17.14.1Core transform implementation
17.14.1.1.1.1.1.1.1JCTVC-D036 Matrix multiplication specification for HEVC transforms [M. Sadafale, M. Budagavi (TI)]
This contribution is a continuation of prior work presented in JCTVC-C226 and presents more details and software optimization results. JCTVC-C226 proposed the use of matrix multiplication to specify HEVC transforms. Matrix multiplication has the advantage that it is friendly to parallel processing with minimal dependency and control logic. In hardware, matrix multiplication results in low-area architecture, while in software it reportedly leads to efficient implementation on SIMD processors. Matrix multiplication allegedly has better fixed-point behavior than Chen’s DCT/IDCT which is asserted to allow for elimination of existing norm correction matrices in HM. The memory requirement for storing norm correction matrices in the TMuC decoder reportedly goes down from 7.5 KB to 6 bytes. The corresponding memory requirements in the TMuC encoder reportedly goes down from 7.5 KB to 12 bytes. A fixed-point version of a matrix multiplication pseudo DCT/IDCT along with norm correction matrix elimination optimization was implemented in TMuC-0.9. Simulation results reportedly indicate that matrix multiplication DCT/IDCT achieves similar coding efficiency performance when compared to TMuC-0.9 with a Chen-style pseudo DCT/IDCT (Average BD BR: 0.1 to -0.3%). TMuC-0.9 decoder and encoder with matrix multiplication DCT/IDCT implemented with even-odd decomposition optimization have run times that are comparable to TMuC-0.9 with Chen’s implementation. Decoding time ratios on the same PC are reportedly in the range of 99%-100% range. Encoder time ratios on heterogeneous Linux cluster reportedly range from 93%-107%.
-
With IBDI off, the current HM implementation of IDCT reportedly requires 21 bits, whereas matrix mult. reportedly requires 15 bits. With IBDI on, 25/19 bits (case of 16-size transform)
-
Butterfly structures have cascading of multipliers, such that they cannot be parallelized and incur latency (at least if there is rounding between the stages)
-
In effect, this may mean that matrix multiplication can be faster than "fast" algorithms (particularly in hardware)
Some doubt was expressed that the estimate of bit depth in HM may be pessimistic and could be different by other implementation. It did not seem fully clear how to measure the bit depth and latency. A comment was that measuring latency must take into account pipelining and the number of available multipliers.
The best selection would also depend on target platform architecture. Compromises will be necessary.
17.14.1.1.1.1.1.1.2JCTVC-D245 Verification of TI's proposal on matrix multiplication architecture for DCT/IDCT [Jian Lou, Limin Wang]
17.14.1.1.1.1.1.1.3JCTVC-D224 Unified transform design for HEVC with 16 bit intermediate data representation [A. Fuldseth, G. Bjøntegaard (Cisco)]
This contribution proposes a set of transforms for HEVC, covering all transform sizes from 4x4 to 32x32. The proposed transforms are intended to replace the existing transforms specified in HM1.0. The proposed transforms have the following properties; 16 bit data representation before and after each transform stage (assuming 8-bit input data and IBDI disabled), no need for correction of different norms of basis vectors during quantization/dequantization, all transform sizes above 4x4 can reuse arithmetic operations for smaller transform sizes, and implementations using either pure matrix multiplication or a combination of matrix multiplication and butterfly structures are possible. The proposed transforms are implemented without transform precision extension (TPE) and compared with the existing transforms with TPE=4 (anchor conditions). Average BD BR gains were reported, varying between 0.1% and 0.6% depending on the test condition (low complexity/high efficiency, intra/random access/low delay).
-
The basis vectors are almost orthogonal and have almost equal norms, so no quantization/dequantization matrices are necessary (same quantization for all transform sizes)
-
8 bit representation of transform coefficients is enabled, with a bit width of accumulators for matrix multiplication of less than or equal to 32 bits.
-
Scaled transform coefficients are very close to DCT coefficients
-
Large transforms partially re-use smaller transforms, such that a "partial butterfly" is possible
-
4x4 and 8x8 were also changed
Clarification was requested regarding what a "partial butterfly" means. It seems it is not possible to fully decompose these matrices (log2N). In the ideal case, a fast transform algorithm should be capable of implementation by both matrix multiplication and "fast" implementation.
Gain was reported (0.1 or 0.2% bit rate reduction compared to HM). It was not clear what the source of this gain was (perhaps the change of 4x4 and 8x8 transforms).
17.14.1.1.1.1.1.1.4JCTVC-D268 Verification of Cisco transform proposal (JCTVC-D224) [M. Budagavi (TI)]
17.14.1.1.1.1.1.1.5JCTVC-D339 Fast, Mult-Free Transforms for the HM [Wei Dai, Madhu Krishnan, Pankaj Topiwala] (initial uploaded versions had problems)
The current HM (0.9) uses integer transforms of sizes 4-pt to 32-pt, designed using Chen's factorization of the DCT, which provide good decorrelation performance and moderate complexity. This contribution discusses integer transforms and several variants, which reportedly offer substantial computational savings with effectively no performance loss. The current 4-pt and 8-pt transforms are initially retained, while the more complex 16-pt and 32-pt transforms are simplified, including multiplication-free designs. The multiplication-free designs reportedly provide the lowest complexity as well as exact invertibility in limited bitdepth arithmetic, yet reportedly have identical performance. A CE was proposed to further study and finalize these designs, and corresponding quantization structures. The transform itself was developed jointly with Samsung, and is presented in a joint proposal JCTVC-D365.
-
Lifting implementation, allows perfect inversion
-
Reports minor deviations (+/- 0.02%) in bit rate
-
Transform is not orthogonal, quantization was adapted (see JCTVC-D365)
-
The whole dynamic range (down to QP=0/1 with worst-case input) should be checked in the investigations
17.14.1.1.1.1.1.1.6JCTVC-D037 DCT+Hadamard large transform [M. Budagavi, A. Gupte (TI)]
This contribution is a continuation of prior work presented in JCTVC-C226 and JCTVC-C255 and presents more details and software optimization results. JCTVC-C255 proposed a class of transforms for large block sizes which is a combination of DCT+Hadamard transforms for reducing computational complexity. When DCT+Hadamard is applied to only 32x32 Inter blocks, the average BD BR degradation is reportedly in the range of 0.2-0.4%. When DCT+Hadamard is applied to both Inter and Intra blocks, the average BD BR degradation is reportedly in the range of 0.4-0.8%. The 1D DCT+Hadamard transform involves application of two size-N/2 DCT transforms instead of one size-N DCT transform, resulting in reduction in number of multiplications required. There is reportedly a 50% reduction in number of multiplications for DCT+Hadamard implemented with matrix multiplication and 24% reduction in number of multiplications for DCT+Hadamard implemented with Chen’s algorithm.
The contribution proposed a combination of DCT-16 followed by Hadamard-2.
Loss in compression is up to 0.8/0.9% BR increase in all-Intra coding; less in inter.
It appears that this loss does not justify the savings in complexity. In effect, this loses the advantage of 32x32 blocks, as the error would concentrate at the boundaries of the 16x16 transform.
17.14.1.1.1.1.1.1.7JCTVC-D371 Crosscheck of TI's matrix multiplication specification (JCTVC-D036) and DCT+Hadamard large transform (JCTVC-D037) by Qualcomm [R. Joshi, Y. Zheng, J. Sole]
17.14.1.1.1.1.1.1.8JCTVC-D388 Cross check of JCTVC-D037 on DCT+Hadamard large transform [C. Auyeung]
17.14.1.1.1.1.1.1.9JCTVC-D256 Efficient 16 and 32-point transforms [R. Joshi, Y. Reznik, J. Sole, M. Karczewicz]
This contribution proposes new 16 and 32-point transforms. The proposed transforms are scaled orthogonal integer transforms and support a recursive factorization structure. The proposed 16 and 32-point transforms use 36 and 92 multiplications, respectively, compared to 44 and 116 multiplications for the corresponding transforms in the existing test model. A multiplication-free implementation for the proposed transforms is also provided. The BD BR performance of the proposed transforms is reportedly nearly identical to the existing transforms.
-
12 distinct scale factors for quantization for the 32x32 transform
-
No significant change in coding efficiency
-
Bit precision is bitdepthvideo+4bitsIBDI+9+5+5 for 2D transform (in the case of 10-bit video this would be larger than 32 bit)
17.14.1.1.1.1.1.1.10JCTVC-D079 Verification of Qualcomm’s proposal on alternative large transform architecture [C. Yeo, Y. H. Tan, Z. Li (I2R)]
17.14.1.1.1.1.1.1.11JCTVC-D269 Verification of Qualcomm transform proposal (JCTVC-D256) [M. Budagavi (TI)]
17.14.1.1.1.1.1.1.12JCTVC-D257 Low complexity 32×32 transform by partial frequency transform [J. Sole, R. Joshi, M. Karczewicz, Y.-M. Hong, M.-S. Cheon, I.-K. Kim]
This proposal presents an approach to simplifying the 32×32 transform by computing only the 16 lowest frequency coefficients in each direction. This simplification reportedly reduces the number of multiplications by 74℅ while reducing the intermediate buffering requirements by 50℅. The loss in terms of BD BR for the high efficiency intra, random access, and low-delay configurations is reported as 0.31%, 0.22%, and 0.21℅, respectively. There is no reported impact on the BD BR for the low complexity configuration.
-
Combination of JCTVC-C209 and JCTVC-C237
-
Loss in performance, roughly 0.3% BR increase
-
Same quantization matrix used (norms of the basis vectors are the same)
-
Question: 32x32 is not used often, is the loss higher for specific sequences? Not remarkable, highest loss around 0.7 or 0.8.
-
A remark was given that even a small loss may be dangerous as it indicates a deviation of the transform basis characteristics. Particularly for large transforms, ringing may be a danger.
-
Would be advisable to perform viewing. It was suggested that we may wish to find special test images for this.
-
Some people expressed a concern that losses such as 0.3-0.4% could be unacceptable.
17.14.1.1.1.1.1.1.13JCTVC-D135 Crosscheck of Samsung and Qualcomm's Partial Frequency Transform by MediaTek [Tzu-Der Chuang, Ching-Yeh Chen, Yu-Wen Huang]
17.14.1.1.1.1.1.1.14JCTVC-D365 Fast Integer Transforms for the HM, and Complexity Analysis [Y.-M. Hong, I.-K. Kim, K.-H. Lee (Samsung), W. Dai, M. Krishnan, P. Topiwala (FastVDO)]
The current HM (TMuC-0.9) uses integer transforms of sizes 4-pt to 32-pt, designed using Chen's factorization of the DCT, which provide good decorrelation performance with moderate complexity. This contribution summarizes the performance and the complexity of the existing transform cores and describes some integer transforms and several variants, which reportedly offer substantial computational savings with effectively no performance loss. The current 4-pt and 8-pt transforms are initially retained, while the more complex 16-pt and 32-pt transforms are changed, including multiplication-free designs. The multiplication-free designs afford exact invertibility in limited bitdepth arithmetic, yet reportedly have identical performance. In addition, a simplified approach to quantization matrices was suggested – just replacing them with scalar values. This can reportedly be done for both the existing HM transforms, as well as the new transforms in the proposed design. The contributors asserted that the transform and quantization designs in the HM can be significantly improved computationally, with no noticeable loss of performance.
No quantization matrix was proposed for 16x16 and 32x32; only scaling according to QP value.
17.14.1.1.1.1.1.1.15JCTVC-D129 Verification results of Samsung’s proposals on Low-complexity transforms [Jungsun Kim, Byeongmoon Jeon] (missing prior / placeholder, uploaded before meeting)
17.14.1.1.1.1.1.1.16JCTVC-D408 Crosscheck of Samsung and FastVDO's proposal JCTVC-D365 on simplified quantization/dequantization matrices [Xin Zhao, Li Zhang, Siwei Ma, Wen Gao] (missing prior, uploaded Tuesday 18th, before meeting)
17.14.1.1.1.1.1.1.17JCTVC-D435 Cross verification of JCTVC-D365 by Nokia [K. Ugur (Nokia)] (late registration Saturday 22nd after start of meeting, uploaded Thursday 27th, near the end of the meeting)
17.14.1.1.1.1.1.1.18JCTVC-D071 On transform dynamic range [Kiran Misra, Louie Kerofsky, Andrew Segall] (missing prior, uploaded Wednesday 19th, before meeting)
The dynamic range of an inverse transform is a critical parameter influencing the minimum width of the bus required between the processing unit which carries out the transform and the memory used to store results. Additionally it determines the total size of the memory used for storage and number of coefficients which can be pre-fetched with a single instruction. In this proposal a technique is described which reduces the required transform dynamic range to achieve a reduced dynamic range of 16-bits. This goal is reportedly achieved with an increase in the average Y Bjøntegaard Delta (BD) bit rate between 0.0%. and 0.2%. The performance was measured with all the tools in their default test settings, for e.g. there is an internal bit depth increment of 4 for the HE case, the transform precision extension is turned on for the low-complexity case etc. The transform dynamic range limitation is achieved by downshifting to restrict the bit-depth of coefficients being stored in memory (a) prior to being input to the first inverse 1-D transform and (b) at the output of the first 1-D inverse transform. A combination of QP period based scaling, clipping and transform size dependent bit shifts are used to achieve the 16-bit dynamic range.
-
Clipping at the decoder was discussed. In practice, some encoders may not perform clipping.
-
It should be clarified how this would be described in the standard.
-
Several experts think this is a good idea, and recommended further study.
17.14.1.1.1.1.1.1.19Discussions and Conclusions
A breakout activity was established (coordinated by P. Topiwala) on issues of transform implementation. In that activity, the main discussions were about measures for complexity.
It was suggested to establish a CE on core transforms and related quantization, with further discussion in a BoG (see JCTVC-D445):
-
Complexity parameters (number of op’s, bit depth, parallelism etc.)
-
Performance (also visual investigation particularly for ringing)
-
Fitness for implementation on different platforms (software/hardware/SIMDetc.)
-
Complexity of quantization
The current HM uses quantization matrices for 16x16 and 32x32 which are consisting of almost equal entries, and it was asked whether these should be changed to equal values or replaced by a scalar value. According to JCTVC-D365, this would not affect performance. It is agreed to wait with this until other transform issues will be decided.
Dostları ilə paylaş: |