# Learning clustering-based linear mappings for quantization noise removal

### Abstract

This paper describes a novel scheme to reduce the quantization noise of compressed videos and improve the overall coding performances. The proposed scheme first consists in clustering noisy patches of the compressed sequence. Then, at the encoder side, linear mappings are learned for each cluster between the noisy patches and the corresponding source patches. The linear mappings are then transmitted to the decoder where they can be applied to perform de-noising. The method has been tested with the HEVC standard, leading to a bitrate saving of up to 9.63%.

Publication
IEEE International Conference on Image Processing
Date

## Context and Goal

Modern video codecs such as HEVC rely on four main steps, which aim at reducing the redundancies of the signal. First, spatial and temporal redudancies are reduced through prediction tools. The prediction residue is then transformed to further reduce the inner correlations of the signal, and the transformed coefficients are then quantized to remove non-perceptible information. The quantized transformed coefficients are finally entropy coded to remove the remaining statistical redundancies. Because the quantization step is not revertible at the decoder side, the reconstructed frames are corrupted with a so-called quantization noise. The goal of this work is to improve the rate-distortion (RD) performances of existing compression schemes using a novel out-of-the-loop de-noising scheme.

## Approach

The proposed scheme is represented in Fig. 1, and consists in the following steps:

• At the encoder side:

• cluster the coded/decoded patches
• for each cluster, learn a projection function using multivariate linear regression between the coded/decoded patches and the corresponding source patches
• encode the projection function and transmit them to the decoder
• At the decoder side:

• decode the projection functions
• cluster the decoded patches, as it is performed at the encoder side
• for each cluster, apply the corresponding projection function to the decoded patches in order to obtain the de-noised patches

Alternatively, we propose a two steps scheme represented in Fig. 2, where a blind de-noising algorithm is first performed on the decoded frames, before applying the proposed mehtods.

### Selecting the number of clusters

Selecting the number of clusters K is crucial for the overall RD performances, as increasing K improves the de-noising performances, but also the bit-rate associated to the projection functions. We thus propose an adative method to select K, based on a rate-distortion optimization (RDO) criterion, which proceeds as follow:

• At initialization, the full data set is considered as one cluster
• Each cluster is then recursively split in 2 clusters if the RDO criterion (explained below) is satisfied, creating a binary tree structure
• The procedure stops when no RDO criterion is satisfied, or a certain maximum tree depth is reached

To perform the exact same partitioning at the decoder, we then need to transmit the binary tree structure.
The aforementioned RDO criterion can be computed for each cluster, first by estimating the distortion between the de-noised patches of the cluster and the corresponding source patches. The rate is then esitmated as the number of bits of the encoded projection function used to de-noise the patches. Thus, to decide if a cluster $C_{n}$ is further divided into two clusters $C_{n+1}$ and $C_{n+2}$, we merely check the following condition on the associated rate distortion costs: $J_{n+1} + J_{n+2} < J_n$. An example of binary tree structure obtained after recursive paritioning is given in Fig. 3.

### Oracle clustering

We propose a so-called oracle clustering, which goal is to maximize the de-noising performances of the proposed scheme for a fixed number of clusters $K$. Formally, the oracle clustering solve the following problem: $$\min_C \sum_{i=1 \dots K} \sum_{x_d,x_s \in C_i} || x_s - \textbf{P}_i x_d || ^2$$ where $x_s$ and $x_d$ are corresponding source and decoded patches respectively, and $\textbf{P}_i$ is the projection function corresponding to the cluster $C_i$.
This oracle clustering is not reproducible at the decoder side, as it relies on the knowledge of the source patches, and thus can not be directly applied in the proposed scheme. However, it allows to define an upper bound on the performances.

First frame of the Kimono source sequence. First frame of the Kimono sequence encoded with HEVC at QP=37. Decoded PSNR = 37.95dB.
K-means clustering (K=10) performed on non-overlapping 4x4 patches of the Kimono sequence encoded with HEVC at QP=37. Each color correspond to a cluster label. Corresponding de-noised PSNR = 38.07 dB. Oracle clustering (K=10) performed on non-overlapping 4x4 patches of the Kimono sequence encoded with HEVC at QP=37. Each color correspond to a cluster label. Corresponding de-noised PSNR = 41.26 dB.

## Experimental Results

The sequences used in the experiment are presented in Table 1, and mainly consist in HEVC test sequences. The test sequences are processed per Group Of Pictures (GOP), where a GOP contains a number of frames equal to the frame rate (i.e. we have one GOP per second). The sequences are encoded with the HEVC test model HM (ver 15.0) using the Main profile in Random Access, with 4 values for the Quantization Parameter, QP=22, 27, 32, 37.

Table 1 - Test sequences
Sequence Resolution Frame count Frame rate
City 1280x720 600 60
ParkScene 1920x1080 240 24
Tennis 1920x1080 240 24
Kimono 1920x1080 240 24
Cactus 1920x1080 500 50
Terrace 1920x1080 600 60
Ducks 1920x1080 500 50
PeopleOnStreet 2560x1600 150 30
Traffic 2560x1600 150 30

We show in Table 2 the RD performances of our method (without first pass de-noising, see Fig. 1) for each GOP of the test sequences, performed with fixed number of clusters K=10.

Table 2 - RD performances per GOP of K-means clustering with a fixed number of clusters K=10 (Bjontegaard bit-rate gains with respect to HEVC)
GOP City Park Scene Tennis Kimono Cactus Terrace Basket Ducks People On Street Traffic Average
1 -2.31 0.12 -0.06 -0.13 -1.16 -5.38 -0.83 -2.32 -2.54 -1.29 -1.59
2 -1.76 0.15 0.09 -0.13 -1.22 -7.14 -0.43 -1.81 -2.55 -1.23 -1.60
3 -1.96 0.09 0.18 -0.20 -0.88 -8.62 -0.87 -1.02 -2.54 -1.22 -1.70
4 -1.74 0.18 0.10 -0.24 -0.81 -11.69 -0.74 -1.16 -2.60 -1.18 -1.99
5 -0.96 0.30 0.10 -0.19 -1.05 -12.15 -0.65 -1.20 -2.58 -1.31 -1.97
6 -1.90 0.47 1.60 0.35 -1.20 -10.05 -0.43 -1.61 -1.60
7 -1.59 0.48 1.45 1.22 -1.11 -8.49 -0.51 -2.01 -1.32
8 -1.98 0.50 1.05 1.42 -1.34 -7.53 -0.71 -2.36 -1.37
9 -1.84 0.59 0.84 1.90 -1.03 -6.73 -0.74 -2.49 -1.19
10 -1.92 0.97 0.77 2.54 -0.91 -6.79 -0.49 -2.76 -1.07
Average -1.81 0.38 0.08 0.43 -1.08 -8.28 -0.65 -1.88 -2.56 -1.26 -1.66

We show in Table 3 the RD performances of our method (without first pass de-noising, see Fig. 1) for each GOP of the test sequences, performed with the adaptive selection of the number of clusters (Kmax=16). In addition, we give in Table 4 the selected K values depending on the quantization parameter and averaged over all GOPs. We can see that the K values adapt to the sequence, reaching higher values for sequences with a higher resolution and/or frame rate. The K values also vary depending on QP, and lower values are selected at low bit-rate (high QP value).

Table 3 - RD performances per GOP of K-means clustering with the adaptive K selection (Kmax=16) (Bjontegaard bit-rate gains with respect to HEVC)
GOP City Park Scene Tennis Kimono Cactus Terrace Basket Ducks People On Street Traffic Average
1 -2.76 -0.69 -0.90 -1.32 -1.12 -5.25 -1.10 -2.39 -2.57 -1.44 -1.96
2 -2.31 -0.67 -0.72 -1.35 -1.29 -6.83 -0.40 -1.82 -2.58 -1.39 -1.93
3 -2.45 -0.63 -0.78 -1.36 -0.89 -8.86 -0.98 -1.08 -2.59 -1.35 -2.10
4 -1.66 -0.54 -0.82 -1.31 -0.98 -11.58 -0.62 -1.24 -2.61 -1.35 -2.27
5 -1.84 -0.52 -0.72 -0.89 -1.11 -11.95 -0.76 -1.27 -2.63 -1.53 -2.32
6 -2.22 -0.48 -1.08 -0.85 -1.37 -9.90 -0.77 -1.71 -2.30
7 -1.90 -0.48 -1.91 -0.58 -1.25 -8.20 -0.68 -2.13 -2.14
8 -1.86 -0.40 -2.39 -0.58 -1.30 -7.23 -1.09 -2.47 -2.16
9 -2.51 -0.31 -1.77 -0.43 -1.19 -6.35 -0.88 -2.58 -2.00
10 -2.74 0.02 -2.52 0.29 -1.00 -6.47 -0.69 -2.84 -1.99
Average -2.24 -0.47 -1.41 -0.94 -1.15 -8.08 -0.80 -1.96 -2.60 -1.42 -2.11
Table 4 - Selected K values for the RDO adaptive selection of the number of clusters (Averaged over all GOPs)
QP City Park Scene Tennis Kimono Cactus Terrace Basket Ducks People On Street Traffic Average
22 10.8 8.5 7.6 8.2 14.9 15.4 12.0 15.9 16.0 14.0 12.3
27 11.3 6.8 5.7 6.4 11.7 15.0 9.7 15.8 16.0 12.6 11.1
32 7.4 4.0 3.4 4.5 9.0 14.1 7.8 15.8 16.0 7.6 9.0
37 5.9 4.0 2.8 4.0 7.1 13.1 5.8 15.5 15.8 5.8 8.0
Average 8.9 5.8 4.9 5.8 10.7 14.4 8.8 15.8 16.0 10.0 10.1

We show in Table 5 the RD performances of the proposed two steps scheme (see Fig. 2), with a first pass VBM3D de-noising, for the first GOP of the test sequences, performed with the adaptive selection of the number of clusters (Kmax=16). In addition, we give in Table 6 the selected K values depending on the quantization parameter.

Table 5 - RD performances of the two steps scheme with the adaptive K selection (Kmax=16) for the first GOP (Bjontegaard bit-rate gains with respect to HEVC)
Sequence BD-rate
City -3.24
Park Scene -1.07
Tennis -7.77
Kimono -4.32
Cactus -6.19
Terrace -7.24
Ducks -4.13
People On Street -9.63
Traffic -5.43
Average -5.28
Table 6 - Selected K values for the two steps scheme with the RDO adaptive selection of the number of clusters (Averaged over all GOPs)
QP City Park Scene Tennis Kimono Cactus Terrace Basket Ducks People On Street Traffic Average
22 15.0 12.0 9.0 7.0 16.0 16.0 11.0 16.0 16.0 15.0 13.3
27 15.0 8.0 7.0 6.0 13.0 16.0 11.0 16.0 16.0 12.0 12.0
32 10.0 6.0 6.0 6.0 12.0 16.0 9.0 16.0 15.0 13.0 10.9
37 9.0 4.0 5.0 4.0 11.0 15.0 5.0 16.0 15.0 6.0 9.0
Average 12.3 7.5 6.8 5.8 13.0 15.8 9.0 16.0 15.5 11.5 11.3

Finally, we show in Table 7 the upper bound on the RD performances obtained with the oracle clustering.

Table 7 - Upper bound on the RD performances per GOP using the oracle clustering with a fixed number of clusters K=10 (Bjontegaard bit-rate gains with respect to HEVC)
GOP City Park Scene Tennis Kimono Cactus Terrace Basket Ducks People On Street Traffic Average
1 -28.85 -22.38 -30.48 -40.90 -23.28 -29.93 -32.60 -31.03 -23.11 -25.35 -28.79
2 -28.34 -22.40 -30.39 -42.10 -24.62 -34.49 -29.80 -28.40 -23.15 -25.52 -28.92
3 -28.35 -21.86 -30.61 -41.10 -24.89 -41.78 -30.62 -18.31 -23.07 -25.31 -28.59
4 -29.20 -21.43 -30.92 -40.77 -26.09 -48.57 -27.25 -16.72 -23.88 -25.68 -29.05
5 -28.89 -21.07 -30.34 -40.85 -25.08 -46.06 -29.78 -16.08 -23.07 -25.65 -28.69
6 -29.43 -21.50 -44.94 -38.14 -24.65 -39.84 -31.07 -18.51 -31.01
7 -29.19 -21.66 -45.89 -25.38 -24.21 -38.70 -29.48 -20.69 -29.40
8 -29.08 -21.71 -46.38 -25.24 -24.71 -37.85 -30.36 -22.64 -29.75
9 -28.89 -21.37 -45.92 -25.17 -24.53 -35.55 -27.57 -23.33 -29.04
10 -29.01 -21.24 -46.14 -16.37 -12.72 -42.21 -21.39 -24.99 -26.76
Average -28.92 -21.66 -38.20 -33.60 -23.48 -39.50 -28.99 -22.07 -23.26 -25.50 -28.52

We display below a summary of the RD performances of the different versions of the proposed scheme, in the form of RD curves for the first GOP of each sequence.

RD performances for the first GOP of the City sequence. RD performances for the first GOP of the Park Scene sequence.
RD performances for the first GOP of the Tennis sequence. RD performances for the first GOP of the Kimono sequence.
RD performances for the first GOP of the Cactus sequence. RD performances for the first GOP of the Terrace sequence.
RD performances for the first GOP of the Basket sequence. RD performances for the first GOP of the Ducks sequence.
RD performances for the first GOP of the People On Street sequence. RD performances for the first GOP of the Traffic sequence.

## References

• Kostadin Dabov, Alessandro Foi, Karen Egiazarian, Video denoising by sparse 3D transform-domain collaborative filtering, EUSIPCO 2007. [PDF]