4. Data Compression

Existence of Code

A source code $C$ for a random variable $X$ is a mapping from $\mathcal{X}$ to $D^*$

$X$ : the range of $X$
$D$ -ary alphabet is $\mathcal{D}=\{0,1,\cdots,D-1\}$
$C(x)$ : codeword for $x$
$l(x)$ : length of $C(x)$
$L(C)=\sum_{x\in\mathcal{X}}p(x)l(x)$ : the expected length of a source code $C(x)$ for a random variable
rate $R$ : average length

Nonsigular code

every element of the range $X$ maps into a different string in $D^*$
$C^*$ : extension of a code $C$ is mapping from finite length strings of $X$ to finite-length strings of $D$ , $C(x_1x_2\cdots x_n)=C(x_1)C(x_2)\cdots C(x_n)$
Uniquely decodable: extension is nonsigular

Prefix Code (instantaneous code)

no codeword is a prefix of any other codeword
Tree representation
- Kraft Inequality (1949): For any instantaneous code over an alphabet of size $D$ , the codeword lengths $l_1,l_2,\cdots,l_m$ must satisfy the inequality $\sum_{i=1}^mD^{-l_i}\leq 1$
- Conversly, given codeword satisfy above inequality, the prefix code exists.
Interval representation
- Extended Kraft Inequality: For any countably infinite set of codewords

Optimal Codes

optimal: $\sum_{p_il_i}$ is minimal
optimization problem: Lagrange $\min_{l_i} L=\sum p_il_i\\\sum D^{-l_i}\leq 1$
$p_i=D^{-l_i},l_i=-\log p_i,L^*\geq H_D(X)$
Bounds: $H_D(X)\leq L^*<H_D(X)+1$
Shannon codes: $l_i=\lceil\log_D\frac{1}{p_i}\rceil$

Encode $n$ symbols together (block encode to remove 1 in $H_D(X)+1$ )

$\frac{H(X_1,X_2,\cdots,X_n)}{n}\leq L^*<\frac{H(X_1,X_2,\cdots,X_n)}{n}+\frac{1}{n}$ , if stationary stochastic process, $L^*\rightarrow H(\mathcal{X})$
shortcome: alphabet has $|X^n|$

Wrong code

The expected length under $p(x)$ of the code assignment $l(x)=\log\frac{1}{q(x)}$ : $H(p)+D(p\|q)\leq E_pl(x)<H(p)+D(p\|q)+1$

Kraft Inequality for Uniquely Decodable Codes (McMillan): The codeword length of any uniquely decodable $D$ -ary code must satisfy the Kraft inequality $\sum_{D}^{-l_i}\leq 1$ . Conversely, given a set of codeword satisfy this inequality, it is possible to construct a uniquely decodable with these codeword lengths. (prefix code is 'no less than' any other code)

Codes

Huffman Codes

$D$ -ary Huffman codes (prefix code) for a given distribution: Each time combine $D$ symbols with the lowest probabilities into a single source symbol, until there is only one symbol
add dummy symbols to the end of the set of symbols: $1+k(D-1)$
optimal: $C'$ is any other uniquely decodable code, $L(C)\leq L(C')$

Shannon Codes

$D$ -adic distribution: $P(X=x_i)=D^{-n}$ , $l_i=\log\frac{1}{p_i}=n_i$
Shannon codes
- attain optimality within 1
- $D$ -adic: Shannon codes are optimal
- $p_i\rightarrow0$ : Shannon codes may be worse

For any optimal coding scheme

lengths are ordered inversely with probability ( $p_j>p_k,l_j\leq l_k$ )
the two longest codewords have the same length
two of the longest codewords differ only in the last bit

Shannon-Fano-Elias Coding

$F(x)=\sum_{a\leq x}p(a)$
modified cumulative distribution function: $\overline F(x)=\sum_{a<x}p(a)+\frac{1}{2}p(x)=F(x)+\frac{1}{2}p(x)$
Truncate $\overline{F}(x)$ to $l(x)$ bit, $\overline F(x)-|\overline{F}(x)|_{l(x)}\leq\frac{1}{2^{l(x)}}$
$l(x)=\lceil\log\frac{1}{p(x)}\rceil+1$ , $\frac{1}{2^{l(x)}}\leq\frac{p(x)}{2}=\overline{F}(x)-\overline{F}(x-1)$
$L=\sum p(x)l(x)<H(X)+2$

Random Variable Generation

fair coin toss: $Z_1,\cdots, Z_n$
expected number of fair bits $E(T)$
generate variable: leaves are given symbols of $X$
$E(T)\geq H(X)$ $E (T) \geq H (X)$
- dyadic distribution: $E(T)=H(X)$
- otherwise: use binary expansions for the probabilities, $H(X)\leq ET<H(X)+2$

Universal Source Coding

minimax redundancy: $R^*=\min_q\max_{p_0}R=\min_q\max_{p_0}D(p_\theta\|q)$
Arithmetic Coding
LZ77: slidinig windows
LZ78: Trie

Table of Contents