Skip to Content
Information Theory

4. Data Compression

2020-12-24Original-language archivelegacy assets may be incomplete

Existence of Code

A source code CC for a random variable XX is a mapping from X\mathcal{X} to DD^*

  • XX: the range of XX
  • DD-ary alphabet is D={0,1,,D1}\mathcal{D}=\{0,1,\cdots,D-1\}
  • C(x)C(x): codeword for xx
  • l(x)l(x): length of C(x)C(x)
  • L(C)=xXp(x)l(x)L(C)=\sum_{x\in\mathcal{X}}p(x)l(x): the expected length of a source code C(x)C(x) for a random variable
  • rate RR: average length

Nonsigular code

  • every element of the range XX maps into a different string in DD^*
  • CC^*: extension of a code CC is mapping from finite length strings of XX to finite-length strings of DD, C(x1x2xn)=C(x1)C(x2)C(xn)C(x_1x_2\cdots x_n)=C(x_1)C(x_2)\cdots C(x_n)
  • Uniquely decodable: extension is nonsigular

Prefix Code (instantaneous code)

  • no codeword is a prefix of any other codeword
  • Tree representation
    • Kraft Inequality (1949): For any instantaneous code over an alphabet of size DD, the codeword lengths l1,l2,,lml_1,l_2,\cdots,l_m must satisfy the inequality i=1mDli1\sum_{i=1}^mD^{-l_i}\leq 1
    • Conversly, given codeword satisfy above inequality, the prefix code exists.
  • Interval representation
    • Extended Kraft Inequality: For any countably infinite set of codewords

Optimal Codes

  • optimal: pili\sum_{p_il_i} is minimal
  • optimization problem: Lagrange minliL=piliDli1\min_{l_i} L=\sum p_il_i\\\sum D^{-l_i}\leq 1
  • pi=Dli,li=logpi,LHD(X)p_i=D^{-l_i},l_i=-\log p_i,L^*\geq H_D(X)
  • Bounds: HD(X)L<HD(X)+1H_D(X)\leq L^*<H_D(X)+1
  • Shannon codes: li=logD1pil_i=\lceil\log_D\frac{1}{p_i}\rceil

Encode nn symbols together (block encode to remove 1 in HD(X)+1H_D(X)+1)

  • H(X1,X2,,Xn)nL<H(X1,X2,,Xn)n+1n\frac{H(X_1,X_2,\cdots,X_n)}{n}\leq L^*<\frac{H(X_1,X_2,\cdots,X_n)}{n}+\frac{1}{n}, if stationary stochastic process, LH(X)L^*\rightarrow H(\mathcal{X})
  • shortcome: alphabet has Xn|X^n|

Wrong code

  • The expected length under p(x)p(x) of the code assignment l(x)=log1q(x)l(x)=\log\frac{1}{q(x)}: H(p)+D(pq)Epl(x)<H(p)+D(pq)+1H(p)+D(p\|q)\leq E_pl(x)<H(p)+D(p\|q)+1

Kraft Inequality for Uniquely Decodable Codes (McMillan): The codeword length of any uniquely decodable DD-ary code must satisfy the Kraft inequality Dli1\sum_{D}^{-l_i}\leq 1. Conversely, given a set of codeword satisfy this inequality, it is possible to construct a uniquely decodable with these codeword lengths. (prefix code is 'no less than' any other code)

Codes

Huffman Codes

  • DD-ary Huffman codes (prefix code) for a given distribution: Each time combine DD symbols with the lowest probabilities into a single source symbol, until there is only one symbol
  • add dummy symbols to the end of the set of symbols: 1+k(D1)1+k(D-1)
  • optimal: CC' is any other uniquely decodable code, L(C)L(C)L(C)\leq L(C')

Shannon Codes

  • DD-adic distribution: P(X=xi)=DnP(X=x_i)=D^{-n}, li=log1pi=nil_i=\log\frac{1}{p_i}=n_i
  • Shannon codes
    • attain optimality within 1
    • DD-adic: Shannon codes are optimal
    • pi0p_i\rightarrow0: Shannon codes may be worse

For any optimal coding scheme

  • lengths are ordered inversely with probability (pj>pk,ljlkp_j>p_k,l_j\leq l_k)
  • the two longest codewords have the same length
  • two of the longest codewords differ only in the last bit

Shannon-Fano-Elias Coding

  • F(x)=axp(a)F(x)=\sum_{a\leq x}p(a)
  • modified cumulative distribution function: F(x)=a<xp(a)+12p(x)=F(x)+12p(x)\overline F(x)=\sum_{a<x}p(a)+\frac{1}{2}p(x)=F(x)+\frac{1}{2}p(x)
  • Truncate F(x)\overline{F}(x) to l(x)l(x) bit, F(x)F(x)l(x)12l(x)\overline F(x)-|\overline{F}(x)|_{l(x)}\leq\frac{1}{2^{l(x)}}
  • l(x)=log1p(x)+1l(x)=\lceil\log\frac{1}{p(x)}\rceil+1, 12l(x)p(x)2=F(x)F(x1)\frac{1}{2^{l(x)}}\leq\frac{p(x)}{2}=\overline{F}(x)-\overline{F}(x-1)
  • L=p(x)l(x)<H(X)+2L=\sum p(x)l(x)<H(X)+2

Random Variable Generation

  • fair coin toss: Z1,,ZnZ_1,\cdots, Z_n
  • expected number of fair bits E(T)E(T)
  • generate variable: leaves are given symbols of XX
  • E(T)H(X)E(T)\geq H(X)
    • dyadic distribution: E(T)=H(X)E(T)=H(X)
    • otherwise: use binary expansions for the probabilities, H(X)ET<H(X)+2H(X)\leq ET<H(X)+2

Universal Source Coding

  • minimax redundancy: R=minqmaxp0R=minqmaxp0D(pθq)R^*=\min_q\max_{p_0}R=\min_q\max_{p_0}D(p_\theta\|q)
  • Arithmetic Coding
  • LZ77: slidinig windows
  • LZ78: Trie