AboutBlogResearchProjectsTeaching
Back to Blog

Data Compression - Prefix Free Code and Length Optimisation (ICT Part 2)

12 min read

Data Compression - Prefix Free Code and Length Optimisation

In my last post I arrived at the requirement for a prefix-free code CC. Now, I will discuss how we can obtain prefix-free codes and then elucidate on length optimisation for data compression.

The naive way to check whether a code CC is prefix-free would be to brute force through each yC(Ω)y \in C(\Omega) and check whether y{0,1}\exists \> y' \in \{0,1\}^* such that yyC(Ω)yy' \in C(\Omega).

We want to make this check faster; one way to do this is to use Kraft's inequality. The theorem is stated as follows:

Kraft's Inequality

Given C:Ω={a1,an}{0,1}nC : \Omega = \{a_1 \dots , a_n\} \rightarrow \{0,1\}^n and li=l_i = length of C(ai)=C(ai)1inC(a_i) = \mid C(a_i)\mid \forall 1 \leq i \leq n, if CC is prefix-free then 1in2li1(1)\sum_{1\leq i}^n2^{-l_i} \leq 1 \rightarrow (1)

Proof: We have a structured notation, we assume an ordering of the lil_is [l1l2lnl_1 \leq l_2 \leq \dots \leq l_n]; however, the summation formula in the kraft's inequality works even without such ordering. Using this inequality we get that ln=C(an)l_n = \mid C(a_n)\mid is the maximum length of the codewords mapped from Ω\Omega and using the (1)(1) we have:

1in2li11in2li2n2n\sum_{1\leq i}^n2^{-l_i} \leq 1 \Rightarrow \sum_{1\leq i}^n2^{-l_i}*2^n \leq 2^n

1in2nli2n\Rightarrow \sum_{1\leq i}^n2^{n-l_i} \leq 2^n

Suppose we have a prefix-free code CC,

  1. Construct a binary tree of height lnl_n.
  2. Label the nodes as follows:
    • The root is labeled as ..
    • Nodes at height 11 (left child first, then the right child) get 0,10,1 respectively.
    • Whenever a node has a label xx, the left child of that node gets x0x0 & the right child gets the label x1x1.
  3. At height l1,l_1, \exists some node with label as C(a1)C(a_1) (the number of nodes in this layer are 2l12^{l_1}, all strings in {0,1}1\{0,1\}^1 are contained in this layer; same follows for all aiΩa_i \in \Omega), we will call this node NC(ai)N_{C(a_i)}. Now, take the subtree root at the node labeled as C(a1)C(a_1); no C(ai)C(a_i) for i1i \neq 1 should be a label of one of the nodes in this subtree. The number of leaf nodes of this subtree are 2lnl12^{l_n - l_1} (since we now have a binary tree with height lnl1l_n - l_1).
  4. All subtrees with roots as NC(a1),NC(a2),,NC(an)N_{C(a_1)},N_{C(a_2)},\dots, N_{C(a_n)} have to be disjoint as CC is prefix-free.
  5. The total number of leaves of theses subtrees put together =2lnl1+2lnl2++2lnln2ln= 2^{l_n - l_1} + 2^{l_n - l_2} + \dots + 2^{l_n - l_n} \leq 2^{l_n}, since 2ln2^{l_n} is the maximum number of nodes in the entire tree.

Hence, we have proved that if CC is prefix-free then 1in2lnli2n\sum_{1 \leq i}^n 2^{l_n - l_i} \leq 2^n.

Length for Data Compression

Up till now I have just discussed how to form codes that are decodeable. Now, I will elucidate on optimising the length for codewords.

Source encoding to transmitter

The source input is the set Ω\Omega. The source encoder outputs C(ai)C(a_i), and C(w)C^*(w) is formed by appending these outputs.

Let ωΩ\omega \in \Omega^*.

Eg. if w=ai1ai2aikw = a_{i_1}a_{i_2}\dots a_{i_k}, then C(w)=C(ai1))C(ai2))C(aik))C^*(w) = C(a_{i_1}))C(a_{i_2}))\dots C(a_{i_k})) and C(w)=li1+li2++lik=lC^*(w) = l_{i_1} + l_{i_2} + \dots + l_{i_k} = l. We want a encoding function CC such that this ll is minimum for maxmimum number of source data ww.

Solving this problem won't be as simple as selecting a CC' such that for one specific ww we have C(w)<C(w)\mid C'(w) \mid < \mid C(w)\mid as we can have some www' \neq w such that C(w)>C(w)\mid C'(w')\mid > \mid C(w')\mid .

We want to minimise the codeword length C(w)C^*(w) for a given source distribution Ω\Omega (a correct way to describe a distribution is with probabilities, which I will just get to).

Probability Distribution for Ω\Omega

Define: P:Ω={a1,,an}[0,1]P : \Omega =\{a_1, \dots , a_n\} \rightarrow [0,1] such that i=1nP(ai)=1\sum_{i=1}^n P(a_i) = 1

Eg. we have Ω={a,b,c,d}\Omega = \{a,b,c,d\} and P:ΩP : \Omega for a given source Ω\Omega^* is obtained over a large sampling of inputs aiΩa_i \in \Omega as (the probability of the source sending an aiΩa_i \Omega to be transmitted):

  • P(a)=0.35P(a) = 0.35
  • P(b)=0.15P(b) = 0.15
  • P(c)=0.26P(c) = 0.26
  • P(d)=0.24P(d) = 0.24

Using PP we obtain a model of the source distribution as (Ω,P)(\Omega, P).

If C(ai)C(aj)ij\mid C(a_i)\mid \neq \mid C(a_j)\mid \> \forall \> i \neq j then we get P(ai)=P(C(ai))=P(li)1iΩP(a_i) = P(\mid C(a_i)\mid ) = P(l_i) \> \forall \> 1 \leq i \leq \mid \Omega\mid

Now for a given CC, using the P(C(ai))P(\lvert C(a_i) \rvert) values we are able to calculate the expected length of a codeword C(w)C^*(w) retrieved from the source distribution (Ω,P)(\Omega, P) as:

L(C)=i=1npiliL(C) = \sum_{i = 1}^n p_il_i

Hence, given a source distribution (Ω,P)(\Omega, P), to minimise the codeword length over the entire domain Ω\Omega^* we need to choose a CC such that L(C)L(C) is minimised and 2li1\sum 2^{-l_i} \leq 1 (CC should remain prefix-free so that the codewords are efficiently decodable).