Data Compression - Prefix Free Code and Length Optimisation (ICT Part 2)

January 22, 202512 min read

Data Compression - Prefix Free Code and Length Optimisation

In my last post I arrived at the requirement for a prefix-free code $C$ . Now, I will discuss how we can obtain prefix-free codes and then elucidate on length optimisation for data compression.

The naive way to check whether a code $C$ is prefix-free would be to brute force through each $y \in C(\Omega)$ and check whether $\exists \> y' \in \{0,1\}^*$ such that $yy' \in C(\Omega)$ .

We want to make this check faster; one way to do this is to use Kraft's inequality. The theorem is stated as follows:

Kraft's Inequality

Given $C : \Omega = \{a_1 \dots , a_n\} \rightarrow \{0,1\}^n$ and $l_i =$ length of $C(a_i) = \mid C(a_i)\mid \forall 1 \leq i \leq n$ , if $C$ is prefix-free then $\sum_{1\leq i}^n2^{-l_i} \leq 1 \rightarrow (1)$

Proof: We have a structured notation, we assume an ordering of the $l_i$ s [ $l_1 \leq l_2 \leq \dots \leq l_n$ ]; however, the summation formula in the kraft's inequality works even without such ordering. Using this inequality we get that $l_n = \mid C(a_n)\mid$ is the maximum length of the codewords mapped from $\Omega$ and using the $(1)$ we have:

$\sum_{1\leq i}^n2^{-l_i} \leq 1 \Rightarrow \sum_{1\leq i}^n2^{-l_i}*2^n \leq 2^n$

$\Rightarrow \sum_{1\leq i}^n2^{n-l_i} \leq 2^n$

Suppose we have a prefix-free code $C$ ,

Construct a binary tree of height $l_n$ .
Label the nodes as follows:
- The root is labeled as $.$
- Nodes at height $1$ (left child first, then the right child) get $0,1$ respectively.
- Whenever a node has a label $x$ , the left child of that node gets $x0$ & the right child gets the label $x1$ .
At height $l_1, \exists$ some node with label as $C(a_1)$ (the number of nodes in this layer are $2^{l_1}$ , all strings in $\{0,1\}^1$ are contained in this layer; same follows for all $a_i \in \Omega$ ), we will call this node $N_{C(a_i)}$ . Now, take the subtree root at the node labeled as $C(a_1)$ ; no $C(a_i)$ for $i \neq 1$ should be a label of one of the nodes in this subtree. The number of leaf nodes of this subtree are $2^{l_n - l_1}$ (since we now have a binary tree with height $l_n - l_1$ ).
All subtrees with roots as $N_{C(a_1)},N_{C(a_2)},\dots, N_{C(a_n)}$ have to be disjoint as $C$ is prefix-free.
The total number of leaves of theses subtrees put together $= 2^{l_n - l_1} + 2^{l_n - l_2} + \dots + 2^{l_n - l_n} \leq 2^{l_n}$ , since $2^{l_n}$ is the maximum number of nodes in the entire tree.

Hence, we have proved that if $C$ is prefix-free then $\sum_{1 \leq i}^n 2^{l_n - l_i} \leq 2^n$ .

Length for Data Compression

Up till now I have just discussed how to form codes that are decodeable. Now, I will elucidate on optimising the length for codewords.

Source encoding to transmitter

The source input is the set $\Omega$ . The source encoder outputs $C(a_i)$ , and $C^*(w)$ is formed by appending these outputs.

Let $\omega \in \Omega^*$ .

Eg. if $w = a_{i_1}a_{i_2}\dots a_{i_k}$ , then $C^*(w) = C(a_{i_1}))C(a_{i_2}))\dots C(a_{i_k}))$ and $C^*(w) = l_{i_1} + l_{i_2} + \dots + l_{i_k} = l$ . We want a encoding function $C$ such that this $l$ is minimum for maxmimum number of source data $w$ .

Solving this problem won't be as simple as selecting a $C'$ such that for one specific $w$ we have $\mid C'(w) \mid < \mid C(w)\mid$ as we can have some $w' \neq w$ such that $\mid C'(w')\mid > \mid C(w')\mid$ .

We want to minimise the codeword length $C^*(w)$ for a given source distribution $\Omega$ (a correct way to describe a distribution is with probabilities, which I will just get to).

Probability Distribution for $\Omega$

Define: $P : \Omega =\{a_1, \dots , a_n\} \rightarrow [0,1]$ such that $\sum_{i=1}^n P(a_i) = 1$

Eg. we have $\Omega = \{a,b,c,d\}$ and $P : \Omega$ for a given source $\Omega^*$ is obtained over a large sampling of inputs $a_i \in \Omega$ as (the probability of the source sending an $a_i \Omega$ to be transmitted):

$P(a) = 0.35$
$P(b) = 0.15$
$P(c) = 0.26$
$P(d) = 0.24$

Using $P$ we obtain a model of the source distribution as $(\Omega, P)$ .

If $\mid C(a_i)\mid \neq \mid C(a_j)\mid \> \forall \> i \neq j$ then we get $P(a_i) = P(\mid C(a_i)\mid ) = P(l_i) \> \forall \> 1 \leq i \leq \mid \Omega\mid$

Now for a given $C$ , using the $P(\lvert C(a_i) \rvert)$ values we are able to calculate the expected length of a codeword $C^*(w)$ retrieved from the source distribution $(\Omega, P)$ as:

$L(C) = \sum_{i = 1}^n p_il_i$

Hence, given a source distribution $(\Omega, P)$ , to minimise the codeword length over the entire domain $\Omega^*$ we need to choose a $C$ such that $L(C)$ is minimised and $\sum 2^{-l_i} \leq 1$ ( $C$ should remain prefix-free so that the codewords are efficiently decodable).

Data Compression - Prefix Free Code and Length Optimisation (ICT Part 2)