<<< Previous chapter

5. Unique path pruning

 

Sometimes it is not needed to extend the context tree to the maximum tree depth. If a path in the context tree is a unique path, which means that from a certain level in the tree on the tree doesn’t split up anymore. This is because in such a path the weighted probability is equal to the estimated probability in each node. Removing unique pathes from the context tree is called “unique path pruning” (figure 5.1).

 

Figure 5.1: with unique path pruning, the tree parts above the crosses (left tree) are removed (right tree)

 

At a later time during the encoding or decoding process, a context may occur that coincides with a pruned path. Then it is necessary to be able to reconstruct the full context corresponding to this pruned path. If this context is known, it can be compared to the current context, and the tree can be extended as far as needed to make the contexts unique again.

However to be able to get the context of a pruned path, it is necessary to keep the whole source sequence in memory, which is a drawback of using unique path pruning. A benefit of using unique path pruning is that the number of nodes decreases significantly, which saves more memory space than the source sequence takes. It can also result in a smaller execution time, because there are less nodes to search for and less calculations to make.

 

5.1. Original implementation of unique path pruning

 

In the implementation of CTW as described in [EIDMA], the unique path pruning was implemented as follows:

 

1.      Start at the root node of the context tree for the current phase;

2.      find the child node for the current context symbol;

3.      if the child node does exist:

3a.   the child node becomes the current node;

3b.  if the maximum tree depth is reached: stop;

3c.   otherwise go back to step 2 to find the next child node;

4.      otherwise: (the tree was pruned, going to extend the tree as far as needed)

4a.   create the new node;

4b.  if the current symbol of the current context is the same as the current symbol of the pruned context: go back to step 4a (to create the next new node of the current context); or if the maximum tree depth is reached: stop;

4c.   otherwise: create a node for the pruned context if it doesn’t exist yet, and stop.

 

 

Example 5.1: Suppose that a tree at a certain time looks like the left tree in figure 5.2.

 

 

Figure 5.2: pruning example 1

 

A symbol x with context ‘daaa’ arrives. Note that the context has te be read from right to left, because the symbol on the right side of the total context is the most recent context symbol and therefore the first context symbol to process. The path is searched and updated using the algorithm described above. We start at root node n0 , which has a pruned context ‘aaac’ (i.e. it was previously pruned with this context starting from node n5), and find the child node for the current context symbol ‘a’. This node does not exist, so we create the node (n1). The current symbol of the current context ‘a’ is not the same as the current symbol of the pruned context ‘c’, so now we can stop. It is not needed to create a node for the pruned context because it already exists. The dotted line shows the part of the tree that was pruned during this step.

 

After some time the symbol y with context ‘abaa’ arrives. Again we start at node n0. The child node n1 is found. From node n1 we search for the child node with context symbol ‘a’. This node does not exist so it is created (n2). The current symbol of the current context ‘a’ is the same as the current symbol of the pruned context ‘a’, so we create the child node for the current context (n3). Now the current context symbol of the current context ‘b’ and the one of the pruned context ‘a’ aren’t the same, so the node n4 for the pruned context is created. The resulting tree is shown on the right side of figure 5.2. Again the dotted lines show the part of the tree that was pruned during this step.

 

 

A drawback of this implementation occurs when the new context is exactly the same as the pruned context. In this case it can be seen that the context is fully extended, while this is not nescessary at all. This is shown in example 2.

 

 

Example 5.2: Suppose that the context tree looks like the left tree in figure 5.3.

 

Figure 5.3: pruning example 2

 

A new symbol y arrives with context ‘daaa’. We start at node n0. The child node n1 is found. From node n1 we search for the child node with context symbol ‘a’. This node does not exist so it is created (n2). The current symbol of the current context ‘a’ is the same as the current symbol of the pruned context ‘a’, so we create the child node for the current context (n3). The next context symbols are also the same (‘d’), so after this node n4 is created. Now the maximum depth is reached so we stop. The result is that the path for context ‘daaa’ is fully extended, though this is not nescessary at all.

 

5.2. Strict unique path pruning

 

Strict unique path pruning works a bit different than the original unique path pruning. Before extending the tree, the full context of the pruned tree and the current context are compared with each other. If they are fully the same, no new nodes will be created at all:

 

1.      Start at the root node of the context tree for the current phase;

2.      find the child node for the current context symbol;

3.      if the child node does exist:

3a.   the child node becomes the current node;

3b.  if the maximum tree depth is reached: stop;

3c.   otherwise go back to step 2 to find the next child node;

4.      otherwise: (the tree was pruned, going to extend the tree as far as needed)

4a.   if full context of the pruned tree and full current context are the same: stop;

4b.  (otherwise) create the new node (copy a and b counts from the last node in the path that already existed to the new node);

4c.   if the current symbol of the current context is the same as the current symbol of the pruned context: go back to step 4b (to create the next new node of the current context); or if the maximum tree depth is reached: stop;

4d.  otherwise: create a node for the pruned context if it doesn’t exist yet, and stop.

 

 

Example 5.3: Consider the left tree in figure 5.3. A new symbol y arrives with context ‘daaa’. We start at node n0. The child node n1 is found. From node n1 we search for the child node with context symbol ‘a’. This node does not exist. Now we check if the full context of the pruned tree and the full current context are the same. This is true because they are both ‘daaa’, so we stop. The tree structure does not change.

 

 

The use of strict unique path pruning results in about 15% less nodes (this was the average saving on all files of the calgary corpus) compared to the original unique path pruning algorithm. It also results in a slightly faster execution time, because on average there are less nodes that have to be treated per encoding step, which saves time that was needed for the hashing method to find the node (chapter 6) and the weighting in the node (chapter 3). The time that is saved this way appeared to be more than the time that is needed to compare the full contexts, which has to be performed for strict unique path pruning. Another benefit is that because less nodes are created with strict unique path pruning enabled, there usually will be less failed nodes, which results in a better compression performance. However, in theory the compression rate would have to be exactly the same, as long are there no failed nodes. In practice the number of code bits might differ a little, because of rounding errors in the integer arithmetic (paragraph 3.1).

 

5.3. Handling large files

 

For unique path pruning we need to store the whole file in memory. This leads to a new problem: if the files are very large there is too much memory space required. This means the amount of available memory leads to a limitation on the file size. But if there is plenty of memory, there still is another restriction: the pointer to the filebuffer in each node in our implementation has a size of 24 bits. This means that a file buffer larger than 224 = 16 megabytes is useless because we cannot address it. There are several possible solutions to this problem:

 

 

Figure 5.4: structure of a tree node in this implementation

 

In this implementation we choose the last solution: freezing the tree and file buffer. It is possible to specify the maximum file buffer size on the command prompt. If the file size exceeds this size, the tree will be frozen as soon as the file buffer is full.

 

 

More information can be found in [EIDMA ch. 4 par. 1.4]. Unique path pruning is implemented in ctwtree.c.


Next chapter >>>