<<< Previous chapter

7. Miscellaneous

 

Beside the subjects discussed in the previous chapters, there are some subjects that are not so relevant for understanding the current implementation of CTW. These subjects can be found in the implementation however, so they will be discussed in this chapter. First the effects of root weighting will be discussed (which can be enabled but is disabled by default). In the second paragraph it is explained why the number of the bit in the current byte (the phase) is used in the hashing function.

 

7.1. Root weighting

 

The part about binary decomposition in chapter 4, states there are 255 different trees. However, one could also see this as if there is just one tree for each phase (bit) of an ASCII-symbol, which first splits in 2phase nodes (with 0 ≤ phase ≤ 7) dependent on the previous bits of the current symbol. Now each of these nodes corresponds to the root node of one of the 255 trees mentioned before.

 

If root weighting is disabled, the previous bits of the current symbol are used to select the weighted probability from one of these nodes only. This means the weighted probability from the selected node is used as the coding probability. However, if root weighting is enabled, weighting also takes place in the root nodes of the eight trees (not to be confused with the root nodes of the 255 trees). For example, the coding probability (i.e. weighted probability in the root node λ) for the third bit of a byte (phase = 2) is given by:

 

                                                                     (7.1)

 

If the symbol source that is modeled really is a source with a 256-symbol alphabet, then the most reasonable choice is to disable root weighting. If the true source model does not include the root node, the codeword should be about one bit shorter per tree than with root weighting enabled, so the total gain would be about eight bits. For the Calgary corpus, the difference in the codelength per file varies from –42 to +120 bits (for depth 4) and from –43 to +110 bits (for depth 8), as can be found in [EIDMA ch. 4 app. 4]. Without root weighting the speed is also increased slightly, because less weighting operations are to be performed. However, in the implementation as described in [EIDMA] root weighting was enabled, because enabling root weighting does not much influence the performance of the algorithm and the implementation with root weighting enabled is slightly simpler (because the root nodes are handled in the same way as all other nodes). In this implementation of CTW root weighting can be enabled or disabled using command line options, but is disabled by default. Enabling root weighting might result in a slightly different (better or worse) compression performance, dependent on the characteristics of the input file.

 

7.2. Using phase in the hashing function

 

A simple hashing function that could be used in this implementation is:

 

                                                       offset = hash(symbol)                                                     (7.2)

 

However in chapter 6, we introduced a more difficult hashing function:

 

                        offset = hash(symbol) xor ((phase+1) · 2) , with 0 ≤ phase ≤ 7                      (7.3)

 

One might wonder what is the benefit of using phase in the hashing function. Suppose that we use formula 7.2 as hashing function. Now the offset only depends on symbol, which means in all different trees the offset (that is added to the index of a node into the array) for a certain value of symbol is the same. This may lead to a situation in which collisions occur between two trees that try to store a node on the same positions all the time. The trees that belong to different phases look quite similar to each other because the context symbols are the same for all phases, so this situation will occur regularly. When using phase, the offsets become different for the same value of symbol in different trees. This results in a smaller probability for a collision to occur, wich results in a faster execution time.

 

The effect of using phase can be measured by calculating the average number of tries that was needed to find a node during the encoding / decoding of a file. A smaller average number of tries means a smaller amount of collisions. Table 7.1 shows the results for all files in the calgary corpus.

 

Filename

No phase

With phase

difference

bib

1.09808

1.09732

-0.00076

book1

1.21630

1.21250

-0.00380

book2

1.19481

1.19256

-0.00225

geo

1.16459

1.16006

-0.00453

news

1.19255

1.19038

-0.00217

obj1

1.05890

1.05101

-0.00789

obj2

1.16199

1.14584

-0.01615

paper1

1.12991

1.12901

-0.00090

paper2

1.15380

1.15307

-0.00073

paper3

1.14326

1.14303

-0.00023

paper4

1.11142

1.11149

-0.00007

paper5

1.10925

1.10833

-0.00092

paper6

1.12573

1.12459

-0.00114

pic

1.09279

1.08770

-0.00509

progc

1.10625

1.10515

-0.00110

progl

1.09582

1.09543

-0.00039

progp

1.07847

1.07805

-0.00042

trans

1.08486

1.08360

-0.00126

average

1.12882

1.12606

-0.00276

Table 7.1: average number of tries for each file in calgary corpus

 

It can be seen that the average number of tries decreases for all files with the use of phase in the hashing function. The differences may seem to be small, but this is because the values are averages on the whole file and in most cases one try will be enough.

 

 

Root weighting can be found in ctwmath.c. More information on root weighting can be found in [EIDMA ch. 4 par. 4].