6. Hashing
Storing a tree structure in computer memory
can be done in various ways:
- Allocating an array in which the data of all
possible nodes in a fully extended tree (to a certain depth) can be
stored. The benefit of this method is that a simple calculation can
be used to find a child or a parent of a node. For example for a
binary tree the children of a node on index x can be found
on position 2·x and 2·x + 1. However,
often a tree (also the context tree of the CTW algorithm) will not
be fully extended. This results in a very sparse array and thus a
large waste of memory.
- Dynamically allocating memory space for the
data of each node. The benefit is that memory space is only
required for nodes that are actually created, but the drawback is
that extra memory is required to store pointers to the children
(and if needed to the parent) of the node. Especially when a lot of
nodes have to be allocated with a small amount of data, like in the
CTW algorithm, these pointers cause a huge overhead.
- The use of hashing. With hashing, an array is
used in which a fixed number of nodes can be placed. The idea is
that every node is placed on a pseudo-random index in this array.
The index is calculated from a certain key parameter with
the so called hashing function. The hashing function must be
a pseudo-random function, but the result of the hashing function
must always be the same given the same key value. It is possible
that different keys result in the same index; this is called a
collision. To detect such a collision it is important that there is
some data stored in each node which makes it possible to recognize
if the right node has been found. If it isn’t the right node,
the hashing algorithm can perform a second try, adding a certain
offset to the index value and checking if the new location in the
array contains the right node. The hashing function must be chosen
such that the probability on a collision is as small as
possible.
The benefit of using hashing is that it’s an efficient way
of using memory when storing a tree. A drawback is that possibly a
node can’t be allocated although there still is free space in
the array, because of the chosen hashing function and the maximum
number of tries that is performed. It is also difficult to forecast
how large the array should be.
For use with the CTW algorithm, the hashing
method is best suitable. The CTW hashing implementation is simple
and fast:
- The index in the tree array of the parent of
the node that is to be found, is used as a ‘start
position’, say i.
- A certain offset is calculated with the use
of a hashing function, in which a ‘mix’ of the value of
the current context symbol and the position of the current bit in
the current context symbol (called phase, as described in
paragraph 4.1) is used as the key:
offset = hash(symbol) xor ((phase+1) ·
2) , with 0 ≤ phase ≤ 7
.
(6.1)
Because we actually store different trees for each phase in the
tree array, the phase is also used in calculating this key because
this decreases the probability that two trees conflict with each
other. This is explained in more detail in paragraph 7.2. The value of
hash(symbol) is retrieved from the hash table, which is just
a permutation table with 256 pseudo-random chosen values.
- A new index is calculated:
inew = i + offset (modulo the array
size).
- At index inew in the tree
array we check if the right node is found. This can be recognized
by the phase the node belongs to, the number of tries
that was needed to find this node and the current context
symbol.
- If all three values are correct, it is
sure that the right node is found. If the values are not correct, a
new index inew is calculated by adding
offset to the current index again, and a new try will be
performed at this new index. However if the number of tries
exceeds the maximum number of tries that is allowed, the search
operation will stop. In that case, the node can’t be used in
the CTW algorithm, which will result in a (slightly) worse
compression rate, because now CTW in fact is forced to use a
simpler source model. However the file can still correctly be
decoded because exactly the same nodes will fail during the
decoding process.
It is also possible that an empty node is found. If this situation
occurs, the node we’re searching for is a new one or the tree
is pruned (chapter 5) at that
location.
Using hashing appeared to be a good choice. It
decreased the amount of required memory enormously. However a
drawback still is that it is difficult to estimate how large the
array in which the values are stored should be for a certain file.
This size mostly depends on the characteristic of the input file
and the maximum tree depth. The total array has to be allocated
before the actual encoding/decoding process, and at that time the
characteristic of the input file is not known yet.
More information can be found in [EIDMA ch. 4
par. 3]. Hashing is implemented in ctwtree.c.