Reconstructing a huffman tree using minimal information in the header

Question

I'm writing a huffman encoding program in C. I'm trying to include the least amount of information in the header as possible, I know the simplest way to decompress the file in the header would be to store the frequencies of each character in the file, but for a large file with 256 characters it would take 2304 bytes ((1 byte for character + 8 bytes for long frequency) * 256), which I don't think is optimal.
I know I can reconstruct a tree from a preorder scan and an inorder scan of it, but that requires having no duplicate values. That is bad because I now have to store each node in the tree (in a huffman tree: n*2 - 1 with n being the number of unique characters), twice, having each node be a long value (which could take ((256*2 - 1) * 2) * 8 = 8176 bytes.

Is there a way I'm missing here, or are those my only options?

Thanks.

score 2 · Answer 1 · answered Aug 10 '14 at 15:40

First, as discussed in comments, you should get rid of the frequencies since you only need them to create the tree, not to reproduce the codes for decoding. In your program, but not on disk, the tree structure would might look like this (note the absence of frequencies):

struct Node {
  char value; // only used for leaf nodes
  // leaf nodes have BOTH child pointers NULL
  struct Node *left, *right;
}

I think the following scheme should allow reproducing the tree (though not the frequencies) using at most 2n * k bits for alphabets where each character takes k bits (so k <= log2 n <= k + 1):

Assign arbitrary consecutive indices to all interior nodes of the Huffman tree.
For each character, write out the index of the parent node.
Order the interior nodes by their indices. For each node except the root, write out the index of its parent node. For the root node, make its "parent" index equal to itself.

Since there are at most n-1 interior nodes, node indices fit into k bits each. So the interior node records plus the n character records, we arrive at slightly less than 2n*k bits. Decoding is relatively easy: First read the k character records, create the corresponding interior nodes, and iteratively add the newly discovered nodes (those referenced by other interior nodes but not yet created). You can recognize the root node by its self-reference.

Note that this would require a different tree structure, one with parent references instead of child references and a flag to distinguish leaf nodes (in memory, you can use NULL for the root's parent) If this makes it easier to generate the codes, you can invert the parent pointers, i.e. turn this representation into the nice top-down structure mentioned above.

Caveat: I assumed k is known to both parties (if not, a single extra byte should suffice for any practical application). I also assumes an alphabet of fixed-size bit vectors, but I think that's the case in virtually all applications (and if it's not true, you can add that metadata and still get away rather well).

ratchet freak · Accepted Answer · 2022-06-20T10:58:31.820

There are 2 separate problems, store the topography and assign the leaf nodes

Assigning the leaf nodes can be done by storing the characters in in a predefined order so it can be extracted as needed.

Storing topography can be done by having a bit vector with 2 bits per parent node in the previous layer where 1 represents a compound node and 0 represents a leaf node

so first there is 1 bit for the root which is 1 and the next 2 bits will represent the next level down

to build the tree using the node{char value; node* left, right;} setup will be:

char[] chars;//prefill with the other array
int charIndex = 0;
node root;
vector<node*> toBuild(root);
while(!toBuild.empty()){
    node n = toBuild.popFront();
    bool bit = grabBit();
    if(bit){
        n.left = new node;
        toBuild.pushBack(n.left);
    }else
        n.value = chars[charIndex++];
    bit = grabBit();
    if(bit){
        n.right = new node;
        toBuild.pushBack(n.left);
    }else
        n.value = chars[charIndex++];
}
return root;

This is 2*n bits in the topography plus the permutation which is O(log n!) at the minimum.

Another option is to store the length of each encoded token. Using just that you can build a huffman tree deterministically. You start with the shortest token and assign it all 0 bits. The next token you add with carry 1 to the encoding and add 0 bits as needed. To store the length of each token you can use a fixed huffman encoding.

This is the method used in DEFLATE.

score 0 · Answer 3 · answered Aug 10 '14 at 17:05

It's not necessary to store the actual frequencies of each symbol, or the exact topology of the Huffman tree. You only need to store enough information to encode the level on the tree at which each symbol resides.

You can modify a Huffman tree by shuffling symbols and internal branch nodes around on the same level without changing the coding efficiency of the tree. So it makes sense to map your particular huffman tree to its canonical version then you only need to specify which of the canonical trees you are using. I suggest, starting at the top then going down, shove all the symbols to the left as far as they will go, then sort them in ascending order.

Once you've made you tree canonical you need to actually encode it.

If you limit your tree depth to 32 levels then you can just encode a 256 by 5-bit array (160 bytes) giving the huffman tree level of each symbol.

You can approach the information-theoretic minimum of the encoding size of you arithmetically encode the subset of available symbols at each level, but I figure since you're using Huffman codes you're not ready for arithmetic encoding yet.

gnasher729 · Answer 4 · 2022-06-20T15:23:14.510

You absolutely don't need the frequencies. Define a fixed algorithm that determines the code values depending on the code lengths for each code. The code lengths for single bytes will be on average less than eight and take three bits on average to store, so you do some statistics and define a fixed Huffman encode for code lengths.

Example: Assume 4 symbols with code lengths 2, 3, 3, 1. You assign the next possible code. So 2 -> 00. 3 -> 010. 3 -> 011. 1 -> 1. You only need to store the code lengths. In this case 2 bits per code were enough, giving eight bits for the table.

But we might assume that 3 bits is the most common code length, followed by 1 and 2. So we store "3" using one bit, and 1 a 2 using two bits. Now we have 2x1 + 2 + 2 = 6 bits.

for 256 values, you'll likely need less than 256 x 3 bits = 96 bytes.

If you don't mind slower encoding and decoding, you don't need to store a table at all. Define an algorithm that creates a Huffman code from the frequencies with reasonable results in all cases, including the worst case that you have no frequency results at all yet.

So the encoder builds a Huffman code for all frequencies zero and the decoder does the same. You encode the first byte, the decoder decodes the first byte. Both now know the first byte and create a new Huffman code for one non-zero frequency. They encode and decode the second byte, then build another code with the added information, and so on.

Reconstructing a huffman tree using minimal information in the header

4 Answers4