161-A2.3

From Microlab Classes
Jump to navigation Jump to search

Exercise 2.2. Huffman encoding (5 points)

If we can model our information source as discrete and memoryless, Huffman codes will provide a prefix-free encoding that has the minimum expected length per symbol. In practice, we do not have access to the underlying probability distribution of our information source, so we use the frequency counts as a proxy. For instance, if we have a coin-flipping machine as an information source and we observe 201 heads and 99 tails after 300 observations, then we estimate the probability of heads to be 201/300. We can similarly perform such an estimate for other information sources with larger alphabets.

The Huffman code can be visualized as a binary tree where each symbol is located at a leaf. The path from the root node to a leaf node corresponds to a binary encoding of one symbol. To produce a Huffman code, we need a sequence of samples to observe their relative frequencies. Construction then proceeds in a greedy fashion wherein more probable symbols tend to have shorter encodings. In this exercise, you are to use a Huffman encoder to convert a sequence of samples into a bit string.

Task description

In this task, you need to write a function huffman_encode which will take a HuffmanEncoder object and a list of strings, and return a binary string consisting of '0's and '1's which corresponds to an encoding of samples. The list of strings may contain strings with more than one character, and each string is guaranteed to have some encoding using the Huffman encoder.

Although the Huffman encoder is constructed in the header script for this exercise, we still encourage you to understand how it works. Note that in the interest of conciseness and readability, the script below may not be the fastest way to implement Huffman encoders.

Python test script for execution
Header
class Node:
    def __init__(self, data, weight):
        self.left = None
        self.right = None
        self.data = data
        self.weight = weight

class HuffmanEncoder:
    def __init__(self, samples):
        alphabet = set(samples)
        alphabet_size = len(alphabet)

        nodes = [Node({symb}, samples.count(symb)) for symb in alphabet]

        for it in range(1, len(alphabet)):
            a,b = sorted(range(len(nodes)), key = lambda idx : nodes[idx].weight)[:2]
            
            supernode = Node(nodes[a].data.union(nodes[b].data), nodes[a].weight + nodes[b].weight)
            supernode.left = nodes[a]
            supernode.right = nodes[b]
            
            nodes = [nodes[idx] for idx in range(len(nodes)) if idx not in (a,b)]
            nodes.append(supernode)

        self.root = nodes[0]
    
    def encode(self, samples):
        return huffman_encode(self, samples)
Student submission
def huffman_encode(encoder, samples):
    # Your code goes here
    # More code..
    
    return ans
Main test script
# Main test script goes here, and will not be visible to students.
# The script will test if the submitted code does the prescribed functionality.
# For successfully validated scripts, test results will be sent via email.

Sample input

To be added