161-A2.3

From Microlab Classes
Revision as of 09:43, 19 March 2021 by Adrian Vidal (talk | contribs)
Jump to navigation Jump to search

The exercise below will appear in your submission bin after passing Exercise 2.1. (Prefix-free codes).

Exercise 2.2. Huffman encoding (5 points)

If we can model our information source as discrete and memoryless, Huffman codes will provide a prefix-free encoding that has the minimum expected length per symbol. In practice, we do not have access to the underlying probability distribution of our information source, so we use the frequency counts as a proxy. For instance, if we have a coin-flipping machine as an information source and we observe 201 heads and 99 tails after 300 observations, then we estimate the probability of heads to be 201/300. We can similarly perform such an estimate for other information sources with larger alphabets.

The Huffman code can be visualized as a binary tree where each symbol is located at a leaf. The path from the root node to a leaf node corresponds to a binary encoding of one symbol. To produce a Huffman code, we need a sequence of samples to observe their relative frequencies. Construction then proceeds in a greedy fashion wherein more probable symbols tend to have shorter encodings. In this exercise, you are to use a Huffman encoder to convert a sequence of samples into a bit string.

Task description

In this task, you need to write a function huffman_encode which will take a HuffmanEncoder object and a list of strings, and return a binary string consisting of '0's and '1's which corresponds to an encoding of samples. The list of strings may contain strings with more than one character, and each string is guaranteed to have some encoding using the Huffman encoder.

Although the Huffman encoder is constructed in the header script for this exercise, we still encourage you to understand how it works. Note that in the interest of conciseness and readability, the script below may not be the fastest way to implement Huffman encoders.

Python test script for execution
Header
class Node:
    def __init__(self, data, weight):
        self.left = None
        self.right = None
        self.data = data
        self.weight = weight

class HuffmanEncoder:
    def __init__(self, samples):
        alphabet = set(samples)
        alphabet_size = len(alphabet)

        nodes = [Node({symb}, samples.count(symb)) for symb in alphabet]

        for it in range(1, len(alphabet)):
            a,b = sorted(range(len(nodes)), key = lambda idx : nodes[idx].weight)[:2]
            
            supernode = Node(nodes[a].data.union(nodes[b].data), nodes[a].weight + nodes[b].weight)
            supernode.left = nodes[a]
            supernode.right = nodes[b]
            
            nodes = [nodes[idx] for idx in range(len(nodes)) if idx not in (a,b)]
            nodes.append(supernode)

        self.root = nodes[0]
    
    def encode(self, samples):
        return huffman_encode(self, samples)
Student submission
def huffman_encode(encoder, samples):
    # Your code goes here
    # More code..
    
    return ans
Main test script
# Main test script goes here, and will not be visible to students.
# The script will test if the submitted code does the prescribed functionality.
# For successfully validated scripts, test results will be sent via email.

Sample input

This time, we will use an entire script to test your code. The code below reads a text file containing Grimms' Fairy Tales from Project Gutenberg. It is a collection of stories gathered by the Brothers Grimm, who studied 19th century German folklore. (It is worth a read if only to realize how the source material differs from the fairy tales you came to know about growing up.) The script first reads the entire file to construct a Huffman encoder from the characters found in the file, and then applies the encoder to all titles in the collection. A helper function reports the encoder output as a pair corresponding to (1) the number of bytes in the original (uncompressed) text, and (2) the number of bits produced by the Huffman encoder.

import re

def encoder_report(encoder, grimms_title, titles, lines):
    title_idx = titles.index(grimms_title)
    line_start = lines.index(grimms_title)
    line_end = lines.index(titles[title_idx + 1])

    orig_text = '\n'.join(lines[line_start:line_end])
    bitstring = huffman_encode(encoder, orig_text)

    return len(orig_text), len(bitstring)

if __name__ == '__main__':
    
    regex = re.compile("^( ){5}[A-Z',\\-\\[\\] ]+$")
    titles = []

    file_contents = open('grimms-fairy-tales.txt').read().replace('\r','')
    lines = file_contents.split('\n')
    
    for i in range(len(lines)):
        line = lines[i]
        if regex.match(line):
            titles.append(line[5:])
    titles.append('*****')
    
    encoder = HuffmanEncoder(list(file_contents))

    for grimms_title in titles[:-1]:
        print(grimms_title, encoder_report(encoder, grimms_title, titles, lines))

To use the script, you need to download the .txt file from Project Gutenberg via this [1] and save it as grimms-fairy-tales.txt in the same folder as the sample script. Keep in mind that only the function huffman_encode function will be read when you submit your code.

Here is a selected list of titles from the file along with the expected output:

Title Number of bytes in the uncompressed text Number of bits in the encoded output
THE TRAVELLING MUSICIANS 6939 30819
CAT AND MOUSE IN PARTNERSHIP 5108 22940
RAPUNZEL 7280 32211
HANSEL AND GRETEL 15410 68016
THE STORY OF THE YOUTH WHO WENT FORTH TO LEARN WHAT FEAR WAS 19511 87221