161-A2.3
The exercise below will appear in your submission bin after passing Exercise 2.1. (Prefix-free codes).
Exercise 2.2. Huffman encoding (5 points)
If we can model our information source as discrete and memoryless, Huffman codes will provide a prefix-free encoding that has the minimum expected length per symbol. In practice, we do not have access to the underlying probability distribution of our information source, so we use the frequency counts as a proxy. For instance, if we have a coin-flipping machine as an information source and we observe 201 heads and 99 tails after 300 observations, then we estimate the probability of heads to be 201/300. We can similarly perform such an estimate for other information sources with larger alphabets.
The Huffman code can be visualized as a binary tree where each symbol is located at a leaf. The path from the root node to a leaf node corresponds to a binary encoding of one symbol. To produce a Huffman code, we need a sequence of samples to observe their relative frequencies. Construction then proceeds in a greedy fashion wherein more probable symbols tend to have shorter encodings. In this exercise, you are to use a Huffman encoder to convert a sequence of samples into a bit string.
Task description
In this task, you need to write a function huffman_encode
which will take a HuffmanEncoder
object and a list of strings, and return a binary string consisting of '0'
s and '1'
s which corresponds to an encoding of samples. The list of strings may contain strings with more than one character, and each string is guaranteed to have some encoding using the Huffman encoder.
Although the Huffman encoder is constructed in the header script for this exercise, we still encourage you to understand how it works. Note that in the interest of conciseness and readability, the script below may not be the fastest way to implement Huffman encoders.
Header |
class Node:
def __init__(self, data, weight):
self.left = None
self.right = None
self.data = data
self.weight = weight
class HuffmanEncoder:
def __init__(self, samples):
alphabet = set(samples)
alphabet_size = len(alphabet)
nodes = [Node({symb}, samples.count(symb)) for symb in alphabet]
for it in range(1, len(alphabet)):
a,b = sorted(range(len(nodes)), key = lambda idx : nodes[idx].weight)[:2]
supernode = Node(nodes[a].data.union(nodes[b].data), nodes[a].weight + nodes[b].weight)
supernode.left = nodes[a]
supernode.right = nodes[b]
nodes = [nodes[idx] for idx in range(len(nodes)) if idx not in (a,b)]
nodes.append(supernode)
self.root = nodes[0]
def encode(self, samples):
return huffman_encode(self, samples)
|
Student submission |
def huffman_encode(encoder, samples):
# Your code goes here
# More code..
return ans
|
Main test script |
# Main test script goes here, and will not be visible to students.
# The script will test if the submitted code does the prescribed functionality.
# For successfully validated scripts, test results will be sent via email.
|
Sample input
This time, we will use an entire script to test your code. The code below reads a text file containing Grimms' Fairy Tales from Project Gutenberg. It is a collection of stories gathered by the Brothers Grimm, who studied 19th century German folklore. (It is worth a read if only to realize how the source material differs from the fairy tales you came to know about growing up.) The script first reads the entire file to construct a Huffman encoder from the characters found in the file, and then applies the encoder to all titles in the collection. A helper function reports the encoder output as a pair corresponding to (1) the number of bytes in the original (uncompressed) text, and (2) the number of bits produced by the Huffman encoder.
import re
def encoder_report(encoder, grimms_title, titles, lines):
title_idx = titles.index(grimms_title)
line_start = lines.index(grimms_title)
line_end = lines.index(titles[title_idx + 1])
orig_text = '\n'.join(lines[line_start:line_end])
bitstring = huffman_encode(encoder, list(orig_text))
return len(orig_text), len(bitstring)
if __name__ == '__main__':
regex = re.compile("^( ){5}[A-Z',\\-\\[\\] ]+$")
titles = []
file_contents = open('grimms-fairy-tales.txt').read().replace('\r','')
lines = file_contents.split('\n')
for i in range(len(lines)):
line = lines[i]
if regex.match(line):
titles.append(line[5:])
titles.append('*****')
encoder = HuffmanEncoder(list(file_contents))
for grimms_title in titles[:-1]:
print(grimms_title, encoder_report(encoder, grimms_title, titles, lines))
To use the script, you need to download the .txt file from Project Gutenberg via this link and save it as grimms-fairy-tales.txt
in the same folder as the sample script. Keep in mind that only the function huffman_encode
function will be read when you submit your code.
Here is a selected list of titles from the file along with the expected output:
Title | Number of bytes in the uncompressed text | Number of bits in the encoded output |
THE TRAVELLING MUSICIANS
|
6939
|
30819
|
CAT AND MOUSE IN PARTNERSHIP
|
5108
|
22940
|
RAPUNZEL
|
7280
|
32211
|
HANSEL AND GRETEL
|
15410
|
68016
|
THE STORY OF THE YOUTH WHO WENT FORTH TO LEARN WHAT FEAR WAS
|
19511
|
87221
|