Assignment 10

Task

Write a program that decompresses files that were compressed using Huffman encoding.

Text: 6.5
Concepts: Huffman trees, (priority queues, maps), bit operators.

Requirements

Your program should take a filename as a command line argument. If no such argument is given, print a usage message explaining what the program does and how to use it.

If the filename ends with a ".huff" extension, decompress the file. Remove the .huff extension to get the output filename. For example, if you run your program as

  java UsernameA10 homework.txt.huff

your program will decompress homework.txt.huff into homework.txt.

If the filename does not end with a ".huff" extension, what your program does is up to you (so long as it doesn't crash). You can compress the given file instead (using the code given below), or you can just print an error message and quit. If the file you are extracting to (such as homework.txt in the example above) already exists, it is also up to you whether to simply overwrite it or to ask the user if they would like to abort.

A valid .huff file will contain the following segments in this order:

count: 4 bytes (an int) containing a count of the number of bytes in the original uncompressed file.
tree data: A pre-order traversal of the Huffman tree used, with 0 marking internal nodes and 1 marking leaf nodes. Every leaf node (1) is followed by 8 bits containing the byte of data for that leaf node. Every tree will have at least one internal node at the root.
traversal data: A series of paths that each start at the root and lead to a leaf node that specifies the corresponding symbol in the original file. A 0 bit means go left, a 1 bit means go right. The valid traversal data ends when count symbols have been decoded. There may be extra bits remaining to complete the last byte of the file.

There are no extra bits or markers between the three file segments.

You may also find useful:

some sample compressed files that you should be able to decompress.
some helpful source code.

Sample Output

Up to you. You can print the Huffman tree used or other information if you want to, or you can run silently except on errors, in traditional commmand-line program style. The file to compress must come from the command line arguments though. As long as that filename is a valid .huff file that will not overwrite another file, you must not ask the user for input.

What to Submit

Your UsernameA10.java file should contain your program's main method. Put this into a JAR along with all other .java files that your program needs to run. Upload your UsernameA10.jar file to Tamarin.

Remember to follow the course coding standards on all assignments.

Grading [80 points]

5 - Compiles

75 - Correctly decompresses

When given a .huff file as a command line argument. Your program should be able to do this quickly for the small provided sample files, which are all less than 50KB.

Partial credit: Provided that your program correctly decompresses the files, you need no other output. However, if your decompression is incorrect, you can still get partial credit for printing these details in your screen output:

The byte count of the original file (5 points)
Up to the first 64 bits of the tree data, printed as 0s and 1s with no spaces (20 points), OR
the correct decoded tree in the same toFullString() format as used in A09 (also see provided code above) (40 points).

Compression Code Provided

While your job is to decompress the files, it may be handy to see how the compression was done. My source code is here. You can also use this program to compress your own sample files to test.

You are welcome to:

completely ignore my code and write your decompression program from scratch in your own style
write your own code, but plunder mine for ideas or code snippets that you find useful, or
build on my code and include it in your finished program so that your program can both compress and decompress files.

Other than the HuffmanNode class and the main method, most of it will not be directly reusable for your A10. If you want to follow my design, you will need to:

write a BitReader class that lets you read 1 bit at a time from a file
implement the three unfinished methods that I left in Huffman.java, all marked with //TODO.
Rename ZtomaszeA10.java

The extra HexFileDump program might be useful to view the contents of a small file as bit/byte values.

Discussion

Decoding Walkthrough

Your program needs to read the size of the original file, decompress the encoded tree, and then decode the message from the remaining bits. It will help to walk through this on a known file so you can double-check the correctness of each of your steps as you implement them.

Consider the file short.txt.huff. When viewed with HexFileDump, you should see this:

00 00 00 05 50  94 3A 0E 98     ....P .:..

Each pair of digits here is a byte shown in hex. (The characters on the right are not actually significant or meaningful if the file contains a binary format and not ASCII.)

If we translate the hex to binary:

00000000 00000000 00000000 00000101 01010000  10010100 00111010 00001110 10011000

You can see that the first four bytes together contain the value 5. You don't need to be able to read single bits to read this part of the file; you can just use a java.io.FileInputStream to read 4 bytes and then use the shift operator to merge them into a single int variable. This variable tells us that short.txt originally contained only 5 bytes. (Since the max value for an int is about 2 billion or so, the current .huff file format can't handle encoded files bigger than 2 GB. We could have used a long instead, though.)

After reading in the byte count, the rest of the file is:

50 94 3A 0E 98

As binary, this is:

0101 0000  1001 0100  0011 1010  0000 1110  1001 1000

This is the encoded tree followed by the encoded message, but we don't know where one ends and the next starts until after we decode the tree.

You will first need an object that lets you read a bit at a time from a file stream. Java will only let you read or write a byte at a time, so you will need to buffer at least 1 byte worth of data. (See the BitWriter class of an example of this for output.) Test this object before you continue. For example, you should be able to print the binary sequence shown above from short.txt.huff.

You can then follow the algorithm given in the slides. If you do this, you'll find that the tree data is as follows:

01[01 0000 10]01 [0100 0011] 1[010 0000 1]

= 01[42]01[43]1[41]  //representing the binary data as hex
= 01[66]01[67]1[65]  //representing the binary data as decimal values
= 01[B]01[C]1[A]     //interpreting the binary data as ASCII values

which results in this tree (note that you don't have any counts when decoding):

|* (x 0)
|<66 'B' (x 0)
|>* (x 0)
|><67 'C' (x 0)
|>>65 'A' (x 0)

which is to say:

   *
  / \
 B   *
    / \
   C   A

The remaining bits in the file are now:

 110 1001 1000

This is where we use the byte count we read in at the beginning. We're supposed to decode these bits by moving left or right from the root until we hit a leaf node. We then write out the value in that leaf node. But we stop as soon as we've printed 5 byte values. We ignore the extra 0 bits that finish the final byte of the .huff file. Therefore:

[11][0] [10][0][1 1]000

= [01000001][01000010][01000011][01000010][01000001]
= [A][B][C][B][A]  //human interpretation of the binary

Thus, we would write these 5 byte values into short.txt. Note that we're writing binary, so the file short.txt will then contain:

01000001 01000010 01000011 01000010 01000001

If viewed with HexFileDump, we would see this binary as:

41 42 43 42 41                  ABCBA

And indeed, when you open short.txt with a text editor, it should appear as: ABCBA. But, as with any file, it is still really just binary on your hard drive or in memory. Note that some of the files you will be decompressing won't be text files, so the data in your trees won't always correspond to characters as they did here.

In conclusion, the sub-problems you'll want to tackle are:

Read in the encoded byte count. Print it out to verify it is correct.
Read in the stream as bits and print them. Verify that these bits really correspond to the hex view of the file's contents.
Now, instead of printing the bits, use them to decode the tree. Print the tree. Make sure it matches the one above.
Use the tree to decode the rest of the file. Print or write the contents into a file. Confirm that the output is correct.
Use your program to decode all the sample files and confirm the output is correct. (You can use the provided hashes to be certain of this.) Or use the provided code to encode your own sample files, use your program to decode them, and make sure they are the same as what you started with.

Java bit-twiddling gotcha: You're always working with ints.

Whenever you use a bit operator, it first converts any byte operands into ints. If you are not aware of this, it can have some unexpected consequences. Consider:

    byte buffer = (byte) 0xFF;  //or: 0b1111_1111;
    System.out.printf("buffer as hex: %X\n", buffer);  // FF (all 1s)
    System.out.println("buffer as decimal: " + buffer); //-1

The decimal value is -1 because of the 2s-complement format. If we add 1 to 11111111, we get (1)00000000 (the 1 overflow gets dropped), which is 0.

Now, if we upgrade this value to an int, don't get 0x000000FF (all 0s except for the last 8 bits, which are all 1s). Instead, we get 0xFFFFFFFF. This is 32 1 bits, which has the value -1 for an int (by the same reasoning as above). So, if we're interested in the integer value, this conversion from byte to int makes total sense: it's the same -1 value in each representation. On the other hand, if you're thinking in terms of bits, suddenly you've got 24 extra 1 bits you may not have expected.

    int asInt = buffer;
    System.out.printf("as hex int: %X\n", asInt);    //FFFFFFFF (all 1s)
    System.out.println("as decimal int: " + asInt);  // -1

So now consider shifting. If you take a byte 11111111 and shift it 7 digits to the right, you may expect to have 00000001:

    int bit = buffer >> 7;
    System.out.println("bit: " + bit);  // -1

But that's not what you get. The >> is "signed shift", which means the left-most bit gets repeated when you shift. So you're shifting in 1s, not 0s, from the left, which means your value is still all 1s after the shift.

Lets fix this by using the unsigned shift operator >>> which will guarantee we shift 0s in from the left:

    bit = buffer >>> 7;
    System.out.println("bit: " + bit);  // 33554431

Huh? This is because the >>> upgraded the buffer to an int before the shift: 11111111 11111111 11111111 11111111 >>> by 7 bits is: 00000001 11111111 11111111 11111111, which is 33554431 in decimal. Casting this back to a byte won't help because you'll only get the 8 right-most bits: all 1s, which is what you started with.

Hopefully knowning how this works will help you make sense of any bugs you run into. If you're only trying to grab one bit out of a byte, masking will help you by converting everything you don't care about to 0s. For example:

    bit = buffer >> 7;
    bit = bit & 0x01;  //mask with all 0s except for last bit
    System.out.println("bit: " + bit);  // 1

If my goal was to be able to print only the bold bit marked in the examples above, this will work with either >> or >>>.

FAQs

I converted all the bits to 0s and 1s and made a String out of them. Now I'm trying to convert 8 of them back to a byte using the Byte class, but I'm getting a NumberFormatException.

First of all, for those that didn't do it this way: It is possible to do the decoding without using Strings at all. In fact, using Strings can be fairly inefficient.

For those that did it this way: It's a valid approach, but you'll probably run into problems when you try to decode image.bmp.huff. The text files contain only ASCII values, which have values between 0 and 127. However, binary files may have byte values like this: 11100110. This series of bits is equivalent to 0xE6 in hex or 230 in decimal. However, as an 8-bit byte, where the first bit is effectively the sign bit, this sequence is also used to represent -26. (However, in an int variable, it would be preceded by 24 zeros and so it would only be equal to 230.)

The source of your problem is that, when you try to use the parseByte or valueOf methods in the Byte class, it will first convert to 11100110 to its positive numerical value, which is 230. But a byte can only store values between -128 and 127, so you get:

  NumberFormatException: Value out of range. Value:"11100110" Radix:2

Probably the easiest fix for this is to use the the corresponding method in the Integer class, which can handle the full range of possible unsigned byte values (0 to 255). Then just cast the value to a byte. For example:

  String s = "11100110";
  int n = Integer.valueOf(s, 2);  // n == 230
  byte b = (byte) n;              // b == -26
  System.out.printf("%X\n", b);   // b also == 0xE6, which == 0b11100110

Instead of using only ints and bytes, I used a String-based approach. However, my decompression takes quite a while to complete, especially with the larger files.

Using Strings to represent 1 and 0s as chars means you're taking up 16x the memory that you need to. If you then also build those Strings by appending 1 char at a time, each String will need to be copied over with each concatenation. To build a bit sequence of length n, this becomes an O(n^2) algorithm. At the very least, use a StringBuilder to get back to (almost) O(n) complexity.

It's hard to debug this binary stuff. Any tips?

If you're using a debugger, check to see if you can change the display format of the variables that you are watching. For example, in JGrasp you can right click on one of the variables and change the view to numeric, which will give you decimal, hex, and binary views of that variable.

If you just want to use a System.out.print statement, check out the static toBinaryString and toHexString methods found in the Integer class.

Tamarin says my decompressed image file is off in the 4th byte, or something like that. And the image.bmp.huff decoded to some weird useless greenish image for me. But all the character files work fine.

The original image.bmp is mostly white with 3 colored digits, so that's what you should see.

Check that you are really writing raw bytes to the output stream. For this to work, you must use FileOutputStream and BufferedOutputStream. If you use FileWriter and PrintWriter instead, you'll be writing characters instead of bytes. This will cause Java to perform some incorrect conversions of your byte values.