Write a program that decompresses files that were compressed using Huffman encoding.
Concepts: Huffman trees, (priority queues, maps), bit operators.
Your program should take a filename as a command line argument. If no such argument is given, print a usage message explaining what the program does and how to use it.
If the filename ends with a ".huff" extension, decompress the file. Remove the .huff extension to get the output filename. For example, if you run your program as
java UsernameA10 homework.txt.huffyour program will decompress homework.txt.huff into homework.txt.
If the filename does not end with a ".huff" extension, what your program does is up to you (so long as it doesn't crash). You can compress the given file instead (using the code given below), or you can just print an error message and quit. If the file you are extracting to (such as homework.txt in the example above) already exists, it is also up to you whether to simply overwrite it or to ask the user if they would like to abort.
A valid .huff file will contain the following segments in this order:
There are no extra bits or markers between the three file segments.
You may also find useful:
Up to you. You can print the Huffman tree used or other information if you want to, or you can run silently except on errors, in traditional commmand-line program style. The file to compress must come from the command line arguments though. As long as that filename is a valid .huff file that will not overwrite another file, you must not ask the user for input.
Your UsernameA10.java file should contain your program's main method. Put this into a JAR along with all other .java files that your program needs to run. Upload your
UsernameA10.jar file to Tamarin.
Remember to follow the course coding standards on all assignments.
Partial credit: Provided that your program correctly decompresses the files, you need no other output. However, if your decompression is incorrect, you can still get partial credit for printing these details in your screen output:
While your job is to decompress the files, it may be handy to see how the compression was done. My source code is here. You can also use this program to compress your own sample files to test.
You are welcome to:
Other than the HuffmanNode class and the main method, most of it will not be directly reusable for your A10. If you want to follow my design, you will need to:
The extra HexFileDump program might be useful to view the contents of a small file as bit/byte values.
Consider the file short.txt.huff. When viewed with HexFileDump, you should see this:
00 00 00 05 50 94 3A 0E 98 ....P .:..
Each pair of digits here is a byte shown in hex. (The characters on the right are not actually significant or meaningful if the file contains a binary format and not ASCII.)
If we translate the hex to binary:
00000000 00000000 00000000 00000101 01010000 10010100 00111010 00001110 10011000
You can see that the first four bytes together contain the value 5. You don't need to be able to read single bits to read this part of the file; you can just use a java.io.FileInputStream to read 4 bytes and then use the shift operator to merge them into a single int variable. This variable tells us that short.txt originally contained only 5 bytes. (Since the max value for an int is about 2 billion or so, the current .huff file format can't handle encoded files bigger than 2 GB. We could have used a long instead, though.)
After reading in the byte count, the rest of the file is:
50 94 3A 0E 98
As binary, this is:
0101 0000 1001 0100 0011 1010 0000 1110 1001 1000
This is the encoded tree followed by the encoded message, but we don't know where one ends and the next starts until after we decode the tree.
You will first need an object that lets you read a bit at a time from a file stream. Java will only let you read or write a byte at a time, so you will need to buffer at least 1 byte worth of data. (See the BitWriter class of an example of this for output.) Test this object before you continue. For example, you should be able to print the binary sequence shown above from short.txt.huff.
You can then follow the algorithm given in the slides. If you do this, you'll find that the tree data is as follows:
01[01 0000 10]01 [0100 0011] 1[010 0000 1] = 01011 //representing the binary data as hex = 01011 //representing the binary data as decimal values = 01[B]01[C]1[A] //interpreting the binary data as ASCII values
which results in this tree (note that you don't have any counts when decoding):
|* (x 0) |<66 'B' (x 0) |>* (x 0) |><67 'C' (x 0) |>>65 'A' (x 0)which is to say:
* / \ B * / \ C A
The remaining bits in the file are now:
110 1001 1000
This is where we use the byte count we read in at the beginning. We're supposed to decode these bits by moving left or right from the root until we hit a leaf node. We then write out the value in that leaf node. But we stop as soon as we've printed 5 byte values. We ignore the extra 0 bits that finish the final byte of the .huff file. Therefore:
 [1 1]000 =  = [A][B][C][B][A] //human interpretation of the binary
Thus, we would write these 5 byte values into short.txt. Note that we're writing binary, so the file short.txt will then contain:
01000001 01000010 01000011 01000010 01000001
If viewed with HexFileDump, we would see this binary as:
41 42 43 42 41 ABCBA
And indeed, when you open short.txt with a text editor, it should appear as:
ABCBA. But, as with any file, it is still really just binary on your hard drive or in memory. Note that some of the files you will be decompressing won't be text files, so the data in your trees won't always correspond to characters as they did here.
In conclusion, the sub-problems you'll want to tackle are:
Whenever you use a bit operator, it first converts any byte operands into ints. If you are not aware of this, it can have some unexpected consequences. Consider:
byte buffer = (byte) 0xFF; //or: 0b1111_1111; System.out.printf("buffer as hex: %X\n", buffer); // FF (all 1s) System.out.println("buffer as decimal: " + buffer); //-1
The decimal value is -1 because of the 2s-complement format. If we add 1 to 11111111, we get (1)00000000 (the 1 overflow gets dropped), which is 0.
Now, if we upgrade this value to an int, don't get 0x000000FF (all 0s except for the last 8 bits, which are all 1s). Instead, we get 0xFFFFFFFF. This is 32 1 bits, which has the value -1 for an int (by the same reasoning as above). So, if we're interested in the integer value, this conversion from byte to int makes total sense: it's the same -1 value in each representation. On the other hand, if you're thinking in terms of bits, suddenly you've got 24 extra 1 bits you may not have expected.
int asInt = buffer; System.out.printf("as hex int: %X\n", asInt); //FFFFFFFF (all 1s) System.out.println("as decimal int: " + asInt); // -1
So now consider shifting. If you take a byte 11111111 and shift it 7 digits to the right, you may expect to have 00000001:
int bit = buffer >> 7; System.out.println("bit: " + bit); // -1
But that's not what you get. The >> is "signed shift", which means the left-most bit gets repeated when you shift. So you're shifting in 1s, not 0s, from the left, which means your value is still all 1s after the shift.
Lets fix this by using the unsigned shift operator >>> which will guarantee we shift 0s in from the left:
bit = buffer >>> 7; System.out.println("bit: " + bit); // 33554431
Huh? This is because the >>> upgraded the buffer to an int before the shift: 11111111 11111111 11111111 11111111 >>> by 7 bits is: 00000001 11111111 11111111 11111111, which is 33554431 in decimal. Casting this back to a byte won't help because you'll only get the 8 right-most bits: all 1s, which is what you started with.
Hopefully knowning how this works will help you make sense of any bugs you run into. If you're only trying to grab one bit out of a byte, masking will help you by converting everything you don't care about to 0s. For example:
bit = buffer >> 7; bit = bit & 0x01; //mask with all 0s except for last bit System.out.println("bit: " + bit); // 1
If my goal was to be able to print only the bold bit marked in the examples above, this will work with either >> or >>>.
For those that did it this way: It's a valid approach, but you'll probably run into problems when you try to decode image.bmp.huff. The text files contain only ASCII values, which have values between 0 and 127. However, binary files may have byte values like this: 11100110. This series of bits is equivalent to 0xE6 in hex or 230 in decimal. However, as an 8-bit byte, where the first bit is effectively the sign bit, this sequence is also used to represent -26. (However, in an int variable, it would be preceded by 24 zeros and so it would only be equal to 230.)
The source of your problem is that, when you try to use the parseByte or valueOf methods in the Byte class, it will first convert to 11100110 to its positive numerical value, which is 230. But a byte can only store values between -128 and 127, so you get:
NumberFormatException: Value out of range. Value:"11100110" Radix:2
Probably the easiest fix for this is to use the the corresponding method in the Integer class, which can handle the full range of possible unsigned byte values (0 to 255). Then just cast the value to a byte. For example:
String s = "11100110"; int n = Integer.valueOf(s, 2); // n == 230 byte b = (byte) n; // b == -26 System.out.printf("%X\n", b); // b also == 0xE6, which == 0b11100110
If you just want to use a System.out.print statement, check out the static toBinaryString and toHexString methods found in the Integer class.
Check that you are really writing raw bytes to the output stream. For this to work, you must use FileOutputStream and BufferedOutputStream. If you use FileWriter and PrintWriter instead, you'll be writing characters instead of bytes. This will cause Java to perform some incorrect conversions of your byte values.