Implement and then benchmark a couple different sorts under a variety of conditions.
Text: 8.2, 8.7, Chapter 8
Concepts: sorting, big-O, micro-benchmarking.
In a class named UsernameA09
, write these two static sort methods:
public static <E extends Comparable<E>> void insertionSort(List<E> list)
public static <E extends Comparable<E>> void mergeSort(List<E> list)
Note that we are sorting java.util.List objects here, rather than arrays. When writing your sorts, you can assume these lists are random access (ArrayLists) in terms of performance. For example, you do not need to use an iterator and you can assume that a get
method call is an O(1) operation.
You should implement a standard insertion sort algorithm.
Your merge sort should use the simple optimization of skipping unnecessary merges. Basically, add an if statement near the start of the merge: if the last element of the left array (at index endLeft
) is less than or equal to the first element of the right array (at startRight
), then the two arrays are already sorted relative to each other and the rest of the merge can be skipped/aborted.
For extra credit, you can also write:
public static <E extends Comparable<E>> void otherSort(List<E> list)
For otherSort, you can choose whatever comparison sort you want that is O(n2) or better. (O(n lg n) is better than O(n2). I recommend quicksort for the practice, but bubble, selection, or any interesting sort you find on Wikipedia or elsewhere is fine too. Name your method otherSort so that Tamarin can invoke it, but document in your javadoc comments what sort you are actually implementing and any optimization choices you made.
Here is a program to generate some unsorted data: FileGenerator.java
See the javadoc documentation on how to run the program. You'll need to specify the number of integers you want in the file as a command line argument. Run the program with 1000, 10000, and 100000 to produce the files 1000.txt, 10000.txt, and 100000.txt. Since it's a little hard to quickly count that many 0s, I'll refer to these as your 1K, 10K, and 100K test cases.
Make sure your sort methods are correct. How you do this is up to you, but make sure you don't test all 3 sorts on the same list object.
Now time each of your sorts, in milliseconds, for lists of 1K, 10K, and 100K integers. As a control measure, also time the java.util.Collections.sort
method. To time a method, do something like this:
long sortTime = System.currentTimeMillis(); //start time insertionSort(listCopy); sortTime = System.currentTimeMillis() - sortTime; //duration = end - start time System.out.println("Insertion sort time: " + sortTime);
You will find that benchmarking is not a very precise science. If you run your program multiple times, you'll probably see that your times can vary about 20% either way. Run each test at least 3 or 4 times and take an average. (You can do this formally with a loop in your program, or you can just run your program a few times and eyeball it.)
Time each method for each of the 1K, 10K, and 100K data sets in each of these cases:
If a single call to a sort method takes more than a few minutes to complete, you can skip recording the time for that entry. You do not have produce all of these values with a single run of your program. That would certainly be acceptable, but you can also modify your program and run it again to handle the different cases.
Log into Laulima and go to the new Wiki section. Further instructions will be on the wiki page itself. We'll collect everyone's times there and then discuss any trends after the assignment deadline.
To speed up the edit process and so hopefully avoid any edit conflicts, you'll need your times in the following format on one line: UH username, 3 times for insertion sort, 3 times for merge sort, 3 times for Collections.sort, and 3 times for your other sort if you have one. All of these elements should be separated by a | character. You'll have 1 line for each of the four conditions. An example:
Username | i-1K | i-10K | i-100k | m-1k | m-10k | m-100k | C-1K | C-10K | C-100k | o-1K | o-10k | o-100k #Unordered integers: ztomasze| 25 | 230 | 10900 | 12 | 60 | 95 | 5 | 50 | 105 | - | - | - #Unordered strings: ztomasze| 30 | 630 | 140500| 4 | 40 | 170 | 4 | 35 | 170 | - | - | - #Sorted integers: ztomasze| 1 | 9 | 13 | 1 | 9 | 30 | 1 | 9 | 15 | - | - | - #LinkedList of integers: ztomasze| 340 | 343365 | + | 19 | 1980 | 274976 | 3 | 30 | 105 | - | - | -
Having your data already in this format means you can just paste one new row into each of the 4 wiki tables. You'll also be asked for your username, java version, OS, CPU speed (type optional), and specifically what other sort you implemented (if any):
ztomasze | 1.7.0_06 | Windows 7 | 2.4Ghz (Intel i5) | -
Tamarin will test your 2 (or 3) static methods directly for correctness. It won't time them, and it won't care what you did in your main method.
Up to you.
Upload your UsernameA09.java
file to Tamarin. (No jar files this time.)
Remember to follow the course coding standards on all assignments.
I recommend you add a main method to your class that reads in a file of integers. A quick way to do this is specify the filename as a command line argument. You can use a Scanner's hasNextInt() and nextInt() methods to read the numbers from the file as ints, or use hasNextLine() and nextLine() to read them as Strings. You should store them in an an ArrayList. Pass a copy of this List to your sort methods and then print the sorted list out afterwards to make sure the sort worked. You'll probably want to test it with a small file of 20 or 30 elements.
Note that you can't pass the same list to all of your sort methods because it will already have been sorted by the first sort you test. One way to avoid this is to re-read the contents from the file before each sort test, but a better approach is just to make a copy of the unsorted list. Remember that every Collection in Java has a constructor that takes another Collection, so you can do something like this:
List<Integer> unsorted = new ArrayList<Integer>(); // ...load unsorted list with file contents... List<Integer> listCopy = new ArrayList<Integer>(unsorted); insertionSort(listCopy); System.out.println("After insertion sort: " + listCopy); // although listCopy is now sorted, original unsorted list is still unsorted, // so you can make another copy of it to test mergeSort
Benchmarking means running a program to evaluate the real-world performance of either the code or the underlying hardware. What we're doing here is micro-benchmarking: testing the performance of just a small section of code.
Benchmarking code is fraught with peril. There are so many many factors that can affect the time you get: your hardware, your operating system, the possible precision of your timing mechanism, what else your OS was doing while running the test, what the compiler did to your code (where small implementation differences of the same algorithm can make big differences), what the JVM does while running your code (whether the garbage collector runs, whether the JVM pauses to compile part of your code into machine code for faster performance), etc.
The general advice on benchmarking is: don't do it (without very good reason). If you do it, don't put too much stock in it. (The first reading below is actually from the documentation of a Java microbenchmarking framework, and they say the same thing.)
So if the numbers we're getting are not very useful, why are we bothering? Because, while the "error rate" is very high in our measurements, we should still be able to see some significant trends as predicted by big-O. For example, a particular benchmark time might vary by 30% or more on the same machine, but if a different test takes 100x as long, that much larger difference will still be apparent despite the high error rate.
More reading: Java Microbenchmarks, Writing a Java Micro-Benchmark
For example, consider this chunk of code from array-based bubbleSort:
for (int i = 0; i < array.length - 1; i++) { if (array[i].compareTo(array[i+1]) > 0) { //swap the two E temp = array[i]; array[i] = array[i + 1]; array[i + 1] = temp; swapped = true; } }This can be easily converted to using List methods, like so:
for (int i = 0; i < list.size() - 1; i++) { if (list.get(i).compareTo(list.get(i+1)) > 0) { //swap the two E temp = list.get(i); list.set(i, list.get(i + 1)); list.set(i + 1, temp); swapped = true; } }
Note that we could have performed the swap by using the remove and add methods, but that would entail an O(n) shift for the remove and then again for the reinsertion. If you're assuming the List is an ArrayList, then the get/set is more efficient.
If you want to, you can write your sorts using iterators in places where you think it will improve performance. The translation from array-based sorting code will be a little trickier if you do so, though.
List<E> aux = new ArrayList<E>(list);
Just leaving it as array is okay too, though.
Try to keep your edit times as short as possible to avoid trashing someone else's data. There's also a history record of all changes made to the document, so even if your times do get overwritten by someone else's changes, we will know that you did at least submit them before the deadline. So don't stress about your grade being affected by this. However, it would help us out if you could check back after the deadline and reinsert any missing sort-time data after the editing frenzy has died down.
Also, people aren't really keeping the table rows sorted by username. If you noticed this, don't worry about it. I'll fix it later.