Assignment 08

Task

Implement and then benchmark a couple different sorts under a variety of conditions.

Text: 8.2, 8.7, Chapter 8
Concepts: sorting, big-O, micro-benchmarking.

Steps

Step 1: Write 3 sorts

In a class named UsernameA08, write these three static sort methods:

public static <E extends Comparable<E>> void insertionSort(List<E> list)
public static <E extends Comparable<E>> void quickSort(List<E> list)
public static <E extends Comparable<E>> void otherSort(List<E> list)

Note that we are sorting java.util.List objects here, rather than arrays. When writing your sorts, you can assume these lists are random access (ArrayLists) in terms of performance. For example, you do not need to use an iterator and you can assume that a get method call is an O(1) operation.

You should implement a standard insertion sort algorithm.

Your quickSort should select a random pivot, as in the lecture slides.

For otherSort, you can choose whatever comparison sort you want that is O(n²) or better. Remember that O(n lg n) is better than O(n²). This can be one of the sorts covered in lecture, or, for extra credit, a different interesting sort that you find on Wikipedia or elsewhere. You could also do quickSort again with a different pivot selection algorithm. Name your method otherSort so that Tamarin can invoke it, but document in your javadoc comments what sort you are actually implementing and any optimization choices you made.

Step 2: Generate some test files

Here is a program to generate some unsorted data: FileGenerator.java

See the javadoc documentation on how to run the program. You'll need to specify the number of integers you want in the file as a command line argument. Run the program with 1000, 10000, and 100000 to produce the files 1000.txt, 10000.txt, and 100000.txt. Since it's a little hard to quickly count that many 0s, I'll refer to these as your 1K, 10K, and 100K test cases.

Step 3: Test your 3 sorts

Make sure your sort methods are correct. How you do this is up to you, but make sure you don't test all 3 sorts on the same list object.

Step 4: Time your sorts, plus one

Now time each of your sorts, in milliseconds, for lists of 1K, 10K, and 100K integers. As a control measure, also time the java.util.Collections.sort method. To time a method, do something like this:

  long sortTime = System.currentTimeMillis();  //start time
  insertionSort(listCopy);
  sortTime = System.currentTimeMillis() - sortTime;  //duration = end - start time
  System.out.println("Insertion sort time: " + sortTime);

You will find that benchmarking is not a very precise science. If you run your program multiple times, you'll probably see that your times can vary about 20% either way. Run each test at least 3 or 4 times and take an average. (You can do this formally with a loop in your program, or you can just run your program a few times and eyeball it.)

Time each method for each of the 1K, 10K, and 100K data sets in each of these cases:

an unordered ArrayList<Integer>
an unordered ArrayList<String>
an already-sorted ArrayList<Integer>
an unordered LinkedList<Integer>

If a single call to a sort method takes more than a few minutes to complete, you can skip recording the time for that entry. You do not have produce all of these values with a single run of your program. That would certainly be acceptable, but you can also modify your program and run it again to handle the different cases. For example, you could read in the file as integers, get the resulting times, and then change your program to read the file in as strings.

Step 5: Report your times

Log into Laulima and go to the new Wiki section. Further instructions will be on the wiki page itself. We'll collect everyone's times there and then discuss any trends after the assignment deadline.

To speed up the edit process and so hopefully avoid edit conflicts, you'll need your times in the following format on one line: UH username, 3 times for insertion sort, 3 times for quick sort, 3 times for other sort, and 3 times for for Collections.sort. All of these elements should be separated by a | character. You'll have 1 line for each of the four conditions. An example:

Username | i-1K | i-10K | i-100K | q-1K | q-10K | q-100K | o-1K | o-10K | o-100K | C-1K | C-10K | C-100k
#Unordered integers:
ztomasze| 25 | 230 | 10900 | 6 | 75 | 50 | 12 | 60 | 95 | 5 | 50 | 105
#Unordered strings:
ztomasze| 30 | 630 | 140500| 10 | 77 | 55 | 4 | 40 | 170 | 4 | 35 | 170
#Sorted integers:
ztomasze| 1 | 9 | 13 | 9 | 58 | 48 | 1 | 9 | 30 | 1 | 9 | 15
#LinkedList of integers:
ztomasze| 340 | 343365 | + | 30 | 2602 | 295548 | 19 | 1980 | 274976 | 3 | 30 | 105

Having your data already in this format means you can just paste one new row into each of the 4 wiki tables. You'll also be asked for your username, java version, OS, CPU speed (type optional), and specifically what other sort you implemented (if any):

ztomasze | 1.7.0_06 | Windows 7 | 2.4Ghz (Intel i5) | merge sort

Step 6: Submit your code

Tamarin will test your 3 static methods directly for correctness. It won't time them, and it won't care what you did in your main method.

Sample Output

Up to you.

What to Submit

Upload your UsernameA08.java file to Tamarin. (No jar files this time.)

Remember to follow the course coding standards on all assignments.

Grading [50 points]

5 - Compiles
30 - 3 requested static sort methods
15 - Reported times to the Laulima wiki
+ - Novel otherSort method: If you implement one of the sorts covered in lecture, (selection, bubble, merge, or modified quick), you get no EC. Heapsort is worth +2 EC. If you implement a different sort, you can get +5 EC. (No Bogosort please.)

Discussion

Testing your sorts

I recommend you add a main method to your class that reads in a file of integers. A quick way to do this is specify the filename as a command line argument. You can use a Scanner's hasNextInt() and nextInt() methods to read the numbers from the file as ints, or use hasNextLine() and nextLine() to read them as Strings. You should store them in an an ArrayList. Pass a copy of this List to your sort methods and then print the sorted list out afterwards to make sure the sort worked. You'll probably want to test it with a small file of 20 or 30 elements.

Note that you can't pass the same list to all of your sort methods because it will already have been sorted by the first sort you test. One way to avoid this is to re-read the contents from the file before each sort test, but a better approach is just to make a copy of the unsorted list. Remember that every Collection in Java has a constructor that takes another Collection, so you can do something like this:

  List<Integer> unsorted = new ArrayList<Integer>();
  // ...load unsorted list with file contents...
  List<Integer> listCopy = new ArrayList<Integer>(unsorted);
  insertionSort(listCopy);
  System.out.println("After insertion sort: " + listCopy);
  // although listCopy is now sorted, original unsorted list is still unsorted,
  // so you can make another copy of it to test quickSort

Benchmarking

Benchmarking means running a program to evaluate the real-world performance of either the code or the underlying hardware. What we're doing here is micro-benchmarking: testing the performance of just a small section of code.

Benchmarking code is fraught with peril. There are many many factors that can affect the time you get: your hardware, your operating system, the possible precision of your timing mechanism, what else your OS was doing while running the test, what the compiler did to your code (where small implementation differences of the same algorithm can make big differences), what the JVM does while running your code (whether the garbage collector runs, whether the JVM pauses to compile part of your code into machine code for faster performance), etc.

The general advice on benchmarking is: don't do it (without very good reason). If you do it, don't put too much stock in it. (The first reading below is actually from the documentation of a Java microbenchmarking framework, and they say the same thing.)

So if the numbers we're getting are not very useful, why are we bothering? Because, while the "error rate" is very high in our measurements, we should still be able to see some significant trends as predicted by big-O. For example, a particular benchmark time might vary by 30% or more on the same machine, but if a different test takes 100x as long, that much larger difference will still be apparent despite the high error rate.

More reading: Java Microbenchmarks, Writing a Java Micro-Benchmark

FAQs

In our sort methods, should we dump the List into an array so we can sort it?

No. While that does work, the extra copying is not necessary and so should be avoided. As specified above, you can assume that your Lists are random access (basically, an ArrayList), and so you should write the sort using List methods. We'll then violate that assumption by passing in a LinkedList instead, just to see what sort of performance hit that entails. Basically anything you'd do with an array you can do using only the size, get, and set methods.

For example, consider this chunk of code from array-based bubbleSort:

for (int i = 0; i < array.length - 1; i++) {
  if (array[i].compareTo(array[i+1]) > 0) {
    //swap the two
    E temp = array[i];
    array[i] = array[i + 1];
    array[i + 1] = temp;
    swapped = true;
  }
}

This can be easily converted to using List methods, like so:

for (int i = 0; i < list.size() - 1; i++) {
  if (list.get(i).compareTo(list.get(i+1)) > 0) {
    //swap the two
    E temp = list.get(i);          // Or replace these 3 lines with:
    list.set(i, list.get(i + 1));  //  Collections.swap(list, i, i+1);
    list.set(i + 1, temp);         //
    swapped = true;
  }
}

Note that we could have performed the swap by using the remove and add methods, but that would entail an O(n) shift for the remove and then again for the reinsertion. If you're assuming the List is an ArrayList, then the get/set is more efficient.

If you want to, you can write your sorts using iterators in places where you think it will improve performance. The translation from array-based sorting code will be a little trickier if you do so, though.

I noticed that in the slides, quickSort's type parameter definition is more complicated. Why is that?

Good spotting. Compare:

public static <T extends Comparable<T>> void mergeSort(T[] array) {
public static <T extends Comparable<? super T>> void quickSort(T[] array) { ... }

Quicksort's definition is really the better way, but mergeSort's is simpler. You can declare your own methods either way, but here's what the difference means:

<T extends Comparable<T>>: T must be something that is Comparable to something else of the exact same type.
<T extends Comparable<? super T>>: T must be something that is Comparable to something else of the exact same type or any of its super types.

The later is more flexible. For example, consider your Meep class that implemented Comparable:

  public class Meep implements Comparable<Meep> {
    ...
    public int compareTo(Meep m) {
      ...

Meeps are comparable to other Meeps. In other words, "Meep extends Comparable<Meep>", so Meep would be a valid value for T. If we created an array of Meeps, each element of the list will have a compareTo(Meep) method, and so we could pass this array to either of the sort methods above.

But now consider a subclass:

  /** Exactly like a Meep, except a bit heavier than it should be. */
  public class FatMeep extends Meep {
    public FatMeep(int size) {
      super(size + 1);  
    }
  }

Thanks to inheritance, FatMeep is Comparable<Meep> and has a compareTo(Meep) method. This is fine; it means a FatMeep works just like a regular Meep in all comparison tests. You can compare a Meep to a FatMeep or a FatMeep to a Meep and get the same results. However, note that "FatMeep extends Comparable<Meep>" will match the generics requirement for quickSort but not for mergeSort. So we would not be able to sort a FatMeep[] using the mergeSort method above.

This situation is fairly rare, but it does reveal the complex underbelly of generics. (For more information, do a search for java generics PECS.)

Some sorts, such as merge sort, internally use an extra storage array. Does this have to be a List too?

That's up to you. You could easily make a List<E> aux = new ArrayList<E>(list);

Just leaving it as array is okay too, though.

When we're running the LinkedList sorts, should we change any internal auxillary arrays/lists to be a LinkedList too?

No, that's not necessary. I just intended you to change the implementations of the lists that are to be sorted. Internal details of the sort methods don't have to be touched.

How do we generate 1K, 10K, and 100K Strings to test with?

Just use the files of numbers that you already have, but read the numbers in as if they were Strings. For example, use Scanner's nextLine() rather than nextInt(), as mentioned above under testing.

I read about System.nanoTime(). Can we use that instead to time our sorts?

Sure, that's fine. nanoTime() is more precise, but it may not be any more accurate because your OS may only be capable of updating the current time every few milliseconds. If you do use nanoTime, don't forget to divide your results by 1 million to get the time in milliseconds for reporting purposes.

I entered my times on the wiki, but now they're not there! Someone else deleted them or something.

Yes, if two people, A and B, start editing the wiki at the same time, and A submits and then B submits, B will overwrite A's change. Having the data already formatted and ready to insert will limit the chance of this happening, but edit conflicts will still be likely on the night the assignment is due. (I believe Wikipedia will attempt to merge two recent changes and alert the second poster if it can't do that. The Laulima wiki does not appear to be that robust.)

Try to keep your edit times as short as possible to avoid trashing someone else's data. There's also a history record of all changes made to the document, so even if your times do get overwritten by someone else's changes, we will know that you did at least submit them before the deadline. So don't stress about your grade being affected by this. However, it would help us out if you could check back after the deadline and reinsert any missing sort-time data after the editing frenzy has died down.