Assignment 09

Task

Implement and then benchmark a couple different sorts under a variety of conditions.

Text: 8.2, 8.7, Chapter 8
Concepts: sorting, big-O, micro-benchmarking.

Steps

Step 1: Write 2 sorts

In a class named UsernameA09, write these two static sort methods:

public static <E extends Comparable<E>> void insertionSort(List<E> list)
public static <E extends Comparable<E>> void mergeSort(List<E> list)

Note that we are sorting java.util.List objects here, rather than arrays. When writing your sorts, you can assume these lists are random access (ArrayLists) in terms of performance. For example, you do not need to use an iterator and you can assume that a get method call is an O(1) operation.

You should implement a standard insertion sort algorithm.

Your merge sort should use the simple optimization of skipping unnecessary merges. Basically, add an if statement near the start of the merge: if the last element of the left array (at index endLeft) is less than or equal to the first element of the right array (at startRight), then the two arrays are already sorted relative to each other and the rest of the merge can be skipped/aborted.

For extra credit, you can also write:

public static <E extends Comparable<E>> void otherSort(List<E> list)

For otherSort, you can choose whatever comparison sort you want that is O(n²) or better. (O(n lg n) is better than O(n²). I recommend quicksort for the practice, but bubble, selection, or any interesting sort you find on Wikipedia or elsewhere is fine too. Name your method otherSort so that Tamarin can invoke it, but document in your javadoc comments what sort you are actually implementing and any optimization choices you made.

Step 2: Generate some test files

Here is a program to generate some unsorted data: FileGenerator.java

See the javadoc documentation on how to run the program. You'll need to specify the number of integers you want in the file as a command line argument. Run the program with 1000, 10000, and 100000 to produce the files 1000.txt, 10000.txt, and 100000.txt. Since it's a little hard to quickly count that many 0s, I'll refer to these as your 1K, 10K, and 100K test cases.

Step 3: Test your 3 sorts

Make sure your sort methods are correct. How you do this is up to you, but make sure you don't test all 3 sorts on the same list object.

Step 4: Time your sorts, plus one

Now time each of your sorts, in milliseconds, for lists of 1K, 10K, and 100K integers. As a control measure, also time the java.util.Collections.sort method. To time a method, do something like this:

  long sortTime = System.currentTimeMillis();  //start time
  insertionSort(listCopy);
  sortTime = System.currentTimeMillis() - sortTime;  //duration = end - start time
  System.out.println("Insertion sort time: " + sortTime);

You will find that benchmarking is not a very precise science. If you run your program multiple times, you'll probably see that your times can vary about 20% either way. Run each test at least 3 or 4 times and take an average. (You can do this formally with a loop in your program, or you can just run your program a few times and eyeball it.)

Time each method for each of the 1K, 10K, and 100K data sets in each of these cases:

an unordered ArrayList<Integer>
an unordered ArrayList<String>
an already-sorted ArrayList<Integer>
an unordered LinkedList<Integer>

If a single call to a sort method takes more than a few minutes to complete, you can skip recording the time for that entry. You do not have produce all of these values with a single run of your program. That would certainly be acceptable, but you can also modify your program and run it again to handle the different cases.

Step 5: Report your times

Log into Laulima and go to the new Wiki section. Further instructions will be on the wiki page itself. We'll collect everyone's times there and then discuss any trends after the assignment deadline.

To speed up the edit process and so hopefully avoid any edit conflicts, you'll need your times in the following format on one line: UH username, 3 times for insertion sort, 3 times for merge sort, 3 times for Collections.sort, and 3 times for your other sort if you have one. All of these elements should be separated by a | character. You'll have 1 line for each of the four conditions. An example:

Username | i-1K | i-10K | i-100k | m-1k | m-10k | m-100k | C-1K | C-10K | C-100k | o-1K | o-10k | o-100k
#Unordered integers:
ztomasze| 25 | 230 | 10900 | 12 | 60 | 95 | 5 | 50 | 105 | - | - | -
#Unordered strings:
ztomasze| 30 | 630 | 140500| 4 | 40 | 170 | 4 | 35 | 170 | - | - | -
#Sorted integers:
ztomasze| 1 | 9  | 13 | 1 | 9 | 30 | 1 | 9 | 15 | - | - | -
#LinkedList of integers:
ztomasze| 340 | 343365 | + | 19 | 1980 | 274976 | 3 | 30 | 105 | - | - | -

Having your data already in this format means you can just paste one new row into each of the 4 wiki tables. You'll also be asked for your username, java version, OS, CPU speed (type optional), and specifically what other sort you implemented (if any):

ztomasze | 1.7.0_06 | Windows 7 | 2.4Ghz (Intel i5) | -

Step 6: Submit your code

Tamarin will test your 2 (or 3) static methods directly for correctness. It won't time them, and it won't care what you did in your main method.

Sample Output

Up to you.

What to Submit

Upload your UsernameA09.java file to Tamarin. (No jar files this time.)

Remember to follow the course coding standards on all assignments.

Grading [50 points]

5 - Compiles
30 - 2 requested static sort methods
15 - Reported times to the Laulima wiki
5+ - Extra otherSort method: If you implement selection or bubble sort, you'll get +5 EC. Heapsort is worth +6 EC. If you implement a different sort, you can get +10 EC. (No Bogosort please.)

Discussion

Testing your sorts

I recommend you add a main method to your class that reads in a file of integers. A quick way to do this is specify the filename as a command line argument. You can use a Scanner's hasNextInt() and nextInt() methods to read the numbers from the file as ints, or use hasNextLine() and nextLine() to read them as Strings. You should store them in an an ArrayList. Pass a copy of this List to your sort methods and then print the sorted list out afterwards to make sure the sort worked. You'll probably want to test it with a small file of 20 or 30 elements.

Note that you can't pass the same list to all of your sort methods because it will already have been sorted by the first sort you test. One way to avoid this is to re-read the contents from the file before each sort test, but a better approach is just to make a copy of the unsorted list. Remember that every Collection in Java has a constructor that takes another Collection, so you can do something like this:

  List<Integer> unsorted = new ArrayList<Integer>();
  // ...load unsorted list with file contents...
  List<Integer> listCopy = new ArrayList<Integer>(unsorted);
  insertionSort(listCopy);
  System.out.println("After insertion sort: " + listCopy);
  // although listCopy is now sorted, original unsorted list is still unsorted,
  // so you can make another copy of it to test mergeSort

Benchmarking

Benchmarking means running a program to evaluate the real-world performance of either the code or the underlying hardware. What we're doing here is micro-benchmarking: testing the performance of just a small section of code.

Benchmarking code is fraught with peril. There are so many many factors that can affect the time you get: your hardware, your operating system, the possible precision of your timing mechanism, what else your OS was doing while running the test, what the compiler did to your code (where small implementation differences of the same algorithm can make big differences), what the JVM does while running your code (whether the garbage collector runs, whether the JVM pauses to compile part of your code into machine code for faster performance), etc.

The general advice on benchmarking is: don't do it (without very good reason). If you do it, don't put too much stock in it. (The first reading below is actually from the documentation of a Java microbenchmarking framework, and they say the same thing.)

So if the numbers we're getting are not very useful, why are we bothering? Because, while the "error rate" is very high in our measurements, we should still be able to see some significant trends as predicted by big-O. For example, a particular benchmark time might vary by 30% or more on the same machine, but if a different test takes 100x as long, that much larger difference will still be apparent despite the high error rate.

More reading: Java Microbenchmarks, Writing a Java Micro-Benchmark

FAQs

In our sort methods, should we dump the List into an array so we can sort it?

No. While that does work, the extra copying is not necessary and so should be avoided. As specified above, you can assume that your Lists are random access (bascially, an ArrayList), and so you should write the sort using List methods. (We'll then violate that assumption by passing in a LinkedList instead, just to see what sort of performance hit that entails.) Basically anything you'd do with an array you can do using only the size, get, and set methods.

For example, consider this chunk of code from array-based bubbleSort:

for (int i = 0; i < array.length - 1; i++) {
  if (array[i].compareTo(array[i+1]) > 0) {
    //swap the two
    E temp = array[i];
    array[i] = array[i + 1];
    array[i + 1] = temp;
    swapped = true;
  }
}

This can be easily converted to using List methods, like so:

for (int i = 0; i < list.size() - 1; i++) {
  if (list.get(i).compareTo(list.get(i+1)) > 0) {
    //swap the two
    E temp = list.get(i);
    list.set(i, list.get(i + 1));
    list.set(i + 1, temp);
    swapped = true;
  }
}

Note that we could have performed the swap by using the remove and add methods, but that would entail an O(n) shift for the remove and then again for the reinsertion. If you're assuming the List is an ArrayList, then the get/set is more efficient.

If you want to, you can write your sorts using iterators in places where you think it will improve performance. The translation from array-based sorting code will be a little trickier if you do so, though.

Merge sort internally uses an extra storage array. Does this have to be a List too?

That's up to you. You could easily make a List<E> aux = new ArrayList<E>(list);

Just leaving it as array is okay too, though.

When we're running the LinkedList sorts, should we change merge's internal auxillary list to be a LinkedList too?

No, that's not necessary. I just intended you to change the implementations of the lists that are to be sorted. Internal details of the sort methods don't have to be touched.

How do we generate 1K, 10K, and 100K Strings to test with?

Just use the files of numbers that you already have, but read the numbers in as if they were Strings. For example, use Scanner's nextLine() rather than nextInt(), as mentioned above under testing.

I read about System.nanoTime(). Can we use that instead to time our sorts?

Sure, that's fine. nanoTime() is more precise, but it may not be any more accurate because your OS may only be capable of updating the current time every few milliseconds. If you do use nanoTime, don't forget to divide your results by 1 million to get the time in milliseconds for reporting purposes.

I entered my times on the wiki, but now they're not there! Someone else deleted them or something.

Yes, if two people, A and B, start editing the wiki at the same time, and A submits and then B submits, B will overwrite A's change. I was hoping that having the data already formatted and ready to insert would limit the chance of this happening, but it looks like 30 people or more are all going to be trying to edit the wiki this evening. (I believe Wikipedia will attempt to merge two recent changes and alert the second poster if it can't do that. The Laulima wiki does not appear to be that robust.)

Try to keep your edit times as short as possible to avoid trashing someone else's data. There's also a history record of all changes made to the document, so even if your times do get overwritten by someone else's changes, we will know that you did at least submit them before the deadline. So don't stress about your grade being affected by this. However, it would help us out if you could check back after the deadline and reinsert any missing sort-time data after the editing frenzy has died down.

Also, people aren't really keeping the table rows sorted by username. If you noticed this, don't worry about it. I'll fix it later.