A Sort Factory for the Collections API

ORIGINAL DRAFT

The Java 1.2 Collections API provides a number of useful data structures and algorithms but only provides one static method for sorting data in the Collections class. This article extends the basic API with a SortFactory, which provides a number of sorting algorithms for use with any of the sortable collections. The general purpose heap sort provided by the Java 1.2 API is a good choice in most cases, but this group of algorithms provides more flexibility when you need to optimize for certain conditions.

This article also illustrates a number of useful techniques for extending Java that you may find instructive. We’ll apply a few programing patterns, among them the Factory and Singleton patterns. We base our approach on a Sort interface, which makes the sorting algorithms interchangeable. Finally, we’ll capitalize on the Comparable and Comparator interfaces provided in the Java collections API to develop a fully generalized implementation, which can be used with any List or Object array.

Overview of the Java Collections API

The Java collections API provides a set of interfaces and concrete implementations that allow you to manipulate object groups. Let’s take a quick look at the interfaces:

Collection is a group of objects in no particular order.
Set is a collection with no duplication, in no particular order.
List is an ordered collection with possible duplicates.
Map is collection of key/value pairs with a maximum of one value per key.
SortedSet is a Set, which is always kept sorted.
SortedMap is a Map, which is always kept sorted.

The general objectives articulated in the specification include presenting a unified architecture for representing and manipulating collections, independent of the details of their representation. The interfaces support interoperability between unrelated APIs, and reduce design and learning effort, encouraging reuse. The concrete implementations provide high performance solutions while the abstract classes provide a springboard for extensibility.

The concrete implementations provided by the API are:

HashSet implements the Set interface using a hash table.
TreeSet implements the Set interface using a red-black tree structure.
ArrayList implements the List interface with an unsynchronized Vector.
LinkedList implements the List interface with a doubly-linked list (a chain).
HashMap implements the Map interface with a hash table.
TreeMap implements the SortedMap interface with a red-black tree.

In Java 1.2, the Vector class now implements the List interface and the Hashtable class implements the Map interface. These two classes remain synchronized but the new collection classes are not. They are, therefore, faster than their predecessors. It’s possible to ensure thread safety by calling the static synchronizedCollection method in the Collections class. You can also make a collection immutable by calling the static unmodifiableCollection method in the Collections class.

To make arrays easier to deal with, the Arrays class implements an asList method that lets you manipulate the array as if it was a List. Several algorithms are provided to sort, shuffle, search, reverse, fill, copy and find minimum or maximum values. You’ll find these implemented as static methods in the Collections class. Both the Collections and Array classes provide a large number of useful methods you’ll want to be familiar with.

To make it easy to create your own implementations, the designers of the Collections API have provided a handful of abstract classes, which implement several useful methods, allowing you to leverage this foundation when you are forced to build your own collection classes. This is rarely going to be necessary but there are certainly cases where you might want to optimize for certain conditions (building a disk-based hash table, for example, or an alternate tree implementation). The abstract classes provide skeletal implementations for the various interfaces.

AbstractCollection implements the Collection interface.
AbstractSet implements the Set interface.
AbstractList implements the List interface.
AbstractSequentialList implements an ordered list.
AbstractMap implements the Map interface.

The Collections API was designed to be relatively thin, both compact and as uncomplicated as possible. Many of the existing Java classes, which made use of the Vector and/or Hashtable can use the new Collections classes, and it is highly recommended this approach be applied to any of your own development as early as possible in a given project. The new collections are faster, more flexible and represent the better choice in most cases.

One of the big surprises in this API is the UnsupportedOperationException, which can be thrown by unimplemented methods. The designers chose to do this to avoid the typical trap experienced by implementers of similar libraries. To capture all the subtleties and possible variations in different data structures, it’s easy to find yourself implementing a wide array of complicated interfaces. Rather than produce a large number of interfaces, the designers or the Java Collections API decided to consider the non-core methods optional, throwing a suitable exception when they are unavailable. While this tends to conflict with the whole idea of declaring interfaces to define roles and imperative methods, it was considered a reasonable compromise in order to keep the API as simple as possible, without unnecessarily constraining it.

Sort Interface

In this article, we’re primarily concerned with the interfaces that let you sort elements. The Comparable interface defines the compareTo method that lets you determine relative order for the object being compared. The Comparator interface defines a compare method, which does the same thing as compareTo but acts on two independent objects, along with the equals method for determining equality between Comparators.

To make our sorting implementations uniform, we define a Sort interface, which forces the following methods to be present:

public void sort(List list);
public void sort(Comparable[] list);
public void sort(Object[] list, Comparator comp);

Listing 1 shows the Sort interface, which is designed to accommodate a trio of basic variations. By having the first two method call the more flexible third method, we can reduce the amount of code in each of the sorting classes. We implement an AbstractSort class that each of the concrete classes can extend. In addition, we define a DefaultComparator, which is always used in cases where you don’t explicitly define another one by using the extended method syntax.

To make it easier to select algorithms, we implement a factory class called SortFactory. The Factory pattern is designed to produce instances of the required classes on demand. We’ll use a static getInstance method with a named argument that returns an implementation of the Sort interface. To maximize efficiency, only one instance is actually created for each of the algorithms and these are only created when asked for the first time.

Sorting Algorithms

Over the years, numerous sorting algorithms have been developed and studied by researchers and computing professionals. To gauge their efficiency, the O notation was developed to measure the growth rate associated with algorithms applied to data sets of different sizes. The average number of operations required to sort a list of elements plays a direct role in this equasion. In these equations, N represents the number of elements in the list being sorted. The fastest algorithms run in linear O(N) time, the weaker ones typically use an inner and outer loop and run in O(N^2) time.

Table 1: Average sort times for different sample sets. The samples are randomized, already in ascending or descending order, or the elements are equal to each other.
Algorithm	Strategy	Random	Ascending	Descending	Equality
Bubble	Transposition	O(N^2)	O(N)	O(N^2)	O(N)
Selection	Priority Queue	O(N^2)	O(N^2)	O(N^2)	O(N^2)
Insertion	Insert and Keep Sorted	O(N^2)	O(N)	O(N^2)	O(N)
Shell	Diminishing Increment	O(N^1.25)	O(N log N)	O(N^1.5)	O(N log N)
Quick	Divide and Conquer	O(N log N)	O(N log N)	O(N log N)	O(N log N)
Merge	Divide and Conquer	O(N log N)	O(N log N)	O(N log N)	O(N log N)
Heap	Divide and Conquer	O(N log N)	O(N log N)	O(N log N)	O(N)

Table 1 shows how different data conditions can affect these algorithms. From these numbers, you can conclude that the Bubble and Insertion algorithms do well if the data is already nearly sorted or uniform, but poorly in other cases. The Heap sort does a pretty good job most of the time and tends to excel on sets that are fairly uniform. Keep in mind that these assessments are based on extreme conditions, which occur only rarely in the real world. A better way to decide which of these algorithms are most required for your own needs may require benchmarking measurements and a more in-depth understanding of your particular data conditions.

Here’s a quick overview of the algorithms and how they function:

BubbleSort makes a series of passes through the collection, exchanging adjacent items if they are in the wrong order. The sort procedure ends when a full pass is made and no exchanges were necessary.
SelectionSort searches the collection for the smallest item and swaps it with the currently marked lower bound. The lower bound is moved successively up the collection until the final element is in place.
InsertionSort picks the item right after the current lower bound and finds the correct position for it in the preceding part of the list, moving entries above the insertion up by one position.
ShellSort is similar to InsetionSort but tries to sort elements a fixed distance apart, reducing the distance over several iterations. This minimizes the number of elements that need to be moved when an item is inserted.
QuickSort divides the collection into two halves around a pivotal element and moves the elements above the pivot to the first half and elements which are larger in the second half. The process is applied recursively to each half until the elements end up in order.
MergeSort divides a list into sorted subsections by first taking every sequential pair of elements and sorting them and then grouping every pair of pairs and merging them in the correct order, repeating the process at increasing granularities until the whole series is in order.
HeapSort uses the heap data structure, which represents a tree by dividing a list into successive binary divisions. The algorithm inserts each elements into the tree and removes them successively, reorganizing the tree as elements are positioned at the end of the list.

Supporting Classes

Listing 2 shows the AbstractSort class, which provides a foundation for building the concrete sort classes. We use the DefaultComparator from Listing 3 to do comparisons if none is explicitly provided. The DefaultComparator implements the Comparator interface. We test the two arguments in the compare method and throw an IllegalArgumentException if they do not implement the Comparable interface. If they do, we cast the arguments and call the Comparable interface compareTo method. The equals method simply calls compare and returns true if the returned value is zero (0).

Figure 1: Class Relationships. The Sort interface is implemented by the AbstractSort class, which is the parent of each of the other sorting classes. AbstractSort provides a DefaultComparator. The SortFactory generates requested Sort classes on demand.

Listing 4 shows the SortFactory code. This is the heart of our implementation. The getInstance method returns an instance of a sorting class that implements the Sort interface. The SortFactory is implemented as a singleton. The constructor is declared as private, so no other instance can be created. We use a static singleton variable to hold the only permissible instance of this class. The getInstance method is declared as static and uses the singleton instance to create suitable Sort classes.

The getInstance method always checks to see if an instance of the requested Sort class has already been created, returning it if it finds one. Otherwise, the class is created and a reference is save in an suitably named static variable. This has the effect of keeping unused classes from being created and created classes from being duplicated, keeping the SortFactory’s single instance as efficient as possible. The technique used to create classes only when necessary is often referred to as lazy instantiation.

Using a Test Harness

Often overlooked when implementing code is the test harness. For our purposes, we’ll use the Java 1.2 Integer class, which implements the Comparable interface. Our test elements will be integer values, but any Comparable type works equally well.

Our SortingTest class implements the harness. We use a few static methods to make life easier. The randomList method generates a List collection with the specified number of randomized elements. The printList method is used to display our results in a readable format. We also need a copyList method to do a deep copy. We need to use separate copies of the input Lists because our algorithms sort by reference. Finally, we implement a sortedList method, which tests to see if a list is sorted. It does this by checking all adjacent element pairs and making sure they are in the right order.

The main clause in our test harness loops for several iterations. On each loop, a random sample is generated and each of the algorithms are tested. For each test, we copy the test sample, sort it and display both the original list and the sorted list, along with a PASSED or FAILED message to show whether the results were properly sorted. If any of the tests fail, our success flag gets set to false and the full test result indicates a failure.

This harness allows us to run very large tests, possibly overnight, to certify our code before delivery and represents an important technique used by Quality Assurance teams to do regression testing without manual intervention. It’s useful to think of these as unit tests, however, implemented at development time rather than as an afterthought. The net improvement in code quality and lowered debugging and maintenance time are well worth the small investment you can make up front.

Preliminary Benchmarking

It’s important to make intelligent decisions about which algorithm you might want to apply under a given set of circumstance. Of course, a number of factors will come into play but the most common will probably depend on the number of elements you are sorting and/or the level of order in the set to be sored. Some of the algorithms tend to perform better under different circumstances.

The basic test we performed were based on random lists of a given number of elements. To make this at least reasonably statistically relevant, several runs were made for each list size so the values you see are effectively averages. Numerous factors are not accounted for, however, including memory constraints, type comparison (we used a simple Integer data type and numbers are likely to vary for strings or compound data structures), nor were the algorithms optimized for speed.

Still, it is useful to be able to compare the algorithms, even if the comparison is coarse. Figure 2 shows the comparative results in table and graph forms.

Figure 2: Benchmark Results. This graph shows how the various sort classes compare to each other. The test was run 5 times and results were averaged. Each test loop used the same source data with each of the algorithms.

Notice that the graph uses a logarithmic scale, so small differences are expecially pronounced toward the top of the graph. It’s also worth noting that the InsertionSort beat the others for small sets. QuickSort seems to beat out the others, but HeapSort consumes the least amount of memory, so it could be argued that it offers the best tradeoff. Some of these results may be attributable to the order of the random data, which is not strictly random but pseudo-random, as generated by the Math.random method.

Using the SortFactory, you can easily experiment with the type of data you use to optimize the algorithm for your circumstances.

Algorithm analysis provides a framework for measuring the efficiency of algorithms. For several key reasons, it is impossible to perform a deep scientific analysis that takes into account all the possible machine, memory, language and implementation-specific issues. Yet a rough estimate, based on the most significant elements affecting performance, is important when deciding which algorithm to use. The O notation, a.k.a. Big-Oh or Asymptotic notation, effectively eliminates the less significant elements of the equation. The objective is to produce a simplified formula, which is similar to less than or equal to the actual algorithm performance curve.

A formula that looks like: f(n) = XN^2 + YN + Z; factors into something like: O(N^2). It’s intuitively easy enough to understand that the N^2 (N squared) part of the equation is the most significant factor. Removing any other element has less impact on the curve than removing that one. There are other variants to the Big O, by the way - Big Omega, Big Theta, and so on - each measures growth rate by applying slightly different formulas. Based on our quick explanation, you can probably guess why Big O has it’s name. The O stands for Order. So the Bubble sort, which includes a pair of nested loops, is said to perform on the order of N squared, where N stands for the number of elements being sorted.

To make things easier to compare, algorithms are usually grouped into complexity classes. The following table shows the most common classes. Notice that the general purpose sorting algorithms fall between Quadratic and Linear. For obvious reasons, it is not wise to use the slower algorithms on large sets of data. It’s also good idea to understand enough about these factors to predict performance in your applications, whether these factors apply to sorting or other algorithms you may develop or use.

Name	Notation
Constant	O(1)
Logarithmic	O(log N)
Linear	O(N)
N log N	O(N log N)
Quadratic	O(N^2)
Cubic	O(N^3)
Exponential	O(2^N)

Keep in mind that O notation is an indicator, which cannot and does not try to account for memory usage, resource constraints, machine speed or other elements, so your mileage will tend to vary widely. Obviously a faster machine will typically run the algorithm more quickly but certain algorithms perform better in constrained memory environments (like Heap sort, which consumes less memory) while others might be better in slower memory conditions (Merge sort, for example, can be applied to disk-based sorting). Finally, the order of the data can affect performance, especially in sorting. Some algorithms move less data and the O notation cannot account for those conditions.

Summary

The classes presented in this article provide an inventory of sorting algorithms, a uniform interface and an efficient factory for generating these algorithms and using them in a variety of interchangeable circumstances. Applications can take advantage of algorithm-specific characteristics to optimize sorting speeds under a variety of data conditions. The use of interfaces and programing patterns in our implementation maximize reuse and extensibility and the test harness guarantees reliability.

Listing 1

import java.util.List;
import java.util.Comparator;

public interface Sort
{
  public void sort(List list);
  public void sort(Comparable[] list);
  public void sort(Object[] list, Comparator comp);
}

Listing 2

import java.util.List;
import java.util.Comparator;

public abstract class AbstractSort implements Sort
{
  private static DefaultComparator
    defaultComparator = new DefaultComparator();
	
  public void sort(List list)
  {
    int size = list.size();
    Comparable[] array = new Comparable[size];
    for (int i = 0; i < size; i++)
    {
      array[i] = (Comparable)list.get(i);
    }
    sort(array);
    for (int i = 0; i < size; i++)
    {
      list.set(i, array[i]);
    }
  }

  public void sort(Comparable[] array)
  {
    sort(array, defaultComparator);
  }

  public abstract void sort(Object[] list, Comparator comp);
}

Listing 3

import java.util.Comparator;

public class DefaultComparator implements Comparator
{
  public int compare(Object one, Object two)
  {
    if (!(one instanceof Comparable &amp;&amp; two instanceof Comparable))
      throw new IllegalArgumentException(
        "DefaultComparator requires Comparable elements");
    Comparable left = (Comparable)one;
    Comparable right = (Comparable)two;
    return left.compareTo(right);
  }

  public boolean equals(Object one, Object two)
  {
    return compare(one, two) == 0;
  }
}

Listing 4

import java.util.Comparator;

public class SortFactory
{
  private static SortFactory singleton = new SortFactory();
  private static Sort bubble, insertion,
    selection, shell, merge, heap, quick;
	
  private SortFactory() {}

  public static Sort getInstance(String name)
  {
    if (name.equalsIgnoreCase("Bubble"))
    {
      if (singleton.bubble == null)
        singleton.bubble = new BubbleSort();
      return singleton.bubble;
    }
    if (name.equalsIgnoreCase("Insertion"))
    {
      if (singleton.insertion == null)
        singleton.insertion = new InsertionSort();
      return singleton.insertion;
    }
    if (name.equalsIgnoreCase("Selection"))
    {
      if (singleton.selection == null)
        singleton.selection = new SelectionSort();
      return singleton.selection;
    }
    if (name.equalsIgnoreCase("Merge"))
    {
      if (singleton.merge == null)
        singleton.merge =new MergeSort();
      return singleton.merge;
    }
    if (name.equalsIgnoreCase("Shell"))
    {
      if (singleton.shell == null)
        singleton.shell =new ShellSort();
      return singleton.shell;
    }
    if (name.equalsIgnoreCase("Heap"))
    {
      if (singleton.heap == null)
        singleton.heap =new HeapSort();
      return singleton.heap;
    }	
    if (name.equalsIgnoreCase("Quick"))
    {
      if (singleton.quick == null)
        singleton.quick = new QuickSort();
      return singleton.quick;
    }
    throw new IllegalArgumentException(
      "Sorting algorithm " + name + " is not supported");
  }
}

Listing 5

import java.util.List;
import java.util.ArrayList;

public class SortingTest
{
  public static List randomList(int length)
  {
    List list = new ArrayList();
    for (int i = 0; i < length; i++)
    {
      list.add(new Integer((int)(Math.random() * 99)));
    }
    return list;
  }

  public static void printList(List list)
  {
    for (int i = 0; i < list.size(); i++)
    {
      Object element = list.get(i);
      System.out.print(element + " ");
    }
    System.out.println();
  }
	
  public static List copyList(List list)
  {
    List copy = new ArrayList();
    for (int i = 0; i < list.size(); i++)
    {
      int value = ((Integer)list.get(i)).intValue();
      copy.add(new Integer(value));
    }
    return copy;
  }

  public static boolean sortedList(List list)
  {
    for (int i = 0; i < list.size() - 1; i++)
    {
      Comparable item = (Comparable)list.get(i);
      Comparable next = (Comparable)list.get(i + 1);
      if (item.compareTo(next) > 0) return false;
    }
    return true;
  }
	
  public static void main(String[] args)
  {
    String[] algorithms = {"Bubble", "Insertion",
      "Selection", "Shell", "Merge", "Quick", "Heap"};
    List list = randomList(20);
    boolean success = true;
    int count = 20;
    for (int j = 0; j < count; j++)
    {
      for (int i = 0; i < algorithms.length; i++)
      {
        List copy = copyList(list);
        Sort sorter = SortFactory.getInstance(algorithms[i]);
        String className = sorter.getClass().getName();
        if (!className.startsWith(algorithms[i]))
          throw new IllegalArgumentException("Algorithm failed");
        sorter.sort(copy);
		
        System.out.println();
        System.out.println("TESTING  : " + className);
        System.out.print("Original : ");
        printList(list);
        System.out.print("Sorted   : ");
        printList(copy);
        System.out.print("RESULT   : ");
        boolean result = sortedList(copy);
        System.out.println(result ? "SUCCESS" : "FAILURE");
        if (!result) success = false;
      }
      System.out.print("\nLOOP " + (j + 1) + " RESULT: ");
      System.out.println(success ? "SUCCESS" : "FAILURE");
    }
    System.out.print("\nTOTAL " + count + " LOOP TEST RESULT: ");
    System.out.println(success ? "SUCCESS" : "FAILURE");
  }
}

Listing 6

import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;

public class SortingBenchmark
{
  public static List randomList(int length)
  {
    List list = new ArrayList();
    for (int i = 0; i < length; i++)
    {
      list.add(new Integer((int)(Math.random() * 99)));
    }
    return list;
  }

  public static List copyList(List list)
  {
    List copy = new ArrayList();
    for (int i = 0; i < list.size(); i++)
    {
      int value = ((Integer)list.get(i)).intValue();
      copy.add(new Integer(value));
    }
    return copy;
  }
	
  public static long mark()
  {
    return System.currentTimeMillis();
  }

  public static long time(long mark)
  {
    return System.currentTimeMillis() - mark;
  }

  public static void main(String[] String_1darray1)
  {
    String[] algorithm = {"Bubble", "Insertion",
      "Selection", "Shell", "Merge", "Quick", "Heap"};
    int iterations = 5;
    int[] elements = { 100, 500, 1000, 5000, 10000, 50000 };
    long[][] averages = new long[elements.length][algorithm.length];

    for(int i = 0; i < iterations; i++)
    {
      System.out.print( "\nIteration: " + (i + 1));
      for(int j = 0; j < elements.length; j++)
      {
        System.out.print( "\n" + elements[j] + " elements" );
        List list = randomList(elements[j]);
        int count = 0;
        while (count < algorithm.length)
        {
          List copy = copyList(list);
          Sort sort = SortFactory.getInstance(algorithm[count]);
          String name = sort.getClass().getName();

          if(!(name.startsWith(algorithm[count])))
          {
            throw new IllegalArgumentException( "Algorithm failed" );
          }
          else
          {
            long mark = mark();
            sort.sort(copy);
            averages[j][count] += time(mark);
            System.out.print( "." );
            count++;
          }
        }
      }
    }
    System.out.println();
    System.out.print( "Elements: " );
    for(int i = 0; i < elements.length; i++)
    {
      if(i > 0) System.out.print( ", " );
      System.out.print( elements[i] );
    }
    System.out.println();
    for(int i = 0; i < algorithm.length; i++)
    {
      System.out.print((String.valueOf(algorithm[i])) + "Sort: ");
      for(int j = 0; j < elements.length; j++)
      {
        long average = averages[j][i] / (long)iterations;
        if(j > 0) System.out.print( ", " );
        System.out.print(average);
      }
      System.out.println();
    }
    System.out.println();
  }
}

Claude Duguay

Enterprises, Inc.