VCE IT Lecture Notes by Mark Kelly, McKinnon Secondary College

Searching and Sorting

SD U4O1 KK13- techniques for searching, including binary search, and techniques for sorting, including bubble sort and quick sort

It's ironic. The most computationally demanding tasks a computer can do are searching and sorting. They are also the most common operations done by computers.

It's no surprise then that programmers have for decades sought to discover more and more efficient ways to search and sort data. Some programmers have spent the bulk of their career researching search and sort algorithms. Don't expect this page to neatly tell you everything you need to know about them: they are complete sciences in themselves.

Searching

Let's take the case of an array or sequential file of N text (student IDs and corresponding test results) items in random order.

BIS0004 83
HOS0004 52
KIT0001 53
LAU0004 59
PIL0004 61
STE0018 65
DIP0001 71
CLA0001 75
GIB0002 82
ROB0004 82
BUN0002 83
MIL0009 84
KIN0005 89
MCC0004 89
BLA0008 95
KRA0005 95

We want to find the location of a particular item in the list.

Using brute force (any technique that relies on effort instead of elegance) we could look at each item in the list and compare it with the item we are searching for.

Let's search for BLA0008. Reading from the top of the list, we would need to check 16 IDs before we finally found the one we were looking for. If we'd been looking for HOS0004, we would've lucked out and only needed 2 comparisons. So, for a random list of N items, the number of comparisons required is anywhere between 1 and N.

This could be very bad if the list has millions of entries.

And if the item being searched for is not guaranteed to be in the list at all (e.g. you're searching a list of registered users to see if an ID is in the list), it will always take N comparisons before it reported that the item is not in the list.

Since there is no order in the list, there is no possible strategy to search smarter. You can't overcome the massive power of randomness. So you need to introduce some order into the list in order to use some brains during the search.

This means sorting the list. How we can sort the list will remain a mystery for a while.

BIS0004 83
BLA0008 95
BUN0002 83
CLA0001 75
DIP0001 71
GIB0002 82
HOS0004 52
KIN0005 89
KIT0001 53
KRA0005 95
LAU0004 59
MCC0004 89
MIL0009 84
PIL0004 61
ROB0004 82
STE0018 65

Now we can use some tricky techniques. Such tricky techniques are called algorithms - they describe the strategy used to solve a problem.

At least now we can find whether a designated item is or is not in the list without always having to check every item first.

How? Think about it before proceeding.

Here's some thinking animation...

Let's assume we're searching to find out if KEL0001 is in the list. Starting at the top, we compare items down to HOS0004. The next item in the sorted list is KIN0005, which is alphabetically after KEL0001. Since KEL0001 was not found by the time we reached KIN0005, we know it's not in the list and we can stop searching after only 8 items instead of having to process all 16.

That's a saving, but we can do better.

Let's try a binary search. Have you ever played the game, "I'm thinking of a number between 1 and 100"? Each time you guess, the person tells you if your guess is high or low or correct. With brute force (sheer guessing), you'd get the right answer in 1 to 100 moves. There's a smarter way. With each guess you can halve the number of possible choices.

At each guess we revise the upper and lower limits of possible answers and the next guess is precisely half way between the limits. When we start, the lower limit is 1 and the upper limit is 100, so we guess halfway between the those...

Guess Response Action New lower
limit of range
New upper
limit of range
Halfway point between limits
50 Too high Rule out numbers 50-100 1 49 25
25 Too low Rule out 1 to 25 26 49 32
32 Too low Rule out 25 to 32 33 49 41
41 Too high Rule out 41 to 49 33 40 37
37 Right!        

So, instead of up to 100 random guesses, we can always get the answer within 5 to 7 guesses. That's an improvement.

In pseudocode it might look like this:

DECLARE upperlimit, lowerlimit
DECLARE valuetofind
DECLARE guess
DECLARE success, count

upperlimit = 100
lowerlimit = 1
success = FALSE
count = 0

WHILE success = FALSE
    /* guess halfway between upper and lower limits */
    guess = (upperlimit + lowerlimit - 1) / 2
    count = count + 1

    IF guess = valuetofind THEN
     /* guessed it! */
         success = TRUE
     /* set the success flag */
    ELSEIF guess < valuetofind THEN
     /* guessed low */
        lowerlimit = guess + 1
    /* new lower limit */
    ELSE
     /* only other option is guessed high */
        upperlimit = guess - 1
    /* new upper limit */
    ENDIF
END WHILE    /* try again if haven't succeeded*/

PRINT "Got it in " & count & " tries!"

 

1 BIS0004 83
2 BLA0008 95
3 BUN0002 83
4 CLA0001 75
5 DIP0001 71
6 GIB0002 82
7 HOS0004 52
8 KIN0005 89
9 KIT0001 53
10 KRA0005 95
11 LAU0004 59
12 MCC0004 89
13 MIL0009 84
14 PIL0004 61
15 ROB0004 82
16 STE0018 65

In our original test, let's search for DIP0001 in the sorted list.
Lower limit = 1. Upper limit = 16. Halfway guess = 8. Value is KIN0005. Too high.

1 BIS0004 83
2 BLA0008 95
3 BUN0002 83
4 CLA0001 75
5 DIP0001 71
6 GIB0002 82
7 HOS0004 52
8 KIN0005 89
9 KIT0001 53
10 KRA0005 95
11 LAU0004 59
12 MCC0004 89
13 MIL0009 84
14 PIL0004 61
15 ROB0004 82
16 STE0018 65

Lower limit = 1. Upper limit = 7. Halfway guess = 4. Value is CLA0001. Too low.

1 BIS0004 83
2 BLA0008 95
3 BUN0002 83
4 CLA0001 75
5 DIP0001 71
6 GIB0002 82
7 HOS0004 52
8 KIN0005 89
9 KIT0001 53
10 KRA0005 95
11 LAU0004 59
12 MCC0004 89
13 MIL0009 84
14 PIL0004 61
15 ROB0004 82
16 STE0018 65

Lower limit = 5. Upper limit = 7. Halfway guess = 6. Value is GIB0002. Too high.

1 BIS0004 83
2 BLA0008 95
3 BUN0002 83
4 CLA0001 75
5 DIP0001 71
6 GIB0002 82
7 HOS0004 52
8 KIN0005 89
9 KIT0001 53
10 KRA0005 95
11 LAU0004 59
12 MCC0004 89
13 MIL0009 84
14 PIL0004 61
15 ROB0004 82
16 STE0018 65

Lower limit = 5. Upper limit = 7. Halfway guess = 6. Value is GIB0002. Too high.

Lower limit =5. Upper limit =5. We have a winner. Now we can deal with record 5 in whatever way we need to.
And it took 5 only guesses instead of 16.

Good old binary search.

Sorting

Such searching relies on the data being sorted, and sorting is even more challenging than searching since it can require a lot of data copying and moving to get things in order. Moving data in RAM is slow enough; moving it on a disk is very slow. So you want to sort as cleverly as you can to minimise the total number of data movements.

There are many different sorting algorithms - some are better in some circumstances (e.g. how random the original data is) than others. They include

  • Bubble sort (you need to know this one)
  • Insertion sort
  • Shell sort
  • Merge sort
  • Heapsort
  • Quicksort (you need to know this one)
  • Counting Sort
  • Bucket sort
  • Radix sort
  • Distribution sort

Bubble sort - thanks to Wikipedia for help

 

Bubble sort is a simple algorithm that works by repeatedly stepping through the list to be sorted, comparing each pair of adjacent items and swapping them if they are in the wrong order. The pass through the list is repeated until no swaps are needed, which indicates that the list is sorted. It is probably the least efficient sorting algorithm.

Let us take the array of numbers "5 1 4 2 8", and sort them array from smallest to biggest using bubble sort algorithm.
In each step, elements written in bold are being compared.

First Pass:
( 5 1 4 2 8 ) > ( 1 5 4 2 8 ), Here, algorithm compares the first two elements, and swaps them.
( 1 5 4 2 8 ) > ( 1 4 5 2 8 ), Swap since 5 > 4
( 1 4 5 2 8 ) > ( 1 4 2 5 8 ), Swap since 5 > 2
( 1 4 2 5 8 ) > ( 1 4 2 5 8 ), Now, since these elements are already in order (8 > 5), algorithm does not swap them.

Second Pass:

( 1 4 2 5 8 ) > ( 1 4 2 5 8 )
( 1 4 2 5 8 ) > ( 1 2 4 5 8 ), Swap since 4 > 2
( 1 2 4 5 8 ) > ( 1 2 4 5 8 )
( 1 2 4 5 8 ) > ( 1 2 4 5 8 )

Now, the array is already sorted, but our algorithm does not know if it is completed. The algorithm needs one whole pass without any swap to know it is sorted.

Third Pass:

( 1 2 4 5 8 ) > ( 1 2 4 5 8 )
( 1 2 4 5 8 ) > ( 1 2 4 5 8 )
( 1 2 4 5 8 ) > ( 1 2 4 5 8 )
( 1 2 4 5 8 ) > ( 1 2 4 5 8 )

Finally, the array is sorted, and the algorithm can terminate.

Bubble sort is simple to code, but awfully inefficient to use, and it gets exponentially worse as the number of items to sort increases. It has been called a "perversely awful algorithm" and "it has nothing to recommend it".

Quicksort

Quicksort animation
Quicksort algorithm visualised (from Wikipedia)

  • Quicksort is a sorting algorithm developed by C. A. R. Hoare.
  • On average, Quicksort makes O(n log n) (big O notation) comparisons to sort n items.
  • In the worst case, it makes O( n2 ) comparisons.
  • Quicksort takes advantage of virtual memory and available caches, and because it uses no temporary memory during data swaps, it is very well suited to implementation on computers.

Quicksort sorts by employing a divide and conquer strategy to divide a list into two sub-lists.

The steps are:

  1. Pick an element, called a pivot, from the list.
  2. Reorder the list so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
  3. Recursively sort the sub-list of lesser elements and the sub-list of greater elements.

 

 

BOGOSORT - If bogosort were used to sort a deck of cards, it would consist of checking if the deck were in order, and if it were not, one would throw the deck into the air, pick the cards up at random, and repeat the process until the deck turns out to be sorted. The bogosort is not be recommended in a Software Development outcome.

 

As an alternative to slow swapping of list items, also see my pages on binary trees and indexing.

 

If you're into sorting theory (and have a PhD in maths), check out Wikipedia to learn about ...

Theory
Exchange sorts
Bubble sort · Cocktail sort · Odd-even sort · Comb sort · Gnome sort · Quicksort
Selection sorts
Insertion sorts
Merge sorts
Non-comparison sorts
Others
Ineffective/humorous sorts

 

Back to the IT Lecture Notes index

Back to the last page you visited

Created 16 Sep 2010

Last changed: September 17, 2010 1:07 PM

VCE IT Lecture notes copyright © Mark Kelly 2001-