CS 228 meeting -*- Outline -*- * sorting with binary trees (HR 13.6-7) It should be clear that we can use our search trees to do sorting (so I lied saying sorting doesn't have anything to do with data structures) ------------------------------------------ SORTING WITH BINARY TREES (HR 13.6-7) Problem: 2 Quicksort has O(N ) worst case time. Can we do better by using a clever data structure? Idea: use a binary search tree, t. Initialize t to empty; For each data value, v, in input: Insert(v, t); Pictures: ------------------------------------------ Q: What's the average case efficiency? N * insertion time = N log N If we want to get the data back in a vector, how long does that take? traversal time = N ------------------------------------------ PROBLEMS WITH NAIVE USE OF TREES Time overhead: - building the tree from array - inorder traversal - filling up the array again Worst case time: 2 - still O(N ) Space overhead: - 2 pointers for each node ------------------------------------------ so that's 2/3 extra space What we'll see in this unit is something that solves *all* of these problems will be worst case O(N log N), and also be space efficient. ** treesort, a worst case time O(N log N) sort (HR 13.6) The idea of this is to *solve the time problem* by *starting with* a balanced tree. Like a tennis or basketball tournament, except that after we get the winner, play again to determine 2nd place, etc. Think of these as tennis players, numbers are their "true rankings" ------------------------------------------ TREESORT Phase 1: E E E 7 3 4 1 ------------------------------------------ Note: this is called a *full* binary tree. draw pictures E 3 1 7 E 4 E 1 3 4 7 E E E Now the tree has the invariant property: the root is smaller than all proper nodes of its subtree. Extract numbers into vector, one at a time, continuing the tournament each time to find the next place winner Phase 2: 3 7 4 E E E E winners: 1 4 7 E E E E E winners: 1 3 7 E E E E E E winners: 1 3 4 winners: 1 3 4 7 Q: how long does each phase take? ------------------------------------------ REPRESENTATION ISSUES n What if don't have 2 - 1 data items? ------------------------------------------ ... use empty (E) for the rest draw picture of it for 5 items ------------------------------------------ How to represent E (empty)? Don't want it to be promoted So has to either - be a tag field of the nodes - or be number ____________________ ------------------------------------------ ... greater than any proper number (+inf). Q: Which will be faster? the regular number, also more space efficent, but doesn't work if need all the codes for real data See book (Fig 13.23) for coding details (recursive). ------------------------------------------ WORST CASE TIME COST OF TREESORT Phase 1: elements to move: N max length of move: log N total: Phase 2: elements to extract: N time to promote next: log N total: Overall: ------------------------------------------ So now we've solved the time problem, need to solve the space problem... ** vector implementation of binary trees (HR 13.7) ------------------------------------------ VECTOR IMPLEMENTATION OF BINARY TREES (HR 13.7) Problem: eliminate overhead for pointers Idea: use array node at index k: Left child at 2*k + 1 Right child at 2*k + 2 A 0 A 1 B B C 2 C A 0 A 1 B B C 2 C 3 D D E F G 4 E 5 F 6 G ------------------------------------------ Q: What's the mapping from the right to the left called? the abstraction map! Q: How do you find the index of a parent in the tree from the index of a child? parent_index(k) = (k-1)/2 // with C++ integer division (trunc) Q: The vectors in the pictures are sorted, where is the largest element in the vector? ... this will be used for heapsort... ------------------------------------------ REPRESENTATION ISSUES Wasted space? How much space for tree of height N? ------------------------------------------ ... O(2^N) That's bad, really bad if the tree is unbalanced. Only acceptable for trees that are nearly fully balanced (like what we'll work with!) ------------------------------------------ Unused array elements? ------------------------------------------ ... use special data value (which shouldn't occur) would 0 be a good choice for sorting integers? ... use an array of structs (or parallel array) to tell what's used ** heapsort (HR pp. 640ff) ------------------------------------------ HEAPSORT (HR pp. 640ff) Combines both ideas: - use fully balanced tree ==> O(N log N) time - use vector representation of trees ==> no cost to move data into tree ==> no cost to extract data ==> less overhead Phase 1: arrange data so largest value is in root of each subtree def: a *heap* is ------------------------------------------ ... a binary tree that stores the largest value in the root of each subtree. don't confuse with notion of heap = free store Basic trick in phase 1 is to use breadth-first search, which corresponds to iterating through the array ------------------------------------------ Phase 2: remove largest element, put into its place in sorted vector, repeat ------------------------------------------ *** examples ------------------------------------------ EXAMPLES OF HEAPSORT A 0 A 1 C C D 2 D Phase 1: ----------------------------------------- C 0 C 1 A A D 2 D D 0 D 1 A A C 2 C ------------------------------------------ Phase 2: C 0 C 1 A A 2 D ------------------------------------------ A 0 A 1 C 2 D ------------------------------------------ ANOTHER EXAMPLE A 0 A 1 C C D 2 D 3 I I B F G 4 B 5 F D 6 G 7 D Phase 1: ------------------------------------------ remember this goes breadth-first, sees for each index 0..7, how high up that can go, stopping when it gets to a parent bigger, because the tree above is ... already a heap! first 2 steps as above: D 0 D 1 A A C 2 C 3 I I B F G 4 B 5 F D 6 G 7 D now look at index 3 (I), have to demote A, becuase I is greater, then see if can go higher, and does to root. I 0 I 1 D D C 2 C 3 A A B F G 4 B 5 F D 6 G 7 D now look at index 4, no change because D is greater then look at 5 (F), swapped with C: I 0 I 1 D D F 2 F 3 A A B C G 4 B 5 C D 6 G 7 D now look at index 6(G), promote that over F I 0 I 1 D D G 2 G 3 A A B C F 4 B 5 C D 6 F 7 D finally, look at index 7, promote that over A, but not over the other D I 0 I 1 D D G 2 G 3 D D B C F 4 B 5 C A 6 F 7 A Q: suppose you start with C O M P U T I N G, what's the heap? Phase 2: keep track of the bottom of the array, and value taken out of old bottom when moving the winner out I 0 I 1 D D G 2 G 3 D D B C F 4 B 5 C 6 F 7 I bottom = 7 displacedVal = A promote the largest of the two children, and fill the vacancy G 0 G 1 D D G 2 G 3 D D B C F 4 B 5 C 6 F 7 I bottom = 7 displacedVal = A G 0 G 1 D D F 2 F 3 D D B C A 4 B 5 C 6 A 7 I bottom = 7 displacedVal = A Q: How was one step of phase 2 like phase 1? Now remember displaced value, put top of heap in bottom, move bottom up G 0 I 1 D D F 2 G 3 D D B C 4 B 5 C 6 G 7 I bottom = 6 displacedVal = A etc. Q: How do we know to stop phase 2? When bottom = 1 *** coding Q: What data needs to be kept for phase 1? counter original value in the ith index of vec indexes to place being promoted, and parent Q: What's the invariant? that the tree above i is a heap See p. 641 ff for details Q: What data needs to be kept for phase 2? false bottom of vector the displaced value (temp for a swap) indexes to the place being considered vacant, and to left and right of that, which is larger Q: What's the invariant? That vec[bottom..vsize-1] is sorted, that vec[0..bottom-1] is a heap and each vec[bottom..vsize-1] <= vec[0..bottom-1] See p. 644ff for details