I. introduction to searching and hashing (HR 12.4-6) A. motivating question How to find info associated with a key in a (online) database? ------------------------------------------ SEARCHING AND HASHING (HR 12.4-6) MOTIVATING PROBLEM Given a collection of records, retrieve the record with a given key. Example: A poison control hotline. Records are struct Poison { String name; // the key String treatment; }; ------------------------------------------ B. overview ------------------------------------------ SEARCHING AND HASHING OVERVIEW Techniques Time cost A. linear search O(N) B. binary search O(log N) C. hashing O(1) on average Key insight: Changing the data structure leads to qualitative improvements ------------------------------------------ II. searching random access collections (HR 12.4) A. linear search (pp. 556-7) ------------------------------------------ LINEAR SEARCH PROBLEM // Poison.h #ifndef Poison_h #define Poison_h 1 #include "String.h" struct Poison { String name; // the key String treatment; }; #endif // LinearSearch.h #include "Poison.h" #include "bool.h" extern void LinearSearch( const Poison arr[], int size, const String & key, Boolean & found, int & loc); // PRE: 0 <= size // && 0..size-1 are legal indexes of arr // MODIFIES: found, loc // POST: found --> arr[loc].name == key // (NOT found) --> there is no element // in arr[0..size-1] with name field key ------------------------------------------ What does a caller have to do to get the treatment? Why not return a String (the treatment) instead of void? Why return the index of the record, instead of the treatment? ------------------------------------------ LINEAR SEARCH ALGORITHM // LinearSearch.C #include "LinearSearch.h" void LinearSearch( const Poison arr[], int size, const String & key, Boolean & found, int & loc) { ------------------------------------------ What's the time complexity of this? Is that good for the poison hotline, where every second counts? B. binary search Suppose you had a copy of the "Poison Dictionary". How would you look up the poison for the caller? ------------------------------------------ BINARY SEARCH OF SORTED ARRAY // BinarySearch.h #include "Poison.h" #include "bool.h" extern void BinarySearch( const Poison arr[], int size, const String & key, Boolean & found, int & loc); // PRE: 0 <= size <= INT_MAX / 2 // && 0..size-1 are legal indexes of arr // && arr is sorted in increasing order // MODIFIES: found, loc // POST: found --> arr[loc].name == key // (NOT found) --> there is no element // in arr[0..size-1] with name field key ------------------------------------------ ------------------------------------------ BINARY SEARCH ALGORITHM // BinarySearch.C #include "BinarySearch.h" void BinarySearch( const Poison arr[], int size, const String & key, Boolean & found, int & loc) { ------------------------------------------ (if time) What would this look like written recursively? 1. pictures 2. analysis (HR 12.5) ------------------------------------------ ANALYSIS OF SEARCH ALGORITHMS def: a *best case analysis* gives ideal (smallest) cost. Technique Best Case Cost linear search binary search ------------------------------------------ ------------------------------------------ def: a *worst case analysis* gives maximum (largest) cost. Technique Worst Case Cost linear search binary search ------------------------------------------ ------------------------------------------ def: a *average case analysis* gives a representative cost. Technique Average Case Cost linear search binary search ------------------------------------------ When size is 100,000, how many comparisons are needed in worst case for binary search? is there any way to get rid of that extra comparison in each loop? ------------------------------------------ IMPROVED BINARY SEARCH // ImprovedBinarySearch.C #include "ImprovedBinarySearch.h" void ImprovedBinarySearch( const Poison arr[], int size, const String & key, Boolean & found, int & loc) { int lowerIndex = 0; int upperIndex = size - 1; // INV: if key in arr[0..size-1].name, // key in a name field of // arr[lowerIndex..upperIndex] // && if 0 < size, then // lowerIndex <= upperIndex while (lowerIndex < upperIndex) { loc = (upperIndex + lowerIndex) / 2; if (key > arr[loc].name) { lowerIndex = loc + 1; } else { // ASSERT: upperIndex = loc; } } // ASSERT: // loc = lowerIndex; found = (0 < size && arr[loc].name == key); } ------------------------------------------ Can you fill in the assertions? To summarize, what's being saved here over the old binary search? III. hashing (HR 12.6) ------------------------------------------ HASHING (HR 12.6) THE PROBLEM Can we search in O(1) time? THE SOLUTION IDEA Idea: use a "hash function" to map key -> index then extract information in O(1) time using the index from a "hash table". Picture: ------------------------------------------ A. hash functions (p. 565) ------------------------------------------ HASH FUNCTIONS Problem: Map a poison name to 0..TAB_SIZE. Want to "spread out" the results. Solutions? 1. hash1(s) == 0 2. hash2(s) == int(s[0]) % TAB_SIZE 3. ------------------------------------------ ------------------------------------------ A HASH FUNCTION FOR STRINGS // SumCharsMod.h #include "String.h" extern int SumCharsMod(const String & s, int n); // PRE: n > 0 // POST: FCTVAL = the sum of the // character codes in s, modulo n. // SumCharsMod.C #include "SumCharsMod.h" int SumCharsMod(const String & s, int n) { int s_size = Length(s); int hash_val = 0; // INV: hash_val is the sum of // the char codes s[0..i-1] mod n // && i <= s_size for (int i = 0; i < s_size; i++) { hash_val = (hash_val + int(s[i])) % n; } return hash_val; } ------------------------------------------ why isn't hash_val += int(s[i]) % n; correct? B. collisions (p. 567 ff.) ------------------------------------------ COLLISIONS def: a *collision* is two keys that a hash function maps to the same index. Example: SumCharsMod(String("arsenic", 301)) = SumCharsMod(String("iron", 301)) = 173 SumCharsMod(String("barbitone", 301)) = SumCharsMod(String("copper", 301)) = 47 SumCharsMod(String("meprobamate", 301)) = SumCharsMod(String("thallium", 301)) = 262 SumCharsMod(String( "phenobarbitone", 1009)) = SumCharsMod(String( "cholinesterase inhibitor", 1009)) = 479 ------------------------------------------ 1. collision resolution ------------------------------------------ COLLISION RESOLUTION OVERVIEW linear probing: if detect a collision, look at the next index, modulo table size variations: quadratic probing: +1, -1, +4, -4, +9.. random probing: add random number rehashing: use another hash function chaining: list of synonyms for each index ------------------------------------------ 2. linear probing ------------------------------------------ LINEAR PROBING name treatment |-----------------------------| 0 | gasoline | don't light ..| |-------------|---------------| 1 | arsenic | see a doctor | |-------------|---------------| 2 | iodine | apply victum..| |-------------|---------------| 3 | | | |-------------|---------------| 4 | | | |-------------|---------------| 5 | | | |-------------|---------------| 6 | caffine | induce vomit..| |-------------|---------------| 7 | | | |-------------|---------------| 8 | kelp | drink seawat..| |-------------|---------------| 9 | | | |-----------------------------| ------------------------------------------ ------------------------------------------ PROBLEM: CLUSTERING def: primary clustering results from keys that hash to adjacent indexes. example: gasoline, arsenic and iodine. def: secondary clustering results when different clusters merge to form a larger cluster. examples: ovaltine, nicotine ------------------------------------------ what's an easy way to reduce the problems of clustering without writing new code? 3. chaining ------------------------------------------ CHAINING name tre link |----| 0 | *---> [gasoline|do..| ] |----| 1 | *---> [arsenic |se..| ] |----| 2 | *---> [iodine |ap..| ] |----| 3 | | |----| 4 | | |----| 5 | | |----| 6 | *---> [caffine |in..| ] |----| 7 | | |----| 8 | *---> [kelp |dr..| ] |----| 9 | | |----| ------------------------------------------ ------------------------------------------ CAVEATS WITH CHAINING What is the worst case search time? How does that happen? ------------------------------------------ ------------------------------------------ EFFICIENCY OF HASHING Factors: - quality of hash function (for the data) - collision resolution technique - availabilty of space def: the *load factor* of a hash table is n/s, where n = number of entries and s = size of table ------------------------------------------ what is the load factor for a full table with linear probing? can a table have a load factor of more than 1 with chaining? C. comparison (p. 572) ------------------------------------------ SEARCHING COMPARISON |linear binary chained |search search hash table |============================= special | none sorted hash requirements| function |----------------------------- worst case | time cost | |----------------------------- average case| time cost | ------------------------------------------ what's the time cost measured in? If we had to sort the data by key anyway, which should we use? Is a chained hash table best for the poison control center?