I. introduction to searching and hashing (HR 12.4-6)
 A. motivating question
	How to find info associated with a key in a (online) database?
------------------------------------------
   SEARCHING AND HASHING (HR 12.4-6)

          MOTIVATING PROBLEM

Given a collection of records,
retrieve the record with a given key.

Example:
   A poison control hotline.
   Records are

      struct Poison {
        String name;      // the key
        String treatment;
      };
------------------------------------------
 B. overview
------------------------------------------
     SEARCHING AND HASHING
           OVERVIEW

Techniques	     Time cost
 A. linear search    O(N)
 B. binary search    O(log N)
 C. hashing          O(1) on average

Key insight:

    Changing the data structure
    leads to qualitative improvements
------------------------------------------
II. searching random access collections (HR 12.4)
 A. linear search (pp. 556-7)
------------------------------------------
        LINEAR SEARCH PROBLEM

// Poison.h
#ifndef Poison_h
#define Poison_h 1
#include "String.h"
struct Poison {
  String name;      // the key
  String treatment;
};
#endif

// LinearSearch.h
#include "Poison.h"
#include "bool.h"

extern void LinearSearch(
	     const Poison arr[],
	     int size,
	     const String & key,
	     Boolean & found,
	     int & loc);
  // PRE: 0 <= size
  // && 0..size-1 are legal indexes of arr
  // MODIFIES: found, loc
  // POST: found --> arr[loc].name == key
  // (NOT found) --> there is no element
  // in arr[0..size-1] with name field key
------------------------------------------
	What does a caller have to do to get the treatment?
	Why not return a String (the treatment) instead of void?
	Why return the index of the record, instead of the treatment?
------------------------------------------
          LINEAR SEARCH ALGORITHM

// LinearSearch.C
#include "LinearSearch.h"

void LinearSearch(
	     const Poison arr[],
	     int size,
	     const String & key,
	     Boolean & found,
	     int & loc)
{


------------------------------------------
	What's the time complexity of this?
	Is that good for the poison hotline, where every second counts?
 B. binary search
	Suppose you had a copy of the "Poison Dictionary".
		How would you look up the poison for the caller?
------------------------------------------
	BINARY SEARCH OF SORTED ARRAY

// BinarySearch.h
#include "Poison.h"
#include "bool.h"

extern void BinarySearch(
	     const Poison arr[],
	     int size,
	     const String & key,
	     Boolean & found,
	     int & loc);
  // PRE: 0 <= size <= INT_MAX / 2
  // && 0..size-1 are legal indexes of arr
  // && arr is sorted in increasing order
  // MODIFIES: found, loc
  // POST: found --> arr[loc].name == key
  // (NOT found) --> there is no element
  // in arr[0..size-1] with name field key
------------------------------------------
------------------------------------------
       BINARY SEARCH ALGORITHM

// BinarySearch.C
#include "BinarySearch.h"
void BinarySearch(
	     const Poison arr[],
	     int size,
	     const String & key,
	     Boolean & found,
	     int & loc)
{


------------------------------------------
	(if time) What would this look like written recursively?
  1. pictures
  2. analysis (HR 12.5)
------------------------------------------
      ANALYSIS OF SEARCH ALGORITHMS

def: a *best case analysis* gives ideal
     (smallest) cost.

   Technique		Best Case Cost
   linear search
   binary search
------------------------------------------
------------------------------------------
def: a *worst case analysis* gives maximum
     (largest) cost.


   Technique		Worst Case Cost
   linear search
   binary search
------------------------------------------
------------------------------------------
def: a *average case analysis* gives
     a representative cost.


   Technique		Average Case Cost
   linear search
   binary search
------------------------------------------
	When size is 100,000, how many comparisons are needed in worst case
		for binary search?
	is there any way to get rid of that extra comparison in each loop?
------------------------------------------
    IMPROVED BINARY SEARCH

// ImprovedBinarySearch.C
#include "ImprovedBinarySearch.h"
void ImprovedBinarySearch(
	     const Poison arr[],
	     int size,
	     const String & key,
	     Boolean & found,
	     int & loc)
{
  int lowerIndex = 0;
  int upperIndex = size - 1;

  // INV: if key in arr[0..size-1].name,
  //  key in a name field of
  // arr[lowerIndex..upperIndex]
  // && if 0 < size, then
  // lowerIndex <= upperIndex
  while (lowerIndex < upperIndex) {
    loc = (upperIndex + lowerIndex) / 2;
    if (key > arr[loc].name) {
      lowerIndex = loc + 1;
    } else {
      // ASSERT:
      upperIndex = loc;
    }
  }
  // ASSERT: 
  //
  loc = lowerIndex;
  found = (0 < size
           && arr[loc].name == key);
}
------------------------------------------
	Can you fill in the assertions?
	To summarize, what's being saved here over the old binary search?
III. hashing (HR 12.6)
------------------------------------------
         HASHING (HR 12.6)
            THE PROBLEM

Can we search in O(1) time?


	    THE SOLUTION IDEA

Idea: use a "hash function" to map
	key -> index
then extract information in O(1) time
using the index from a "hash table".

Picture:


------------------------------------------
 A. hash functions (p. 565)
------------------------------------------
          HASH FUNCTIONS

Problem:

   Map a poison name to 0..TAB_SIZE.
   Want to "spread out" the results.


Solutions?

  1. hash1(s) == 0

  2. hash2(s) == int(s[0]) % TAB_SIZE

  3.
------------------------------------------
------------------------------------------
          A HASH FUNCTION FOR STRINGS

// SumCharsMod.h
#include "String.h"

extern int SumCharsMod(const String & s,
                       int n);
  // PRE: n > 0
  // POST: FCTVAL = the sum of the
  // character codes in s, modulo n.


// SumCharsMod.C
#include "SumCharsMod.h"

int SumCharsMod(const String & s, int n)
{
  int s_size = Length(s);
  int hash_val = 0;

  // INV: hash_val is the sum of
  // the char codes s[0..i-1] mod n
  // && i <= s_size
  for (int i = 0; i < s_size; i++) {
    hash_val = (hash_val + int(s[i])) % n;
  }
  return hash_val;
}
------------------------------------------
	why isn't
		hash_val += int(s[i]) % n;
	 correct?
 B. collisions (p. 567 ff.)
------------------------------------------
           COLLISIONS

def: a *collision* is two keys that a hash
  function maps to the same index.

Example:
  SumCharsMod(String("arsenic", 301))
  = SumCharsMod(String("iron", 301))
  = 173

  SumCharsMod(String("barbitone", 301))
  = SumCharsMod(String("copper", 301))
  = 47

  SumCharsMod(String("meprobamate", 301))
  = SumCharsMod(String("thallium", 301))
  = 262

  SumCharsMod(String(
        "phenobarbitone", 1009))
  = SumCharsMod(String(
        "cholinesterase inhibitor", 1009))
  = 479

------------------------------------------
  1. collision resolution
------------------------------------------
         COLLISION RESOLUTION
	      OVERVIEW

linear probing:
   if detect a collision, look at the next
   index, modulo table size

variations:
   quadratic probing: +1, -1, +4, -4, +9..
   random probing: add random number
   rehashing: use another hash function

chaining:
   list of synonyms for each index
------------------------------------------
  2. linear probing
------------------------------------------
         LINEAR PROBING

	name         treatment
     |-----------------------------|
   0 | gasoline    | don't light ..|
     |-------------|---------------|
   1 | arsenic     | see a doctor  |
     |-------------|---------------|
   2 | iodine      | apply victum..|
     |-------------|---------------|
   3 |             |               |
     |-------------|---------------|
   4 |             |               |
     |-------------|---------------|
   5 |             |               |
     |-------------|---------------|
   6 | caffine     | induce vomit..|
     |-------------|---------------|
   7 |             |               |
     |-------------|---------------|
   8 | kelp        | drink seawat..|
     |-------------|---------------|
   9 |             |               |
     |-----------------------------|


------------------------------------------
------------------------------------------
	PROBLEM: CLUSTERING

def: primary clustering results from keys
    that hash to adjacent indexes.

    example: gasoline, arsenic and iodine.

def: secondary clustering results when
    different clusters merge to form
    a larger cluster.

    examples: ovaltine, nicotine
------------------------------------------
	what's an easy way to reduce the problems of clustering
		without writing new code?
  3. chaining
------------------------------------------
              CHAINING

	    name     tre link
  |----|
0 |  *---> [gasoline|do..|  ]
  |----|
1 |  *---> [arsenic |se..|  ]
  |----|
2 |  *---> [iodine  |ap..|  ]
  |----|
3 |    |
  |----|
4 |    |
  |----|
5 |    |
  |----|
6 |  *---> [caffine |in..|  ]
  |----|
7 |    |
  |----|
8 |  *---> [kelp    |dr..|  ]
  |----|
9 |    |
  |----|


------------------------------------------
------------------------------------------
            CAVEATS WITH CHAINING

What is the worst case search time?


How does that happen?


------------------------------------------
------------------------------------------
         EFFICIENCY OF HASHING

Factors:
 - quality of hash function (for the data)
 - collision resolution technique
 - availabilty of space

def: the *load factor* of a hash table
  is n/s, where n = number of entries
            and s = size of table
------------------------------------------
	what is the load factor for a full table with linear probing?
	can a table have a load factor of more than 1 with chaining?
 C. comparison (p. 572)
------------------------------------------
         SEARCHING COMPARISON

            |linear   binary    chained
            |search   search    hash table
            |=============================
special     | none    sorted    hash
requirements|                   function
	    |-----------------------------
worst case  |
time cost   |
	    |-----------------------------
average case|
time cost   |
------------------------------------------
	what's the time cost measured in?
	If we had to sort the data by key anyway, which should we use?
	Is a chained hash table best for the poison control center?