14b: Hash Tables
ICS211, Fall 2012
Dr. Zach
(switch view)
Status check
- Exam 3 should be graded by Monday... Solutions posted.
- A11 → Correction to Huffman tree algorithm
- Huffman should always have at least 1 internal root node (slides 10, 14)
- eCafe is open (course evals)
- William Albritton will be teaching ICS212 next term (Sp13) at LCC, T/TH, 5-6:15pm. (At UHM, 212 will be online + lab.)
- Binary ops
ADT performance so far
- Stacks and Queues: Add, remove, get in O(1), but only at ends
- Lists: Add, remove, get in O(n) (or O(1) with iterator or get with array)
- BSTs: Add, remove, get in O(lg n) (heap: O(1) get)
- Can we do better? Say, O(1) to add, remove, and get (random access)?
Hash tables
- Array (or array-like data structure)
- Hash function: given an item, compute what cell it goes into
- Similar concept to radix sort (and other distribution sorts)
- Example: Putting playing cards into a PlayingCard[52]
- this is direct addressing (basically)
Hash function
- direct addressing doesn't scale
- consider: UH student IDs
- required array too big
- lots of wasted space (empty cells)
- non-integers: Strings, complex objects, etc.
- a function that maps the element (later: key) to array index
- index in range of array
- simplest: modulo
- example: 10 UH students into an array of size 10
Collisions
- whenever not using direct addressing
- range of keys > array range
- Two (basic) ways to handle: open addressing, chaining
Open addressing: Linear probing
- Linear probing: On collision, put in next cell over... and over... (wrap around array)
- Try it:
- Hash table: Array of size 7
- Hash fn: abs(key) % 7 → index
- Keys to add: 4, 8, -3, 24, 18, 13
- Full? Can track size separately or detect return to original hash index
Open addressing: Getting item back out
- Retrieve objects using original key value
- Given your hash, trace through:
- Linear probing required again... hash to find start, but keep probing until hit an empty cell.
Open Addressing: Removal
- Since finding uses probing, deletion can mess things up...
- Find and remove: 24
- Now see if contains 18.
- Solution: Mark removed with special value (NIL, for example)
- Probe over when seeking; can overwrite when adding.
Avoiding collisions: Load factor
- O(1) if no collisions, but up to O(n) if all mapped to one cell (bucket)
- Load factor
- = (elements in table) / (size of table)
- Around 0.7 is usually good
- From textbook: 0.5 -> 1.5 probes; 0.75 -> 2.5; 0.9 -> 5.5
- Expand the table size
- Need to rehash everything.
- Consider: size from 7 to 14, element 8
- Performance hit, though can spread out over inserts by running both tables for a while
Avoiding collisions: Hash fn
- Good hash function = hard to do!
- Hash fn: = 1 No, very bad.
- Hash fn: (int) (Math.random() * array.length)
- NO! Hash must be deterministic
- Two equal objects must always hash to same bucket (will revisit this with .equals method)
- Example with powers of 10: 93, 43, 73, 63, ...
- Powers of 2 have this effect on binary
- Prime number array length usually works well
- Strings: sum characters values and then %
- But consider: snail, slain, nails.
- A more complicated hash fn includes position
- (BTW: Can also hash entire files like this, such as for cryptography or file ID/fingerprinting)
- What if our values were all 1 to 10, % 7?
- Should map to all cells with equal likelihood
Avoiding collisions: Avoiding clusters
- Quadratic probing
- instead of +x (where x = 1, 2, 3, 4), +x^2
- still get clumps, but spread out a bit
- Double hashing
- secondary hash fn (and then probing)
Chaining
- Alternative to open addressing
- Generally simpler and more common
- buckets: Each cell is head of a linked list (stack)
- Add: hash to bucket, add at head of list
- Find: hash to bucket, loop through list there
- Remove: hash to bucket, loop through to element, and remove
- Related concept: Bucket sort
Chaining: Try it
- Hash table: Array of size 7
- Hash fn: abs(key) % 7 → index
- Keys to add: 4, 8, -3, 24, 18, 13
- Contains: 8, 4, 19, 17
- Remove: 18, -3
Chaining Limitations
- Load factor still a factor
- no clusters to compound the problem though
- still want to expand table to keep lists short
- if lists under constant limit, O(1)
- Still need a good hash fn to spead over all buckets
- Hash fn, while constant, is not 0-time
- complex objects with lots of fields
Iteration
- List all element in hash
- In open addressing vs chaining
- Fairly easy/low cost to maintain DL-list to maintain order by insertion
Summary
- Hash tables: Array + a hash fn to determine which bucket an element goes into
- Direct addressing (or other perfect hashing): no collisions
- In real world:
- Good hash function to spread things (hard to do well)
- manage collisions
- Open-addressing (with linear probing, or other fall-backs)
- Chaining (simpler)
- Keep load factor down to get required performance
- Considered to be O(1) given these constraints
- If done very badly, O(n)
For next time...
- Have a good Thanksgiving
- A11 (start now! 80 points, lots of sequential steps, all-or-nothing outcome)
- Quiz 12 to be posted
- Will record EC as soon as we get to it...
- Next time: Maps and Sets (ADTs that rely on hash tables)