Status check

Exam 3 should be graded by Monday... Solutions posted.
A11 → Correction to Huffman tree algorithm
- Huffman should always have at least 1 internal root node (slides 10, 14)
eCafe is open (course evals)
William Albritton will be teaching ICS212 next term (Sp13) at LCC, T/TH, 5-6:15pm. (At UHM, 212 will be online + lab.)
Binary ops

ADT performance so far

Stacks and Queues: Add, remove, get in O(1), but only at ends
Lists: Add, remove, get in O(n) (or O(1) with iterator or get with array)
BSTs: Add, remove, get in O(lg n) (heap: O(1) get)
Can we do better? Say, O(1) to add, remove, and get (random access)?
- Yes: Hash tables

Hash tables

Array (or array-like data structure)
Hash function: given an item, compute what cell it goes into
Similar concept to radix sort (and other distribution sorts)
Example: Putting playing cards into a PlayingCard[52]
- this is direct addressing (basically)

Hash function

direct addressing doesn't scale
- consider: UH student IDs
  - required array too big
  - lots of wasted space (empty cells)
- non-integers: Strings, complex objects, etc.
a function that maps the element (later: key) to array index
- index in range of array
- simplest: modulo
- example: 10 UH students into an array of size 10

Collisions

whenever not using direct addressing
range of keys > array range
Two (basic) ways to handle: open addressing, chaining

Open addressing: Linear probing

Linear probing: On collision, put in next cell over... and over... (wrap around array)
Try it:
- Hash table: Array of size 7
- Hash fn: abs(key) % 7 → index
- Keys to add: 4, 8, -3, 24, 18, 13
Full? Can track size separately or detect return to original hash index

Open addressing: Getting item back out

Retrieve objects using original key value
Given your hash, trace through:
- Contains: 8, 24, 19
Linear probing required again... hash to find start, but keep probing until hit an empty cell.

Open Addressing: Removal

Since finding uses probing, deletion can mess things up...
Find and remove: 24
Now see if contains 18.
Solution: Mark removed with special value (NIL, for example)
- Probe over when seeking; can overwrite when adding.

Avoiding collisions: Load factor

O(1) if no collisions, but up to O(n) if all mapped to one cell (bucket)
Load factor
- = (elements in table) / (size of table)
- Around 0.7 is usually good
  - From textbook: 0.5 -> 1.5 probes; 0.75 -> 2.5; 0.9 -> 5.5
Expand the table size
- Need to rehash everything.
- Consider: size from 7 to 14, element 8
- Performance hit, though can spread out over inserts by running both tables for a while

Avoiding collisions: Hash fn

Good hash function = hard to do!
Hash fn: = 1 No, very bad.
Hash fn: (int) (Math.random() * array.length)
- NO! Hash must be deterministic
- Two equal objects must always hash to same bucket (will revisit this with .equals method)
Example with powers of 10: 93, 43, 73, 63, ...
- Powers of 2 have this effect on binary
- Prime number array length usually works well
Strings: sum characters values and then %
- But consider: snail, slain, nails.
- A more complicated hash fn includes position
- (BTW: Can also hash entire files like this, such as for cryptography or file ID/fingerprinting)
What if our values were all 1 to 10, % 7?
- Should map to all cells with equal likelihood

Avoiding collisions: Avoiding clusters

Quadratic probing
- instead of +x (where x = 1, 2, 3, 4), +x^2
- still get clumps, but spread out a bit
Double hashing
- secondary hash fn (and then probing)

Chaining

Alternative to open addressing
Generally simpler and more common
buckets: Each cell is head of a linked list (stack)
- Add: hash to bucket, add at head of list
- Find: hash to bucket, loop through list there
- Remove: hash to bucket, loop through to element, and remove
Related concept: Bucket sort

Chaining: Try it

Hash table: Array of size 7
Hash fn: abs(key) % 7 → index
Keys to add: 4, 8, -3, 24, 18, 13
Contains: 8, 4, 19, 17
Remove: 18, -3

Chaining Limitations

Load factor still a factor
- no clusters to compound the problem though
- still want to expand table to keep lists short
  - if lists under constant limit, O(1)
Still need a good hash fn to spead over all buckets
Hash fn, while constant, is not 0-time
- complex objects with lots of fields

Iteration

List all element in hash
In open addressing vs chaining
Fairly easy/low cost to maintain DL-list to maintain order by insertion

Summary

Hash tables: Array + a hash fn to determine which bucket an element goes into
Direct addressing (or other perfect hashing): no collisions
In real world:
- Good hash function to spread things (hard to do well)
- manage collisions
  - Open-addressing (with linear probing, or other fall-backs)
  - Chaining (simpler)
Keep load factor down to get required performance
- Considered to be O(1) given these constraints
- If done very badly, O(n)

For next time...

Have a good Thanksgiving
A11 (start now! 80 points, lots of sequential steps, all-or-nothing outcome)
Quiz 12 to be posted
Will record EC as soon as we get to it...
Next time: Maps and Sets (ADTs that rely on hash tables)

14b: Hash Tables

ICS211, Fall 2012
Dr. Zach

Status check

ADT performance so far

Hash tables

Hash function

Collisions

Open addressing: Linear probing

Open addressing: Getting item back out

Open Addressing: Removal

Avoiding collisions: Load factor

Avoiding collisions: Hash fn

Avoiding collisions: Avoiding clusters

Chaining

Chaining: Try it

Chaining Limitations

Iteration

Summary

For next time...

14b: Hash Tables

ICS211, Fall 2012Dr. Zach

Status check

ADT performance so far

Hash tables

Hash function

Collisions

Open addressing: Linear probing

Open addressing: Getting item back out

Open Addressing: Removal

Avoiding collisions: Load factor

Avoiding collisions: Hash fn

Avoiding collisions: Avoiding clusters

Chaining

Chaining: Try it

Chaining Limitations

Iteration

Summary

For next time...

ICS211, Fall 2012
Dr. Zach