While an ideal hash function distributes keys perfectly uniformly across the hash table slots, preventing any two keys from landing in the same spot, reality is often less cooperative. Given a finite number of slots (m) and potentially a much larger (or even infinite) universe of possible keys, the Pigeonhole Principle tells us that collisions, where multiple keys map to the same hash index, are bound to happen. Even with excellent hash functions, collisions become increasingly likely as the table fills up.
Therefore, a fundamental part of designing or using hash tables is having a strategy to manage these collisions effectively. Without a resolution strategy, we would overwrite existing data whenever a collision occurred, rendering the hash table incorrect. The goal is to handle collisions in a way that maintains correct storage and retrieval while minimizing performance degradation.
There are two primary approaches to resolving hash collisions: Separate Chaining and Open Addressing.
The idea behind Separate Chaining is straightforward: instead of demanding that each slot hold at most one element, we let each slot (j) in the hash table array reference a data structure that holds all elements whose keys hash to index j. Most commonly, this secondary data structure is a linked list.
A hash table using Separate Chaining. Slots 1 and 5 contain single elements. Slot 3 has experienced a collision, storing both Key B and Key C in a linked list. Slots 0, 2, 4, and 6 are empty.
Operations:
table[j]
. This is typically an O(1) operation for adding to the front of a list.table[j]
, comparing the search key with the key of each element in the list.table[j]
for the element with the matching key. If found, remove it from the list (a standard linked list deletion).Performance:
The performance of Separate Chaining depends heavily on the load factor, λ=n/m, where n is the number of elements stored in the table and m is the number of slots (or buckets). The average length of a chain is λ.
Assuming a hash function that distributes keys uniformly (Simple Uniform Hashing Assumption), the average time for an unsuccessful search is O(1+λ), as we need to compute the hash (O(1)) and potentially traverse the average chain length (λ). A successful search takes, on average, slightly less time, also roughly O(1+λ). If the number of slots m is proportional to the number of elements n (i.e., λ is kept bounded by a constant), then insertion, deletion, and search operations take O(1) time on average.
The worst-case scenario occurs when all keys hash to the same slot. In this situation, the hash table degenerates into a single linked list, and search/delete operations take O(n) time. This highlights the importance of a good hash function and potentially resizing the table if the load factor becomes too high.
In contrast to Separate Chaining, Open Addressing stores all elements directly within the hash table array itself. When a collision occurs (i.e., we hash to a slot j that is already occupied), we systematically probe subsequent slots in the table until an empty slot is found.
The sequence of slots checked is called the probe sequence. The hash function is effectively extended to take both the key and the probe number (0 for the first try, 1 for the second, and so on) as input: h(key,i), where i=0,1,2,....
Several probing strategies exist:
Linear Probing: This is the simplest strategy. If the initial slot j=h(key,0) is occupied, we try j+1, then j+2, j+3, and so on, wrapping around the table end if necessary (i.e., probing (h(key)+i)(modm) for i=0,1,2,...).
Quadratic Probing: To mitigate primary clustering, quadratic probing uses a quadratic offset. The probe sequence is (h(key)+c1i+c2i2)(modm) for i=0,1,2,..., where c1 and c2 (c2=0) are constants. A common choice is h(key)+i2(modm).
Double Hashing: This method uses a second, independent hash function, h2(key), to determine the step size for probing after an initial collision. The probe sequence is (h1(key)+i⋅h2(key))(modm) for i=0,1,2,....
Example: Linear Probing Insertion
Consider a table of size m=7 and a simple hash function h(k)=k(mod7). Let's insert keys 15, 22, 8, 1.
Table: [_, 15, _, _, _, _, _]
Table: [_, 15, 22, _, _, _, _]
Table: [_, 15, 22, 8, _, _, _]
Table: [_, 15, 22, 8, 1, _, _]
Operations:
Performance:
Open addressing performance is highly sensitive to the load factor λ=n/m. Unlike separate chaining where λ can exceed 1, in open addressing λ must be less than 1 (you can't store more items than slots). As λ gets close to 1, the number of probes required increases dramatically, and performance degrades sharply. For example, under the uniform hashing assumption (unrealistic for simple linear/quadratic probing but approximated by double hashing):
To maintain good performance (approaching O(1) average time), it's important to keep the load factor low, typically below 0.5 for linear/quadratic probing and perhaps up to 0.7 or 0.8 for double hashing, by resizing the table when necessary. Open addressing can have better cache performance than separate chaining because elements are stored contiguously in the array, but it requires more careful implementation and management.
In many standard library implementations (like Python's dict
prior to version 3.6, which used open addressing, or Java's HashMap
, which uses separate chaining), these details are hidden from the user. However, understanding these collision resolution mechanisms is valuable when designing specialized hash-based structures or when analyzing the performance characteristics of algorithms that rely on them. For instance, in feature hashing, collisions mean that distinct features might be mapped to the same index (feature aliasing). While sometimes acceptable or even beneficial for regularization, understanding collision resolution helps appreciate the potential trade-offs involved.
© 2025 ApX Machine Learning