While an ideal hash function distributes keys perfectly uniformly across the hash table slots, preventing any two keys from landing in the same spot, reality is often less cooperative. Given a finite number of slots ( $m$ ) and potentially a much larger (or even infinite) universe of possible keys, the Pigeonhole Principle tells us that collisions, where multiple keys map to the same hash index, are bound to happen. Even with excellent hash functions, collisions become increasingly likely as the table fills up.

Therefore, a fundamental part of designing or using hash tables is having a strategy to manage these collisions effectively. Without a resolution strategy, we would overwrite existing data whenever a collision occurred, rendering the hash table incorrect. The goal is to handle collisions in a way that maintains correct storage and retrieval while minimizing performance degradation.

There are two primary approaches to resolving hash collisions: Separate Chaining and Open Addressing.

Separate Chaining

The idea behind Separate Chaining is straightforward: instead of demanding that each slot hold at most one element, we let each slot ( $j$ ) in the hash table array reference a data structure that holds all elements whose keys hash to index $j$ . Most commonly, this secondary data structure is a linked list.

A hash table using Separate Chaining. Slots 1 and 5 contain single elements. Slot 3 has experienced a collision, storing both Key B and Key C in a linked list. Slots 0, 2, 4, and 6 are empty.

Operations:

Insert: Compute the hash index $j = h(\text{key})$ . Insert the new element (key-value pair) into the linked list at table[j]. This is typically an $O(1)$ operation for adding to the front of a list.
Search: Compute the hash index $j = h(\text{key})$ . Traverse the linked list at table[j], comparing the search key with the key of each element in the list.
Delete: Compute the hash index $j = h(\text{key})$ . Search the linked list at table[j] for the element with the matching key. If found, remove it from the list (a standard linked list deletion).

Performance:

The performance of Separate Chaining depends heavily on the load factor, $\lambda = n/m$ , where $n$ is the number of elements stored in the table and $m$ is the number of slots (or buckets). The average length of a chain is $\lambda$ .

Assuming a hash function that distributes keys uniformly (Simple Uniform Hashing Assumption), the average time for an unsuccessful search is $O(1 + \lambda)$ , as we need to compute the hash ( $O(1)$ ) and potentially traverse the average chain length ( $\lambda$ ). A successful search takes, on average, slightly less time, also roughly $O(1 + \lambda)$ . If the number of slots $m$ is proportional to the number of elements $n$ (i.e., $\lambda$ is kept bounded by a constant), then insertion, deletion, and search operations take $O(1)$ time on average.

The worst-case scenario occurs when all keys hash to the same slot. In this situation, the hash table degenerates into a single linked list, and search/delete operations take $O(n)$ time. This highlights the importance of a good hash function and potentially resizing the table if the load factor becomes too high.

Open Addressing

In contrast to Separate Chaining, Open Addressing stores all elements directly within the hash table array itself. When a collision occurs (i.e., we hash to a slot $j$ that is already occupied), we systematically probe subsequent slots in the table until an empty slot is found.

The sequence of slots checked is called the probe sequence. The hash function is effectively extended to take both the key and the probe number (0 for the first try, 1 for the second, and so on) as input: $h(\text{key}, i)$ , where $i = 0, 1, 2, ...$ .

Several probing strategies exist:

Linear Probing: This is the simplest strategy. If the initial slot $j = h(\text{key}, 0)$ is occupied, we try $j+1$ , then $j+2$ , $j+3$ , and so on, wrapping around the table end if necessary (i.e., probing $(h(\text{key}) + i) \pmod{m}$ for $i=0, 1, 2, ...$ ).
- Issue: Linear probing suffers from primary clustering. As blocks of occupied slots form, any key hashing into that block requires traversing the entire block, and inserting into the block makes it even longer. This increases average search times.
Quadratic Probing: To mitigate primary clustering, quadratic probing uses a quadratic offset. The probe sequence is $(h(\text{key}) + c_1 i + c_2 i^2) \pmod{m}$ for $i=0, 1, 2, ...$ , where $c_1$ and $c_2$ ( $c_2 \neq 0$ ) are constants. A common choice is $h(\text{key}) + i^2 \pmod{m}$ .
- Issue: While it avoids large contiguous blocks (primary clustering), it can suffer from secondary clustering. If two different keys hash to the same initial slot, they will follow the exact same probe sequence.
Double Hashing: This method uses a second, independent hash function, $h_2(\text{key})$ , to determine the step size for probing after an initial collision. The probe sequence is $(h_1(\text{key}) + i \cdot h_2(\text{key})) \pmod{m}$ for $i=0, 1, 2, ...$ .
- Benefit: This is generally the preferred open addressing method as it significantly reduces both primary and secondary clustering. The probe sequences for keys hashing to the same initial slot are likely different, depending on the result of $h_2$ . Care must be taken to ensure $h_2(\text{key})$ is never zero and is relatively prime to the table size $m$ to ensure all slots can eventually be probed.

Example: Linear Probing Insertion

Consider a table of size $m=7$ and a simple hash function $h(k) = k \pmod{7}$ . Let's insert keys 15, 22, 8, 1.

Insert 15: $h(15) = 15 \pmod{7} = 1$ . Slot 1 is empty. Place 15 at index 1. Table: [_, 15, _, _, _, _, _]
Insert 22: $h(22) = 22 \pmod{7} = 1$ . Slot 1 is occupied by 15 (Collision!). Probe $i=1$ : $(1 + 1) \pmod{7} = 2$ . Slot 2 is empty. Place 22 at index 2. Table: [_, 15, 22, _, _, _, _]
Insert 8: $h(8) = 8 \pmod{7} = 1$ . Slot 1 is occupied. Probe $i=1$ : $(1 + 1) \pmod{7} = 2$ . Slot 2 is occupied. Probe $i=2$ : $(1 + 2) \pmod{7} = 3$ . Slot 3 is empty. Place 8 at index 3. Table: [_, 15, 22, 8, _, _, _]
Insert 1: $h(1) = 1 \pmod{7} = 1$ . Slot 1 is occupied. Probe $i=1$ : $(1 + 1) \pmod{7} = 2$ . Slot 2 is occupied. Probe $i=2$ : $(1 + 2) \pmod{7} = 3$ . Slot 3 is occupied. Probe $i=3$ : $(1 + 3) \pmod{7} = 4$ . Slot 4 is empty. Place 1 at index 4. Table: [_, 15, 22, 8, 1, _, _]

Operations:

Insert: Probe using the chosen sequence until an empty slot is found. Insert the element there. If the table becomes too full (load factor approaches 1), resizing (creating a larger table and rehashing all existing elements) is necessary.
Search: Probe using the same sequence starting from $h(\text{key})$ . If the element is found, return it. If an empty slot is encountered during the probe, the element is not in the table.
Delete: This is tricky. Simply marking a slot as empty can break the probe sequence for subsequent searches. If we deleted 22 (at index 2) in the example above, a later search for 8 or 1 would hit the now-empty slot 2 and incorrectly conclude they aren't present. A common solution is to mark the slot with a special "deleted" marker (sometimes called a tombstone). Insertions can reuse deleted slots, but searches must continue probing past them. This can complicate things and lead to long search times if many deletions occur. Rehashing might be needed periodically to clear out tombstones.

Performance:

Open addressing performance is highly sensitive to the load factor $\lambda = n/m$ . Unlike separate chaining where $\lambda$ can exceed 1, in open addressing $\lambda$ must be less than 1 (you can't store more items than slots). As $\lambda$ gets close to 1, the number of probes required increases dramatically, and performance degrades sharply. For example, under the uniform hashing assumption (unrealistic for simple linear/quadratic probing but approximated by double hashing):

Average probes for unsuccessful search/insertion: $\approx 1/(1-\lambda)$
Average probes for successful search: $\approx (1/\lambda) \ln(1/(1-\lambda))$

To maintain good performance (approaching $O(1)$ average time), it's important to keep the load factor low, typically below 0.5 for linear/quadratic probing and perhaps up to 0.7 or 0.8 for double hashing, by resizing the table when necessary. Open addressing can have better cache performance than separate chaining because elements are stored contiguously in the array, but it requires more careful implementation and management.

Choosing a Strategy

Separate Chaining is often simpler to implement, especially deletion. Its performance degrades more gracefully as the load factor increases, and $\lambda$ can be greater than 1. However, it requires extra memory for pointers and list nodes, which can also negatively impact cache locality.
Open Addressing avoids the overhead of pointers and can exhibit better cache performance. However, it's more complex to implement correctly (especially deletion), and performance degrades sharply as the load factor approaches 1, necessitating careful control over $\lambda$ via resizing. The choice of probing strategy is significant.

In many standard library implementations (like Python's dict prior to version 3.6, which used open addressing, or Java's HashMap, which uses separate chaining), these details are hidden from the user. However, understanding these collision resolution mechanisms is valuable when designing specialized hash-based structures or when analyzing the performance characteristics of algorithms that rely on them. For instance, in feature hashing, collisions mean that distinct features might be mapped to the same index (feature aliasing). While sometimes acceptable or even beneficial for regularization, understanding collision resolution helps appreciate the potential trade-offs involved.