How to approximate the number of distinct elements in a multiset ?
Main Idea: Length of the longest run of 0s express how many distinct elements a good hash function have seen. Total distinct items seen . Use multiple hashes/bucket to prevent unlucky hit and use the harmonic mean to get final guess.
- Initialization:
- Choose a parameter to determine the number of buckets .
- Initialize an array of size with all elements set to 0.
- Hashing:
- Define a hash function that maps each element to a uniformly distributed binary string.
- Update Procedure:
- For each element in the dataset:
- Compute the hash value .
- Split the hash value into two parts: the first bits (bucket index) and the remaining bits (leading zeros).
- Update the corresponding bucket in with the maximum number of leading zeros observed.
- For each element in the dataset:
- Estimation:
- Compute the harmonic mean of the values in to estimate the cardinality.
- Apply a correction factor to improve accuracy.
- Correction for Small Cardinalities:
- If the estimated cardinality is small, use linear counting to correct the estimate.
- Final Estimate:
- The final estimate of the number of distinct elements is derived from the harmonic mean or the corrected estimate.
def hyperloglog(data, b):
m = 2 ** b
M = [0] * m
alpha_m = 0.7213 / (1 + 1.079 / m)
def hash_function(x):
# Implement a hash function that returns a binary string of length L
pass
def leading_zeros(binary_string):
# Count the number of leading zeros in the binary string
pass
for x in data:
hash_value = hash_function(x)
j = int(hash_value[:b], 2) # First b bits for bucket index
r = leading_zeros(hash_value[b:]) # Remaining bits for leading zeros
M[j] = max(M[j], r)
E = alpha_m * m ** 2 / sum(2 ** -M[j] for j in range(m))
if E <= 5/2 * m:
V = sum(1 for j in range(m) if M[j] == 0)
E_corrected = m * log(m / V)
return E_corrected
else:
return E