Affected Version
All versions (confirmed on 0.12.0 and current master branch)
Description
HyperLogLogCollector returns a cardinality estimate of 0 when a single element is added whose murmur3_128 hash has its first 16+ bits all zero (i.e., positionOf1 > 15). This causes DetermineHashedPartitionsJob to report "Found approximately [0] rows in data" and subsequently fail with No buckets?? seems there is no data to index., even though the Hadoop job successfully reads valid input records.
Root cause: When positionOf1 exceeds the 4-bit nibble storage range (max 15), the value is stored only in the global overflow register. No entry is written to the sparse storage buffer. During estimation in estimateSparse(), the overflow register's bucket is not accounted for — zeroCount equals NUM_BUCKETS (all 2048 buckets counted as empty). The linear counting correction formula then computes m × ln(m/m) = m × ln(1) = 0.
Steps to reproduce:
- Create a fresh
HyperLogLogCollector via makeLatestCollector()
- Add a single element whose hash bytes start with
0x00 0x80 (making positionOf1 = 16)
- Call
estimateCardinalityRound() — returns 0 instead of 1
Minimal reproduction:
@Test
public void testSingleOverflowElementEstimatesToZero() {
HyperLogLogCollector collector = HyperLogLogCollector.makeLatestCollector();
// Construct a hash where positionOf1 = 16 (first byte 0x00, second byte 0x80)
byte[] hashedValue = new byte[16];
hashedValue[0] = 0x00; // 8 zero bits
hashedValue[1] = (byte) 0x80; // lowest 1-bit at position 8 → total positionOf1 = 16
hashedValue[14] = 0x00;
hashedValue[15] = 0x05; // bucket info
collector.add(hashedValue);
// BUG: returns 0 instead of 1
assertEquals(0, collector.estimateCardinalityRound());
// EXPECTED after fix:
// assertTrue(collector.estimateCardinalityRound() >= 1);
}
Trigger probability: ~1/32768 per element (depends on hash value distribution). Only manifests when total element count is very small (single digits), as larger datasets will have other elements populating regular registers that keep zeroCount < NUM_BUCKETS.
The error message (in Hadoop indexing context):
java.lang.RuntimeException: No buckets?? seems there is no data to index.
at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:209)
Preceded by:
DetermineHashedPartitionsJob - Found approximately [0] rows in data.
DetermineHashedPartitionsJob - Creating [0] shards
While Hadoop counters confirm Map input records = 1.
Debugging performed:
- Confirmed source data has 1 valid record and is correctly read by the MapReduce job
- Confirmed the record passes all filtering logic (timestamp parsing, interval matching)
- Traced
HyperLogLogCollector.add(): positionOf1 = 16 > registerOffset(0) + range(15) → enters overflow branch → value stored in overflow register only → sparse buffer remains empty (buf.remaining() = 0)
- Traced
estimateSparse(): since no sparse entries exist, zeroCount = NUM_BUCKETS = 2048 → linear counting returns 2048 × ln(2048/2048) = 2048 × ln(1) = 0
- Verified against the original HyperLogLog paper (Flajolet et al. 2007, Figure 3) — the paper's algorithm uses unbounded registers with no overflow mechanism. The bug is specific to Druid's 4-bit nibble space optimization that introduces the overflow register concept.
Suggested fix: In estimateSparse(), when overflowValue > 0 and the overflow bucket is not present in the sparse entry list, subtract 1 from zeroCount and add 1.0 / Math.pow(2, overflowValue) to the harmonic sum e. This correctly accounts for the non-empty overflow bucket in the linear counting formula.
Affected Version
All versions (confirmed on 0.12.0 and current master branch)
Description
HyperLogLogCollectorreturns a cardinality estimate of 0 when a single element is added whose murmur3_128 hash has its first 16+ bits all zero (i.e.,positionOf1 > 15). This causesDetermineHashedPartitionsJobto report "Found approximately [0] rows in data" and subsequently fail withNo buckets?? seems there is no data to index., even though the Hadoop job successfully reads valid input records.Root cause: When
positionOf1exceeds the 4-bit nibble storage range (max 15), the value is stored only in the global overflow register. No entry is written to the sparse storage buffer. During estimation inestimateSparse(), the overflow register's bucket is not accounted for —zeroCountequalsNUM_BUCKETS(all 2048 buckets counted as empty). The linear counting correction formula then computesm × ln(m/m) = m × ln(1) = 0.Steps to reproduce:
HyperLogLogCollectorviamakeLatestCollector()0x00 0x80(makingpositionOf1 = 16)estimateCardinalityRound()— returns 0 instead of 1Minimal reproduction:
Trigger probability: ~1/32768 per element (depends on hash value distribution). Only manifests when total element count is very small (single digits), as larger datasets will have other elements populating regular registers that keep
zeroCount < NUM_BUCKETS.The error message (in Hadoop indexing context):
Preceded by:
While Hadoop counters confirm
Map input records = 1.Debugging performed:
HyperLogLogCollector.add():positionOf1 = 16 > registerOffset(0) + range(15)→ enters overflow branch → value stored in overflow register only → sparse buffer remains empty (buf.remaining() = 0)estimateSparse(): since no sparse entries exist,zeroCount = NUM_BUCKETS = 2048→ linear counting returns2048 × ln(2048/2048) = 2048 × ln(1) = 0Suggested fix: In
estimateSparse(), whenoverflowValue > 0and the overflow bucket is not present in the sparse entry list, subtract 1 fromzeroCountand add1.0 / Math.pow(2, overflowValue)to the harmonic sume. This correctly accounts for the non-empty overflow bucket in the linear counting formula.