HyperLogLogCollector estimates 0 cardinality when single element overflows into sparse mode

### Affected Version

All versions (confirmed on 0.12.0 and current master branch)

### Description

`HyperLogLogCollector` returns a cardinality estimate of **0** when a single element is added whose murmur3_128 hash has its first 16+ bits all zero (i.e., `positionOf1 > 15`). This causes `DetermineHashedPartitionsJob` to report "Found approximately [0] rows in data" and subsequently fail with `No buckets?? seems there is no data to index.`, even though the Hadoop job successfully reads valid input records.

**Root cause**: When `positionOf1` exceeds the 4-bit nibble storage range (max 15), the value is stored only in the global overflow register. No entry is written to the sparse storage buffer. During estimation in `estimateSparse()`, the overflow register's bucket is not accounted for — `zeroCount` equals `NUM_BUCKETS` (all 2048 buckets counted as empty). The linear counting correction formula then computes `m × ln(m/m) = m × ln(1) = 0`.

**Steps to reproduce**:

1. Create a fresh `HyperLogLogCollector` via `makeLatestCollector()`
2. Add a single element whose hash bytes start with `0x00 0x80` (making `positionOf1 = 16`)
3. Call `estimateCardinalityRound()` — returns 0 instead of 1

Minimal reproduction:

```java
@Test
public void testSingleOverflowElementEstimatesToZero() {
  HyperLogLogCollector collector = HyperLogLogCollector.makeLatestCollector();

  // Construct a hash where positionOf1 = 16 (first byte 0x00, second byte 0x80)
  byte[] hashedValue = new byte[16];
  hashedValue[0] = 0x00;           // 8 zero bits
  hashedValue[1] = (byte) 0x80;   // lowest 1-bit at position 8 → total positionOf1 = 16
  hashedValue[14] = 0x00;
  hashedValue[15] = 0x05;          // bucket info

  collector.add(hashedValue);

  // BUG: returns 0 instead of 1
  assertEquals(0, collector.estimateCardinalityRound());

  // EXPECTED after fix:
  // assertTrue(collector.estimateCardinalityRound() >= 1);
}
```

**Trigger probability**: ~1/32768 per element (depends on hash value distribution). Only manifests when total element count is very small (single digits), as larger datasets will have other elements populating regular registers that keep `zeroCount < NUM_BUCKETS`.

**The error message** (in Hadoop indexing context):

```
java.lang.RuntimeException: No buckets?? seems there is no data to index.
    at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:209)
```

Preceded by:

```
DetermineHashedPartitionsJob - Found approximately [0] rows in data.
DetermineHashedPartitionsJob - Creating [0] shards
```

While Hadoop counters confirm `Map input records = 1`.

**Debugging performed**:

1. Confirmed source data has 1 valid record and is correctly read by the MapReduce job
2. Confirmed the record passes all filtering logic (timestamp parsing, interval matching)
3. Traced `HyperLogLogCollector.add()`: `positionOf1 = 16 > registerOffset(0) + range(15)` → enters overflow branch → value stored in overflow register only → sparse buffer remains empty (`buf.remaining() = 0`)
4. Traced `estimateSparse()`: since no sparse entries exist, `zeroCount = NUM_BUCKETS = 2048` → linear counting returns `2048 × ln(2048/2048) = 2048 × ln(1) = 0`
5. Verified against the original HyperLogLog paper (Flajolet et al. 2007, Figure 3) — the paper's algorithm uses unbounded registers with no overflow mechanism. The bug is specific to Druid's 4-bit nibble space optimization that introduces the overflow register concept.

**Suggested fix**: In `estimateSparse()`, when `overflowValue > 0` and the overflow bucket is not present in the sparse entry list, subtract 1 from `zeroCount` and add `1.0 / Math.pow(2, overflowValue)` to the harmonic sum `e`. This correctly accounts for the non-empty overflow bucket in the linear counting formula.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HyperLogLogCollector estimates 0 cardinality when single element overflows into sparse mode #19649

Affected Version

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

HyperLogLogCollector estimates 0 cardinality when single element overflows into sparse mode #19649

Description

Affected Version

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions