Skip to content

HyperLogLogCollector estimates 0 cardinality when single element overflows into sparse mode #19649

Description

@FaxianZhao

Affected Version

All versions (confirmed on 0.12.0 and current master branch)

Description

HyperLogLogCollector returns a cardinality estimate of 0 when a single element is added whose murmur3_128 hash has its first 16+ bits all zero (i.e., positionOf1 > 15). This causes DetermineHashedPartitionsJob to report "Found approximately [0] rows in data" and subsequently fail with No buckets?? seems there is no data to index., even though the Hadoop job successfully reads valid input records.

Root cause: When positionOf1 exceeds the 4-bit nibble storage range (max 15), the value is stored only in the global overflow register. No entry is written to the sparse storage buffer. During estimation in estimateSparse(), the overflow register's bucket is not accounted for — zeroCount equals NUM_BUCKETS (all 2048 buckets counted as empty). The linear counting correction formula then computes m × ln(m/m) = m × ln(1) = 0.

Steps to reproduce:

  1. Create a fresh HyperLogLogCollector via makeLatestCollector()
  2. Add a single element whose hash bytes start with 0x00 0x80 (making positionOf1 = 16)
  3. Call estimateCardinalityRound() — returns 0 instead of 1

Minimal reproduction:

@Test
public void testSingleOverflowElementEstimatesToZero() {
  HyperLogLogCollector collector = HyperLogLogCollector.makeLatestCollector();

  // Construct a hash where positionOf1 = 16 (first byte 0x00, second byte 0x80)
  byte[] hashedValue = new byte[16];
  hashedValue[0] = 0x00;           // 8 zero bits
  hashedValue[1] = (byte) 0x80;   // lowest 1-bit at position 8 → total positionOf1 = 16
  hashedValue[14] = 0x00;
  hashedValue[15] = 0x05;          // bucket info

  collector.add(hashedValue);

  // BUG: returns 0 instead of 1
  assertEquals(0, collector.estimateCardinalityRound());

  // EXPECTED after fix:
  // assertTrue(collector.estimateCardinalityRound() >= 1);
}

Trigger probability: ~1/32768 per element (depends on hash value distribution). Only manifests when total element count is very small (single digits), as larger datasets will have other elements populating regular registers that keep zeroCount < NUM_BUCKETS.

The error message (in Hadoop indexing context):

java.lang.RuntimeException: No buckets?? seems there is no data to index.
    at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:209)

Preceded by:

DetermineHashedPartitionsJob - Found approximately [0] rows in data.
DetermineHashedPartitionsJob - Creating [0] shards

While Hadoop counters confirm Map input records = 1.

Debugging performed:

  1. Confirmed source data has 1 valid record and is correctly read by the MapReduce job
  2. Confirmed the record passes all filtering logic (timestamp parsing, interval matching)
  3. Traced HyperLogLogCollector.add(): positionOf1 = 16 > registerOffset(0) + range(15) → enters overflow branch → value stored in overflow register only → sparse buffer remains empty (buf.remaining() = 0)
  4. Traced estimateSparse(): since no sparse entries exist, zeroCount = NUM_BUCKETS = 2048 → linear counting returns 2048 × ln(2048/2048) = 2048 × ln(1) = 0
  5. Verified against the original HyperLogLog paper (Flajolet et al. 2007, Figure 3) — the paper's algorithm uses unbounded registers with no overflow mechanism. The bug is specific to Druid's 4-bit nibble space optimization that introduces the overflow register concept.

Suggested fix: In estimateSparse(), when overflowValue > 0 and the overflow bucket is not present in the sparse entry list, subtract 1 from zeroCount and add 1.0 / Math.pow(2, overflowValue) to the harmonic sum e. This correctly accounts for the non-empty overflow bucket in the linear counting formula.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions