Skip to content

Classic-only histogram: consider synchronized block instead of multi-LongAdder for observe() hot path #1915

@zeitlinger

Description

@zeitlinger

Context

While benchmarking the Prometheus shim PoC (bridging Prometheus client API to the OTel SDK), I found that classic-only histograms are 30% faster through the OTel SDK than through native Prometheus.

Benchmark numbers (JMH, single thread)

Path observe() latency
Native Prometheus (classic-only) 10.5 ns
OTel SDK (explicit bucket histogram) 7.3 ns

Root cause

Native Prometheus doObserve() uses 3 separate CAS-based atomics per call:

  1. classicBuckets[i].add(1)LongAdder
  2. sum.add(value)DoubleAdder
  3. count.increment()LongAdder

Plus a buffer.append() CAS attempt and volatile reads for reset/scale-down state.

The OTel SDK uses a single synchronized block with plain +=/++ arithmetic:

synchronized (lock) {
    this.sum += value;
    this.count++;
    this.counts[bucketIndex]++;
    // min/max tracking
}

In uncontended (single-thread) benchmarks, HotSpot elides the uncontended lock and optimizes the plain arithmetic freely, beating the multi-CAS approach.

Suggestion

For classic-only histograms (where nativeInitialSchema == CLASSIC_HISTOGRAM), consider an alternative doObserve() implementation that uses a synchronized block with plain fields instead of multiple LongAdder/DoubleAdder instances. The buffer mechanism (needed for native histogram scale-down) could also be bypassed in classic-only mode.

This wouldn't affect native or hybrid histograms, which still need the current design.

Multi-threaded consideration

The LongAdder approach was chosen for multi-threaded scalability (striped cells reduce contention). A synchronized block would serialize threads. However:

  • Most real-world observe() calls happen on different label-value combinations (different data points), so contention on a single data point is rare
  • Even under contention, the critical section is very short (~5 ns of arithmetic), so lock hold time is minimal
  • A benchmark with 4 threads would clarify the actual tradeoff

Not a high priority — 10.5 ns is already excellent. But worth considering if classic histogram performance matters.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions