Classic-only histogram: consider synchronized block instead of multi-LongAdder for observe() hot path

## Context

While benchmarking the [Prometheus shim PoC](https://github.com/prometheus/client_java/compare/main...zeitlinger:client_java:feat/metric-backend-spi) (bridging Prometheus client API to the OTel SDK), I found that classic-only histograms are **30% faster through the OTel SDK** than through native Prometheus.

## Benchmark numbers (JMH, single thread)

| Path | `observe()` latency |
|------|-------------------|
| Native Prometheus (classic-only) | 10.5 ns |
| OTel SDK (explicit bucket histogram) | 7.3 ns |

## Root cause

Native Prometheus `doObserve()` uses **3 separate CAS-based atomics** per call:

1. `classicBuckets[i].add(1)` — `LongAdder`
2. `sum.add(value)` — `DoubleAdder`
3. `count.increment()` — `LongAdder`

Plus a `buffer.append()` CAS attempt and volatile reads for reset/scale-down state.

The OTel SDK uses a **single `synchronized` block** with plain `+=`/`++` arithmetic:

```java
synchronized (lock) {
    this.sum += value;
    this.count++;
    this.counts[bucketIndex]++;
    // min/max tracking
}
```

In uncontended (single-thread) benchmarks, HotSpot elides the uncontended lock and optimizes the plain arithmetic freely, beating the multi-CAS approach.

## Suggestion

For **classic-only** histograms (where `nativeInitialSchema == CLASSIC_HISTOGRAM`), consider an alternative `doObserve()` implementation that uses a synchronized block with plain fields instead of multiple `LongAdder`/`DoubleAdder` instances. The buffer mechanism (needed for native histogram scale-down) could also be bypassed in classic-only mode.

This wouldn't affect native or hybrid histograms, which still need the current design.

## Multi-threaded consideration

The `LongAdder` approach was chosen for multi-threaded scalability (striped cells reduce contention). A synchronized block would serialize threads. However:

- Most real-world `observe()` calls happen on different label-value combinations (different data points), so contention on a single data point is rare
- Even under contention, the critical section is very short (~5 ns of arithmetic), so lock hold time is minimal
- A benchmark with 4 threads would clarify the actual tradeoff

Not a high priority — 10.5 ns is already excellent. But worth considering if classic histogram performance matters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classic-only histogram: consider synchronized block instead of multi-LongAdder for observe() hot path #1915

Context

Benchmark numbers (JMH, single thread)

Root cause

Suggestion

Multi-threaded consideration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Path	`observe()` latency
Native Prometheus (classic-only)	10.5 ns
OTel SDK (explicit bucket histogram)	7.3 ns

Classic-only histogram: consider synchronized block instead of multi-LongAdder for observe() hot path #1915

Description

Context

Benchmark numbers (JMH, single thread)

Root cause

Suggestion

Multi-threaded consideration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions