tests/slo_workloads: align with ydb-slo-action v2#609
Conversation
The v2 SLO action (ydb-platform/ydb-slo-action) starts a workload
container once per run, passes config via env vars (WORKLOAD_REF,
WORKLOAD_DURATION, YDB_CONNECTION_STRING, ...), and queries
sdk_operation_latency_p{50,95,99}_seconds as pre-computed gauges
labeled by (operation_type, operation_status, ref).
Changes:
- Dockerfile: drop SLO_BRANCH_REF compile-time ref baking; a single
ENTRYPOINT binary without CMD is now sufficient.
- utils/utils.cpp: subcommand is optional — with no free-arg the
binary runs create -> run -> cleanup sequentially (cleanup always
runs, even if run errored). Option resolution follows the standard
CLI > env > default priority via .DefaultValue():
- -c from YDB_CONNECTION_STRING (or built from YDB_ENDPOINT +
YDB_DATABASE);
- --metrics-push-url from OTEL_EXPORTER_OTLP_METRICS_ENDPOINT;
- --time from WORKLOAD_DURATION.
- utils/metrics.cpp: replace the OTel Histogram with HDR-backed
gauges. Only successful operations are recorded; a background
thread snapshots p50/p95/p99 every second, publishes the gauges
with operation_status="success", then resets the HDR window.
sdk_retry_attempts_total becomes a counter of (retry_attempts + 1)
per op. ref is read from WORKLOAD_REF at startup, not compile time.
- utils/CMakeLists.txt: pull HdrHistogram_c 0.11.8 via FetchContent
(opt-in via the outer YDB_SDK_TESTS flag) and link slo-utils
against hdr_histogram_static.
There was a problem hiding this comment.
Pull request overview
This PR updates the C++ SLO workload (tests/slo_workloads/) to match the v2 ydb-slo-action contract: configuration via environment variables, a single-container lifecycle (optionally running create→run→cleanup in one process), and publishing precomputed latency percentile gauges with the expected labels.
Changes:
- Make the subcommand optional; when omitted, run
create → run → cleanupsequentially in one process, ensuring cleanup always runs. - Switch option resolution to CLI > env > default, including env-derived defaults for connection string, metrics endpoint, and workload duration.
- Replace the exported latency histogram with HDR-backed p50/p95/p99 gauge metrics and adjust retry-attempts reporting; add HdrHistogram_c via FetchContent; simplify the Docker build.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/slo_workloads/utils/utils.h | Adds All command mode to support full lifecycle execution with no subcommand. |
| tests/slo_workloads/utils/utils.cpp | Implements env-derived defaults, allows zero free args, and runs create→run→cleanup in All mode. |
| tests/slo_workloads/utils/metrics.cpp | Reworks OTLP metrics to export percentile gauges from an HDR window and updates counters/labels. |
| tests/slo_workloads/utils/CMakeLists.txt | Fetches and links HdrHistogram_c for the SLO utils library. |
| tests/slo_workloads/Dockerfile | Removes build-time REF arg and builds the workload with a single ENTRYPOINT. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void PublishPercentiles() { | ||
| auto snapshot = Latency_.SnapshotAndReset(); | ||
| if (!snapshot.HasData) { | ||
| return; | ||
| } | ||
| auto attrs = MergeAttributes({ | ||
| {"operation_type", OperationType_}, | ||
| {"operation_status", "success"}, | ||
| }); | ||
| LatencyP50_->Record(snapshot.P50, attrs); | ||
| LatencyP95_->Record(snapshot.P95, attrs); | ||
| LatencyP99_->Record(snapshot.P99, attrs); |
| void InitMetrics() { | ||
| ErrorsTotal_ = Meter_->CreateUInt64Counter("sdk_errors_total", | ||
| "Total number of errors encountered, categorized by error type." | ||
| OperationsTotal_ = Meter_->CreateDoubleCounter("sdk_operations_total", | ||
| "Total number of operations, categorized by operation type and status." | ||
| ); | ||
|
|
||
| OperationsTotal_ = Meter_->CreateUInt64Counter("sdk_operations_total", | ||
| "Total number of operations, categorized by type attempted by the SDK." | ||
| ); | ||
|
|
||
| OperationsSuccessTotal_ = Meter_->CreateUInt64Counter("sdk_operations_success_total", | ||
| "Total number of successful operations, categorized by type." | ||
| RetryAttemptsTotal_ = Meter_->CreateDoubleCounter("sdk_retry_attempts_total", | ||
| "Total number of retry attempts (including the first attempt), categorized by operation type." | ||
| ); |
Rewrite the SLO workflows following the pattern used by ydb-java-sdk (ydb-platform/ydb-java-sdk#644). slo.yml: - Drop the hand-rolled docker run orchestration that spun up YDB and invoked the workload with --dont-push / explicit create/run phases. The v2 action owns that lifecycle via deploy/compose.yml — we just hand it two prebuilt images. - Gate on the `SLO` PR label. - Build both images with a single `docker build` per ref; if the baseline commit can't be built (missing Dockerfile or compile error on a historical SHA), fall back to the current image so the run is comparable against itself rather than silently failing. - Drop `--build-arg REF=…` — ref is now read from WORKLOAD_REF env at runtime. - Rename matrix entry to `cpp-key-value` to match the built binary and collapse the per-compiler matrix to a single clang entry (gcc variant can be added back when needed). slo_report.yml: - Pin to @v2. - Add a second job that removes the `SLO` label from the PR after the report is published, matching js-sdk/java-sdk/go-sdk.
- Counters: switch OperationsTotal/RetryAttemptsTotal from DoubleCounter to UInt64Counter. Operation counts and retry attempts are inherently integer; float representation could drift for very high cumulative values and doesn't match the Prometheus counter convention. - Percentile gauges: publish 0.0 on empty-window intervals instead of returning early. OTel's sync Gauge holds the last Record() value and the periodic exporter re-emits it every collection cycle — without an explicit reset, gauges would look "stuck" at the last non-empty value after load stops, contradicting the per-second HDR reset semantics.
Shared CI runners (and local docker builds) periodically hit connection timeouts on ppa.launchpadcontent.net. Telling apt itself to retry via Acquire::Retries=5 plus a 60 s connect timeout handles the blip in-place without a shell retry loop.
Two separate issues were making every SLO PR a coin flip:
1. Any transient network error (GitHub release download, PPA timeout,
abseil/protobuf/grpc tarball) kills the entire 30-min build. Each
of the 8 wget calls and both apt-get steps is now retried:
- wget gets --tries=5 --waitretry=15 --timeout=60
--retry-connrefused --retry-on-http-error=500,502,503,504
via a shared WGET_OPTS env var.
- apt already has Acquire::Retries=5.
2. Every CI run does a full cold build. The Dockerfile's toolchain and
dep layers (~25 min) never change between SDK commits, so caching
them is trivially safe.
- Switch slo.yml to docker/build-push-action@v6 with GHA type=gha
cache export/import, scoped per preset.
- First build on a runner still takes ~30 min; subsequent builds
where only the SDK source changed should take ~3 min.
- Baseline build uses continue-on-error + an explicit fallback
step that retags ydb-app-current as ydb-app-baseline when the
historical commit won't compile — replaces the inline shell
if-else that docker/build-push-action can't express.
The earlier Acquire::Retries=5 tweak covered HTTP-level errors but didn't handle TCP connect timeouts to ppa.launchpadcontent.net, which is the actual failure mode observed on shared CI runners. Wrap both the add-apt-repository step and the main apt-get install in shell retry loops (5 attempts, 15/30/45/60/75 s backoff) so a CDN blip no longer kills the 30-min build.
The v2 SLO action invokes the workload as `slo-key-value <run-args>` without a subcommand keyword. The global parser used to error on unknown long options like --read-rps; now it tolerates them and the implicit All branch passes the leftover argv to the run phase.
The baseline image used to be built from the merge-base checkout's tests/slo_workloads/, so any harness change on the PR (Dockerfile, CLI parser, …) was absent from the baseline. SDK comparison should only vary the library, not the harness — so reuse current's harness on both sides.
cmake configure/build now run under ccache as the C/C++ compiler launcher, with /root/.ccache exposed as a BuildKit cache mount so state persists across runs through cache-to=type=gha,mode=max (BuildKit ≥0.13 exports cache mounts under mode=max). Cold runs incur the usual full compile; warm runs reuse object hashes and should drop cmake --build from ~14 min to a few minutes.
…ring lacks ?database=
The v2 SLO action provides YDB_CONNECTION_STRING in path form
(grpc://host:port/Root/testdb), which GetDatabase can't parse, so
prefix stayed empty and create issued CreateTable("key_value")
without the database root → BAD_REQUEST. Falling back to the
YDB_DATABASE env var (which the action also sets) restores the
correct table path.
…he-dance cache-to: type=gha,mode=max exports layer cache but not the contents of --mount=type=cache, so the ccache mount started empty on every run (0% hit rate observed). Restore /root/.ccache from a host dir backed by actions/cache, inject it into BuildKit before each build via the cache-dance action, and extract the updated state for the next run. Same scope is shared by current and baseline since the SDK code overlap dominates ccache hit potential.
| // YDB SLO action sets YDB_CONNECTION_STRING in path form | ||
| // (grpc://host:port/Root/testdb), which GetDatabase can't parse. | ||
| // Fall back to YDB_DATABASE which the action sets alongside it. | ||
| prefix = GetEnv("YDB_DATABASE"); |
There was a problem hiding this comment.
database можно вытащить из драйвера
| char programName[] = "slo"; | ||
| std::vector<char*> runArgv; | ||
| runArgv.reserve(argc + 1); | ||
| runArgv.push_back(programName); | ||
| for (int i = 0; i < argc; ++i) { | ||
| runArgv.push_back(argv[i]); |
| break; \ | ||
| echo "apt-get install attempt $i failed; sleeping $((i * 15))s"; \ | ||
| sleep $((i * 15)); \ | ||
| apt-get -y update || true; \ |
| remove-slo-label: | ||
| if: github.event.workflow_run.event == 'pull_request' | ||
| runs-on: ubuntu-latest |
| // When no successful ops landed in the last second, publish 0.0 | ||
| // for all percentiles so the gauges reset with the HDR window | ||
| // rather than appearing "stuck" at the last non-empty value (the | ||
| // OTel periodic exporter would otherwise re-emit the previous | ||
| // Record() value on every collection cycle). | ||
| LatencyP50_->Record(snapshot.HasData ? snapshot.P50 : 0.0, attrs); | ||
| LatencyP95_->Record(snapshot.HasData ? snapshot.P95 : 0.0, attrs); | ||
| LatencyP99_->Record(snapshot.HasData ? snapshot.P99 : 0.0, attrs); |
There was a problem hiding this comment.
AI Review Summary
Verdict: ✅ No critical issues found
Critical issues
No critical issues found.
Other findings
- Major | Medium: Observable callback ordering assumption —
tests/slo_workloads/utils/metrics.cpp:229 - Minor | Medium: Missing null terminator in argv vector for All mode —
tests/slo_workloads/utils/utils.cpp:253 - Minor | Medium: Per-call map allocation in MergeAttributes on hot path —
tests/slo_workloads/utils/metrics.cpp:150 - Minor | Low: DropTable silently converts NOT_FOUND to SUCCESS —
tests/slo_workloads/key_value/drop.cpp:12 - Nit | High: Redundant atomic load immediately after store in ObserveP50 —
tests/slo_workloads/utils/metrics.cpp:245
This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.
| // HDR once, and caches p50/p95/p99 in the series atomics for the | ||
| // subsequent two callbacks to read. This mirrors the Java workload's | ||
| // single-snapshot strategy without requiring a batch-callback API. | ||
| static void ObserveP50(opentelemetry::metrics::ObserverResult result, void* state) { |
There was a problem hiding this comment.
Severity: Major
Confidence: Medium
The design assumes ObserveP50 is always called before ObserveP95/ObserveP99 within a single collection cycle, so that SnapshotAndReset() runs once (in P50's callback) and the other two callbacks read the cached atomics.
The OTel C++ SDK currently iterates observable instruments in registration order (backed by a vector), but this is an implementation detail, not a documented API guarantee. If a future SDK version changes the iteration order (e.g., uses a hash map internally), ObserveP95/ObserveP99 could run before ObserveP50, reporting stale values from the previous cycle while the HDR histogram still holds the current cycle's data — or worse, the HDR would never be reset (since only ObserveP50 calls SnapshotAndReset).
Consider using a single observable callback that observes all three gauges atomically from one snapshot, or guard with an explicit epoch counter so P95/P99 can detect whether P50 has already run in the current cycle and trigger SnapshotAndReset themselves if it has not.
| // run-phase option (no subcommand keyword was supplied), so | ||
| // prepend a synthetic program name for ParseOptionsRun. | ||
| char programName[] = "slo"; | ||
| std::vector<char*> runArgv; |
There was a problem hiding this comment.
Severity: Minor
Confidence: Medium
The runArgv vector is passed to ParseOptionsRun via runArgv.data() but does not end with a nullptr sentinel. The C/POSIX convention is argv[argc] == NULL, and while NLastGetopt likely uses argc for bounds, violating this contract risks undefined behaviour if any downstream code (or a future SDK version) reads argv[argc].
Add runArgv.push_back(nullptr); after the loop and keep passing static_cast<int>(runArgv.size() - 1) as argc.
| OperationsSuccessTotal_ = Meter_->CreateUInt64Counter("sdk_operations_success_total", | ||
| "Total number of successful operations, categorized by type." | ||
| ); | ||
| OperationsTotal_->Add(uint64_t{1}, MergeAttributes({ |
There was a problem hiding this comment.
Severity: Minor
Confidence: Medium
MergeAttributes copies the entire CommonAttributes_ map and inserts new entries on every Record() invocation. At 1000+ RPS with two operation types, this creates thousands of short-lived std::map allocations per second on the hot path.
Since the set of (operation_type, operation_status) pairs is known at series-creation time, consider pre-computing the merged attribute maps per-series and storing them in TSeries. This removes per-call allocation and map-copy overhead entirely.
| } else { | ||
| series->HasData.store(false); | ||
| } | ||
| if (!series->HasData.load()) { |
There was a problem hiding this comment.
Severity: Nit
Confidence: High
The series->HasData.load() on this line always reflects the value just stored 2–4 lines above in the same thread — the load is redundant. This could be simplified to a direct boolean check:
bool hasData = snap.HasData;
if (hasData) {
series->P50.store(snap.P50);
series->P95.store(snap.P95);
series->P99.store(snap.P99);
}
series->HasData.store(hasData);
if (!hasData) {
continue;
}|
Analysis performed by claude, claude-opus-4-6. |
| // sdk_retry_attempts_total = total number of technical attempts | ||
| // including the first one. RetryAttempts counts only post-first | ||
| // attempts, so add 1 to include the initial attempt. | ||
| RetryAttemptsTotal_->Add(data.RetryAttempts + 1, | ||
| opentelemetry::common::MakeKeyValueIterableView(series.RetryAttrs)); | ||
|
|
||
| if (success) { | ||
| series.Recorder.Record(data.Delay); | ||
| } |
| #include <atomic> | ||
| #include <chrono> | ||
| #include <memory> | ||
| #include <mutex> | ||
| #include <shared_mutex> | ||
| #include <unordered_map> |
Summary
The v2 ydb-platform/ydb-slo-action starts a workload container once per run, passes config exclusively via env vars (
WORKLOAD_REF,WORKLOAD_DURATION,YDB_CONNECTION_STRING,OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, …), and queriessdk_operation_latency_p{50,95,99}_secondsas pre-computed gauges labeled by(operation_type, operation_status, ref). The existing C++ workload does not match: it requires a subcommand + a long list of CLI flags, bakesrefat compile time viaSLO_BRANCH_REF, and publishes an OTel histogram that the action no longer reads. The old CI workflow also hand-rolled thedocker runorchestration thatydb-slo-action/init@v2now owns viadeploy/compose.yml.This PR realigns
tests/slo_workloads/and the SLO CI workflows to the v2 contract with the minimum possible code churn, following the pattern used by the Java SDK (ydb-platform/ydb-java-sdk#644).Workload changes (
tests/slo_workloads/)SLO_BRANCH_REFbuild arg and thecmake -DSLO_BRANCH_REF=…step. SingleENTRYPOINT, noCMD.utils/utils.{h,cpp}— subcommand becomes optional. With no free-arg, the binary executescreate → run → cleanupsequentially in one process (cleanup always runs, even if run errored) — same single-container lifecycle used byydb-js-sdk/tests/slo. Option resolution follows CLI > env > default via.DefaultValue():-cfromYDB_CONNECTION_STRING(or composed fromYDB_ENDPOINT+YDB_DATABASE);--metrics-push-urlfromOTEL_EXPORTER_OTLP_METRICS_ENDPOINT;--timefromWORKLOAD_DURATION.utils/metrics.cpp— replace the OTelHistogram<double>with HDR-backed gauges. Only successful operations are recorded (errors are excluded from percentile stats, matchingoperation_status="success"in the action's metrics query). A background thread snapshots p50/p95/p99 every second, publishes the threesdk_operation_latency_p*_secondsgauges, then resets the HDR window — each gauge value reflects only the last interval.sdk_retry_attempts_totalbecomes a counter ofretry_attempts + 1per op.refis read fromWORKLOAD_REFat startup, not compile time.utils/CMakeLists.txt— add HdrHistogram_c 0.11.8 viaFetchContent(opt-in — only reached whenYDB_SDK_TESTS=ON), linkslo-utilsprivately againsthdr_histogram_static.Metrics contract after change
sdk_operations_totaloperation_type,operation_status,refsdk_retry_attempts_totaloperation_type,refsdk_operation_latency_p50_secondsoperation_type,operation_status,refsdk_operation_latency_p95_secondsoperation_type,operation_status,refsdk_operation_latency_p99_secondsoperation_type,operation_status,refLegacy metrics (
sdk_errors_total,sdk_operations_success_total,sdk_operations_failure_total,sdk_operation_latency_secondshistogram) are removed.CI changes (
.github/workflows/)slo.yml— delegate toydb-slo-action/init@v2. No more hand-rolleddocker runorchestration — the v2 action brings up YDB + Prometheus + Grafana + chaos-monkey via its owndeploy/compose.yml. We just hand it two prebuilt images. Gated on theSLOPR label (matches js-sdk/java-sdk/go-sdk). Baseline build has a fallback to the current image if the historical commit can't be built.slo_report.yml— pin to@v2and add a second job that removes theSLOlabel after the report is published.Triggering convention
Add the
SLOlabel to a PR to opt into a full SLO run. The label is removed automatically once the report is posted.Test plan
docker run --rm -e WORKLOAD_REF=current -e WORKLOAD_DURATION=15 -e YDB_CONNECTION_STRING=grpc://…:2136/?database=/local ydb-app-current— logs showcreate→run(15 s) →cleanup, exit 0.workload_current_command: "--read-rps 500"override works (passes through to therunphase).SLOlabel on this PR triggers the workflow,ydb-slo-action/init@v2starts compose, and Prometheus servessdk_operation_latency_p99_seconds{ref="current",operation_type="read"}with sensible values.p99_seconds ≈ 0.05; after load stops the gauge stops updating (window resets each second with no new samples).