Skip to content

Integrate Benchmarking into CI with Performance Analysis#418

Open
ev1lm0nk3y wants to merge 4 commits into
walles:masterfrom
ev1lm0nk3y:ci-benchmark-integration
Open

Integrate Benchmarking into CI with Performance Analysis#418
ev1lm0nk3y wants to merge 4 commits into
walles:masterfrom
ev1lm0nk3y:ci-benchmark-integration

Conversation

@ev1lm0nk3y
Copy link
Copy Markdown
Contributor

Overview
This PR overhauls our Continuous Integration pipeline by consolidating workflows, migrating to modern GitHub Actions for linting, and introducing automated performance tracking with Pull Request feedback. The goal is to catch both logic bugs and performance regressions in a single, efficient pipeline.

Key Changes

  1. Optimized CI Architecture (2-Job Structure)
    The workflow is now split into two focused jobs to reduce redundant cycles and improve security:
  • Lint Job (Ubuntu-only):
    • Migrated to golangci/golangci-lint-action@v6 with only-new-issues: true for cleaner PR reviews.
    • Centralized man page linting (mandoc) into this job, removing it from the matrixed testing job.
  • Validate Job (Matrix: Linux, macOS, Windows):
    • Consolidates unit tests and benchmarks into a single matrixed execution.
    • Security: Elevated permissions (contents: write, pull-requests: write) are scoped specifically to this job to support performance analysis.
  1. Automated Benchmarking & PR Feedback
  • New benchmark.sh Script: A platform-aware script that detects system specs (OS/Arch/CPU) and saves results to benchmarks/.
  • PR Performance Comments: Integrated benchmark-action/github-action-benchmark to automatically post a comment on PRs comparing the current run's performance against the baseline.
  • Regression Guardrails: The CI will fail if performance regresses by more than 20% (alert-threshold: 120%).
  • Stable Baselines: While benchmarks run on all platforms for runtime validation, analysis/commenting is restricted to ubuntu-latest to ensure hardware consistency and prevent redundant bot comments.
  1. Tooling Enhancements
  • test.sh: Added a SKIP_LINT flag to allow CI to skip the internal linting step (now handled by the specialized Lint job).
  • README.md: Updated the "Developing" section with instructions for the new benchmarking workflow.

How to Use Locally
Run the new benchmark script directly:

1 ./benchmark.sh
Or use the new flag in the test suite:

1 ./test.sh --bench

Verification Results

  • test.sh --bench correctly delegates to benchmark.sh.
  • benchmark.sh correctly identifies hardware (e.g., benchmarks/darwin-arm64-m4_pro).
  • CI correctly skips linting in the matrix jobs when SKIP_LINT is set.
  • Workflow YAML validated for matrix logic and permission scoping.

@ev1lm0nk3y
Copy link
Copy Markdown
Contributor Author

Feel free to reject this, but I thought these changes would be useful

Split CI into separate lint, test, and benchmark jobs

Consolidate CI: Combine tests/benchmarks and move mandoc linting to Lint job

limit fetch depth for linting
@ev1lm0nk3y ev1lm0nk3y force-pushed the ci-benchmark-integration branch from 84ae6aa to c91d14b Compare May 20, 2026 03:30
@walles
Copy link
Copy Markdown
Owner

walles commented May 23, 2026

Doesn't this assume that all GitHub Actions runners always have the same performance characteristics?

Would the benchmark numbers here be stable or flaky, and if flaky, by how much?

Since these are random cloud machines, how would one protect from flakiness from things like:

  • Load from other VMs on the same host
  • Disk caches being filled or not filled, on the host
  • Amount of RAM in the VM
  • Number of cores in the VM

@ev1lm0nk3y
Copy link
Copy Markdown
Contributor Author

Responses in-line

In short, yes these are valid concerns and I have some suggestions I would like your opinion on.

  1. Do nothing and let the alert threshold at 120% of baseline account for the variability at the risk of introducing performance regressions over time.
  2. Use sample averaging to smooth out any single test run spike alleviating the "noisy neighbor" concern
  3. Move to "relative" benchmarking where the tests are run within the source and destination branches to ensure that both are using the same hardware specs
  4. Convert to using deterministic metrics over time-based metrics because things like allocs/op or B/op do not change with the host's current performance

My inclination is to use deterministic metrics instead of sample averaging and relative benchmarking because these are calculated by the Go runtime based on code execution and are virtually immune to VM specs and noisy neighbors. To my knowledge benchdiff and benchstat use the time-based metrics so a custom parser my be required.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants