Skip to content

vLLM/SGLang: comparison report#904

Merged
podkidyshev merged 15 commits into
mainfrom
ipod/llm-comparison
Jun 3, 2026
Merged

vLLM/SGLang: comparison report#904
podkidyshev merged 15 commits into
mainfrom
ipod/llm-comparison

Conversation

@podkidyshev
Copy link
Copy Markdown
Contributor

Summary

  • Enable comparison reporting for vLLM and SGLang

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this May 27, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds canonicalized diff helpers and internal ComparisonReport grouping; introduces an abstract LLMServingComparisonReport for standardized metric extraction and visualization; implements vLLM and SGLang comparison reports, registers them, and adds unit/integration tests that validate generated HTML outputs.

Changes

LLM Serving Comparison Reports

Layer / File(s) Summary
Diff utility refactoring
src/cloudai/report_generator/util.py
New canonicalization helpers and diff_comparison_values(values_by_run: list[Mapping[str, object]]) -> dict[str, list[object]] compute per-key diffs across runs.
ComparisonReport grouping refactor
src/cloudai/report_generator/comparison_report.py
ComparisonReport now builds per-run comparison_values, resolves group keys/labels, diffs per-run values with diff_comparison_values, and exposes group_test_runs() used by get_bokeh_html()/generate().
groups.py minor adjustments
src/cloudai/report_generator/groups.py
Header year and import line updated; minor spacing/import changes around GroupedTestRuns/TestRunsGrouper.
LLMServingComparisonReport abstract framework
src/cloudai/workloads/common/llm_serving_report.py
Abstract base class defines workload hooks (can_handle, parse_results, benchmark_cmd_args, parse_accuracy), augments comparison_values, filters runs, extracts standardized latency/success/throughput/quality dataframes, and creates Rich tables and Bokeh charts with per-run caching.
vLLM comparison report implementation
src/cloudai/workloads/vllm/vllm_comparison_report.py, src/cloudai/workloads/vllm/__init__.py
VLLMComparisonReport implements vLLM-specific detection, arg exposure, parsing via VLLMBenchReportGenerationStrategy, semantic-accuracy parsing, sets vllm_comparison.html, and updates package exports.
SGLang comparison report implementation
src/cloudai/workloads/sglang/sglang_comparison_report.py, src/cloudai/workloads/sglang/__init__.py
SGLangComparisonReport implements SGLang-specific detection, arg exposure, parsing via SGLangBenchReportGenerationStrategy, semantic-accuracy parsing, sets sglang_comparison.html, and updates package exports.
Registry integration
src/cloudai/registration.py
Register "vllm_comparison" and "sglang_comparison" scenario reports with their corresponding classes and enabled ComparisonReportConfig.
Tests and integration
tests/test_init.py, tests/report_generation_strategy/test_report_groups.py, tests/workloads/common/test_llm_serving_report.py
Registry expectations extended; add normalization tests for diff_comparison_values; add integration tests that write synthetic benchmark/accuracy artifacts, load TestRuns, build comparison tables/charts, and assert generated HTML files (vllm_comparison.html, sglang_comparison.html).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia

"🐰 I hopped through diffs with a twitch and a cheer,
Tables and charts now stand crystal clear,
vLLM and SGLang, side-by-side they bloom,
Reports sprout like carrots across the room,
Happy hops — HTMLs tucked in near!"

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'vLLM/SGLang: comparison report' directly and clearly summarizes the main change—enabling comparison reporting for vLLM and SGLang workloads.
Description check ✅ Passed The description explicitly states the PR's purpose ('Enable comparison reporting for vLLM and SGLang'), which aligns with the changeset's primary objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/llm-comparison

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev added the enhancement New feature or request label May 27, 2026
@podkidyshev podkidyshev force-pushed the ipod/llm-comparison branch from bd9012d to 2596726 Compare May 28, 2026 15:58
@podkidyshev podkidyshev marked this pull request as ready for review May 28, 2026 16:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/report_generator/util.py`:
- Around line 175-183: diff_comparison_values currently uses repr(v) to detect
differences which produces false positives (e.g., 1 vs 1.0 or dicts with
different insertion order); replace that with a canonicalization + equality
check: add a small helper (inside diff_comparison_values) named
canonicalize(value) that (a) for Mapping or list/tuple returns json.dumps(value,
sort_keys=True, default=str) to normalize structural content/order, (b) for
numeric types normalizes to a single numeric form (e.g., float(value) or
Decimal) so 1 and 1.0 compare equal, and (c) otherwise returns the value itself
(or str(value) as a last resort); then compute canonical_forms =
[canonicalize(v) for v in all_values] and use len(set(canonical_forms)) > 1 to
decide diffs and assign diff[key] = all_values as before (refer to
diff_comparison_values, values_by_run, all_values).

In `@src/cloudai/workloads/common/llm_serving_report.py`:
- Line 176: The calls to extract_data_as_df are duplicated in create_tables and
create_charts causing repeated parsing; modify the class to compute and cache
extracted DataFrames once and reuse them: e.g., add a lightweight cache
attribute (like self._df_cache) keyed by the test run object or an immutable id
(referencing extract_data_as_df, create_tables, create_charts, and
group.items/item.tr), populate it the first time you call extract_data_as_df for
a given tr, and have both create_tables and create_charts read from that cache
(or accept an injected precomputed list of DataFrames) instead of calling
extract_data_as_df again.

In `@src/cloudai/workloads/vllm/vllm_comparison_report.py`:
- Around line 24-25: The current import in vllm_comparison_report.py pulls
VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, and VllmTestDefinition from
the package root (cloudai.workloads.vllm) which creates a circular init
dependency; change those to direct imports from the submodules that actually
define them (use relative/submodule imports) so VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition are imported from
their defining modules instead of from cloudai.workloads.vllm.

In `@tests/workloads/common/test_llm_serving_report.py`:
- Around line 40-137: Add a second test mirroring
test_llm_comparison_report_generates_html that exercises the SGLang path
end-to-end: create two cloudai.core.TestRun instances for SGLang (use
SGLangTestDefinition, SGLangCmdArgs, SGLangBenchCmdArgs,
SGLangSemanticEvalCmdArgs), write corresponding bench and quality JSON results
to slurm_system.output_path under each test run (use the SGLANG_BENCH_JSON_FILE
and SGLANG_GSM8K_JSON_FILE constants or their equivalents), instantiate
SGLangComparisonReport with a TestScenario containing those runs, call
load_test_runs(), create_tables(group_test_runs()) and assert key table cells
(latency/throughput/quality/success) similar to the vLLM test, call
report.generate(), and finally assert that (slurm_system.output_path /
"sglang_comparison.html").exists() to ensure HTML was produced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6a508bb1-229e-4702-b1ad-bdb49888fef9

📥 Commits

Reviewing files that changed from the base of the PR and between 0738c4b and 727552b.

📒 Files selected for processing (12)
  • src/cloudai/registration.py
  • src/cloudai/report_generator/comparison_report.py
  • src/cloudai/report_generator/groups.py
  • src/cloudai/report_generator/util.py
  • src/cloudai/workloads/common/llm_serving_report.py
  • src/cloudai/workloads/sglang/__init__.py
  • src/cloudai/workloads/sglang/sglang_comparison_report.py
  • src/cloudai/workloads/vllm/__init__.py
  • src/cloudai/workloads/vllm/vllm_comparison_report.py
  • tests/test_init.py
  • tests/workloads/common/test_llm_serving.py
  • tests/workloads/common/test_llm_serving_report.py

Comment thread src/cloudai/report_generator/util.py Outdated
Comment thread src/cloudai/workloads/common/llm_serving_report.py Outdated
Comment thread src/cloudai/workloads/vllm/vllm_comparison_report.py Outdated
Comment thread tests/workloads/common/test_llm_serving_report.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/report_generator/util.py`:
- Around line 191-197: _canonical_comparison_value currently only coerces
top-level ints/floats to float and then uses json.dumps on mappings/lists which
leaves nested numeric types like 1 vs 1.0 unchanged; update
_canonical_comparison_value to recursively walk collections (for the mapping
case handle keys and values, for list/tuple handle each element) and replace any
int/float with float (preserve bools), then json.dumps the normalized structure
so nested numerics are canonicalized before serialization. Target the existing
function name _canonical_comparison_value and ensure you handle collections.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 711f86bc-39c4-43a8-bb10-01f52a72de91

📥 Commits

Reviewing files that changed from the base of the PR and between 727552b and 626442f.

📒 Files selected for processing (6)
  • src/cloudai/report_generator/util.py
  • src/cloudai/workloads/common/llm_serving_report.py
  • src/cloudai/workloads/sglang/sglang_comparison_report.py
  • src/cloudai/workloads/vllm/vllm_comparison_report.py
  • tests/report_generation_strategy/test_report_groups.py
  • tests/workloads/common/test_llm_serving_report.py

Comment thread src/cloudai/report_generator/util.py Outdated
@podkidyshev podkidyshev merged commit 6f678f7 into main Jun 3, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/llm-comparison branch June 3, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants