vLLM/SGLang: comparison report by podkidyshev · Pull Request #904 · NVIDIA/cloudai

podkidyshev · 2026-05-27T21:49:13Z

Summary

Enable comparison reporting for vLLM and SGLang

Test Plan

Automated CI
Manual runs

Additional Notes

coderabbitai · 2026-05-27T21:49:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds canonicalized diff helpers and internal ComparisonReport grouping; introduces an abstract LLMServingComparisonReport for standardized metric extraction and visualization; implements vLLM and SGLang comparison reports, registers them, and adds unit/integration tests that validate generated HTML outputs.

Changes

LLM Serving Comparison Reports

Layer / File(s)	Summary
Diff utility refactoring `src/cloudai/report_generator/util.py`	New canonicalization helpers and `diff_comparison_values(values_by_run: list[Mapping[str, object]]) -> dict[str, list[object]]` compute per-key diffs across runs.
ComparisonReport grouping refactor `src/cloudai/report_generator/comparison_report.py`	ComparisonReport now builds per-run `comparison_values`, resolves group keys/labels, diffs per-run values with `diff_comparison_values`, and exposes `group_test_runs()` used by `get_bokeh_html()`/`generate()`.
groups.py minor adjustments `src/cloudai/report_generator/groups.py`	Header year and import line updated; minor spacing/import changes around `GroupedTestRuns`/`TestRunsGrouper`.
LLMServingComparisonReport abstract framework `src/cloudai/workloads/common/llm_serving_report.py`	Abstract base class defines workload hooks (`can_handle`, `parse_results`, `benchmark_cmd_args`, `parse_accuracy`), augments `comparison_values`, filters runs, extracts standardized latency/success/throughput/quality dataframes, and creates Rich tables and Bokeh charts with per-run caching.
vLLM comparison report implementation `src/cloudai/workloads/vllm/vllm_comparison_report.py`, `src/cloudai/workloads/vllm/__init__.py`	`VLLMComparisonReport` implements vLLM-specific detection, arg exposure, parsing via `VLLMBenchReportGenerationStrategy`, semantic-accuracy parsing, sets `vllm_comparison.html`, and updates package exports.
SGLang comparison report implementation `src/cloudai/workloads/sglang/sglang_comparison_report.py`, `src/cloudai/workloads/sglang/__init__.py`	`SGLangComparisonReport` implements SGLang-specific detection, arg exposure, parsing via `SGLangBenchReportGenerationStrategy`, semantic-accuracy parsing, sets `sglang_comparison.html`, and updates package exports.
Registry integration `src/cloudai/registration.py`	Register `"vllm_comparison"` and `"sglang_comparison"` scenario reports with their corresponding classes and enabled `ComparisonReportConfig`.
Tests and integration `tests/test_init.py`, `tests/report_generation_strategy/test_report_groups.py`, `tests/workloads/common/test_llm_serving_report.py`	Registry expectations extended; add normalization tests for `diff_comparison_values`; add integration tests that write synthetic benchmark/accuracy artifacts, load TestRuns, build comparison tables/charts, and assert generated HTML files (`vllm_comparison.html`, `sglang_comparison.html`).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

srivatsankrishnan
jeffnvidia

"🐰 I hopped through diffs with a twitch and a cheer,
Tables and charts now stand crystal clear,
vLLM and SGLang, side-by-side they bloom,
Reports sprout like carrots across the room,
Happy hops — HTMLs tucked in near!"

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'vLLM/SGLang: comparison report' directly and clearly summarizes the main change—enabling comparison reporting for vLLM and SGLang workloads.
Description check	✅ Passed	The description explicitly states the PR's purpose ('Enable comparison reporting for vLLM and SGLang'), which aligns with the changeset's primary objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/llm-comparison

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/report_generator/util.py`:
- Around line 175-183: diff_comparison_values currently uses repr(v) to detect
differences which produces false positives (e.g., 1 vs 1.0 or dicts with
different insertion order); replace that with a canonicalization + equality
check: add a small helper (inside diff_comparison_values) named
canonicalize(value) that (a) for Mapping or list/tuple returns json.dumps(value,
sort_keys=True, default=str) to normalize structural content/order, (b) for
numeric types normalizes to a single numeric form (e.g., float(value) or
Decimal) so 1 and 1.0 compare equal, and (c) otherwise returns the value itself
(or str(value) as a last resort); then compute canonical_forms =
[canonicalize(v) for v in all_values] and use len(set(canonical_forms)) > 1 to
decide diffs and assign diff[key] = all_values as before (refer to
diff_comparison_values, values_by_run, all_values).

In `@src/cloudai/workloads/common/llm_serving_report.py`:
- Line 176: The calls to extract_data_as_df are duplicated in create_tables and
create_charts causing repeated parsing; modify the class to compute and cache
extracted DataFrames once and reuse them: e.g., add a lightweight cache
attribute (like self._df_cache) keyed by the test run object or an immutable id
(referencing extract_data_as_df, create_tables, create_charts, and
group.items/item.tr), populate it the first time you call extract_data_as_df for
a given tr, and have both create_tables and create_charts read from that cache
(or accept an injected precomputed list of DataFrames) instead of calling
extract_data_as_df again.

In `@src/cloudai/workloads/vllm/vllm_comparison_report.py`:
- Around line 24-25: The current import in vllm_comparison_report.py pulls
VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, and VllmTestDefinition from
the package root (cloudai.workloads.vllm) which creates a circular init
dependency; change those to direct imports from the submodules that actually
define them (use relative/submodule imports) so VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition are imported from
their defining modules instead of from cloudai.workloads.vllm.

In `@tests/workloads/common/test_llm_serving_report.py`:
- Around line 40-137: Add a second test mirroring
test_llm_comparison_report_generates_html that exercises the SGLang path
end-to-end: create two cloudai.core.TestRun instances for SGLang (use
SGLangTestDefinition, SGLangCmdArgs, SGLangBenchCmdArgs,
SGLangSemanticEvalCmdArgs), write corresponding bench and quality JSON results
to slurm_system.output_path under each test run (use the SGLANG_BENCH_JSON_FILE
and SGLANG_GSM8K_JSON_FILE constants or their equivalents), instantiate
SGLangComparisonReport with a TestScenario containing those runs, call
load_test_runs(), create_tables(group_test_runs()) and assert key table cells
(latency/throughput/quality/success) similar to the vLLM test, call
report.generate(), and finally assert that (slurm_system.output_path /
"sglang_comparison.html").exists() to ensure HTML was produced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6a508bb1-229e-4702-b1ad-bdb49888fef9

📥 Commits

Reviewing files that changed from the base of the PR and between 0738c4b and 727552b.

📒 Files selected for processing (12)

src/cloudai/registration.py
src/cloudai/report_generator/comparison_report.py
src/cloudai/report_generator/groups.py
src/cloudai/report_generator/util.py
src/cloudai/workloads/common/llm_serving_report.py
src/cloudai/workloads/sglang/__init__.py
src/cloudai/workloads/sglang/sglang_comparison_report.py
src/cloudai/workloads/vllm/__init__.py
src/cloudai/workloads/vllm/vllm_comparison_report.py
tests/test_init.py
tests/workloads/common/test_llm_serving.py
tests/workloads/common/test_llm_serving_report.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/report_generator/util.py`:
- Around line 191-197: _canonical_comparison_value currently only coerces
top-level ints/floats to float and then uses json.dumps on mappings/lists which
leaves nested numeric types like 1 vs 1.0 unchanged; update
_canonical_comparison_value to recursively walk collections (for the mapping
case handle keys and values, for list/tuple handle each element) and replace any
int/float with float (preserve bools), then json.dumps the normalized structure
so nested numerics are canonicalized before serialization. Target the existing
function name _canonical_comparison_value and ensure you handle collections.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 711f86bc-39c4-43a8-bb10-01f52a72de91

📥 Commits

Reviewing files that changed from the base of the PR and between 727552b and 626442f.

📒 Files selected for processing (6)

src/cloudai/report_generator/util.py
src/cloudai/workloads/common/llm_serving_report.py
src/cloudai/workloads/sglang/sglang_comparison_report.py
src/cloudai/workloads/vllm/vllm_comparison_report.py
tests/report_generation_strategy/test_report_groups.py
tests/workloads/common/test_llm_serving_report.py

comparison reports for vllm and sglang

bf13531

podkidyshev self-assigned this May 27, 2026

podkidyshev added the enhancement New feature or request label May 27, 2026

podkidyshev added 6 commits May 27, 2026 14:54

use updated import style

243dc0d

one more

3f10c13

minor refactor

634e39a

refactor base comp-reporting code to avoid hacky getattr

ba30fa6

better code than getattr

abe9ac0

remove callables as args in reporting

2596726

podkidyshev force-pushed the ipod/llm-comparison branch from bd9012d to 2596726 Compare May 28, 2026 15:58

fix copyright

727552b

podkidyshev marked this pull request as ready for review May 28, 2026 16:12

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners May 28, 2026 16:12

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/cloudai/report_generator/util.py Outdated

Comment thread src/cloudai/workloads/common/llm_serving_report.py Outdated

Comment thread src/cloudai/workloads/vllm/vllm_comparison_report.py Outdated

Comment thread tests/workloads/common/test_llm_serving_report.py

podkidyshev added 3 commits June 3, 2026 12:05

Merge branch 'main' into ipod/llm-comparison

852b1b7

Address CodeRabbit comparison report feedback

626442f

Fix report groups copyright header

82151bc

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/cloudai/report_generator/util.py Outdated

podkidyshev added 4 commits June 3, 2026 12:24

Canonicalize nested comparison values

c644d64

Simplify comparison report grouping

81bd9f5

Format comparison grouping tests

69cd20a

Prune nested comparison label diffs

777d44b

amaslenn approved these changes Jun 3, 2026

View reviewed changes

podkidyshev merged commit 6f678f7 into main Jun 3, 2026
5 checks passed

podkidyshev deleted the ipod/llm-comparison branch June 3, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM/SGLang: comparison report#904

vLLM/SGLang: comparison report#904
podkidyshev merged 15 commits into
mainfrom
ipod/llm-comparison

podkidyshev commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented May 27, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 27, 2026 •

edited

Loading