Add vLLM offline backend with micro-batching support by maryamtahhan · Pull Request #736 · vllm-project/guidellm

maryamtahhan · 2026-05-20T15:33:44Z

Add vLLM Offline Backend with Shared Base Class

This PR implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication between vLLM backends.

Summary

Adds VLLMOfflineBackend for batch processing and refactors existing vLLM code into a shared VLLMBackendBase class. This reduces code duplication by ~360 lines while adding new offline inference capabilities optimized for benchmarking scenarios.

New Components

VLLMBackendBase (base.py)

Shared base class for all vLLM backends containing ~400 lines of common functionality:

Chat template resolution (plain, default-template, custom Jinja2)
Multimodal data handling (image/audio columns)
Request formatting and prompt resolution
Sampling parameter creation
Abstract _get_tokenizer() method for subclass implementation

VLLMOfflineBackend (offline.py)

New backend for offline batch processing using vLLM's LLM class:

Micro-batching with configurable batch_size (default: 32)
Buffers requests until batch is full, then processes with LLM.generate()
Auto-flushes remaining requests on shutdown
Single-process execution for batch coordination
Ideal for offline benchmarking and dataset evaluation

Refactored VLLMPythonBackend (vllm.py)

Now extends VLLMBackendBase instead of Backend directly
Removed ~360 lines of duplicate code
Implements _get_tokenizer() for AsyncLLMEngine
No breaking changes to public API

Key Benefits

✅ Code Reuse: ~400 lines shared between backends
✅ Reduced Duplication: ~360 lines eliminated from VLLMPythonBackend
✅ Extensibility: Easy to add new vLLM-based backends (e.g., vLLM server)
✅ No Breaking Changes: VLLMPythonBackend API unchanged
✅ Clean Architecture: Clear separation of concerns with shared base

Documentation

New guide: docs/guides/vllm-offline-backend.md
- Usage examples and configuration options
- Performance tuning (batch size, vLLM EngineArgs)
- Comparison with other backends
- Troubleshooting guide
Updated: docs/guides/backends.md with offline backend section

Usage Example

guidellm benchmark run \
  --backend vllm_offline \
  --model "Qwen/Qwen3-0.6B" \
  --backend-kwargs '{"batch_size": 64, "vllm_config": {"tensor_parallel_size": 2}}' \
  --data "prompt_tokens=256,output_tokens=128" \
  --max-requests 1000

Test Plan

Unit Tests (✅ Passing)

2296 unit tests passing (all existing + new tests)
New test coverage:
- VLLMOfflineBackend lifecycle (startup, shutdown, validate)
- Batch processing logic and request buffering
- VLLMBackendBase request resolution and formatting
- Chat template handling (plain, default, custom)
- Multimodal data processing (audio/image)
- Sampling parameter creation
- Backend registration and creation

Integration Tests (✅ Verified)

Backend registration in Backend registry
Args creation and validation (VLLMOfflineBackendArgs)
Backend creation via Backend.create()
Request resolution with chat templates
Batch size configuration (8-128+)
vLLM config passthrough (tensor_parallel_size, gpu_memory_utilization, etc.)
Backend info property exposure

Manual Testing

Validated functionality on local environment

Details

Add VLLMBackendBase shared base class in src/guidellm/backends/vllm_python/base.py
Add VLLMOfflineBackend and VLLMOfflineBackendArgs in src/guidellm/backends/vllm_python/offline.py
Refactor VLLMPythonBackend to extend VLLMBackendBase (eliminate duplication)
Re-export test helpers (_ResolvedRequest, _has_jinja2_markers) from base for backward compatibility
Add optional dependency handling for audio/vision extras (catch RuntimeError from torchcodec/PIL)
Add comprehensive test coverage in tests/unit/backends/vllm_python/test_vllm.py
Add new guide docs/guides/vllm-offline-backend.md
Update docs/guides/backends.md with offline backend documentation
Register vllm_offline backend type in Backend registry
Update test_backend.py with offline backend registration test

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes code generated or substantially modified by an AI agent
Includes tests generated or substantially modified by an AI agent

All commits include appropriate Co-Authored-By trailers as described in DEVELOPING.md.

This commit implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication. - Shared base class for all vLLM backends (~400 lines) - Extracted common functionality: - Chat template resolution - Multimodal data handling (image/audio) - Request formatting and resolution - Sampling parameter creation - Abstract method `_get_tokenizer()` for subclass implementation - Offline batch processing using vLLM's LLM class (~370 lines) - Micro-batching with configurable batch_size (default: 32) - Buffers requests until batch is full, then processes with LLM.generate() - Auto-flushes remaining requests on shutdown - Single-process execution for batch coordination - Ideal for offline benchmarking and dataset evaluation - Now extends VLLMBackendBase instead of Backend directly - Removed ~360 lines of duplicate code - Implements `_get_tokenizer()` for AsyncLLMEngine - No breaking changes to public API - New guide: docs/guides/vllm-offline-backend.md - Usage examples, configuration, performance tuning - Comparison with other backends - Troubleshooting guide - Updated docs/guides/backends.md with offline backend section - **Code Reuse**: ~400 lines shared between backends - **Code Reduction**: ~360 lines eliminated from VLLMPythonBackend - **Extensibility**: Easy to add new vLLM-based backends - **No Breaking Changes**: VLLMPythonBackend API unchanged - **Clean Architecture**: Clear separation of concerns ```bash guidellm benchmark run \ --backend vllm_offline \ --model "Qwen/Qwen3-0.6B" \ --backend-kwargs '{"batch_size": 64}' \ --data "prompt_tokens=256,output_tokens=128" \ --max-requests 1000 ``` Validated on EC2: - ✅ VLLMOfflineBackend class structure and configuration - ✅ VLLMBackendBase inheritance chain - ✅ VLLMPythonBackend refactoring (no regressions) - ✅ All imports and registry integration - ⏸️ End-to-end inference (requires GPU) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Re-export _has_jinja2_markers and _ResolvedRequest from vllm.py for test compatibility - Add shutdown protection flag to reject requests during shutdown - Improve batch processing error handling with size validation - Use strict=True for zip to catch mismatches - Don't re-raise exceptions in batch processing (let requests handle failures) - Lock final batch processing during shutdown - Remove unused Backend import from offline.py Fixes unit test ImportError and improves robustness. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Removed duplicate _check_vllm_available() calls from VLLMPythonBackend and VLLMOfflineBackend __init__ methods. The check is already performed in VLLMBackendBase.__init__(), so calling it again in subclasses is redundant. Updated unit tests to patch base._check_vllm_available instead of vllm._check_vllm_available since the function is no longer called from the vllm module. This fixes CI test failures where tests were trying to patch a function that was being called twice from different modules. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Fixed two issues causing CI failures: 1. Stream value not propagated from backend to resolved request - Added _stream_value property to VLLMBackendBase (default: True) - VLLMPythonBackend overrides to return self._stream - Updated _resolve_request to pass stream=self._stream_value 2. test_backend.py still patching vllm._check_vllm_available - Updated to patch base._check_vllm_available instead This fixes test_stream_false_propagated and test_vllm_python_backend_registered. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Linter fixes: - Use contextlib.suppress instead of try-except-pass (SIM105) - Add noqa comment for intentional Exception catch (BLE001) - Apply ruff formatting to base.py and offline.py Test fixes: - Patch base.SamplingParams in addition to vllm.SamplingParams - This fixes TypeError: 'NoneType' object is not callable - _create_sampling_params moved to base.py, so tests need to patch there Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Remove unused imports from vllm.py (_ResolvedRequest, _has_jinja2_markers) - Fix import order in vllm.py - Apply mdformat to vllm-offline-backend.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Tests were patching vllm._decode_audio but the function is called from base.py where it's imported. Updated all 4 patches to reference base._decode_audio instead. This fixes the audio-related test failures where fake audio bytes were being decoded by the real _decode_audio function instead of the mocked version. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Handle RuntimeError from torchcodec/PIL when FFmpeg or image libraries are not available. Update exception handlers to catch both ImportError and RuntimeError for graceful degradation when optional dependencies fail to load. Also fix test patches for multimodal data: - Change image_dict_to_pil patch from vllm to base module - Add HAS_AUDIO and HAS_VISION patches to enable tests when optional dependencies are unavailable Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch 3 times, most recently from c05ee69 to 968f44e Compare May 21, 2026 13:54

maryamtahhan and others added 6 commits May 25, 2026 10:14

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from fc01371 to bbe2874 Compare May 25, 2026 09:14

maryamtahhan and others added 2 commits May 25, 2026 10:56

maryamtahhan marked this pull request as ready for review May 25, 2026 10:25

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from efa1d9e to 942fa2e Compare May 25, 2026 13:43

sjmonson self-requested a review May 27, 2026 15:25

sjmonson added the internal filed by core contributor or associate label May 27, 2026

sjmonson added this to the v0.8.0 milestone May 27, 2026

sjmonson added the priority-low label Jun 1, 2026

sjmonson requested a review from jaredoconnell June 1, 2026 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM offline backend with micro-batching support#736

Add vLLM offline backend with micro-batching support#736
maryamtahhan wants to merge 8 commits into
vllm-project:mainfrom
maryamtahhan:feat/vllm-offline-batching-backend

maryamtahhan commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maryamtahhan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add vLLM Offline Backend with Shared Base Class

Summary

New Components

VLLMBackendBase (base.py)

VLLMOfflineBackend (offline.py)

Refactored VLLMPythonBackend (vllm.py)

Key Benefits

Documentation

Usage Example

Test Plan

Unit Tests (✅ Passing)

Integration Tests (✅ Verified)

Manual Testing

Details

Use of AI

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maryamtahhan commented May 20, 2026 •

edited

Loading