Skip to content

Add vLLM offline backend with micro-batching support#736

Open
maryamtahhan wants to merge 8 commits into
vllm-project:mainfrom
maryamtahhan:feat/vllm-offline-batching-backend
Open

Add vLLM offline backend with micro-batching support#736
maryamtahhan wants to merge 8 commits into
vllm-project:mainfrom
maryamtahhan:feat/vllm-offline-batching-backend

Conversation

@maryamtahhan

@maryamtahhan maryamtahhan commented May 20, 2026

Copy link
Copy Markdown
Contributor

Add vLLM Offline Backend with Shared Base Class

This PR implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication between vLLM backends.

Summary

Adds VLLMOfflineBackend for batch processing and refactors existing vLLM code into a shared VLLMBackendBase class. This reduces code duplication by ~360 lines while adding new offline inference capabilities optimized for benchmarking scenarios.

New Components

VLLMBackendBase (base.py)

Shared base class for all vLLM backends containing ~400 lines of common functionality:

  • Chat template resolution (plain, default-template, custom Jinja2)
  • Multimodal data handling (image/audio columns)
  • Request formatting and prompt resolution
  • Sampling parameter creation
  • Abstract _get_tokenizer() method for subclass implementation

VLLMOfflineBackend (offline.py)

New backend for offline batch processing using vLLM's LLM class:

  • Micro-batching with configurable batch_size (default: 32)
  • Buffers requests until batch is full, then processes with LLM.generate()
  • Auto-flushes remaining requests on shutdown
  • Single-process execution for batch coordination
  • Ideal for offline benchmarking and dataset evaluation

Refactored VLLMPythonBackend (vllm.py)

  • Now extends VLLMBackendBase instead of Backend directly
  • Removed ~360 lines of duplicate code
  • Implements _get_tokenizer() for AsyncLLMEngine
  • No breaking changes to public API

Key Benefits

  • Code Reuse: ~400 lines shared between backends
  • Reduced Duplication: ~360 lines eliminated from VLLMPythonBackend
  • Extensibility: Easy to add new vLLM-based backends (e.g., vLLM server)
  • No Breaking Changes: VLLMPythonBackend API unchanged
  • Clean Architecture: Clear separation of concerns with shared base

Documentation

  • New guide: docs/guides/vllm-offline-backend.md
    • Usage examples and configuration options
    • Performance tuning (batch size, vLLM EngineArgs)
    • Comparison with other backends
    • Troubleshooting guide
  • Updated: docs/guides/backends.md with offline backend section

Usage Example

guidellm benchmark run \
  --backend vllm_offline \
  --model "Qwen/Qwen3-0.6B" \
  --backend-kwargs '{"batch_size": 64, "vllm_config": {"tensor_parallel_size": 2}}' \
  --data "prompt_tokens=256,output_tokens=128" \
  --max-requests 1000

Test Plan

Unit Tests (✅ Passing)

  • 2296 unit tests passing (all existing + new tests)
  • New test coverage:
    • VLLMOfflineBackend lifecycle (startup, shutdown, validate)
    • Batch processing logic and request buffering
    • VLLMBackendBase request resolution and formatting
    • Chat template handling (plain, default, custom)
    • Multimodal data processing (audio/image)
    • Sampling parameter creation
    • Backend registration and creation

Integration Tests (✅ Verified)

  • Backend registration in Backend registry
  • Args creation and validation (VLLMOfflineBackendArgs)
  • Backend creation via Backend.create()
  • Request resolution with chat templates
  • Batch size configuration (8-128+)
  • vLLM config passthrough (tensor_parallel_size, gpu_memory_utilization, etc.)
  • Backend info property exposure

Manual Testing

  • Validated functionality on local environment

Details

  • Add VLLMBackendBase shared base class in src/guidellm/backends/vllm_python/base.py
  • Add VLLMOfflineBackend and VLLMOfflineBackendArgs in src/guidellm/backends/vllm_python/offline.py
  • Refactor VLLMPythonBackend to extend VLLMBackendBase (eliminate duplication)
  • Re-export test helpers (_ResolvedRequest, _has_jinja2_markers) from base for backward compatibility
  • Add optional dependency handling for audio/vision extras (catch RuntimeError from torchcodec/PIL)
  • Add comprehensive test coverage in tests/unit/backends/vllm_python/test_vllm.py
  • Add new guide docs/guides/vllm-offline-backend.md
  • Update docs/guides/backends.md with offline backend documentation
  • Register vllm_offline backend type in Backend registry
  • Update test_backend.py with offline backend registration test

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes code generated or substantially modified by an AI agent
  • Includes tests generated or substantially modified by an AI agent

All commits include appropriate Co-Authored-By trailers as described in DEVELOPING.md.

@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch 3 times, most recently from c05ee69 to 968f44e Compare May 21, 2026 13:54
maryamtahhan and others added 6 commits May 25, 2026 10:14
This commit implements offline/batch inference support for vLLM using
a clean, extensible architecture that eliminates code duplication.

- Shared base class for all vLLM backends (~400 lines)
- Extracted common functionality:
  - Chat template resolution
  - Multimodal data handling (image/audio)
  - Request formatting and resolution
  - Sampling parameter creation
- Abstract method `_get_tokenizer()` for subclass implementation

- Offline batch processing using vLLM's LLM class (~370 lines)
- Micro-batching with configurable batch_size (default: 32)
- Buffers requests until batch is full, then processes with LLM.generate()
- Auto-flushes remaining requests on shutdown
- Single-process execution for batch coordination
- Ideal for offline benchmarking and dataset evaluation

- Now extends VLLMBackendBase instead of Backend directly
- Removed ~360 lines of duplicate code
- Implements `_get_tokenizer()` for AsyncLLMEngine
- No breaking changes to public API

- New guide: docs/guides/vllm-offline-backend.md
  - Usage examples, configuration, performance tuning
  - Comparison with other backends
  - Troubleshooting guide
- Updated docs/guides/backends.md with offline backend section

- **Code Reuse**: ~400 lines shared between backends
- **Code Reduction**: ~360 lines eliminated from VLLMPythonBackend
- **Extensibility**: Easy to add new vLLM-based backends
- **No Breaking Changes**: VLLMPythonBackend API unchanged
- **Clean Architecture**: Clear separation of concerns

```bash
guidellm benchmark run \
  --backend vllm_offline \
  --model "Qwen/Qwen3-0.6B" \
  --backend-kwargs '{"batch_size": 64}' \
  --data "prompt_tokens=256,output_tokens=128" \
  --max-requests 1000
```

Validated on EC2:
- ✅ VLLMOfflineBackend class structure and configuration
- ✅ VLLMBackendBase inheritance chain
- ✅ VLLMPythonBackend refactoring (no regressions)
- ✅ All imports and registry integration
- ⏸️ End-to-end inference (requires GPU)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Re-export _has_jinja2_markers and _ResolvedRequest from vllm.py for test compatibility
- Add shutdown protection flag to reject requests during shutdown
- Improve batch processing error handling with size validation
- Use strict=True for zip to catch mismatches
- Don't re-raise exceptions in batch processing (let requests handle failures)
- Lock final batch processing during shutdown
- Remove unused Backend import from offline.py

Fixes unit test ImportError and improves robustness.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed duplicate _check_vllm_available() calls from VLLMPythonBackend
and VLLMOfflineBackend __init__ methods. The check is already performed
in VLLMBackendBase.__init__(), so calling it again in subclasses is
redundant.

Updated unit tests to patch base._check_vllm_available instead of
vllm._check_vllm_available since the function is no longer called from
the vllm module.

This fixes CI test failures where tests were trying to patch a function
that was being called twice from different modules.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixed two issues causing CI failures:

1. Stream value not propagated from backend to resolved request
   - Added _stream_value property to VLLMBackendBase (default: True)
   - VLLMPythonBackend overrides to return self._stream
   - Updated _resolve_request to pass stream=self._stream_value

2. test_backend.py still patching vllm._check_vllm_available
   - Updated to patch base._check_vllm_available instead

This fixes test_stream_false_propagated and test_vllm_python_backend_registered.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Linter fixes:
- Use contextlib.suppress instead of try-except-pass (SIM105)
- Add noqa comment for intentional Exception catch (BLE001)
- Apply ruff formatting to base.py and offline.py

Test fixes:
- Patch base.SamplingParams in addition to vllm.SamplingParams
- This fixes TypeError: 'NoneType' object is not callable
- _create_sampling_params moved to base.py, so tests need to patch there

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Remove unused imports from vllm.py (_ResolvedRequest, _has_jinja2_markers)
- Fix import order in vllm.py
- Apply mdformat to vllm-offline-backend.md

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from fc01371 to bbe2874 Compare May 25, 2026 09:14
maryamtahhan and others added 2 commits May 25, 2026 10:56
Tests were patching vllm._decode_audio but the function is called
from base.py where it's imported. Updated all 4 patches to reference
base._decode_audio instead.

This fixes the audio-related test failures where fake audio bytes
were being decoded by the real _decode_audio function instead of
the mocked version.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Handle RuntimeError from torchcodec/PIL when FFmpeg or image libraries
are not available. Update exception handlers to catch both ImportError
and RuntimeError for graceful degradation when optional dependencies
fail to load.

Also fix test patches for multimodal data:
- Change image_dict_to_pil patch from vllm to base module
- Add HAS_AUDIO and HAS_VISION patches to enable tests when optional
  dependencies are unavailable

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan marked this pull request as ready for review May 25, 2026 10:25
@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from efa1d9e to 942fa2e Compare May 25, 2026 13:43
@sjmonson sjmonson self-requested a review May 27, 2026 15:25
@sjmonson sjmonson added the internal filed by core contributor or associate label May 27, 2026
@sjmonson sjmonson added this to the v0.8.0 milestone May 27, 2026
@sjmonson sjmonson requested a review from jaredoconnell June 1, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal filed by core contributor or associate priority-low

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants