Add vLLM offline backend with micro-batching support#736
Open
maryamtahhan wants to merge 8 commits into
Open
Conversation
c05ee69 to
968f44e
Compare
This commit implements offline/batch inference support for vLLM using
a clean, extensible architecture that eliminates code duplication.
- Shared base class for all vLLM backends (~400 lines)
- Extracted common functionality:
- Chat template resolution
- Multimodal data handling (image/audio)
- Request formatting and resolution
- Sampling parameter creation
- Abstract method `_get_tokenizer()` for subclass implementation
- Offline batch processing using vLLM's LLM class (~370 lines)
- Micro-batching with configurable batch_size (default: 32)
- Buffers requests until batch is full, then processes with LLM.generate()
- Auto-flushes remaining requests on shutdown
- Single-process execution for batch coordination
- Ideal for offline benchmarking and dataset evaluation
- Now extends VLLMBackendBase instead of Backend directly
- Removed ~360 lines of duplicate code
- Implements `_get_tokenizer()` for AsyncLLMEngine
- No breaking changes to public API
- New guide: docs/guides/vllm-offline-backend.md
- Usage examples, configuration, performance tuning
- Comparison with other backends
- Troubleshooting guide
- Updated docs/guides/backends.md with offline backend section
- **Code Reuse**: ~400 lines shared between backends
- **Code Reduction**: ~360 lines eliminated from VLLMPythonBackend
- **Extensibility**: Easy to add new vLLM-based backends
- **No Breaking Changes**: VLLMPythonBackend API unchanged
- **Clean Architecture**: Clear separation of concerns
```bash
guidellm benchmark run \
--backend vllm_offline \
--model "Qwen/Qwen3-0.6B" \
--backend-kwargs '{"batch_size": 64}' \
--data "prompt_tokens=256,output_tokens=128" \
--max-requests 1000
```
Validated on EC2:
- ✅ VLLMOfflineBackend class structure and configuration
- ✅ VLLMBackendBase inheritance chain
- ✅ VLLMPythonBackend refactoring (no regressions)
- ✅ All imports and registry integration
- ⏸️ End-to-end inference (requires GPU)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Re-export _has_jinja2_markers and _ResolvedRequest from vllm.py for test compatibility - Add shutdown protection flag to reject requests during shutdown - Improve batch processing error handling with size validation - Use strict=True for zip to catch mismatches - Don't re-raise exceptions in batch processing (let requests handle failures) - Lock final batch processing during shutdown - Remove unused Backend import from offline.py Fixes unit test ImportError and improves robustness. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed duplicate _check_vllm_available() calls from VLLMPythonBackend and VLLMOfflineBackend __init__ methods. The check is already performed in VLLMBackendBase.__init__(), so calling it again in subclasses is redundant. Updated unit tests to patch base._check_vllm_available instead of vllm._check_vllm_available since the function is no longer called from the vllm module. This fixes CI test failures where tests were trying to patch a function that was being called twice from different modules. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixed two issues causing CI failures: 1. Stream value not propagated from backend to resolved request - Added _stream_value property to VLLMBackendBase (default: True) - VLLMPythonBackend overrides to return self._stream - Updated _resolve_request to pass stream=self._stream_value 2. test_backend.py still patching vllm._check_vllm_available - Updated to patch base._check_vllm_available instead This fixes test_stream_false_propagated and test_vllm_python_backend_registered. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Linter fixes: - Use contextlib.suppress instead of try-except-pass (SIM105) - Add noqa comment for intentional Exception catch (BLE001) - Apply ruff formatting to base.py and offline.py Test fixes: - Patch base.SamplingParams in addition to vllm.SamplingParams - This fixes TypeError: 'NoneType' object is not callable - _create_sampling_params moved to base.py, so tests need to patch there Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Remove unused imports from vllm.py (_ResolvedRequest, _has_jinja2_markers) - Fix import order in vllm.py - Apply mdformat to vllm-offline-backend.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
fc01371 to
bbe2874
Compare
Tests were patching vllm._decode_audio but the function is called from base.py where it's imported. Updated all 4 patches to reference base._decode_audio instead. This fixes the audio-related test failures where fake audio bytes were being decoded by the real _decode_audio function instead of the mocked version. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Handle RuntimeError from torchcodec/PIL when FFmpeg or image libraries are not available. Update exception handlers to catch both ImportError and RuntimeError for graceful degradation when optional dependencies fail to load. Also fix test patches for multimodal data: - Change image_dict_to_pil patch from vllm to base module - Add HAS_AUDIO and HAS_VISION patches to enable tests when optional dependencies are unavailable Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
efa1d9e to
942fa2e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add vLLM Offline Backend with Shared Base Class
This PR implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication between vLLM backends.
Summary
Adds
VLLMOfflineBackendfor batch processing and refactors existing vLLM code into a sharedVLLMBackendBaseclass. This reduces code duplication by ~360 lines while adding new offline inference capabilities optimized for benchmarking scenarios.New Components
VLLMBackendBase (base.py)
Shared base class for all vLLM backends containing ~400 lines of common functionality:
_get_tokenizer()method for subclass implementationVLLMOfflineBackend (offline.py)
New backend for offline batch processing using vLLM's
LLMclass:batch_size(default: 32)LLM.generate()Refactored VLLMPythonBackend (vllm.py)
VLLMBackendBaseinstead ofBackenddirectly_get_tokenizer()forAsyncLLMEngineKey Benefits
Documentation
docs/guides/vllm-offline-backend.mddocs/guides/backends.mdwith offline backend sectionUsage Example
Test Plan
Unit Tests (✅ Passing)
VLLMOfflineBackendlifecycle (startup, shutdown, validate)VLLMBackendBaserequest resolution and formattingIntegration Tests (✅ Verified)
Backend.create()Manual Testing
Details
VLLMBackendBaseshared base class insrc/guidellm/backends/vllm_python/base.pyVLLMOfflineBackendandVLLMOfflineBackendArgsinsrc/guidellm/backends/vllm_python/offline.pyVLLMPythonBackendto extendVLLMBackendBase(eliminate duplication)_ResolvedRequest,_has_jinja2_markers) from base for backward compatibilityRuntimeErrorfrom torchcodec/PIL)tests/unit/backends/vllm_python/test_vllm.pydocs/guides/vllm-offline-backend.mddocs/guides/backends.mdwith offline backend documentationvllm_offlinebackend type in Backend registrytest_backend.pywith offline backend registration testUse of AI