-
Notifications
You must be signed in to change notification settings - Fork 36
Add vLLM #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add vLLM #221
Conversation
Reviewer's GuideThis PR adds comprehensive vLLM support across the codebase: a complete vLLM backend implementation with configuration and tests, platform detection and scheduler integration for Safetensors routing, updates to the Backend interface and CLI/docs for vLLM, a new Dockerfile variant and CI steps to build and publish CUDA-enabled vLLM images, and a cleanup of obsolete distribution utilities with a new log sanitization helper. Sequence diagram for scheduler routing Safetensors models to vLLM backendsequenceDiagram
participant User
participant Scheduler
participant ModelManager
participant vLLMBackend
User->>Scheduler: POST /v1/completions (Safetensors model)
Scheduler->>ModelManager: Get model config
ModelManager-->>Scheduler: Return config (Format=Safetensors)
Scheduler->>vLLMBackend: Route request to vLLM backend
vLLMBackend->>Scheduler: Process request
Scheduler-->>User: Return response
Class diagram for new and updated vLLM backend typesclassDiagram
class vLLM {
- log logging.Logger
- modelManager *models.Manager
- serverLog logging.Logger
- config *Config
- status string
+ Name() string
+ UsesExternalModelManagement() bool
+ Install(ctx context.Context, httpClient *http.Client) error
+ Run(ctx context.Context, socket string, model string, modelRef string, mode inference.BackendMode, backendConfig *inference.BackendConfiguration) error
+ Status() string
+ GetDiskUsage() (int64, error)
+ GetRequiredMemoryForModel(ctx context.Context, model string, config *inference.BackendConfiguration) (inference.RequiredMemory, error)
- binaryPath() string
}
class Config {
+ Args []string
+ GetArgs(bundle types.ModelBundle, socket string, mode inference.BackendMode, config *inference.BackendConfiguration) ([]string, error)
+ NewDefaultVLLMConfig() *Config
}
vLLM --> Config
class inference.Backend {
<<interface>>
+ Run(ctx context.Context, socket string, model string, modelRef string, mode BackendMode, config *BackendConfiguration) error
+ Status() string
+ GetDiskUsage() (int64, error)
+ GetRequiredMemoryForModel(ctx context.Context, model string, config *BackendConfiguration) (RequiredMemory, error)
}
vLLM ..|> inference.Backend
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new vLLM CUDA Docker image variant to support vLLM (a fast LLM inference engine) alongside the existing model runner functionality.
- Adds a new Docker build stage for vLLM with CUDA support
- Introduces a new GitHub Actions workflow job to build and push the vLLM image
- Includes configuration for vLLM version parameterization
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
Dockerfile | Adds new vLLM build stage that installs vLLM using uv package manager |
.github/workflows/release.yml | Adds workflow input parameter and build job for vLLM CUDA image |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
echo "vllm-cuda<<EOF" >> "$GITHUB_OUTPUT" | ||
echo "docker/model-runner:${{ inputs.releaseTag }}-vllm-cuda" >> "$GITHUB_OUTPUT" | ||
if [ "${{ inputs.pushLatest }}" == "true" ]; then | ||
echo "docker/model-runner:latest-vllm-cuda" >> "$GITHUB_OUTPUT" | ||
fi | ||
echo 'EOF' >> "$GITHUB_OUTPUT" |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tag generation logic for vllm-cuda duplicates the pattern used for the cuda tags above. Consider extracting this into a reusable function or template to reduce code duplication and improve maintainability.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
pkg/inference/backends/vllm/vllm.go:1
- The log message uses 'modelID' in the format string but should use 'model' to be consistent with the variable name pattern used elsewhere in the codebase.
package vllm
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pkg/inference/backends/vllm/vllm.go:1
- The log message refers to 'modelID' but should be consistent with the parameter name 'model' used elsewhere in the codebase for clarity.
package vllm
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Reorganize Dockerfile stages to avoid unnecessary layer copying and improve build caching. Split final stages into final-llamacpp and final-vllm variants, copying the model-runner binary only in the final stages rather than in intermediate ones to profit of the build cache when only the model-runner binary has been changed. Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Add modelRef parameter to Backend.Run() interface to support serving models with vLLM under their reference names. Update vLLM backend to use modelRef with --served-model-name flag. Signed-off-by: Dorin Geman <dorin.geman@docker.com>
…n Linux This reverts commit 1ab1ead. Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
log.Warnf("Backend %s running modelID %s exited with error: %v", | ||
backend.Name(), modelID, err, |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Log message incorrectly says 'modelID' but the context suggests this should be 'model' to match the previous naming convention, or the variable name should remain as 'model' since modelID is the internal identifier. The original message 'running model' was clearer.
log.Warnf("Backend %s running modelID %s exited with error: %v", | |
backend.Name(), modelID, err, | |
log.Warnf("Backend %s running model %s exited with error: %v", | |
backend.Name(), modelRef, err, |
Copilot uses AI. Check for mistakes.
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
|
||
# Install uv and vLLM as modelrunner user | ||
RUN curl -LsSf https://astral.sh/uv/install.sh | sh \ | ||
&& ~/.local/bin/uv venv --python /usr/bin/python3 /opt/vllm-env \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change this to copy from the vllm/vllm-openai:v0.11.0 container we get DGX Spark support (I know I suggested doing it this less hacky way, apologies, didn't realize the container had aarch64 and this way doesn't appear to).
Could be a follow on PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or this would be even better install the wheels from here:
https://wheels.vllm.ai/b8b302cde434df8c9289a2b465406b47ebab1c2d/vllm/
That commit sha is the 0.11.0 one.
They tipped me off in vLLM stack that they build CUDA x86_64 and aarch64 wheels for every commit. So this is the same thing, but has an aarch64 version also.
Be better than the hacky container copy (which is prone to error, missing files, OS mismatch, library version mismatch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a way to get that programatically:
$ git rev-list -n 1 v0.11.0
b8b302cde434df8c9289a2b465406b47ebab1c2d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
} else { | ||
if config.Format == types.FormatSafetensors { | ||
if vllmBackend, ok := s.backends[vllm.Name]; ok { | ||
backend = vllmBackend |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The automatic backend selection logic should handle the case where vLLM is not installed. Currently, if vLLM is not in the backends map, the original backend assignment remains, but this could lead to runtime errors if that backend doesn't support safetensors. Consider adding a check to verify the selected backend supports the model format, or log a warning if vLLM is unavailable for a safetensors model.
backend = vllmBackend | |
backend = vllmBackend | |
} else { | |
s.log.Warnf("vLLM backend is not available for safetensors model '%s'; current backend '%s' may not support safetensors format", model.Name(), backend.Name()) |
Copilot uses AI. Check for mistakes.
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Blocking issues:
- An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `.github/workflows/release.yml:117` </location>
<code_context>
uses: docker/build-push-action@v5
</code_context>
<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.
*Source: opengrep*
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `pkg/inference/backends/vllm/vllm.go:97` </location>
<code_context>
- // TODO: Implement.
- v.log.Warn("vLLM backend is not yet supported")
- return errors.New("not implemented")
+func (v *vLLM) Run(ctx context.Context, socket, model string, modelRef string, mode inference.BackendMode, backendConfig *inference.BackendConfiguration) error {
+ if !platform.SupportsVLLM() {
+ v.log.Warn("vLLM backend is not yet supported")
</code_context>
<issue_to_address>
**issue (bug_risk):** The Run method appends both model and modelRef to --served-model-name, which may not match expected CLI usage.
If the CLI expects only one value for --served-model-name, passing both may cause errors. Please confirm the correct usage and update the argument construction if necessary.
</issue_to_address>
### Comment 2
<location> `pkg/inference/backends/vllm/vllm.go:204` </location>
<code_context>
+ return size, nil
+}
+
+func (v *vLLM) GetRequiredMemoryForModel(_ context.Context, _ string, _ *inference.BackendConfiguration) (inference.RequiredMemory, error) {
+ if !platform.SupportsVLLM() {
+ return inference.RequiredMemory{}, errors.New("not implemented")
</code_context>
<issue_to_address>
**suggestion:** GetRequiredMemoryForModel returns hardcoded values for RAM and VRAM.
Static memory values may cause incorrect resource allocation. Please add logic to estimate requirements from model metadata or configuration.
Suggested implementation:
```golang
func (v *vLLM) GetRequiredMemoryForModel(ctx context.Context, modelPath string, config *inference.BackendConfiguration) (inference.RequiredMemory, error) {
if !platform.SupportsVLLM() {
return inference.RequiredMemory{}, errors.New("not implemented")
}
// Example: Estimate memory based on model metadata
metadata, err := inference.GetModelMetadata(modelPath)
if err != nil {
return inference.RequiredMemory{}, fmt.Errorf("failed to get model metadata: %w", err)
}
// Assume float32 weights, 4 bytes per parameter
const bytesPerParam = 4
paramCount := metadata.NumParameters
ramEstimate := int64(paramCount * bytesPerParam)
// VRAM estimate: typically similar to RAM, but may be higher for large batch sizes
// Here we use a simple heuristic, can be improved with more info
vramEstimate := ramEstimate
// If config specifies batch size or precision, adjust estimates
if config != nil {
if config.Precision == "float16" {
vramEstimate = int64(paramCount * 2)
ramEstimate = int64(paramCount * 2)
}
if config.BatchSize > 0 {
// Increase VRAM estimate for larger batch sizes
vramEstimate += int64(config.BatchSize * 1024 * 1024) // 1MB per batch as a rough estimate
}
}
return inference.RequiredMemory{
RAM: ramEstimate,
VRAM: vramEstimate,
}, nil
}
```
- You must implement or ensure the existence of `inference.GetModelMetadata(modelPath)` which should return a struct with at least `NumParameters` (int64).
- If your model metadata includes more details (e.g., data type, layers), you can refine the estimation logic.
- Adjust the estimation heuristics as needed for your specific models and hardware.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
return size, nil | ||
} | ||
|
||
func (v *vLLM) GetRequiredMemoryForModel(_ context.Context, _ string, _ *inference.BackendConfiguration) (inference.RequiredMemory, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: GetRequiredMemoryForModel returns hardcoded values for RAM and VRAM.
Static memory values may cause incorrect resource allocation. Please add logic to estimate requirements from model metadata or configuration.
Suggested implementation:
func (v *vLLM) GetRequiredMemoryForModel(ctx context.Context, modelPath string, config *inference.BackendConfiguration) (inference.RequiredMemory, error) {
if !platform.SupportsVLLM() {
return inference.RequiredMemory{}, errors.New("not implemented")
}
// Example: Estimate memory based on model metadata
metadata, err := inference.GetModelMetadata(modelPath)
if err != nil {
return inference.RequiredMemory{}, fmt.Errorf("failed to get model metadata: %w", err)
}
// Assume float32 weights, 4 bytes per parameter
const bytesPerParam = 4
paramCount := metadata.NumParameters
ramEstimate := int64(paramCount * bytesPerParam)
// VRAM estimate: typically similar to RAM, but may be higher for large batch sizes
// Here we use a simple heuristic, can be improved with more info
vramEstimate := ramEstimate
// If config specifies batch size or precision, adjust estimates
if config != nil {
if config.Precision == "float16" {
vramEstimate = int64(paramCount * 2)
ramEstimate = int64(paramCount * 2)
}
if config.BatchSize > 0 {
// Increase VRAM estimate for larger batch sizes
vramEstimate += int64(config.BatchSize * 1024 * 1024) // 1MB per batch as a rough estimate
}
}
return inference.RequiredMemory{
RAM: ramEstimate,
VRAM: vramEstimate,
}, nil
}
- You must implement or ensure the existence of
inference.GetModelMetadata(modelPath)
which should return a struct with at leastNumParameters
(int64). - If your model metadata includes more details (e.g., data type, layers), you can refine the estimation logic.
- Adjust the estimation heuristics as needed for your specific models and hardware.
// Name is the backend name. | ||
Name = "vllm" | ||
Name = "vllm" | ||
vllmDir = "/opt/vllm-env/bin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sometimes curious why we care about the directory in the golang code. We could just add this directory to the PATH in the Dockerfile.
Could be useful if we start to use artifacts from this container for example for ROCm which expect things to be in a different path:
https://hub.docker.com/r/rocm/vllm
can worry about this in a follow on PR though
Building
v0.0.14-rc1
.Make sure you're using a Docker Engine context with
nvidia
runtime.Build:
Run:
Test:
Loading the model is expected to take ~30s on Docker Offload.
E.g.,
Summary by Sourcery
Add support for building and publishing a vLLM-enabled CUDA Docker image by extending the Dockerfile and GitHub Actions release workflow.
New Features:
Summary by Sourcery
Add vLLM backend support to model-runner, including backend implementation, scheduler integration, Docker build for vLLM CUDA images, CLI updates, and associated tests.
New Features:
Enhancements:
Build:
CI:
Documentation:
Tests: