Skip to content

Conversation

doringeman
Copy link
Contributor

@doringeman doringeman commented Oct 13, 2025

Building v0.0.14-rc1.

Make sure you're using a Docker Engine context with nvidia runtime.

Build:

docker build -t docker/model-runner:test-vllm-cuda \
    --target final-vllm \
    --build-arg LLAMA_SERVER_VARIANT=cuda \
    --build-arg BASE_IMAGE=nvidia/cuda:12.9.0-runtime-ubuntu24.04 \
    --build-arg VLLM_VERSION=0.11.0 --platform linux/amd64 \
    .

Run:

docker run --rm -p 8080:8080 -e MODEL_RUNNER_PORT=8080 --gpus all --runtime=nvidia docker/model-runner:test-vllm-cuda

Test:

MODEL_RUNNER_HOST=http://localhost:8080 docker model run --backend vllm aistaging/smollm2-vllm hi

Loading the model is expected to take ~30s on Docker Offload.
E.g.,

$ MODEL_RUNNER_HOST=http://localhost:8080 docker model run --backend vllm aistaging/smollm2-vllm hi
Unable to find model 'aistaging/smollm2-vllm' locally. Pulling from the server.
Downloaded 272.05MB of 272.05MB
Model pulled successfully
Hello there! I’m here to help with an interesting task. What can I assist you with?

Summary by Sourcery

Add support for building and publishing a vLLM-enabled CUDA Docker image by extending the Dockerfile and GitHub Actions release workflow.

New Features:

  • Introduce a new vLLM variant in the Dockerfile that installs uv and the specified vLLM version
  • Add vllmVersion input and corresponding tagging logic in the release workflow to handle vLLM CUDA images
  • Add a build-and-push step for the vLLM CUDA image in GitHub Actions using docker/build-push-action

Summary by Sourcery

Add vLLM backend support to model-runner, including backend implementation, scheduler integration, Docker build for vLLM CUDA images, CLI updates, and associated tests.

New Features:

  • Introduce vLLM backend with installation, run, status, disk usage, and memory estimation support
  • Automatically select vLLM backend for safetensors-format models in the scheduler
  • Add a vLLM-CUDA Docker image variant in Dockerfile and GitHub Actions release workflow
  • Extend CLI to recognize and document the vLLM backend as a valid option

Enhancements:

  • Sanitize backend arguments for safe logging and consolidate format compatibility logic in distribution client
  • Add platform.SupportsVLLM utility and refactor runner and loader to pass modelRef parameter
  • Upgrade GitHub Actions references to fixed commit SHAs across workflows

Build:

  • Add Dockerfile target final-vllm and supporting environment setup for vLLM

CI:

  • Add build-and-push step for the vLLM CUDA image and update CI workflows with action version pins

Documentation:

  • Update CLI reference documentation to include vLLM in backend options

Tests:

  • Add unit tests for vLLM configuration argument generation
  • Extend distribution client tests for platform-dependent safetensors support

@Copilot Copilot AI review requested due to automatic review settings October 13, 2025 15:40
Copy link
Contributor

sourcery-ai bot commented Oct 13, 2025

Reviewer's Guide

This PR adds comprehensive vLLM support across the codebase: a complete vLLM backend implementation with configuration and tests, platform detection and scheduler integration for Safetensors routing, updates to the Backend interface and CLI/docs for vLLM, a new Dockerfile variant and CI steps to build and publish CUDA-enabled vLLM images, and a cleanup of obsolete distribution utilities with a new log sanitization helper.

Sequence diagram for scheduler routing Safetensors models to vLLM backend

sequenceDiagram
    participant User
    participant Scheduler
    participant ModelManager
    participant vLLMBackend
    User->>Scheduler: POST /v1/completions (Safetensors model)
    Scheduler->>ModelManager: Get model config
    ModelManager-->>Scheduler: Return config (Format=Safetensors)
    Scheduler->>vLLMBackend: Route request to vLLM backend
    vLLMBackend->>Scheduler: Process request
    Scheduler-->>User: Return response
Loading

Class diagram for new and updated vLLM backend types

classDiagram
    class vLLM {
        - log logging.Logger
        - modelManager *models.Manager
        - serverLog logging.Logger
        - config *Config
        - status string
        + Name() string
        + UsesExternalModelManagement() bool
        + Install(ctx context.Context, httpClient *http.Client) error
        + Run(ctx context.Context, socket string, model string, modelRef string, mode inference.BackendMode, backendConfig *inference.BackendConfiguration) error
        + Status() string
        + GetDiskUsage() (int64, error)
        + GetRequiredMemoryForModel(ctx context.Context, model string, config *inference.BackendConfiguration) (inference.RequiredMemory, error)
        - binaryPath() string
    }
    class Config {
        + Args []string
        + GetArgs(bundle types.ModelBundle, socket string, mode inference.BackendMode, config *inference.BackendConfiguration) ([]string, error)
        + NewDefaultVLLMConfig() *Config
    }
    vLLM --> Config
    class inference.Backend {
        <<interface>>
        + Run(ctx context.Context, socket string, model string, modelRef string, mode BackendMode, config *BackendConfiguration) error
        + Status() string
        + GetDiskUsage() (int64, error)
        + GetRequiredMemoryForModel(ctx context.Context, model string, config *BackendConfiguration) (RequiredMemory, error)
    }
    vLLM ..|> inference.Backend
Loading

File-Level Changes

Change Details Files
Implement full vLLM backend
  • Extend New() to accept serverLog and default Config
  • Implement Install() with platform check and version detection
  • Build Run() to generate args, launch sandbox, handle errors and modelRef
  • Add Status(), GetDiskUsage(), GetRequiredMemoryForModel(), binaryPath()
pkg/inference/backends/vllm/vllm.go
Add vLLM configuration and unit tests
  • Define Config.GetArgs and GetMaxModelLen in vllm_config.go
  • Cover argument generation and context-size logic in vllm_config_test.go
pkg/inference/backends/vllm/vllm_config.go
pkg/inference/backends/vllm/vllm_config_test.go
Integrate platform detection and scheduler/distribution routing
  • Introduce platform.SupportsVLLM() util
  • Auto-route Safetensors models to vLLM in scheduler
  • Update distribution client to support Safetensors on Linux
  • Extend run/loader to pass modelRef through runner calls
pkg/inference/platform/platform.go
pkg/inference/scheduling/scheduler.go
pkg/distribution/distribution/client.go
pkg/inference/scheduling/runner.go
pkg/inference/scheduling/loader.go
pkg/inference/backend.go
Support vLLM in Dockerfile and CI workflows
  • Add vllm build stage in Dockerfile with python/uv/vLLM install
  • Extend release.yml to take vllmVersion, tag and push vllm-cuda images
  • Pin and update GitHub Actions versions across CI workflows
Dockerfile
.github/workflows/release.yml
.github/workflows/cli-build.yml
.github/workflows/cli-validate.yml
.github/workflows/ci.yml
.github/workflows/dmr-daily-check.yml
Extend Backend interface, CLI and docs for vLLM
  • Update Backend.Run signature to include modelRef
  • Adapt llama.cpp and mlx backends to new signature and sanitize args
  • Register vLLM in main and add it to ValidBackends
  • Document vLLM option in CLI reference YAMLs
pkg/inference/backend.go
pkg/inference/backends/llamacpp/llamacpp.go
pkg/inference/backends/mlx/mlx.go
main.go
cmd/cli/commands/backend.go
cmd/cli/docs/reference/docker_model_list.yaml
cmd/cli/docs/reference/docker_model_run.yaml
Remove obsolete distribution utils and introduce log sanitization
  • Delete old FormatBytes/ShowProgress/ReadContent from distribution utils
  • Add SanitizeForLog helper in internal/utils/log.go
  • Update imports and logging calls to use new sanitization utility
pkg/distribution/internal/utils/utils.go
pkg/internal/utils/log.go
pkg/distribution/distribution/client.go
pkg/inference/backends/llamacpp/llamacpp.go

Possibly linked issues

  • #Feature Request: vLLM Backend: The PR adds vLLM backend support, including Docker image builds, CI/CD integration, and CLI updates, directly addressing the feature request.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new vLLM CUDA Docker image variant to support vLLM (a fast LLM inference engine) alongside the existing model runner functionality.

  • Adds a new Docker build stage for vLLM with CUDA support
  • Introduces a new GitHub Actions workflow job to build and push the vLLM image
  • Includes configuration for vLLM version parameterization

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
Dockerfile Adds new vLLM build stage that installs vLLM using uv package manager
.github/workflows/release.yml Adds workflow input parameter and build job for vLLM CUDA image

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

@Copilot Copilot AI review requested due to automatic review settings October 14, 2025 06:49
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +67 to +72
echo "vllm-cuda<<EOF" >> "$GITHUB_OUTPUT"
echo "docker/model-runner:${{ inputs.releaseTag }}-vllm-cuda" >> "$GITHUB_OUTPUT"
if [ "${{ inputs.pushLatest }}" == "true" ]; then
echo "docker/model-runner:latest-vllm-cuda" >> "$GITHUB_OUTPUT"
fi
echo 'EOF' >> "$GITHUB_OUTPUT"
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag generation logic for vllm-cuda duplicates the pattern used for the cuda tags above. Consider extracting this into a reusable function or template to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.

@Copilot Copilot AI review requested due to automatic review settings October 14, 2025 09:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@doringeman doringeman changed the title build: add new vLLM CUDA image Add vLLM Oct 14, 2025
@Copilot Copilot AI review requested due to automatic review settings October 14, 2025 12:14
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Copilot Copilot AI review requested due to automatic review settings October 14, 2025 16:46
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Copilot Copilot AI review requested due to automatic review settings October 15, 2025 19:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/inference/backends/vllm/vllm.go:1

  • The log message uses 'modelID' in the format string but should use 'model' to be consistent with the variable name pattern used elsewhere in the codebase.
package vllm

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Copilot Copilot AI review requested due to automatic review settings October 16, 2025 12:04
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

pkg/inference/backends/vllm/vllm.go:1

  • The log message refers to 'modelID' but should be consistent with the parameter name 'model' used elsewhere in the codebase for clarity.
package vllm

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Reorganize Dockerfile stages to avoid unnecessary layer copying and improve build caching. Split final stages into final-llamacpp and final-vllm variants, copying the model-runner binary only in the final stages rather than in intermediate ones to profit of the build cache when only the model-runner binary has been changed.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Add modelRef parameter to Backend.Run() interface to support serving models with vLLM under their reference names. Update vLLM backend to use modelRef with --served-model-name flag.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
…n Linux

This reverts commit 1ab1ead.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
@Copilot Copilot AI review requested due to automatic review settings October 21, 2025 09:47
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +184 to +185
log.Warnf("Backend %s running modelID %s exited with error: %v",
backend.Name(), modelID, err,
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Log message incorrectly says 'modelID' but the context suggests this should be 'model' to match the previous naming convention, or the variable name should remain as 'model' since modelID is the internal identifier. The original message 'running model' was clearer.

Suggested change
log.Warnf("Backend %s running modelID %s exited with error: %v",
backend.Name(), modelID, err,
log.Warnf("Backend %s running model %s exited with error: %v",
backend.Name(), modelRef, err,

Copilot uses AI. Check for mistakes.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
@Copilot Copilot AI review requested due to automatic review settings October 21, 2025 10:58

# Install uv and vLLM as modelrunner user
RUN curl -LsSf https://astral.sh/uv/install.sh | sh \
&& ~/.local/bin/uv venv --python /usr/bin/python3 /opt/vllm-env \
Copy link
Contributor

@ericcurtin ericcurtin Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this to copy from the vllm/vllm-openai:v0.11.0 container we get DGX Spark support (I know I suggested doing it this less hacky way, apologies, didn't realize the container had aarch64 and this way doesn't appear to).

Could be a follow on PR too.

Copy link
Contributor

@ericcurtin ericcurtin Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this would be even better install the wheels from here:

https://wheels.vllm.ai/b8b302cde434df8c9289a2b465406b47ebab1c2d/vllm/

That commit sha is the 0.11.0 one.

They tipped me off in vLLM stack that they build CUDA x86_64 and aarch64 wheels for every commit. So this is the same thing, but has an aarch64 version also.

Be better than the hacky container copy (which is prone to error, missing files, OS mismatch, library version mismatch).

Copy link
Contributor

@ericcurtin ericcurtin Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way to get that programatically:

$ git rev-list -n 1 v0.11.0
b8b302cde434df8c9289a2b465406b47ebab1c2d

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Copilot Copilot AI review requested due to automatic review settings October 21, 2025 18:05
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

} else {
if config.Format == types.FormatSafetensors {
if vllmBackend, ok := s.backends[vllm.Name]; ok {
backend = vllmBackend
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The automatic backend selection logic should handle the case where vLLM is not installed. Currently, if vLLM is not in the backends map, the original backend assignment remains, but this could lead to runtime errors if that backend doesn't support safetensors. Consider adding a check to verify the selected backend supports the model format, or log a warning if vLLM is unavailable for a safetensors model.

Suggested change
backend = vllmBackend
backend = vllmBackend
} else {
s.log.Warnf("vLLM backend is not available for safetensors model '%s'; current backend '%s' may not support safetensors format", model.Name(), backend.Name())

Copilot uses AI. Check for mistakes.

@ilopezluna
Copy link
Contributor

@sourcery-ai review

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Blocking issues:

  • An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `.github/workflows/release.yml:117` </location>
<code_context>
        uses: docker/build-push-action@v5
</code_context>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ilopezluna ilopezluna marked this pull request as ready for review October 21, 2025 19:21
@ilopezluna
Copy link
Contributor

@sourcery-ai review

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `pkg/inference/backends/vllm/vllm.go:97` </location>
<code_context>
-	// TODO: Implement.
-	v.log.Warn("vLLM backend is not yet supported")
-	return errors.New("not implemented")
+func (v *vLLM) Run(ctx context.Context, socket, model string, modelRef string, mode inference.BackendMode, backendConfig *inference.BackendConfiguration) error {
+	if !platform.SupportsVLLM() {
+		v.log.Warn("vLLM backend is not yet supported")
</code_context>

<issue_to_address>
**issue (bug_risk):** The Run method appends both model and modelRef to --served-model-name, which may not match expected CLI usage.

If the CLI expects only one value for --served-model-name, passing both may cause errors. Please confirm the correct usage and update the argument construction if necessary.
</issue_to_address>

### Comment 2
<location> `pkg/inference/backends/vllm/vllm.go:204` </location>
<code_context>
+	return size, nil
+}
+
+func (v *vLLM) GetRequiredMemoryForModel(_ context.Context, _ string, _ *inference.BackendConfiguration) (inference.RequiredMemory, error) {
+	if !platform.SupportsVLLM() {
+		return inference.RequiredMemory{}, errors.New("not implemented")
</code_context>

<issue_to_address>
**suggestion:** GetRequiredMemoryForModel returns hardcoded values for RAM and VRAM.

Static memory values may cause incorrect resource allocation. Please add logic to estimate requirements from model metadata or configuration.

Suggested implementation:

```golang
func (v *vLLM) GetRequiredMemoryForModel(ctx context.Context, modelPath string, config *inference.BackendConfiguration) (inference.RequiredMemory, error) {
	if !platform.SupportsVLLM() {
		return inference.RequiredMemory{}, errors.New("not implemented")
	}

	// Example: Estimate memory based on model metadata
	metadata, err := inference.GetModelMetadata(modelPath)
	if err != nil {
		return inference.RequiredMemory{}, fmt.Errorf("failed to get model metadata: %w", err)
	}

	// Assume float32 weights, 4 bytes per parameter
	const bytesPerParam = 4
	paramCount := metadata.NumParameters
	ramEstimate := int64(paramCount * bytesPerParam)

	// VRAM estimate: typically similar to RAM, but may be higher for large batch sizes
	// Here we use a simple heuristic, can be improved with more info
	vramEstimate := ramEstimate

	// If config specifies batch size or precision, adjust estimates
	if config != nil {
		if config.Precision == "float16" {
			vramEstimate = int64(paramCount * 2)
			ramEstimate = int64(paramCount * 2)
		}
		if config.BatchSize > 0 {
			// Increase VRAM estimate for larger batch sizes
			vramEstimate += int64(config.BatchSize * 1024 * 1024) // 1MB per batch as a rough estimate
		}
	}

	return inference.RequiredMemory{
		RAM:  ramEstimate,
		VRAM: vramEstimate,
	}, nil
}

```

- You must implement or ensure the existence of `inference.GetModelMetadata(modelPath)` which should return a struct with at least `NumParameters` (int64).
- If your model metadata includes more details (e.g., data type, layers), you can refine the estimation logic.
- Adjust the estimation heuristics as needed for your specific models and hardware.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return size, nil
}

func (v *vLLM) GetRequiredMemoryForModel(_ context.Context, _ string, _ *inference.BackendConfiguration) (inference.RequiredMemory, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: GetRequiredMemoryForModel returns hardcoded values for RAM and VRAM.

Static memory values may cause incorrect resource allocation. Please add logic to estimate requirements from model metadata or configuration.

Suggested implementation:

func (v *vLLM) GetRequiredMemoryForModel(ctx context.Context, modelPath string, config *inference.BackendConfiguration) (inference.RequiredMemory, error) {
	if !platform.SupportsVLLM() {
		return inference.RequiredMemory{}, errors.New("not implemented")
	}

	// Example: Estimate memory based on model metadata
	metadata, err := inference.GetModelMetadata(modelPath)
	if err != nil {
		return inference.RequiredMemory{}, fmt.Errorf("failed to get model metadata: %w", err)
	}

	// Assume float32 weights, 4 bytes per parameter
	const bytesPerParam = 4
	paramCount := metadata.NumParameters
	ramEstimate := int64(paramCount * bytesPerParam)

	// VRAM estimate: typically similar to RAM, but may be higher for large batch sizes
	// Here we use a simple heuristic, can be improved with more info
	vramEstimate := ramEstimate

	// If config specifies batch size or precision, adjust estimates
	if config != nil {
		if config.Precision == "float16" {
			vramEstimate = int64(paramCount * 2)
			ramEstimate = int64(paramCount * 2)
		}
		if config.BatchSize > 0 {
			// Increase VRAM estimate for larger batch sizes
			vramEstimate += int64(config.BatchSize * 1024 * 1024) // 1MB per batch as a rough estimate
		}
	}

	return inference.RequiredMemory{
		RAM:  ramEstimate,
		VRAM: vramEstimate,
	}, nil
}
  • You must implement or ensure the existence of inference.GetModelMetadata(modelPath) which should return a struct with at least NumParameters (int64).
  • If your model metadata includes more details (e.g., data type, layers), you can refine the estimation logic.
  • Adjust the estimation heuristics as needed for your specific models and hardware.

// Name is the backend name.
Name = "vllm"
Name = "vllm"
vllmDir = "/opt/vllm-env/bin"
Copy link
Contributor

@ericcurtin ericcurtin Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sometimes curious why we care about the directory in the golang code. We could just add this directory to the PATH in the Dockerfile.

Could be useful if we start to use artifacts from this container for example for ROCm which expect things to be in a different path:

https://hub.docker.com/r/rocm/vllm

can worry about this in a follow on PR though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants