Skip to content

feat(cli): add GPU count requests#1812

Open
elezar wants to merge 6 commits into
mainfrom
1444-gpu-cli-count/elezar
Open

feat(cli): add GPU count requests#1812
elezar wants to merge 6 commits into
mainfrom
1444-gpu-cli-count/elezar

Conversation

@elezar

@elezar elezar commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Adds structured GPU resource requirements for sandbox creation and updates the CLI/API path so openshell sandbox create --gpu [COUNT] records GPU intent in ResourceRequirements.gpu.

This is a breaking proto compatibility change: the public and compute-driver sandbox specs now carry resource_requirements and reserve the previous flat GPU fields. The shape aligns the implementation with rfc/0004-sandbox-resource-requirements/README.md and #1360.

Related Issue

Part of #1444. Related to #1338, #1156, and #1360. Follow-up GPU support preflight semantics are tracked in #1807.

Changes

  • Add ResourceRequirements.gpu.count to the public and compute-driver protos, and reserve the previous GPU device fields.
  • Replace the older GPU CLI shape with --gpu for the driver default request and --gpu COUNT for counted requests.
  • Pass GPU resource requirements through sandbox create, gateway-to-driver translation, provisioning timeout messages, and driver helper APIs.
  • Render Kubernetes nvidia.com/gpu limits from GPU requirements; default --gpu requests one GPU, and --gpu COUNT requests that count.
  • Keep exact device selection driver-owned through driver_config: Docker/Podman use cdi_devices, and VM uses gpu_device_ids.
  • Validate exact device requests consistently: they require a GPU request, a single exact device works with default --gpu, and multi-device exact lists require --gpu COUNT matching the list length.
  • Reject unsupported count-only selection in Docker and Podman, while preserving VM support for the single-GPU default/count path.
  • Update GPU request docs plus Docker, Podman, Kubernetes, and compute runtime notes.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests not added; the change is covered by unit/integration coverage in pre-commit and does not add a new live sandbox e2e path

Focused GPU checks also run during local review:

  • /Users/elezar/.cargo/bin/cargo test -p openshell-core gpu::tests
  • /Users/elezar/.cargo/bin/cargo test -p openshell-driver-docker cdi_device
  • /Users/elezar/.cargo/bin/cargo test -p openshell-driver-podman cdi_device
  • /Users/elezar/.cargo/bin/cargo test -p openshell-driver-vm gpu_device

Checklist

@elezar elezar requested a review from a team as a code owner June 8, 2026 13:10
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@mrunalp

mrunalp commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

/ok to test abe5b79

@TaylorMutch

Copy link
Copy Markdown
Collaborator

/ok to test abe5b79

@elezar elezar marked this pull request as draft June 10, 2026 07:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Landing #1815 first should simplify the changes here.

BREAKING CHANGE: SandboxSpec.gpu and DriverSandboxSpec.gpu were replaced with resource_requirements.gpu, changing protobuf field 9 from a bool to a message for both public and driver APIs.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the 1444-gpu-cli-count/elezar branch from abe5b79 to 06c69dd Compare June 12, 2026 07:12
@elezar elezar marked this pull request as ready for review June 12, 2026 07:29
@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e-gpu Requires GPU end-to-end coverage labels Jun 12, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for 06c69dd. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored, project-valid GPU CLI/API/runtime work that aligns GPU sandbox intent with structured resource requirements and the related resource-requirements RFC direction.
Head SHA: 06c69dddf62dd74b3215bdc8e3dafc95ee2622a2

Review findings:

  • Blocking: crates/openshell-cli/src/run.rs moves resource_requirements into CreateSandboxRequest and then borrows it later for provisioning timeout messages. This is a Rust use-after-move compile failure; clone or otherwise retain the value before moving it into the request.
  • Blocking: proto/openshell.proto and proto/compute_driver.proto reuse field number 9 for resource_requirements, replacing the old bool gpu = 9 with a message. Old clients and persisted sandbox records encode field 9 as a varint, while the new schema expects length-delimited data, so prost can silently ignore the old GPU request. Reserve the old field and add resource_requirements on a new field, or keep a transitional deprecated field and map both safely.
  • Warning: --gpu now accepts an optional COUNT, so default GPU plus a trailing command requires the documented --gpu -- <command> form. The docs/tests cover that; keep the parse error crisp for accidental --gpu <command> usage.

Docs: Fern docs were updated under docs/; no navigation change appears necessary for this focused CLI/runtime behavior update.
E2E: test:e2e-gpu is applied because this changes GPU runtime behavior and user-visible GPU sandbox creation semantics.
Checks: OpenShell / Branch Checks is still pending for this head; OpenShell / Helm Lint and DCO are passing.

Next state: gator:in-review

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI Update

I re-evaluated latest head 06c69dddf62dd74b3215bdc8e3dafc95ee2622a2 after OpenShell / Branch Checks completed successfully.

Disposition: partially resolved.

Remaining items:

  • Resolved: the earlier compile-failure concern in crates/openshell-cli/src/run.rs is cleared by the passing Rust checks and the current diff retaining resource_requirements for the provisioning timeout path.
  • Still blocking: proto/openshell.proto and proto/compute_driver.proto still replace the old bool gpu = 9 fields with length-delimited ResourceRequirements resource_requirements = 9. That is a protobuf wire-compatibility break for existing clients and stored sandbox records. Please reserve the old field number and add resource_requirements on a new field, or keep a deprecated transitional field and map both safely.
  • OpenShell / GPU E2E is still pending for this head with test:e2e-gpu applied.

Next state: gator:in-review

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI Update

I re-evaluated latest head 06c69dddf62dd74b3215bdc8e3dafc95ee2622a2 after the required gates completed successfully.

Disposition: partially resolved.

Remaining items:

  • Resolved: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are now green for this head with test:e2e-gpu applied.
  • Still blocking: proto/openshell.proto and proto/compute_driver.proto still replace the old bool gpu = 9 fields with length-delimited ResourceRequirements resource_requirements = 9. That is a protobuf wire-compatibility break for existing clients and stored sandbox records. Please reserve the old field number and add resource_requirements on a new field, or keep a deprecated transitional field and map both safely.

Next state: gator:in-review

@elezar elezar force-pushed the 1444-gpu-cli-count/elezar branch from 06c69dd to 87c9a8c Compare June 12, 2026 10:26
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 87c9a8c6015bc23d48b11823540b44d85fa24399 after the branch update following the prior protobuf compatibility feedback.

Disposition: not resolved.

Remaining items:

  • Still blocking: proto/openshell.proto and proto/compute_driver.proto still define resource_requirements = 9, replacing the previous bool gpu = 9 field with a length-delimited message. This remains a protobuf wire-compatibility break for old clients and stored sandbox records. Please keep a deprecated transitional gpu = 9 field and map it when the new field is absent, or reserve field 9 and add resource_requirements on a new field number.
  • Still blocking: crates/openshell-cli/src/run.rs still moves resource_requirements into CreateSandboxRequest.spec.resource_requirements and later reads resource_requirements.as_ref() for provisioning timeout messages. Please clone or otherwise retain that value before moving it into the request.
  • Additional warning from the independent re-review: Docker and Podman should also reject count-only GPU requests at the create/build sink, not only during validation, so a missed validation path cannot over-grant all CDI GPUs.
  • OpenShell / Branch Checks, OpenShell / Helm Lint, and DCO are passing for this head. OpenShell / GPU E2E is still pending with test:e2e-gpu applied.

Next state: gator:in-review

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI Update

I re-evaluated latest head 87c9a8c6015bc23d48b11823540b44d85fa24399 after the required gates completed successfully and after a focused re-check of the current diff.

Disposition: partially resolved.

Remaining items:

  • Resolved: the prior crates/openshell-cli/src/run.rs move/borrow compile concern is no longer present in the current diff, and OpenShell / Branch Checks is green for this head.
  • Resolved: the prior Docker/Podman count-only GPU warning is addressed at the driver validation/build paths. Docker and Podman now reject count-only GPU requests without driver_config.cdi_devices, and they validate exact CDI device requests at the sink.
  • Still blocking: proto/openshell.proto and proto/compute_driver.proto still define resource_requirements = 9, replacing the previous bool gpu = 9 field with a length-delimited message. That remains a protobuf wire-compatibility break for old clients and stored sandbox records. Please keep a deprecated transitional gpu = 9 field and map it when the new field is absent, or reserve field 9 and add resource_requirements on a new field number.
  • Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for this head with test:e2e-gpu applied.

Next state: gator:in-review

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Still blocking: proto/openshell.proto and proto/compute_driver.proto still define resource_requirements = 9, replacing the previous bool gpu = 9 field with a length-delimited message. That remains a protobuf wire-compatibility break for old clients and stored sandbox records. Please keep a deprecated transitional gpu = 9 field and map it when the new field is absent, or reserve field 9 and add resource_requirements on a new field number.

The breaking proto change is intentional. However, I will defer to @drew on whether we should rather reserve the previous gpu field and introduce this at a new number.

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 87c9a8c6015bc23d48b11823540b44d85fa24399 after @elezar's 2026-06-12 11:33 UTC comment that the protobuf wire-compatibility break is intentional and should be decided by @drew.

Disposition: not resolved.

Remaining items:

  • The protobuf compatibility concern remains blocking for gator: proto/openshell.proto and proto/compute_driver.proto still define resource_requirements = 9, replacing the previous bool gpu = 9 field with a length-delimited message. Please either change the field numbering/transitional mapping, or have an appropriate maintainer explicitly accept that compatibility break on this PR.
  • Checks are green for this head with test:e2e-gpu applied.

Next state: gator:in-review

@elezar elezar enabled auto-merge (squash) June 12, 2026 12:43
@elezar elezar disabled auto-merge June 12, 2026 12:44
elezar added 4 commits June 12, 2026 14:52
Pass the coupled GPU requirement object through the CLI sandbox_create boundary instead of splitting presence and count into separate arguments.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Pass ResourceRequirements into the provisioning timeout message helper so GPU hints are derived from the same nested request object used to create the sandbox.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Thread Option<GpuResourceRequirements> through driver validation and rendering helpers instead of splitting GPU presence and count into separate arguments.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Require exact driver GPU device lists to be tied to a GPU request, allow a single exact device to use the default countless request, and require explicit matching counts for multi-device lists.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the 1444-gpu-cli-count/elezar branch from 87c9a8c to da6fbd8 Compare June 12, 2026 12:55
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head da6fbd8752372c86a834d59edb238d61d38dc9b7 after the branch update that followed the prior protobuf compatibility feedback.

Disposition: not resolved.

Remaining items:

  • Still blocking: crates/openshell-cli/src/run.rs constructs resource_requirements, moves it into CreateSandboxRequest.spec.resource_requirements, and later borrows resource_requirements.as_ref() for provisioning timeout messages. Please clone the value into the request or retain a separate GPU-request flag/value before moving it.
  • Still blocking: proto/openshell.proto and proto/compute_driver.proto still define resource_requirements = 9, replacing the previous bool gpu = 9 field with a length-delimited message. This remains a protobuf wire-compatibility break for old clients, old drivers, and persisted sandbox records. Please either change the field numbering/transitional mapping, or have an appropriate maintainer explicitly accept that compatibility break on this PR.
  • Resolved: the prior Docker/Podman count-only GPU warning remains addressed. Docker and Podman reject count-only GPU requests without driver_config.cdi_devices, and they validate exact CDI device requests at the sink.
  • Checks: OpenShell / Branch Checks and OpenShell / GPU E2E are still pending for this head; OpenShell / Helm Lint, DCO, and the docs preview are passing.

Next state: gator:in-review

@elezar elezar enabled auto-merge (squash) June 12, 2026 13:03
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI and Independent Review

I re-evaluated latest head da6fbd8752372c86a834d59edb238d61d38dc9b7 after the required gates completed successfully and after a bounded independent review of the current diff.

Disposition: not resolved.

Remaining items:

  • Still blocking: crates/openshell-cli/src/run.rs moves resource_requirements into CreateSandboxRequest.spec.resource_requirements and later borrows resource_requirements.as_ref() for provisioning timeout messages. Please clone the value into the request or retain a separate timeout value before the move.
  • Still blocking: proto/openshell.proto and proto/compute_driver.proto still define resource_requirements = 9, replacing the previous bool gpu = 9 field with a length-delimited message. This remains a protobuf wire-compatibility break for old clients, old drivers, and persisted sandbox records unless an appropriate maintainer explicitly accepts that break on this PR.
  • Warning: Docker and Podman reject count-only GPU requests during create validation, but the lower-level device builders can still translate gpu.count = Some(_) without exact devices into the all-GPUs CDI request if reached directly. Please mirror the count-only rejection at the builder sink and add builder-level tests, or explain why no alternate path can reach those builders without prior validation.
  • Checks are now green for this head: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, OpenShell / E2E, and OpenShell / GPU E2E are passing with test:e2e-gpu applied.

Next state: gator:in-review

@elezar

elezar commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

Responding to the two blocking findings from gator:

  1. crates/openshell-cli/src/run.rs is not a use-after-move compile failure. The generated prost types for ResourceRequirements and GpuResourceRequirements derive Copy, so Option<ResourceRequirements> is also Copy. I verified this by running cargo check -p openshell-cli -j 1 successfully. I also tested the proposed clone fix, and cargo clippy --workspace --all-targets -- -D warnings rejects it with clippy::clone_on_copy, confirming the original code is the correct form for the generated type.

  2. The protobuf field-number change is an intentional alpha API break. The compute-driver proto is not currently treated as a public compatibility surface: gateway and driver lifetimes are tightly coupled, and local drivers are launched by the gateway at startup. For the public SandboxSpec, direct API use is currently limited to the matching OpenShell CLI. We are also intentionally not preserving live or persisted legacy GPU intent across this transition; GPU sandboxes should be recreated after upgrade if they need the new typed resource-requirements shape.

Given that the API is still alpha, we do not want to carry legacy GPU-specific reserved or transitional fields forward into the proto shape we intend to stabilize. I will update RFC 0004 in this branch to reflect that decision, since the current RFC text still describes reserving the old GPU fields.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants