Skip to content

feat(kubernetes): warm-pooled sandboxes via agent-sandbox extension CRDs#1892

Open
rmalani-nv wants to merge 10 commits into
NVIDIA:mainfrom
rmalani-nv:rmalani-nv/warm-pooled-sandboxes
Open

feat(kubernetes): warm-pooled sandboxes via agent-sandbox extension CRDs#1892
rmalani-nv wants to merge 10 commits into
NVIDIA:mainfrom
rmalani-nv:rmalani-nv/warm-pooled-sandboxes

Conversation

@rmalani-nv

Copy link
Copy Markdown
Contributor

Summary

Implements warm-pooled sandboxes on the Kubernetes compute driver (#1879). Instead of cold-starting a Sandbox CR per request, the gateway claims a pre-warmed pod from an operator-declared warm pool by creating an agent-sandbox SandboxClaim (extensions.agents.x-k8s.io/v1alpha1), cutting provisioning from pod-startup time to near-instant.

This supersedes the closed groundwork PR #1813 (RFC + extension install): it keeps that groundwork at the base of the branch and folds in the full driver/server/supervisor implementation plus the security remediation items raised in the #1813 review.

Warm pooling is off by default (server.warmPool.enabled=false); with it disabled the cold path is unchanged. Validated end-to-end on a local k3s (k3d) cluster.

Security-sensitive change — please review closely

A warm pod's owning Sandbox CR is created generically by the pool controller and carries no OpenShell sandbox-id, so IssueSandboxToken re-anchors identity through a fail-closed chain: pod → owning Sandbox → controlling SandboxClaim → live bound claim → durable gateway-written claim mapping (crates/openshell-server/src/auth/k8s_sa.rs). Any broken link denies the token. Custom-policy requests never warm-pool (Landlock is applied once at boot and cannot be loosened), and the sandbox pod ServiceAccount is granted no claim/template/pool access.

Related Issue

Closes #1879. Addresses the security remediation items from the #1813 review (fail-closed eligibility, identity re-anchor, single-use workspace, least-privilege RBAC).

Changes

Stacked commits, in dependency order:

  • feat(compute-driver) — extend the gRPC contract: WarmClaimBinding on CreateSandboxResponse, claim_name/claim_uid on DriverSandboxStatus, disallow_warm_pool on DriverSandboxSpec; adapt the Docker/Podman/VM drivers (they don't warm-pool).
  • refactor(core) — centralize agent-sandbox CRD identity (group/version/kind, claim-uid label, sandbox-id annotation) in openshell-core so the driver and the auth path can't drift.
  • feat(server) persistence — durable, HA-safe (namespace, claim_name, claim_uid) → sandbox_id mapping; rebinding a key to a different id is rejected.
  • feat(kubernetes) — reconcile warm pools (one SandboxTemplate + SandboxWarmPool per pool), bind pre-warmed pods via SandboxClaim, and watch claims to surface warm sandboxes. warm_eligible() exhaustively destructures the request spec/template, so any newly added field must be classified before it can ride the warm path.
  • feat(server) re-anchor — the fail-closed identity chain above, plus a synchronous mapping write with claim rollback, crash-window backfill bounded to gateway-minted ids, and orphan GC.
  • feat(sandbox) — the supervisor boots identity-less on the image baseline policy; the gateway derives the sandbox id from the gateway-minted JWT when the supervisor hello carries none, with a fast bootstrap retry while the claim binds.
  • feat(deploy) — warm-pool gateway.toml rendering, a values-warm-pool.yaml CI overlay, and least-privilege RBAC (SSA create+patch on templates/pools; create/delete/get/list/watch on claims; no status subresources).
  • test(e2e) — workspace single-use across re-claims, read-only shared volume, and cold-fallback for custom-policy requests (gated on OPENSHELL_E2E_WARM_POOL).
  • docs — config reference, driver README, and RFC 0005 updated to the validated design.

Testing

Validated on a local k3s (k3d) cluster during development:

  • Drove a real SandboxTemplate → SandboxWarmPool → SandboxClaim cycle: a claim binds a pre-warmed pod near-instantly, the openshell.io/sandbox-id annotation lands on the pod, and the pool self-replenishes.
  • Warm-path sandbox create reaches Ready and runs a command over the supervisor relay; the cold-path baseline is unchanged.
  • warm_pool_workspace_isolation (workspace single-use + read-only shared volume) passed live on k3d.

This change set:

  • mise run pre-commit ✓ (lint / format / license).

  • Targeted unit tests pass: warm_eligible (Kubernetes driver) and the auth::k8s_sa re-anchor suite (30 tests, including spoofed-label, uid-mismatch, and store-mapping-spoof cases).

  • Added a cold-fallback e2e assertion (compiles under e2e-kubernetes; runs in the gated warm-pool e2e).

  • Ran an adversarial multi-dimension review over the diff; it surfaced one over-trimmed RBAC verb (claims watch, which the driver's claim watcher needs) — now fixed and re-verified against the rendered chart.

  • Full unit/integration suite and e2e run in CI.

  • mise run pre-commit passes

  • Unit tests added/updated — warm_eligible, auth re-anchor (incl. adversarial cases), persistence mappings

  • E2E tests added/updated — warm-pool workspace isolation + cold-fallback

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — RFC 0005, the Kubernetes driver README, and the gateway config reference document the warm-pool design and configuration

@rmalani-nv rmalani-nv requested review from a team, derekwaynecarr and mrunalp as code owners June 13, 2026 00:49
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Propose adopting the upstream agent-sandbox warm-pool extension CRDs (SandboxTemplate / SandboxWarmPool / SandboxClaim, extensions.agents.x-k8s.io/v1alpha1) on the Kubernetes driver to hand out pre-warmed sandbox pods in ~milliseconds instead of cold-starting a Sandbox CR per request. Documents the claim-based create flow, what bakes into the shared template vs. late-binds over the supervisor relay, the security-sensitive identity re-anchor, risks, alternatives, and a phased rollout.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…and e2e

Apply the agent-sandbox extensions.yaml alongside manifest.yaml in the local k3d dev bootstrap and the e2e kube harness, reusing the pinned AGENT_SANDBOX_VERSION. The e2e harness waits for the new extension CRDs to be Established and for the re-rolled controller. Note the change in the helm-dev-environment skill.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Extend the ComputeDriver gRPC contract: WarmClaimBinding on CreateSandboxResponse, claim_name/claim_uid on DriverSandboxStatus, and disallow_warm_pool on DriverSandboxSpec so the gateway can force the cold path for custom-policy requests. Adapt the Docker/Podman/VM drivers (they don't warm-pool).

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
The gateway's warm-pool identity re-anchor must agree byte-for-byte with the CRD group/version/kind, claim-uid label, and sandbox-id annotation the Kubernetes driver writes. Hoist these into openshell-core as the single source of truth so the driver and the auth path can't drift.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Add a durable, HA-safe (namespace, claim_name, claim_uid) -> sandbox_id mapping in the shared object store, keyed so a uid mismatch fails closed and rebinding an existing key to a different sandbox id is rejected. This is the record the warm-pool identity re-anchor resolves against.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Reconcile operator-declared warm pools (one SandboxTemplate + SandboxWarmPool per pool), bind pre-warmed pods via a SandboxClaim instead of cold-creating a Sandbox, and watch claims to surface warm sandboxes. warm_eligible() exhaustively destructures the request spec/template so any newly added field must be classified before it can ride the warm path; custom-policy requests and per-request overrides fall back to cold. The pooled template uses an ephemeral workspace, an unmanaged network policy, and no injected identity.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…SandboxClaim

A warm pod's owning Sandbox CR is created generically by the pool controller and carries no openshell sandbox-id, so IssueSandboxToken must re-anchor identity through a fail-closed chain: pod -> owning Sandbox -> controlling SandboxClaim -> live bound claim -> durable gateway claim mapping. Record the mapping synchronously at create (rolling the claim back if the write fails), back-fill it after a crash window only for sandboxes the gateway minted, and GC orphaned mappings.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
A pooled pod boots before it is claimed, so it cannot be handed a per-sandbox identity at create time. Let the supervisor boot identity-less on the image baseline policy and have the gateway derive the sandbox id from the gateway-minted JWT principal when the supervisor hello carries none, with a fast bootstrap retry while the claim binds. Landlock is applied once at process start, so a warm pod enforces the pool baseline policy; custom-policy requests stay on the cold path.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Render warm-pool settings into gateway.toml, add a values-warm-pool.yaml CI overlay, document the fields in the config reference, and grant the gateway only the extension-CRD verbs it uses: server-side apply (create+patch) on templates/pools and create/delete/get/list/watch on claims, with no status subresources. The sandbox pod ServiceAccount is intentionally granted none of these.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Add a Kubernetes e2e (gated on OPENSHELL_E2E_WARM_POOL) asserting a re-claimed pod never sees the prior claim's writable workspace, the shared data volume is mounted read-only, and a custom-policy request falls back to cold (no SandboxClaim) while a default request binds one.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
@rmalani-nv rmalani-nv force-pushed the rmalani-nv/warm-pooled-sandboxes branch from d83d658 to 03d7fde Compare June 13, 2026 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warm-pooled sandboxes for the Kubernetes compute driver

1 participant