feat(kubernetes): warm-pooled sandboxes via agent-sandbox extension CRDs#1892
Open
rmalani-nv wants to merge 10 commits into
Open
feat(kubernetes): warm-pooled sandboxes via agent-sandbox extension CRDs#1892rmalani-nv wants to merge 10 commits into
rmalani-nv wants to merge 10 commits into
Conversation
Propose adopting the upstream agent-sandbox warm-pool extension CRDs (SandboxTemplate / SandboxWarmPool / SandboxClaim, extensions.agents.x-k8s.io/v1alpha1) on the Kubernetes driver to hand out pre-warmed sandbox pods in ~milliseconds instead of cold-starting a Sandbox CR per request. Documents the claim-based create flow, what bakes into the shared template vs. late-binds over the supervisor relay, the security-sensitive identity re-anchor, risks, alternatives, and a phased rollout. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…and e2e Apply the agent-sandbox extensions.yaml alongside manifest.yaml in the local k3d dev bootstrap and the e2e kube harness, reusing the pinned AGENT_SANDBOX_VERSION. The e2e harness waits for the new extension CRDs to be Established and for the re-rolled controller. Note the change in the helm-dev-environment skill. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Extend the ComputeDriver gRPC contract: WarmClaimBinding on CreateSandboxResponse, claim_name/claim_uid on DriverSandboxStatus, and disallow_warm_pool on DriverSandboxSpec so the gateway can force the cold path for custom-policy requests. Adapt the Docker/Podman/VM drivers (they don't warm-pool). Signed-off-by: Roshni Malani <rmalani@nvidia.com>
The gateway's warm-pool identity re-anchor must agree byte-for-byte with the CRD group/version/kind, claim-uid label, and sandbox-id annotation the Kubernetes driver writes. Hoist these into openshell-core as the single source of truth so the driver and the auth path can't drift. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Add a durable, HA-safe (namespace, claim_name, claim_uid) -> sandbox_id mapping in the shared object store, keyed so a uid mismatch fails closed and rebinding an existing key to a different sandbox id is rejected. This is the record the warm-pool identity re-anchor resolves against. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Reconcile operator-declared warm pools (one SandboxTemplate + SandboxWarmPool per pool), bind pre-warmed pods via a SandboxClaim instead of cold-creating a Sandbox, and watch claims to surface warm sandboxes. warm_eligible() exhaustively destructures the request spec/template so any newly added field must be classified before it can ride the warm path; custom-policy requests and per-request overrides fall back to cold. The pooled template uses an ephemeral workspace, an unmanaged network policy, and no injected identity. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…SandboxClaim A warm pod's owning Sandbox CR is created generically by the pool controller and carries no openshell sandbox-id, so IssueSandboxToken must re-anchor identity through a fail-closed chain: pod -> owning Sandbox -> controlling SandboxClaim -> live bound claim -> durable gateway claim mapping. Record the mapping synchronously at create (rolling the claim back if the write fails), back-fill it after a crash window only for sandboxes the gateway minted, and GC orphaned mappings. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
A pooled pod boots before it is claimed, so it cannot be handed a per-sandbox identity at create time. Let the supervisor boot identity-less on the image baseline policy and have the gateway derive the sandbox id from the gateway-minted JWT principal when the supervisor hello carries none, with a fast bootstrap retry while the claim binds. Landlock is applied once at process start, so a warm pod enforces the pool baseline policy; custom-policy requests stay on the cold path. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Render warm-pool settings into gateway.toml, add a values-warm-pool.yaml CI overlay, document the fields in the config reference, and grant the gateway only the extension-CRD verbs it uses: server-side apply (create+patch) on templates/pools and create/delete/get/list/watch on claims, with no status subresources. The sandbox pod ServiceAccount is intentionally granted none of these. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
Add a Kubernetes e2e (gated on OPENSHELL_E2E_WARM_POOL) asserting a re-claimed pod never sees the prior claim's writable workspace, the shared data volume is mounted read-only, and a custom-policy request falls back to cold (no SandboxClaim) while a default request binds one. Signed-off-by: Roshni Malani <rmalani@nvidia.com>
d83d658 to
03d7fde
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements warm-pooled sandboxes on the Kubernetes compute driver (#1879). Instead of cold-starting a
SandboxCR per request, the gateway claims a pre-warmed pod from an operator-declared warm pool by creating an agent-sandboxSandboxClaim(extensions.agents.x-k8s.io/v1alpha1), cutting provisioning from pod-startup time to near-instant.This supersedes the closed groundwork PR #1813 (RFC + extension install): it keeps that groundwork at the base of the branch and folds in the full driver/server/supervisor implementation plus the security remediation items raised in the #1813 review.
Warm pooling is off by default (
server.warmPool.enabled=false); with it disabled the cold path is unchanged. Validated end-to-end on a local k3s (k3d) cluster.Security-sensitive change — please review closely
A warm pod's owning
SandboxCR is created generically by the pool controller and carries no OpenShell sandbox-id, soIssueSandboxTokenre-anchors identity through a fail-closed chain: pod → owningSandbox→ controllingSandboxClaim→ live bound claim → durable gateway-written claim mapping (crates/openshell-server/src/auth/k8s_sa.rs). Any broken link denies the token. Custom-policy requests never warm-pool (Landlock is applied once at boot and cannot be loosened), and the sandbox pod ServiceAccount is granted no claim/template/pool access.Related Issue
Closes #1879. Addresses the security remediation items from the #1813 review (fail-closed eligibility, identity re-anchor, single-use workspace, least-privilege RBAC).
Changes
Stacked commits, in dependency order:
feat(compute-driver)— extend the gRPC contract:WarmClaimBindingonCreateSandboxResponse,claim_name/claim_uidonDriverSandboxStatus,disallow_warm_poolonDriverSandboxSpec; adapt the Docker/Podman/VM drivers (they don't warm-pool).refactor(core)— centralize agent-sandbox CRD identity (group/version/kind, claim-uid label, sandbox-id annotation) inopenshell-coreso the driver and the auth path can't drift.feat(server)persistence — durable, HA-safe(namespace, claim_name, claim_uid) → sandbox_idmapping; rebinding a key to a different id is rejected.feat(kubernetes)— reconcile warm pools (oneSandboxTemplate+SandboxWarmPoolper pool), bind pre-warmed pods viaSandboxClaim, and watch claims to surface warm sandboxes.warm_eligible()exhaustively destructures the request spec/template, so any newly added field must be classified before it can ride the warm path.feat(server)re-anchor — the fail-closed identity chain above, plus a synchronous mapping write with claim rollback, crash-window backfill bounded to gateway-minted ids, and orphan GC.feat(sandbox)— the supervisor boots identity-less on the image baseline policy; the gateway derives the sandbox id from the gateway-minted JWT when the supervisor hello carries none, with a fast bootstrap retry while the claim binds.feat(deploy)— warm-poolgateway.tomlrendering, avalues-warm-pool.yamlCI overlay, and least-privilege RBAC (SSA create+patch on templates/pools; create/delete/get/list/watch on claims; no status subresources).test(e2e)— workspace single-use across re-claims, read-only shared volume, and cold-fallback for custom-policy requests (gated onOPENSHELL_E2E_WARM_POOL).docs— config reference, driver README, and RFC 0005 updated to the validated design.Testing
Validated on a local k3s (k3d) cluster during development:
SandboxTemplate → SandboxWarmPool → SandboxClaimcycle: a claim binds a pre-warmed pod near-instantly, theopenshell.io/sandbox-idannotation lands on the pod, and the pool self-replenishes.sandbox createreachesReadyand runs a command over the supervisor relay; the cold-path baseline is unchanged.warm_pool_workspace_isolation(workspace single-use + read-only shared volume) passed live on k3d.This change set:
mise run pre-commit✓ (lint / format / license).Targeted unit tests pass:
warm_eligible(Kubernetes driver) and theauth::k8s_sare-anchor suite (30 tests, including spoofed-label, uid-mismatch, and store-mapping-spoof cases).Added a cold-fallback e2e assertion (compiles under
e2e-kubernetes; runs in the gated warm-pool e2e).Ran an adversarial multi-dimension review over the diff; it surfaced one over-trimmed RBAC verb (claims
watch, which the driver's claim watcher needs) — now fixed and re-verified against the rendered chart.Full unit/integration suite and e2e run in CI.
mise run pre-commitpassesUnit tests added/updated —
warm_eligible, auth re-anchor (incl. adversarial cases), persistence mappingsE2E tests added/updated — warm-pool workspace isolation + cold-fallback
Checklist