feat(cli): dstack + dstackup — hands-off single-node onboarding#731
feat(cli): dstack + dstackup — hands-off single-node onboarding#731h4x3rotab wants to merge 13 commits into
Conversation
Captures the problem (the ~22-step path to a first app), the hardware-validated findings, and the locked design for hands-off single-node onboarding: dstackup (host setup) + dstack (client), single-node KMS bootstrap, a Rust auth webhook, and the crates/ layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds four crates under crates/ (no changes to existing crates): - dstack-core: shared library — typed VMM prpc client, config rendering (vmm.toml / kms.toml / auth-allowlist.json), app-compose build, host/SGX detection, free-port + --port spec helpers. - dstack: client CLI — run / ls / logs (info / upgrade / init are scaffolds); talks to a local VMM over its unix socket or a remote one over http(s). - dstackup: host setup CLI — install / status / destroy. SGX preflight, renders configs, manages dstack-vmm + dstack-auth as systemd units, deploys and bootstraps a single-node KMS-in-CVM. Idempotent install, deterministic cgroup teardown on destroy. - dstack-auth: Rust reimplementation of the single-operator KMS auth webhook (compose-hash allowlist, re-read per request, fails closed). Validated end-to-end on a TDX host against the official meta-dstack v0.5.11 release image: dstackup install -> KMS bootstrap -> dstack run -> app serves HTTP 200 -> dstackup destroy, all at the default 1 GB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses code-review blockers: - allowlist + state files are now written atomically (temp + rename), and the allowlist read-modify-write holds an exclusive flock — so a concurrent `dstack run` or a crash can no longer corrupt it. A torn allowlist matters: the webhook fails closed on invalid JSON, i.e. denies keys to every app on the host. New `dstack-core::fsutil` (write_atomic, lock_exclusive), tested. - `dstackup destroy` now finds the KMS CVM by recorded id OR by name, so an install that died before persisting kms_vm_id can't orphan the CVM. - `--expose` fails fast with guidance (use an SSH tunnel): it would otherwise bind the VM-control plane with neither TLS nor an auth token. Minors: align hex normalization between `dstack run` and the webhook and store the normalized hash; command stubs exit non-zero; dedupe the KMS image default against `config::DEFAULT_KMS_IMAGE`; fix a stale doc comment and duplicate step numbering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`dstackup install` now reads the guest image's digest.txt and renders it into
the webhook allowlist's `osImages`, so the auth webhook enforces which OS image
an app may boot — even though the KMS's own download-verify stays off for the
single-node flow. Previously both gates were open: an app could boot under a
different, unmeasured image and still receive keys. `bootAuth/kms` ignores
`osImages`, so the KMS bootstrap is unaffected.
Validated on a TDX host with the official meta-dstack v0.5.11 image: the pin
(digest.txt c2aa0186…) matches the KMS-reported image hash — nginx still gets
keys and serves HTTP 200 with the pin active — while a wrong image hash is now
denied ("os image not allowed"); 0x/case variants normalize correctly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- `dstack run`: the "registered" line no longer implies the KMS will honor it regardless of path — it states keys are issued only if the file is the allowlist the auth webhook actually serves. - `dstack logs`: clearer gate for a remote endpoint (remote support lands with the TLS+token transport) instead of a terse "unix only". - `dstackup`: document that the auth webhook's 127.0.0.1 bind is deliberate (it decides key release; CVMs still reach it at 10.0.2.2 via user-mode networking). Message/comment-only; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ust readiness - `dstackup install` auto-picks a CID window that avoids any VMM already running on the host: it reads other `dstack-vmm` processes' configs for their reserved `[cid_start, cid_start+pool)` and any live `guest-cid`, then offsets past them. `--cid-start` is now optional (auto by default) and refuses an explicit value that overlaps a reserved range. - external tools (systemctl/docker/curl) run with a sanitized `PATH`, so a hijacked environment can't substitute a binary while we run as root. - KMS readiness now requires curl to succeed AND a parsed, non-empty `ca_cert` field rather than a substring match (an error body can't read as "ready"), which also confirms our KMS is bound to the expected port. - dstack-auth: a BootInfo wire-contract test pins the camelCase field names the webhook depends on, so a future KMS rename breaks a test, not production. Re-validated end-to-end on a TDX host: with an existing VMM reserving [1000,2000), install with no --cid-start auto-picks 2000; KMS + app CVMs land at 2001/2002 (no collision), nginx serves HTTP 200 with the os-image pin active, clean destroy to baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…coexistence From a second clean-context review of the hardened branch: - B1 (fail-open): a missing/empty digest.txt silently disabled the OS-image pin (osImages=[] => webhook allow-any-image) while install only warned. Now the pin is resolved in preflight and `dstackup install` BAILS in KMS mode unless --allow-unpinned-image is passed. Fail-closed. - B2 (half-install): coexistence was handled only for CIDs; the dashboard/auth TCP ports and the host-api vsock port had fixed defaults with no detection, so a second install half-installed then failed on bind. All collision checks (CIDs + ports) now run in a preflight BEFORE any side effect — TCP ports are bind-tested, the host-api port is checked against other dstack-vmm configs — and refuse with guidance. - M1: `dstack run --allowlist <missing>` no longer misreports ENOENT as a permissions problem; distinct "run dstackup install first" message. - M2: TCB-status enforcement intentionally NOT added (single-node operator trusts their own host; real TDX hosts often report non-UpToDate) — documented as a deliberate deviation from auth-simple. - minors: device_ok matches auth-simple (empty list = any device); write_atomic fsyncs the parent dir (rename durability); lowercase two error messages; comments on the CID-block math, the compose-hash clone, and the /logs transport. Validated on a TDX host: missing-pin and port-collision installs both bail in preflight with zero side effects; --allow-unpinned-image opt-out works; happy path (real image) -> KMS bootstrap -> nginx HTTP 200 -> clean destroy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI runs `cargo clippy -- -D warnings -D clippy::expect_used -D clippy::unwrap_used`
(stricter than the `-D warnings` documented in CLAUDE.md). The three infallible
`to_string_pretty(&value).expect(...)` calls now pretty-print via the Value's
Display (`{:#}`), which is byte-identical output — so the compose hash and the
rendered configs are unchanged — and trips neither lint.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
h4x3rotab
left a comment
There was a problem hiding this comment.
- crates/dstack is confusing. Let's call it
crates/dstack-cli, but the binary can still be called dstack - crates/dstack-core is also confusing. because it looks like the core of the entire dstack project. but it's actually just a shared lib between the two clis
- there are so many mention of "2025 Phala Network". it's 2026 now
There was a problem hiding this comment.
We should separate the work progress tracking doc from the user facing docs
There was a problem hiding this comment.
| #[derive(Subcommand)] | ||
| enum Command { | ||
| /// Deploy an app from a docker-compose file. | ||
| Run { |
There was a problem hiding this comment.
let's try to make our cli align with phala cli as much as we can
There was a problem hiding this comment.
Done in 1661cc5 — aligned the client with the phala CLI: run → deploy, ls → apps, dropped the separate upgrade (phala folds update into deploy), and added a global -j/--json (honored by deploy/apps). logs already matched; the binary is still dstack. Fuller alignment (ps, --cvm-id, a phala.toml-style project file) is noted as a follow-up in #699 so it can be designed deliberately.
Addresses PR #731 review feedback: - crates/dstack -> crates/dstack-cli — the directory no longer reads like "the dstack project"; the binary is still `dstack` (via [[bin]]). - crates/dstack-core -> crates/dstack-cli-core — it's the shared lib for the two CLIs, not the core of dstack. - bump the new files' copyright year from 2025 to 2026. git mv preserves history; produced binaries unchanged (dstack/dstackup/dstack-auth). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the project-internal "Tier-1" shorthand with plain "single-node (no gateway)" in the ~8 code comments + a test name, so the comments stand on their own without the design doc. "Tier 1/2" stays in docs/onboarding-redesign.md where it's actually defined. Also drops a stale config.rs module-doc line that claimed vmm.toml rendering was still a follow-up (it's rendered now). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
main.rs was one 925-line file. Break it by concern:
- cli.rs — clap definitions (Cli, Command, InstallOpts)
- install.rs — cmd_install + its helpers (key provider, image pin, port
preflight, readiness probes)
- destroy.rs — cmd_destroy
- state.rs — install-state persistence + the atomic write helper
- systemd.rs — unit management + the sanitized tool() spawner
- cid.rs — CID-window allocation (+ its tests)
- main.rs — now ~70 lines: module decls, main() dispatch, cmd_status
Also collapses the Command::Install 20-field destructure-then-rebuild into
`Install(InstallOpts)` via a clap `Args` derive (the duplication the first
review flagged). Pure reorganization, no behavior change: build, clippy
(-D warnings -D expect_used -D unwrap_used), fmt, the CID tests, and the full
`install --help` surface (all 20 flags + subcommands) are unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses the two inline PR review comments: - align the `dstack` client with the `phala` CLI taxonomy: `run` -> `deploy`, `ls` -> `apps`, drop the separate `upgrade` (phala folds update into `deploy`), and add a global `-j/--json` honored by `deploy` and `apps`. `logs` already matched. The binary is still `dstack`. - remove docs/onboarding-redesign.md — it's a planning/roadmap doc (design, "what we validated", roadmap, open questions), tracked in #699, not user-facing docs; this keeps docs/ user-only. Also drops the two code comments that linked it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the review — all five points are addressed:
Two extra cleanups while in here: de-jargoned "Tier-1" in code comments → "single-node (no gateway)" (47f4c47), and split the 925-line Verified locally against the exact CI flags ( |
What
Adds a hands-off single-node onboarding path for dstack, cutting the route to a first app from ~20 manual steps to two commands:
Four new crates under
crates/(no existing crate is modified):dstack-cli-core— shared lib for the two CLIs: typed VMMprpcclient, config rendering (vmm.toml/kms.toml/auth-allowlist.json), app-compose build, host/SGX detection + CID/port coexistence scan, atomic-write +flockhelpers.dstack-cli— client CLI (binarydstack):deploy/apps/logs(info/initare scaffolds), with a global-j/--json. Command names follow thephalaCLI. Talks to a local VMM over its unix socket or a remote one over http(s).dstackup— host-setup CLI:install/status/destroy. Privileged; managesdstack-vmm+dstack-authas systemd units; deploys and bootstraps a single-node KMS-in-CVM. Idempotent install, deterministic cgroup teardown. (Split into focused modules:cli/install/destroy/state/systemd/cid.)dstack-auth— Rust reimplementation of the single-operator KMS auth webhook (compose-hash + os-image allowlist; fails closed).Part of #699. The design rationale and roadmap live in that issue.
Architecture
Two binaries, mirroring
kubeadm/kubectl:dstackup(host setup, local + privileged) anddstack(client, local-or-remote). The dashboard binds localhost by default and is reached via an SSH tunnel — a browser secure context for the env-encryption crypto without needing a cert.Security model — and the deliberate single-node trade-offs
The single-node flow makes a few scoped relaxations. Each is confined to the single-node KMS-in-CVM config and does not touch the per-app key path or multi-node replication (verified against
kms/):enforce_self_authorization=false— removes only the KMS's self-attestation gate at bootstrap; per-app quote verification + the compose-hash allowlist stay fully enforced.verify_os_image=false(KMS download-verify) is compensated by pinning the app's measured OS image (digest.txt) into the webhook'sosImages— and it's fail-closed: a missing pin abortsinstallunless--allow-unpinned-imageis passed.--exposeis intentionally disabled — the VMM management RPCs are unauthenticated and the TLS+token transport isn't built yet, so it fails with SSH-tunnel guidance rather than opening an unauthenticated control plane.auth-simple(the single-node operator controls/trusts their own host, and real TDX hosts routinely report a non-UpToDateTCB) — documented indstack-auth.The auth webhook is fail-closed by construction; the allowlist read-modify-write is atomic (temp + rename + dir fsync) under an
flock; andinstallruns all CID/port collision checks in a preflight before any side effect, so a clash refuses cleanly instead of half-installing.Validation
Hardware-validated end-to-end on an Intel TDX host (alongside an unrelated VMM, left undisturbed throughout):
dstackup install(realdstack-vmmbuilt from this repo + the official meta-dstack v0.5.11 image) → KMS-in-CVM bootstraps →dstack deploy nginx→ HTTP 200 at the default 1 GB with the os-image pin enforced → cleandestroyto baseline. Fail-closed paths (missing pin, port collision) verified to bail with zero side effects, and CID/host-api coexistence verified against a second running VMM.cargo fmt,cargo clippy -- -D warnings -D clippy::expect_used -D clippy::unwrap_used, and the unit tests pass.Out of scope / follow-ups (tracked in #699)
Remote transport (TLS+token for
--expose, remotedstack logs,--token), the gateway tier, env-var encryption fordstack deploy, fullerphala-CLI alignment, and OS packaging are deliberately not in this PR.This went through several rounds of adversarial ("Linus-style") review; the findings are addressed across the commit history.
🤖 Generated with Claude Code