shim: avoid re-downloading runner/shim binaries on forced re-install#3954
Open
peterschmidt85 wants to merge 2 commits into
Open
shim: avoid re-downloading runner/shim binaries on forced re-install#3954peterschmidt85 wants to merge 2 commits into
peterschmidt85 wants to merge 2 commits into
Conversation
The shim re-downloaded the full runner/shim binary on every server-issued install (which is always force=true), with no content check. When an instance's installed version never converges to the version the server expects, the server re-issues the install every check, so the same binaries were re-downloaded indefinitely -- unbounded egress from the downloads bucket. Cache the downloaded bytes next to the target and revalidate them with the object's ETag via a conditional request: an unchanged artifact returns 304 and no body is transferred, even under force. Install from the cache so a finalization (chmod/rename) failure is retried without re-downloading, and propagate those errors instead of masking them. A genuinely changed artifact (different ETag) is still downloaded exactly once; without force an existing file is still left untouched. This bounds the loop to at most one download per distinct artifact regardless of why versions fail to converge. Refs #3953 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
94ff89a to
5bfbf12
Compare
The shim download cache used a single slot per binary, so when two server versions reconcile the same instance (the confirmed real-world trigger), each version-swap overwrote the slot and re-downloaded the full binary every cycle. - downloadFile now keys the cache + ETag by URL, so alternating versions each persist and are revalidated independently -> each version is downloaded once, then 304s -> egress ~0 even while the binary keeps swapping. Bounded by maxCachedVersions (oldest pruned). - bootstrap curl: add -f and chain `mv`, so a transient 403/5xx is never saved as the shim binary (which otherwise runs as a shell script -> "Syntax error: newline unexpected" crash loop) -- fail cleanly instead. - add a flip-flop test (each of two alternating versions downloaded exactly once). Refs #3953 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5bfbf12 to
59f28e8
Compare
r4victor
reviewed
Jun 11, 2026
|
|
||
| // Install the cached bytes; skip if path already matches the cache. | ||
| if !downloaded { | ||
| installed, err := sameSize(path, cachePath) |
Collaborator
There was a problem hiding this comment.
Same-size adjacent Go builds are unlikely but not impossible.
Suggested fix: after a successful rename in installFile, write the cache's urlKey to a <path>.installed marker and replace sameSize with a marker comparison.
r4victor
reviewed
Jun 11, 2026
|
|
||
| // downloadFile ensures path holds the artifact at url. | ||
| // | ||
| // Bytes are cached next to path (one cache per URL, validated by ETag), so a repeated |
Collaborator
There was a problem hiding this comment.
I don't quite like putting these helper-files in /usr/local/bin. How about using a shimHomeDir subdir?
r4victor
reviewed
Jun 11, 2026
| case "/0.20.23/runner": | ||
| body, etag = "RUNNER-0.20.23", `"e23"` | ||
| case "/0.20.18/runner": | ||
| body, etag = "RUNNER-0.20.18-x", `"e18"` // intentionally different length |
Collaborator
There was a problem hiding this comment.
See tests only work with different lengths!
r4victor
reviewed
Jun 11, 2026
| # 403/5xx is never saved as the shim binary; chain `mv` so a failed download | ||
| # is never installed (otherwise it would run as a script -> "Syntax error"). | ||
| f'sudo curl -fsS --compressed --connect-timeout 60 --max-time 240 --retry 1 --output "$dlpath" "{url}"' | ||
| f' && sudo mv "$dlpath" {dstack_shim_binary_path}', |
Collaborator
There was a problem hiding this comment.
Not sure if && is needed since commands are chained with &&. Adding -f should be sufficient
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bounds the wasted egress from the runner/shim re-install loop (#3953): the shim re-downloaded the full binaries on every server-issued install (
force=true), and a transient HTTP error during bootstrap could brick the instance. Two changes, both in the download path.downloadFile— cache + revalidate instead of always re-downloadingrunner/internal/shim/components/utils.go. Installingurlintopathnow:force=falseandpathexists → return (unchanged).<path>.<sha256(url)[:16]>.cache(+ a.etagsidecar). Keying by URL means different versions get separate caches instead of overwriting one slot.ensureCachedGETsurlwithIf-None-Match: <stored etag>:304 Not Modified→ cache is current, no body transferred;200→ stream into the cache (temp file + atomic rename) and store the responseETag.path: copy →chmod→ atomic rename. Skipped ifpathalready matches the cache (same size). Ifchmod/rename fails, the next call retries from the cache — no re-download.pruneCacheskeeps the newestmaxCachedVersions(5) and deletes older ones, so disk stays bounded.Net: a repeated/forced install of an unchanged version is a
304with no transfer; two servers alternating between versions download each version once, then304.Bootstrap
curl— fail instead of saving an error pagesrc/dstack/_internal/core/backends/base/compute.py. The shim-bootstrapcurlgains-fand chainsmv, so a transient403/5xxis no longer written todstack-shimand then executed (/usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected); it fails cleanly.Tests
runner/internal/shim/components/utils_test.go(pass under-race):force→ body served once;ETag→ re-downloaded and updated;force=false+ existing file → no request;Validated end-to-end (AWS)
Two servers on one DB + backend, one real
eu-west-1instance, custom-versioned builds on the staging bucket, measured via S3 request metrics:0.20.30 ↔ 0.20.31): 22 conditional GETs over 25 min of continuous flip-flop → 1 real transfer (~17 MB warmup), the rest304. Steady-state egress ≈ 0 (unfixed: ~430 MB and growing).0.20.29shim ↔ fixed0.20.31): ~454 MB in 11 min, climbing — see scope below.Scope — what this does NOT fix (tracked in #3953)
This PR bounds the egress, not the loop:
Fully stopping the loop needs a server-side guard (no-downgrade on comparable versions + an attempt cap/backoff). That's deferred: dstack's version space (real semver vs CI build-numbers like
27305vs dev/None) has no reliable ordering yet, so the server can't safely decide which version wins. See #3953.🤖 Generated with Claude Code