Skip to content

Stop zero-traffic App Engine versions after production deploys#3719

Open
MaxGhenis wants to merge 1 commit into
masterfrom
stop-old-app-engine-versions
Open

Stop zero-traffic App Engine versions after production deploys#3719
MaxGhenis wants to merge 1 commit into
masterfrom
stop-old-app-engine-versions

Conversation

@MaxGhenis

Copy link
Copy Markdown
Collaborator

The problem

App Engine flexible environment keeps a VM running 24/7 for every version in SERVING state, even at a 0% traffic split — and our deploy workflow never stops old versions. Every release therefore leaked two 4vCPU/24GB VMs (staging + prod, ~$278/month each).

Impact on the GCP bill (billing account 0160DF-370818-B14FEA):

Month Bill
April 2026 £2,030
May 2026 £6,561
June 2026 £9,470

As of tonight the service had 42 SERVING versions (41 with zero traffic), the oldest from April 22 — a ~$10.6k/month run rate on this repo alone. policyengine-household-api had the same pattern (17 zombies, two dating to August 2025).

The fix

Adds a stop-old-app-engine-versions job that runs after promote-production:

  • Stops SERVING versions beyond the newest 2 per prefix (prod-*, staging-*) — configurable via KEEP_PER_PREFIX.
  • Never touches a version currently receiving traffic, regardless of age.
  • Stopped versions remain deployed and can be restarted with one command (gcloud app versions start), so rollback via the console keeps working.

Steady state becomes ≤4 idle-capable versions (~$1.1k/month ceiling) instead of unbounded growth.

Manual remediation already done (2026-07-02)

  • Stopped the 38 zero-traffic versions on this service (kept prod-2393 + staging-2393 + prior pair).
  • Stopped 17 zombie versions on policyengine-household-api and set min-instances=0 on two leftover testing-codex-* Cloud Run gateways there.
  • Posted a summary in #infrastructure.

Follow-ups

  • Port the same cleanup to policyengine-household-api's deploy workflow.
  • Consider deleting (not just stopping) versions beyond a deeper window; App Engine also caps total versions at 210.

🤖 Generated with Claude Code

Flexible-environment versions keep their VMs running 24/7 while in
SERVING state, even at 0% traffic. Each release leaked a staging and a
prod version (4vCPU/24GB, ~$278/month each); 41 had accumulated by
July 2026 (~$11k/month). Adds a post-promote job that stops SERVING
versions beyond the newest two per prefix, never touching versions
that hold traffic.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.72%. Comparing base (7023b64) to head (6f15d7b).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3719   +/-   ##
=======================================
  Coverage   79.72%   79.72%           
=======================================
  Files          70       70           
  Lines        4326     4326           
  Branches      807      807           
=======================================
  Hits         3449     3449           
  Misses        657      657           
  Partials      220      220           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant