Skip to content

ci: resilient HF pre-download + offline tests (fix main CI flake)#75

Merged
cdeust merged 1 commit into
mainfrom
fix/ci-hf-offline-predownload
Jul 1, 2026
Merged

ci: resilient HF pre-download + offline tests (fix main CI flake)#75
cdeust merged 1 commit into
mainfrom
fix/ci-hf-offline-predownload

Conversation

@cdeust

@cdeust cdeust commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Problem

main CI failed on run 2849580172858 tests failed on the Python 3.10 matrix leg only, all with:

OSError: We couldn't connect to 'https://huggingface.co' ... and couldn't find them in the cached files.

The other legs (3.11/3.12/3.13) passed. This is a transient-network / cache-cold flake, not a code defect: the model (all-MiniLM-L6-v2) is fine.

Root cause

The Pre-download embedding model step had no retry and continue-on-error: true. When huggingface.co blipped on the 3.10 runner, the step was masked as ✓ but the HF cache stayed empty. The test suite — which is not offline-aware — then tried to re-fetch the model at runtime and cascaded into 58 misleading failures.

Fix (all three pre-download sites: test, test-sqlite, test-windows)

  • Retry the download 5× with linear backoff → a transient blip self-heals.
  • Drop continue-on-error → a genuine persistent failure surfaces clearly at the download step, not as a confusing test cascade.
  • HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE on the test-run steps → the model is already cached, so tests never touch the network mid-suite (deterministic, flake-free).
  • Windows pre-download runs under shell: bash for the retry loop.

Verification

  • YAML validated (yaml.safe_load).
  • CI on this PR is the proof — all matrix legs must go green offline-from-cache.

🤖 Generated with Claude Code

Root cause of the main-branch CI failure (run 28495801728, Python 3.10,
2026-07-01): the "Pre-download embedding model" step had no retry and
`continue-on-error: true`, so a transient huggingface.co blip left the HF
cache empty on that one matrix leg. The offline-unaware test suite then
re-fetched the model at runtime and cascaded into 58 spurious
"couldn't connect to huggingface.co" failures while the model itself was fine
(3.11/3.12/3.13 legs, which got the cache, all passed).

Fix, applied to all three pre-download sites (test, test-sqlite, test-windows):
- Retry the download 5× with linear backoff so a transient blip self-heals.
- Drop `continue-on-error` so a genuine persistent failure surfaces at the
  download step instead of cascading into a misleading test failure.
- Set HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE on the test-run steps: the model
  is already cached by the step above, so tests never touch the network
  mid-suite — deterministic and flake-free.
- Windows pre-download runs under `shell: bash` for the retry loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UwnQrVh2tnNMWJabhAQgaN
@cdeust cdeust merged commit f00edd1 into main Jul 1, 2026
13 checks passed
@cdeust cdeust deleted the fix/ci-hf-offline-predownload branch July 1, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant