fix(worker): re-index repos whose zoekt shards are missing from disk by rudraasakariya · Pull Request #1304 · sourcebot-dev/sourcebot

rudraasakariya · 2026-06-12T14:11:17Z

Problem

If the zoekt index directory is deleted while the database and repo clones remain intact (e.g., .sourcebot/index placed on ephemeral/node-local storage in a Kubernetes deployment, lost on pod replacement), Sourcebot starts normally but repos stay marked as indexed in the DB. No shard files exist, so search silently returns empty/incomplete results, and nothing schedules a rebuild until reindexIntervalMs elapses.

Fix

Adds a reconciliation step, markReposWithMissingShardsAsStale, to RepoIndexManager. It runs on scheduler startup and on every scheduler poll (so an index directory lost mid-run is also caught), and:

Queries repos with indexedAt != null from the DB.
Scans INDEX_CACHE_DIR for completed .zoekt shards and collects which repo ids have at least one.
Clears indexedAt on repos with no shards, so the existing scheduler re-indexes them through its normal path (keeping the dedup/backoff guards from fix(backend): prevent duplicate index jobs from indexedAt race #1298 intact).

Design notes:

The DB is queried before the disk scan, so a repo whose first index completes between the two reads cannot be falsely marked stale.
Empty repos are excluded (indexedCommitHash != null): they complete indexing without producing a shard and would otherwise be re-indexed forever.
Unconnected repos are excluded: clearing indexedAt on them would bypass the GC grace period in scheduleCleanupJobs.
.tmp files from in-flight/failed indexing don't count as shards.
No jobs are created directly; the change is purely additive and reuses the existing scheduling machinery.

Test plan

6 new unit tests in repoIndexManager.test.ts (written first, TDD): missing shards on startup, all-present no-op, index dir missing entirely, tmp-file handling, query guards, and poll-time detection.
Full backend suite passes (160/160), tsc clean.
Live verification against a local instance: deleted .sourcebot/index/ while the worker was running with an indexed repo. One poll later the worker logged Found 1 repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ... and re-indexed the repo; the shard was restored. The warning fired exactly once (no re-index loop), and a normal restart with intact shards triggers nothing.

Summary by CodeRabbit

Bug Fixes
- Repositories are now automatically re-indexed when their associated index shards are missing or removed from disk, even if the database still marks them as previously indexed. This ensures the search index remains accurate and synchronized with the actual indexed data.

When the index directory is lost (e.g., it lives on ephemeral storage in a k8s deployment) while the database still marks repos as indexed, search silently returns empty results and nothing triggers a rebuild until the reindex interval elapses. Add a reconciliation step that runs on scheduler startup and on every scheduler poll: repos marked as indexed in the DB that have no index shards on disk get their indexedAt cleared, so the existing scheduler re-indexes them with its usual dedup and backoff guards.

coderabbitai · 2026-06-12T14:11:33Z

Walkthrough

RepoIndexManager detects repos marked as indexed in the database but missing corresponding shard files on disk, clears their indexedAt status on startup and during scheduler polling, and allows the scheduler to re-index them. The changelog documents the fix and test infrastructure validates the behavior.

Changes

Missing Shard Reconciliation

Layer / File(s)	Summary
Reconciliation method: detect and mark missing shards `packages/backend/src/repoIndexManager.ts`, `CHANGELOG.md`	`markReposWithMissingShardsAsStale()` queries indexed repos with connections, scans `INDEX_CACHE_DIR` for shard files, identifies repos recorded as indexed in the DB but missing on-disk shards, and clears `indexedAt` to trigger scheduler re-indexing. Changelog documents the fix.
Scheduler integration: reconciliation on startup and each tick `packages/backend/src/repoIndexManager.ts`	`startScheduler()` calls the reconciliation method immediately after orphaned resource cleanup and then on every scheduler poll interval, ensuring continuous detection of missing shards.
Test mocks and reconciliation test suite `packages/backend/src/repoIndexManager.test.ts`	Test mocks extend constants with `REPOS_CACHE_DIR`, augment `getRepoIdFromShardFileName` to parse shard filenames, and add `repo.updateMany` to the Prisma client mock. A new test suite validates startup/periodic reconciliation, index directory absence handling, temporary shard filtering, repo eligibility filtering, and repeated reconciliation across scheduler polls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

sourcebot-dev/sourcebot#973: Both PRs update RepoIndexManager.startScheduler() to reconcile INDEX_CACHE_DIR/shard files with DB state, though one deletes orphaned disk resources while the other clears indexedAt to trigger re-indexing when shards are missing.
sourcebot-dev/sourcebot#860: Both PRs modify packages/backend/src/repoIndexManager.ts scheduler/startup polling behavior, with the main PR adding missing-shard reconciliation and the retrieved PR refactoring scheduler workers.
sourcebot-dev/sourcebot#305: Both PRs update the semantics of Repo.indexedAt during indexing—main PR clears it when disk shards are missing to trigger re-indexing, while the other PR avoids setting it when indexing fails.

Suggested reviewers

msukkari

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing re-indexing of repos whose zoekt shards are missing from disk.
Linked Issues check	✅ Passed	The PR fully implements the objectives from issue `#1210`: detecting missing shard files on startup/polls, clearing indexedAt to trigger re-indexing, excluding empty/unconnected repos, and handling temp files.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing the missing shard reconciliation: test coverage, implementation logic, and changelog documentation align with the stated objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/src/repoIndexManager.ts`:
- Around line 718-720: Replace the two-step existsSync + readdir pattern so the
directory scan tolerates a race: call readdir(INDEX_CACHE_DIR) directly inside a
try/catch around the call used to populate entries (the block that constructs
repoIdsWithShards), and if the catch error.code === 'ENOENT' treat it as an
empty directory (i.e. leave repoIdsWithShards empty) instead of aborting the
poll; for other errors rethrow or log as before. Ensure you update the logic
that uses entries to handle the empty-case correctly.
- Around line 702-744: staleRepos were selected from a snapshot but the
subsequent updateMany only filters by id, which can clear indexedAt incorrectly
if a repo lost its connections or its indexedAt changed; instead, re-apply the
preconditions at write time by updating only rows that still match both the
original indexedAt value and still have connections: for example iterate
staleRepos and for each call this.db.repo.updateMany (or a single updateMany
with a where: { OR: [...] }) where each clause is { id: repo.id, indexedAt:
repo.indexedAt, connections: { some: {} } } so you only set indexedAt: null when
the repo still has the same indexedAt and at least one connection.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8047b0d6-9b12-4aaf-a6bd-3c3de86c7d38

📥 Commits

Reviewing files that changed from the base of the PR and between 90a5afe and 5e1624e.

📒 Files selected for processing (3)

CHANGELOG.md
packages/backend/src/repoIndexManager.test.ts
packages/backend/src/repoIndexManager.ts

coderabbitai · 2026-06-12T14:18:17Z

+        const indexedRepos = await this.db.repo.findMany({
+            where: {
+                indexedAt: { not: null },
+                indexedCommitHash: { not: null },
+                connections: { some: {} },
+            },
+            select: {
+                id: true,
+                name: true,
+            },
+        });
+
+        if (indexedRepos.length === 0) {
+            return;
+        }
+
+        const repoIdsWithShards = new Set<number>();
+        if (existsSync(INDEX_CACHE_DIR)) {
+            const entries = await readdir(INDEX_CACHE_DIR);
+            for (const entry of entries) {
+                // Ignore temporary files (e.g., `.tmp` files from in-flight or
+                // failed indexing operations) - only completed shards count.
+                if (!entry.endsWith('.zoekt')) {
+                    continue;
+                }
+                const repoId = getRepoIdFromShardFileName(entry);
+                if (repoId !== undefined) {
+                    repoIdsWithShards.add(repoId);
+                }
+            }
+        }
+
+        const staleRepos = indexedRepos.filter(repo => !repoIdsWithShards.has(repo.id));
+        if (staleRepos.length === 0) {
+            return;
+        }
+
+        logger.warn(`Found ${staleRepos.length} repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ${staleRepos.map(repo => repo.name).join(', ')}`);
+
+        await this.db.repo.updateMany({
+            where: { id: { in: staleRepos.map(repo => repo.id) } },
+            data: { indexedAt: null },
+        });


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Re-apply the stale-marking preconditions at write time.

staleRepos is derived from a snapshot, but Line 741 later clears indexedAt by id only. If a repo loses its last connection or finishes a reindex after the findMany() but before updateMany(), this write can either bypass the GC grace period or clobber a freshly-written indexedAt, which then triggers an unnecessary extra reindex. The update should re-check the repo is still connected and still has the same indexedAt value that was observed during selection.

Suggested fix

const indexedRepos = await this.db.repo.findMany({ where: { indexedAt: { not: null }, indexedCommitHash: { not: null }, connections: { some: {} }, }, select: { id: true, name: true, + indexedAt: true, }, }); @@ - await this.db.repo.updateMany({ - where: { id: { in: staleRepos.map(repo => repo.id) } }, - data: { indexedAt: null }, - }); + await this.db.$transaction( + staleRepos.map((repo) => + this.db.repo.updateMany({ + where: { + id: repo.id, + indexedAt: repo.indexedAt, + indexedCommitHash: { not: null }, + connections: { some: {} }, + }, + data: { indexedAt: null }, + }) + ) + );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const indexedRepos = await this.db.repo.findMany({

where: {

indexedAt: { not: null },

indexedCommitHash: { not: null },

connections: { some: {} },

},

select: {

id: true,

name: true,

},

});

if (indexedRepos.length === 0) {

return;

}

const repoIdsWithShards = new Set<number>();

if (existsSync(INDEX_CACHE_DIR)) {

const entries = await readdir(INDEX_CACHE_DIR);

for (const entry of entries) {

// Ignore temporary files (e.g., `.tmp` files from in-flight or

// failed indexing operations) - only completed shards count.

if (!entry.endsWith('.zoekt')) {

continue;

}

const repoId = getRepoIdFromShardFileName(entry);

if (repoId !== undefined) {

repoIdsWithShards.add(repoId);

}

}

}

const staleRepos = indexedRepos.filter(repo => !repoIdsWithShards.has(repo.id));

if (staleRepos.length === 0) {

return;

}

logger.warn(`Found ${staleRepos.length} repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ${staleRepos.map(repo => repo.name).join(', ')}`);

await this.db.repo.updateMany({

where: { id: { in: staleRepos.map(repo => repo.id) } },

data: { indexedAt: null },

});

const indexedRepos = await this.db.repo.findMany({

where: {

indexedAt: { not: null },

indexedCommitHash: { not: null },

connections: { some: {} },

},

select: {

id: true,

name: true,

indexedAt: true,

},

});

if (indexedRepos.length === 0) {

return;

}

const repoIdsWithShards = new Set<number>();

if (existsSync(INDEX_CACHE_DIR)) {

const entries = await readdir(INDEX_CACHE_DIR);

for (const entry of entries) {

// Ignore temporary files (e.g., `.tmp` files from in-flight or

// failed indexing operations) - only completed shards count.

if (!entry.endsWith('.zoekt')) {

continue;

}

const repoId = getRepoIdFromShardFileName(entry);

if (repoId !== undefined) {

repoIdsWithShards.add(repoId);

}

}

}

const staleRepos = indexedRepos.filter(repo => !repoIdsWithShards.has(repo.id));

if (staleRepos.length === 0) {

return;

}

logger.warn(`Found ${staleRepos.length} repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ${staleRepos.map(repo => repo.name).join(', ')}`);

await this.db.$transaction(

staleRepos.map((repo) =>

this.db.repo.updateMany({

where: {

id: repo.id,

indexedAt: repo.indexedAt,

indexedCommitHash: { not: null },

connections: { some: {} },

},

data: { indexedAt: null },

})

)

);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/src/repoIndexManager.ts` around lines 702 - 744, staleRepos were selected from a snapshot but the subsequent updateMany only filters by id, which can clear indexedAt incorrectly if a repo lost its connections or its indexedAt changed; instead, re-apply the preconditions at write time by updating only rows that still match both the original indexedAt value and still have connections: for example iterate staleRepos and for each call this.db.repo.updateMany (or a single updateMany with a where: { OR: [...] }) where each clause is { id: repo.id, indexedAt: repo.indexedAt, connections: { some: {} } } so you only set indexedAt: null when the repo still has the same indexedAt and at least one connection.

coderabbitai · 2026-06-12T14:18:17Z

+        const repoIdsWithShards = new Set<number>();
+        if (existsSync(INDEX_CACHE_DIR)) {
+            const entries = await readdir(INDEX_CACHE_DIR);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle ENOENT from the directory scan directly.

If INDEX_CACHE_DIR disappears between Line 719 and Line 720, readdir() throws and this poll aborts instead of recovering by marking repos stale. That is the exact failure mode this reconciliation is meant to survive. Prefer calling readdir() directly and treating ENOENT as “no shards present”.

Suggested fix

- const repoIdsWithShards = new Set<number>(); - if (existsSync(INDEX_CACHE_DIR)) { - const entries = await readdir(INDEX_CACHE_DIR); - for (const entry of entries) { - // Ignore temporary files (e.g., `.tmp` files from in-flight or - // failed indexing operations) - only completed shards count. - if (!entry.endsWith('.zoekt')) { - continue; - } - const repoId = getRepoIdFromShardFileName(entry); - if (repoId !== undefined) { - repoIdsWithShards.add(repoId); - } - } - } + const repoIdsWithShards = new Set<number>(); + let entries: string[] = []; + try { + entries = await readdir(INDEX_CACHE_DIR); + } catch (error) { + if ((error as NodeJS.ErrnoException).code !== 'ENOENT') { + throw error; + } + } + + for (const entry of entries) { + // Ignore temporary files (e.g., `.tmp` files from in-flight or + // failed indexing operations) - only completed shards count. + if (!entry.endsWith('.zoekt')) { + continue; + } + const repoId = getRepoIdFromShardFileName(entry); + if (repoId !== undefined) { + repoIdsWithShards.add(repoId); + } + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const repoIdsWithShards = new Set<number>();

if (existsSync(INDEX_CACHE_DIR)) {

const entries = await readdir(INDEX_CACHE_DIR);

const repoIdsWithShards = new Set<number>();

let entries: string[] = [];

try {

entries = await readdir(INDEX_CACHE_DIR);

} catch (error) {

if ((error as NodeJS.ErrnoException).code !== 'ENOENT') {

throw error;

}

}

for (const entry of entries) {

// Ignore temporary files (e.g., `.tmp` files from in-flight or

// failed indexing operations) - only completed shards count.

if (!entry.endsWith('.zoekt')) {

continue;

}

const repoId = getRepoIdFromShardFileName(entry);

if (repoId !== undefined) {

repoIdsWithShards.add(repoId);

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/src/repoIndexManager.ts` around lines 718 - 720, Replace the two-step existsSync + readdir pattern so the directory scan tolerates a race: call readdir(INDEX_CACHE_DIR) directly inside a try/catch around the call used to populate entries (the block that constructs repoIdsWithShards), and if the catch error.code === 'ENOENT' treat it as an empty directory (i.e. leave repoIdsWithShards empty) instead of aborting the poll; for other errors rethrow or log as before. Ensure you update the logic that uses entries to handle the empty-case correctly.

chore: add CHANGELOG entry for sourcebot-dev#1304

5e1624e

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): re-index repos whose zoekt shards are missing from disk#1304

fix(worker): re-index repos whose zoekt shards are missing from disk#1304
rudraasakariya wants to merge 2 commits into
sourcebot-dev:mainfrom
rudraasakariya:rudraasakariya/fix-missing-shards-stale-detection

rudraasakariya commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rudraasakariya commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rudraasakariya commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading