Skip to content

fix(worker): re-index repos whose zoekt shards are missing from disk#1304

Open
rudraasakariya wants to merge 2 commits into
sourcebot-dev:mainfrom
rudraasakariya:rudraasakariya/fix-missing-shards-stale-detection
Open

fix(worker): re-index repos whose zoekt shards are missing from disk#1304
rudraasakariya wants to merge 2 commits into
sourcebot-dev:mainfrom
rudraasakariya:rudraasakariya/fix-missing-shards-stale-detection

Conversation

@rudraasakariya

@rudraasakariya rudraasakariya commented Jun 12, 2026

Copy link
Copy Markdown

Fixes #1210

Problem

If the zoekt index directory is deleted while the database and repo clones remain intact (e.g., .sourcebot/index placed on ephemeral/node-local storage in a Kubernetes deployment, lost on pod replacement), Sourcebot starts normally but repos stay marked as indexed in the DB. No shard files exist, so search silently returns empty/incomplete results, and nothing schedules a rebuild until reindexIntervalMs elapses.

Fix

Adds a reconciliation step, markReposWithMissingShardsAsStale, to RepoIndexManager. It runs on scheduler startup and on every scheduler poll (so an index directory lost mid-run is also caught), and:

  1. Queries repos with indexedAt != null from the DB.
  2. Scans INDEX_CACHE_DIR for completed .zoekt shards and collects which repo ids have at least one.
  3. Clears indexedAt on repos with no shards, so the existing scheduler re-indexes them through its normal path (keeping the dedup/backoff guards from fix(backend): prevent duplicate index jobs from indexedAt race #1298 intact).

Design notes:

  • The DB is queried before the disk scan, so a repo whose first index completes between the two reads cannot be falsely marked stale.
  • Empty repos are excluded (indexedCommitHash != null): they complete indexing without producing a shard and would otherwise be re-indexed forever.
  • Unconnected repos are excluded: clearing indexedAt on them would bypass the GC grace period in scheduleCleanupJobs.
  • .tmp files from in-flight/failed indexing don't count as shards.
  • No jobs are created directly; the change is purely additive and reuses the existing scheduling machinery.

Test plan

  • 6 new unit tests in repoIndexManager.test.ts (written first, TDD): missing shards on startup, all-present no-op, index dir missing entirely, tmp-file handling, query guards, and poll-time detection.
  • Full backend suite passes (160/160), tsc clean.
  • Live verification against a local instance: deleted .sourcebot/index/ while the worker was running with an indexed repo. One poll later the worker logged Found 1 repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ... and re-indexed the repo; the shard was restored. The warning fired exactly once (no re-index loop), and a normal restart with intact shards triggers nothing.

Summary by CodeRabbit

  • Bug Fixes
    • Repositories are now automatically re-indexed when their associated index shards are missing or removed from disk, even if the database still marks them as previously indexed. This ensures the search index remains accurate and synchronized with the actual indexed data.

When the index directory is lost (e.g., it lives on ephemeral storage in
a k8s deployment) while the database still marks repos as indexed, search
silently returns empty results and nothing triggers a rebuild until the
reindex interval elapses.

Add a reconciliation step that runs on scheduler startup and on every
scheduler poll: repos marked as indexed in the DB that have no index
shards on disk get their indexedAt cleared, so the existing scheduler
re-indexes them with its usual dedup and backoff guards.
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

RepoIndexManager detects repos marked as indexed in the database but missing corresponding shard files on disk, clears their indexedAt status on startup and during scheduler polling, and allows the scheduler to re-index them. The changelog documents the fix and test infrastructure validates the behavior.

Changes

Missing Shard Reconciliation

Layer / File(s) Summary
Reconciliation method: detect and mark missing shards
packages/backend/src/repoIndexManager.ts, CHANGELOG.md
markReposWithMissingShardsAsStale() queries indexed repos with connections, scans INDEX_CACHE_DIR for shard files, identifies repos recorded as indexed in the DB but missing on-disk shards, and clears indexedAt to trigger scheduler re-indexing. Changelog documents the fix.
Scheduler integration: reconciliation on startup and each tick
packages/backend/src/repoIndexManager.ts
startScheduler() calls the reconciliation method immediately after orphaned resource cleanup and then on every scheduler poll interval, ensuring continuous detection of missing shards.
Test mocks and reconciliation test suite
packages/backend/src/repoIndexManager.test.ts
Test mocks extend constants with REPOS_CACHE_DIR, augment getRepoIdFromShardFileName to parse shard filenames, and add repo.updateMany to the Prisma client mock. A new test suite validates startup/periodic reconciliation, index directory absence handling, temporary shard filtering, repo eligibility filtering, and repeated reconciliation across scheduler polls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • sourcebot-dev/sourcebot#973: Both PRs update RepoIndexManager.startScheduler() to reconcile INDEX_CACHE_DIR/shard files with DB state, though one deletes orphaned disk resources while the other clears indexedAt to trigger re-indexing when shards are missing.
  • sourcebot-dev/sourcebot#860: Both PRs modify packages/backend/src/repoIndexManager.ts scheduler/startup polling behavior, with the main PR adding missing-shard reconciliation and the retrieved PR refactoring scheduler workers.
  • sourcebot-dev/sourcebot#305: Both PRs update the semantics of Repo.indexedAt during indexing—main PR clears it when disk shards are missing to trigger re-indexing, while the other PR avoids setting it when indexing fails.

Suggested reviewers

  • msukkari
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing re-indexing of repos whose zoekt shards are missing from disk.
Linked Issues check ✅ Passed The PR fully implements the objectives from issue #1210: detecting missing shard files on startup/polls, clearing indexedAt to trigger re-indexing, excluding empty/unconnected repos, and handling temp files.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing the missing shard reconciliation: test coverage, implementation logic, and changelog documentation align with the stated objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/src/repoIndexManager.ts`:
- Around line 718-720: Replace the two-step existsSync + readdir pattern so the
directory scan tolerates a race: call readdir(INDEX_CACHE_DIR) directly inside a
try/catch around the call used to populate entries (the block that constructs
repoIdsWithShards), and if the catch error.code === 'ENOENT' treat it as an
empty directory (i.e. leave repoIdsWithShards empty) instead of aborting the
poll; for other errors rethrow or log as before. Ensure you update the logic
that uses entries to handle the empty-case correctly.
- Around line 702-744: staleRepos were selected from a snapshot but the
subsequent updateMany only filters by id, which can clear indexedAt incorrectly
if a repo lost its connections or its indexedAt changed; instead, re-apply the
preconditions at write time by updating only rows that still match both the
original indexedAt value and still have connections: for example iterate
staleRepos and for each call this.db.repo.updateMany (or a single updateMany
with a where: { OR: [...] }) where each clause is { id: repo.id, indexedAt:
repo.indexedAt, connections: { some: {} } } so you only set indexedAt: null when
the repo still has the same indexedAt and at least one connection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8047b0d6-9b12-4aaf-a6bd-3c3de86c7d38

📥 Commits

Reviewing files that changed from the base of the PR and between 90a5afe and 5e1624e.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • packages/backend/src/repoIndexManager.test.ts
  • packages/backend/src/repoIndexManager.ts

Comment on lines +702 to +744
const indexedRepos = await this.db.repo.findMany({
where: {
indexedAt: { not: null },
indexedCommitHash: { not: null },
connections: { some: {} },
},
select: {
id: true,
name: true,
},
});

if (indexedRepos.length === 0) {
return;
}

const repoIdsWithShards = new Set<number>();
if (existsSync(INDEX_CACHE_DIR)) {
const entries = await readdir(INDEX_CACHE_DIR);
for (const entry of entries) {
// Ignore temporary files (e.g., `.tmp` files from in-flight or
// failed indexing operations) - only completed shards count.
if (!entry.endsWith('.zoekt')) {
continue;
}
const repoId = getRepoIdFromShardFileName(entry);
if (repoId !== undefined) {
repoIdsWithShards.add(repoId);
}
}
}

const staleRepos = indexedRepos.filter(repo => !repoIdsWithShards.has(repo.id));
if (staleRepos.length === 0) {
return;
}

logger.warn(`Found ${staleRepos.length} repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ${staleRepos.map(repo => repo.name).join(', ')}`);

await this.db.repo.updateMany({
where: { id: { in: staleRepos.map(repo => repo.id) } },
data: { indexedAt: null },
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Re-apply the stale-marking preconditions at write time.

staleRepos is derived from a snapshot, but Line 741 later clears indexedAt by id only. If a repo loses its last connection or finishes a reindex after the findMany() but before updateMany(), this write can either bypass the GC grace period or clobber a freshly-written indexedAt, which then triggers an unnecessary extra reindex. The update should re-check the repo is still connected and still has the same indexedAt value that was observed during selection.

Suggested fix
         const indexedRepos = await this.db.repo.findMany({
             where: {
                 indexedAt: { not: null },
                 indexedCommitHash: { not: null },
                 connections: { some: {} },
             },
             select: {
                 id: true,
                 name: true,
+                indexedAt: true,
             },
         });
@@
-        await this.db.repo.updateMany({
-            where: { id: { in: staleRepos.map(repo => repo.id) } },
-            data: { indexedAt: null },
-        });
+        await this.db.$transaction(
+            staleRepos.map((repo) =>
+                this.db.repo.updateMany({
+                    where: {
+                        id: repo.id,
+                        indexedAt: repo.indexedAt,
+                        indexedCommitHash: { not: null },
+                        connections: { some: {} },
+                    },
+                    data: { indexedAt: null },
+                })
+            )
+        );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const indexedRepos = await this.db.repo.findMany({
where: {
indexedAt: { not: null },
indexedCommitHash: { not: null },
connections: { some: {} },
},
select: {
id: true,
name: true,
},
});
if (indexedRepos.length === 0) {
return;
}
const repoIdsWithShards = new Set<number>();
if (existsSync(INDEX_CACHE_DIR)) {
const entries = await readdir(INDEX_CACHE_DIR);
for (const entry of entries) {
// Ignore temporary files (e.g., `.tmp` files from in-flight or
// failed indexing operations) - only completed shards count.
if (!entry.endsWith('.zoekt')) {
continue;
}
const repoId = getRepoIdFromShardFileName(entry);
if (repoId !== undefined) {
repoIdsWithShards.add(repoId);
}
}
}
const staleRepos = indexedRepos.filter(repo => !repoIdsWithShards.has(repo.id));
if (staleRepos.length === 0) {
return;
}
logger.warn(`Found ${staleRepos.length} repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ${staleRepos.map(repo => repo.name).join(', ')}`);
await this.db.repo.updateMany({
where: { id: { in: staleRepos.map(repo => repo.id) } },
data: { indexedAt: null },
});
const indexedRepos = await this.db.repo.findMany({
where: {
indexedAt: { not: null },
indexedCommitHash: { not: null },
connections: { some: {} },
},
select: {
id: true,
name: true,
indexedAt: true,
},
});
if (indexedRepos.length === 0) {
return;
}
const repoIdsWithShards = new Set<number>();
if (existsSync(INDEX_CACHE_DIR)) {
const entries = await readdir(INDEX_CACHE_DIR);
for (const entry of entries) {
// Ignore temporary files (e.g., `.tmp` files from in-flight or
// failed indexing operations) - only completed shards count.
if (!entry.endsWith('.zoekt')) {
continue;
}
const repoId = getRepoIdFromShardFileName(entry);
if (repoId !== undefined) {
repoIdsWithShards.add(repoId);
}
}
}
const staleRepos = indexedRepos.filter(repo => !repoIdsWithShards.has(repo.id));
if (staleRepos.length === 0) {
return;
}
logger.warn(`Found ${staleRepos.length} repo(s) marked as indexed but with no index shards on disk. Marking as stale for re-indexing: ${staleRepos.map(repo => repo.name).join(', ')}`);
await this.db.$transaction(
staleRepos.map((repo) =>
this.db.repo.updateMany({
where: {
id: repo.id,
indexedAt: repo.indexedAt,
indexedCommitHash: { not: null },
connections: { some: {} },
},
data: { indexedAt: null },
})
)
);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/src/repoIndexManager.ts` around lines 702 - 744, staleRepos
were selected from a snapshot but the subsequent updateMany only filters by id,
which can clear indexedAt incorrectly if a repo lost its connections or its
indexedAt changed; instead, re-apply the preconditions at write time by updating
only rows that still match both the original indexedAt value and still have
connections: for example iterate staleRepos and for each call
this.db.repo.updateMany (or a single updateMany with a where: { OR: [...] })
where each clause is { id: repo.id, indexedAt: repo.indexedAt, connections: {
some: {} } } so you only set indexedAt: null when the repo still has the same
indexedAt and at least one connection.

Comment on lines +718 to +720
const repoIdsWithShards = new Set<number>();
if (existsSync(INDEX_CACHE_DIR)) {
const entries = await readdir(INDEX_CACHE_DIR);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle ENOENT from the directory scan directly.

If INDEX_CACHE_DIR disappears between Line 719 and Line 720, readdir() throws and this poll aborts instead of recovering by marking repos stale. That is the exact failure mode this reconciliation is meant to survive. Prefer calling readdir() directly and treating ENOENT as “no shards present”.

Suggested fix
-        const repoIdsWithShards = new Set<number>();
-        if (existsSync(INDEX_CACHE_DIR)) {
-            const entries = await readdir(INDEX_CACHE_DIR);
-            for (const entry of entries) {
-                // Ignore temporary files (e.g., `.tmp` files from in-flight or
-                // failed indexing operations) - only completed shards count.
-                if (!entry.endsWith('.zoekt')) {
-                    continue;
-                }
-                const repoId = getRepoIdFromShardFileName(entry);
-                if (repoId !== undefined) {
-                    repoIdsWithShards.add(repoId);
-                }
-            }
-        }
+        const repoIdsWithShards = new Set<number>();
+        let entries: string[] = [];
+        try {
+            entries = await readdir(INDEX_CACHE_DIR);
+        } catch (error) {
+            if ((error as NodeJS.ErrnoException).code !== 'ENOENT') {
+                throw error;
+            }
+        }
+
+        for (const entry of entries) {
+            // Ignore temporary files (e.g., `.tmp` files from in-flight or
+            // failed indexing operations) - only completed shards count.
+            if (!entry.endsWith('.zoekt')) {
+                continue;
+            }
+            const repoId = getRepoIdFromShardFileName(entry);
+            if (repoId !== undefined) {
+                repoIdsWithShards.add(repoId);
+            }
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const repoIdsWithShards = new Set<number>();
if (existsSync(INDEX_CACHE_DIR)) {
const entries = await readdir(INDEX_CACHE_DIR);
const repoIdsWithShards = new Set<number>();
let entries: string[] = [];
try {
entries = await readdir(INDEX_CACHE_DIR);
} catch (error) {
if ((error as NodeJS.ErrnoException).code !== 'ENOENT') {
throw error;
}
}
for (const entry of entries) {
// Ignore temporary files (e.g., `.tmp` files from in-flight or
// failed indexing operations) - only completed shards count.
if (!entry.endsWith('.zoekt')) {
continue;
}
const repoId = getRepoIdFromShardFileName(entry);
if (repoId !== undefined) {
repoIdsWithShards.add(repoId);
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/src/repoIndexManager.ts` around lines 718 - 720, Replace the
two-step existsSync + readdir pattern so the directory scan tolerates a race:
call readdir(INDEX_CACHE_DIR) directly inside a try/catch around the call used
to populate entries (the block that constructs repoIdsWithShards), and if the
catch error.code === 'ENOENT' treat it as an empty directory (i.e. leave
repoIdsWithShards empty) instead of aborting the poll; for other errors rethrow
or log as before. Ensure you update the logic that uses entries to handle the
empty-case correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug/rfe] Rebuild or mark repos stale when zoekt shard files are missing but DB marks repos indexed

1 participant