Skip to content

Incrementally resume long-PDF ingestion using cached PageIndex doc_id#43

Closed
plasma16 wants to merge 1 commit into
VectifyAI:mainfrom
plasma16:feat/long-pdf-resume
Closed

Incrementally resume long-PDF ingestion using cached PageIndex doc_id#43
plasma16 wants to merge 1 commit into
VectifyAI:mainfrom
plasma16:feat/long-pdf-resume

Conversation

@plasma16

@plasma16 plasma16 commented May 7, 2026

Copy link
Copy Markdown

Summary

  • add long-PDF ingest checkpoint state in .openkb/long_pdf_jobs.json
  • cache doc_id and description after successful PageIndex indexing
  • on re-run, reuse cached doc_id for long PDFs and retry compilation directly
  • persist index/compile failure state for troubleshooting and incremental retry

Why

When long PDF ingestion fails after indexing, re-running currently re-indexes the same document. This change makes retries incremental for long PDFs while leaving existing skip behavior unchanged for other file types.

Scope

  • only long-document (long_pdf) ingestion path
  • no queue/cursor behavior for non-PDF files

@KylinMountain

Copy link
Copy Markdown
Collaborator

Thanks @plasma16. In the default local mode, PageIndex already content-hashes each file and reuses the cached doc_id without re-running the expensive parse/index pass, so local re-runs do not re-index from scratch; the only remaining gap is cloud-mode retries. The patch also no longer applies after #142 rewrote the function it targets. Closing for now — appreciate the dig into the resume path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants