Incrementally resume long-PDF ingestion using cached PageIndex doc_id by plasma16 · Pull Request #43 · VectifyAI/OpenKB

plasma16 · 2026-05-07T07:31:03Z

Summary

add long-PDF ingest checkpoint state in .openkb/long_pdf_jobs.json
cache doc_id and description after successful PageIndex indexing
on re-run, reuse cached doc_id for long PDFs and retry compilation directly
persist index/compile failure state for troubleshooting and incremental retry

Why

When long PDF ingestion fails after indexing, re-running currently re-indexes the same document. This change makes retries incremental for long PDFs while leaving existing skip behavior unchanged for other file types.

Scope

only long-document (long_pdf) ingestion path
no queue/cursor behavior for non-PDF files

KylinMountain · 2026-07-02T07:07:30Z

Thanks @plasma16. In the default local mode, PageIndex already content-hashes each file and reuses the cached doc_id without re-running the expensive parse/index pass, so local re-runs do not re-index from scratch; the only remaining gap is cloud-mode retries. The patch also no longer applies after #142 rewrote the function it targets. Closing for now — appreciate the dig into the resume path.

Incrementally resume long-PDF ingestion via cached PageIndex doc_id

06c5954

KylinMountain closed this Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incrementally resume long-PDF ingestion using cached PageIndex doc_id#43

Incrementally resume long-PDF ingestion using cached PageIndex doc_id#43
plasma16 wants to merge 1 commit into
VectifyAI:mainfrom
plasma16:feat/long-pdf-resume

plasma16 commented May 7, 2026

Uh oh!

KylinMountain commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

plasma16 commented May 7, 2026

Summary

Why

Scope

Uh oh!

KylinMountain commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants