Replace manual JSON parsing with Instructor for structured output reliability by pegahmansourian · Pull Request #339 · VectifyAI/PageIndex

pegahmansourian · 2026-06-26T15:25:23Z

Problem

The codebase uses manual JSON prompt instructions across 11 locations in
page_index.py, with a custom extract_json() parser in utils.py that
handles malformed output through multiple fallback strategies.

When the LLM returns slightly malformed JSON, extract_json() silently
returns {}, causing downstream KeyError crashes like:

json_content['completed']  # KeyError if extract_json returned {}

Solution

Replace manual JSON parsing with Instructor,
which is patched directly onto the existing LiteLLM client already used in the codebase.

Changes

utils.py

Add sync_instructor_client and async_instructor_client patched via instructor.from_litellm()
Add llm_structured() for sync structured calls
Add llm_astructured() for async structured calls
extract_json() and get_json_content() can be removed once all callers are updated

page_index.py

Add Pydantic models for all 11 structured outputs
Replace all extract_json() calls with llm_structured() / llm_astructured()
Remove "Directly return the final JSON structure" from all prompts — Instructor handles this automatically

Benefits

Removes 40 lines of fragile JSON parsing with multiple fallback hacks
Adds automatic retry on malformed output — no more silent {} returns
Adds type safety via Pydantic — Literal["yes", "no"] fields are always valid
No behavior change — same LiteLLM routing, same models, same logic
instructor is the only new dependency

New dependency

This PR adds one new dependency: instructor

Add to requirements.txt:

instructor

Or install directly:

pip install instructor

…iability - Add llm_structured() and llm_astructured() in utils.py using instructor.from_litellm() - Add Pydantic models for all 11 structured outputs in page_index.py - Replace all extract_json() calls with typed Instructor calls - Remove 'Directly return the final JSON structure' from all prompts - Fixes silent KeyError crashes when extract_json() returned {} on malformed LLM output

KylinMountain · 2026-07-03T09:23:51Z

@pegahmansourian Thanks for this — moving to structured output (instructor + Pydantic) is a solid direction and it removes a whole class of manual-parsing bugs. A few things to address before we can take it:

Move the schemas out of page_index.py. The ~14 BaseModel definitions don't belong in the core logic file — please put them in a dedicated module (e.g. pageindex/schemas.py) and import them. Keeps page_index.py focused and lets the schemas be reused/tested independently.
The per-field prompt guidance is being dropped. For example detect_page_index used to tell the model what to put in thinking:"thinking": <why do you think there are page numbers/indices given within the table of contents>.
After the change that's a bare thinking: str with no guidance, so the model loses that instruction. instructor serializes Pydantic field descriptions into the schema it sends the model, so please carry the original hints. Without this, the change risks lowering answer quality rather than improving it.
Do you have before/after test data? Since this PR is about reliability, it'd help to see concrete numbers — e.g. JSON parse-failure rate or TOC accuracy over a set of PDFs, old vs new, ideally including a weaker/local (Ollama) model since Mode.JSON relies on the provider honoring the schema. Right now the gain is asserted but not measured.

Minor/practical: instructor isn't in requirements.txt (install will break), and the branch now conflicts with main — will need a rebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace manual JSON parsing with Instructor for structured output reliability#339

Replace manual JSON parsing with Instructor for structured output reliability#339
pegahmansourian wants to merge 1 commit into
VectifyAI:mainfrom
pegahmansourian:fix/replace-manual-json-parsing-with-instructor

pegahmansourian commented Jun 26, 2026

Uh oh!

KylinMountain commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pegahmansourian commented Jun 26, 2026

Problem

Solution

Changes

Benefits

New dependency

Uh oh!

KylinMountain commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants