Skip to content

Replace manual JSON parsing with Instructor for structured output reliability#339

Open
pegahmansourian wants to merge 1 commit into
VectifyAI:mainfrom
pegahmansourian:fix/replace-manual-json-parsing-with-instructor
Open

Replace manual JSON parsing with Instructor for structured output reliability#339
pegahmansourian wants to merge 1 commit into
VectifyAI:mainfrom
pegahmansourian:fix/replace-manual-json-parsing-with-instructor

Conversation

@pegahmansourian

Copy link
Copy Markdown

Problem

The codebase uses manual JSON prompt instructions across 11 locations in
page_index.py, with a custom extract_json() parser in utils.py that
handles malformed output through multiple fallback strategies.

When the LLM returns slightly malformed JSON, extract_json() silently
returns {}, causing downstream KeyError crashes like:

json_content['completed']  # KeyError if extract_json returned {}

Solution

Replace manual JSON parsing with Instructor,
which is patched directly onto the existing LiteLLM client already used in the codebase.

Changes

utils.py

  • Add sync_instructor_client and async_instructor_client patched via instructor.from_litellm()
  • Add llm_structured() for sync structured calls
  • Add llm_astructured() for async structured calls
  • extract_json() and get_json_content() can be removed once all callers are updated

page_index.py

  • Add Pydantic models for all 11 structured outputs
  • Replace all extract_json() calls with llm_structured() / llm_astructured()
  • Remove "Directly return the final JSON structure" from all prompts — Instructor handles this automatically

Benefits

  • Removes 40 lines of fragile JSON parsing with multiple fallback hacks
  • Adds automatic retry on malformed output — no more silent {} returns
  • Adds type safety via Pydantic — Literal["yes", "no"] fields are always valid
  • No behavior change — same LiteLLM routing, same models, same logic
  • instructor is the only new dependency

New dependency

This PR adds one new dependency: instructor

Add to requirements.txt:

instructor

Or install directly:

pip install instructor

…iability

- Add llm_structured() and llm_astructured() in utils.py using instructor.from_litellm()
- Add Pydantic models for all 11 structured outputs in page_index.py
- Replace all extract_json() calls with typed Instructor calls
- Remove 'Directly return the final JSON structure' from all prompts
- Fixes silent KeyError crashes when extract_json() returned {} on malformed LLM output
@KylinMountain

Copy link
Copy Markdown
Collaborator

@pegahmansourian Thanks for this — moving to structured output (instructor + Pydantic) is a solid direction and it removes a whole class of manual-parsing bugs. A few things to address before we can take it:

  1. Move the schemas out of page_index.py. The ~14 BaseModel definitions don't belong in the core logic file — please put them in a dedicated module (e.g. pageindex/schemas.py) and import them. Keeps page_index.py focused and lets the schemas be reused/tested independently.

  2. The per-field prompt guidance is being dropped. For example detect_page_index used to tell the model what to put in thinking:"thinking": <why do you think there are page numbers/indices given within the table of contents>.
    After the change that's a bare thinking: str with no guidance, so the model loses that instruction. instructor serializes Pydantic field descriptions into the schema it sends the model, so please carry the original hints. Without this, the change risks lowering answer quality rather than improving it.

  3. Do you have before/after test data? Since this PR is about reliability, it'd help to see concrete numbers — e.g. JSON parse-failure rate or TOC accuracy over a set of PDFs, old vs new, ideally including a weaker/local (Ollama) model since Mode.JSON relies on the provider honoring the schema. Right now the gain is asserted but not measured.

Minor/practical: instructor isn't in requirements.txt (install will break), and the branch now conflicts with main — will need a rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants