Skip to content

feat: preserve DOCX table structure as HTML#2111

Open
gyx09212214-prog wants to merge 1 commit into
microsoft:mainfrom
gyx09212214-prog:codex/preserve-docx-html-tables
Open

feat: preserve DOCX table structure as HTML#2111
gyx09212214-prog wants to merge 1 commit into
microsoft:mainfrom
gyx09212214-prog:codex/preserve-docx-html-tables

Conversation

@gyx09212214-prog

Copy link
Copy Markdown

Summary

  • Add an opt-in docx_table_format="html" mode for DOCX conversion.
  • Preserve raw HTML <table> output when DOCX tables contain structure that Markdown pipe tables cannot represent, such as merged cells and nested tables.
  • Add docx_markdownify_options forwarding for DOCX conversions and expose --docx-table-format markdown|html in the CLI.
  • Document the new CLI and Python API usage.

Why

Markdown pipe tables cannot represent rowspan, colspan, or nested table structure. For document ingestion workflows, especially financial or research documents, merged headers and nested table regions often carry important meaning that should survive conversion.

Related issues: #1211, #1217, #1248, #167.

Tests

  • python -m pytest packages\markitdown\tests\test_docx_tables.py packages\markitdown\tests\test_module_misc.py::test_docx_comments packages\markitdown\tests\test_module_misc.py::test_docx_equations packages\markitdown\tests\test_cli_misc.py -q
  • python -m black --check packages\markitdown\src\markitdown\converters\_markdownify.py packages\markitdown\src\markitdown\converters\_docx_converter.py packages\markitdown\src\markitdown\_markitdown.py packages\markitdown\src\markitdown\__main__.py packages\markitdown\tests\test_docx_tables.py

Note: I also tried test_cli_vectors.py::test_output_to_stdout; failures were due to the local environment only having [docx] installed and missing optional [xls], [outlook], and [pdf] dependencies, not this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant