Feeding Word documents to an LLM
A .docx is not a document — it's a ZIP archive of
XML files describing one: content in one part, styles in another,
images in a media folder, plus themes, fonts and settings. That
architecture is why a 3-page memo weighs 36 KB and why naive
extraction loses precisely the things a model needs.
What breaks if you paste or extract it raw
- The outline evaporates. "Heading 2" is a style reference, not markup. Extractors that ignore styles emit every heading as an ordinary paragraph — the document arrives as a flat wall, and retrieval loses its best structure.
- Tables degrade. Word tables wrap every cell's content in paragraph elements; careless conversion produces empty cells or one-column mush, and header rows arrive as data.
- Images vanish or explode. Either dropped without a trace, or inlined as base64.
The element mapping
| In the .docx | In the Markdown |
|---|---|
| Title / Heading 1–4 styles | # / ## / ### — a real outline, from the style objects |
| Bold / italic runs | **bold** / *italic* |
| Bulleted / numbered lists | Markdown lists, nesting kept |
| Tables | GFM pipe tables; first row promoted to a real header when Word didn't mark one |
| Embedded images | [Figure: …] placeholder in place; image file extracted into the .zip's figures/ |
| Hyperlinks | Inline Markdown links |
| Headers, footers, page numbers | Dropped — page furniture, not content |
Before → after
In the file
<w:p><w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
<w:r><w:t>Regional performance</w:t></w:r></w:p>
<w:tbl><w:tr><w:tc><w:p><w:r><w:t>Region</w:t></w:r>…In the Markdown
## Regional performance
| Region | Units shipped | Avg delivery (days) |
| --- | --- | --- |
| North | 18,420 | 2.1 |Honest limits
- Unstyled "headings" (someone typed a line and made it big and bold by hand) can't be distinguished from emphasized paragraphs — they convert as bold text. If the outline matters, the source document needs real heading styles.
- Comments and tracked changes are not included; convert the accepted-changes version you actually mean to feed the model.
- Text boxes, shapes and SmartArt are drawing objects; their text is generally lost, and the fidelity report's detected counts are how you notice.
- .doc (legacy binary) is a different, pre-2007 format — resave as .docx first (File → Save As).
FAQ
Where do the images go? Into the downloaded .zip, next to the Markdown, named as the placeholders say — workspace uploads can include or skip them.
Does it read password-protected files? No — remove protection first; the parser sees only encrypted bytes.
Confidential documents? Conversion is local to your browser; nothing is uploaded. Verify by converting offline. See privacy.
Watch a styled Word report become an outlined, table-intact Markdown file.