LLMs don't just read Markdown — they speak it
Try a small experiment. Ask any chat model to explain something —
a tradeoff, a recipe, an error message. Look at the shape of the
answer: a bold lead, a bulleted list, maybe a ## heading
and a fenced code block. Nobody asked it to format anything. The
model reached for Markdown the way you reach for your native
language.
That's not a UI trick. It's a fact about how these models were made, and it has a practical consequence most people miss: the format you paste in is not neutral. Some formats land in the model's native register. Others make it do translation work before it can even start thinking.
The point: the format you paste in is not neutral — Markdown lands in the model’s native register; everything else costs a translation step first.
1 · Why Markdown is the native register
Three forces, all pointing the same direction:
What your chat renders
Trade-offs
- Speed — the cached path wins
- Cost — batch where possible
What the model actually streamed
## Trade-offs
- **Speed** — the cached path wins
- **Cost** — batch where possible- The training corpus is soaked in it. README
files, documentation sites, wikis, developer forums, chat logs —
an enormous share of the technical text a model learns from is
Markdown or renders from it. The model has seen
# headingmean "heading" literally billions of times. - Chat tuning rewards it. Assistant models are fine-tuned on conversations where good answers are structured answers — and the structure is written in Markdown. Producing clean Markdown is, quite literally, what these models were graded on.
- Every chat interface renders it. The bold text and tidy lists you see in a chat window are Markdown being rendered live. Output format and display format agree, so the whole ecosystem keeps reinforcing it.
So when your document arrives as Markdown, its structure is
expressed in the exact vocabulary the model uses to organize its own
thoughts. A ## is not a hint to be decoded; it's a
first-class token pattern the model has an extremely strong prior
about.
2 · What other formats make the model do first
Here is the same document arriving two ways, as a pipeline:
With a pasted PDF extraction, the model must first infer where lines break into paragraphs, which fragments are headings, which runs of numbers were once a table — all from typography that no longer exists. Modern models are impressively good at this guessing. But every guess consumes capacity, and a wrong guess doesn't announce itself: the model just answers confidently from a slightly wrong document. We measured what this does to tables and layout in Why PDFs are hostile input for LLMs.
With Markdown, that entire first stage disappears. Heading levels, list nesting, table cells, code boundaries — all explicit, all in the model's home notation. Reading comprehension starts at sentence one.
3 · The token bill
Structure has a price in tokens, and Markdown's price is close to the minimum. The same three-row table, three ways:
<table><tr><th>region</th><th>units</th></tr>
<tr><td>North</td><td>1204</td></tr>
<tr><td>South</td><td>980</td></tr></table> ← 123 characters
{"rows":[{"region":"North","units":1204},
{"region":"South","units":980}]} ← 72 characters, keys repeat per row
| region | units |
| --- | --- |
| North | 1204 |
| South | 980 | ← 63 characters
The gap widens with real documents, because office and notebook formats are containers: fonts, themes, XML plumbing, embedded previews. The text you care about is a minority of the bytes. Some honest numbers from our own built-in samples and test files:
| Source | Original | As Markdown |
|---|---|---|
| 2-page business PDF | 26.6 KB | 0.9 KB |
| Word report (.docx) | 36.4 KB | 0.9 KB |
| Data-analysis notebook, 100+ cells | 3,465 KB | 261 KB ≈ 64K tokens |
The notebook row is the dramatic one: as raw JSON it doesn't fit in most context windows at all; as Markdown it fits with room to spare. (Small office files exaggerate the ratio — their fixed container overhead dominates — but the direction never flips: Markdown is the text, minus the plumbing.)
4 · Where Markdown is not the answer
Fairness requires one caveat. Deeply nested, machine-generated data — API payloads, config trees — is often better left as JSON, which models also read fluently; flattening it into prose can lose precision. Markdown wins for documents: things with headings, paragraphs, tables, figures and code, written for a reader. That is exactly the shape of lecture notes, reports, articles and notebooks — the things people actually paste into chat windows.
5 · The takeaway
Models answer from what they can parse, in a register they were trained to think in. Markdown is that register. Convert once, and every downstream use — pasting into a chat, uploading to an AI workspace, indexing for retrieval — starts from the model's native language instead of a guessing game.
See what your document looks like in the model's native format — converted locally, nothing uploaded.