Start here · 02

Why PDFs are hostile input for LLMs

The point: a PDF is a painting of a document — the model reads a guess at the text, and every downstream error starts there.

Paste a PDF into a chat model and it usually answers something. That is the problem: it answers from a degraded copy of your document, and neither of you can see what was lost. The failure isn't the model. It's the format.

1 · A PDF is a painting of a document

PDF was designed in 1993 to make documents print identically everywhere. It succeeds by storing drawing instructions: put this glyph at x=72.4, y=310.2; move right 4.1 points; draw a line here. There is no paragraph object, no heading object, and in most real-world files no table object. The visual structure your eye reconstructs — columns, headers, captions under figures — exists only as coordinates.

What the PDF stores

Tj (R) 72.0 404.2
Tj (e) 78.4 404.2
Tj (s) 83.9 404.2
Tj (u) 88.7 404.2  …
(glyphs at coordinates)

What a model needs

## Results

The cached path wins on
every measured run…

Text extraction therefore has to guess. Which glyph runs form a line? Which lines form a paragraph? Does that vertical gap mean a new column or a new section? Every extractor answers these questions with heuristics, and every heuristic fails somewhere:

2 · What this does to model output

The damage is quiet. A model given interleaved columns rarely says "this text is garbled" — it produces fluent summaries of sentences that never existed. A collapsed table doesn't yield "I can't read this table"; it yields plausible numbers attached to the wrong rows. In retrieval pipelines the effect compounds: chunk boundaries fall mid-table, embeddings encode noise, and the retriever surfaces the page-number footer as the "most relevant passage."

The most expensive failure mode in LLM document work is not refusal. It is confident output built on silently corrupted input.

3 · Measure it yourself

Take any two-column paper or a financial report. Extract its text with any tool, then check three things: does any sentence splice into an unrelated one; does any table row survive with its header attached; how many times does the page header repeat. In our test corpus a typical 40-page report produced usable prose but zero fully intact tables — every one had to be reconstructed by eye.

🎬 Media slot — save as /assets/media/blog/pdf-hostile/pdf-vs-md.mp4 · 20–30 s clip: drop a two-column PDF into a chat model, ask "what was column 2 of table 1?", show the wrong answer; then paste the Markdown version and show the right one. Cursor-follow, muted, loopable. This box is replaced by the clip once the file lands.

4 · What to feed the model instead

The practical fix is to move the document into a format where structure is explicit — Markdown is the current lingua franca because headings, lists, tables and code fences survive tokenization and are abundantly represented in training data. The order of preference we use:

  1. The source, not the print. If an HTML, DOCX or notebook version exists, convert that. It still knows what a heading is. The PDF is the worst copy of your document that exists.
  2. Deterministic conversion with a fidelity report. Whatever converts your file should tell you what it detected — tables, figures, equations — so silent loss becomes visible loss.
  3. PDF as a last resort, with disclaimers attached. When only a PDF exists, extract per page, keep the layout-lossy warning attached to the output, and treat scanned pages as a hard stop rather than pretending.

That's the design behind our converter: it prefers structured sources, reports what it detected instead of promising preservation, and refuses to hallucinate text for scanned pages — it tells you to find the HTML instead.

Convert a document and see its fidelity report — everything runs in your browser, nothing is uploaded.