Feeding PDFs to an LLM

MakeItMarkdown · July 2026

PDF is the only format on this site where our guide begins with "avoid it if you can". A PDF stores drawing instructions — put this glyph at these coordinates — and nothing about paragraphs, headings or tables. Everything any tool "extracts" from a PDF is reconstruction, and honest conversion means saying so. (The full autopsy: Why PDFs are hostile input for LLMs.)

What can honestly be recovered

For born-digital PDFs (exported from a word processor or browser), the text itself is intact — it's the structure that's gone. Per-page text extraction recovers the words in reading order for simple single-column layouts, and progressively less reliably for columns, floats and dense tables.

The element mapping

In the PDF	In the Markdown
Each page's text layer	A `## Page N` section with the extracted text lines
Multi-column / complex layout	Extracted as-is, with a standing layout-lossy warning in the fidelity report
Tables	Not reconstructed as GFM — a PDF has no table objects, and fake table recovery is how other tools invent data; rows survive as text lines
Image-only (scanned) pages	Detected by character density and flagged; an opt-in Try OCR (experimental) button runs locally and returns clearly-labelled approximate text
Password-protected file	A clear error with the workaround (open it, remove the password, e.g. Print → Save as PDF)
Corrupt / truncated file	A clear "re-export or re-download" error instead of a hang

Long documents show per-page progress while converting; a 40-page report takes moments, not minutes.

The decision tree

Does the content exist as HTML or .docx anywhere? Convert that instead — every downstream use improves. (This is the single highest-leverage move in this entire guide.)
Born-digital PDF, simple layout? Convert it; expect clean text with page addresses.
Complex layout? Convert, then read the fidelity warnings and spot-check the sections you'll rely on.
Scan? Different problem — see the scanned-PDF guide.

Before → after

In the file

Regional per formance
North 18,420 2.1 South
15,876 2.6 the table
continues mid-sentence…

In the Markdown

## Page 3

Regional performance
North 18,420 2.1
South 15,876 2.6

> Warning: layout-lossy — verify tables against the source.

FAQ

Why does the QC score cap lower for PDFs? Because structural checks (real sections, intact tables) genuinely can't fully pass on a structureless source — the score is honest rather than flattering.

Big PDFs? Parsing is local (pdf.js, the same engine as Firefox); size is limited by your machine, not by an upload cap.

Forms and annotations? Body text only — filled form values and comments aren't extracted today.

Convert a PDF and read its warnings — knowing what's shaky is the feature.