Feeding PDFs to an LLM
PDF is the only format on this site where our guide begins with "avoid it if you can". A PDF stores drawing instructions — put this glyph at these coordinates — and nothing about paragraphs, headings or tables. Everything any tool "extracts" from a PDF is reconstruction, and honest conversion means saying so. (The full autopsy: Why PDFs are hostile input for LLMs.)
What can honestly be recovered
For born-digital PDFs (exported from a word processor or browser), the text itself is intact — it's the structure that's gone. Per-page text extraction recovers the words in reading order for simple single-column layouts, and progressively less reliably for columns, floats and dense tables.
The element mapping
| In the PDF | In the Markdown |
|---|---|
| Each page's text layer | A ## Page N section with the extracted text lines |
| Multi-column / complex layout | Extracted as-is, with a standing layout-lossy warning in the fidelity report |
| Tables | Not reconstructed as GFM — a PDF has no table objects, and fake table recovery is how other tools invent data; rows survive as text lines |
| Image-only (scanned) pages | Detected by character density and flagged; an opt-in Try OCR (experimental) button runs locally and returns clearly-labelled approximate text |
| Password-protected file | A clear error with the workaround (open it, remove the password, e.g. Print → Save as PDF) |
| Corrupt / truncated file | A clear "re-export or re-download" error instead of a hang |
Long documents show per-page progress while converting; a 40-page report takes moments, not minutes.
The decision tree
- Does the content exist as HTML or .docx anywhere? Convert that instead — every downstream use improves. (This is the single highest-leverage move in this entire guide.)
- Born-digital PDF, simple layout? Convert it; expect clean text with page addresses.
- Complex layout? Convert, then read the fidelity warnings and spot-check the sections you'll rely on.
- Scan? Different problem — see the scanned-PDF guide.
Before → after
In the file
Regional per formance
North 18,420 2.1 South
15,876 2.6 the table
continues mid-sentence…In the Markdown
## Page 3
Regional performance
North 18,420 2.1
South 15,876 2.6
> Warning: layout-lossy — verify tables against the source.FAQ
Why does the QC score cap lower for PDFs? Because structural checks (real sections, intact tables) genuinely can't fully pass on a structureless source — the score is honest rather than flattering.
Big PDFs? Parsing is local (pdf.js, the same engine as Firefox); size is limited by your machine, not by an upload cap.
Forms and annotations? Body text only — filled form values and comments aren't extracted today.
Convert a PDF and read its warnings — knowing what's shaky is the feature.