Scanned PDFs: what text converters can't read — and what OCR can
Some PDFs contain text. Others contain pictures of text — scans, photographed pages, faxes reborn as PDFs, and exports from apps that "print" to an image. To every text extractor on earth, the second kind is a photo album. There is nothing to extract, because nothing textual is there.
1 · Diagnose it in five seconds
- The selection test: open the PDF and try to select a sentence with your cursor. Selectable → real text. Only draggable as a picture → scan.
- The search test: Ctrl/Cmd-F a word you can see on the page. No hits for a visible word → scan.
- The converter test: drop it into MakeItMarkdown — image-only pages trigger an explicit "looks scanned" warning in the fidelity report instead of an empty file pretending to be a conversion.
2 · Your actual options, best first
- Find the born-digital source. Most scans are copies of something that exists properly elsewhere — the original web page, the publisher's HTML version, the .docx it was printed from. Converting that beats OCR-ing its photograph every time.
- Re-export instead of re-scan. If the document is yours, print-to-PDF from the source app produces a text layer.
- OCR as the last resort. Optical character recognition guesses characters from pixels — good on clean print, shaky on tables, columns and handwriting. The converter now offers this directly: when a scan is detected, a Try OCR (experimental) button appears in the fidelity panel. It runs locally (~8 MB engine, downloaded once) and its output is labelled approximate — treat it as a draft needing a proof-read, not a source of truth.
3 · Why scans are flagged, not silently converted
A converter that silently OCRs hands over plausible text with invisible errors baked in — the exact failure mode this site exists to prevent. So OCR is an explicit, opt-in button, never a default: it runs locally in the browser like everything else, its result replaces the conversion with every warning attached ("approximate", per-page recognition counts), and the QC score is capped to say so. A detected gap, then a labelled guess — each marked as what it is, the same rule the rest of the fidelity report follows.
Not sure whether your PDF is a scan? Drop it in — the fidelity report answers in one line.