Fixes · 06

Scanned PDFs: what text converters can't read — and what OCR can

Some PDFs contain text. Others contain pictures of text — scans, photographed pages, faxes reborn as PDFs, and exports from apps that "print" to an image. To every text extractor on earth, the second kind is a photo album. There is nothing to extract, because nothing textual is there.

1 · Diagnose it in five seconds

2 · Your actual options, best first

  1. Find the born-digital source. Most scans are copies of something that exists properly elsewhere — the original web page, the publisher's HTML version, the .docx it was printed from. Converting that beats OCR-ing its photograph every time.
  2. Re-export instead of re-scan. If the document is yours, print-to-PDF from the source app produces a text layer.
  3. OCR as the last resort. Optical character recognition guesses characters from pixels — good on clean print, shaky on tables, columns and handwriting. The converter now offers this directly: when a scan is detected, a Try OCR (experimental) button appears in the fidelity panel. It runs locally (~8 MB engine, downloaded once) and its output is labelled approximate — treat it as a draft needing a proof-read, not a source of truth.
What a detected scan offers: opt-in, local, and labelled approximate before you click.

3 · Why scans are flagged, not silently converted

A converter that silently OCRs hands over plausible text with invisible errors baked in — the exact failure mode this site exists to prevent. So OCR is an explicit, opt-in button, never a default: it runs locally in the browser like everything else, its result replaces the conversion with every warning attached ("approximate", per-page recognition counts), and the QC score is capped to say so. A detected gap, then a labelled guess — each marked as what it is, the same rule the rest of the fidelity report follows.

Not sure whether your PDF is a scan? Drop it in — the fidelity report answers in one line.