Feeding webpages to an LLM
HTML is the best-structured source most people ever feed a model — real headings, real tables, real links — wrapped in the worst noise: navigation, ads, consent banners, recommendation sidebars. The conversion problem isn't recovering structure (it's there); it's finding the content.
How article extraction works
We run Mozilla's Readability — the algorithm family behind Reader View — in your browser. It scores the page's element tree for the densest coherent block of text and discards the rest. The surviving article then converts to Markdown with its structure intact.
The element mapping
| In the page | In the Markdown |
|---|---|
| Article headings | The #/## outline (levels normalized) |
| Article tables | GFM pipe tables |
| Images + captions | [Figure: …] placeholders with captions; original URLs kept in the figure list |
| Links | Inline Markdown links; relative links are flagged in the fidelity report (a saved file can't resolve /pricing) |
| Code blocks | Fenced blocks |
| Nav, ads, banners, sidebars, footers | Discarded by the extraction pass |
Before → after
In the file
<nav>Home · Pricing · Blog</nav>
<div class="cookie-banner">We value your privacy…</div>
<article><h2>The measurement</h2><p>We logged…</p></article>
<aside>Related: 14 stories</aside>In the Markdown
## The measurement
We logged…
(nav, cookie banner and sidebar discarded by extraction)Honest limits
- JavaScript-rendered pages: what you saved is what converts. Save the page after it has rendered (Ctrl/Cmd-S in the browser), not via a raw source download.
- Extraction can pick the wrong block on unusual layouts — comment threads have out-scored short articles before. The fidelity report's detected counts (title? sections? the table you expected?) are how you catch it in five seconds.
- No URL fetching. You give us the file, not the address — the site makes no network requests with your content, which is the whole privacy model. Saving the page first is the one extra step.
FAQ
Can I paste HTML instead of saving a file? Yes — paste page source onto the landing page with ⌘V.
Documentation pages with heavy chrome? That's the sweet spot — see the junk-flood walkthrough in the fix article.
Whole sites? One page per file today; batch-drop several saved pages at once and the .zip bundles them.
Try the sample article — nav and ads in, clean outline out.