Format guide · .html

Feeding webpages to an LLM

HTML is the best-structured source most people ever feed a model — real headings, real tables, real links — wrapped in the worst noise: navigation, ads, consent banners, recommendation sidebars. The conversion problem isn't recovering structure (it's there); it's finding the content.

How article extraction works

We run Mozilla's Readability — the algorithm family behind Reader View — in your browser. It scores the page's element tree for the densest coherent block of text and discards the rest. The surviving article then converts to Markdown with its structure intact.

The element mapping

In the pageIn the Markdown
Article headingsThe #/## outline (levels normalized)
Article tablesGFM pipe tables
Images + captions[Figure: …] placeholders with captions; original URLs kept in the figure list
LinksInline Markdown links; relative links are flagged in the fidelity report (a saved file can't resolve /pricing)
Code blocksFenced blocks
Nav, ads, banners, sidebars, footersDiscarded by the extraction pass

Before → after

In the file

<nav>Home · Pricing · Blog</nav>
<div class="cookie-banner">We value your privacy…</div>
<article><h2>The measurement</h2><p>We logged…</p></article>
<aside>Related: 14 stories</aside>

In the Markdown

## The measurement

We logged…

(nav, cookie banner and sidebar discarded by extraction)

Honest limits

FAQ

Can I paste HTML instead of saving a file? Yes — paste page source onto the landing page with ⌘V.

Documentation pages with heavy chrome? That's the sweet spot — see the junk-flood walkthrough in the fix article.

Whole sites? One page per file today; batch-drop several saved pages at once and the .zip bundles them.

Try the sample article — nav and ads in, clean outline out.