RAG-ready Markdown: chunk boundaries, stable anchors, honest citations

MakeItMarkdown · July 2026 · 8 min read

The point: most RAG quality problems are chunking problems, and chunking problems start at the source format — fix them before embedding.

Most RAG quality problems are chunking problems. And most chunking problems are input problems: the chunker was handed a flat wall of extracted text and told to find topic boundaries with a ruler. This article is about fixing retrieval at the source — before the embedding model ever runs.

1 · Thirty seconds of background

A retrieval-augmented pipeline ingests documents by splitting them into chunks, embedding each chunk as a vector, and — at question time — pulling the closest chunks into the model's context. Everything downstream (answer accuracy, citations, hallucination rate) inherits from one early decision: where the splits fall.

2 · What a ruler does to a document

The default strategy — fixed windows of N characters with overlap — knows nothing about your document. On raw PDF extractions it reliably produces the three classic failure chunks:

The half-definition — a split lands mid-sentence, so the condition ends up in chunk 12 and its exception in chunk 13; retrieval surfaces one without the other.
The beheaded table — header row in one chunk, data rows in the next; the numbers arrive with no column names.
The orphan context — a chunk starting with "as shown above, this fails when…" — referring to text the model will never see.

Overlap papers over some of this at the cost of duplicated tokens and duplicated retrievals. The real fix is upstream: give the chunker boundaries that mean something.

3 · Headings are free chunk boundaries

A well-structured Markdown document already contains the splits a chunker wishes it could infer: section headings are topic boundaries, written by the one entity that knew the topics — the author. Chunk at headings and each chunk becomes a coherent unit with a self-describing label; tables stay whole because they live inside a section; "as shown above" mostly refers to text within the same chunk.

4 · What the RAG preset emits

MakeItMarkdown's RAG preset makes those boundaries machine-obvious instead of implicit. Two additions to the standard output, both inert in any Markdown renderer:

<!-- chunk: quarterly-results -->
<a id="quarterly-results"></a>
## Quarterly results

| region (str) | units (int) | revenue (float) |
| --- | --- | --- |
| North | 1204 | 96432.10 |
…

<!-- chunk: methodology -->
<a id="methodology"></a>
## Methodology
…

 comments mark every section boundary. Your ingestion script doesn't parse heading levels or guess — it splits on a literal string. Any language, three lines of code.
<a id="slug"></a> anchors give each section a stable address. Slugs are deduplicated deterministically (a second "Results" becomes results-2), so the same document converts to the same anchors every time.

5 · Stable anchors are what make citations honest

The quiet failure of RAG systems is the citation that can't be followed — "source: document.pdf, page ~14" pointing into a file whose extraction changed since indexing. Anchors fix the contract: store notes-05.md#convergence-conditions alongside each vector, and a citation resolves to an exact section — after re-indexing, after edits elsewhere in the file, in any Markdown viewer that renders HTML anchors, in your repo's rendered view. Deduplicated slugs double as chunk IDs for deduping retrievals and tracking chunk-level metrics.

The chunk carries its own address; the answer cites it; you can open the address and check.

6 · Keep the fidelity signals in the index

Two more properties of the converted output matter specifically for pipelines:

Figure placeholders are signal. [Figure: cell_12_figure_1.png] embeds cheaply and tells the model a figure exists at that point — far better than a stripped image leaving no trace, and enormously better than a base64 wall poisoning the embedding. (One notebook we tested hid 515 KB of base64 in a single cell — as an embedding input that's pure noise.)
Explicit truncation prevents confident nonsense. A table cut at 50 rows says so in the text, so a retrieved chunk can't masquerade as the whole dataset.

Both come from the same design rule as the fidelity report: never let loss be silent.

7 · Recipe: files to chunks in three steps

Convert with the RAG preset (batch drop works; everything runs in your browser).
In your ingestion script, split on <!-- chunk: and index each piece with its slug and source filename as metadata. Oversized sections can still be sub-split by paragraph — the slug keeps the lineage.
At answer time, return filename#slug with every quote. Your citations now survive re-indexing.

8 · Where this is heading

The same properties that help retrieval — addressable sections, explicit structure, honest truncation — are what agentic workflows need: an assistant that can be told "read report.md, section methodology" navigates instead of re-reading. Agent-oriented output is on our roadmap; if you're building in that direction, tell us what your agents choke on.

Convert a document with the RAG preset and inspect the chunk boundaries yourself — this link preloads a sample with the preset switched on.