Advanced · 01

RAG-ready Markdown: chunk boundaries, stable anchors, honest citations

The point: most RAG quality problems are chunking problems, and chunking problems start at the source format — fix them before embedding.

Most RAG quality problems are chunking problems. And most chunking problems are input problems: the chunker was handed a flat wall of extracted text and told to find topic boundaries with a ruler. This article is about fixing retrieval at the source — before the embedding model ever runs.

1 · Thirty seconds of background

A retrieval-augmented pipeline ingests documents by splitting them into chunks, embedding each chunk as a vector, and — at question time — pulling the closest chunks into the model's context. Everything downstream (answer accuracy, citations, hallucination rate) inherits from one early decision: where the splits fall.

2 · What a ruler does to a document

The default strategy — fixed windows of N characters with overlap — knows nothing about your document. On raw PDF extractions it reliably produces the three classic failure chunks:

Overlap papers over some of this at the cost of duplicated tokens and duplicated retrievals. The real fix is upstream: give the chunker boundaries that mean something.

3 · Headings are free chunk boundaries

A well-structured Markdown document already contains the splits a chunker wishes it could infer: section headings are topic boundaries, written by the one entity that knew the topics — the author. Chunk at headings and each chunk becomes a coherent unit with a self-describing label; tables stay whole because they live inside a section; "as shown above" mostly refers to text within the same chunk.

4 · What the RAG preset emits

MakeItMarkdown's RAG preset makes those boundaries machine-obvious instead of implicit. Two additions to the standard output, both inert in any Markdown renderer:

<!-- chunk: quarterly-results -->
<a id="quarterly-results"></a>
## Quarterly results

| region (str) | units (int) | revenue (float) |
| --- | --- | --- |
| North | 1204 | 96432.10 |
…

<!-- chunk: methodology -->
<a id="methodology"></a>
## Methodology
…

5 · Stable anchors are what make citations honest

The quiet failure of RAG systems is the citation that can't be followed — "source: document.pdf, page ~14" pointing into a file whose extraction changed since indexing. Anchors fix the contract: store notes-05.md#convergence-conditions alongside each vector, and a citation resolves to an exact section — after re-indexing, after edits elsewhere in the file, in any Markdown viewer that renders HTML anchors, in your repo's rendered view. Deduplicated slugs double as chunk IDs for deduping retrievals and tracking chunk-level metrics.

## Results <!-- chunk: results --> the cached path wins on every measured run "…wins on every run" source: report.md#results an answer that must cite an anchor can be audited at that anchor
The chunk carries its own address; the answer cites it; you can open the address and check.

6 · Keep the fidelity signals in the index

Two more properties of the converted output matter specifically for pipelines:

Both come from the same design rule as the fidelity report: never let loss be silent.

7 · Recipe: files to chunks in three steps

  1. Convert with the RAG preset (batch drop works; everything runs in your browser).
  2. In your ingestion script, split on <!-- chunk: and index each piece with its slug and source filename as metadata. Oversized sections can still be sub-split by paragraph — the slug keeps the lineage.
  3. At answer time, return filename#slug with every quote. Your citations now survive re-indexing.

8 · Where this is heading

The same properties that help retrieval — addressable sections, explicit structure, honest truncation — are what agentic workflows need: an assistant that can be told "read report.md, section methodology" navigates instead of re-reading. Agent-oriented output is on our roadmap; if you're building in that direction, tell us what your agents choke on.

Convert a document with the RAG preset and inspect the chunk boundaries yourself — this link preloads a sample with the preset switched on.