RAG-ready Markdown: chunk boundaries, stable anchors, honest citations
The point: most RAG quality problems are chunking problems, and chunking problems start at the source format — fix them before embedding.
Most RAG quality problems are chunking problems. And most chunking problems are input problems: the chunker was handed a flat wall of extracted text and told to find topic boundaries with a ruler. This article is about fixing retrieval at the source — before the embedding model ever runs.
1 · Thirty seconds of background
A retrieval-augmented pipeline ingests documents by splitting them into chunks, embedding each chunk as a vector, and — at question time — pulling the closest chunks into the model's context. Everything downstream (answer accuracy, citations, hallucination rate) inherits from one early decision: where the splits fall.
2 · What a ruler does to a document
The default strategy — fixed windows of N characters with overlap — knows nothing about your document. On raw PDF extractions it reliably produces the three classic failure chunks:
- The half-definition — a split lands mid-sentence, so the condition ends up in chunk 12 and its exception in chunk 13; retrieval surfaces one without the other.
- The beheaded table — header row in one chunk, data rows in the next; the numbers arrive with no column names.
- The orphan context — a chunk starting with "as shown above, this fails when…" — referring to text the model will never see.
Overlap papers over some of this at the cost of duplicated tokens and duplicated retrievals. The real fix is upstream: give the chunker boundaries that mean something.
3 · Headings are free chunk boundaries
A well-structured Markdown document already contains the splits a chunker wishes it could infer: section headings are topic boundaries, written by the one entity that knew the topics — the author. Chunk at headings and each chunk becomes a coherent unit with a self-describing label; tables stay whole because they live inside a section; "as shown above" mostly refers to text within the same chunk.
4 · What the RAG preset emits
MakeItMarkdown's RAG preset makes those boundaries machine-obvious instead of implicit. Two additions to the standard output, both inert in any Markdown renderer:
<!-- chunk: quarterly-results -->
<a id="quarterly-results"></a>
## Quarterly results
| region (str) | units (int) | revenue (float) |
| --- | --- | --- |
| North | 1204 | 96432.10 |
…
<!-- chunk: methodology -->
<a id="methodology"></a>
## Methodology
…
<!-- chunk: slug -->comments mark every section boundary. Your ingestion script doesn't parse heading levels or guess — it splits on a literal string. Any language, three lines of code.<a id="slug"></a>anchors give each section a stable address. Slugs are deduplicated deterministically (a second "Results" becomesresults-2), so the same document converts to the same anchors every time.
5 · Stable anchors are what make citations honest
The quiet failure of RAG systems is the citation that can't be
followed — "source: document.pdf, page ~14" pointing into a file
whose extraction changed since indexing. Anchors fix the contract:
store notes-05.md#convergence-conditions alongside each
vector, and a citation resolves to an exact section — after
re-indexing, after edits elsewhere in the file, in any Markdown
viewer that renders HTML anchors, in your repo's rendered view.
Deduplicated slugs double as chunk IDs for deduping retrievals and
tracking chunk-level metrics.
6 · Keep the fidelity signals in the index
Two more properties of the converted output matter specifically for pipelines:
- Figure placeholders are signal.
[Figure: cell_12_figure_1.png]embeds cheaply and tells the model a figure exists at that point — far better than a stripped image leaving no trace, and enormously better than a base64 wall poisoning the embedding. (One notebook we tested hid 515 KB of base64 in a single cell — as an embedding input that's pure noise.) - Explicit truncation prevents confident nonsense. A table cut at 50 rows says so in the text, so a retrieved chunk can't masquerade as the whole dataset.
Both come from the same design rule as the fidelity report: never let loss be silent.
7 · Recipe: files to chunks in three steps
- Convert with the RAG preset (batch drop works; everything runs in your browser).
- In your ingestion script, split on
<!-- chunk:and index each piece with its slug and source filename as metadata. Oversized sections can still be sub-split by paragraph — the slug keeps the lineage. - At answer time, return
filename#slugwith every quote. Your citations now survive re-indexing.
8 · Where this is heading
The same properties that help retrieval — addressable sections,
explicit structure, honest truncation — are what agentic workflows
need: an assistant that can be told "read report.md,
section methodology" navigates instead of re-reading.
Agent-oriented output is on our roadmap; if you're building in that
direction, tell us what your agents
choke on.
Convert a document with the RAG preset and inspect the chunk boundaries yourself — this link preloads a sample with the preset switched on.