Feeding CSV files to an LLM

MakeItMarkdown · July 2026

CSV is the format everyone assumes is trivial — it's "just commas". In the wild it's a family of dialects: semicolons in European locales, tabs from database exports, quoted fields with embedded commas and newlines, and rows whose column counts drift. Models read messy CSV the way you'd expect: confidently and inconsistently.

What breaks in a raw paste

Quoting is invisible logic. "Portland, OR",1200 is two fields; a model skimming commas sees three. Embedded newlines inside quotes are worse — they look like new rows.
The delimiter is a guess. Semicolon files pasted as "CSV" read as one giant column.
No types. Is 0042 a number or a code? Is 45922 a date? The bytes don't say.
Length. Ten thousand rows don't fit a chat input; whatever cuts them off won't tell the model it happened.

The element mapping

In the file	In the Markdown
Delimiter (comma/semicolon/tab/pipe)	Auto-sniffed, then parsed with real RFC-4180 quoting rules
Header row	GFM header with inferred per-column types: `region (str) · units (int) · revenue (float)`
Rows	First 50 as a GFM table + an explicit "first 50 of N" sentence
Ragged rows	Padded to rectangular; counted in a fidelity warning
Pipes in values	Escaped so the table can't shear
Overview	Row/column counts up front, so the model knows the dataset's true size

Before → after

In the file

region;units;"revenue, gross";signup
North;12;"1.204,50";45922
South;;"980,00";45923

In the Markdown

| region (str) | units (int) | revenue, gross (str) | signup (int) |
| --- | --- | --- | --- |
| North | 12 | 1.204,50 | 45922 |
| South |  | 980,00 | 45923 |
1 ragged row was padded (see fidelity warnings).

Honest limits

The first row is assumed to be a header — a headerless file gets its first data row promoted, which the preview makes obvious in one glance.
Type inference is per-column majority; a column mixing 12 and n/a reads as str, which is the safe call.
50 rows is a paste-budget default. For whole-dataset questions, ask the model for the code to compute the answer, not the answer.

FAQ

.tsv? Yes — tabs are one of the sniffed delimiters.

Huge files? Parsing happens in your browser's memory; hundreds of MB may be slow. The output stays small by design.

Why not JSON output for data? Per-row key repetition costs ~15–40% more tokens for flat tables — measured in LLMs speak Markdown.

Drop a CSV and check the typed header line — it's the piece models thank you for.