Feeding CSV files to an LLM
CSV is the format everyone assumes is trivial — it's "just commas". In the wild it's a family of dialects: semicolons in European locales, tabs from database exports, quoted fields with embedded commas and newlines, and rows whose column counts drift. Models read messy CSV the way you'd expect: confidently and inconsistently.
What breaks in a raw paste
- Quoting is invisible logic.
"Portland, OR",1200is two fields; a model skimming commas sees three. Embedded newlines inside quotes are worse — they look like new rows. - The delimiter is a guess. Semicolon files pasted as "CSV" read as one giant column.
- No types. Is
0042a number or a code? Is45922a date? The bytes don't say. - Length. Ten thousand rows don't fit a chat input; whatever cuts them off won't tell the model it happened.
The element mapping
| In the file | In the Markdown |
|---|---|
| Delimiter (comma/semicolon/tab/pipe) | Auto-sniffed, then parsed with real RFC-4180 quoting rules |
| Header row | GFM header with inferred per-column types: region (str) · units (int) · revenue (float) |
| Rows | First 50 as a GFM table + an explicit "first 50 of N" sentence |
| Ragged rows | Padded to rectangular; counted in a fidelity warning |
| Pipes in values | Escaped so the table can't shear |
| Overview | Row/column counts up front, so the model knows the dataset's true size |
Before → after
In the file
region;units;"revenue, gross";signup
North;12;"1.204,50";45922
South;;"980,00";45923In the Markdown
| region (str) | units (int) | revenue, gross (str) | signup (int) |
| --- | --- | --- | --- |
| North | 12 | 1.204,50 | 45922 |
| South | | 980,00 | 45923 |
1 ragged row was padded (see fidelity warnings).Honest limits
- The first row is assumed to be a header — a headerless file gets its first data row promoted, which the preview makes obvious in one glance.
- Type inference is per-column majority; a column mixing
12andn/areads asstr, which is the safe call. - 50 rows is a paste-budget default. For whole-dataset questions, ask the model for the code to compute the answer, not the answer.
FAQ
.tsv? Yes — tabs are one of the sniffed delimiters.
Huge files? Parsing happens in your browser's memory; hundreds of MB may be slow. The output stays small by design.
Why not JSON output for data? Per-row key repetition costs ~15–40% more tokens for flat tables — measured in LLMs speak Markdown.
Drop a CSV and check the typed header line — it's the piece models thank you for.