Feeding Excel workbooks to an LLM
Spreadsheets look like tables but store something stranger: typed cells whose displayed value and stored value differ. That gap — dates stored as day-counts, formulas stored as expressions, layout stored as merges — is where models get systematically misled.
What breaks if you hand it over raw
- Serial dates: the cell shows
2025-09-22, the file stores45922. Raw extractions hand the model five-digit numbers it happily treats as amounts. - Formulas: depending on the path, the model
sees
=SUM(B2:B13)instead of the number — or a number with no hint it's computed. - Merged headers unmerge into one value and blanks, shifting columns.
- Multi-sheet workbooks flatten into one undifferentiated stream, or lose all but the first sheet.
The element mapping
| In the workbook | In the Markdown |
|---|---|
| Each sheet | Its own ## section, in workbook order |
| Data range | GFM table with type-annotated headers — units (int), month (date) |
| Date cells | ISO dates (2025-09-22), never serials |
| Formula cells | The cached computed value, with a note that values are cached |
| Rows beyond 50 | Truncated with an explicit "first 50 of N" sentence |
| Ragged/merged regions | Padded to rectangular + a fidelity warning |
| Empty sheets | Named and flagged, not silently skipped |
| Charts, pivot caches, styling | Not extracted — values only (the report says what was detected) |
Why the type annotations matter more than anything else here: Tables LLMs can actually read.
Before → after
In the file
A2: 45922 (formatted as 2025-09-22)
B2: =SUM(B3:B14) (cached value 79)
C1: "Revenue
(EUR)" (merged across C1:D1)In the Markdown
| month (date) | total (int) | revenue_eur (float) |
| --- | --- | --- |
| 2025-09-22 | 79 | 96432.10 |
Values are cached formula results from the last save.Honest limits
- Formulas are not recomputed — you get the value cached at last save. Stale saves mean stale numbers (the note in the output reminds the reader).
- Charts don't survive: a chart is a drawing bound to data; the data survives, the picture doesn't.
- Fifty rows per table is a deliberate paste-budget default — for full-table workloads, attach the .md and the original together, or ask the model for code to run over the real file.
FAQ
Legacy .xls? Yes — both the modern ZIP format and the legacy binary are read (each is verified by its magic bytes first, so mislabeled files fail loudly, not weirdly).
Several sheets, one question? Every sheet is a titled section, so you can tell the model "use the Summary section".
Financial data privacy? Local conversion, nothing uploaded — verifiably.
Convert a workbook and inspect the typed headers and the cached-value note.