Feeding Jupyter notebooks to an LLM
The notebook is the hardest general format we handle, and the one
where conversion adds the most: a .ipynb is a JSON
container in which your actual reasoning — code and prose — is
buried under everything the notebook ever displayed.
What's inside a .ipynb
Four layers: code cells (source + an
execution_count recording when you ran it),
markdown cells (prose, sometimes with pasted images
embedded as base64), outputs (every chart as an
embedded PNG, every dataframe preview, every traceback), and
metadata (kernel info, widget state). In real
notebooks the outputs dominate: our stress-test file is 3.5 MB
of which about 7% is meaningful text.
What breaks if you paste it raw
- The JSON wrapper spends tokens on
"cell_type":plumbing around every line of your code. - Base64 images are token bombs — one pasted screenshot ≈ 130,000 junk tokens.
- Cells have no names, so neither you nor the model can point at one.
- Out-of-order execution — the notebook's most famous bug — is invisible in the raw file unless you know to compare execution counts.
The element mapping
| In the notebook | In the Markdown |
|---|---|
Code cell, run as In [7] | ## Cell [7] · type:code · id:… + fenced ```python block |
| Never-run cell | Cell [p12] (position-based address) |
| Markdown cell | Verbatim prose under its own cell header |
| Text output | **Output:** block, truncated at 30 lines with an explicit note |
| Image output / pasted image | [Figure: cell_7_output_1.png] placeholder; the image itself is extracted into the .zip |
| Decreasing execution counts | An execution-order warning in the fidelity report |
| Variable reuse across cells | "depends on df (defined in Cell [2])" annotations |
| Kernel/widget metadata | Dropped (one-line overview keeps language + cell counts) |
The dependency annotations come from static analysis of
assignments and imports, and they're labelled what they are:
approximate cell dependency hints. Imports are treated as
ambient (a notebook that uses pd in 60 cells doesn't
need 60 arrows). The design decisions and their failure cases:
Making Jupyter
notebooks LLM-addressable.
Before → after
In the file
{"cell_type":"code","execution_count":2,
"source":["df = pd.read_csv(\"sales.csv\")\n","df.shape"],
"outputs":[{"data":{"text/plain":["(120, 4)"]}}],
"metadata":{"scrolled":true,"tags":[]}}In the Markdown
## Cell [2] · type:code · id:bb22cc33
```python
df = pd.read_csv("sales.csv")
df.shape
```
**Output:**
```
(120, 4)
```Honest limits
- Dependency hints are regex-based, not real dataflow — branchy reassignments can fool them (the output says so).
- Dependency analysis is Python-only; other-language notebooks convert fine but without hints.
- Interactive widget state is dropped — it has no textual meaning.
FAQ
Does it work on non-Python notebooks? Yes — the structure conversion is language-agnostic; only the dependency hints are Python-specific.
My notebook has cleared outputs — still worth converting? Yes: you keep addresses, structure and the much smaller paste; there are just no output blocks.
My analysis is unpublished. Where does the file go? Nowhere. Parsing runs in your browser — the site works offline. See privacy.
See the mapping live on a sample notebook — cell addresses, dependency hints, and the fidelity report.