What's really inside your files
-
Jupyter notebooks
Cell addresses, dependency hints, execution-order warnings, base64 extraction — the hardest format, and the biggest win.
.ipynb -
Word documents
Styles become a real outline; tables survive as GFM; page furniture is dropped.
.docx -
Slide decks
Slides become an addressable outline; speaker notes surface; charts are confessed, not faked.
.pptx -
Emails
Decoded headers, the newest message intact, quoted reply pyramids truncated explicitly.
.eml -
Excel workbooks
ISO dates instead of serials, cached formula values, one section per sheet, explicit truncation.
.xlsx · .xls -
CSV / TSV
Sniffed delimiters, RFC-4180 quoting, typed columns, ragged rows repaired and confessed.
.csv · .tsv -
PDF
Per-page text with honest limits: no fake tables, scans flagged, layout loss disclosed.
.pdf -
Webpages
Reader-style article extraction: content kept with tables and figures, chrome discarded.
.html -
LaTeX sources
The structure the PDF destroys: outline, fenced math, keyed citations — no TeX engine pretensions.
.tex -
Mailbox archives
A whole mailbox as an addressable thread outline, quote pyramids truncated per message.
.mbox -
JSON / JSONL
A structure outline plus record tables — and when to leave JSON as JSON.
.json · .jsonl -
Subtitles
Timestamped transcripts with compact time markers; styling and numbering stripped.
.srt · .vtt -
Markdown / plain text
Structure QC, exact token counts, and retargeting to chat, RAG, Obsidian or archive presets.
.md · .txt
Fastest way to see any of these mappings: convert your own file and read the fidelity report.