Crack open a PDF, get clean markdown and its pictures.
caju is the cashew — that funny little nut that grows outside its fruit, in
plain sight. cajupdf does the same trick to your PDFs: it cracks the shell and
hands you the good stuff — readable markdown, plus every embedded image, laid out
where you can actually use them.
Three stages, one command:
- Text — pulled with
unpdf(PDF.js under the hood). Scanned/image-only PDFs fall back to the macOS Vision framework OCR automatically — no setup, no API keys. - Images / figures / sigils — extracted at full fidelity with Poppler's
pdfimages(JPEGs stay JPEGs, PNGs stay PNGs). - Markdown — assembled with YAML front matter (title, author, page & image
counts) and one
## Page Nsection per page, images linked inline.
- macOS — cajupdf leans on the system Vision framework for OCR, so it's macOS-only for now.
- Node ≥ 22
pdfimages(Poppler) — only needed when extracting images:brew install poppler. Skip it entirely with--no-images.
npm i -g cajupdf
# or
pnpm add -g cajupdfThen:
cajupdf some.pdfOr run it from a clone:
pnpm install
pnpm build
node dist/cli.js some.pdfcajupdf <pdf> [<pdf> ...] [options]
Options:
--no-images Skip image extraction (images are extracted by default).
--url-friendly Slugify output names to kebab-case (default: verbatim).
--out-dir <dir> Where to write outputs (default: current directory).
-h, --help Show help.
-v, --version Show version.
Pass as many PDFs as you like — each is processed independently, and one bad file won't sink the rest (the run just exits non-zero if any failed).
The markdown is <name>.md, and its images sit in a sibling <name>-images/
directory.
Default (verbatim names):
My Book.pdf → My Book.md
My Book-images/
img-002-000.jpg
...
Verbatim names can contain spaces, so image links are percent-encoded and
angle-bracket-wrapped to stay valid markdown:
.
--url-friendly (kebab-case):
My Book.pdf → my-book.md
my-book-images/
img-002-000.jpg
...
cajupdf is also a library. The one-call workhorse:
import { extractPdf } from 'cajupdf'
const result = await extractPdf('My Book.pdf', {
images: true, // extract images (default)
urlFriendly: false, // keep names verbatim (default)
outDir: './out', // default: cwd
onProgress: p => console.log(p.stage, p.current, p.total),
})
// → { mdPath, imageDir, pageCount, imageCount }Or reach for the individual stages — parsePDF, extractImages, toMarkdown,
slugify, resolveOutputNames — all exported with full types.
MIT
Built to be cracked open. Mind the shell. 🥜🐘