cajupdf 🥜

Crack open a PDF, get clean markdown and its pictures.

caju is the cashew — that funny little nut that grows outside its fruit, in plain sight. cajupdf does the same trick to your PDFs: it cracks the shell and hands you the good stuff — readable markdown, plus every embedded image, laid out where you can actually use them.

What it does

Three stages, one command:

Text — pulled with unpdf (PDF.js under the hood). Scanned/image-only PDFs fall back to the macOS Vision framework OCR automatically — no setup, no API keys.
Images / figures / sigils — extracted at full fidelity with Poppler's pdfimages (JPEGs stay JPEGs, PNGs stay PNGs).
Markdown — assembled with YAML front matter (title, author, page & image counts) and one ## Page N section per page, images linked inline.

Requirements

macOS — cajupdf leans on the system Vision framework for OCR, so it's macOS-only for now.
Node ≥ 22
pdfimages (Poppler) — only needed when extracting images: brew install poppler. Skip it entirely with --no-images.

Install

npm i -g cajupdf
# or
pnpm add -g cajupdf

Then:

cajupdf some.pdf

Or run it from a clone:

pnpm install
pnpm build
node dist/cli.js some.pdf

Usage

cajupdf <pdf> [<pdf> ...] [options]

Options:
  --no-images        Skip image extraction (images are extracted by default).
  --url-friendly     Slugify output names to kebab-case (default: verbatim).
  --out-dir <dir>    Where to write outputs (default: current directory).
  -h, --help         Show help.
  -v, --version      Show version.

Pass as many PDFs as you like — each is processed independently, and one bad file won't sink the rest (the run just exits non-zero if any failed).

Output layout

The markdown is <name>.md, and its images sit in a sibling <name>-images/ directory.

Default (verbatim names):

My Book.pdf  →  My Book.md
                My Book-images/
                  img-002-000.jpg
                  ...

Verbatim names can contain spaces, so image links are percent-encoded and angle-bracket-wrapped to stay valid markdown: ![](<My Book-images/img-002-000.jpg>).

--url-friendly (kebab-case):

My Book.pdf  →  my-book.md
                my-book-images/
                  img-002-000.jpg
                  ...

Programmatic API

cajupdf is also a library. The one-call workhorse:

import { extractPdf } from 'cajupdf'

const result = await extractPdf('My Book.pdf', {
  images: true, // extract images (default)
  urlFriendly: false, // keep names verbatim (default)
  outDir: './out', // default: cwd
  onProgress: p => console.log(p.stage, p.current, p.total),
})

// → { mdPath, imageDir, pageCount, imageCount }

Or reach for the individual stages — parsePDF, extractImages, toMarkdown, slugify, resolveOutputNames — all exported with full types.

License

MIT

Built to be cracked open. Mind the shell. 🥜🐘

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
test		test
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cajupdf 🥜

What it does

Requirements

Install

Usage

Output layout

Programmatic API

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cajupdf 🥜

What it does

Requirements

Install

Usage

Output layout

Programmatic API

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages