Skip to content

juliocesar/cajupdf

Repository files navigation

cajupdf 🥜

Crack open a PDF, get clean markdown and its pictures.

caju is the cashew — that funny little nut that grows outside its fruit, in plain sight. cajupdf does the same trick to your PDFs: it cracks the shell and hands you the good stuff — readable markdown, plus every embedded image, laid out where you can actually use them.


What it does

Three stages, one command:

  1. Text — pulled with unpdf (PDF.js under the hood). Scanned/image-only PDFs fall back to the macOS Vision framework OCR automatically — no setup, no API keys.
  2. Images / figures / sigils — extracted at full fidelity with Poppler's pdfimages (JPEGs stay JPEGs, PNGs stay PNGs).
  3. Markdown — assembled with YAML front matter (title, author, page & image counts) and one ## Page N section per page, images linked inline.

Requirements

  • macOS — cajupdf leans on the system Vision framework for OCR, so it's macOS-only for now.
  • Node ≥ 22
  • pdfimages (Poppler) — only needed when extracting images: brew install poppler. Skip it entirely with --no-images.

Install

npm i -g cajupdf
# or
pnpm add -g cajupdf

Then:

cajupdf some.pdf

Or run it from a clone:

pnpm install
pnpm build
node dist/cli.js some.pdf

Usage

cajupdf <pdf> [<pdf> ...] [options]

Options:
  --no-images        Skip image extraction (images are extracted by default).
  --url-friendly     Slugify output names to kebab-case (default: verbatim).
  --out-dir <dir>    Where to write outputs (default: current directory).
  -h, --help         Show help.
  -v, --version      Show version.

Pass as many PDFs as you like — each is processed independently, and one bad file won't sink the rest (the run just exits non-zero if any failed).


Output layout

The markdown is <name>.md, and its images sit in a sibling <name>-images/ directory.

Default (verbatim names):

My Book.pdf  →  My Book.md
                My Book-images/
                  img-002-000.jpg
                  ...

Verbatim names can contain spaces, so image links are percent-encoded and angle-bracket-wrapped to stay valid markdown: ![](<My Book-images/img-002-000.jpg>).

--url-friendly (kebab-case):

My Book.pdf  →  my-book.md
                my-book-images/
                  img-002-000.jpg
                  ...

Programmatic API

cajupdf is also a library. The one-call workhorse:

import { extractPdf } from 'cajupdf'

const result = await extractPdf('My Book.pdf', {
  images: true, // extract images (default)
  urlFriendly: false, // keep names verbatim (default)
  outDir: './out', // default: cwd
  onProgress: p => console.log(p.stage, p.current, p.total),
})

// → { mdPath, imageDir, pageCount, imageCount }

Or reach for the individual stages — parsePDF, extractImages, toMarkdown, slugify, resolveOutputNames — all exported with full types.


License

MIT


Built to be cracked open. Mind the shell. 🥜🐘

About

Turn PDFs into clean markdown + extracted images (macOS)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors