archive-assistant

A preprocessing tool for document archives. Designed to complement find-anything by ensuring PDFs are text-searchable and archives are in a format find-anything can browse efficiently.

Two tools in one workspace:

archive-repack — reads any archive format, applies processor rules to members, writes a ZIP. Standalone and useful on its own.
archive-assistant — walks a directory tree, decides which archives need processing, calls archive-repack, and manages idempotency (state DB, mtime+60s).

What it does

OCRs image-only PDFs — embeds a text layer so content is searchable
Converts any archive to ZIP — 7z, tar, tar.gz, tar.bz2, tar.xz, rar → ZIP
Processes files inside archives — extracts, applies rules, repacks
Handles nested archives — configurable via --nested: pass through unchanged (default), repack recursively, or flatten contents into subdirectories
Idempotent — ZIPs with an embedded archive-assistant.txt manifest are skipped; optional SQLite state DB for top-level files

Install

curl -fsSL https://raw.githubusercontent.com/jamietre/archive-assistant/master/install.sh | sh

Installs archive-repack and archive-assistant to ~/.local/bin. Set INSTALL_DIR=/usr/local/bin to override the destination.

Build from source

cargo build --workspace --release

Binaries at target/release/archive-repack and target/release/archive-assistant.

Requirements

# OCR support (pipx manages the venv automatically)
pipx install ocrmypdf
apt install tesseract-ocr-eng   # or other language packs as needed

# PDF text detection
apt install poppler-utils       # provides pdftotext

# RAR extraction is handled by the unrar Rust crate (no external binary needed),
# which requires the unrar shared library on some systems:
# apt install libunrar-dev   # if the build fails looking for unrar headers

Config file

Both tools use the same TOML config to define what happens to each file type.

# zip-rewrite.toml

# Exclude macOS metadata and Windows thumbnails from the output archive
exclude = ["__MACOSX/**", "*.DS_Store", "Thumbs.db"]

# OCR image-only PDFs using ocrmypdf (in-place)
[[processor]]
match = "*.pdf"
chain = [
    { io = "in-place", command = "ocrmypdf", args = ["--skip-text", "--quiet", "{input}", "{input}"] },
]

# Shell passthrough example — arbitrary pipeline via sh -c
[[processor]]
match = "*.txt"
shell = "cat {input} | tr '[:lower:]' '[:upper:]'"
io = "stdin-stdout"

`exclude`

A list of glob patterns matched against each member's full path inside the archive. Members that match any pattern are omitted from the output ZIP entirely.

exclude = ["__MACOSX/**", "*.DS_Store", "Thumbs.db", "**/.gitkeep"]

Patterns can also be supplied per-invocation via --exclude on the command line (see below).

I/O modes

Mode	Description
`in-place`	Tool modifies the file at `{input}` directly
`file-to-file`	Tool reads `{input}`, writes to `{output}`
`file-to-stdout`	Tool reads `{input}`, result captured from stdout
`stdin-stdout`	Input piped to stdin, result captured from stdout

For chain, each step's output feeds the next. For shell, the expression is passed to sh -c with {input} substituted.

`archive-repack`

Reads any archive, applies processor rules to members, writes a ZIP. How nested archives inside the input are handled is controlled by --nested.

# From a config file
archive-repack input.7z --config zip-rewrite.toml

# Inline rule — no config file needed
archive-repack input.7z \
  --match '*.pdf' \
  --command ocrmypdf \
  --arg '--skip-text' --arg '--quiet' --arg '{input}' --arg '{input}'

# Inline shell expression
archive-repack input.tar.gz \
  --match '*.txt' \
  --shell 'cat {input} | tr a-z A-Z' \
  --io stdin-stdout

# Combine: inline rule runs first, then config-file rules
archive-repack input.zip \
  --config zip-rewrite.toml \
  --match '*.png' --command convert --arg '{input}' --arg '{output}' --io file-to-file

# Write manifest into output ZIP (archive-assistant always passes this)
archive-repack input.7z --config zip-rewrite.toml --write-manifest

# Exclude files by glob pattern (CLI flag, repeatable)
archive-repack input.zip --exclude "*.DS_Store" --exclude "__MACOSX/**"

# Repack nested archives recursively (default is passthrough)
archive-repack input.7z --config zip-rewrite.toml --nested repack

# Flatten nested archives into subdirectories
archive-repack input.7z --config zip-rewrite.toml --nested flatten

# Dry run
archive-repack --dry-run input.7z --config zip-rewrite.toml

# Explicit output path
archive-repack input.tar.gz --config zip-rewrite.toml --output /tmp/repacked.zip

`--nested <MODE>`

Controls what happens when a nested archive is found inside the input:

Mode	Behaviour
`passthrough` (default)	Copy the nested archive into the output ZIP unchanged
`repack`	Shell out to `archive-repack` recursively; nested archive becomes a repacked ZIP member
`flatten`	Expand contents into a subdirectory named after the archive; processor rules apply to members

Example with --nested flatten:

input.7z/docs/inner.zip → out.zip/docs/inner/fileA.pdf
                           out.zip/docs/inner/fileB.txt

The --nested flag is forwarded to recursive subprocess calls so deeply nested archives are handled consistently.

Options

archive-repack [OPTIONS] <INPUT>

Arguments:
  <INPUT>    Input archive (any supported format)

Options:
  --output <PATH>       Output ZIP path [default: input stem + .zip, same directory]
  --config <PATH>       Config file defining processor rules

Inline rule (alternative or supplement to --config):
  --match <GLOB>        Filename pattern [default: * when --command/--shell given]
  --command <CMD>       Command to run on matching members
  --arg <ARG>           Argument for the command (repeatable); use {input}, {output}
  --io <MODE>           I/O mode: in-place, file-to-file, file-to-stdout, stdin-stdout
                        [default: in-place]
  --shell <EXPR>        Shell expression via sh -c (alternative to --command)

General:
  --exclude <GLOB>      Omit members matching this glob from the output (repeatable)
  --nested <MODE>       How to handle nested archives: passthrough (default), repack, flatten
  --write-manifest      Embed archive-assistant.txt manifest in the output ZIP
  --dry-run             Print what would be done without writing any output
  --verbose             Log each member being processed

If ARCHIVE_REPACK_CONFIG is set in the environment, it is used as the config path for recursive calls on nested archives (set automatically by archive-assistant).