path-findeR

What it solves

Writing CSS selectors for a new site is grunt work, and the selectors you hand-write today drift the moment the site re-templates. path-findeR takes that off your plate: feed it a handful of pages from the same template and it returns a stable selector manifest for the content that actually matters — title, price, body, images, whatever the LLM identifies as a meaningful zone.

The output is a JSON manifest keyed by parser_id that your scraper can load and use directly. When the site changes, regenerate from a fresh corpus; the API is the same.

How it works

HTML input → URL pattern detection → parse → analyze + score → LLM labels → derive selectors → validate
                    │                                                            │
              (gate: min 2 pages)                                     (retry on miss, mark unresolved)

URL pattern detection compares 2+ source URLs to extract a common shape (/products/{}). Dynamic segments are recorded and used to deprioritize instance-specific text in the analyzer.
Parser turns each HTML page into a DOM tree, strips scripts/styles/comments, normalizes whitespace.
Analyzer drops obvious junk (nav, cookie banners, ads, etc.) and scores survivors by text density, semantic tag, and link density.
Semantic doc builder compacts the top-scored nodes into an LLM-friendly representation. Below max_direct_kb (default 300KB) the raw HTML goes through instead.
LLM identifies content zones and returns (label, gen_id) pairs plus similarity clusters. It never writes selectors — only points at nodes.
Selector builder derives CSS from each labeled gen_id at multiple specificity levels (ID, class, tag+class, structural path).
Validator runs every selector across the corpus, retries on miss, and marks labels unresolved when nothing holds.

How to use it

1. Run the service

cp .env.example .env       # fill in API keys + storage choice
make run                   # builds the Rust core and starts the service on :7117

Other Makefile targets: make build (release artifacts), make test (Rust + Go), make clean.

.env highlights:

STORAGE_ADAPTER — s3 (default) or local (writes to LOCAL_STORAGE_PATH)
S3_BUCKET, AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY — only needed when STORAGE_ADAPTER=s3
MONGO_URI, MONGO_DB — where manifests are persisted
AI_ADAPTER — anthropic | openai | openrouter; set the matching *_API_KEY

2. Feed pages

Pick a job_id you'll reuse for the corpus. Feed at least 2 pages from the same template:

# CLI
path-finder feed "https://shop.example.com/products/123" page1.html --job-id myshop
path-finder feed "https://shop.example.com/products/456" page2.html --job-id myshop

# or HTTP
curl -X POST http://localhost:7117/feed \
  -H 'content-type: application/json' \
  -d '{"url":"https://shop.example.com/products/123","html":"<html>…","job_id":"myshop"}'

The pipeline runs automatically once min_pages is reached (default 2). Force it early with path-finder force myshop or POST /force.

3. Get the manifest

path-finder get myshop
# or
curl http://localhost:7117/parser/myshop

Output:

{
  "parser_id": "myshop",
  "url_pattern": { "host": "shop.example.com", "pattern": "/products/{}" },
  "parser": {
    "title": {
      "selectors": [{ "css": "main > article > h1.product-title" }, { "css": ".product-hero h1" }],
      "concrete_types": ["Text"],
      "abstract_types": ["Title", "Headline"],
      "array": false,
      "unresolved": false
    },
    "price":  { "selectors": [{ "css": ".price-now" }], "concrete_types": ["Float"], "array": false, "unresolved": false },
    "images": { "selectors": [{ "css": ".gallery img" }], "array": true, "unresolved": false }
  }
}

In your scraper, walk the parser map: try selectors in order, use querySelectorAll when array: true, fall back / log when unresolved: true.

4. Regenerate when a site changes

path-finder regenerate myshop                      # all labels
path-finder regenerate myshop -l price -l in_stock # just those two
path-finder regenerate myshop --force              # override the "newer pages exist" check

5. (Optional) Web UI

cd frontend && npm install && npm run dev

http://localhost:3000 — Feed pages, browse history, inspect the manifest tree, test selectors against pasted HTML, and watch the run trace (candidate scores, cross-corpus validation grid, LLM rationale).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
frontend		frontend
path-finder-cli		path-finder-cli
path-finder-core		path-finder-core
path-finder-service		path-finder-service
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
path-findeR-spec.md		path-findeR-spec.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

path-findeR

What it solves

How it works

How to use it

1. Run the service

2. Feed pages

3. Get the manifest

4. Regenerate when a site changes

5. (Optional) Web UI

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

path-findeR

What it solves

How it works

How to use it

1. Run the service

2. Feed pages

3. Get the manifest

4. Regenerate when a site changes

5. (Optional) Web UI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages