Skip to content

xcodethink/pixelcheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

194 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PixelCheck

Real eyes and hands for the AI agent that's writing your frontend.

Drop-in MCP server. Five browser primitives. Eighteen personas across fifteen countries.
Local-first · Vendor-agnostic · MIT-licensed · Yours to own.

npm npm downloads license MCP CI node typescript release stars

Quick Start →  ·  Primitives  ·  MCP Server  ·  Audit Preset  ·  Why Not E2E  ·  Changelog


If PixelCheck helps you, give it a star — it helps others discover the project.


Right now, you're a screenshotting middleman.

Your AI agent is writing 80% of your frontend. It's fast. It's good at code. But it's blind.

  • It writes a button. You open Chrome to check it rendered right. Paste a screenshot back. Ask for the fix.
  • It tweaks the OAuth flow. You log in to verify it didn't silently break. Again. Sixth time this month.
  • It updates the Japanese strings. A user emails: "half the page is in English." You didn't catch it.
  • It rewrites checkout. You walk through it on iPhone, Android, iPad just to feel whether step 3 is confusing.
  • It changes the Arabic layout. RTL didn't propagate. You don't notice for two days.

You become the bridge. The agent has thoughts. You have a browser. The two never meet. Hours of your week, every week, indefinitely.

PixelCheck is the bridge.

A single MCP server. Five primitives. Drop it in once — your agent has eyes and hands.

see(url, opts)              snapshot a page (DOM + screenshot + console + network)
act(url, steps)             execute an action sequence (semantic + selector + Computer Use)
extract(url, schema)        pull structured data matching a Zod / JSON schema
judge(url, rubric)          score a page against a rubric ("is this dark-pattern free?")
compare(a, b, criteria)     A/B comparison of two URLs (incl. blind mode)

Now your agent navigates. Sees rendered HTML. Reads console errors. Clicks. Fills. Judges. Compares. Without ever leaving its workflow — drop into Claude Desktop, Cursor, Cline, Continue, Zed, or Claude Code via four lines in ~/.mcp.json.

npm install -g pixelcheck
pixelcheck doctor                # 8-check environment health
pixelcheck-mcp                   # MCP server (stdio transport)
// ~/.mcp.json
{
  "mcpServers": {
    "pixelcheck": {
      "command": "pixelcheck-mcp",
      "env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
    }
  }
}

Restart your client. Your agent has eyes.

Three promises that aren't going anywhere.

Local-first. PixelCheck runs entirely on your machine. The only outbound network destination is the LLM provider your agent already uses. Screenshots, DOMs, business flows, OAuth tokens, customer URLs — they stay yours. Zero telemetry. Zero remote storage. Zero SaaS sign-up. The audit data hits Anthropic only when the vision critic actively scores a screenshot, and you opt in once on first run.

Vendor-agnostic. Works with Claude today; multi-provider abstraction (OpenAI, Gemini, Ollama-local) is on the v1.x Wave 2 roadmap and your agent will switch with one config flag. The reason is simple: AI tools that lock you to a single LLM provider die in 2026. PixelCheck is the antidote.

Yours to own. MIT license. Source-available. No paid tier. No "Pro" upgrade path. No commercial fork waiting in the wings. The 1858-test, 29-ADR, 30-published-schema product in this repo is the entire product. There's no premium edition behind a sign-up wall — never was, never will be.


The Audit Preset — when you want to be the user, not the bridge

The five primitives compose into something more powerful when you're the operator: PixelCheck bundles an 18-persona / 15-country audit preset on top of the primitives — a CLI-first composition that runs "eighteen real users review your product" after every deployment.

You deploy. Tests pass. CI is green. But then:

  • A Japanese user opens your app and sees half-translated English strings mixed into the UI
  • A user on a budget Android phone in Nigeria waits 12 seconds for your hero image to load
  • Your OAuth login flow silently breaks — again — for the 6th time in 10 deployments
  • The Arabic version renders left-to-right, making the entire layout unusable
  • Your "Trusted" score badge shows green while the copy says "stop interacting immediately"

No E2E test catches these. They test whether code runs. They don't test whether the product works for real humans in real contexts.

The audit preset launches real Chromium browsers as 18 different users from 17 countries, walks through your product's core flows, and delivers a verdict — like having a senior PM, QA engineer, and UX reviewer audit every deployment, in every language, on every device class.

pixelcheck init projects/my-app --name "My App" --url "https://myapp.com"
pixelcheck run --project projects/my-app

Output: a structured report with per-step screenshots, video recordings, network logs, WCAG accessibility violations, and AI-scored ratings across 6 dimensions — served as JSON, HTML dashboard, or Markdown.

How It Works

For each (persona x scenario) combination:

 1. Launch Chromium with device-accurate fingerprint
    (viewport, locale, timezone, UA, regional proxy)
                        |
 2. Execute scenario steps semantically via Stagehand 3.x
    ("click the sign-up button" not "click #btn-37")
                        |
 3. 5-Layer Reliability Stack ensures 98%+ step success
    Stability Gate -> LLM Rewrite -> Selector Discovery -> Auto Selector -> Computer Use
                        |
 4. Claude Vision Critic + axe-core score each checkpoint on 18 dimensions
    completion | localization | visual_polish | trust_signals | accessibility | ...
                        |
 5. Critical steps escalate to Computer Use for pixel-level review
                        |
 6. Generate report: JSON + HTML dashboard + Markdown + video + HAR

Why Not E2E Tests?

Traditional E2E PixelCheck (audit preset)
What it tests Code logic Product experience
Decision making Hardcoded selectors AI reads the page like a human
Assertion style expect(text).toBe("Welcome") "As a Japanese free-tier user, is this CTA clear and fully localized?"
When UI changes Selectors break, tests fail Semantic instructions adapt automatically
Failure output Stack trace Screenshots + video + 6-dimension score + specific UX issues
What it catches Functional bugs i18n gaps, UX friction, visual regressions, trust issues, accessibility violations, cultural mismatches

PixelCheck's audit preset is not a replacement for E2E tests. It's what runs after them — the layer between "code works" and "product is good."

Compared to existing tools

PixelCheck Playwright Cypress Stagehand Browserbase
MCP server out of the box
Browser primitives an AI agent can call 5 (see / act / extract / judge / compare) n/a (low-level page API) n/a 3 (act / extract / observe) n/a
AI vision (judge / critique) ✅ via Anthropic ❌ (action-only)
Built-in personas 18 across 15 countries
Localised report (5 languages)
WCAG 2.x audit + SARIF export ✅ (axe-core + GitHub Code Scanning ready) manual via plugins manual via plugins
Local-first by default ✅ (your machine, your API key) ✅ (or Browserbase) ❌ (cloud-only)
Vendor lock-in none (MIT, no SaaS) none none optional Browserbase full (paid SaaS)
LLM provider swap any (Anthropic default; primitives are vendor-agnostic) n/a n/a swap any n/a
Open source ✅ MIT ✅ Apache 2.0 ✅ MIT ✅ MIT partial

TL;DR: Playwright / Cypress are deterministic browser drivers — you tell them exactly what to click. Stagehand wraps Playwright with natural-language act / extract so an agent can drive a browser. PixelCheck is the next layer up: an MCP-shaped surface that gives any AI agent vision (see / judge / compare) on top of action (act / extract), with audit presets composed across personas. Use Playwright for unit-style tests; use Stagehand if you only need an agent to fill forms; use PixelCheck when the agent needs to evaluate a UI, not just operate it.

Personas

18 built-in personas covering real-world user diversity. The Subscriber Tier column is the persona's subscription level in the SaaS you're auditing (Free user / Pro subscriber / Power-user / enterprise) — used so PixelCheck can audit your product's tiered features (paywalls, upsells, gated UI, Pro-only flows). PixelCheck itself is MIT-licensed and 100% free with no paid tier or commercial fork.

Persona Country Language Device Subscriber Tier (in your app)
US college student US English iPhone 14 Free
Tokyo housewife JP Japanese MacBook Pro Pro
Berlin security analyst DE German iPad Pro Power
Shanghai student CN Chinese Xiaomi Android Free
Sao Paulo freelancer BR Portuguese Desktop Free
Riyadh businessman SA Arabic (RTL) iPhone 15 Pro Pro
Mumbai office worker IN Hindi Budget Android Free
Seoul designer KR Korean QHD Desktop Pro
Hanoi student VN Vietnamese Android Free
Moscow engineer RU Russian (Cyrillic) Windows Desktop Free
Lagos entrepreneur NG English Budget Tecno Free
Mexico City teacher MX Spanish (LATAM) Android Free
Jakarta gig worker ID Bahasa Indonesia Android Free
US retired teacher (72yo) US English iPad Free
London security analyst UK English Desktop Power
Paris marketing manager FR French iPhone Free
Bangkok student TH Thai iPhone SE Free
Taipei engineer TW Traditional Chinese iPad Pro

Each persona includes a mental model (who they are, what they expect) and critical concerns (what would make them lose trust). The AI reviewer judges your product through their eyes.

6 script systems: Latin, CJK, Arabic (RTL), Cyrillic, Devanagari, Thai.

Scenarios Are Declarative YAML

No code required. Describe what a user does, not how to click:

id: signup-flow
name: New User Signup
priority: P0
steps:
  - id: open-home
    type: visit
    url: https://myapp.com/${persona.url_locale}

  - id: click-signup
    type: act
    instruction: Click the sign-up or get-started button

  - id: check-language
    type: assert_visual
    instruction: |
      Is all visible text in ${persona.language}?
      Flag any English strings outside of brand names.

  - id: complete-oauth
    type: act
    instruction: Sign in with Google

  - id: verify-email
    type: check_email
    subject_contains: "welcome"
    timeout: 60000

  - id: a11y-check
    type: assert_a11y
    standard: wcag2aa          # axe-core WCAG analysis
    exclude: [".cookie-banner"]

  - id: rate-onboarding
    type: assert_visual
    critical_review: true    # escalates to Computer Use
    instruction: |
      Rate the post-signup experience. Is the value proposition
      clear within 10 seconds? Is the first action obvious?

12 step types: visit, act, extract, observe, wait_for, assert_visual, assert_dom, assert_a11y, check_email, screenshot, computer_use, custom

5-Layer Reliability Stack

AI-driven browsers are flaky (~75% baseline). We engineered that away:

Layer 1: Page Stability Gate                              +10%  (zero cost)
         Wait for network idle + DOM stable + framework hydration
                            |
Layer 2: LLM Rewrite + Local Mutation                     +7%   (~$0.001/call)
         Haiku rewrites failed instructions using DOM context;
         local rules rephrase/decompose/specify as fallback
                            |
Layer 3a: Selector Hint                                   +3%   (zero cost)
          Optional CSS selector fallback (manual or YAML-defined)
                            |
Layer 3b: Auto Selector Discovery                         +3%   (zero cost)
          Stagehand observe() extracts candidate selectors automatically
                            |
Layer 4: Computer Use Fallback                            +2-4% ($0.01-0.15/call)
         Claude sees the actual pixels and operates the browser directly
         (Sonnet for non-critical steps, Opus for critical reviews)

Target: 98-99% step success rate across all persona/scenario combinations.

Each step records which layer succeeded via execution_method, giving you a reliability breakdown per run.

Reports

Every audit produces a full evidence package:

reports/2026-04-11_post-deploy/
 |-- audit.json              # Machine-readable, all scores and issues
 |-- audit.html              # Dark-theme dashboard with trend sparklines
 |-- audit-explorer.html     # Filterable SPA view of every (scenario × persona) — open with ?lang=zh-CN/ja/es/de for localised UI chrome
 |-- audit.pdf               # Stakeholder-facing summary (A4, 12pt, vector text)
 |-- summary.md              # Terminal-friendly overview
 |-- jp-japanese-pro-desktop__signup-flow/
      |-- 01-open_home.png          # Timestamped screenshot
      |-- 02-check_language.png     # + SHA-256 hash for each
      |-- network.har               # Full network log
      |-- console.log               # Browser console errors
      |-- video/*.webm              # Session recording

audit.json and every MCP tool response carries a top-level schema_version field (SemVer). The contract is documented in docs/contracts/RESULT_SCHEMA.md; machine-readable JSON Schemas live in docs/schemas/ and can be regenerated with npm run schemas.

WCAG compliance reporting

The assert_a11y scenario step runs axe-core to detect accessibility violations. As of v1, every violation carries structured WCAG attribution that flows through to all stakeholder reports:

  • PDF report — a "WCAG Compliance Summary" section grouped by conformance level (A / AA / AAA), by the four WCAG principles (Perceivable / Operable / Understandable / Robust), and a top-violated-criteria table with deep links to the W3C Understanding documents.
  • SARIF (GitHub Code Scanning / GitLab SAST) — per-criterion ruleIds like wcag/1-4-3, wcag/2-1-1. Filter by W3C clause directly in the Security tab. Each rule's detail panel shows "WCAG 1.4.3 Contrast (Minimum) (Level AA)" with a link to the W3C spec.
  • audit.json — every accessibility issue gets wcag_level and wcag_criterion fields alongside the existing description / recommendation.

Catalog covers WCAG 2.1 (the production-deployed standard) plus the 9 net-new success criteria added in WCAG 2.2 (e.g. 2.4.11 Focus Not Obscured, 2.5.8 Target Size). Compliance teams reading reports in zh-CN / ja / es / de see the section headings translated; SC names and id numbers (1.4.3, 2.1.1) stay canonical for compliance-document consistency.

Use case — answering an RFP that asks "Are you WCAG 2.1 AA compliant?":

pixelcheck run --project myapp                                    # writes audit.pdf + audit.sarif
# Open audit.pdf → "WCAG Compliance Summary" section shows A / AA / AAA counts
# Or upload audit.sarif via github/codeql-action/upload-sarif → grouped under wcag/* ruleIds

See ADR-024 for the full design.

Localised reports

Stakeholder reports (PDF / trends dashboard / PR diff Markdown / PR diff HTML) emit in the language of your audience. v1 supports 5 locales:

Code Language Used for
en English (default) Baseline
zh-CN Simplified Chinese China-market teams
ja Japanese Japan-market product orgs
es Spanish Spain + Latin America
de German DACH-region enterprises
pixelcheck run --project myapp --locale ja          # Japanese PDF + reports
pixelcheck trends --project myapp --locale zh-CN     # Chinese trends dashboard
pixelcheck diff <a> <b> --format markdown --locale es  # Spanish PR comment

Or pin a default in config.yaml:

project_name: myapp
base_url: https://myapp.com
default_locale: ja    # any audit run on this project defaults to ja

What's translated: report skeleton — section titles, table headers, status / severity badges, disclaimer prose. What's NOT translated: PixelCheck's findings themselves (those come from the LLM in whatever language you asked Claude for) and numeric values / dates / run IDs. See ADR-023 for the full design.

Translations reviewed by: machine-assisted draft pending native-speaker review. We track reviewer credits publicly — see docs/translation-review-template.md and the translation-review issue template. Confirmed reviewers will be listed below as the v1.x review pass completes.

Locale Reviewer Date Corrections applied
en (source — no review needed)
zh-CN pending pending pending
ja pending pending pending
es pending pending pending
de pending pending pending

PDF report (audit.pdf)

A 4-section A4 portrait PDF aimed at the layer of decision-makers above engineering — PMs, executives, customers, sales / CS reps. The format every email client renders inline, every slide deck embeds, every phone opens.

Section Contents
Cover Project + URL + run date + colour-coded overall score (green ≥ 8, amber 5–8, red < 5) + 7-counter summary card
Top findings Severity-sorted (critical → high → medium → low), capped at 5; each cites scenario × persona context + recommendation
Scenario results One block per (scenario × persona): status badge, score + cost, per-dimension table, all issues
Methodology How the audit works, persona list, scenario list, calibration disclaimer, run id for archival

Vector text (selectable / searchable / accessible) — not a screenshot of HTML. No screenshots embedded so the file stays under ~1 MB and emailable; for visual evidence, the recipient opens audit-explorer.html (cited in the methodology disclaimer).

Default: ON every run. Pass --no-pdf to skip during fast local iteration. See ADR-020 for the full design.

Historical Trends

Scores are tracked in a local SQLite database. Three ways to look at history:

pixelcheck history                    # Terminal table of recent runs with scores
pixelcheck diff run_0412 run_0411     # Score deltas, new/resolved issues
pixelcheck trends                     # Full HTML dashboard with 5 charts (writes <reports>/trends.html)

pixelcheck trends reads <reports>/history.db and writes a standalone HTML dashboard answering "did our UX get better or worse?" Five inline-SVG charts (no Chart.js / external CDN — opens behind any firewall, emails / prints / archives cleanly):

Chart Answer it gives
Overall score line Trending up or down?
Pass / Warn / Fail stacked bars Consistent or flaky?
Issues over time (total + critical) Where are the regression hot spots?
Cost over time Is efficiency drifting?
Per-dimension multi-line Which scoring dimension is the cause?

Plus six summary cards at the top (latest score, mean last 7, mean last 30, total cost, total issues, total critical issues) and a recent-runs table for navigation. See ADR-021 for the full design.

pixelcheck trends --project myapp -n 90 --dashboard reports/trends.html

The per-run audit.html also includes inline sparkline charts for at-a-glance trends within that single report.

Quality Gate

Fail your CI build if the experience drops below your bar:

pixelcheck run --project projects/my-app --min-score 7.5
# Exit code 1 if overall score < 7.5

Quick Start

1. Install

npm install pixelcheck
npx playwright install chromium

For corporate proxy / Alpine Linux / Docker / air-gapped environments, see docs/INSTALLATION.md.

2. Verify your environment

npx pixelcheck doctor

Reports Node version, API key, config / scenarios / personas, network proxy, and api.anthropic.com reachability. Exits 0 when ready, 1 when any check fails — useful in CI scripts to fail-fast before running an audit.

Add --verbose for diagnostic detail, --skip-network for offline / air-gapped environments.

3. Set up a project (interactive or scripted)

Interactive wizard (recommended for first-time users):

npx pixelcheck init
# Walks you through project name, base URL, sample scenario, and runs
# `doctor` at the end to confirm setup.

Non-interactive (CI / scripted):

npx pixelcheck init my-project --name acme-shop --url https://acme.example.com

Either path scaffolds:

  • config.yaml (project name + base URL + model defaults + budget)
  • scenarios/00-smoke.yaml (starter visual + a11y check)

4. Set your API key

export ANTHROPIC_API_KEY=sk-ant-...

Get a key at console.anthropic.com. The wizard above tells you when this is missing; pixelcheck doctor re-checks it any time.

5. Create your first audit

npx pixelcheck init projects/my-app --name "My App" --url "https://myapp.com"

This generates a project directory with a config file and a starter scenario. Edit the scenario to match your app's flows.

6. Run

# Dry run — validate config, print the persona x scenario matrix
npx pixelcheck run --project projects/my-app --dry-run

# Full audit
npx pixelcheck run --project projects/my-app

# Debug mode — visible browser
npx pixelcheck run --project projects/my-app --headed

# Single persona
npx pixelcheck run --project projects/my-app --persona jp-japanese-pro-desktop

CI Integration

Trigger an audit after every deployment:

# .github/workflows/deploy.yml
audit-after-deploy:
  needs: [deploy]
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - run: npm install pixelcheck && npx playwright install chromium
    - run: npx pixelcheck run --project .audit --min-score 7.0
      env:
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    - uses: actions/upload-artifact@v4
      if: always()
      with:
        name: audit-report
        path: reports/

Or dispatch to a central PixelCheck repo that audits all your projects:

    - run: |
        gh workflow run post-deploy-audit.yml \
          --repo your-org/pixelcheck \
          --field project="my-app"
      env:
        GH_TOKEN: ${{ secrets.GH_PAT }}

Exit codes: 0 = pass, 1 = fail, 2 = warn.

CI output formats

When PixelCheck detects a CI environment (CI=true, GITHUB_ACTIONS=true, GITLAB_CI=true, CIRCLECI=true, TF_BUILD=True, or JENKINS_URL), it automatically emits four standard formats alongside audit.json/audit.html:

File Format Consumed by
junit.xml JUnit XML Jenkins, GitLab CI, Azure DevOps, CircleCI
audit.sarif SARIF 2.1.0 GitHub Code Scanning, GitLab SAST
audit.jsonl JSON Lines (one record per line) jq, log aggregators, custom dashboards
github-annotations.txt GHA workflow commands GitHub Actions inline PR annotations

Inside GitHub Actions the workflow-command lines are also streamed to stderr so issues attach inline to PR diffs without a separate annotation step.

Override behaviour explicitly:

  • --ci-format auto — default; emit all 4 in CI, none on developer laptop
  • --ci-format all — force-emit all 4 regardless of environment
  • --ci-format none — skip CI formats
  • --ci-format junit,sarif — comma-separated subset

Severity mapping: critical/high → SARIF error / GHA error; mediumwarning/warning; lownote/notice. See ADR-019 for the full design.

Example — upload SARIF to GitHub Code Scanning:

- run: npx pixelcheck run --project .audit
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- uses: github/codeql-action/upload-sarif@v3
  if: always()
  with:
    sarif_file: reports/<run-id>/audit.sarif

PR diff report

Posting a "did this PR make UX better or worse?" summary as a PR comment is two commands:

# Audit main → audit PR → diff → post
- run: pixelcheck run --tag main && pixelcheck run --tag pr
- run: pixelcheck diff <MAIN_RUN_ID> <PR_RUN_ID> --format markdown --output diff.md
- uses: marocchino/sticky-pull-request-comment@v2
  with: { path: diff.md }

The Markdown contains:

  • A headline metrics table (overall score / issues / critical issues / cost / duration) with ▲ / ▼ polarity arrows
  • Per-dimension changes (sorted by absolute delta magnitude)
  • 🆕 New issues raised by this PR (with severity tags + recommendations)
  • ✅ Resolved issues fixed by this PR
  • A "no meaningful UX changes" message when both lists are empty

Other output formats: --format html for email / Slack, --format json for downstream charting, --format text (default) for terminal. Use --output <path> to write directly to a file (extension auto-detects format) or omit to print to stdout. See ADR-022 for the full design.

Notifications: Slack webhook and Telegram bot on completion.

MCP Server

PixelCheck ships an MCP server that lets any Model Context Protocol client (Claude Code, Cursor, Cline, Continue, Zed agent) drive audits without leaving its workflow.

Register with Claude Code

Add to ~/.mcp.json (or your client's equivalent):

{
  "mcpServers": {
    "pixelcheck": {
      "command": "pixelcheck-mcp",
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}

Tools

Tool Kind Use when
audit_url preset You want the full audit pipeline against one URL — agent loop, scoring, JSON + HTML report.
explore_url preset You want a quick autonomous run with a free-form goal; no scenario YAML needed.
see primitive You want to look at a URL once and get back DOM summary + screenshot + console errors + an optional natural-language note. 0 LLM cost when goal is omitted.
act primitive You want to drive an action sequence (click / fill / scroll / screenshot / natural-language act / vision note) and get back per-step status + final DOM + screenshot.
extract primitive You want a typed payload back from a URL — pricing tiers, feature lists, FAQ entries — shaped exactly the way you asked for. Hand the tool a JSON Schema; get back data matching it plus DOM / console / screenshot.
judge primitive You want a rubric-driven critique of one URL — aesthetic polish, dark-pattern risk, or any custom rubric. Returns per-criterion scores (0..10) + severity-graded findings with on-screen locations. 1 vision call.
compare primitive You want an A/B comparison of two URLs against the same rubric. Default double_blind mode judges each side independently then synthesises a comparison (3 vision calls, free of anchoring bias). fast mode is 1 call (cheaper, anchored).
list_personas meta Discover which personas are installed in a project.
list_scenarios meta Discover which scenarios are installed in a project.
list_capabilities meta Self-describe the server: every shipped tool with kind / cacheability / static cost band / side-effects / dependency declarations, plus the public env-var table and live result-cache state. Pure introspection — call it once on first connect to plan the rest of your session.
calibrate_critic meta Run the critic calibration gate against labeled fixtures (returns pass/fail + agreement metrics).
get_last_report meta Read the most recent audit's summary JSON from the local history DB.

see — one-shot navigation snapshot

The lightest tool in the kit. Call it when you want to ask "what's on this page right now?" without spinning up a full audit.

// MCP tools/call arguments
{
  "url": "https://stripe.com/pricing",
  "goal": "Is there a free tier?",      // optional — runs one vision call, ~$0.005
  "wait_for": "networkidle",            // or "load", "domcontentloaded", or a CSS selector
  "viewport_width": 1280,
  "viewport_height": 800,
  "include_dom": true,
  "include_console": true,
  "headless": true,
  "timeout_ms": 30000
}

Returns a SeeResult (see docs/schemas/see-result.schema.json) with url_final (post-redirect), title, dom (interactive count + headings + summary), console.errors, screenshot (path + sha256), and note (the goal answer when set). Artefacts land under $AUDIT_SEES_DIR or ~/.pixelcheck/sees/<UTC-iso>-<rand6>/. See ADR-011 for design rationale.

act — execute an action sequence

Run a sequence of browser actions (deterministic + AI), get back a per-step trace, the final DOM, and a final screenshot. Engine is auto-selected: pure-deterministic step lists run on raw Playwright (~1 s cold start, no LLM key needed), Stagehand only spins up when at least one step is { "type": "act" }.

// MCP tools/call arguments
{
  "url": "https://stripe.com/pricing",
  "steps": [
    { "type": "fill", "selector": "input[name=email]", "value": "user@example.com" },
    { "type": "click", "selector": "button[type=submit]" },
    { "type": "wait_for", "selector": ".dashboard", "state": "visible" },
    { "type": "screenshot", "label": "after-login" },
    { "type": "act", "instruction": "Click the Upgrade to Pro button" },
    { "type": "note", "goal": "Was the upgrade modal shown? Any error?" }
  ],
  "stop_on_error": true
}

Each step kind:

Kind Cost Notes
goto 0 Re-navigate. Supports wait_for (load / domcontentloaded / networkidle / CSS selector).
click / fill / press / wait / wait_for / scroll 0 Direct Playwright. No LLM.
screenshot 0 Writes <label>.png (default step-<index>.png) into the per-call artefacts dir.
act ~1 LLM call Stagehand-resolved natural-language action. Forces the engine to Stagehand for the whole session.
note ~$0.005 One vision call against the current page. Works on either engine.

Returns an ActResult (see docs/schemas/act-result.schema.json) with engine ("playwright" | "stagehand"), steps[] (each with status, duration_ms, cost_usd, optional screenshot / note / output / error), final dom / console / screenshot, and total cost_usd. Failure semantics: stop_on_error: true (default) skips remaining steps after the first failure (recorded as status: "skipped"); false runs them all and the top-level status is "error" if any failed. Artefacts land under $AUDIT_ACTS_DIR or ~/.pixelcheck/acts/<UTC-iso>-<rand6>/. See ADR-012 for design rationale.

extract — schema-bound structured extraction

Hand the tool a JSON Schema describing the payload you want; get back data matching the shape. One LLM call per invocation. Always Stagehand (extract is fundamentally LLM-driven; there is no deterministic alternative for "give me an arbitrarily-shaped object").

// MCP tools/call arguments
{
  "url": "https://stripe.com/pricing",
  "schema": {
    "type": "object",
    "properties": {
      "plans": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name":     { "type": "string" },
            "price":    { "type": "number", "description": "Monthly price in USD" },
            "features": { "type": "array",  "items": { "type": "string" } }
          },
          "required": ["name", "price"]
        }
      }
    },
    "required": ["plans"]
  },
  "instruction": "Extract every pricing plan card",   // optional — auto-synthesised from schema field names if omitted
  "selector": "main"                                   // optional — constrain to a sub-region
}

JSON Schema subset accepted (the converter rejects everything else with a precise error message naming the keyword and JSON path):

Accepted Rejected
type: object | array | string | number | integer | boolean | null oneOf, anyOf, allOf, not
type: ["string", "null"] (nullable shorthand) $ref, patternProperties, dependencies
properties, required, items, enum, description, nullable if / then / else, const (use a single-element enum instead)
additionalProperties (accepted, ignored — z.object strips by default)
pattern, minLength, maxLength, minimum, maximum (accepted, not enforced — the LLM does not honour them)

The root must be type: "object" because Stagehand's extract() requires an object schema. A bare { properties: {…} } (no type) is accepted as object-shorthand.

Returns an ExtractResult (see docs/schemas/extract-result.schema.json) with engine: "stagehand", data (matching your schema), schema_used / instruction_used / selector_used (echoed for client-side re-validation and debugging), dom / console / screenshot, and cost_usd derived from Stagehand's metrics.extractPromptTokens × estimateCost(model, …). The data.json artefact is also persisted alongside the screenshot for replay. If a tight cost-guard cap trips during recordUsage, status flips to "error" but data and cost_usd are still surfaced (partial-success). Artefacts land under $AUDIT_EXTRACTS_DIR or ~/.pixelcheck/extracts/<UTC-iso>-<rand6>/. See ADR-013 for design rationale.

judge — rubric-driven page critic

Score one URL against a rubric — aesthetic polish, dark-pattern risk, or any custom criteria you supply. One vision call per invocation. Built-in rubrics are reified data in src/core/critics/; the criterion ids are part of the public contract so consumers can join verdicts back to the rubric across runs.

// MCP tools/call arguments
{
  "url": "https://stripe.com/pricing",
  "rubrics": ["aesthetic", "dark_pattern"],          // 8 + 12 built-in criteria
  "custom_criteria": [                                // optional one-off rubric
    { "id": "pricing_clarity", "label": "Pricing clarity", "description": "Is total cost visible without scrolling?" }
  ],
  "persona": "us-power-user-desktop",                  // optional — drives viewport/locale via personas/
  "wait_for": "networkidle"
}

Built-in rubrics:

Rubric Criteria Examples
aesthetic (8) visual_hierarchy, typography, alignment_grid, color_contrast, spacing_rhythm, polish, information_density, brand_cohesion Benchmarked against Stripe / Linear / Vercel / Notion
dark_pattern (12) forced_continuity, hidden_costs, preselected_options, fake_urgency, confirmshaming, obstruction, misdirection, trick_questions, disguised_ads, bait_and_switch, privacy_zuckering, nagging Brignull taxonomy + Norwegian Consumer Council 2018
custom Caller-supplied Any one-off rubric — pricing clarity, conversion path, accessibility narrative, …

Score direction is uniform: higher = better, regardless of kind. Aesthetic 10 = excellent; dark-pattern 10 = no dark pattern detected. So overall_score (mean of all verdict scores) is monotonic across mixed rubrics.

Returns a JudgeResult (see docs/schemas/judge-result.schema.json) with rubrics, criteria (the full rubric list rendered into the prompt), verdicts[] (per-criterion { criterion_id, score, rationale, evidence }), findings[] (severity-graded issues with location), overall_score, summary, plus the standard dom / console / screenshot / cost_usd envelope. Artefacts land under $AUDIT_JUDGES_DIR or ~/.pixelcheck/judges/<UTC-iso>-<rand6>/judge.json. See ADR-014 for design rationale.

compare — A/B page comparison

Run an A/B comparison of two pages against a shared rubric. Default mode is double_blind: judge each side independently (parallel) with the same rubric, then run ONE synthesis vision call that sees both screenshots side-by-side with the prior judgements as context. Three vision calls total; wall-clock ≈ two calls because the judges run in parallel. This mirrors commercial UX-review practice — Nielsen Norman, Baymard Institute — where each candidate is evaluated independently before the comparison synthesis. The reason is anchoring bias: when a model is asked to score two pages in one prompt, absolute scores get pulled toward the difference between them, not the page itself.

fast mode collapses to a single side-by-side vision call — cheaper (~3× cheaper) but anchored. Opt in for batch comparisons (e.g. evaluating 100 competitors overnight) where the cost ratio matters more than per-call accuracy.

// MCP tools/call arguments
{
  "a": { "url": "https://stripe.com/pricing" },
  "b": { "url": "https://intercom.com/pricing", "viewport_width": 375, "viewport_height": 812 },
  "rubrics": ["aesthetic", "dark_pattern"],
  "mode": "double_blind"                                // default; use "fast" for cheap batches
}

Per-side viewport lets you compare e.g. desktop A vs mobile B. Either side may be a pre-captured snapshot from a prior see / extract / judge call ({ "capture": { ... } }); the tool will skip the browser for that side.

Returns a CompareResult (see docs/schemas/compare-result.schema.json) with mode, rubrics, criteria, side_a / side_b (each carrying the embedded JudgeResult in double_blind mode + per-side screenshot + artefacts dir), per_criterion[] ({ criterion_id, score_a, score_b, winner: "a"|"b"|"tie", rationale }), overall_winner, summary, and total cost_usd. Artefacts land under $AUDIT_COMPARES_DIR or ~/.pixelcheck/compares/<UTC-iso>-<rand6>/ with a/ and b/ subdirs and a compare.json sidecar.

list_capabilities — self-describe (M9-5)

Call once on first connect to get a structured map of the whole server: every tool with its kind, input schema, result schema title, cacheability, static cost-estimate band, side-effects, and dependency declarations; plus the public env-var table and live state of the M9-4 result cache.

// MCP tools/call arguments — none required
{}

Returns a ListCapabilitiesResult (see docs/schemas/list-capabilities-result.schema.json):

{
  "schema_version": "1.2.0",
  "server": { "name": "pixelcheck", "version": "0.3.0" },
  "result_schema_version": "1.2.0",
  "tools": [
    {
      "name": "judge",
      "kind": "primitive",
      "result_schema": "JudgeResult",
      "cacheable": true,
      "cost_estimate_usd": { "typical": 0.02, "min": 0.01, "max": 0.06, "unit": "per_call", "notes": "..." },
      "side_effects": ["navigation", "network_egress", "fs_writes_artifacts"],
      "requires": { "api_keys": ["ANTHROPIC_API_KEY"], "browser": true }
      /* …plus name / description / input_schema */
    }
    /* …11 more rows */
  ],
  "env": [
    { "name": "ANTHROPIC_API_KEY", "scope": "auth", "default": "", "required": true, "description": "..." }
    /* …20 more rows across auth / cache / cost_guard / artifacts / logging / memory / reports */
  ],
  "cache": { "enabled": true, "ttl_ms_default": 86400000, "path": "~/.pixelcheck/result-cache.db" }
}

Pure introspection. No LLM, no browser, no probe of secret presence. Secret env vars are named (so you know what to set) but values are never returned. The cache file path is exposed because paths are not secrets — agents writing diagnostic / cleanup scripts genuinely need them. See ADR-016 for design rationale, including why tools/list keeps the strict-spec subset and why runtime secret-presence is deliberately not probed.

Every tool response carries a top-level schema_version field per docs/contracts/RESULT_SCHEMA.md. Two parallel tool calls in one server process see independent run-USD cost caps (per ADR-009) but share the persistent daily ledger.

Adding a new tool: drop a file under src/mcp/tools/<name>.ts exporting a ToolDefinition, then push it into ALL_TOOLS in src/mcp/server.ts. See ADR-010 for the registry rationale.

Multi-Project Support

One PixelCheck install serves all your projects:

pixelcheck/
 |-- personas/              # 18 shared personas (used by all projects)
 |-- projects/
      |-- my-saas/          # Project A
      |    |-- config.yaml
      |    |-- scenarios/
      |-- my-mobile-web/    # Project B
      |    |-- config.yaml
      |    |-- scenarios/
      |    |-- personas/    # Optional: project-specific persona overrides
      |-- my-docs-site/     # Project C
           |-- config.yaml
           |-- scenarios/

Safety

  • Stripe live key protection — refuses to start if pk_live_ detected in environment
  • Credential redaction — OAuth tokens, passwords, API keys, and webhook URLs are never written to reports OR to logs (two layers: well-known field names like apiKey / password / token / cookie are always censored, and concrete env-derived secret values are substring-replaced anywhere they appear, including inside log messages)
  • Computer Use guardrails — Anthropic's prompt-injection classifier enabled by default
  • Budget cap — stops spawning new audit units when cumulative API cost exceeds your threshold

Logging

Internal events use a structured logger (pino). Output goes to stderr, so stdout stays clean for CLI results and the MCP stdio protocol. By default the format is human-readable when stderr is a TTY and JSON otherwise.

Env var Values Default Effect
LOG_LEVEL trace, debug, info, warn, error, fatal, silent info Minimum log level
LOG_PRETTY 1, true, 0, false, auto auto Force pretty-print or JSON; auto decides by TTY
LOG_FILE /path/to.log unset Additionally tee logs to a file

Examples:

# CI / piped: JSON to stderr automatically (no TTY)
pixelcheck run --project projects/my-app 2> audit.log

# Force JSON even in a terminal
LOG_PRETTY=0 pixelcheck run --project projects/my-app

# Verbose debugging
LOG_LEVEL=debug pixelcheck run --project projects/my-app

Cost Guard

A process-wide spend cap protects against runaway LLM bills. Every Anthropic API call is tracked against two limits:

  • Per-run — single audit / MCP tool invocation. Reset at run start.
  • Per-day — UTC-day total persisted across processes in a JSON ledger (default ~/.pixelcheck/cost-ledger.json, override via AUDIT_COST_LEDGER_PATH).

Exceeding any cap throws BudgetExceededError so the calling loop stops immediately. The ledger auto-prunes entries older than 30 days.

Env var Default Effect
AUDIT_COST_MAX_RUN_USD 5 Max USD per audit run / MCP tool call
AUDIT_COST_MAX_RUN_TOKENS 10000000 Max input+output tokens per run
AUDIT_COST_MAX_DAILY_USD 50 Max USD per UTC day across all runs
AUDIT_COST_MAX_DAILY_TOKENS 100000000 Max input+output tokens per UTC day
AUDIT_COST_LEDGER_PATH ~/.pixelcheck/cost-ledger.json Path to the persistent daily ledger
AUDIT_COST_GUARD_DISABLED unset 1 / true to bypass entirely (CI / tests)

The cost guard layers over (and is independent of) the runner's budget_usd cap, which only stops the runner from scheduling new units. The cost guard catches direct MCP tool calls, computer-use loops, and instruction mutations that the unit scheduler doesn't see.

Inspect the current state via the snapshot included in the run started log line, or:

LOG_LEVEL=debug pixelcheck run --project projects/my-app
# emits one "llm usage recorded" debug line per Anthropic call with running totals

Concurrency Safety

PixelCheck is safe to run from multiple processes at once — two parallel pixelcheck terminals, an MCP server fielding two audit_url calls in parallel, or a CLI run alongside an MCP-served call. Specifically:

  • Cost ledger (cost-ledger.json): protected by a cross-process advisory lockfile (<ledger>.lock). Concurrent recorders never lose updates.
  • Per-run cost counters: each MCP tool dispatch and each runAudit call gets its own AsyncLocalStorage scope, so two parallel calls have independent run-USD caps. The persistent daily ledger is still shared.
  • Memory DB (memory.db): record(fact) uses one atomic INSERT … ON CONFLICT DO UPDATE. No SELECT-then-write race.
  • Visual diff baselines: first-run bootstrap copies to a .tmp path then linkSyncs into place. Two parallel first-runs both succeed; the first writer wins.

If a process crashes while holding the cost-ledger lock, the lock auto-recovers after 30 seconds (or sooner if the holder pid is no longer alive). See ADR-009 for design.

SQLite stores share a unified migration runner

PixelCheck uses four local SQLite files (history.db / memory.db / plan-cache.db / result-cache.db). They all open through src/core/db-migrate.ts > openManagedDatabase(), which handles the parent-directory creation, busy_timeout pragma, file-locked WAL transition, and a PRAGMA user_version-driven migration walk in one place. Each migration runs in its own BEGIN IMMEDIATE / COMMIT block so a SQL failure rolls every CREATE / ALTER / INSERT in that step back atomically — partial schema is impossible. Older binaries opening newer DBs fail loudly with MigrationVersionError instead of running broken queries against missing columns. See ADR-026 for design.

Result Cache

A persistent local cache memoises results from the deterministic primitives so repeated identical calls return instantly with cost_usd = 0. AI agents can plan more aggressively without burning fresh vision tokens on every tool call.

Cached primitives:

Primitive Cached Notes
judge Same URL + rubrics + custom criteria + persona/model → same verdict.
extract Same URL + schema + instruction + selector + persona/model → same data.
see ✅ when goal is set Without a goal there is no LLM cost — caching a snapshot would risk staleness.
act State-changing semantics; always runs fresh.
compare Transparent Its two per-side judge calls hit cache automatically; the synthesis call is not separately cached.

Hit/miss semantics: every cache-aware result carries an optional cache?: { hit, age_ms, key, cost_saved_usd? } field. On hit the result's own cost_usd is zeroed and the original cost moves to cache.cost_saved_usd — so callers summing nested costs (e.g. compare) do not double-count cached work.

Configuration:

Env var Default Effect
AUDIT_RESULT_CACHE_PATH ~/.pixelcheck/result-cache.db SQLite path; isolate per environment
AUDIT_RESULT_CACHE_TTL_MS 86400000 (24h) Entries older than this are misses + pruned
AUDIT_RESULT_CACHE_DISABLED unset 1 / true to bypass entirely (read = miss, write = no-op)
AUDIT_RESULT_CACHE_MAX_ROWS 10000 LRU cap; oldest last_used_at rows evicted past this. 0 disables.
AUDIT_RESULT_CACHE_MAX_DISK_MB 500 LRU cap by DB size; same eviction order. 0 disables.

Per-call overrides (also exposed on each MCP tool as cache / cache_bust / cache_ttl_ms):

  • cache: false — skip read and write for this one call.
  • cacheBust: true — skip read but persist the new result so subsequent identical calls hit cache.
  • cacheTtlMs: number — override the TTL for this call.

Schema-version invalidation: entries written under a different RESULT_SCHEMA_VERSION are treated as misses and removed at the next prune. The cache survives additive minor bumps automatically; major bumps invalidate everything.

See ADR-015 for design.

Artifact retention

Each MCP primitive call (see / act / extract / judge / compare) writes a per-call subdirectory under ~/.pixelcheck/<kind>/ containing screenshots, DOM dumps, payload JSON, and the LLM response. Long-running MCP servers can accumulate gigabytes over a month. PixelCheck enforces a 30-day retention window by default and prunes lazily.

pixelcheck prune          # explicit cleanup; prints summary, exit 1 on errors

The MCP server runs the same prune at most once per 24 hours on startup (prune-stamp.json records the last run; subsequent connects within the window skip prune entirely).

Env var Default Effect
AUDIT_SEES_RETENTION_DAYS 30 Retention window for see artifacts; 0 disables
AUDIT_ACTS_RETENTION_DAYS 30 Same, for act
AUDIT_EXTRACTS_RETENTION_DAYS 30 Same, for extract
AUDIT_JUDGES_RETENTION_DAYS 30 Same, for judge
AUDIT_COMPARES_RETENTION_DAYS 30 Same, for compare
AUDIT_<KIND>_DIR ~/.pixelcheck/<kind> Custom storage dir per kind

Setting a retention to 0 means infinite retention (skip prune for that kind), matching how every Linux retention tool behaves. To bulk-delete a kind, use rm -rf ~/.pixelcheck/<kind> directly.

Built With

  • Playwright — browser automation
  • Stagehand 3.x — AI-driven semantic browser control
  • Claude (Vision + Computer Use) — visual evaluation and pixel-level review
  • axe-core — WCAG accessibility auditing
  • better-sqlite3 — local audit history and trend tracking

How Is This Different?

You have four real options if you want an AI agent to operate the visual web today, and each makes a different bet:

  • OSS automation frameworks — browser-use (91k★), Stagehand (22k★), Skyvern (21k★). Best-in-class at executing tasks an agent dictates. None ship a multi-persona simulation layer, none have a strict result-schema contract designed for cacheable AI workflows.
  • Rule-based auditors — axe-core (7k★), pa11y (4.4k★), Lighthouse. Excellent at "does this pass WCAG?" Silent on "is this product actually good?"
  • Hosted agentic browsers — Comet, Atlas, BrowserOS, Dia. Consumer products that replace Chrome. You give them a credit card and a session. They give you UI, not infrastructure.
  • PixelCheck — the MCP server beneath your AI agent's workflow. Fully local, fully OSS, fully owned.
OSS Frameworks Rule-Based Auditors Hosted Agentic Browsers PixelCheck
Question answered How do I control a browser? Does this pass WCAG 2.x? Can a product look at the web for me? How does my AI agent see and operate the web?
Primary interface Library / SDK CLI Desktop app + cloud session MCP server (+ CLI for humans)
Intelligence LLM-driven actions Static rules Hosted LLM (you pay per session) LLM vision + rules + Computer Use, your LLM key
User simulation Single anonymous session None Single signed-in session 18 personas × 17 countries × 6 script systems
Anti-detection None N/A Built-in (browser identity) 9 fingerprints + 15 stealth patches
Output contract Action results Pass/fail checklist Conversational replies 31 published JSON Schemas + 89 named API
History None None Per-session, vendor-locked SQLite trends + run-to-run diff, yours
Cost model Free OSS, your LLM bill Free OSS Subscription + per-session Free OSS, your LLM bill, no PixelCheck markup
Where your data lives Your machine Your machine Vendor cloud Your machine. Period.
Lock-in Sometimes (cloud add-ons) None Maximum None — MIT, no paid tier, no commercial fork

No existing open-source project combines MCP-first browser primitives, multi-persona simulation, AI vision scoring, WCAG analysis, stealth fingerprints, and historical trend tracking. PixelCheck is the missing infrastructure layer between AI agents and the visual web — and it's the only one in the table above where the answer to "what happens to my data" is "it never leaves your machine."

Test Coverage

Run with measurement:

npm run test:coverage          # writes ./coverage/index.html
npm run test:coverage:check    # CI gate — fails on regression below thresholds

Coverage is enforced via vitest.config.ts > coverage.thresholds (provider v8). Entry-points (cli.ts, index.ts, mcp/server.ts) and pure-type contracts (core/types.ts, core/result-schema.ts) are excluded — they are tested through consumers (CLI smoke + MCP tools/list handshake + schema round-trip tests). Counting them would dilute the signal.

The threshold floor sits at or below the current global baseline so the gate catches regression but doesn't block the build. Each new test PR ratchets the floor up after pushing it. Per-module coverage is visible in the text-table report or coverage/index.html. See docs/decisions/ADR-017-coverage-tooling-and-m1-2-phase-1.md for the M1-2 phase plan.

Performance regression gate

Hot-path benchmarks live in tests/perf.bench.ts and run separately from the test suite (vitest's *.bench.ts discovery is independent from *.test.ts, so npm test stays fast). 9 benchmarks cover the report rendering + aggregation paths most likely to regress when someone refactors a template or adds an O(N²) loop:

  • renderPdfHtml / renderTrendsHtml / renderDiffMarkdown / renderDiffHtml
  • renderJunitXml / renderSarif
  • summarizeWcag / computeSummary / t() i18n lookup
npm run bench          # measure (writes docs/perf-current.json)
npm run bench:check    # compare to docs/perf-baseline.json — exit 1 on regression > 50%
npm run bench:update   # bake current as new baseline (after intentional perf changes)

The default 50% tolerance is calibrated against measured run-to-run variance (8–53% on quiet hardware). Stricter local checks via --tolerance 0.30. Initial baseline was recorded as min-of-5 consecutive runs so regressions register as "slower than we've ever been" — robust to noise above the floor. See ADR-025 for the full design.

Stability Commitment

Starting v1.0.0, the following surfaces are stable per Semantic Versioning:

  • CLI — flags, subcommands, exit codes, env var names
  • Config schemaconfig.yaml / personas/*.yaml / scenarios/*.yaml
  • Result Schema — version 1.2.0, the 31 published JSON Schemas in docs/schemas/
  • MCP tool surface — 13 tool names + input/output schemas
  • Library exports — 89 named exports from src/index.ts

Breaking changes only land in major version bumps (v2.0, v3.0, ...). Minor and patch releases are guaranteed backward-compatible. Deprecation cycle is documented in docs/DEPRECATION-POLICY.md: features deprecated in v1.x continue to work for at least two minor releases before being removed in the next major.

Upgrading from v0.3 to v1.0? See MIGRATION.md.

Performance baseline (provisional, v1.0-rc1 calibration pending)

A typical 5-unit audit (1 scenario × 5 personas, full AI pipeline) is expected to land in:

Metric v1.0 target Notes
Wall-clock time ~2–5 minutes Varies by site complexity, persona count, model. v1.0-rc1 calibration will set a hard SLA.
API cost ~$0.10–$0.30 Claude Sonnet 4.6 vision; Computer Use spikes can push to $0.50+
Memory peak < 1 GB RSS Chromium ~500 MB + Node heap ~300 MB

Render hot-paths (already tracked via npm run bench:check regression gate):

Path ops/sec on M-series Notes
renderPdfHtml (20-unit audit) ~12,000 A4 portrait + WCAG section + 5 charts
renderTrendsHtml (100-row history) ~1,000 5 inline-SVG charts
renderDiffMarkdown (typical PR) ~90,000 Sticky PR comment friendly
renderSarif (20-unit, 12 issues) ~190,000 Per-WCAG-SC ruleIds

These are micro-benchmarks (single function call). Full audit pipeline (launch chromium → navigate → score) wall-clock baseline is being calibrated in v1.0-rc1.

Contributing

Contributions are welcome. See CONTRIBUTING.md for the full developer guide (dev setup, commit conventions, PR process, ADR practice, branch protection).

We adopt the Contributor Covenant 2.1 as our community Code of Conduct.

Areas where help is especially appreciated:

  • New personas for underrepresented regions/demographics
  • Scenario templates for common app patterns (e-commerce checkout, onboarding, dashboards)
  • Report format improvements
  • Cost optimization strategies

For installation troubleshooting (corporate proxy, Alpine, air-gapped, etc), see docs/INSTALLATION.md.

Privacy & Data Handling

pixelcheck runs entirely on your machine. The only outbound network destination is api.anthropic.com for the audit calls you explicitly trigger. Zero telemetry.

What leaves your machine when you run an audit:

  • Page screenshots + DOM summaries → Anthropic Claude API
  • Your scenario step text + persona profile fields → Claude API
  • Nothing else (URLs / env vars / paths / past audits stay local)

Privacy-first defaults:

  • Password / secret / API-key inputs are redacted to ******** before screenshots (--redact-inputs, on by default; opt out with --no-redact-inputs only for fixture audits)
  • First-run consent prompt explicitly informs you what data goes to Anthropic. Persisted in ~/.pixelcheck/consent.json so subsequent runs don't re-prompt. Bypass for CI / non-TTY: AUDIT_AUTO_CONSENT=1 env or --auto-consent flag (read PRIVACY.md first).
  • Per-run reports stored at mode 0700 (owner-only) under <projectDir>/reports/

For full data-flow disclosure, GDPR / CCPA position, retention controls, and how to delete data — see PRIVACY.md.

Security

Found a vulnerability? Please use GitHub Security Advisories (private disclosure) — see SECURITY.md. Do not file public issues for security reports.

License

MIT — see LICENSE for full text.

Third-party dependencies and their licenses are documented in docs/THIRD_PARTY_LICENSES.md.

Help & Reference

  • FAQ.md — common questions on API key + cost, scenarios + personas, reports + output, privacy, native binaries
  • docs/TROUBLESHOOTING.md — runtime errors and fixes (API + auth, audit run, browser, reports, CI, performance)
  • docs/INSTALLATION.md — install matrix + corporate proxy + Alpine / Docker / air-gapped recipes
  • API reference — generate locally with npm run docs:apidocs/api/index.html (TypeDoc, not committed)
  • docs/decisions/ — 28 Architecture Decision Records explaining design rationale

E2E tests verify your code works. PixelCheck gives AI agents real eyes and hands to verify your product works.
Get started in 2 minutes

About

MCP-first browser primitives for AI agents — real eyes and hands on the web. Local-first. Vendor-agnostic.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors