AI-native B2B wholesale platform for mobile accessories. Built on Cloudflare Workers, D1, R2, Vectorize, and Hono.
Obsidian Vault (content/)
-> git push
-> GitHub Actions CI/CD
-> validate (type check, entity ID linting)
-> generate (graph, link-index, llms.txt, JSON-LD)
-> process images (WebP conversion, R2 upload)
-> chunked D1 sync (idempotent, failure-strict)
-> deploy Worker
-> Cloudflare Edge
-> SSR HTML (Google + humans)
-> Markdown/plain text (AI agents)
-> D1 (products, graph, runtime state)
-> R2 (images)
-> Vectorize (semantic search, optional)
This project follows a shared-repository operating model with three hard rules:
- Single local source of truth — Obsidian, Codex, Claude Code, and Git all work inside the same repository root.
- Deployment isolation — the Cloudflare Worker runtime is only redeployed when code paths change. Markdown/content updates never redeploy the Worker by default.
- Minimal architecture — no monorepo tooling, no turborepo, no pnpm workspaces. Static-first, edge-native, operationally simple.
All local work happens in:
/Users/alexkou/Documents/github/b2bweb
This repository is both:
- the code workspace for Claude Code
- the content vault for Obsidian and Codex
Responsibilities are separated by path ownership and workflow rules, not by maintaining two live local clones.
See:
Two separate GitHub Actions workflows enforce deployment isolation:
Fires when any of the following paths change:
src/**
scripts/**
wrangler.toml
package.json
package-lock.json
tsconfig.json
.github/workflows/deploy.yml
Pipeline stages:
validate (type check + entity ID lint)
-> build (graph, llms.txt, JSON-LD)
-> sync (D1 chunked batch)
-> embeddings (Vectorize upsert)
-> images (WebP conversion, R2 upload)
-> deploy (wrangler deploy)
The Worker runtime is only redeployed at the end of this pipeline — never from content changes.
Fires when any of the following paths change:
content/**
public/**
attachments/**
Pipeline stages are identical to deploy.yml up through embeddings and images, but the deploy job is absent:
validate (entity ID lint)
-> build (graph, llms.txt, JSON-LD)
-> sync (D1 chunked batch)
-> embeddings (Vectorize upsert)
-> images (WebP conversion, R2 upload)
Content editors (Obsidian, Codex) push markdown changes and get full D1 sync, vector indexing, and image processing — without ever touching the Worker runtime.
Without path filtering, every Obsidian markdown push would trigger a full Worker redeploy. This wastes CI minutes, risks deploying an unintended Worker state, and conflates content authoring with infrastructure operations.
With path filtering:
- A product markdown edit → D1 sync + embeddings only. No deploy.
- A source code change → full pipeline including deploy.
- Both change in the same commit → both workflows fire in parallel, each handling its domain.
The repository may contain both code and content, but daily operations must keep them separated by commit intent and path scope.
The long-term resolution is to split into two GitHub repositories:
| Repo | Contains |
|---|---|
b2bweb |
Source code, Workers, D1 schema, deployment logic |
b2bweb-content |
Products, markdown, Obsidian vault |
In the future architecture, the content repo pushes to GitHub Actions that pull content into the code repo, sync D1, and update Vectorize — without the content repo ever having access to deployment credentials or Worker configuration.
| Data | Owner | Where |
|---|---|---|
| Product content, SEO text, metadata | Markdown | content/products/*.md |
| Live price, stock, price_locked | D1 runtime | Admin API (PUT /api/products/:id/live) |
| Knowledge graph | Generated | generated/graph.json -> D1 snapshot |
| Embeddings | Vectorize | b2bweb-products index (bge-m3, 1024d) |
| Product images | R2 | products/{slug}/{hash}.webp |
Markdown controls product content and SEO. D1 controls live runtime state (stock quantities, locked prices). When both sources overlap on pricing fields, D1 wins if price_locked = 1.
| Name | Type | Purpose |
|---|---|---|
DB |
D1 | Products, customers, orders, quotes, graph |
ASSETS |
R2 | Product images (WebP) |
VECTORIZE |
Vectorize | Semantic search (bge-m3, 1024 dims) |
AI |
Workers AI | Embeddings, vision, inference |
ALLOWED_ORIGINS |
Var | CORS allowlist (comma-separated) |
ENVIRONMENT |
Var | production or development |
git clone https://github.com/alexmorerich/b2bweb.git
cd b2bweb
npm install
# Initialize database
npm run db:init
npm run db:seed
# Run full content pipeline (validate -> graph -> llms -> jsonld -> sync)
npm run pipeline
# Start dev server
npm run dev
# -> http://localhost:8787Pipeline order is critical. Graph and link-index must be generated before sync:
validate:ids -> content:graph -> content:llms -> content:jsonld -> content:sync
# Validate content (linter-only, no mutations)
npm run validate:ids
# Generate artifacts
npm run content:graph # Knowledge graph + link-index.json
npm run content:llms # llms.txt hierarchy (3 layers)
npm run content:jsonld # JSON-LD structured data
npm run content:embeddings # Vector embeddings (bge-m3, 1024d)
npm run content:vision # AI image descriptions
npm run content:images # Process images (WebP, R2 upload)
# Sync to D1 (chunked, failure-strict)
npm run content:sync # Local D1
npm run content:sync:remote # Remote D1
# Full pipeline
npm run pipeline # validate -> graph -> llms -> jsonld -> sync
npm run pipeline:remote # Same, remote D1
# Backup
npm run backup:d1 # Export all D1 tables
npm run backup:r2 # Download all R2 objects
npm run backup # BothScrape a product URL, extract structured product data, and generate a Markdown draft with atomic image download. The recommended mode is deterministic --no-llm; LLM mode is optional. This is a draft upload tool — generated products are active: false and won't appear on the live site until you review and publish them.
cd /Users/alexkou/Documents/github/b2bweb
npm run product:copy:setup
source .venv/bin/activateThe recommended path is --no-llm, which does not require OpenAI quota or a local model. If LLM mode is used, provide a real OpenAI Platform API key with --api_key or $OPENAI_API_KEY. Codex OAuth is not an OpenAI Platform API key and is not used by this script.
python3 scripts/copy_product.py \
--url "https://m.gadgetfix.com/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e-11052.html" \
--require-image \
--strict \
--no-llmUse the non-mobile https://gadgetfix.com/... URL if the mobile URL has SSL issues.
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"
python3 scripts/copy_product.py \
--url "https://m.gadgetfix.com/..." \
--model "qwen2.5" \
--require-image --strictGenerated products are drafts (draft: true, needs_review: true, active: false). Review title, SKU, MOQ, pricing, compatible models, materials, image, and source accuracy. Publish by changing:
draft: false
needs_review: false
active: truepython3 scripts/copy_product.py \
--url "https://m.gadgetfix.com/..." \
--require-image --strict --no-llm --push \
--commit-message "content: add iphone 16e 17e charging port flex"--push commits and pushes to the current branch (sets upstream automatically). It does not merge to main or create a PR. The Cloudflare deploy pipeline only triggers when changes reach main.
| Flag | Description |
|---|---|
--url |
(Required) Product page URL to scrape |
--model |
LLM model name (default: gpt-4o-mini) |
--api_key |
OpenAI Platform API key for LLM mode |
--base_url |
API base URL (default: $OPENAI_BASE_URL or OpenAI) |
--no-llm |
Use deterministic page parsing and skip OpenAI/Ollama |
--force |
Overwrite existing Markdown file |
--force-image |
Overwrite existing image file |
--require-image |
Fail if image cannot be saved |
--strict |
Fail if category/compatible_models/materials are empty |
--push |
Validate, commit, and push to current branch |
--commit-message |
Custom commit message (default: "content: add copied product") |
See docs/PRODUCT_UPLOAD_CLI.md for full details and safety notes.
This section is a complete walkthrough for importing products in bulk. No prior experience with the batch CLI is required. If you can run commands in a terminal, you can use this.
The Batch Product Upload CLI lets you point at a supplier's category page, pick which products you want by keyword, and automatically generate draft Markdown files for each matching product — complete with images, metadata, and frontmatter. You review the drafts in Obsidian, approve the ones you want, and promote them into the live product catalog.
Nothing is published automatically. Every product goes through human review before it reaches the live site.
The system is designed around three principles:
-
Local scripts do the work, not AI. The CLI fetches pages, parses HTML, and downloads images using plain Python and curl. No LLM is called during batch intake (
--no-llmis always on). This keeps the process fast, deterministic, and free from API costs. -
Sandbox first, publish later. Generated drafts land in a staging area (
content/_incoming/), never directly in production (content/products/). You review and approve each file before it goes live. This prevents bad data from ever reaching customers. -
Deduplication is automatic. Every imported URL is hashed and recorded in a local registry. If you run the same intake twice, already-imported products are silently skipped. You never get duplicates.
YOU ARE HERE
|
v
Step 1 Supplier category page (e.g. gadgetfix.com/apple-parts.html)
|
v
Step 2 extract_candidates.py ---- fetches the page, collects all product URLs
|
v
Step 3 filter_candidates.py ----- keeps only URLs matching your keywords,
| removes URLs matching your exclude terms
v
Step 4 batch_copy_products.sh --- for each new URL:
| - runs copy_product.py (fetch + parse + image download)
| - deduplicates against the import registry
| - validates generated drafts
| - commits the registry atomically
v
Step 5 content/_incoming/products/{run_id}/ --- your draft sandbox
|
v
Step 6 YOU review in Obsidian --- check titles, images, pricing, models
| set review_status: approved on good ones
v
Step 7 promote_incoming.py ------ copies approved drafts to content/products/
| updates frontmatter (draft:false, active:true)
v
Step 8 validate, commit, PR, merge --- standard content workflow
|
v
Step 9 Cloudflare deploys automatically
b2bweb/
content/
_incoming/
products/
20260517-143022/ <-- your run
some-product.md <-- draft (review_status: pending)
another-product.md
assets/
some-product-main.jpg
another-product-main.jpg
batch.log <-- what happened during import
product-urls.txt <-- which URLs were processed
products/ <-- production (after promotion)
some-product.md <-- promoted (active: true)
.local/
intake/
import-registry.txt <-- URL hash registry (dedup)
You need:
- macOS with Python 3.10+ and curl (both ship with macOS)
- Node.js 18+ and npm
- Git
- A clone of this repository
# 1. Clone the repo (skip if you already have it)
git clone https://github.com/alexmorerich/b2bweb.git
cd b2bweb
# 2. Install Node dependencies
npm install
# 3. Create a Python virtual environment
python3 -m venv .venv
# 4. Activate the virtual environment
source .venv/bin/activate
# 5. Install Python dependencies for the product scraper
pip install -r scripts/requirements-copy-product.txtYou only do this once. In future sessions, just run source .venv/bin/activate.
Pick a supplier category page. For example, to import iPhone repair parts from GadgetFix:
https://gadgetfix.com/apple-parts.html
Open the page in your browser to see what's there. Note the kinds of products you want (and don't want).
Open a terminal in the b2bweb directory and set these three required variables:
cd /path/to/b2bweb
source .venv/bin/activate
# Which domains are allowed (comma-separated if multiple)
export B2BWEB_ALLOWED_DOMAINS="gadgetfix.com"
# The category/search page to scrape
export B2BWEB_SOURCE_URL="https://gadgetfix.com/apple-parts.html"
# Keywords to match (one per line, using $'...' syntax)
export B2BWEB_KEYWORDS=$'iphone 17 pro max
bluetooth
flex cable
charging port
usb-c'Optionally, exclude products you don't want:
export B2BWEB_EXCLUDE_TERMS=$'case
protector
tempered glass'How keywords work: The filter checks each product URL for your keywords. A URL like replacement-bluetooth-flex-cable-for-iphone-17-pro-max-11014.html would match bluetooth, flex cable, and iphone 17 pro max. URLs containing any exclude term are dropped.
bash scripts/intake_run.shThis single command runs the full pipeline (extract -> filter -> batch copy). You'll see output like:
=== B2BWeb Batch Intake v10 ===
run_id: 20260517-143022
source: https://gadgetfix.com/apple-parts.html
--- Extracting candidates...
--- Filtering candidates...
--- Running batch copy...
=== INTAKE SUMMARY ===
pipeline_version: v10
schema_version: 1
batch_status: success
registry_committed: true
blocked: false
candidate_count: 87
selected_count: 12
duplicate_skipped_count: 0
new_url_count: 12
success_generated: 12
failed_generated: 0
stage_directory: content/_incoming/products/20260517-143022
Reading the summary:
| Field | Meaning |
|---|---|
candidate_count |
Total product URLs found on the page |
selected_count |
URLs that matched your keywords (after excludes) |
duplicate_skipped_count |
URLs already imported in a previous run |
new_url_count |
URLs actually processed this run |
success_generated |
Draft files successfully created |
failed_generated |
Products that failed to import |
batch_status |
success or no_new_urls = healthy. Anything else = check logs |
stage_directory |
Where your drafts live |
Open your b2bweb vault in Obsidian. Navigate to:
content/_incoming/products/20260517-143022/
Each .md file is a product draft. Open one and check:
- Title — Is it accurate and readable?
- SKU — Does it look right?
- Compatible models — Are the right devices listed?
- Materials — Reasonable for the product type?
- Image — Scroll down. Does the product image show correctly?
- Price — If extracted, is it in the right range?
- Source URL — At the bottom. Click it to verify against the original page.
The frontmatter will look like this:
---
id: 01JXYZ...
entity_type: product
slug: replacement-bluetooth-flex-cable-for-iphone-17-pro-max
sku: "IP17PM-BT-FLEX-11014"
title: "Replacement Bluetooth Flex Cable for iPhone 17 Pro Max"
draft: true
needs_review: true
active: false
source_url: "https://gadgetfix.com/replacement-bluetooth-flex-cable-..."
review_status: pending # <-- change this to approve
category: ["accessories"]
compatible_models: ["iPhone 17 Pro Max"]
materials: ["flexible pcb"]
moq: 1
unit: "pcs"
price_usd: 0.0
---For each product you want to publish, change one line in the frontmatter:
# Before
review_status: pending
# After
review_status: approvedLeave products you don't want as pending, or delete the file entirely.
You can also fix any metadata while reviewing — edit the title, adjust the price, add compatible models, etc.
Once you've marked your approved products, run:
python3 scripts/promote_incoming.py content/_incoming/products/20260517-143022Replace 20260517-143022 with your actual run ID.
This command:
- Copies files with
review_status: approvedintocontent/products/ - Copies their images into
content/products/assets/ - Updates frontmatter:
draft: false,active: true,review_status: promoted - Skips everything still marked
pending
Output:
{
"promoted_count": 8,
"skipped_count": 4,
"error_count": 0
}Run the content validator to make sure everything is consistent:
npm run validate:idsYou should see:
=== Content Validation ===
All content is valid. No issues found.
If there are issues (missing IDs, duplicate slugs), the validator will tell you exactly what to fix.
Create a branch, commit the promoted products, and push:
# Make sure you're on a clean branch
git checkout main
git pull origin main
git checkout -b content/batch-products-20260517-143022
# Stage only the promoted production files
git add content/products/ content/products/assets/
git commit -m "content: add batch products from 20260517-143022"
# Push and create PR
git push -u origin HEADOpen the PR link printed by git, review it on GitHub, and merge to main. The Cloudflare CI pipeline will automatically validate, sync to D1, and deploy.
By default, the extractor looks for URLs ending in -{number}.html. If your supplier uses a different URL format:
# Match URLs ending in /product/{number}
export B2BWEB_PRODUCT_URL_REGEX='.*\/product\/[0-9]+$'
bash scripts/intake_run.shBy default, keywords are matched against the URL text only. For more accurate filtering, enable detail mode — this fetches each candidate page's title and meta description:
export B2BWEB_FILTER_FETCH_DETAILS=1
bash scripts/intake_run.shThis is slower (one HTTP request per candidate) but catches products whose URLs don't contain descriptive text.
If you need to re-import a product that's already in content/products/:
python3 scripts/promote_incoming.py content/_incoming/products/{run_id} --overwrite --overwrite-assetsTo see which URLs are being processed:
export B2BWEB_VERBOSE=1
bash scripts/intake_run.shAnother intake process is running, or a previous one crashed. Remove the stale lock:
rmdir .local/intake/.lock- Check that
B2BWEB_SOURCE_URLactually contains product links - Check that
B2BWEB_ALLOWED_DOMAINSmatches the domain in the URLs - Try adjusting
B2BWEB_PRODUCT_URL_REGEXif the site uses non-standard URLs
The registry remembers every URL you've imported. If you need to re-import:
# View the registry
cat .local/intake/import-registry.txt
# Clear it to allow re-importing everything
> .local/intake/import-registry.txtCheck the batch log for details:
cat content/_incoming/products/{run_id}/batch.log | tail -30Common issues:
- Product page returned 403/429 (blocked by the site)
- No product images found on the page
- Title couldn't be extracted (empty page or JavaScript-rendered content)
Make sure you changed review_status: approved (not approve or Approved — it's case-sensitive and must be exactly approved).
# Full intake (one command)
export B2BWEB_ALLOWED_DOMAINS="gadgetfix.com"
export B2BWEB_SOURCE_URL="https://gadgetfix.com/apple-parts.html"
export B2BWEB_KEYWORDS=$'bluetooth\nflex cable\ncharging port'
bash scripts/intake_run.sh
# Review in Obsidian, then promote
python3 scripts/promote_incoming.py content/_incoming/products/{run_id}
# Validate and ship
npm run validate:ids
git add content/products/ && git commit -m "content: add batch products" && git pushFor the full technical reference, see docs/BATCH_PRODUCT_UPLOAD_CLI.md.
Upload one product at a time using only macOS Terminal. This uses copy_product.py directly — no batch pipeline, no keyword filtering. Best for adding a specific product you already found on a supplier site.
You find a product URL on a supplier site
|
v
copy_product.py
| curl (fetches the page)
| parses title, SKU, images, price, models, materials
| curl (downloads product image)
| writes Markdown draft to content/products/
v
content/products/{slug}.md <-- draft (active: false)
|
| you review and publish
v
git commit + push --> CI --> Cloudflare (live)
Two extraction modes:
| Mode | Flag | How it works | Needs API key? |
|---|---|---|---|
| LLM | (default) | Sends page text to GPT/Ollama, gets structured JSON back | Yes |
| Deterministic | --no-llm |
Parses HTML directly with Python — no AI, no cost | No |
cd ~/Documents/github/b2bweb
# Install Python dependencies
npm run product:copy:setup
# Activate the virtual environment
source .venv/bin/activateFor future sessions, just run source .venv/bin/activate.
Browse the supplier site and copy the product page URL. Example:
https://gadgetfix.com/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e-11052.html
Option A: Without AI (deterministic, free, no API key)
cd ~/Documents/github/b2bweb
source .venv/bin/activate
python3 scripts/copy_product.py \
--url "https://gadgetfix.com/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e-11052.html" \
--require-image \
--strict \
--no-llmOption B: With OpenAI
export OPENAI_API_KEY="sk-..."
python3 scripts/copy_product.py \
--url "https://gadgetfix.com/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e-11052.html" \
--require-image \
--strictOption C: With Ollama (local LLM, free)
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"
python3 scripts/copy_product.py \
--url "https://gadgetfix.com/..." \
--model "qwen2.5" \
--require-image \
--strictYou'll see output like:
🚀 Scraping target: https://gadgetfix.com/white-charging-port-...
🧩 Extracting structured JSON with deterministic parser (--no-llm)...
📸 Image downloaded safely via curl: assets/white-charging-port-...-main.jpg (45832 bytes)
🎉 Clean Draft Successfully Created: content/products/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e.md
# See the frontmatter
head -30 content/products/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e.mdOutput:
---
id: 01JXYZ...
entity_type: product
slug: white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e
sku: "IP16E-IP17E-USBC-CHG-FLEX-WHT-11052"
title: "White Charging Port Dock USB-C Connector Flex Cable For iPhone 16e iPhone 17e"
draft: true
needs_review: true
active: false
source_url: "https://gadgetfix.com/white-charging-port-..."
category: ["accessories"]
compatible_models: ["iPhone 16e", "iPhone 17e"]
materials: ["flexible pcb", "usb-c connector"]
moq: 1
price_usd: 5.50
---Check the image was downloaded:
ls -la content/products/assets/ | grep white-charging# Fix the MOQ
sed -i '' 's/^moq: 1$/moq: 20/' \
content/products/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e.md
# Fix a price
sed -i '' 's/^price_usd: 5.50$/price_usd: 7.95/' \
content/products/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e.md
# Or open in nano for bigger edits
nano content/products/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e.mdChange three fields in the frontmatter to make it live:
FILE="content/products/white-charging-port-dock-usb-c-connector-flex-cable-for-iphone-16e-iphone-17e.md"
sed -i '' 's/^draft: true$/draft: false/' "$FILE"
sed -i '' 's/^needs_review: true$/needs_review: false/' "$FILE"
sed -i '' 's/^active: false$/active: true/' "$FILE"Verify:
grep -E '^(draft|needs_review|active):' "$FILE"Expected:
draft: false
needs_review: false
active: true
npm run validate:idsExpected: All content is valid. No issues found.
git add content/products/
git commit -m "content: add iphone 16e 17e charging port flex cable"
git push origin mainOr use --push to do it in one shot (Step 2 + Step 7 combined):
python3 scripts/copy_product.py \
--url "https://gadgetfix.com/..." \
--require-image --strict --no-llm --push \
--commit-message "content: add iphone 16e 17e charging port flex cable"| Flag | Description |
|---|---|
--url |
(Required) Product page URL |
--no-llm |
Deterministic parsing — no AI, no API key needed |
--model |
LLM model name (default: gpt-4o-mini) |
--api_key |
OpenAI API key (default: $OPENAI_API_KEY env) |
--base_url |
API base URL (for Ollama: http://localhost:11434/v1) |
--require-image |
Fail if no product image can be downloaded |
--strict |
Fail if category/compatible_models/materials are empty |
--force |
Overwrite existing Markdown file |
--force-image |
Overwrite existing image asset |
--output-dir |
Write to a custom directory instead of content/products/ |
--push |
Auto-validate, commit, and push after generating |
--commit-message |
Custom git commit message (used with --push) |
Upload many products at once from a supplier's category page using only macOS Terminal. The batch CLI scans a page for product links, filters by your keywords, and runs the single-product scraper on each match. No Obsidian, no GUI required.
Supplier website Your Mac (Terminal) GitHub / Cloudflare
================ =================== ===================
Category page
|
| curl (fetch HTML)
v
extract_candidates.py --- collects all product URLs
|
v
filter_candidates.py ---- keeps only keyword matches
|
v
batch_copy_products.sh -- for each URL:
| copy_product.py --no-llm
| (fetch + parse + download image)
v
content/_incoming/ SANDBOX (not live)
|
| you review with: ls, head, grep, sed
| you approve with: sed (change one line)
v
promote_incoming.py ----- copies approved drafts
| to content/products/
v
git add, commit, push ----------------------> GitHub Actions CI
|
v
validate + sync
|
v
Cloudflare Edge
(live on the web)
| Principle | What it means |
|---|---|
| No AI during batch import | Products are parsed with Python + curl, not LLMs. --no-llm is always on. Fast, free, deterministic. |
| Sandbox first | Drafts land in content/_incoming/ staging area. Nothing reaches content/products/ until you explicitly promote. |
| Auto-deduplication | Every URL is SHA-256 hashed into a local registry. Run the same import twice and duplicates are silently skipped. |
b2bweb/
content/
_incoming/products/{run_id}/ <-- sandbox (your drafts land here)
bluetooth-flex-cable.md
charging-port-flex.md
assets/
bluetooth-flex-cable-main.jpg
batch.log
products/ <-- production (after you promote)
bluetooth-flex-cable.md
assets/
.local/intake/
import-registry.txt <-- URL dedup registry
cd ~/Documents/github/b2bweb
npm install
python3 -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements-copy-product.txtFor future sessions, just: cd ~/Documents/github/b2bweb && source .venv/bin/activate
cd ~/Documents/github/b2bweb
source .venv/bin/activate
# REQUIRED: which domains to allow
export B2BWEB_ALLOWED_DOMAINS="gadgetfix.com"
# REQUIRED: the category page to scan
export B2BWEB_SOURCE_URL="https://gadgetfix.com/apple-parts.html"
# REQUIRED: keywords to include (one per line)
export B2BWEB_KEYWORDS=$'iphone 17 pro max
bluetooth
flex cable
charging port
usb-c'
# OPTIONAL: terms to exclude
export B2BWEB_EXCLUDE_TERMS=$'case
protector
tempered glass'How keywords work: Each product URL is converted to readable text (replacement-bluetooth-flex-cable becomes replacement bluetooth flex cable). If any keyword appears, it's included. If any exclude term appears, it's dropped.
bash scripts/intake_run.shOutput:
=== INTAKE SUMMARY ===
batch_status: success
registry_committed: true
candidate_count: 87
selected_count: 12
new_url_count: 12
success_generated: 12
failed_generated: 0
stage_directory: content/_incoming/products/20260518-143022
Save the run ID:
RUN_ID=$(ls -t content/_incoming/products/ | head -1)
echo "Run ID: $RUN_ID"# List drafts
ls content/_incoming/products/$RUN_ID/*.md
# Quick-scan all titles
grep '^title:' content/_incoming/products/$RUN_ID/*.md
# Check images
ls content/_incoming/products/$RUN_ID/assets/head -35 content/_incoming/products/$RUN_ID/replacement-bluetooth-flex-cable-for-iphone-17-pro-max.mdCheck these fields:
| Field | What to check |
|---|---|
title |
Readable, accurate product name? |
sku |
Sensible abbreviation? |
compatible_models |
Correct device(s)? |
materials |
Reasonable for this product? |
price_usd |
In the right range? (0.0 = not detected, edit manually) |
Approve a single file:
sed -i '' 's/^review_status: pending$/review_status: approved/' \
content/_incoming/products/$RUN_ID/replacement-bluetooth-flex-cable-for-iphone-17-pro-max.mdApprove all at once:
sed -i '' 's/^review_status: pending$/review_status: approved/' \
content/_incoming/products/$RUN_ID/*.mdApprove all, then un-approve rejects:
# Approve everything first
sed -i '' 's/^review_status: pending$/review_status: approved/' \
content/_incoming/products/$RUN_ID/*.md
# Un-approve specific files you don't want
sed -i '' 's/^review_status: approved$/review_status: pending/' \
content/_incoming/products/$RUN_ID/front-camera-flex-cable-for-iphone-17-pro-max.mdVerify:
grep '^review_status:' content/_incoming/products/$RUN_ID/*.md# Fix a price
sed -i '' 's/^price_usd: 0.0$/price_usd: 4.50/' \
content/_incoming/products/$RUN_ID/replacement-bluetooth-flex-cable-for-iphone-17-pro-max.md
# Fix MOQ
sed -i '' 's/^moq: 1$/moq: 20/' \
content/_incoming/products/$RUN_ID/replacement-bluetooth-flex-cable-for-iphone-17-pro-max.md
# Or use nano for bigger changes
nano content/_incoming/products/$RUN_ID/replacement-bluetooth-flex-cable-for-iphone-17-pro-max.mdpython3 scripts/promote_incoming.py content/_incoming/products/$RUN_IDOutput:
{
"promoted_count": 10,
"skipped_count": 2,
"error_count": 0
}Verify files landed in production:
ls content/products/*.md | tail -5npm run validate:idsExpected: All content is valid. No issues found.
git add content/products/
git commit -m "content: add batch products from $RUN_ID"
git push origin mainGitHub Actions CI automatically validates, syncs to D1, and deploys to Cloudflare.
Replace the values at the top and paste into Terminal:
# --- Configuration (EDIT THESE) ---
cd ~/Documents/github/b2bweb
source .venv/bin/activate
export B2BWEB_ALLOWED_DOMAINS="gadgetfix.com"
export B2BWEB_SOURCE_URL="https://gadgetfix.com/apple-parts.html"
export B2BWEB_KEYWORDS=$'iphone 17 pro max
bluetooth
flex cable
charging port
usb-c'
export B2BWEB_EXCLUDE_TERMS=$'case
protector
tempered glass'
# --- Run intake ---
bash scripts/intake_run.sh
# --- Identify run ---
RUN_ID=$(ls -t content/_incoming/products/ | head -1)
echo "Run: $RUN_ID"
# --- Quick review ---
echo "=== TITLES ==="
grep '^title:' content/_incoming/products/$RUN_ID/*.md
# --- Approve all ---
sed -i '' 's/^review_status: pending$/review_status: approved/' \
content/_incoming/products/$RUN_ID/*.md
# --- Promote ---
python3 scripts/promote_incoming.py content/_incoming/products/$RUN_ID
# --- Validate and push ---
npm run validate:ids
git add content/products/
git commit -m "content: add batch products from $RUN_ID"
git push origin mainXxxweb category pages include product-card links and extra recommendation links. Use B2BWEB_LINK_CLASS="product-main-img" to import only the visible category products and avoid bottom/related product blocks.
cd ~/Documents/github/b2bweb
source .venv/bin/activate
export B2BWEB_ALLOWED_DOMAINS="www.xxxweb-online.com,xxxweb-online.com"
export B2BWEB_SOURCE_URL="https://www.xxxweb-online.com/product/default!search.do?keyword=&categoryId=111855&priceRange=&brandIds=&colorIds=&certs=&brandModelIds=&propOptions=&closedFilters=&orderBy=rank&desc=true"
export B2BWEB_PRODUCT_URL_REGEX="^/p/[^/]+/.*\.htm$"
export B2BWEB_LINK_CLASS="product-main-img"
export B2BWEB_KEYWORDS=$'iphone'
export B2BWEB_MAX_FAILURES=100
bash scripts/intake_run.sh| Variable | Required | Default | Description |
|---|---|---|---|
B2BWEB_ALLOWED_DOMAINS |
Yes | — | Comma-separated allowed domains |
B2BWEB_SOURCE_URL |
Yes | — | Category/search page URL to scan |
B2BWEB_KEYWORDS |
Yes | — | Newline-separated include keywords |
B2BWEB_EXCLUDE_TERMS |
No | empty | Newline-separated exclude terms |
B2BWEB_RUN_ID |
No | YYYYMMDD-HHMMSS |
Custom run identifier |
B2BWEB_RUN_DIR |
No | /tmp/b2bweb-intake/{run_id} |
Custom runtime directory |
B2BWEB_PRODUCT_URL_REGEX |
No | .*-[0-9]+\.html$ |
Regex for product URL matching |
B2BWEB_LINK_CLASS |
No | unset | Only keep product links whose anchor has this CSS class, useful for excluding recommendation blocks |
B2BWEB_MAX_FAILURES |
No | 3 |
Stop batch after N failures |
B2BWEB_FILTER_FETCH_DETAILS |
No | 0 |
Set 1 to fetch page titles for filtering |
B2BWEB_VERBOSE |
No | 0 |
Set 1 for verbose output |
B2BWEB_PROXY |
No | unset | HTTP proxy for curl |
"lock_held" error — Previous run crashed. Fix: rmdir .local/intake/.lock
Zero candidates — URL regex doesn't match the site. Debug:
curl -sL "$B2BWEB_SOURCE_URL" | grep -oE 'href="[^"]*"' | head -20All duplicates — Registry already has them. Reset: > .local/intake/import-registry.txt
Validation fails — Auto-fix: npm run validate:ids -- --fix
price_usd: 0.0 — Scraper couldn't extract price. Bulk-fix with sed:
sed -i '' 's/^price_usd: 0.0$/price_usd: 5.00/' content/products/PRODUCT-SLUG.mdUndo a promotion — Delete the file: rm content/products/PRODUCT-SLUG.md
View import log — cat content/_incoming/products/$RUN_ID/batch.log
Remove supplier watermarks from product images using LaMA AI inpainting. Scripts: scripts/run_lama.py (single image) and scripts/remove-watermarks.py (batch).
pip3 install -r scripts/requirements-watermark.txt
# LaMa TorchScript model (~206 MB) — high-frequency texture-preserving inpainting.
# Runs directly on torch (CPU/MPS); no IOPaint needed.
curl -L -o /tmp/big-lama-model.pt \
"https://github.com/enesmsahin/simple-lama-inpainting/releases/download/v0.1.0/big-lama.pt"LaMa vs OpenCV inpainting. Telea/Navier-Stokes leave blur/artifacts on dense industrial textures (PCB traces, laser etching, gold pins). The LaMa model reconstructs high-frequency texture, giving commercial-grade results (sharpness ratio ≈ 1.0). It activates automatically when the model file exists; pass
--use-lamatopilot/processto enable it (off by default since CPU inference is ~6 s/image).IOPaint (the usual LaMa wrapper) does not build on Python 3.13 — it pins an old Pillow that fails to compile. We use the underlying TorchScript model directly via
_inpaint_lama, which forces CPU to avoid an Apple-MPS compiler crash seen with repeated GPU loads.
Before a full run, sanity-check detection + mask quality on a small batch. The pilot command scans a capped sample, attempts a clean on each, and emits an original / mask / cleaned preview HTML for visual sign-off. Source images are never modified; all output goes to a folder outside the repo.
python3 scripts/remove-watermarks.py pilot \
--assets /Users/alexkou/Documents/github/b2bweb/content/products/assets \
--max-total 50 \
--preset review \
--rights-confirmed| Flag | Meaning |
|---|---|
--assets |
Source image directory (read-only) |
--max-total |
Max watermarked images to include in the pilot (default 50) |
--preset |
Detection preset (default review — lenient discovery + MSER + verification) |
--rights-confirmed |
Required. Confirms you hold the rights to remove the watermark. Without it the command refuses. |
--out |
Output folder (default ~/Downloads/sunsky-watermark-pilot-<timestamp>; must be outside the repo) |
--use-lama |
Use LaMA inpainting if /tmp/big-lama-model.pt exists (slower, higher quality) |
Output: review.html (side-by-side original/mask/cleaned with per-image status), cleaned/, masks/, manifest.json. Open with open ~/Downloads/sunsky-watermark-pilot-*/review.html.
⚠️ iPhone 14 and newer are EXCLUDED from scanning. Supplier (sunsky-online.com) stopped watermarking iPhone 14/15/16/17 series images — those product photos are already clean. The scanner skips any filename whose only iPhone token isiphone-14or higher (e.g.for-iphone-14-pro-...,for-iphone-15-pro-max-...,for-iphone-16-...,for-iphone-17-...). Rule lives inshould_scan_file()inscripts/remove-watermarks.py; changeSKIP_IPHONE_MINif the supplier's policy changes.
python3 scripts/remove-watermarks.py scan # medium preset (default)
python3 scripts/remove-watermarks.py scan --preset high # most accurate, slowest
python3 scripts/remove-watermarks.py scan --preset fast # quickest, may miss faint watermarks
python3 scripts/remove-watermarks.py scan --assets /path/to/content/products/assets # custom vaultSpeed vs accuracy presets (rate is per worker; default uses 10 workers):
| Preset | Downscale | Scales | Expected rate | 17K images | Notes |
|---|---|---|---|---|---|
high |
1200px | 11 | ~3 img/s | ~95 min | Most accurate — final production pass |
medium (default) |
600px | 6 | ~10 img/s | ~28 min | Balanced — recommended for day-to-day |
fast |
300px | 3 | ~25 img/s | ~11 min | Quick triage |
super |
200px | 2 | ~60 img/s | ~5 min | Very fast — for re-scans |
lightning |
100px | 2 | ~120 img/s | ~2 min | Extreme — significant recall loss |
All presets disable MSER (caused false positives on clutter) and run a full-resolution verification on every candidate (rejects scores < 0.42 at native scale).
Multi-template matching: in addition to the synthesized "Helvetica" template, the scanner loads any scripts/watermark-templates/real-*.png crops (real watermark patches extracted from product images). Best score across all templates wins, which improves recognition of fonts/styles the synthesized version misses.
Outputs watermarked-images.json with detected files and coordinates.
Full scan filter — which files are checked:
| Category | Scanned? | Examples |
|---|---|---|
| iPhone 13 / 12 / 11 / XS / XR / X / SE | Yes | for-iphone-13-pro-..., for-iphone-x-..., for-iphone-se-... |
| iPhone 14 / 15 / 16 / 17 | Skipped (excluded) | for-iphone-14-pro-..., for-iphone-15-pro-max-..., for-iphone-16-..., for-iphone-17-... |
| Combo products (e.g. fits 12 + 13 + 14) | Yes | Any filename containing an iPhone 13 or earlier token |
| iPad / Mac / Apple Watch / accessories | Yes | for-ipad-mini-..., for-macbook-pro-..., for-apple-watch-... |
Three cleaning modes available:
| Mode | Speed | Quality | Model needed | Best for |
|---|---|---|---|---|
| LaMA + GPU (MPS) | ~5-10 img/s | Best | Yes (196 MB) | Final production cleanup |
| LaMA + CPU | ~1 img/s | Best | Yes (196 MB) | Machines without GPU |
| OpenCV | ~50+ img/s | Good | No | Quick batch runs |
python3 scripts/remove-watermarks.py clean -f watermarked-images.json # LaMA (auto-detects MPS GPU, falls back to CPU)
python3 scripts/remove-watermarks.py clean -f watermarked-images.json --method opencv # fastest, no model needed
python3 scripts/remove-watermarks.py clean --dry-run -f watermarked-images.json # preview only
python3 scripts/remove-watermarks.py clean --assets /path/to/content/products/assets -f watermarked-images.json # custom vaultLaMA automatically uses Apple Silicon GPU (MPS) when available, falling back to CPU. Originals are backed up to /tmp/watermark-backup/ before modification.
For a single image with a custom mask:
python3 scripts/run_lama.py --image input.jpg --mask mask.png --output clean.jpggit add content/products/assets/
git commit -m "fix: remove watermarks from product images"
git push origin maingit push triggers content-sync.yml which runs automatically:
validate (entity ID lint)
-> images (WebP conversion, R2 upload)
-> build (graph, llms.txt, JSON-LD)
-> sync (D1 chunked batch)
-> embeddings (Vectorize upsert)
No manual R2/D1 steps needed — CI converts cleaned images to WebP, uploads to R2, and syncs product data to D1.
Markdown sync (content:sync) respects D1 runtime state:
| Field | Rule |
|---|---|
stock_qty |
Never overwritten by sync |
price_usd, bulk_price_usd, bulk_qty |
Preserved if price_locked = 1; updated from markdown if price_locked = 0 |
active |
Preserved unless markdown explicitly sets active: true/false |
vision_description |
Preserved if not null; overwritten only when new value exists |
Flags:
--force-price-- Overrideprice_lockedand update prices from markdown--deactivate-missing-- Setactive = 0for products not in markdown--remote-- Execute against remote D1
D1 sync splits statements into chunks of 50 to avoid Wrangler payload limits. Each chunk is executed separately. If any chunk fails, the script exits non-zero immediately.
Soft-delete sets active: false in frontmatter and D1. The .md file stays on disk (visible in Obsidian). Add --purge to also remove the file.
# Soft-delete (keeps .md file, sets active: false)
npm run product:delete -- --slug PRODUCT-SLUG --remote
# Hard-delete (removes .md file + deactivates in D1)
npm run product:delete -- --slug PRODUCT-SLUG --remote --purge
# With optional reason
npm run product:delete -- --slug PRODUCT-SLUG --remote --purge --reason "Discontinued"# Comma-separated slugs
npm run product:batch-delete -- --slugs slug-one,slug-two,slug-three --remote
# With purge (removes all .md files)
npm run product:batch-delete -- --slugs slug-one,slug-two --remote --purge
# From a text file (one slug per line, # lines ignored)
npm run product:batch-delete -- --file scripts/to-delete.txt --remote --purgeto-delete.txt format:
# Products to remove - May 2026
samsung-galaxy-s25-ultra-clear-case
samsung-galaxy-s25-ultra-screen-protector
Both commands issue a single targeted D1 UPDATE — no full catalog sync. Safe to run at any time without risking auth timeouts.
The repository root is the Obsidian vault. Content lives under content/. Obsidian is the CMS.
Obsidian edit -> git commit -> git push -> CI validates and syncs
CI does NOT auto-commit. Missing IDs fail CI with a clear error message. ID generation and fixes happen locally via Obsidian Templater or npm run validate:ids --fix.
- Open Obsidian at the repository root, not at
content/alone. - Obsidian Git must not auto-commit or auto-push on a timer or file change.
- Pull before starting work:
git pull --ff-only - Review
git statusbefore every commit. - Content commits should normally include only
content/,attachments/, or selectedpublic/assets.
Every product markdown must have: id, title, slug, sku, category, moq, price_usd.
[[TPU]], [[CE]], [[iPhone 16 Pro Max]] -- resolved during sync using:
- Generated link index (
generated/link-index.json) from graph build - Hardcoded material/certification slug sets (fallback)
- Device prefix detection
- Default:
/wholesale/{slug}
Obsidian image embeds are processed by CI:
![[ip16-housing-ga.jpg]]
![[ iPhone 16 Glass .JPG ]]
![[product photo 01.jpeg]]Parser supports: global matching, case-insensitive extensions, spaces in filenames, .jpg/.jpeg/.png/.webp.
Images are converted to WebP and uploaded to R2 with deterministic keys: products/{slug}/{hash}.webp.
pip3 install -r scripts/requirements-watermark.txtScan all product images for watermark candidates:
python3 scripts/remove-watermarks.py scanReview the report, then clean:
python3 scripts/remove-watermarks.py clean -f watermarked-images.json
python3 scripts/remove-watermarks.py clean --dry-run -f watermarked-images.json # preview onlyUses LaMA inpainting if model is available at /tmp/big-lama-model.pt, falls back to OpenCV Navier-Stokes. Originals are backed up to /tmp/watermark-backup/. After cleaning, git push triggers CI (validate → images → R2 → D1 sync).
- CORS allowlist (configurable via
ALLOWED_ORIGINS) - CSRF Origin/Referer validation on mutations
- Rate limiting on login, register, search, quotes, orders, catalog
- Security headers (HSTS, CSP, X-Frame-Options, X-Content-Type-Options)
- Private wholesale catalog: Bearer auth only (no query-token)
- HTML sanitization on all rendered markdown content
- Blocks: script, iframe, object, event handlers, javascript: URLs, dangerous HTMX attributes
- Sanitization applied after markdown-to-HTML conversion and before rendering
- JWT (HS256) with server-side session tracking in D1
- PBKDF2 (100k iterations, SHA-256) password hashing
- HttpOnly, Secure, SameSite=Lax cookies
- Session revocation support
3-layer retrieval for AI agents:
/llms.txt-- Index map (noindex)/llms-full.txt-- Compressed global catalog (noindex)/llms-ctx-{slug}.txt-- Entity-level deep context (noindex)
/ai-index.json-- Machine-readable product index (noindex)/api/knowledge/{slug}-- Markdown export (noindex)/api/graph-- Knowledge graph (nodes, edges, summary)/robots.txt-- Crawl rules with AI retrieval comments/sitemap.xml-- Canonical URLs only (no AI/noindex routes)
All AI-only endpoints return X-Robots-Tag: noindex to prevent Google from indexing duplicate content.
POST /api/auth/register-- Customer registration (rate limited)POST /api/auth/login-- JWT login (rate limited)POST /api/auth/logout-- Logout
GET /api/products-- List (query:category,search,page,limit)GET /api/products/:id-- DetailPOST /api/products-- Create (admin)PUT /api/products/:id-- Update (admin)DELETE /api/products/:id-- Soft delete (admin)PUT /api/products/:id/live-- Update live price/stock (admin)
GET /api/search?q=&mode=hybrid-- Hybrid search (rate limited)- Priority: exact SKU -> alias -> FTS5 -> vector similarity
- Modes:
hybrid(default),semantic,keyword
GET /api/knowledge/:slug-- Markdown exportGET /api/graph-- Full knowledge graph JSONGET /api/llms/wholesale-catalog-- Bearer-auth private catalog (rate limited)
POST /api/orders-- Create order (auth, rate limited)GET /api/orders-- List orders (auth)POST /api/quotes-- Request quote (auth, rate limited)GET /api/quotes-- List quotes (auth)
GET /api/cart-- Get cart (cookie-based)POST /api/cart/add-- Add to cartPUT /api/cart/update-- Update quantityDELETE /api/cart/clear-- Clear cart
npm run backup # Export D1 + R2 to backups/Backups fail hard on any error. Recovery target: 0-24 hours.
npm install
npx tsc --noEmit
npm run validate:ids
npm run db:init
npm run db:seedDo not run npm run db:migrate:role after db:init; fresh schema already includes customers.role.
For databases created before customers.role existed:
npm run backup:d1
npm run db:migrate:roleSee docs/LOCAL_ADMIN_SMOKE_TEST.md for creating a local admin user.
npm run dev # Start local dev server
npx tsc --noEmit # Type check
npm run validate:ids # Validate content
npm run deploy # Deploy to Cloudflare Workersb2bweb/
content/ # Obsidian vault (source of truth)
products/ # Product markdown + images
categories/
materials/
certifications/
devices/
entities/ # Suppliers, workflows, tools
generated/ # Build artifacts (gitignored)
scripts/ # Content pipeline
process-images.ts # Image WebP/R2 pipeline
sync-content.ts # Chunked D1 sync
validate-ids.ts # Content linter
generate-graph.ts # Knowledge graph
backup-d1.ts # D1 backup
backup-r2.ts # R2 backup
remove-watermarks.py # Batch watermark scan + clean
run_lama.py # LaMA inpainting (single image)
src/
index.ts # App entry + middleware stack
types.ts # Type definitions
routes/
api.ts # REST API (rate limited)
pages.tsx # SSR pages + SEO
middleware/
auth.ts # JWT + sessions + PBKDF2
csrf.ts # CSRF protection
rate-limit.ts # Rate limiting
sanitize.ts # HTML sanitization
security-headers.ts
ai-crawler.ts # Bot detection
lib/
wikilink.ts # WikiLink resolver
graph/ # Knowledge graph
resolver/ # pSEO page resolver
seo/schema.ts # JSON-LD builders
pages/ # JSX page components
components/ # Shared UI components
db/
schema.sql # Complete schema
migrate-*.sql # Migrations
seed.sql # Sample data
Temporarily hide an entire product category from the live site without deleting anything. Products stay in the Obsidian vault and can be restored with one command.
toggle-category.sh hide "iPhone Parts"
|
v
Obsidian markdown files: active: true → active: false
|
v
git add + commit + push ──────> GitHub Actions CI
|
v
D1 sync + deploy
|
v
Products hidden on hyranger.com
To restore, run show — same flow in reverse.
cd ~/Documents/github/b2bweb
# List all categories and product counts
./scripts/toggle-category.sh list
# Check status (single or multiple)
./scripts/toggle-category.sh status "iPhone Parts"
./scripts/toggle-category.sh status "iPhone Parts" "iPad Parts" "MacBook Parts"
# Hide one category
./scripts/toggle-category.sh hide "iPhone Parts"
# Hide multiple categories in one command
./scripts/toggle-category.sh hide "Apple Watch Parts" "iPad Parts" "MacBook Parts"
# Restore one or multiple categories
./scripts/toggle-category.sh show "iPad Parts"
./scripts/toggle-category.sh show "Apple Watch Parts" "iPad Parts" "MacBook Parts"| Category | Products |
|---|---|
| iPhone Parts | 2,056 |
| Apple Watch Parts | 377 |
| iPad Parts | 315 |
| MacBook Parts | 189 |
The script edits the markdown files but does not commit or push. You decide when to propagate:
# Review what changed
git diff --stat content/products/
# Propagate to live site
git add content/products/
git commit -m "content: hide Apple Watch, iPad, MacBook parts for watermark cleanup"
git pushCI handles the rest — D1 sync updates the database, Worker redeploys, changes go live.
To undo a hide, just run show with the same categories:
./scripts/toggle-category.sh show "Apple Watch Parts" "iPad Parts" "MacBook Parts"
git add content/products/
git commit -m "content: restore Apple Watch, iPad, MacBook parts"
git push