A standalone, retrieval-based chatbot for greenbuildermedia.com that:
- crawls public pages from the site's sitemap
- extracts clean article text and metadata
- chunks content into retrieval passages
- embeds and stores passages in a local LanceDB index
- answers user questions using only Green Builder Media content
- returns source links for every answer
- provides an embeddable website widget
This MVP is designed to save implementation time. It does not require rebuilding the site or changing the CMS. It runs as a separate service that can be embedded into any page with one script tag.
-
scripts/crawl_greenbuilder.py- reads
robots.txtand/orsitemap.xml - filters allowed Green Builder URLs
- fetches and cleans article pages
- stores normalized documents in
data/documents.jsonl
- reads
-
scripts/build_index.py- chunks documents into passages
- creates embeddings with OpenAI
- stores records in LanceDB under
data/lancedb
-
app/main.py- FastAPI backend
/healthendpoint/chatendpoint for grounded answers/widget.jsendpoint to serve the embed script
-
widget/embed.js- lightweight embeddable chat launcher
- inject with a single
<script>tag
This is the fastest path to a useful production-style bot for a publishing archive:
- no WordPress plugin dependency
- no custom training workflow
- works on thousands of pages
- easy to re-crawl nightly or after publishing batches
- citations keep it trustworthy
- Backend: Render, Railway, Fly.io, or any Linux VM
- Persistent volume: required for
data/lancedb - Secrets:
OPENAI_API_KEY
- Create a Python 3.11+ service.
- Set environment variables from
.env.example. - Install dependencies:
pip install -r requirements.txt
- Crawl the site:
python scripts/crawl_greenbuilder.py
- Build the vector index:
python scripts/build_index.py
- Start the API:
uvicorn app.main:app --host 0.0.0.0 --port 8000
- Embed the widget on Green Builder Media pages:
<script src="https://YOUR-BOT-DOMAIN/widget.js" data-chatbot-title="Ask Green Builder" data-api-base="https://YOUR-BOT-DOMAIN" ></script>
The bot is configured to:
- answer from Green Builder content only
- prefer recent pages when the query asks for latest coverage
- cite source pages in every answer
- refuse to invent facts not supported by retrieved passages
Start with the most valuable sections first:
/blog- article pages linked from the blog archive
- topic pages only as metadata helpers
- optionally: ebooks, magazine pages, webinar pages
Set a cron or scheduled job:
python scripts/crawl_greenbuilder.py && python scripts/build_index.pyUpdate ALLOWED_ORIGINS in .env to include https://www.greenbuildermedia.com.
The widget is intentionally simple. It can be restyled to match brand colors by editing widget/embed.js.
For larger usage, swap LanceDB for Postgres+pgvector or a managed vector DB. The retrieval logic in app/retrieval.py is modular so storage can be swapped later.
- working crawler
- working index builder
- working FastAPI chat API
- working embeddable widget
- config templates
This MVP now supports mixing public and private Green Builder documents in the same index.
Each document can include:
visibility:publicorprivateattribution_label: the branded attribution to use when private material informs an answer
Example private record:
{"title":"Draft article title","url":"","published_at":"2024-11-08","category":"Building Science","text":"...","visibility":"private","attribution_label":"Green Builder Media's editorial archive"}Behavior:
- Public records may appear in the response
sourceslist with URLs. - Private records are retrieved and used as background material.
- Private records are not returned in the
sourceslist. - When private material materially shapes the answer, the generator is instructed to use natural attribution such as:
Green Builder Media's research archive suggests...Green Builder Media's editors note...Based on Green Builder Media's internal editorial archive...
Use the included importer:
python scripts/import_private_hubspot_zip.py "/path/to/mpower draft blogs hubspot zipped.zip" --output ./data/private_documents.jsonl --append-to-docs ./data/documents.jsonlThat script:
- reads
.htmlfiles from blog paths inside the ZIP - skips obvious temporary-slug pages
- extracts title, publish date where available, category, and body text
- marks every imported document as
visibility=private - assigns
attribution_label="Green Builder Media's editorial archive"
python scripts/crawl_greenbuilder.py
python scripts/import_private_hubspot_zip.py "/path/to/private-export.zip" --output ./data/private_documents.jsonl --append-to-docs ./data/documents.jsonl
python scripts/build_index.pyThe /chat response now also includes:
private_archive_used: booleanattribution_note: which private attribution label(s) were used in retrieval
This makes it easier to audit when unpublished material influenced an answer.
Private records now support three response-use modes:
paraphrase: may influence the answer and can receive branded attribution such asGreen Builder Media's research archive suggests...weight_only: may help internal retrieval/background weighting, but should not be paraphrased, quoted, or directly attributedblocked: excluded from retrieval and response generation entirely
The governance layer also flags stale or risky private material using date age, obsolete-platform signals, placeholder text, embargo language, and technology/policy references that are likely to have changed.
python scripts/import_private_hubspot_zip.py "/path/to/private-export.zip" --output ./data/private_documents.jsonl --append-to-docs ./data/documents.jsonl
python scripts/classify_private_docs.py ./data/private_documents.jsonl
python scripts/build_index.pyIn practice, you can leave public documents alone and run governance mainly on the unpublished archive.
This MVP now includes a lightweight editor-facing correction app.
What it does:
- logs recent chatbot questions and answers
- lets editors save an override for an exact question, a recurring phrase, or a regex pattern
- applies the editor correction immediately on future matching questions
- does not require a re-crawl or reindex for simple answer fixes
Set these env vars:
ADMIN_USERNAME=editor
ADMIN_PASSWORD=strong-password-hereThen open:
https://YOUR-BOT-DOMAIN/admin
Editors can review recent answers and create a correction by typing the corrected answer into the form.
exact: fixes one specific questioncontains: fixes a family of similar questions containing a phraseregex: advanced pattern matching for repeated edge cases
When a correction is applied, the API returns it immediately and marks the response as editor-corrected.
This package is now tuned to answer in a more Green Builder-style voice:
- direct and journalistic
- practical about tradeoffs, costs, and timing
- grounded in sustainable-building topics without sounding promotional
To show the quick correction button in the widget for editors, use:
<script
src="https://YOUR-BOT-DOMAIN/widget.js"
data-chatbot-title="Ask Green Builder"
data-api-base="https://YOUR-BOT-DOMAIN"
data-editor-tools="true"
></script>When an editor clicks Fix this answer, the bot opens the admin console with the question and answer pre-filled so the editor can simply type a corrected response and save it.
You can preprocess the draft archive before deployment so install day is easier:
python scripts/import_private_hubspot_zip.py "/path/to/mpower draft blogs hubspot zipped.zip" --output ./data/private_documents.jsonl
python scripts/classify_private_docs.py ./data/private_documents.jsonl
python scripts/summarize_private_archive.py ./data/private_documents.jsonl --output ./data/private_archive_report.jsonThat produces:
data/private_documents.jsonlfor indexing laterdata/private_archive_report.jsonso editors can review what was imported, downgraded, or blocked before the bot goes live