Skip to content

Add GOV.UK mirror site (port 40015)#32

Open
lamawmouk wants to merge 1 commit into
aiming-lab:mainfrom
lamawmouk:feat/gov-uk-mirror
Open

Add GOV.UK mirror site (port 40015)#32
lamawmouk wants to merge 1 commit into
aiming-lab:mainfrom
lamawmouk:feat/gov-uk-mirror

Conversation

@lamawmouk

Copy link
Copy Markdown

TL;DR

Adds a Flask mirror of gov.uk as the 16th WebHarbor site (port 40015), with topic browse, guidance article detail, department directory, announcements, and search. Uses the official MIT-licensed govuk-frontend v6.1.0 for canonical Design System DOM.

Companion HuggingFace PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/22

What's in this PR

sites/gov_uk/:

File Purpose
app.py 5 SQLAlchemy models, 9 routes
seed_data.py Idempotent seed: 16 topics, 44 subtopics, 15 departments, 73 articles, 20 announcements
templates/*.html base + 9 page templates using canonical govuk-frontend DOM
static/{css,js,fonts,icons}/ Official govuk-frontend v6.1.0 bundle (MIT)
tasks.jsonl Stub; WebVoyager tasks in follow-up

Registration (sync per AGENTS.md): gov_uk added to websyn_start.sh and control_server.py, Dockerfile EXPOSE bumped to 40000-40015.

Verification

All checks in AGENTS.md § Pre-PR checks pass: image builds clean, 16/16 sites alive, every gov_uk route returns 200, POST /reset/gov_uk byte-identical pre/post (md5 f6931b6c…), and identical after docker restart.

Notes

  • Content is synthesized (no upstream copy); OGL v3.0 would permit direct copy but synth keeps the seed at 128 KB and deterministic.
  • govuk-frontend.min.css only patched with one sed to rewrite url(/assets/...) → relative paths so they resolve through Flask's /static/.
  • .assets-revision still points at main; will bump to the HF merge SHA after that PR is reviewed.

Adds a Flask mirror of https://www.gov.uk/ as the 16th WebHarbor site,
running on port 40015.

## What's mirrored

- 16 top-level topics (Money and tax, Visas and immigration, Driving, ...)
- 44 subtopics
- 15 government departments (HMRC, DfE, Home Office, DVLA, NHS England, ...)
  with real ministers / permanent secretaries / employee counts
- 73 guidance articles (Self Assessment, Income Tax, Universal Credit,
  Skilled Worker visa, passport applications, vehicle tax, ...)
- 20 announcements (press releases, news stories, speeches)
- Search across articles / announcements / departments

## Visual fidelity

Uses the official MIT-licensed govuk-frontend v6.1.0 CSS + JS + GDS
Transport font + crown SVG. Templates use the canonical Design System
component DOM (govuk-header, govuk-breadcrumbs, govuk-summary-list,
govuk-pagination, govuk-grid-row, etc.) so an agent's selectors match
the real GOV.UK.

Content licensed under the Open Government Licence v3.0 (synthesized
in the spirit of GOV.UK guidance; no upstream copy embedded).

## Folder layout

Matches the canonical site layout (compare wolfram_alpha, google_search):

  sites/gov_uk/
  |-- _health.py
  |-- app.py
  |-- seed_data.py
  |-- tasks.jsonl
  |-- instance_seed/        (HF-managed)
  |-- static/{css,js,fonts,icons,images,external_cache}/
  \`-- templates/

## Wiring

- websyn_start.sh: gov_uk appended to SITES, 15->16 counts
- control_server.py: gov_uk added to SITES
- Dockerfile: EXPOSE 40000-40015

## Pre-PR verification (passed)

- docker build webharbor:dev clean (5.92 GB)
- 16/16 sites bind in 2s
- All gov_uk routes (/, /browse, /browse/<topic>, /browse/<t>/<s>,
  /guidance/<slug>, /government/organisations[/<dept>],
  /government/announcements, /search, /_health) return 200
- /reset/gov_uk -> {ready: true}, md5 byte-identical pre/post
- Byte-identical after docker restart

## Asset PR

Seed DB (gov_uk.tar.gz, 32 KB) uploaded as HF PR:
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/22

.assets-revision will be bumped to the HF merge SHA once that PR lands.
@lamawmouk lamawmouk force-pushed the feat/gov-uk-mirror branch from 96f4916 to 63b73a5 Compare May 24, 2026 22:48
@lamawmouk lamawmouk changed the title feat(gov_uk): add GOV.UK mirror site (port 40015) Add GOV.UK mirror site (port 40015) May 24, 2026
@lamawmouk

Copy link
Copy Markdown
Author

@Raibows would you be able to review this when you have a chance? Thanks! 🙏

@YuanDaoze

Copy link
Copy Markdown

Review: gov_uk

I ran the full pipeline in an isolated git worktree using alternate ports 42000-42015.

Mechanical / Visual / Functional all PASS, but Task Quality is N/A because tasks.jsonl is empty. See notes below.

✅ Mechanical checks: PASS

  • 16/16 sites return HTTP 200, including gov_uk on port 42015

  • Control-plane /health reports all 16 sites alive

  • Byte-identical reset PASS

    • md5 == f6931b6c99df43d3ce1ffbef4566d2ef
    • Verified at startup, after /reset/gov_uk, and after /reset-all
  • /reset-all across 16 sites completes in parallel in ~1.3s, under the 10s requirement

  • Reasonable seed scale:

    • 16 topics
    • 44 subtopics
    • 15 departments
    • 73 articles
    • 20 announcements

✅ Visual fidelity: PASS

This site uses the official MIT-licensed govuk-frontend v6.1.0, and the homepage, detail pages, and search page closely match the real [gov.uk](https://www.gov.uk/).

Checked pages include:

  • Home page

    • Crown logo
    • “Welcome to GOV.UK” hero
    • Popular-on-GOV.UK links
    • 16-topic 4×4 grid
    • Departments / Announcements two-column footer
    • Open Government Licence v3.0 footer
  • Detail page: State Pension forecast

    • Breadcrumb: Home > Working > State Pension
    • DWP department label
    • Related content sidebar
    • Published / Last updated metadata
  • Search page

    • Tri-category grouping:

      • Guidance and services
      • Announcements
      • Departments
    • Each result has a short summary, department, and updated date

The “Mirror” banner correctly identifies this as a benchmark mirror.

Note: having 0 <img> elements on the home page is not a bug. The real gov.uk is also an almost text-only portal with just an SVG crown, so this is a deliberate and faithful departure from e-commerce-style mirrors.

✅ Functional depth: PASS

  • Search quality is good:

    • passport → 12 results across 3 categories
    • tax return → 6 results
    • drive → 6 results
    • Search uses token-overlap scoring rather than strict AND matching, which works well here
  • 8/8 sampled detail pages have populated content with more than 200 characters

  • 4/4 sampled navigation links return non-404

  • No authentication system is intentional and correct

    • The real gov.uk also has no on-site user accounts
    • It delegates account-related flows to Government Gateway
    • This matches the upstream design

❌ Task quality: N/A

sites/gov_uk/tasks.jsonl currently contains only:

[]

The PR description acknowledges that “WebVoyager tasks” will come in a follow-up, so Phase 2 is not done yet.

This is the most important step in the review-env skill, specifically Step 5. Without tasks, I cannot audit:

  • answer leaks
  • distractor density
  • task difficulty
  • whether the site is benchmark-ready

Recommendation:

  • Either add 15-20 tasks in this PR before merging, per the contribution guide
  • Or, with maintainer agreement, split the tasks into a follow-up PR

However, if tasks are split into a follow-up, that follow-up should land before the author-list cutoff. Otherwise, this contribution should probably count as infrastructure scaffolding rather than a complete benchmark-ready site.

⚠️ Required fixes

1. .assets-revision should point at the HF PR ref, not main

Currently .assets-revision points to revision: main, but gov_uk.tar.gz is still in HF PR #22 and has not been merged yet.

Because of this, when a reviewer runs:

./scripts/fetch_assets.sh

only the existing 15 sites are pulled, and gov_uk is missing.

I had to manually fetch from refs/pr/22 in order to build and review the site.

Suggested fix, following the pattern used by other paired PRs:

repo: <your-hf-fork>/WebHarbor
revision: <commit-sha-containing-gov_uk-assets>

Then, once HF PR #22 is merged, bump it back to:

repo: ChilleD/WebHarbor
revision: <merge-sha>

2. The tarball contains macOS AppleDouble files

Extracting gov_uk.tar.gz writes 7 ._* files into sites/gov_uk/, for example:

  • ._instance_seed
  • ._gov_uk.db
  • ._.gitkeep

It also emits warnings such as:

Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.provenance'

Suggested fix:

Re-pack on macOS with:

COPYFILE_DISABLE=1 tar --no-xattrs -czf gov_uk.tar.gz ...

or re-pack the tarball from Linux.

Summary

The infrastructure is solid:

  • visual fidelity is strong
  • functional depth is good
  • byte-identical reset passes
  • data modeling looks reasonable
  • using govuk-frontend was a great call

The main blocker is the empty tasks.jsonl, which prevents this site from being benchmark-ready.

Once the tasks are added and the two asset issues are fixed, I would be happy to approve.

Looking forward to the follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants