Skip to content

Add Compass real-estate mirror (port 40015)#25

Open
sarendis56 wants to merge 1 commit into
aiming-lab:mainfrom
sarendis56:add-compass-mirror
Open

Add Compass real-estate mirror (port 40015)#25
sarendis56 wants to merge 1 commit into
aiming-lab:mainfrom
sarendis56:add-compass-mirror

Conversation

@sarendis56

Copy link
Copy Markdown

TL;DR

Adds a Flask mirror of compass.com as the 16th
WebHarbor site, with browse / search / filter, listing detail, agent
directory, account flows (save, tour, inquiry, saved search, collection),
and 18 WebVoyager-format benchmark tasks.

Companion HuggingFace PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/3

Note: PRs #11, #12, #24 also claim port 40015. Whichever lands first
keeps it; happy to rebase onto the next free port if this isn't merged
first.

What's in this PR

Site code (sites/compass/)

File Lines Purpose
app.py 1,011 Flask app: 10 SQLAlchemy models, 35+ routes, token-overlap scored search
seed_data.py 659 Idempotent seed (524 listings, 20 agents, 10 cities, 4 benchmark users)
templates/*.html 33 files base + 32 page templates, hand-rolled Compass look
static/css/compass.css 327 White / black / serif palette matching the real site
listings_clean.json 524 records Normalized scrape output consumed by seed_data.py at build time
tasks.jsonl 18 WebVoyager benchmark tasks (3 hard multi-step)
_health.py End-to-end health probe
requirements.txt Pinned to image's Flask / SQLAlchemy versions

Registration (3 files modified, must stay in sync per AGENTS.md)

  • websyn_start.shcompass appended to SITES=( … ), the two 15s
    in ready-count log lines bumped to 16.
  • control_server.py'compass' appended to SITES.
  • DockerfileEXPOSE 8101 40000-40015.

Verification

All checks in AGENTS.md § Pre-PR checks pass.

  • python3 -m py_compile sites/compass/{app.py,seed_data.py} — clean.
  • ./scripts/build.sh webharbor:dev — image builds.
  • docker run on alt ports 8201 / 41000-41015:
    • /health reports all 16 sites alive with PIDs.
    • All 16 root paths return 200.
  • Byte-identical reset (the strict invariant):
    curl -X POST :8201/reset/compass
    md5sum instance/compass.db instance_seed/compass.db
    # 2a7458e3b6c3e3d0b39c32cca5d0f519  both files
    
  • All 18 tasks in tasks.jsonl walk end-to-end against the running mirror.

Design notes

  • Determinism. Passwords use PBKDF2 with a fixed per-email salt
    (`sha1("salt-" + email)[:8]`), not bcrypt, because bcrypt's random salt
    breaks byte-identical reset. `User.check_password` accepts both prefixes
    so future writes from the running app (which uses Flask-Bcrypt) still
    authenticate.
  • Search scoring. Token-overlap with city / state / neighborhood boosts
    rather than strict `LIKE %q% AND %q%` — matches the booking-site pattern
    in `sites/booking/app.py`.
  • No task-info leaks. Homepage panels (Newest, Luxury) sort by
    `Listing.id` rather than `price.desc()` so the answers to Tasks 11 / 17
    don't surface for free in the hero grid. Co-op pool was backfilled to

    = 5 candidates per filter combo used in tasks.

  • Real assets. All listing photos are the actual
    `compass.com/m/0//600x400.webp` images, resolved via Playwright
    and downloaded with httpx. No placeholders, no AI stock photos.

Assets

Heavy assets (`instance_seed/compass.db`, `static/images/`, ~129 MB
packed) ship via the companion HuggingFace PR linked above.
`.assets-revision` already pins `main`, so once the HF PR merges this
code PR Just Works.

Adds a Flask mirror of compass.com as the 16th WebHarbor site, with
browse / search / filter, listing detail, agent directory, account
flows (save, tour, inquiry, saved search, collection), and 18
WebVoyager-format benchmark tasks.

sites/compass/:
- app.py (1011 lines): 10 SQLAlchemy models, 35+ routes,
  token-overlap scored search with city/state/neighborhood boosts.
  User.check_password accepts both pbkdf2 and bcrypt prefixes so
  seed-time PBKDF2 hashes (deterministic) coexist with runtime
  Flask-Bcrypt writes.
- seed_data.py (659 lines): idempotent function-level gates;
  PBKDF2 with fixed per-email salt to preserve byte-identical reset;
  Co-op pool backfilled to keep filter-based tasks at >=5 candidates.
- 33 Jinja templates + 327-line hand-rolled CSS (white/black/serif
  to match the real Compass palette).
- tasks.jsonl: 18 WebVoyager tasks (3 hard multi-step).
- listings_clean.json: 524 normalized listings consumed by seed_data
  at build time (committed alongside the mirror, per the convention
  used by booking/, arxiv/, etc.).

Registration (3 files, must stay in sync per AGENTS.md):
- websyn_start.sh: compass appended to SITES, two ready-count 15s -> 16.
- control_server.py: 'compass' appended to SITES.
- Dockerfile: EXPOSE 8101 40000-40015.

Heavy assets (instance_seed/compass.db, static/images/, ~129 MB
packed) ship via the companion HuggingFace PR
ChilleD/WebHarbor#3. .assets-revision already pins main, so once
that merges this Just Works.

Byte-identical reset verified:
  md5sum instance/compass.db instance_seed/compass.db
  -> 2a7458e3b6c3e3d0b39c32cca5d0f519 (both files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@YuanDaoze

Copy link
Copy Markdown

Review

Tested on a fresh worktree of this branch. HF asset (compass.tar.gz, 124 MB) pulled directly from the dataset PR ref (refs/pr/3), the other 15 sites' assets hard-linked from upstream/main. Built as webharbor:rev25, run on alt ports :8231 / :42300-42315 to avoid colliding with my dev container.

What works ✓

Mechanical

  • Three-place registration in sync (websyn_start.sh SITES, control_server.py SITES, Dockerfile EXPOSE 8101 40000-40015).
  • All 16 sites alive after boot. compass on :42315 returns 200.
  • Byte-identical reset holds: md5(instance/compass.db) == md5(instance_seed/compass.db) == 2a7458e3b6c3e3d0b39c32cca5d0f519 both before and after POST /reset/compass — matches the value in the PR description verbatim.
  • Reset hygiene is correct: registered a new "Taylor Reed" account → instance/ MD5 changed (733d23f8…) → after /reset/compass it's back to the seed MD5.
  • Each seed_*() function has its own if X.query.count() > 0: return gate; module-level seed_database() and seed_benchmark_users() are both idempotent.
  • Tarball is clean — 0 macOS ._* AppleDouble files (PR feat(drugs_com): add drugs.com mirror site (port 40015) #9 had 51, PR Add GOV.UK mirror site (port 40015) #32 had 7). 1945 image files + 1 seed DB.
  • Build context clean — no .db files, no static/images/, no scraped_data/ committed.
  • tasks.jsonl has 18 tasks, all five required fields on every line, port matches 40015.

Determinism trick — well-executed
The author handles a subtle bcrypt-versus-determinism conflict elegantly: benchmark seed users get pbkdf2:sha256:1000$<sha1(salt-email)[:8]>$<derived> hashes (deterministic), while live registrations from set_password() use bcrypt (random salt). check_password() dispatches by prefix:

if (self.password_hash or "").startswith("pbkdf2:"):
    return wz_check(self.password_hash, pw)
return bcrypt.check_password_hash(self.password_hash, pw)

Verified end-to-end: registered Taylor Reed with password=ReedPass123! (bcrypt path), login returned 302, /account shows "Welcome back, Taylor". Alice (alice.j@test.com / webharbor123, PBKDF2 path) logs in too. Both paths coexist without breaking byte-identity.

Functional depth

  • Visual fidelity is the strongest of the 4 PRs I've reviewed: real Compass minimalist serif palette, full-bleed property photographs, hero grid → "Compass Exclusives" → "New on Compass" → "Compass Luxury" sections, all with real compass.com/m/0/<uuid>/600x400.webp images (not AI stock).
  • Listing detail at /listing/<slug>-<id> renders hero photo + 3-photo grid sidebar, $1,325,000-style price, beds/baths/sqft summary, full property details, listing-agent block with tour/contact/save CTAs, "Similar homes nearby" carousel.
  • Agent directory at /agents shows 33 agents with "Top Agent" badges, transactions count, volume, niche tags. City filter via dropdown works (filters server-side).
  • /open-houses shows 140 open houses with a real, functioning city filter (e.g. ?city=Miami → 18 results, all Miami).
  • Login works, /account shows alice's profile with Saved homes / Searches / Collections / Tours / Inquiries sidebar plus seeded "Recently saved homes" + "Recent tours" tables — matches the PR description claim that alice/bob/carol/david are pre-seeded with realistic activity.

Task quality

  • 18 tasks span 8 categories: search/filter, listing detail, multi-listing compare (--16), agent dir, open houses, register, login + profile edit, login + tour request, login + collection create, saved searches, find-the-cheapest with constraints (--14), find-most-expensive (--11, --17).
  • Difficulty gradient is solid: simple lookup (--2 1425 Brickell facts) → multi-criteria filter (--13 Miami built≥2016 + non-zero sqft) → multi-step authenticated workflow (--6 carol login + tour request, --7 david login + collection create).
  • Author notes in the PR description that "homepage panels (Newest, Luxury) sort by Listing.id rather than price.desc() so the answers to Tasks 11 / 17 don't surface for free in the hero grid" — verified, the homepage is not a single-glance leak for those tasks.
  • Spot-checked 5 tasks via natural search/q= path:
    • --0 q=Miami&property_type=Condo&beds=3 → 6 results (clean count)
    • --2 1425 Brickell → exact listing slug found
    • --11 q=Aspen&property_type=Single Family → max price = $99M
    • --12 17145 Southwest 90th Avenue → listing exists at /listing/17145-...
    • --16 3-listing compare URLs all resolve

Should-fix (non-blocking)

1. /search silently ignores ?city= parameter.

filter_listings() in app.py:402-441 filters on status, type/property_type, price_min/max, beds, baths, sqft_min, year_built_min, and a bunch of booleans (pool, garage, waterfront, doorman, open_house, new, compass_exclusive) — but not city. The /open-houses route filters on city separately and correctly, and q=Miami works through search_listings()'s token matcher, but /search?city=Miami returns all 497 listings labeled "All listings" with NYC results dominating.

In practice this isn't task-blocking — the homepage search box uses q= (which works), no internal template links to /search?city=…, and tasks resolve via q=. But it's a UX footgun: an agent that constructs a URL by hand or follows a ?city= link from outside (or from saved-searches state) will get silently wrong results. Either honor the param or strip it from the saved-search criteria.

2. .assets-revision pinned to main, but compass.tar.gz is in HF PR ref refs/pr/3 and not yet merged.

./scripts/fetch_assets.sh on a fresh clone won't find compass on main and will silently skip → site won't boot. Author calls this out in the PR description ("once the HF PR merges this code PR Just Works") which is honest, but reviewers/CI without the HF PR merged need the same workaround I used (download tarball from refs/pr/3 directly).

Either get the HF PR merged first then bump .assets-revision to a commit SHA, or temporarily pin to a fork/PR ref like refs/pr/3 so the code PR is independently reviewable.

3. PR base is 3c408d8 init, way behind upstream/main.

A rebase before merge would be wise — both for cleanliness and to expose any silent conflict with sites added since (merriam_webster, etc., that also extend EXPOSE / SITES=). Author's PR description acknowledges PRs #11/#12/#24 also claim port 40015 and offers to rebase to whichever port is free post-merge.

4. requirements.txt is unused.

The Dockerfile pip-installs explicitly with locked versions (correct per AGENTS.md). requirements.txt in sites/compass/ is dead code — either delete or wire it in. Same comment as on PR #9; this seems to be a common pattern.

5. listings_clean.json (524 records, ~330 KB) committed to the source tree.

seed_data.py reads this at build time to populate listings. It's not heavy enough to require HF, but feels like the kind of intermediate-scrape JSON that AGENTS.md recommends folding into instance_seed/<site>.db ("Runtime data lives in instance_seed/*.db, not JSON"). Since seed_data.py only reads it once at boot to do INSERTs into the DB (which is then snapshotted into instance_seed/), it's effectively the source of truth for re-seeding on a fresh build — which is fine, but worth noting that this couples re-seeding to the JSON staying in sync with the binary DB. Not blocking.

Bottom line

The most polished site visually of the four PRs I've reviewed — real photographs throughout, clean Compass-style design system, deep listing detail. Seed determinism handled with the cleanest hand-rolled PBKDF2 trick I've seen, and the tarball is the only one out of three so far without macOS junk. Two tiny pieces of friction (unused ?city= filter, .assets-revision pin) and otherwise mergeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants