Skip to content

Add Merriam-Webster mirror site#44

Open
YuanDaoze wants to merge 1 commit into
aiming-lab:mainfrom
YuanDaoze:feat/merriam-webster
Open

Add Merriam-Webster mirror site#44
YuanDaoze wants to merge 1 commit into
aiming-lab:mainfrom
YuanDaoze:feat/merriam-webster

Conversation

@YuanDaoze

@YuanDaoze YuanDaoze commented Jun 2, 2026

Copy link
Copy Markdown

Summary

Adds the 16th WebHarbor site: a Merriam-Webster mirror with dictionary,
thesaurus, Word of the Day, vocabulary quizzes, and login/account.
142 real word entries, 30 thesaurus entries, 3 quizzes (10 questions
each), 20 benchmark tasks in tasks.jsonl.

Paired HF PR

Heavy assets (instance_seed/merriam_webster.db, 12 real images) live in:

Pre-PR checks (passed locally)

  • docker build webharbor:dev (5.89GB)
  • 16/16 sites return HTTP 200
  • /reset/merriam_webster byte-identical (md5 a4248bef..)
  • /reset-all 16 sites parallel ~1.1s
  • 20/20 benchmark tasks walkable in container
  • All 15 existing sites still byte-identical (no regression)

@YuanDaoze YuanDaoze changed the title feat(merriam_webster): add Merriam-Webster mirror site Add Merriam-Webster mirror site Jun 2, 2026
Adds a Flask mirror of merriam-webster.com as the 16th WebHarbor site:
dictionary, thesaurus, word of the day, vocabulary quizzes, and full
account/login flow. 142 real word entries, 30 thesaurus entries, 3
quizzes (10 questions each), all scraped from the live site. 20
WebVoyager-format benchmark tasks in tasks.jsonl.

Registered as site index 15 (port 40015) in websyn_start.sh,
control_server.py, and Dockerfile (EXPOSE 40000-40015).

Pre-PR checks (passed locally):
- docker build webharbor:dev (5.89GB)
- 16/16 sites return HTTP 200
- /reset/merriam_webster byte-identical (md5 a4248bef..)
- /reset-all 16 sites parallel ~1.1s
- 20/20 benchmark tasks walkable in container
- All 15 existing sites still byte-identical (no regression)

Assets: heavy assets (instance_seed/merriam_webster.db, 12 real images
from MW games/quizzes) uploaded to HF dataset YuanDaozeiii/WebHarbor at
revision 8866e560. .assets-revision pins to the fork until the HF PR
adding merriam_webster.tar.gz to ChilleD/WebHarbor is merged.

Also fixes a pre-existing .gitignore bug where the inline comment on
sites/*/scraped_data/ silently disabled the rule (gitignore does not
support inline comments).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@YuanDaoze YuanDaoze force-pushed the feat/merriam-webster branch from 6268698 to ce6d6e3 Compare June 2, 2026 11:31
@YuanDaoze

Copy link
Copy Markdown
Author

hello, please check the repo~ @Raibows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant