feat: scrape quality gates — classify outcome and roll back unusable indexes#439
Draft
walterfrey wants to merge 15 commits into
Draft
feat: scrape quality gates — classify outcome and roll back unusable indexes#439walterfrey wants to merge 15 commits into
walterfrey wants to merge 15 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
walterfrey
added a commit
to NEXUZ-SYS/docs-mcp-server
that referenced
this pull request
Jun 13, 2026
Promote a scraped (library, version) to searchable only when it indexed usable docs. Adds a ScrapeOutcome classifier + gate-then-rollback hard-fail at the pipeline seam, fixing three failure modes that previously reported "completed": empty/hostile host (FM-1, locale pin-en + LOCALE_REDIRECT_LOOP), off-topic subtree (FM-2, denyPaths + opt-in expectTerms/scope relevance), and silent no-op on a nonexistent GitHub subpath (FM-3, GITHUB_SUBPATH_NOT_FOUND). Additive API (expectTerms/denyPaths/localeStrategy + outcome/errorCode); refresh/clean=false skip the gate to avoid data loss. Upstream PR (clean branch): arabold#439 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: scrape quality gates — classify outcome and roll back unusable indexes
Problem
Today a scrape job is marked
completedas soon as the scraper returns withoutthrowing (
PipelineManagersuccess path). There is no quality gate: the serverconflates "the crawl finished" with "I indexed usable documentation". This produces
three failure modes that all report
completed, verified against the realgoogle/generative-ai-docsrepository:?hl=locale redirect)Locationdetection and no locale stripping.completeddemos/as descendants. Measured: 679 of 1045 tree entries are underdemos/, drowning the ~50 real.mddocs. Path scope can't help (demos are descendants of the root).completed(non-empty!)/tree/<branch>/<subPath>indexes nothingcompletedin ~0.4s.completedEmpirical evidence (real repo, not truncated, 1045 entries):
Solution
Promote a
(library, version)to searchable only if it passes quality gates.Otherwise, discard what was indexed and return a typed error code with remediation.
ScrapeOutcomeenum (indexed | empty | thin | degenerate | failed) computed at job end from counters that already exist.COMPLETED, a failing verdict calls the existingremoveVersion()to discard thestaged docs and marks the job
FAILEDwith a typederrorCode. No half-indexedgarbage stays searchable.
localeStrategy(pin-endefault):Accept-Language: en+ striphl/lang/localequery params; cyclic-Locationdetection within the existingredirect cap raises
LOCALE_REDIRECT_LOOP.denyPaths(defaultdemos/examples) trims irrelevant subtrees; anopt-in relevance gate (
expectTerms) samples indexed chunks and flagsOFF_TOPIC,plus path/host coherence flags
SCOPE_DRIFT./tree/subPath matching zero files raisesGITHUB_SUBPATH_NOT_FOUNDwhose message lists the repo's real top-level directories.
Error codes (surfaced in
get_job_info/list_jobs)EMPTY_RESULT,THIN_RESULT,OFF_TOPIC,SCOPE_DRIFT,LOCALE_REDIRECT_LOOP,GITHUB_SUBPATH_NOT_FOUND.Backward compatibility
All public changes are additive:
scrape_docsparams:expectTerms,denyPaths,localeStrategy(bounded in the zod schema: ≤50 items, ≤200 chars each).
ScraperOptions,FetchOptions,JobInfo,PipelineJob.relevance axes (
OFF_TOPIC/SCOPE_DRIFT) are opt-in viaexpectTerms.clean=falsejobs skip the gate, so a transient thin re-index can neverdelete a pre-existing good version.
Implementation walkthrough (one commit per step)
ScrapeOutcome/ScrapeErrorCode/JobOutcomeMetricstypes; pureevaluateOutcome()classifier;getVersionMetrics()store accessor.applyQualityGate()seam inPipelineManager+ gate-then-rollback; E2Easserting a failed gate leaves the store empty.
GITHUB_SUBPATH_NOT_FOUNDguard; localenormalization +
LOCALE_REDIRECT_LOOP.denyPathsin GitHub + web crawl;relevanceGate(toScopeKeyref-agnosticscope,
expectTermssampling); wiring into the gate (opt-in).outcome/errorCodein job tools; plumb new params throughscrape_docs; docs.Tests
New deterministic coverage (no network):
outcomeGate.test.ts,relevanceGate.test.ts,quality-gate-e2e.test.ts, plus additions toPipelineManager,HttpFetcher,GitHubScraperStrategy,DocumentManagementService,GetJobInfoTool,ScrapeTooltests.FM-2/FM-3 use a committed snapshot of the real
google/generative-ai-docstree.npm test,npm run lint,npm run typecheckall green (pre-existing locale-dependentcli-e2eand an environment-dependentvector-persistencecase excluded — failingidentically on
main).Out of scope (intentionally deferred)
discover_source(library)tool.logger.warn).