Skip to content

feat: scrape quality gates — classify outcome and roll back unusable indexes#439

Draft
walterfrey wants to merge 15 commits into
arabold:mainfrom
NEXUZ-SYS:upstream-pr/scrape-quality-gates
Draft

feat: scrape quality gates — classify outcome and roll back unusable indexes#439
walterfrey wants to merge 15 commits into
arabold:mainfrom
NEXUZ-SYS:upstream-pr/scrape-quality-gates

Conversation

@walterfrey

Copy link
Copy Markdown

feat: scrape quality gates — classify outcome and roll back unusable indexes

Problem

Today a scrape job is marked completed as soon as the scraper returns without
throwing (PipelineManager success path). There is no quality gate: the server
conflates "the crawl finished" with "I indexed usable documentation". This produces
three failure modes that all report completed, verified against the real
google/generative-ai-docs repository:

# Failure mode Verified root cause Today
FM-1 Empty / hostile host (SPA, ?hl= locale redirect) No locale normalization; a JS-gated or redirect-walled page yields 0 docs with no signal. Redirect cap exists but there is no cyclic-Location detection and no locale stripping. completed
FM-2 Wrong content — crawler indexes an irrelevant subtree A repo-root scrape legitimately includes demos/ as descendants. Measured: 679 of 1045 tree entries are under demos/, drowning the ~50 real .md docs. Path scope can't help (demos are descendants of the root). completed (non-empty!)
FM-3 Silent no-op — GitHub /tree/<branch>/<subPath> indexes nothing The GitHub tree API returns HTTP 200 with the entire tree regardless of subPath; the subPath is filtered client-side. A nonexistent/mistyped/wrong-case subPath matches 0 files silently → only the wiki link is queued (404) → 0 docs → completed in ~0.4s. completed

Empirical evidence (real repo, not truncated, 1045 entries):

/tree/main/docs                     -> 0 files   (dir doesn't exist)
/tree/main/gemini-api/docs          -> 0 files   (wrong path)
/tree/main/Site/en                  -> 0 files   (wrong case)
/tree/main/site/en/gemini-api/docs  -> 8 files   (correct path)

Solution

Promote a (library, version) to searchable only if it passes quality gates.
Otherwise, discard what was indexed and return a typed error code with remediation.

  • Outcome classification — new ScrapeOutcome enum (indexed | empty | thin | degenerate | failed) computed at job end from counters that already exist.
  • Gate-then-rollback (hard-fail) — at the single seam where the manager would mark
    COMPLETED, a failing verdict calls the existing removeVersion() to discard the
    staged docs and marks the job FAILED with a typed errorCode. No half-indexed
    garbage stays searchable.
  • FM-1localeStrategy (pin-en default): Accept-Language: en + strip
    hl/lang/locale query params; cyclic-Location detection within the existing
    redirect cap raises LOCALE_REDIRECT_LOOP.
  • FM-2denyPaths (default demos/examples) trims irrelevant subtrees; an
    opt-in relevance gate (expectTerms) samples indexed chunks and flags OFF_TOPIC,
    plus path/host coherence flags SCOPE_DRIFT.
  • FM-3 — a /tree/ subPath matching zero files raises GITHUB_SUBPATH_NOT_FOUND
    whose message lists the repo's real top-level directories.

Error codes (surfaced in get_job_info / list_jobs)

EMPTY_RESULT, THIN_RESULT, OFF_TOPIC, SCOPE_DRIFT, LOCALE_REDIRECT_LOOP,
GITHUB_SUBPATH_NOT_FOUND.

Backward compatibility

All public changes are additive:

  • New optional scrape_docs params: expectTerms, denyPaths, localeStrategy
    (bounded in the zod schema: ≤50 items, ≤200 chars each).
  • New optional fields on ScraperOptions, FetchOptions, JobInfo, PipelineJob.
  • Conservative defaults: only a 0-document scrape is hard-failed by default; the
    relevance axes (OFF_TOPIC/SCOPE_DRIFT) are opt-in via expectTerms.
  • Refresh / clean=false jobs skip the gate, so a transient thin re-index can never
    delete a pre-existing good version.

Implementation walkthrough (one commit per step)

  • M1ScrapeOutcome/ScrapeErrorCode/JobOutcomeMetrics types; pure
    evaluateOutcome() classifier; getVersionMetrics() store accessor.
  • M2applyQualityGate() seam in PipelineManager + gate-then-rollback; E2E
    asserting a failed gate leaves the store empty.
  • M3 — real GitHub tree fixture; GITHUB_SUBPATH_NOT_FOUND guard; locale
    normalization + LOCALE_REDIRECT_LOOP.
  • M4denyPaths in GitHub + web crawl; relevanceGate (toScopeKey ref-agnostic
    scope, expectTerms sampling); wiring into the gate (opt-in).
  • M5 — expose outcome/errorCode in job tools; plumb new params through
    scrape_docs; docs.

Tests

New deterministic coverage (no network): outcomeGate.test.ts, relevanceGate.test.ts,
quality-gate-e2e.test.ts, plus additions to PipelineManager, HttpFetcher,
GitHubScraperStrategy, DocumentManagementService, GetJobInfoTool, ScrapeTool tests.
FM-2/FM-3 use a committed snapshot of the real google/generative-ai-docs tree.
npm test, npm run lint, npm run typecheck all green (pre-existing locale-dependent
cli-e2e and an environment-dependent vector-persistence case excluded — failing
identically on main).

Out of scope (intentionally deferred)

  • discover_source(library) tool.
  • A real staging-version with pointer swap (this PR uses gate-then-rollback).
  • Typed signal + pagination for truncated GitHub trees (still logger.warn).

walterfrey added a commit to NEXUZ-SYS/docs-mcp-server that referenced this pull request Jun 13, 2026
Promote a scraped (library, version) to searchable only when it indexed
usable docs. Adds a ScrapeOutcome classifier + gate-then-rollback hard-fail
at the pipeline seam, fixing three failure modes that previously reported
"completed": empty/hostile host (FM-1, locale pin-en + LOCALE_REDIRECT_LOOP),
off-topic subtree (FM-2, denyPaths + opt-in expectTerms/scope relevance),
and silent no-op on a nonexistent GitHub subpath (FM-3, GITHUB_SUBPATH_NOT_FOUND).
Additive API (expectTerms/denyPaths/localeStrategy + outcome/errorCode);
refresh/clean=false skip the gate to avoid data loss.

Upstream PR (clean branch): arabold#439

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant