Optimize Semantic Cache Retrieval and Pruning by d-oit · Pull Request #432 · d-oit/do-web-doc-resolver

d-oit · 2026-06-08T08:24:32Z

This PR addresses performance and hit-rate issues in the wdr CLI semantic cache.

Key changes:

Advanced Normalization: Queries are now normalized by removing documentation-specific stop-words (docs, library, standard, etc.) and sorting tokens alphabetically. This ensures that variadic queries for the same documentation hit the same cache entry with 1.0 similarity.
Redundancy Pruning: The 'store' operation now checks for existing identical content or extremely high vector similarity (>0.999) before adding a new entry, significantly reducing cache bloat.
Corrected Stats: The do-wdr cache-stats command was fixed to report live concept counts from the underlying framework rather than hardcoded zeros.
Performance Verification: Benchmarks against Python, Rust, Go, MDN, and .NET docs confirm that cache hits remain extremely fast (~11ms), well within the 200ms target, and quality scores remain high (0.9-1.0).

Full analysis documented in agents-docs/SEMANTIC_HEALTH_2026_06.md.

PR created automatically by Jules for task 16123962746434932673 started by @d-oit

- Implement advanced normalization (stop-word filtering and token sorting) to improve hit rates for variadic queries (e.g., 'Python Std Lib' -> 'Python Standard Library'). - Implement redundancy pruning in 'store' operation to skip identical content and extremely similar vectors. - Fix 'cache-stats' command to report actual entry counts from the framework. - Switch to code-aware TextEncoder for better identifier handling. - Verify hit latency remains ~11ms and quality scores >0.85. - Add Semantic Health summary for June 2026. Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>

google-labs-jules · 2026-06-08T08:24:33Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

vercel · 2026-06-08T08:24:34Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
do-web-doc-resolover	Ready	Preview, Comment	Jun 8, 2026 9:01am

deepsource-io · 2026-06-08T08:24:48Z

DeepSource Code Review

We reviewed changes in 40ab7b3...e7900d5 on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade	Security Reliability Complexity Hygiene

Code Review Summary

Analyzer	Updated (UTC)	Details
JavaScript	Jun 8, 2026 9:00a.m.	Review ↗
Python	Jun 8, 2026 9:00a.m.	Review ↗
Rust	Jun 8, 2026 9:00a.m.	Review ↗
Shell	Jun 8, 2026 9:00a.m.	Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

…434932673

codacy-production · 2026-06-08T08:26:25Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -2 duplication

Metric Results

Duplication -2

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

- Implement advanced normalization (stop-word filtering and token sorting) to improve hit rates for variadic queries (e.g., 'Python Std Lib' -> 'Python Standard Library'). - Implement redundancy pruning in 'store' operation to skip identical content and extremely similar vectors. - Fix 'cache-stats' command to report actual entry counts from the framework. - Switch to code-aware TextEncoder for better identifier handling. - Verify hit latency remains ~11ms and quality scores >0.85. - Add Semantic Health summary for June 2026. - Fix linting and clippy issues in previous attempt. Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>

- Implement advanced normalization (stop-word filtering and token sorting) to improve hit rates for variadic queries (e.g., 'Python Std Lib' -> 'Python Standard Library'). - Implement redundancy pruning in 'store' operation to skip identical content and extremely similar vectors. - Fix 'cache-stats' command to report actual entry counts from the framework. - Switch to code-aware TextEncoder for better identifier handling. - Verify hit latency remains ~11ms and quality scores >0.85. - Add Semantic Health summary for June 2026. - Fix Markdownlint and Clippy issues. Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>

Merge branch 'main' into perf/semantic-cache-optimization-16123962746…

12c4643

…434932673

vercel Bot deployed to Preview June 8, 2026 08:25 View deployment

vercel Bot deployed to Preview June 8, 2026 08:45 View deployment

vercel Bot deployed to Preview June 8, 2026 09:01 View deployment

d-oit merged commit 305cd06 into main Jun 8, 2026
39 checks passed

d-oit deleted the perf/semantic-cache-optimization-16123962746434932673 branch June 8, 2026 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Semantic Cache Retrieval and Pruning#432

Optimize Semantic Cache Retrieval and Pruning#432
d-oit merged 4 commits into
mainfrom
perf/semantic-cache-optimization-16123962746434932673

d-oit commented Jun 8, 2026

Uh oh!

google-labs-jules Bot commented Jun 8, 2026

Uh oh!

vercel Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

deepsource-io Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

codacy-production Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

d-oit commented Jun 8, 2026

Uh oh!

google-labs-jules Bot commented Jun 8, 2026

Uh oh!

vercel Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepsource-io Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DeepSource Code Review

PR Report Card

Code Review Summary

Uh oh!

codacy-production Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 8, 2026 •

edited

Loading

deepsource-io Bot commented Jun 8, 2026 •

edited

Loading

codacy-production Bot commented Jun 8, 2026 •

edited

Loading