perf(parser): fix O(n²) JSONL streaming that made Sync take >2 minutes by san360 · Pull Request #161 · microsoft/AI-Engineering-Coach

san360 · 2026-06-24T17:15:14Z

Summary

Sync Sessions took >2 minutes for large session volumes (follow-up to the #106 OOM fix). Root-caused via CPU profiling and fixed an O(n²) hotspot in the JSONL streaming readers. Phase-2 parse drops from ~197s → ~14s (~14×) with identical parsed results (same 616 sessions / 132 workspaces in my test data).

Root cause

The cold-parse JSONL readers (forEachJsonlLine / forEachJsonlLineAsync in parser-vscode-files.ts) used:

pending += decoder.write(chunk);
const parts = pending.split('\n');   // re-splits the ENTIRE growing buffer on every 1MB read
pending = parts.pop() ?? '';

Copilot CLI events.jsonl lines embed tens of MB of inline base64 (images). When a single line spans many 1 MB read chunks, every read re-concatenated and re-split the whole accumulated buffer → O(n²) CPU plus massive GC churn.

Evidence (CPU profile of a real 616-session / 132-workspace parse)

	Before	After
Phase 2 (parse)	~197 s	~14 s
`forEachJsonlLineAsync` self-time	154.5 s	negligible
Garbage collector	29.8 s	0.9 s
Idle (waiting on disk I/O)	2.6 s	9.4 s (healthy)

This ruled out the earlier hypotheses (per-file setImmediate yields, prefetch I/O, the worker pool) — the profile attributed 78% of wall time to a single function.

The fix

A shared emitChunkLines() helper that scans only the freshly read chunk with indexOf, joining the carried partial line only when its terminating newline arrives. Total work is now linear in file size, with byte-for-byte identical line-splitting semantics. Applied to both the sync and async readers.

No functional change to parsed output.
Respects the Parse worker OOM crash (exit code 3758096392) during cold parse of large session volumes #106 memory-safety design (streaming, per-workspace stripping, bounded prefetch all unchanged).
No new runtime dependencies; no network calls.

Telemetry (local-only, as requested)

Adds Sync timing instrumentation to the VS Code Output channel to attribute future regressions — all local, no remote telemetry:

[panel] sync-timing — end-to-end Sync breakdown (discover / parse / analyzer / warm-up / dataReady).
[parser] worker-phase-timing — per-phase wall clock.
cold-parse-breakdown / phase2-attribution — fine-grained cold-parse attribution.

Tests

Added regression tests for a single line spanning multiple read chunks (the O(n²) trigger) in both readers.
npm run check green — 1215 tests pass (+2 new).

Verification

VSIX repackaged and installed locally; "Developer: Reload Window" then re-run Sync to see the ~14× improvement and the sync-timing line in the Output channel.

Parsed-output parity verification

Added npm run parity (scripts/parse-parity.ts + scripts/parity-parse-entry.ts) to prove this change is output-preserving. There was no existing automated way to compare parsed JSON against main, and the harness can't live on main — so instead of a committed golden snapshot it builds the parser from both trees and diffs them directly:

Bundles parseAllLogs from this branch's src/ and from a main checkout's src/.
Runs both against the same real local logs.
Serializes each ParseResult to canonical, stable-sorted NDJSON and compares by SHA-256.

Result — byte-for-byte identical across both trees (main runs the old split-based reader, branch runs emitChunkLines):

branch sha256 = 849183b49f4556f44678256104a169f3073f2002baa711e5ddc5ed95efe13880
main   sha256 = 849183b49f4556f44678256104a169f3073f2002baa711e5ddc5ed95efe13880
sessions=616 workspaces=132 editLoc=1771 sources=599  (identical)

PARITY OK — parsed output is byte-for-byte identical across both trees.

Usage: npm run parity (defaults to the local main checkout) or npm run parity -- --main <path-to-main-checkout>.

The cold-parse JSONL readers accumulated `pending += chunk` and then re-ran `pending.split('\n')` over the entire growing buffer on every 1MB read. Copilot CLI events.jsonl lines embed tens of MB of inline base64, so a single long line spanning many chunks cost O(n^2) CPU plus heavy GC. CPU-profiling a real 616-session / 132-workspace parse attributed 154.5s of 198s to forEachJsonlLineAsync self-time and 29.8s to GC. Replacing the repeated whole-buffer split with emitChunkLines(), which scans only each freshly read chunk via indexOf, makes the work linear with identical line semantics: phase-2 parse dropped from ~197s to ~14s (~14x) with the same parsed result. Also adds local-only Sync timing telemetry to the VS Code Output channel (sync-timing, worker-phase-timing, cold-parse-breakdown, phase2-attribution) to attribute future regressions. No remote telemetry or network calls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-24T17:15:27Z

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Adds `npm run parity`, which builds the parser from two source trees (this branch and a main checkout), runs both against the same real local logs, and compares a canonical NDJSON serialization of the ParseResult by SHA-256. This is how we prove the O(n^2) JSONL streaming fix changed performance only, not output: there is no committed golden snapshot, and the harness itself lives only on the branch, so it diffs branch-vs-main builds directly instead of against a baseline. Verified: both trees emit an identical sha256 (sessions=616 workspaces=132 editLoc=1771 sources=599). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

san360 · 2026-06-25T05:00:34Z

Thanks for the automated check. This is informational: Dependency Review found no vulnerabilities or license issues for this PR, which is expected since it adds no runtime dependencies (only source changes plus two dev-only scripts). No action needed here.

The "deny-licenses is deprecated" note refers to the repository's dependency-review workflow configuration, not anything in this PR's diff, so it is out of scope for this change and best handled as a separate workflow-maintenance update.

aymenfurter

LGTM

Copilot

Pull request overview

Improves cold-parse performance for large Copilot JSONL logs by replacing an O(n²) chunk-splitting hotspot with a linear-time line emitter, and adds local-only timing instrumentation (plus a parity harness) to make future regressions diagnosable while preserving parsed output.

Changes:

Reworked JSONL streaming readers to scan only the newly read chunk for newlines (linear-time behavior) and added regression coverage for long single-line spans.
Added local-only sync/parse timing instrumentation across the panel, worker host, and parser to attribute wall-clock time by phase/sub-phase.
Introduced a npm run parity harness that builds/runs parsers from two trees and SHA-compares canonicalized NDJSON output.

Show a summary per file

File	Description
src/webview/panel.ts	Adds end-to-end sync timing breakdown emitted via `runtimeDebug` (warm-cache + cold).
src/core/parser.ts	Adds phase-2 attribution logging and cold-parse sub-phase breakdown logging hooks.
src/core/parser-worker-host.ts	Tracks per-phase wall-clock durations from worker progress and logs a summary on completion.
src/core/parser-worker-host.test.ts	Adds coverage ensuring the worker-phase timing summary is emitted on `done`.
src/core/parser-vscode.ts	Adds sub-step timing accumulation around chat/CLI/edit parsing operations.
src/core/parser-vscode-files.ts	Replaces `pending.split('\n')` loop with `emitChunkLines()` to avoid O(n²) behavior on long lines.
src/core/parser-vscode-files.test.ts	Adds regression tests for single JSONL lines spanning multiple read chunks (sync + async readers).
src/core/parser-shared.ts	Introduces parse timing counters and integrates forced-GC counting for attribution.
scripts/parse-parity.ts	Adds a two-tree parse parity runner that bundles/runs both parsers and compares SHA-256 output hashes.
scripts/parity-parse-entry.ts	Adds the deterministic canonicalization + NDJSON emission entrypoint used by the parity runner.
package.json	Adds `parity` npm script.
.gitignore	Ignores parity output/temp artifacts.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 11/12 changed files
Comments generated: 2

- parse-parity: replace the hard-coded Windows-only main checkout path with git-based discovery (git worktree list -> the main branch's working tree), so `npm run parity` works on any OS and checkout layout, and fails with an actionable message when no main checkout is found. - parser-shared(maybeForceGc): increment parseTiming.forcedGc only after gc() returns, so the counter reflects GCs actually forced rather than attempted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

san360 assigned aymenfurter Jun 24, 2026

san360 requested a review from aymenfurter June 24, 2026 17:22

aymenfurter approved these changes Jun 25, 2026

View reviewed changes

aymenfurter requested a review from Copilot June 25, 2026 08:19

Copilot started reviewing on behalf of aymenfurter June 25, 2026 08:19 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread scripts/parse-parity.ts

Comment thread src/core/parser-shared.ts Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(parser): fix O(n²) JSONL streaming that made Sync take >2 minutes#161

perf(parser): fix O(n²) JSONL streaming that made Sync take >2 minutes#161
san360 wants to merge 3 commits into
mainfrom
san360-sync-timing-telemetry

san360 commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

san360 commented Jun 25, 2026

Uh oh!

aymenfurter left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

san360 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Evidence (CPU profile of a real 616-session / 132-workspace parse)

The fix

Telemetry (local-only, as requested)

Tests

Verification

Parsed-output parity verification

Uh oh!

github-actions Bot commented Jun 24, 2026

Dependency Review

Scanned Files

Uh oh!

san360 commented Jun 25, 2026

Uh oh!

aymenfurter left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

san360 commented Jun 24, 2026 •

edited

Loading