Skip to content

fix(android): stop large-graph startup crash from uncaught Throwables on the ViewModel scope#141

Merged
tstapler merged 5 commits into
mainfrom
claude/android-crash-large-docs-h7p49e
Jun 11, 2026
Merged

fix(android): stop large-graph startup crash from uncaught Throwables on the ViewModel scope#141
tstapler merged 5 commits into
mainfrom
claude/android-crash-large-docs-h7p49e

Conversation

@tstapler

Copy link
Copy Markdown
Owner

Root cause of the 8 030-page "SteleKit keeps stopping" crash: the ViewModel
scope (App.kt) had no CoroutineExceptionHandler. Under large-graph heap
pressure an OutOfMemoryError is thrown in whichever coroutine allocates
next — the unguarded standing collectors (observeSpecialPages, PageNameIndex)
and fire-and-forget launches (ensureTodayJournal) on that scope let it reach
the default uncaught-exception handler, which kills the process on Android
but only logs on desktop. That is why the crash never reproduced on desktop
and why prior mitigations inside GraphLoader's own scopes did not help.

The hang half came from observeSpecialPages keeping a standing collector on
getAllPages(): every DB write invalidates that query, so graph import/
reconcile re-materialized all 8 030 Page objects per write burst (GC thrash),
and cachedAllPages pinned a full-table snapshot for the ViewModel's lifetime.

Fixes:

  • StelekitViewModel scope now carries a CoroutineExceptionHandler that
    surfaces uncaught Throwables as fatalError (recovery screen) instead of
    crashing.
  • observeSpecialPages no longer collects getAllPages(): favorites use a new
    dedicated selectFavoritePages query (PageRepository.getFavoritePages, with
    a conflated SqlDelight override); recents resolve via bounded point lookups;
    cachedAllPages is removed.
  • PageNameIndex guards its collector and Aho-Corasick matcher build against
    Throwable, degrading suggestions instead of dying (its first 8 000-name trie
    build lands ~500 ms after launch — exactly the observed crash window).
  • android:largeHeap="true" for headroom on whole-graph workloads.

Tests (fail before the fix — they reproduce the crash by recording Throwables
that reach the default uncaught-exception handler):

  • StelekitViewModelCrashReproductionTest: OOM in the page-list observer, OOM
    in ensureTodayJournal, and a guard that the ViewModel keeps at most one
    standing getAllPages() subscription.
  • PageNameIndexResilienceTest: OOM from the page flow keeps the last matcher.
  • LargeGraphWarmStartCrashTest: end-to-end 8 030-page warm start (persistent
    cache kept, warm reconcile + Phase 3 indexing) completes with no uncaught
    Throwables.

https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb

claude added 3 commits June 10, 2026 17:09
… on the ViewModel scope

Root cause of the 8 030-page "SteleKit keeps stopping" crash: the ViewModel
scope (App.kt) had no CoroutineExceptionHandler. Under large-graph heap
pressure an OutOfMemoryError is thrown in whichever coroutine allocates
next — the unguarded standing collectors (observeSpecialPages, PageNameIndex)
and fire-and-forget launches (ensureTodayJournal) on that scope let it reach
the default uncaught-exception handler, which kills the process on Android
but only logs on desktop. That is why the crash never reproduced on desktop
and why prior mitigations inside GraphLoader's own scopes did not help.

The hang half came from observeSpecialPages keeping a standing collector on
getAllPages(): every DB write invalidates that query, so graph import/
reconcile re-materialized all 8 030 Page objects per write burst (GC thrash),
and cachedAllPages pinned a full-table snapshot for the ViewModel's lifetime.

Fixes:
- StelekitViewModel scope now carries a CoroutineExceptionHandler that
  surfaces uncaught Throwables as fatalError (recovery screen) instead of
  crashing.
- observeSpecialPages no longer collects getAllPages(): favorites use a new
  dedicated selectFavoritePages query (PageRepository.getFavoritePages, with
  a conflated SqlDelight override); recents resolve via bounded point lookups;
  cachedAllPages is removed.
- PageNameIndex guards its collector and Aho-Corasick matcher build against
  Throwable, degrading suggestions instead of dying (its first 8 000-name trie
  build lands ~500 ms after launch — exactly the observed crash window).
- android:largeHeap="true" for headroom on whole-graph workloads.

Tests (fail before the fix — they reproduce the crash by recording Throwables
that reach the default uncaught-exception handler):
- StelekitViewModelCrashReproductionTest: OOM in the page-list observer, OOM
  in ensureTodayJournal, and a guard that the ViewModel keeps at most one
  standing getAllPages() subscription.
- PageNameIndexResilienceTest: OOM from the page flow keeps the last matcher.
- LargeGraphWarmStartCrashTest: end-to-end 8 030-page warm start (persistent
  cache kept, warm reconcile + Phase 3 indexing) completes with no uncaught
  Throwables.

https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb
…kpressure

Eliminates the remaining O(graph) query consumers so peak memory during load,
reconcile, and background indexing is bounded by chunk size, not graph size —
the other half of the 8 000-page Android OOM ("hangs and crashes").

- GraphLoader.loadDirectory: per-chunk IN-clause existence lookups
  (getPagesByNames / getJournalPagesByDates, ≤500 per IN list for Android's
  SQLITE_MAX_VARIABLE_NUMBER) replace the getAllPages() full-table preload
  that held every Page plus two full-size maps for the whole load.
- GraphLoader.indexRemainingPages: drain loop over the new
  getUnloadedPages(limit, offset) (batches of 100) replaces the all-at-once
  unloaded-pages snapshot. An attempted-UUID set advances the drain offset
  past permanently-failing rows (missing file, parse error, zero-block parse),
  guaranteeing termination; countUnloadedPages() provides the O(1) progress
  denominator.
- PageNameIndex: collects the new names-only getPageNameEntries() projection
  (name + is_journal, conflated) instead of getAllPages() — a few hundred KB
  per rebuild instead of tens of MB of full Page objects with properties maps.
- No production code subscribes to getAllPages() anymore; it remains for
  tests/benchmarks and migration tooling only. CLAUDE.md documents the
  pagination/backpressure rules per consumer type.

Tests (verified to fail against the unpaginated implementation):
- LargeGraphWarmStartCrashTest now also asserts zero getAllPages()/unbounded
  getUnloadedPages() subscriptions and ≤100-row batches across the full
  8 030-page warm start + Phase 3 indexing.
- GraphLoaderIndexBatchingTest: drains 250 pages in bounded batches; and
  terminates when 30 pages can never be indexed.
- StelekitViewModelCrashReproductionTest: ViewModel holds zero getAllPages()
  subscriptions; OOM reproduction tests retargeted to the new projection flow.

https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb
… pagination at compile time

Now that no code path needs a full-table read, delete the unbounded queries
entirely so reintroducing the 8 000-page Android OOM pattern is a compile
error, not a code-review catch:

- PageRepository no longer declares getAllPages() or unbounded
  getUnloadedPages(); selectAllPages / selectUnloadedPages are removed from
  SteleDatabase.sq. The interface KDoc and CLAUDE.md document the bounded
  query menu per consumer type.
- Whole-graph one-shots (export, migration tooling, benchmarks, tests) use
  the new getAllPagesSnapshot(batchSize=500) interface method, which pages
  through getPages(limit, offset) in bounded batches — even "fetch
  everything" callers never run an unpaginated query or hold a reactive
  full-table flow.
- Remaining production callers migrated to bounded reads:
  - AllPagesViewModel: standing getAllPages() collector (a live OOM vector
    while indexing runs with the All Pages screen open) becomes a one-shot
    bounded-batch snapshot; refresh() reloads on demand.
  - ImportViewModel: suggestion-dedup uses the names-only
    getPageNameEntries() projection; URL dedup uses the snapshot.
  - GlobalUnlinkedReferencesViewModel: names-only projection.
  - ExportService.exportJournalRange: bounded pagination over
    getJournalPages (journal_date DESC) with early stop below the range.
  - InMemorySearchRepository (test backend): snapshot instead of full flow.
- DatalogPageRepository, commonTest fakes, and InstrumentedPageRepository
  implement the bounded query set; the instrumentation wrapper delegates
  explicitly so the SQL-optimized chunked lookups are preserved.
- QueryPlanAuditTest audits the new bounded queries; UpgradeResilienceTest
  now exercises the closed-DB guard on every new read (favorites, name
  entries, paginated unloaded, counts, snapshot).
- Runtime "zero unbounded subscriptions" test assertions removed — the
  guarantee is now structural.

https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb
Copilot AI review requested due to automatic review settings June 11, 2026 02:31
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

JVM Load Benchmark (Desktop)

Synthetic in-memory benchmark measuring load performance for the desktop (JVM) app.
Comparing 636ff68a (this PR) vs 3d674c6 (baseline)
Graph config: xlarge — 230 pages

Metric This PR Baseline Delta
Phase 1 TTI ↓ 0ms 1ms -1ms (-100%) ✅
Phase 2 background ↓ 1ms 0ms +1ms ⚠️
Phase 3 index ↓ 1ms 1ms 0 (0%)
Total ↓ 2ms 2ms 0 (0%)
Write p95 (baseline) ↓ 4ms 2ms +2ms (+100%) ⚠️
Write p95 (under load) ↓ n/a n/a
Jank factor ↓ n/a n/a
↓ lower is better
Flamegraphs (this PR) **Allocation** — object allocation pressure (JDBC/SQLite churn)

Alloc flamegraph not available

CPU — method-level hotspots by on-CPU time

CPU flamegraph not available

Top SQL queries by total time (this PR) | table:operation | calls | p50 | p99 | max | total | |-----------------|-------|-----|-----|-----|-------| | `pages:select` | 2 | 1ms | 1ms | 0ms | 0ms |
Top allocation hotspots (this PR) `42%` byte[]_[k] `9.5%` java.lang.String_[k] `7.1%` java.util.LinkedHashMap$Entry_[k] `3.7%` dev.stapler.stelekit.parsing.lexer.Token_[k] `3.3%` int[]_[k]
Top CPU hotspots (this PR) `94.5%` /usr/lib/x86_64-linux-gnu/libc.so.6 `2.3%` /tmp/sqlite-3.51.3.0-77cabc3d-9165-401a-9a27-a799ea3e85f4-libsqlitejdbc.so `0.4%` fsync `0.4%` __libc_pwrite `0.3%` pthread_cond_signal

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an Android-only startup crash on very large graphs by preventing uncaught coroutine Throwables (notably OutOfMemoryError) from reaching the platform default uncaught-exception handler, and by eliminating unbounded full-table page reads/collectors that caused O(N²) re-materialization during write bursts.

Changes:

  • Add a CoroutineExceptionHandler to the long-lived StelekitViewModel scope and add resilience guards in PageNameIndex so uncaught Throwables surface as fatalError / degrade gracefully instead of crashing Android.
  • Remove unbounded “all pages” reactive access patterns by redesigning PageRepository around paginated, projected, chunked, and one-shot snapshot reads (getFavoritePages, getPageNameEntries, chunked IN lookups, getAllPagesSnapshot, bounded unloaded-page drain).
  • Add regression tests reproducing the large-graph warm-start crash window and enforcing bounded-query behavior.

Reviewed changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/ui/StelekitViewModelCrashReproductionTest.kt New JVM test that reproduces/guards against uncaught Throwables escaping the ViewModel scope.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/ui/fixtures/FakeRepositories.kt Updates fake repositories to the new bounded/projection PageRepository API.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/repository/UpgradeResilienceTest.kt Adjusts upgrade resilience coverage to new snapshot/projection APIs.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/repository/RepositoryFlowResilienceTest.kt Updates flow-resilience test to use snapshot instead of removed unbounded flow.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/migration/DryRunTest.kt Uses getAllPagesSnapshot() for pre/post state snapshots.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/integration/PipelineReproductionTest.kt Switches whole-graph test read to getAllPagesSnapshot().
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/domain/PageNameIndexResilienceTest.kt New test ensuring PageNameIndex survives upstream OOM without crashing.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/QueryPlanAuditTest.kt Updates query allowlist/audits to reflect new bounded/projection queries.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/LargeGraphWarmStartCrashTest.kt New end-to-end warm-start regression test for an 8,030-page graph.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphManagerDatabaseLifecycleTest.kt Adjusts lifecycle tests away from removed unbounded page flow.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderWatcherTest.kt Updates watcher test to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderTest.kt Updates loader tests to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderIntegrationTest.kt Updates integration tests to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderIndexBatchingTest.kt New tests enforcing bounded batch draining + termination behavior for indexing.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderCacheTest.kt Updates cache tests to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/ExternalChangeConflictTest.kt Updates conflict tests to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/DemoGraphIntegrationTest.kt Updates demo integration test to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/AppLoadJournalIntegrationTest.kt Updates journal-load integration assertions to use snapshot access.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/benchmark/UserSessionBenchmarkTest.kt Updates benchmark to use snapshot access for warmup and final counts.
kmp/src/jvmTest/kotlin/dev/stapler/stelekit/benchmark/GraphLoadTimingTest.kt Updates timing benchmark to use snapshot access for page counts.
kmp/src/jvmCommonMain/kotlin/dev/stapler/stelekit/performance/InstrumentedPageRepository.kt Keeps instrumentation consistent with new PageRepository methods and avoids default fallbacks.
kmp/src/commonTest/kotlin/dev/stapler/stelekit/ui/screens/JournalsViewModelTest.kt Updates test fake PageRepository implementation to new API.
kmp/src/commonTest/kotlin/dev/stapler/stelekit/ui/screens/JournalsViewModelEditorTest.kt Updates test fake PageRepository implementation to new API.
kmp/src/commonTest/kotlin/dev/stapler/stelekit/repository/JournalServiceTest.kt Switches assertions to use getAllPagesSnapshot().
kmp/src/commonMain/sqldelight/dev/stapler/stelekit/db/SteleDatabase.sq Removes unpaginated full-table page queries; adds bounded/projection queries used by new APIs.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/StelekitViewModel.kt Adds scope exception handler; replaces unbounded collectors with favorites/projection/point lookups; removes full-table pinning.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/screens/ImportViewModel.kt Uses names-only projection and snapshot where whole-graph reads are required.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/screens/GlobalUnlinkedReferencesViewModel.kt Uses names-only projection for whole-graph name scanning.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/screens/AllPagesViewModel.kt Replaces standing full-table collector with an explicit one-shot snapshot load.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/SqlDelightPageRepository.kt Implements new bounded/projection APIs (favorites, names projection, chunked IN, bounded unloaded drain, count).
kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/PageRepository.kt Redesigns the interface to remove unbounded flows and add snapshot/projection/chunked/bounded APIs.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/InMemoryRepositories.kt Updates in-memory repositories and in-memory search to match new PageRepository API.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/DatalogPageRepository.kt Updates datascript/datalog page repo to match new bounded/projection API.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/migration/DslEvaluator.kt Switches to getAllPagesSnapshot() for bulk migration prefetch.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/export/ExportService.kt Replaces full-table reads with bounded pagination + early-stop export logic.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/domain/PageNameIndex.kt Switches to names-only projection and guards matcher/collector against Throwable.
kmp/src/commonMain/kotlin/dev/stapler/stelekit/db/GraphLoader.kt Makes Phase 3 indexing drain bounded batches; uses chunked IN lookups instead of preloading all pages.
kmp/src/businessTest/kotlin/dev/stapler/stelekit/voice/VoiceNoteBlockFormatTest.kt Updates to use snapshot access for assertions.
kmp/src/androidInstrumentedTest/kotlin/dev/stapler/stelekit/benchmark/AndroidGraphBenchmark.kt Updates Android benchmarks to use snapshot access for page counts.
CLAUDE.md Documents new invariants: scope exception handling and bounded/projection/chunked graph-scale reads.
androidApp/src/main/AndroidManifest.xml Enables android:largeHeap="true" to provide additional headroom for whole-graph workloads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +246 to +250
// Resolved Page objects for the recent-pages list, keyed by UUID and bounded by
// recentPageUuids (≤20 entries). Replaces the former cachedAllPages field, which
// pinned the entire pages table (8 000+ Page objects on large graphs) in memory
// for the lifetime of the ViewModel.
private val recentPagesByUuid = mutableMapOf<String, Page>()

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All access to recentPagesByUuid and recentPageUuids is now guarded by recentMutex (added in this PR). addToRecent, refreshRecentPages, setGraphPath, and observeSpecialPages all use recentMutex.withLock { }. trimRecentPagesCache is only ever called inside an existing withLock block. Thanks for the catch — the concern was valid on the original code.

Comment on lines 437 to 442
_uiState.update { it.copy(currentGraphPath = path) }
recentPageUuids = platformSettings.getString(recentPagesKey, "")
.split(",").filter { it.isNotEmpty() }.toMutableList()
recentPagesByUuid.clear()
scope.launch { refreshRecentPages() }
loadGraph(path)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed: refreshRecentPages() is no longer called in setGraphPath itself. It's now deferred to inside onPhase1Complete (after Phase 1 has populated the new graph's DB), so the getPageByUuid lookups never race with loadGraph's clear().

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Android Load Benchmark

Instrumented benchmark on an API 30 x86_64 emulator — 500-page synthetic graph.

Comparing 636ff68a (this PR) vs 2eaadb6 (baseline)
Device: API 30 x86_64 emulator — 530 pages loaded

Graph Load

Metric This PR Baseline Delta
Phase 1 TTI ↓ 41ms 49ms -8ms (-16%) ✅
Phase 3 index ↓ 4441ms 5020ms -579ms (-12%) ✅

Interactive Write Latency (during Phase 3)

Metric This PR Baseline Delta
Write p95 (baseline) ↓ 5ms 9ms -4ms (-44%) ✅
Write p95 (during phase 3) ↓ 16ms 15ms +1ms (+7%) ⚠️
Jank factor ↓ 3.2x 1.67x +1.53x (+92%) ⚠️
Concurrent writes ↑ 22 25 -3ms (-12%) ⚠️

SAF I/O Overhead (ContentProvider vs direct File read)

Measures Binder IPC cost added by ContentResolver per readFile() call.
Real SAF via ExternalStorageProvider will be higher on device; this is a lower bound.

Metric This PR Baseline Delta
Direct read / file ↓ 0.0ms 0.0ms 0 (0%)
Provider read / file ↓ 0.2ms 0.2ms 0ms (-15%) ✅
IPC overhead ratio ↓ 5x 6x -1x (-17%) ✅
↓ lower is better · ↑ higher is better

tstapler and others added 2 commits June 10, 2026 23:00
- BLOCKER: remove runCatching from withTimeout in GraphLoaderIndexBatchingTest to
  prevent CancellationException from breaking structured concurrency
- CRITICAL: snapshot recentPageUuids under recentMutex then do DB work outside lock
  in refreshRecentPages to eliminate starvation of addToRecent
- fix sanitizeErrorMessage: apply in loadGraph catch blocks (bypassed before);
  add Windows path regex C:\... pattern
- fix QueryPlanAuditTest selectJournalPages: audit SQL was COALESCE phantom query,
  not the actual ORDER BY journal_date DESC used in SteleDatabase.sq
- change countUnloadedPages() from Flow to suspend fun to fix reactive/one-shot
  contract mismatch between SQL and InMemory implementations
- add getPagesByNames/getJournalPagesByDates overrides to DatalogPageRepository
  to avoid O(graph) getAllPagesSnapshot() fallback
- improve test determinism: replace delay(1_000) with delay(200); replace polling
  while loop with flow.first{} in LargeGraphWarmStartCrashTest
- add test for ExportService oldest==null guard branch (uEJR04)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ace with loadGraph clear()

setGraphPath previously launched refreshRecentPages() concurrently with loadGraph(),
which calls pageRepository.clear() early in the load sequence. The getPageByUuid
lookups in refreshRecentPages would silently return null while clear() was in flight,
leaving recents empty until the user navigated again.

Fix: refreshRecentPages() is now called inside onPhase1Complete, after Phase 1 has
populated the new graph's DB with page metadata.

Addresses Copilot PR review comment on StelekitViewModel.kt:442.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tstapler tstapler merged commit dd96cd1 into main Jun 11, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants