fix(android): stop large-graph startup crash from uncaught Throwables on the ViewModel scope#141
Conversation
… on the ViewModel scope Root cause of the 8 030-page "SteleKit keeps stopping" crash: the ViewModel scope (App.kt) had no CoroutineExceptionHandler. Under large-graph heap pressure an OutOfMemoryError is thrown in whichever coroutine allocates next — the unguarded standing collectors (observeSpecialPages, PageNameIndex) and fire-and-forget launches (ensureTodayJournal) on that scope let it reach the default uncaught-exception handler, which kills the process on Android but only logs on desktop. That is why the crash never reproduced on desktop and why prior mitigations inside GraphLoader's own scopes did not help. The hang half came from observeSpecialPages keeping a standing collector on getAllPages(): every DB write invalidates that query, so graph import/ reconcile re-materialized all 8 030 Page objects per write burst (GC thrash), and cachedAllPages pinned a full-table snapshot for the ViewModel's lifetime. Fixes: - StelekitViewModel scope now carries a CoroutineExceptionHandler that surfaces uncaught Throwables as fatalError (recovery screen) instead of crashing. - observeSpecialPages no longer collects getAllPages(): favorites use a new dedicated selectFavoritePages query (PageRepository.getFavoritePages, with a conflated SqlDelight override); recents resolve via bounded point lookups; cachedAllPages is removed. - PageNameIndex guards its collector and Aho-Corasick matcher build against Throwable, degrading suggestions instead of dying (its first 8 000-name trie build lands ~500 ms after launch — exactly the observed crash window). - android:largeHeap="true" for headroom on whole-graph workloads. Tests (fail before the fix — they reproduce the crash by recording Throwables that reach the default uncaught-exception handler): - StelekitViewModelCrashReproductionTest: OOM in the page-list observer, OOM in ensureTodayJournal, and a guard that the ViewModel keeps at most one standing getAllPages() subscription. - PageNameIndexResilienceTest: OOM from the page flow keeps the last matcher. - LargeGraphWarmStartCrashTest: end-to-end 8 030-page warm start (persistent cache kept, warm reconcile + Phase 3 indexing) completes with no uncaught Throwables. https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb
…kpressure
Eliminates the remaining O(graph) query consumers so peak memory during load,
reconcile, and background indexing is bounded by chunk size, not graph size —
the other half of the 8 000-page Android OOM ("hangs and crashes").
- GraphLoader.loadDirectory: per-chunk IN-clause existence lookups
(getPagesByNames / getJournalPagesByDates, ≤500 per IN list for Android's
SQLITE_MAX_VARIABLE_NUMBER) replace the getAllPages() full-table preload
that held every Page plus two full-size maps for the whole load.
- GraphLoader.indexRemainingPages: drain loop over the new
getUnloadedPages(limit, offset) (batches of 100) replaces the all-at-once
unloaded-pages snapshot. An attempted-UUID set advances the drain offset
past permanently-failing rows (missing file, parse error, zero-block parse),
guaranteeing termination; countUnloadedPages() provides the O(1) progress
denominator.
- PageNameIndex: collects the new names-only getPageNameEntries() projection
(name + is_journal, conflated) instead of getAllPages() — a few hundred KB
per rebuild instead of tens of MB of full Page objects with properties maps.
- No production code subscribes to getAllPages() anymore; it remains for
tests/benchmarks and migration tooling only. CLAUDE.md documents the
pagination/backpressure rules per consumer type.
Tests (verified to fail against the unpaginated implementation):
- LargeGraphWarmStartCrashTest now also asserts zero getAllPages()/unbounded
getUnloadedPages() subscriptions and ≤100-row batches across the full
8 030-page warm start + Phase 3 indexing.
- GraphLoaderIndexBatchingTest: drains 250 pages in bounded batches; and
terminates when 30 pages can never be indexed.
- StelekitViewModelCrashReproductionTest: ViewModel holds zero getAllPages()
subscriptions; OOM reproduction tests retargeted to the new projection flow.
https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb
… pagination at compile time
Now that no code path needs a full-table read, delete the unbounded queries
entirely so reintroducing the 8 000-page Android OOM pattern is a compile
error, not a code-review catch:
- PageRepository no longer declares getAllPages() or unbounded
getUnloadedPages(); selectAllPages / selectUnloadedPages are removed from
SteleDatabase.sq. The interface KDoc and CLAUDE.md document the bounded
query menu per consumer type.
- Whole-graph one-shots (export, migration tooling, benchmarks, tests) use
the new getAllPagesSnapshot(batchSize=500) interface method, which pages
through getPages(limit, offset) in bounded batches — even "fetch
everything" callers never run an unpaginated query or hold a reactive
full-table flow.
- Remaining production callers migrated to bounded reads:
- AllPagesViewModel: standing getAllPages() collector (a live OOM vector
while indexing runs with the All Pages screen open) becomes a one-shot
bounded-batch snapshot; refresh() reloads on demand.
- ImportViewModel: suggestion-dedup uses the names-only
getPageNameEntries() projection; URL dedup uses the snapshot.
- GlobalUnlinkedReferencesViewModel: names-only projection.
- ExportService.exportJournalRange: bounded pagination over
getJournalPages (journal_date DESC) with early stop below the range.
- InMemorySearchRepository (test backend): snapshot instead of full flow.
- DatalogPageRepository, commonTest fakes, and InstrumentedPageRepository
implement the bounded query set; the instrumentation wrapper delegates
explicitly so the SQL-optimized chunked lookups are preserved.
- QueryPlanAuditTest audits the new bounded queries; UpgradeResilienceTest
now exercises the closed-DB guard on every new read (favorites, name
entries, paginated unloaded, counts, snapshot).
- Runtime "zero unbounded subscriptions" test assertions removed — the
guarantee is now structural.
https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb
JVM Load Benchmark (Desktop)Synthetic in-memory benchmark measuring load performance for the desktop (JVM) app.
Flamegraphs (this PR)**Allocation** — object allocation pressure (JDBC/SQLite churn)Alloc flamegraph not available CPU — method-level hotspots by on-CPU time CPU flamegraph not available Top SQL queries by total time (this PR)| table:operation | calls | p50 | p99 | max | total | |-----------------|-------|-----|-----|-----|-------| | `pages:select` | 2 | 1ms | 1ms | 0ms | 0ms |Top allocation hotspots (this PR)`42%` byte[]_[k] `9.5%` java.lang.String_[k] `7.1%` java.util.LinkedHashMap$Entry_[k] `3.7%` dev.stapler.stelekit.parsing.lexer.Token_[k] `3.3%` int[]_[k]Top CPU hotspots (this PR)`94.5%` /usr/lib/x86_64-linux-gnu/libc.so.6 `2.3%` /tmp/sqlite-3.51.3.0-77cabc3d-9165-401a-9a27-a799ea3e85f4-libsqlitejdbc.so `0.4%` fsync `0.4%` __libc_pwrite `0.3%` pthread_cond_signal |
There was a problem hiding this comment.
Pull request overview
This PR addresses an Android-only startup crash on very large graphs by preventing uncaught coroutine Throwables (notably OutOfMemoryError) from reaching the platform default uncaught-exception handler, and by eliminating unbounded full-table page reads/collectors that caused O(N²) re-materialization during write bursts.
Changes:
- Add a
CoroutineExceptionHandlerto the long-livedStelekitViewModelscope and add resilience guards inPageNameIndexso uncaughtThrowables surface asfatalError/ degrade gracefully instead of crashing Android. - Remove unbounded “all pages” reactive access patterns by redesigning
PageRepositoryaround paginated, projected, chunked, and one-shot snapshot reads (getFavoritePages,getPageNameEntries, chunked IN lookups,getAllPagesSnapshot, bounded unloaded-page drain). - Add regression tests reproducing the large-graph warm-start crash window and enforcing bounded-query behavior.
Reviewed changes
Copilot reviewed 41 out of 41 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/ui/StelekitViewModelCrashReproductionTest.kt | New JVM test that reproduces/guards against uncaught Throwables escaping the ViewModel scope. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/ui/fixtures/FakeRepositories.kt | Updates fake repositories to the new bounded/projection PageRepository API. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/repository/UpgradeResilienceTest.kt | Adjusts upgrade resilience coverage to new snapshot/projection APIs. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/repository/RepositoryFlowResilienceTest.kt | Updates flow-resilience test to use snapshot instead of removed unbounded flow. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/migration/DryRunTest.kt | Uses getAllPagesSnapshot() for pre/post state snapshots. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/integration/PipelineReproductionTest.kt | Switches whole-graph test read to getAllPagesSnapshot(). |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/domain/PageNameIndexResilienceTest.kt | New test ensuring PageNameIndex survives upstream OOM without crashing. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/QueryPlanAuditTest.kt | Updates query allowlist/audits to reflect new bounded/projection queries. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/LargeGraphWarmStartCrashTest.kt | New end-to-end warm-start regression test for an 8,030-page graph. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphManagerDatabaseLifecycleTest.kt | Adjusts lifecycle tests away from removed unbounded page flow. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderWatcherTest.kt | Updates watcher test to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderTest.kt | Updates loader tests to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderIntegrationTest.kt | Updates integration tests to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderIndexBatchingTest.kt | New tests enforcing bounded batch draining + termination behavior for indexing. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/GraphLoaderCacheTest.kt | Updates cache tests to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/ExternalChangeConflictTest.kt | Updates conflict tests to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/DemoGraphIntegrationTest.kt | Updates demo integration test to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/db/AppLoadJournalIntegrationTest.kt | Updates journal-load integration assertions to use snapshot access. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/benchmark/UserSessionBenchmarkTest.kt | Updates benchmark to use snapshot access for warmup and final counts. |
| kmp/src/jvmTest/kotlin/dev/stapler/stelekit/benchmark/GraphLoadTimingTest.kt | Updates timing benchmark to use snapshot access for page counts. |
| kmp/src/jvmCommonMain/kotlin/dev/stapler/stelekit/performance/InstrumentedPageRepository.kt | Keeps instrumentation consistent with new PageRepository methods and avoids default fallbacks. |
| kmp/src/commonTest/kotlin/dev/stapler/stelekit/ui/screens/JournalsViewModelTest.kt | Updates test fake PageRepository implementation to new API. |
| kmp/src/commonTest/kotlin/dev/stapler/stelekit/ui/screens/JournalsViewModelEditorTest.kt | Updates test fake PageRepository implementation to new API. |
| kmp/src/commonTest/kotlin/dev/stapler/stelekit/repository/JournalServiceTest.kt | Switches assertions to use getAllPagesSnapshot(). |
| kmp/src/commonMain/sqldelight/dev/stapler/stelekit/db/SteleDatabase.sq | Removes unpaginated full-table page queries; adds bounded/projection queries used by new APIs. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/StelekitViewModel.kt | Adds scope exception handler; replaces unbounded collectors with favorites/projection/point lookups; removes full-table pinning. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/screens/ImportViewModel.kt | Uses names-only projection and snapshot where whole-graph reads are required. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/screens/GlobalUnlinkedReferencesViewModel.kt | Uses names-only projection for whole-graph name scanning. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/screens/AllPagesViewModel.kt | Replaces standing full-table collector with an explicit one-shot snapshot load. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/SqlDelightPageRepository.kt | Implements new bounded/projection APIs (favorites, names projection, chunked IN, bounded unloaded drain, count). |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/PageRepository.kt | Redesigns the interface to remove unbounded flows and add snapshot/projection/chunked/bounded APIs. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/InMemoryRepositories.kt | Updates in-memory repositories and in-memory search to match new PageRepository API. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/repository/DatalogPageRepository.kt | Updates datascript/datalog page repo to match new bounded/projection API. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/migration/DslEvaluator.kt | Switches to getAllPagesSnapshot() for bulk migration prefetch. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/export/ExportService.kt | Replaces full-table reads with bounded pagination + early-stop export logic. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/domain/PageNameIndex.kt | Switches to names-only projection and guards matcher/collector against Throwable. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/db/GraphLoader.kt | Makes Phase 3 indexing drain bounded batches; uses chunked IN lookups instead of preloading all pages. |
| kmp/src/businessTest/kotlin/dev/stapler/stelekit/voice/VoiceNoteBlockFormatTest.kt | Updates to use snapshot access for assertions. |
| kmp/src/androidInstrumentedTest/kotlin/dev/stapler/stelekit/benchmark/AndroidGraphBenchmark.kt | Updates Android benchmarks to use snapshot access for page counts. |
| CLAUDE.md | Documents new invariants: scope exception handling and bounded/projection/chunked graph-scale reads. |
| androidApp/src/main/AndroidManifest.xml | Enables android:largeHeap="true" to provide additional headroom for whole-graph workloads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Resolved Page objects for the recent-pages list, keyed by UUID and bounded by | ||
| // recentPageUuids (≤20 entries). Replaces the former cachedAllPages field, which | ||
| // pinned the entire pages table (8 000+ Page objects on large graphs) in memory | ||
| // for the lifetime of the ViewModel. | ||
| private val recentPagesByUuid = mutableMapOf<String, Page>() |
There was a problem hiding this comment.
All access to recentPagesByUuid and recentPageUuids is now guarded by recentMutex (added in this PR). addToRecent, refreshRecentPages, setGraphPath, and observeSpecialPages all use recentMutex.withLock { }. trimRecentPagesCache is only ever called inside an existing withLock block. Thanks for the catch — the concern was valid on the original code.
| _uiState.update { it.copy(currentGraphPath = path) } | ||
| recentPageUuids = platformSettings.getString(recentPagesKey, "") | ||
| .split(",").filter { it.isNotEmpty() }.toMutableList() | ||
| recentPagesByUuid.clear() | ||
| scope.launch { refreshRecentPages() } | ||
| loadGraph(path) |
There was a problem hiding this comment.
Good catch. Fixed: refreshRecentPages() is no longer called in setGraphPath itself. It's now deferred to inside onPhase1Complete (after Phase 1 has populated the new graph's DB), so the getPageByUuid lookups never race with loadGraph's clear().
Android Load BenchmarkInstrumented benchmark on an API 30 x86_64 emulator — 500-page synthetic graph. Comparing Graph Load
Interactive Write Latency (during Phase 3)
SAF I/O Overhead (ContentProvider vs direct File read)Measures Binder IPC cost added by ContentResolver per readFile() call.
|
- BLOCKER: remove runCatching from withTimeout in GraphLoaderIndexBatchingTest to
prevent CancellationException from breaking structured concurrency
- CRITICAL: snapshot recentPageUuids under recentMutex then do DB work outside lock
in refreshRecentPages to eliminate starvation of addToRecent
- fix sanitizeErrorMessage: apply in loadGraph catch blocks (bypassed before);
add Windows path regex C:\... pattern
- fix QueryPlanAuditTest selectJournalPages: audit SQL was COALESCE phantom query,
not the actual ORDER BY journal_date DESC used in SteleDatabase.sq
- change countUnloadedPages() from Flow to suspend fun to fix reactive/one-shot
contract mismatch between SQL and InMemory implementations
- add getPagesByNames/getJournalPagesByDates overrides to DatalogPageRepository
to avoid O(graph) getAllPagesSnapshot() fallback
- improve test determinism: replace delay(1_000) with delay(200); replace polling
while loop with flow.first{} in LargeGraphWarmStartCrashTest
- add test for ExportService oldest==null guard branch (uEJR04)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ace with loadGraph clear() setGraphPath previously launched refreshRecentPages() concurrently with loadGraph(), which calls pageRepository.clear() early in the load sequence. The getPageByUuid lookups in refreshRecentPages would silently return null while clear() was in flight, leaving recents empty until the user navigated again. Fix: refreshRecentPages() is now called inside onPhase1Complete, after Phase 1 has populated the new graph's DB with page metadata. Addresses Copilot PR review comment on StelekitViewModel.kt:442. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of the 8 030-page "SteleKit keeps stopping" crash: the ViewModel
scope (App.kt) had no CoroutineExceptionHandler. Under large-graph heap
pressure an OutOfMemoryError is thrown in whichever coroutine allocates
next — the unguarded standing collectors (observeSpecialPages, PageNameIndex)
and fire-and-forget launches (ensureTodayJournal) on that scope let it reach
the default uncaught-exception handler, which kills the process on Android
but only logs on desktop. That is why the crash never reproduced on desktop
and why prior mitigations inside GraphLoader's own scopes did not help.
The hang half came from observeSpecialPages keeping a standing collector on
getAllPages(): every DB write invalidates that query, so graph import/
reconcile re-materialized all 8 030 Page objects per write burst (GC thrash),
and cachedAllPages pinned a full-table snapshot for the ViewModel's lifetime.
Fixes:
surfaces uncaught Throwables as fatalError (recovery screen) instead of
crashing.
dedicated selectFavoritePages query (PageRepository.getFavoritePages, with
a conflated SqlDelight override); recents resolve via bounded point lookups;
cachedAllPages is removed.
Throwable, degrading suggestions instead of dying (its first 8 000-name trie
build lands ~500 ms after launch — exactly the observed crash window).
Tests (fail before the fix — they reproduce the crash by recording Throwables
that reach the default uncaught-exception handler):
in ensureTodayJournal, and a guard that the ViewModel keeps at most one
standing getAllPages() subscription.
cache kept, warm reconcile + Phase 3 indexing) completes with no uncaught
Throwables.
https://claude.ai/code/session_01Avi2KYmMrxLLFCiMBJHwpb