Skip to content

fix: synchronize CoalesceService slot lifecycle to prevent dropped reruns#486

Open
JuliusHenke wants to merge 3 commits into
masterfrom
codex/propose-fix-for-coalesceservice-vulnerability
Open

fix: synchronize CoalesceService slot lifecycle to prevent dropped reruns#486
JuliusHenke wants to merge 3 commits into
masterfrom
codex/propose-fix-for-coalesceservice-vulnerability

Conversation

@JuliusHenke
Copy link
Copy Markdown
Member

Motivation

  • A recent refactor introduced a race in CoalesceService where submissions near task completion could create an orphan nextFuture (never completed) or allow duplicate concurrent executions for the same key.
  • This affects cache rebuilding and relay-detail recomputation jobs and can cause stale caches, hung CompletableFuture.allOf callers, duplicate DB/cache work, and extra lock contention.

Description

  • Serialize submitAsync state transitions (running, rerunRequested, nextFuture) by synchronizing on slot.monitor so the initial-run path sets running and prepares the first future while holding the lock.
  • Start the async run loop only after the monitor-held state is established to avoid submitters observing partially-shutdown state.
  • Move the worker-loop rerun check and shutdown (running=false + slots.remove) into the same synchronized(slot.monitor) block and promote/clear nextFuture there to prevent orphaned futures and stale-slot removal races.
  • Change is limited to backend/src/main/kotlin/org/tormap/service/CoalesceService.kt and preserves the intended coalescing semantics.

Testing

  • Attempted ./gradlew :backend:test --tests "org.tormap.service.CoalesceServiceTest" at the repo root failed because ./gradlew was not present in that path. (failed)
  • Ran cd backend && ./gradlew test --tests "org.tormap.service.CoalesceServiceTest" which failed due to a Gradle/toolchain environment error (25.0.2), so unit tests could not be executed in this environment. (failed)

Codex Task

Copilot AI review requested due to automatic review settings May 20, 2026 16:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the concurrency control in CoalesceService to better synchronize per-key slot lifecycle and prevent rerun futures from being dropped or orphaned during task completion races.

Changes:

  • Synchronizes submitAsync state transitions on slot.monitor before starting the worker loop.
  • Moves rerun promotion and slot shutdown/removal into a monitor-held critical section in the worker loop.
  • Refactors the “first run” path to create/return its CompletableFuture only after the slot state is established.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/src/main/kotlin/org/tormap/service/CoalesceService.kt Outdated
Comment thread backend/src/main/kotlin/org/tormap/service/CoalesceService.kt
Agent-Logs-Url: https://github.com/TorMap/tormap/sessions/3ff79a36-5735-4758-a207-d2ef45467cbb

Co-authored-by: JuliusHenke <23460202+JuliusHenke@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 113.114.104.172.in-addr.arpa
    • Triggering command: /usr/lib/jvm/temurin-17-jdk-amd64/bin/java /usr/lib/jvm/temurin-17-jdk-amd64/bin/java -Dorg.gradle.internal.worker.tmpdir=/home/REDACTED/work/tormap/tormap/backend/build/tmp/test/work @/home/REDACTED/.gradle/.tmp/gradle-worker-classpath4577654228264605735txt -Xmx512m -Dfile.encoding=UTF-8 -Duser.country -Duser.language=en -Duser.variant -ea worker.org.gradle.process.internal.worker.GradleWorkerMain &#39;Gradle Test Executor 2&#39; (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@JuliusHenke
Copy link
Copy Markdown
Member Author

@copilot perform a final review to determine if there are any issues left

Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

@copilot perform a final review to determine if there are any issues left

Final review done at 02ebc1c.

Current state:

  • The earlier CI compile failure on an older commit was due to continue in an inline lambda and is already resolved in the latest commit.
  • Latest backend workflow run for this head SHA is still queued (actions/runs/26177660781), so no new CI result yet.
  • One issue still appears intermittently in focused local runs: CoalesceServiceTest > never runs the same key concurrently even under repeated submissions can fail (expected 2, got 1).

So I still see that intermittent test-stability concern as the remaining issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants