feat: Refactor agent and hub sync for better readability by maximbigler · Pull Request #208 · OrcaCD/orca-cd

maximbigler · 2026-06-22T17:49:09Z

Type of change

🐛 Bug fix
🚀 New feature
❓ Other (please specify): Refactor

Description

Fixes #172
Fixes #211

Refactors the repository-sync and application-deploy flow in internal/hub/applications to make it readable and to remove duplicated/divergent logic. No new feature; behavior is preserved except for the intentional changes called out below.

Application-level deploy state machine

Introduced updateApplicationStatus as the single chokepoint for every application status write + SSE publish; the four mark* helpers are now thin wrappers over it. Fixes a bug where markDeploymentInProgress published an SSE event twice on a DB error.
Extracted applyDeployResult (+ deployNotifications) — one shared interpreter of an agent's deploy outcome (err / nil result / !Success / success), replacing two divergent copies that previously lived in the sync path and the manual-deploy path.
Split Deployer.StartDeploy into dispatchDeploy (transport-only) and StartDeploy (dispatch + mark in-progress). The sync path uses dispatchDeploy via DeployAndWait because processSyncJob already owns the in-progress transition, so each path now marks the application in-progress exactly once instead of twice.
Rewrote processSyncJob to use linear control flow instead of the success bool + defer pattern.
Bug fix: terminal status writes now use context.Background() instead of the job context. A deploy that hit the 3-minute job timeout previously wrote its failure status on an already-cancelled context, which could leave the application stuck displaying Syncing.

Unified sync entry points

Added SyncApplications, the single path that all four sync triggers funnel through (polling, manual repo sync, webhooks, GitHub Actions). It performs consistent repository-status bookkeeping (Syncing → Success/Failed).
Added CommitResolver with StaticCommit (webhooks and GitHub Actions already carry the pushed SHA) and LatestCommit (polling and commit-less generic webhooks), separating "where the commit comes from" from the shared sync logic.
SyncRepository, WebhookHandler, handleGenericWebhook, and the GitHub Actions handler now delegate to SyncApplications; the bespoke enqueueGenericApps helper and the inline repository-status updates were removed.

Documentation

Doc comments make the two-layer model explicit (repository-level sync vs. application-level deploy) and clarify that a repository's SyncStatus reflects whether the sync was dispatched (commits resolved, jobs enqueued), not whether every application finished deploying — per-application progress lives on each Application.

Additional context

Intentional behavior changes (please review)

A sync-path transport failure (agent dropped / nil result) now marks the application OutOfSync only, instead of also Unhealthy — matching the manual-deploy path. A dropped connection does not mean the running containers are unhealthy.
Webhooks and GitHub Actions now perform the full Syncing → Success/Failed repository bookkeeping like the poller. Previously webhooks marked Success only (never Failed) and GitHub Actions did no repository-status bookkeeping at all. For these synchronous handlers the Syncing state is effectively instantaneous; it is only visibly observable on the polling path during the network commit lookup.
Generic webhooks now skip applications with an empty branch (consistent with the poller) and mark the repository Failed if commit resolution fails, instead of silently enqueuing an empty commit.

Out of scope (planned follow-up)

Removing the Default* package-level globals in favor of dependency injection, which also requires reordering the server.go bootstrap. Deliberately left for a separate PR to keep this diff focused and reviewable.

Testing

applications, routes, and websocket suites pass; gofmt / go vet clean. The only failing tests are the Docker-daemon-dependent ones in internal/agent/docker, which this branch does not touch.

codecov · 2026-06-24T15:05:26Z

Codecov Report

❌ Patch coverage is 69.35484% with 133 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/internal/hub/routes/applications.go	57.14%	14 Missing and 7 partials ⚠️
backend/internal/agent/websocket.go	73.77%	12 Missing and 4 partials ⚠️
backend/internal/hub/websocket/handler.go	54.54%	13 Missing and 2 partials ⚠️
backend/internal/hub/db/name_hash.go	0.00%	14 Missing ⚠️
backend/internal/agent/docker/deploy.go	53.84%	7 Missing and 5 partials ⚠️
backend/internal/hub/websocket/deploy.go	73.33%	7 Missing and 5 partials ⚠️
backend/internal/hub/applications/sync.go	73.17%	7 Missing and 4 partials ⚠️
backend/internal/hub/routes/webhooks.go	52.94%	4 Missing and 4 partials ⚠️
backend/internal/hub/applications/repo_sync.go	74.07%	6 Missing and 1 partial ⚠️
backend/cmd/hub/key_rotate.go	0.00%	4 Missing and 1 partial ⚠️
... and 5 more

📢 Thoughts on this report? Let us know!

…c.go

When a sync finds the compose file unchanged, also set SyncStatus to Synced, update LastSyncedAt, and clear LastSyncError instead of only recording the new commit. Previously a no-op sync left a previously out-of-sync application looking failed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

markRepositorySuccess passed LastSyncError: nil, which GORM skips on a struct Updates, leaving the old error in place. Use Select to force the column to NULL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Application names are encrypted with a random nonce, so the old uniqueness check loaded and decrypted every row and compared in code, enforcing a single global namespace. Replace it with a keyed blind-index name_hash column: - crypto.BlindIndex derives a deterministic HMAC-SHA256 of the name keyed off APP_SECRET, so the stored hash does not leak the plaintext. - applicationNameTaken now does an indexed lookup scoped to the agent, allowing the same name on different agents. - Migration 000023 adds name_hash with a partial unique index on (agent_id, name_hash). - BackfillNameHashes populates the column on startup and recomputes it after key rotation (the blind index is keyed off APP_SECRET). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The hub bulk-set every application to Healthy when an agent connected, masking a real Unhealthy from a failed deploy. Have the agent inspect its containers and report actual health back in a single message instead: - Add ApplicationStatusReport proto message (one report covers all apps). - Agent derives health per application from container state via the orca-cd.application-id label and sends the report on AgentSettings. - Hub applies the report scoped to the reporting agent (an agent can only update its own applications) and no longer assumes Healthy on connect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add tests for the previously-uncovered code introduced in this PR: - hub RemoveApplication / ResolveDeleteResult: round trip, agent offline, context cancellation, and unknown request id. - agent docker Remove: compose down + directory removal, unsafe name rejection, down error, and missing deployments dir. - agent docker ApplicationHealth: no containers and unreachable daemon both report unknown; aggregateHealth table test. - agent executeDelete: success, nil deployer, and remove error. - processSyncJob: no-op sync now asserts Synced/LastSyncedAt/cleared error, plus a fetch-error failure path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alex289 · 2026-06-29T21:35:20Z

Manual deploy on an offline agent now returns 500 (was 409), and the test no longer covers it

DeployApplicationHandler (routes/applications.go:442) calls fire-and-forget TriggerApplicationDeploy, which does return an error when hub.Send fails — so the handler hits c.JSON(http.StatusInternalServerError, …). The old code returned 409 Conflict ("agent is not connected"), which is the correct semantic and is exactly what the new DeleteApplicationHandler still does (ErrAgentOffline → 409). So delete and deploy are now inconsistent for the same offline condition.

Worse, TestDeployApplicationHandler_AgentUnavailable was changed to use a bare &stubRouteDeployer{} (no startErr), so it returns nil and the test asserts 202 with a comment claiming "Agent-not-connected handling is still a TODO." The test no longer exercises the offline path at all — its name is now misleading. Either restore a stub that simulates the offline error and assert 409, or rename the test to reflect what it checks.

alex289 · 2026-06-29T21:35:46Z

Delete timeout mismatch: hub 30s vs. agent 5min → state divergence

agentDeleteTimeout = 30s (hub) but the agent's executeDelete uses deploymentTimeout = 5 * time.Minute for compose down. If teardown takes 30s–5min, the hub times out (504), keeps its record, and tells the user it failed — but the agent goes on to complete the removal (containers + deployment dir gone). The hub now references an application the agent has destroyed: the inverse of the "don't lose track of running containers" goal this flow is designed around. Consider aligning the hub wait with the agent's execution timeout, or making the agent abort teardown when its delete context is cancelled.

…r.go Co-authored-by: Alex <me@alexanderkonietzko.com>

…rcaCD/orca-cd into refactor/deploy-state-machine

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Co-authored-by: Alex <me@alexanderkonietzko.com>

…rcaCD/orca-cd into refactor/deploy-state-machine

github-actions Bot added backend hub labels Jun 22, 2026

maximbigler self-assigned this Jun 24, 2026

maximbigler added the agent label Jun 24, 2026

feat: Refactor agent and hub sync for better readability

2b3b396

maximbigler force-pushed the refactor/deploy-state-machine branch from 6a3927e to 2b3b396 Compare June 24, 2026 14:58

maximbigler added 9 commits June 24, 2026 17:40

feat: Handle error on websocket connection

17e92ec

fix: Formatting in application_deployer.go

48cc348

fix: Deployer test

5c1a893

feat: Add test for websocket/deploy.go

ff14bc1

docs: Remove ai comments

bbf3b98

feat: Add function for to mark application failed in applications/syn…

e72524d

…c.go

fix: Formatting in sync.go

070fd0e

feat: Trigger deploy when app is created

54ffa86

feat: Sync app status on websocket connection state

89d2834

maximbigler linked an issue Jun 26, 2026 that may be closed by this pull request

fix: Sync issues #172

Open

8 tasks

maximbigler added 3 commits June 26, 2026 23:15

feat: Remove application also on agent side

b3e0b49

feat: Unique application names

068ddf7

feat: Add error banner for application

7af1b26

github-actions Bot added the frontend label Jun 26, 2026

fix: Formatting in applications page

8974362

maximbigler marked this pull request as ready for review June 26, 2026 22:01

alex289 requested a review from timokoessler June 26, 2026 23:34

This comment has been minimized.

Sign in to view

timokoessler reviewed Jun 27, 2026

View reviewed changes

Comment thread frontend/src/routes/_authenticated/applications/$id.index.tsx Outdated

Comment thread backend/internal/hub/websocket/handler.go Outdated

Comment thread backend/internal/hub/websocket/delete.go Outdated

alex289 reviewed Jun 27, 2026

View reviewed changes

Comment thread backend/internal/hub/models/applications.go

timokoessler reviewed Jun 27, 2026

View reviewed changes

Comment thread backend/internal/hub/websocket/deploy.go Outdated

timokoessler reviewed Jun 27, 2026

View reviewed changes

Comment thread backend/internal/hub/websocket/deploy.go Outdated

timokoessler reviewed Jun 27, 2026

View reviewed changes

Comment thread backend/internal/agent/docker/deploy.go

maximbigler and others added 6 commits June 29, 2026 21:37

fix: Silent no-op when DefaultQueue is nil

ce89491

fix: Clear stale repository sync error on success

2ef905e

markRepositorySuccess passed LastSyncError: nil, which GORM skips on a struct Updates, leaving the old error in place. Use Select to force the column to NULL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maximbigler requested review from alex289 and timokoessler June 29, 2026 21:13