Skip to content

feat: Refactor agent and hub sync for better readability#208

Open
maximbigler wants to merge 40 commits into
mainfrom
refactor/deploy-state-machine
Open

feat: Refactor agent and hub sync for better readability#208
maximbigler wants to merge 40 commits into
mainfrom
refactor/deploy-state-machine

Conversation

@maximbigler

@maximbigler maximbigler commented Jun 22, 2026

Copy link
Copy Markdown
Member

Type of change

  • 🐛 Bug fix
  • 🚀 New feature
  • ❓ Other (please specify): Refactor

Description

Fixes #172
Fixes #211

Refactors the repository-sync and application-deploy flow in internal/hub/applications to make it readable and to remove duplicated/divergent logic. No new feature; behavior is preserved except for the intentional changes called out below.

Application-level deploy state machine

  • Introduced updateApplicationStatus as the single chokepoint for every application status write + SSE publish; the four mark* helpers are now thin wrappers over it. Fixes a bug where markDeploymentInProgress published an SSE event twice on a DB error.
  • Extracted applyDeployResult (+ deployNotifications) — one shared interpreter of an agent's deploy outcome (err / nil result / !Success / success), replacing two divergent copies that previously lived in the sync path and the manual-deploy path.
  • Split Deployer.StartDeploy into dispatchDeploy (transport-only) and StartDeploy (dispatch + mark in-progress). The sync path uses dispatchDeploy via DeployAndWait because processSyncJob already owns the in-progress transition, so each path now marks the application in-progress exactly once instead of twice.
  • Rewrote processSyncJob to use linear control flow instead of the success bool + defer pattern.
  • Bug fix: terminal status writes now use context.Background() instead of the job context. A deploy that hit the 3-minute job timeout previously wrote its failure status on an already-cancelled context, which could leave the application stuck displaying Syncing.

Unified sync entry points

  • Added SyncApplications, the single path that all four sync triggers funnel through (polling, manual repo sync, webhooks, GitHub Actions). It performs consistent repository-status bookkeeping (Syncing → Success/Failed).
  • Added CommitResolver with StaticCommit (webhooks and GitHub Actions already carry the pushed SHA) and LatestCommit (polling and commit-less generic webhooks), separating "where the commit comes from" from the shared sync logic.
  • SyncRepository, WebhookHandler, handleGenericWebhook, and the GitHub Actions handler now delegate to SyncApplications; the bespoke enqueueGenericApps helper and the inline repository-status updates were removed.

Documentation

  • Doc comments make the two-layer model explicit (repository-level sync vs. application-level deploy) and clarify that a repository's SyncStatus reflects whether the sync was dispatched (commits resolved, jobs enqueued), not whether every application finished deploying — per-application progress lives on each Application.

Additional context

Intentional behavior changes (please review)

  1. A sync-path transport failure (agent dropped / nil result) now marks the application OutOfSync only, instead of also Unhealthy — matching the manual-deploy path. A dropped connection does not mean the running containers are unhealthy.
  2. Webhooks and GitHub Actions now perform the full Syncing → Success/Failed repository bookkeeping like the poller. Previously webhooks marked Success only (never Failed) and GitHub Actions did no repository-status bookkeeping at all. For these synchronous handlers the Syncing state is effectively instantaneous; it is only visibly observable on the polling path during the network commit lookup.
  3. Generic webhooks now skip applications with an empty branch (consistent with the poller) and mark the repository Failed if commit resolution fails, instead of silently enqueuing an empty commit.

Out of scope (planned follow-up)

Removing the Default* package-level globals in favor of dependency injection, which also requires reordering the server.go bootstrap. Deliberately left for a separate PR to keep this diff focused and reviewable.

Testing

applications, routes, and websocket suites pass; gofmt / go vet clean. The only failing tests are the Docker-daemon-dependent ones in internal/agent/docker, which this branch does not touch.

@maximbigler maximbigler force-pushed the refactor/deploy-state-machine branch from 6a3927e to 2b3b396 Compare June 24, 2026 14:58
@maximbigler maximbigler linked an issue Jun 26, 2026 that may be closed by this pull request
8 tasks
@maximbigler maximbigler marked this pull request as ready for review June 26, 2026 22:01
@alex289 alex289 requested a review from timokoessler June 26, 2026 23:34
@alex289

This comment has been minimized.

Comment thread frontend/src/routes/_authenticated/applications/$id.index.tsx Outdated
Comment thread backend/internal/hub/websocket/handler.go Outdated
Comment thread backend/internal/hub/websocket/delete.go Outdated
Comment thread backend/internal/hub/models/applications.go
Comment thread backend/internal/hub/websocket/deploy.go Outdated
Comment thread backend/internal/hub/websocket/deploy.go Outdated
Comment thread backend/internal/agent/docker/deploy.go
maximbigler and others added 6 commits June 29, 2026 21:37
When a sync finds the compose file unchanged, also set SyncStatus to Synced,
update LastSyncedAt, and clear LastSyncError instead of only recording the new
commit. Previously a no-op sync left a previously out-of-sync application
looking failed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
markRepositorySuccess passed LastSyncError: nil, which GORM skips on a struct
Updates, leaving the old error in place. Use Select to force the column to NULL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Application names are encrypted with a random nonce, so the old uniqueness check
loaded and decrypted every row and compared in code, enforcing a single global
namespace. Replace it with a keyed blind-index name_hash column:

- crypto.BlindIndex derives a deterministic HMAC-SHA256 of the name keyed off
  APP_SECRET, so the stored hash does not leak the plaintext.
- applicationNameTaken now does an indexed lookup scoped to the agent, allowing
  the same name on different agents.
- Migration 000023 adds name_hash with a partial unique index on
  (agent_id, name_hash).
- BackfillNameHashes populates the column on startup and recomputes it after key
  rotation (the blind index is keyed off APP_SECRET).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hub bulk-set every application to Healthy when an agent connected, masking a
real Unhealthy from a failed deploy. Have the agent inspect its containers and
report actual health back in a single message instead:

- Add ApplicationStatusReport proto message (one report covers all apps).
- Agent derives health per application from container state via the
  orca-cd.application-id label and sends the report on AgentSettings.
- Hub applies the report scoped to the reporting agent (an agent can only update
  its own applications) and no longer assumes Healthy on connect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add tests for the previously-uncovered code introduced in this PR:

- hub RemoveApplication / ResolveDeleteResult: round trip, agent offline,
  context cancellation, and unknown request id.
- agent docker Remove: compose down + directory removal, unsafe name rejection,
  down error, and missing deployments dir.
- agent docker ApplicationHealth: no containers and unreachable daemon both
  report unknown; aggregateHealth table test.
- agent executeDelete: success, nil deployer, and remove error.
- processSyncJob: no-op sync now asserts Synced/LastSyncedAt/cleared error, plus
  a fetch-error failure path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread frontend/src/routes/_authenticated/applications/$id/details.tsx Outdated
Comment thread backend/internal/hub/server.go Outdated
Comment thread backend/internal/agent/docker/status.go
Comment thread backend/internal/hub/crypto/crypto.go Outdated
@alex289

This comment was marked as outdated.

@alex289

alex289 commented Jun 29, 2026

Copy link
Copy Markdown
Member

Manual deploy on an offline agent now returns 500 (was 409), and the test no longer covers it

DeployApplicationHandler (routes/applications.go:442) calls fire-and-forget TriggerApplicationDeploy, which does return an error when hub.Send fails — so the handler hits c.JSON(http.StatusInternalServerError, …). The old code returned 409 Conflict ("agent is not connected"), which is the correct semantic and is exactly what the new DeleteApplicationHandler still does (ErrAgentOffline → 409). So delete and deploy are now inconsistent for the same offline condition.

Worse, TestDeployApplicationHandler_AgentUnavailable was changed to use a bare &stubRouteDeployer{} (no startErr), so it returns nil and the test asserts 202 with a comment claiming "Agent-not-connected handling is still a TODO." The test no longer exercises the offline path at all — its name is now misleading. Either restore a stub that simulates the offline error and assert 409, or rename the test to reflect what it checks.

@alex289

alex289 commented Jun 29, 2026

Copy link
Copy Markdown
Member

Delete timeout mismatch: hub 30s vs. agent 5min → state divergence

agentDeleteTimeout = 30s (hub) but the agent's executeDelete uses deploymentTimeout = 5 * time.Minute for compose down. If teardown takes 30s–5min, the hub times out (504), keeps its record, and tells the user it failed — but the agent goes on to complete the removal (containers + deployment dir gone). The hub now references an application the agent has destroyed: the inverse of the "don't lose track of running containers" goal this flow is designed around. Consider aligning the hub wait with the agent's execution timeout, or making the agent abort teardown when its delete context is cancelled.

Comment thread backend/internal/hub/deployer/application_deployer.go Outdated
Comment thread backend/internal/hub/deployer/application_deployer.go
Comment thread frontend/src/routes/_authenticated/applications/$id/details.tsx Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Normalize the compose name and skip check fix: Sync issues

4 participants