feat: Refactor agent and hub sync for better readability#208
feat: Refactor agent and hub sync for better readability#208maximbigler wants to merge 40 commits into
Conversation
6a3927e to
2b3b396
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This comment has been minimized.
This comment has been minimized.
When a sync finds the compose file unchanged, also set SyncStatus to Synced, update LastSyncedAt, and clear LastSyncError instead of only recording the new commit. Previously a no-op sync left a previously out-of-sync application looking failed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
markRepositorySuccess passed LastSyncError: nil, which GORM skips on a struct Updates, leaving the old error in place. Use Select to force the column to NULL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Application names are encrypted with a random nonce, so the old uniqueness check loaded and decrypted every row and compared in code, enforcing a single global namespace. Replace it with a keyed blind-index name_hash column: - crypto.BlindIndex derives a deterministic HMAC-SHA256 of the name keyed off APP_SECRET, so the stored hash does not leak the plaintext. - applicationNameTaken now does an indexed lookup scoped to the agent, allowing the same name on different agents. - Migration 000023 adds name_hash with a partial unique index on (agent_id, name_hash). - BackfillNameHashes populates the column on startup and recomputes it after key rotation (the blind index is keyed off APP_SECRET). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hub bulk-set every application to Healthy when an agent connected, masking a real Unhealthy from a failed deploy. Have the agent inspect its containers and report actual health back in a single message instead: - Add ApplicationStatusReport proto message (one report covers all apps). - Agent derives health per application from container state via the orca-cd.application-id label and sends the report on AgentSettings. - Hub applies the report scoped to the reporting agent (an agent can only update its own applications) and no longer assumes Healthy on connect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add tests for the previously-uncovered code introduced in this PR: - hub RemoveApplication / ResolveDeleteResult: round trip, agent offline, context cancellation, and unknown request id. - agent docker Remove: compose down + directory removal, unsafe name rejection, down error, and missing deployments dir. - agent docker ApplicationHealth: no containers and unreachable daemon both report unknown; aggregateHealth table test. - agent executeDelete: success, nil deployer, and remove error. - processSyncJob: no-op sync now asserts Synced/LastSyncedAt/cleared error, plus a fetch-error failure path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This comment was marked as outdated.
This comment was marked as outdated.
|
Manual deploy on an offline agent now returns 500 (was 409), and the test no longer covers it DeployApplicationHandler (routes/applications.go:442) calls fire-and-forget TriggerApplicationDeploy, which does return an error when hub.Send fails — so the handler hits c.JSON(http.StatusInternalServerError, …). The old code returned 409 Conflict ("agent is not connected"), which is the correct semantic and is exactly what the new DeleteApplicationHandler still does (ErrAgentOffline → 409). So delete and deploy are now inconsistent for the same offline condition. Worse, TestDeployApplicationHandler_AgentUnavailable was changed to use a bare &stubRouteDeployer{} (no startErr), so it returns nil and the test asserts 202 with a comment claiming "Agent-not-connected handling is still a TODO." The test no longer exercises the offline path at all — its name is now misleading. Either restore a stub that simulates the offline error and assert 409, or rename the test to reflect what it checks. |
|
Delete timeout mismatch: hub 30s vs. agent 5min → state divergence agentDeleteTimeout = 30s (hub) but the agent's executeDelete uses deploymentTimeout = 5 * time.Minute for compose down. If teardown takes 30s–5min, the hub times out (504), keeps its record, and tells the user it failed — but the agent goes on to complete the removal (containers + deployment dir gone). The hub now references an application the agent has destroyed: the inverse of the "don't lose track of running containers" goal this flow is designed around. Consider aligning the hub wait with the agent's execution timeout, or making the agent abort teardown when its delete context is cancelled. |
…r.go Co-authored-by: Alex <me@alexanderkonietzko.com>
…rcaCD/orca-cd into refactor/deploy-state-machine
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Alex <me@alexanderkonietzko.com>
…rcaCD/orca-cd into refactor/deploy-state-machine
Type of change
Description
Fixes #172
Fixes #211
Refactors the repository-sync and application-deploy flow in
internal/hub/applicationsto make it readable and to remove duplicated/divergent logic. No new feature; behavior is preserved except for the intentional changes called out below.Application-level deploy state machine
updateApplicationStatusas the single chokepoint for every application status write + SSE publish; the fourmark*helpers are now thin wrappers over it. Fixes a bug wheremarkDeploymentInProgresspublished an SSE event twice on a DB error.applyDeployResult(+deployNotifications) — one shared interpreter of an agent's deploy outcome (err/ nil result /!Success/ success), replacing two divergent copies that previously lived in the sync path and the manual-deploy path.Deployer.StartDeployintodispatchDeploy(transport-only) andStartDeploy(dispatch + mark in-progress). The sync path usesdispatchDeployviaDeployAndWaitbecauseprocessSyncJobalready owns the in-progress transition, so each path now marks the application in-progress exactly once instead of twice.processSyncJobto use linear control flow instead of thesuccess bool+deferpattern.context.Background()instead of the job context. A deploy that hit the 3-minute job timeout previously wrote its failure status on an already-cancelled context, which could leave the application stuck displayingSyncing.Unified sync entry points
SyncApplications, the single path that all four sync triggers funnel through (polling, manual repo sync, webhooks, GitHub Actions). It performs consistent repository-status bookkeeping (Syncing → Success/Failed).CommitResolverwithStaticCommit(webhooks and GitHub Actions already carry the pushed SHA) andLatestCommit(polling and commit-less generic webhooks), separating "where the commit comes from" from the shared sync logic.SyncRepository,WebhookHandler,handleGenericWebhook, and the GitHub Actions handler now delegate toSyncApplications; the bespokeenqueueGenericAppshelper and the inline repository-status updates were removed.Documentation
SyncStatusreflects whether the sync was dispatched (commits resolved, jobs enqueued), not whether every application finished deploying — per-application progress lives on eachApplication.Additional context
Intentional behavior changes (please review)
OutOfSynconly, instead of alsoUnhealthy— matching the manual-deploy path. A dropped connection does not mean the running containers are unhealthy.Syncing → Success/Failedrepository bookkeeping like the poller. Previously webhooks markedSuccessonly (neverFailed) and GitHub Actions did no repository-status bookkeeping at all. For these synchronous handlers theSyncingstate is effectively instantaneous; it is only visibly observable on the polling path during the network commit lookup.Failedif commit resolution fails, instead of silently enqueuing an empty commit.Out of scope (planned follow-up)
Removing the
Default*package-level globals in favor of dependency injection, which also requires reordering theserver.gobootstrap. Deliberately left for a separate PR to keep this diff focused and reviewable.Testing
applications,routes, andwebsocketsuites pass;gofmt/go vetclean. The only failing tests are the Docker-daemon-dependent ones ininternal/agent/docker, which this branch does not touch.