Skip to content

Commit 644147b

Browse files
committed
chore(webapp,run-engine): downgrade boundary log noise to warn
Several boundary log sites and customer-input validation paths were unconditionally logging at error level for failures the system already handles gracefully (disconnect, retry-skip, return early). This batch downgrades them to warn or counts them as metrics — visibility is preserved in stdout / OTel metrics without surfacing them as alerts. - apiBuilder.server.ts: logBoundaryError helper — warn for AbortError and ServiceValidationError at loader/action boundary catches - handleSocketIo.server.ts: warn for "Worker authentication failed" (system disconnects on auth failure; refs TRI-8863) - waitpointSystem.ts: skip throw and warn when run was canceled while suspended (benign cancel-vs-resume race, no checkpoint to resume from) - runAttemptSystem.ts: warn for failed parse/validate of customer's flushedMetadata (system already returns gracefully) - batch-queue/index.ts: warn for non-retryable batch item failures via result.skipRetries (queue size limit exceeded, etc.) - queryPerformanceMonitor.server.ts: slow queries are observability, not errors — warn - timeoutDeployment.server.ts: deployment-state mismatch is a benign timeout-vs-completion race — warn - platform.v3.server.ts: platform_client.failures_total OTel counter with {function, kind} labels replaces logger.error from BillingClient call sites; helper recordPlatformFailure(fn, kind) - waitpointCompletionPacket.server.ts: log inner uploadError before throwing the wrapper ServiceValidationError so the underlying error context isn't lost
1 parent c69e939 commit 644147b

12 files changed

Lines changed: 414 additions & 114 deletions

File tree

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
name: sentry-triage
3+
description: Run a Sentry triage session — pull unresolved production issues, classify them, archive noise, file Linear tickets for real bugs, and close the loop after deploy. Use when asked to triage Sentry, audit errors, work through unresolved issues, do a Sentry sweep, or set up the triage flow. Triggers on phrases like "sentry triage", "audit sentry errors", "unresolved errors", "what should we fix in Sentry".
4+
allowed-tools: Read, Write, Edit, Bash, Glob, Grep, mcp__linear-server__save_issue, mcp__linear-server__list_projects
5+
---
6+
7+
# Sentry Triage Loop
8+
9+
A repeatable process for working down the unresolved-Sentry-issues backlog. Output is: code fixes for noisy log sites, Linear tickets in **Platform bugs** for real bugs, and a clean Sentry inbox.
10+
11+
The four lenses for every issue:
12+
13+
1. **Ignore / archive non-errors** — system-handled validation, expected races, customer-side issues
14+
2. **Update log level error → warn** — for boundary catches that already handle the failure
15+
3. **Investigate real bugs / perf issues / capacity signals** — these get Linear tickets
16+
4. **Improve the loop** — patterns recur; codify them so future triage is faster
17+
18+
## Constants
19+
20+
| Thing | Value |
21+
|---|---|
22+
| Sentry org slug | `triggerdev` |
23+
| Sentry project slug (cloud) | `trigger-cloud` |
24+
| Sentry project numeric ID | `4509723301642240` |
25+
| Linear team | `Triggerdotdev` |
26+
| Linear project | `Platform bugs` |
27+
| Linear-Sentry App install UUID | `51f89c4b-a9c5-4684-bfd1-d9a24e1e222f` |
28+
| Sentry token | `~/.sentryclirc` (must have `event:admin` scope) |
29+
30+
## Step 1 — Pull and enrich
31+
32+
Pull top 100 unresolved production issues by frequency, last 7 days. The org-scoped issues endpoint accepts `statsPeriod=7d` (the project-scoped one only allows `24h` or `14d`).
33+
34+
```bash
35+
TOKEN=$(grep '^token=' ~/.sentryclirc | cut -d= -f2)
36+
curl -sG "https://sentry.io/api/0/organizations/triggerdev/issues/" \
37+
-H "Authorization: Bearer $TOKEN" \
38+
--data-urlencode "project=4509723301642240" \
39+
--data-urlencode "query=is:unresolved environment:production" \
40+
--data-urlencode "statsPeriod=7d" \
41+
--data-urlencode "sort=freq" \
42+
--data-urlencode "limit=100" \
43+
> /tmp/sentry-triage-raw.json
44+
```
45+
46+
Filter out issues that already have a Linear link (so re-runs skip what's been triaged):
47+
48+
```bash
49+
# An issue is already-triaged if it has any external-issue
50+
curl -s -H "Authorization: Bearer $TOKEN" \
51+
"https://sentry.io/api/0/issues/<ID>/external-issues/" | jq 'length > 0'
52+
```
53+
54+
Format into a triage file at `~/.work-journal/sentry-triage-YYYY-MM-DD.md` with one block per issue:
55+
56+
```
57+
[_] TRIGGER-CLOUD-XX <count> events · <users>u · first <date> · last <date>
58+
**<ErrorType>**: <title>
59+
culprit: `<culprit>`
60+
https://triggerdev.sentry.io/issues/<id>/
61+
notes:
62+
```
63+
64+
`[_]` is the tag slot; user fills it in by editing the file. **Do not** auto-tag items the user hasn't reviewed — see "Bulk-action gotcha" below.
65+
66+
For high-volume items where the `culprit` looks misleading, also pull the latest event stack and surface the top in-app frame. The events list endpoint doesn't include `.context` per item — fetch each event individually:
67+
68+
```bash
69+
curl -s -H "Authorization: Bearer $TOKEN" "https://sentry.io/api/0/issues/<ID>/events/latest/" \
70+
| jq '{
71+
title,
72+
url: ([.entries[]? | select(.type=="request") | .data.url] | first),
73+
msg: ([.entries[]? | select(.type=="message") | .data.formatted] | first),
74+
exceptionType: ([.entries[]? | select(.type=="exception") | .data.values[0].type] | first),
75+
topInApp: ([.entries[]? | select(.type=="exception") | .data.values[0].stacktrace.frames | reverse | map(select(.inApp==true))] | .[0] // [] | .[0:3]),
76+
extras: .context
77+
}'
78+
```
79+
80+
## Step 2 — Walk the file with the user
81+
82+
Tag each issue: `[r]` real, `[n]` noise, `[u]` unclear, `[s]` skip. The user makes the call. Surface patterns as suggestions, never act on `[u]` or `[_]` items unilaterally.
83+
84+
### The three patterns to recognize
85+
86+
**Pattern A — Pure validation (no underlying error)**: e.g. `ServiceValidationError: Snapshot ID doesn't match the latest snapshot`. The throw IS the error; nothing's wrapped. Right call: archive + filter at SDK or downgrade log level.
87+
88+
**Pattern B — Wrapper around a real error**: e.g. `Error in loader` boundary catches that wrap an inner SVE/AbortError/etc. Check whether the inner error is *also* captured separately (look for a `logger.error("...", { error: filesError })` adjacent to the throw — `Logger.onError` extracts `args[].error` and calls `captureException(error)` so the inner becomes its own typed Sentry issue). If yes, archive the wrapper. If no, *first* add a `logger.error("...", { error: innerError })` so we don't lose the signal, then archive the wrapper.
89+
90+
**Pattern C — Real bug / perf issue / capacity signal**: needs investigation. File a Linear ticket (Step 4).
91+
92+
### The boundary-log realization
93+
94+
Most high-volume Sentry "errors" are actually wrapper logs at boundary catches (`logger.error("Error in loader/action", { error })` in `apiBuilder.server.ts`, `logger.error("Batch item processing failed after all attempts", ...)`, etc.). The boundary catches everything regardless of type — including expected things like aborts and validation errors.
95+
96+
The fix is at the catch site itself: route to `logger.warn` when the inner error is a known-noise type, OR use a signal the callback already provides (e.g. `result.skipRetries` for batch items). Don't try to filter at the SDK by message — that fights the boundary.
97+
98+
### Sentry SDK filtering — what works
99+
100+
`ignoreErrors` matches against the formatted `"Type: message"` string, not just `Type`. So `/^ServiceValidationError$/` does NOT match (event message is `"ServiceValidationError: <message>"`). Use `/^ServiceValidationError(?::|$)/` instead. **Verify with `apps/webapp/scripts/test-sentry-filter.mts`** — it captures four error variants tagged with a unique run ID, then queries Sentry to confirm which arrived.
101+
102+
`logger.warn` does NOT flow to Sentry (only `Logger.onError` is wired to `captureException`/`captureMessage`). So downgrading the log call is sufficient — you don't also need to add it to `ignoreErrors`.
103+
104+
## Step 3 — Archive `[n]` items
105+
106+
Archive with `archived_until_escalating` (Sentry auto-resurfaces if event volume jumps significantly). **Token must have `event:admin` scope**`event:write` silently coerces every archive to `archived_forever`.
107+
108+
```bash
109+
TOKEN=$(grep '^token=' ~/.sentryclirc | cut -d= -f2)
110+
QS="id=<id1>&id=<id2>&..."
111+
curl -s -X PUT "https://sentry.io/api/0/organizations/triggerdev/issues/?$QS" \
112+
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
113+
-d '{"status":"ignored","substatus":"archived_until_escalating"}'
114+
```
115+
116+
### Substatus transition gotcha
117+
118+
Sentry **does not allow direct transitions between archive substatuses**. If an issue is already `archived_forever`, sending `archived_until_escalating` silently no-ops. Round-trip via `unresolved` first:
119+
120+
```bash
121+
# 1. unresolve
122+
curl -s -X PUT ".../issues/?$QS" ... -d '{"status":"unresolved"}'
123+
sleep 1
124+
# 2. then archive with desired substatus
125+
curl -s -X PUT ".../issues/?$QS" ... -d '{"status":"ignored","substatus":"archived_until_escalating"}'
126+
```
127+
128+
### When to archive immediately vs after deploy
129+
130+
- **Now**: pure noise the user has confirmed `[n]` (e.g. customer quota errors, validation failures with no system impact)
131+
- **After deploy + volume drop**: anything addressed by a code change. Code in a branch isn't shipped — premature archive hides the very signal that tells us the fix worked. See `feedback_archive_after_deploy.md` in memory.
132+
133+
## Step 4 — File Linear tickets for `[r]` items
134+
135+
The Linear-Sentry app is installed at install UUID `51f89c4b-a9c5-4684-bfd1-d9a24e1e222f`. We do NOT use Sentry's "Create Linear Issue" UI button — we use Linear MCP directly, then attach the back-link via Sentry's external-issues API.
136+
137+
Step 4a — Create the Linear issue:
138+
139+
```ts
140+
mcp__linear-server__save_issue({
141+
team: "Triggerdotdev",
142+
project: "Platform bugs",
143+
title: "<descriptive title, not the Sentry shortId>",
144+
priority: 3, // 1=Urgent / 2=High / 3=Medium / 4=Low
145+
description: `## Sentry issue
146+
- [TRIGGER-CLOUD-XX](https://triggerdev.sentry.io/issues/<id>/)
147+
- Volume: <N> events (<rate>/day, steady|bursty)
148+
- First seen / last seen
149+
150+
## Where it fires
151+
\`<file>:<line>\` — <function context>
152+
153+
\`\`\`ts
154+
<the actual throw/log site>
155+
\`\`\`
156+
157+
## Root cause (confirmed | hypothesis)
158+
<what's actually happening; sample data if available>
159+
160+
## Fix applied (this branch | needed)
161+
<what the change is, or what investigation is needed>
162+
163+
## Acceptance criteria
164+
- After deploy, Sentry volume on TRIGGER-CLOUD-XX drops to ~0
165+
- Linear issue closes when verified
166+
167+
## Source
168+
Auto-created by YYYY-MM-DD Sentry triage session.`
169+
})
170+
```
171+
172+
Step 4b — Attach the back-link in Sentry:
173+
174+
```bash
175+
INSTALL_UUID=51f89c4b-a9c5-4684-bfd1-d9a24e1e222f
176+
curl -s -X POST "https://sentry.io/api/0/sentry-app-installations/$INSTALL_UUID/external-issues/" \
177+
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
178+
-d '{
179+
"issueId": <sentry numeric issue id>,
180+
"webUrl": "<linear issue URL>",
181+
"project": "Platform bugs",
182+
"identifier": "<TRI-XXXX>"
183+
}'
184+
```
185+
186+
Verify with `GET /api/0/issues/<id>/external-issues/`.
187+
188+
### Granularity: cluster vs fingerprint
189+
190+
One Linear ticket per **cluster** (root cause), not per Sentry fingerprint. Example: K0 cluster (10 SyntaxError-on-HTML fingerprints from billing client) → 1 Linear ticket. The ticket's description should list all the related Sentry shortIds.
191+
192+
## Step 5 — Investigate, fix, deploy
193+
194+
Investigation work happens on the Linear ticket. Common patterns we've seen:
195+
196+
- **Boundary log noise**: change the catch site to inspect inner error type and route to `warn` (see `apiBuilder.server.ts` `logBoundaryError` helper for the canonical example).
197+
- **Wrapper-of-real-error**: ensure the inner error gets its own `logger.error("...", { error: inner })` before the wrapper SVE throw, so visibility is preserved when the wrapper's class is filtered.
198+
- **Customer-driven validation noise**: usually `ServiceValidationError` subclasses. Filter at SDK via `ignoreErrors: [/^ServiceValidationError(?::|$)/]`.
199+
- **High-volume failures from a 3rd-party client (e.g. billing)**: don't log to Sentry, count as a metric instead. See `platform.v3.server.ts` for the OTel counter pattern with `{function, kind}` labels (low cardinality).
200+
- **Deliberate observability signals masquerading as errors** (slow queries, deployment timeouts firing on already-completed deployments): downgrade to `warn`, surface via dashboards.
201+
202+
Server-only changes need a `.server-changes/` file. Public-package changes need a changeset.
203+
204+
## Step 6 — Verify and close
205+
206+
After the fix deploys to prod and the volume drops in Sentry over ~24h:
207+
208+
1. Archive the Sentry issue (`archived_until_escalating`)
209+
2. Move the Linear ticket to Done
210+
3. The external-issue link stays — easy to find what was fixed when
211+
212+
## Bulk-action gotcha (very important)
213+
214+
When you spot a pattern (e.g. "all 12 untagged items are ServiceValidationError"), **do NOT bulk-act on `[u]` or `[_]` items**. The user marked them unclear/untagged for a reason — some "noise-shaped" issues are real (e.g. `Snapshot ID doesn't match the latest snapshot` looked like SVE noise but was a benign concurrency-rejection signal worth understanding before silencing). Surface the pattern as a suggestion, ask, then act.
215+
216+
See `feedback_audit_no_bulk_actions.md` in memory.
217+
218+
## Common pitfalls
219+
220+
- **Sentry's data scrubber filters `error` extras** to `[Filtered]` if `sendDefaultPii: false`. Use the **request URL** and **`namespace`/`runId`/etc. context fields** for attribution instead.
221+
- **`pluginIssues`/`pluginActions` on issue objects are the legacy plugin system** — ignore. Modern integrations are sentry-apps with the `external-issues` endpoint.
222+
- **No-stack message-level events**: when `logger.error("msg", { error })` passes a plain string `error`, `Logger.onError` calls `captureMessage` (no exception type, no stack). Look in `.context.error` for the underlying error.
223+
- **Outgoing fetch ECONNRESET errors lose their callsite** because they originate from `node:_http_client` socket events. Wrap outgoing calls in try/catch that captures the originator before the await if you need attribution.
224+
225+
## Where the data lives
226+
227+
| What | Where |
228+
|---|---|
229+
| Triage file | `~/.work-journal/sentry-triage-YYYY-MM-DD.md` |
230+
| Test script for SDK filter changes | `apps/webapp/scripts/test-sentry-filter.mts` |
231+
| Boundary-log helper pattern | `apps/webapp/app/services/routeBuilders/apiBuilder.server.ts` (`logBoundaryError`) |
232+
| Logger → Sentry wiring | `apps/webapp/app/services/logger.server.ts` (`Logger.onError`) |
233+
| Sentry init config | `apps/webapp/sentry.server.ts` |

apps/webapp/app/runEngine/concerns/waitpointCompletionPacket.server.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ import { type IOPacket, packetRequiresOffloading, tryCatch } from "@trigger.dev/
22
import type { AuthenticatedEnvironment } from "~/services/apiAuth.server";
33
import { env } from "~/env.server";
44
import { uploadPacketToObjectStore } from "~/v3/objectStore.server";
5+
import { logger } from "~/services/logger.server";
56
import { ServiceValidationError } from "~/v3/services/common.server";
67

78
function packetExtensionForDataType(dataType: string): string {
@@ -53,6 +54,11 @@ export async function processWaitpointCompletionPacket(
5354
);
5455

5556
if (uploadError) {
57+
logger.error("Failed to upload large waitpoint to object store", {
58+
error: uploadError,
59+
filename,
60+
environmentId: environment.id,
61+
});
5662
throw new ServiceValidationError("Failed to upload large waitpoint to object store", 500);
5763
}
5864

0 commit comments

Comments
 (0)