triggerdotdev
diff --git a/‎.claude/skills/sentry-triage/SKILL.md‎
Lines changed: 233 additions & 0 deletions b/‎.claude/skills/sentry-triage/SKILL.md‎
Lines changed: 233 additions & 0 deletions
diff --git a/‎apps/webapp/app/runEngine/concerns/waitpointCompletionPacket.server.ts‎
Lines changed: 6 additions & 0 deletions b/‎apps/webapp/app/runEngine/concerns/waitpointCompletionPacket.server.ts‎
Lines changed: 6 additions & 0 deletions
@@ -0,0 +1,233 @@
+---
+name: sentry-triage
+description: Run a Sentry triage session — pull unresolved production issues, classify them, archive noise, file Linear tickets for real bugs, and close the loop after deploy. Use when asked to triage Sentry, audit errors, work through unresolved issues, do a Sentry sweep, or set up the triage flow. Triggers on phrases like "sentry triage", "audit sentry errors", "unresolved errors", "what should we fix in Sentry".
+allowed-tools: Read, Write, Edit, Bash, Glob, Grep, mcp__linear-server__save_issue, mcp__linear-server__list_projects
+---
+
+# Sentry Triage Loop
+
+A repeatable process for working down the unresolved-Sentry-issues backlog. Output is: code fixes for noisy log sites, Linear tickets in **Platform bugs** for real bugs, and a clean Sentry inbox.
+
+The four lenses for every issue:
+
+1. **Ignore / archive non-errors** — system-handled validation, expected races, customer-side issues
+2. **Update log level error → warn** — for boundary catches that already handle the failure
+3. **Investigate real bugs / perf issues / capacity signals** — these get Linear tickets
+4. **Improve the loop** — patterns recur; codify them so future triage is faster
+
+## Constants
+
+| Thing | Value |
+|---|---|
+| Sentry org slug | `triggerdev` |
+| Sentry project slug (cloud) | `trigger-cloud` |
+| Sentry project numeric ID | `4509723301642240` |
+| Linear team | `Triggerdotdev` |
+| Linear project | `Platform bugs` |
+| Linear-Sentry App install UUID | `51f89c4b-a9c5-4684-bfd1-d9a24e1e222f` |
+| Sentry token | `~/.sentryclirc` (must have `event:admin` scope) |
+
+## Step 1 — Pull and enrich
+
+Pull top 100 unresolved production issues by frequency, last 7 days. The org-scoped issues endpoint accepts `statsPeriod=7d` (the project-scoped one only allows `24h` or `14d`).
+
+```bash
+TOKEN=$(grep '^token=' ~/.sentryclirc | cut -d= -f2)
+curl -sG "https://sentry.io/api/0/organizations/triggerdev/issues/" \
+  -H "Authorization: Bearer $TOKEN" \
+  --data-urlencode "project=4509723301642240" \
+  --data-urlencode "query=is:unresolved environment:production" \
+  --data-urlencode "statsPeriod=7d" \
+  --data-urlencode "sort=freq" \
+  --data-urlencode "limit=100" \
+  > /tmp/sentry-triage-raw.json
+```
+
+Filter out issues that already have a Linear link (so re-runs skip what's been triaged):
+
+```bash
+# An issue is already-triaged if it has any external-issue
+curl -s -H "Authorization: Bearer $TOKEN" \
+  "https://sentry.io/api/0/issues/<ID>/external-issues/" | jq 'length > 0'
+```
+
+Format into a triage file at `~/.work-journal/sentry-triage-YYYY-MM-DD.md` with one block per issue:
+
+```
+[_]  TRIGGER-CLOUD-XX   <count> events · <users>u · first <date> · last <date>
+     **<ErrorType>**: <title>
+     culprit: `<culprit>`
+     https://triggerdev.sentry.io/issues/<id>/
+     notes:
+```
+
+`[_]` is the tag slot; user fills it in by editing the file. **Do not** auto-tag items the user hasn't reviewed — see "Bulk-action gotcha" below.
+
+For high-volume items where the `culprit` looks misleading, also pull the latest event stack and surface the top in-app frame. The events list endpoint doesn't include `.context` per item — fetch each event individually:
+
+```bash
+curl -s -H "Authorization: Bearer $TOKEN" "https://sentry.io/api/0/issues/<ID>/events/latest/" \
+  | jq '{
+      title,
+      url: ([.entries[]? | select(.type=="request") | .data.url] | first),
+      msg: ([.entries[]? | select(.type=="message") | .data.formatted] | first),
+      exceptionType: ([.entries[]? | select(.type=="exception") | .data.values[0].type] | first),
+      topInApp: ([.entries[]? | select(.type=="exception") | .data.values[0].stacktrace.frames | reverse | map(select(.inApp==true))] | .[0] // [] | .[0:3]),
+      extras: .context
+    }'
+```
+
+## Step 2 — Walk the file with the user
+
+Tag each issue: `[r]` real, `[n]` noise, `[u]` unclear, `[s]` skip. The user makes the call. Surface patterns as suggestions, never act on `[u]` or `[_]` items unilaterally.
+
+### The three patterns to recognize
+
+**Pattern A — Pure validation (no underlying error)**: e.g. `ServiceValidationError: Snapshot ID doesn't match the latest snapshot`. The throw IS the error; nothing's wrapped. Right call: archive + filter at SDK or downgrade log level.
+
+**Pattern B — Wrapper around a real error**: e.g. `Error in loader` boundary catches that wrap an inner SVE/AbortError/etc. Check whether the inner error is *also* captured separately (look for a `logger.error("...", { error: filesError })` adjacent to the throw — `Logger.onError` extracts `args[].error` and calls `captureException(error)` so the inner becomes its own typed Sentry issue). If yes, archive the wrapper. If no, *first* add a `logger.error("...", { error: innerError })` so we don't lose the signal, then archive the wrapper.
+
+**Pattern C — Real bug / perf issue / capacity signal**: needs investigation. File a Linear ticket (Step 4).
+
+### The boundary-log realization
+
+Most high-volume Sentry "errors" are actually wrapper logs at boundary catches (`logger.error("Error in loader/action", { error })` in `apiBuilder.server.ts`, `logger.error("Batch item processing failed after all attempts", ...)`, etc.). The boundary catches everything regardless of type — including expected things like aborts and validation errors.
+
+The fix is at the catch site itself: route to `logger.warn` when the inner error is a known-noise type, OR use a signal the callback already provides (e.g. `result.skipRetries` for batch items). Don't try to filter at the SDK by message — that fights the boundary.
+
+### Sentry SDK filtering — what works
+
+`ignoreErrors` matches against the formatted `"Type: message"` string, not just `Type`. So `/^ServiceValidationError$/` does NOT match (event message is `"ServiceValidationError: <message>"`). Use `/^ServiceValidationError(?::|$)/` instead. **Verify with `apps/webapp/scripts/test-sentry-filter.mts`** — it captures four error variants tagged with a unique run ID, then queries Sentry to confirm which arrived.
+
+`logger.warn` does NOT flow to Sentry (only `Logger.onError` is wired to `captureException`/`captureMessage`). So downgrading the log call is sufficient — you don't also need to add it to `ignoreErrors`.
+
+## Step 3 — Archive `[n]` items
+
+Archive with `archived_until_escalating` (Sentry auto-resurfaces if event volume jumps significantly). **Token must have `event:admin` scope** — `event:write` silently coerces every archive to `archived_forever`.
+
+```bash
+TOKEN=$(grep '^token=' ~/.sentryclirc | cut -d= -f2)
+QS="id=<id1>&id=<id2>&..."
+curl -s -X PUT "https://sentry.io/api/0/organizations/triggerdev/issues/?$QS" \
+  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
+  -d '{"status":"ignored","substatus":"archived_until_escalating"}'
+```
+
+### Substatus transition gotcha
+
+Sentry **does not allow direct transitions between archive substatuses**. If an issue is already `archived_forever`, sending `archived_until_escalating` silently no-ops. Round-trip via `unresolved` first:
+
+```bash
+# 1. unresolve
+curl -s -X PUT ".../issues/?$QS" ... -d '{"status":"unresolved"}'
+sleep 1
+# 2. then archive with desired substatus
+curl -s -X PUT ".../issues/?$QS" ... -d '{"status":"ignored","substatus":"archived_until_escalating"}'
+```
+
+### When to archive immediately vs after deploy
+
+- **Now**: pure noise the user has confirmed `[n]` (e.g. customer quota errors, validation failures with no system impact)
+- **After deploy + volume drop**: anything addressed by a code change. Code in a branch isn't shipped — premature archive hides the very signal that tells us the fix worked. See `feedback_archive_after_deploy.md` in memory.
+
+## Step 4 — File Linear tickets for `[r]` items
+
+The Linear-Sentry app is installed at install UUID `51f89c4b-a9c5-4684-bfd1-d9a24e1e222f`. We do NOT use Sentry's "Create Linear Issue" UI button — we use Linear MCP directly, then attach the back-link via Sentry's external-issues API.
+
+Step 4a — Create the Linear issue:
+
+```ts
+mcp__linear-server__save_issue({
+  team: "Triggerdotdev",
+  project: "Platform bugs",
+  title: "<descriptive title, not the Sentry shortId>",
+  priority: 3, // 1=Urgent / 2=High / 3=Medium / 4=Low
+  description: `## Sentry issue
+- [TRIGGER-CLOUD-XX](https://triggerdev.sentry.io/issues/<id>/)
+- Volume: <N> events (<rate>/day, steady|bursty)
+- First seen / last seen
+
+## Where it fires
+\`<file>:<line>\` — <function context>
+
+\`\`\`ts
+<the actual throw/log site>
+\`\`\`
+
+## Root cause (confirmed | hypothesis)
+<what's actually happening; sample data if available>
+
+## Fix applied (this branch | needed)
+<what the change is, or what investigation is needed>
+
+## Acceptance criteria
+- After deploy, Sentry volume on TRIGGER-CLOUD-XX drops to ~0
+- Linear issue closes when verified
+
+## Source
+Auto-created by YYYY-MM-DD Sentry triage session.`
+})
+```
+
+Step 4b — Attach the back-link in Sentry:
+
+```bash
+INSTALL_UUID=51f89c4b-a9c5-4684-bfd1-d9a24e1e222f
+curl -s -X POST "https://sentry.io/api/0/sentry-app-installations/$INSTALL_UUID/external-issues/" \
+  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
+  -d '{
+    "issueId": <sentry numeric issue id>,
+    "webUrl": "<linear issue URL>",
+    "project": "Platform bugs",
+    "identifier": "<TRI-XXXX>"
+  }'
+```
+
+Verify with `GET /api/0/issues/<id>/external-issues/`.
+
+### Granularity: cluster vs fingerprint
+
+One Linear ticket per **cluster** (root cause), not per Sentry fingerprint. Example: K0 cluster (10 SyntaxError-on-HTML fingerprints from billing client) → 1 Linear ticket. The ticket's description should list all the related Sentry shortIds.
+
+## Step 5 — Investigate, fix, deploy
+
+Investigation work happens on the Linear ticket. Common patterns we've seen:
+
+- **Boundary log noise**: change the catch site to inspect inner error type and route to `warn` (see `apiBuilder.server.ts` `logBoundaryError` helper for the canonical example).
+- **Wrapper-of-real-error**: ensure the inner error gets its own `logger.error("...", { error: inner })` before the wrapper SVE throw, so visibility is preserved when the wrapper's class is filtered.
+- **Customer-driven validation noise**: usually `ServiceValidationError` subclasses. Filter at SDK via `ignoreErrors: [/^ServiceValidationError(?::|$)/]`.
+- **High-volume failures from a 3rd-party client (e.g. billing)**: don't log to Sentry, count as a metric instead. See `platform.v3.server.ts` for the OTel counter pattern with `{function, kind}` labels (low cardinality).
+- **Deliberate observability signals masquerading as errors** (slow queries, deployment timeouts firing on already-completed deployments): downgrade to `warn`, surface via dashboards.
+
+Server-only changes need a `.server-changes/` file. Public-package changes need a changeset.
+
+## Step 6 — Verify and close
+
+After the fix deploys to prod and the volume drops in Sentry over ~24h:
+
+1. Archive the Sentry issue (`archived_until_escalating`)
+2. Move the Linear ticket to Done
+3. The external-issue link stays — easy to find what was fixed when
+
+## Bulk-action gotcha (very important)
+
+When you spot a pattern (e.g. "all 12 untagged items are ServiceValidationError"), **do NOT bulk-act on `[u]` or `[_]` items**. The user marked them unclear/untagged for a reason — some "noise-shaped" issues are real (e.g. `Snapshot ID doesn't match the latest snapshot` looked like SVE noise but was a benign concurrency-rejection signal worth understanding before silencing). Surface the pattern as a suggestion, ask, then act.
+
+See `feedback_audit_no_bulk_actions.md` in memory.
+
+## Common pitfalls
+
+- **Sentry's data scrubber filters `error` extras** to `[Filtered]` if `sendDefaultPii: false`. Use the **request URL** and **`namespace`/`runId`/etc. context fields** for attribution instead.
+- **`pluginIssues`/`pluginActions` on issue objects are the legacy plugin system** — ignore. Modern integrations are sentry-apps with the `external-issues` endpoint.
+- **No-stack message-level events**: when `logger.error("msg", { error })` passes a plain string `error`, `Logger.onError` calls `captureMessage` (no exception type, no stack). Look in `.context.error` for the underlying error.
+- **Outgoing fetch ECONNRESET errors lose their callsite** because they originate from `node:_http_client` socket events. Wrap outgoing calls in try/catch that captures the originator before the await if you need attribution.
+
+## Where the data lives
+
+| What | Where |
+|---|---|
+| Triage file | `~/.work-journal/sentry-triage-YYYY-MM-DD.md` |
+| Test script for SDK filter changes | `apps/webapp/scripts/test-sentry-filter.mts` |
+| Boundary-log helper pattern | `apps/webapp/app/services/routeBuilders/apiBuilder.server.ts` (`logBoundaryError`) |
+| Logger → Sentry wiring | `apps/webapp/app/services/logger.server.ts` (`Logger.onError`) |
+| Sentry init config | `apps/webapp/sentry.server.ts` |
@@ -2,6 +2,7 @@ import { type IOPacket, packetRequiresOffloading, tryCatch } from "@trigger.dev/
 import type { AuthenticatedEnvironment } from "~/services/apiAuth.server";
 import { env } from "~/env.server";
 import { uploadPacketToObjectStore } from "~/v3/objectStore.server";
+import { logger } from "~/services/logger.server";
 import { ServiceValidationError } from "~/v3/services/common.server";
 
 function packetExtensionForDataType(dataType: string): string {
@@ -53,6 +54,11 @@ export async function processWaitpointCompletionPacket(
   );
 
   if (uploadError) {
+    logger.error("Failed to upload large waitpoint to object store", {
+      error: uploadError,
+      filename,
+      environmentId: environment.id,
+    });
     throw new ServiceValidationError("Failed to upload large waitpoint to object store", 500);
   }