Skip to content

Commit 51ecf45

Browse files
harshachclaudegithub-actions[bot]aniketkatkar97IceS2
authored
Task redesign (#25894)
* Task Redesign: Add Task entity & tests * Task Redesign: Add Task entity & tests * Task Redesign: Add Permissions checks for Task APIs * Task UI changed to the new APIs * Migrate UI and APIs to new tasks system inlcuding suggestions * Add Suggestions integration * Activity Feed Refactor * ActivityFeed -> ActivityStream publisher * Activity Feed redesign * Activity Feed redesign, adding tests * Incident Manager update * Migrate Incidents to new tasks * Migrate Incidents to new tasks * Update generated TypeScript types * Update generated TypeScript types * feat(tasks): add domain-aware task cutover and workflow v2 migration * test(tasks): cover domain filters and task feed visibility flows * Address comments * Fix workflow tests to use new Task entity API and fix UserApprovalTaskV2 candidate transformation Migrated 9 WorkflowDefinitionResourceIT tests from legacy Feed/Thread API to the new Task entity API (UserApprovalTaskV2 creates Task entities, not Thread entities). Fixed a bug in UserApprovalTaskV2 where candidates were passed as raw EntityReferences instead of being transformed into users/teams FQN arrays for SetApprovalAssigneesImpl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix tests * refactor: stabilize task entity workflows * refactor: finish task entity cutover and activity migration * refactor: migrate legacy thread feed during cutover * refactor: split legacy thread rename and archive migrations * Merge main; fix tests * Update generated TypeScript types * feat: advance task redesign through phase 2 * Merge main; fix tests * Update generated TypeScript types * Fix failing tests * Update generated TypeScript types * fininsh phase 6 of the design, configurable task forms * Update generated TypeScript types * Update generated TypeScript types * Fix linting * Address gitar comments * Address gitar comments * Fix build * Address giar comments * fix build * Add task custom forms * Fix tests * Address tests * Apply UI lint autofixes * Fix tess * Fix linter * Fix task patching * Fix tests * Fix playwright tests * fix java checkstyle * Add python sdk support for tasks, annoucements * Fix playwright tests * Fix playwright tests * Fix playwright tests * Fix python tests * Fix python tests * Fix linting workflows * fix pycheck * fix pycheck * Fix tests * Fix build * Address deviations from main and fix tests * Fix integration tests * Fix integration tests * Fix integration tests * Update generated TypeScript types * Fix Playwright tests * Fix Playwright tests * feat(incident): wire incident manager to task-first architecture (#27369) * feat(incident): wire incident manager to task-first architecture Connect the incident manager to the task redesign so it works end-to-end: resolve data persistence, backward transitions, reopen from resolved, and incident discovery via TCRS. * Update generated TypeScript types * refactor: single-query incident task lookup with parameterized statuses Replace two sequential queries (Open, InProgress) in getOrCreateIncident with one findByAboutAndTypeAndStatuses query using @BindList for status IN (...). --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Fix Playwright tests * Update generated TypeScript types * Fix linter * Fix tests * Fix tests * Fix checkstyle * Fix tests * Fix checkstyle * Update FeedResourceIT.java * Update TableRepository.java * fix tests * Update ActivityFeedProvider.tsx * fix tests * fix tests * Address Task comments * Fix unit test * Fix the feed summary panel showing on landing page * Fix comment functionality * Fix pytests * Fix failing playwright tests * Fix test flakiness * Fix ui-checkstyle * Fix advanced search spec failure * Fix playwright tests Co-authored-by: Copilot <copilot@github.com> * Fix checkstyle * Fix the flaky tests Co-authored-by: Copilot <copilot@github.com> * fix checkstyle * Reduce the workflow polling * Update generated TypeScript types * skip failing tests Co-authored-by: Copilot <copilot@github.com> * Fix ui-checkstyle --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com> Co-authored-by: IceS2 <pablo.takara@getcollate.io> Co-authored-by: karanh37 <karanh37@gmail.com> Co-authored-by: Karan Hotchandani <33024356+karanh37@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>
1 parent b47c219 commit 51ecf45

498 files changed

Lines changed: 76513 additions & 13744 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/connector-audit

Lines changed: 0 additions & 1 deletion
This file was deleted.

CLAUDE.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,22 @@ make unit_ingestion # Python unit tests with coverage
139139
yarn test:coverage # Frontend test coverage
140140
```
141141

142+
### Backend Integration Tests
143+
All backend API integration tests MUST be placed in `openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/` directory. Tests should:
144+
- Use naming convention `*IT.java` (Integration Test)
145+
- Extend `BaseEntityIT<T, K>` for entity CRUD tests
146+
- Be designed to run concurrently (use `@Execution(ExecutionMode.CONCURRENT)`)
147+
- Use `TestNamespace` for test isolation
148+
- Use `SdkClients` for API calls (e.g., `SdkClients.adminClient().tables().create(...)`)
149+
150+
```bash
151+
# Run a specific integration test
152+
mvn test -pl openmetadata-integration-tests -Dtest=TaskResourceIT
153+
154+
# Run all integration tests
155+
mvn test -pl openmetadata-integration-tests
156+
```
157+
142158
## Code Generation and Schemas
143159

144160
OpenMetadata uses a schema-first approach with JSON Schema definitions driving code generation:
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
# Integrate Incident Manager in the Governance Workflows Framework
2+
3+
ADR-#: 1
4+
Authors: Pablo Takara
5+
Reviewers: Teddy Crépineau, Ram Narayan Balaji
6+
Date: February 27, 2026
7+
Status: Proposed
8+
9+
> Migrate incident lifecycle into a governance workflow using a new Task Lifecycle Node. The node uses OpenMetadata tasks as the source of truth (not Flowable UserTask), receives a template with configurable statuses, and exposes each status transition to the main workflow graph via process variables. Users wire hooks on any transition using standard edges. Non-terminal statuses loop back; terminal statuses auto-close the task.
10+
11+
---
12+
13+
## Context
14+
15+
The Incident Manager handles the lifecycle of data quality incidents in OpenMetadata. When a test case fails, an incident is created; it progresses through `New → Ack → Assigned → Resolved` as humans triage it.
16+
17+
Today, this lifecycle is a **switch statement** in `TestCaseResolutionStatusRepository.storeInternal()`. It handles state transitions, task creation, assignment, and resolution. The state machine is simple, correct, and performant, but it has **no extension points**. Adding a behavior like "on Assigned, notify via Slack" or "on New, auto-assign to table owner" requires modifying repository code, testing, and redeploying.
18+
19+
Meanwhile, OpenMetadata ships a **governance workflows framework** built on Flowable BPM. It is fully configurable via REST API and UI. Users configure workflows as abstract **trigger → nodes → edges** graphs (they never see BPMN XML). The backend compiles these to Flowable process definitions automatically via `NodeFactory` and `MainWorkflow`.
20+
21+
The two systems live side by side but do not interact.
22+
23+
Additionally, the **task refactor** promotes tasks to first-class entities with standard `ChangeEvents`. This enables Flowable to be notified of every status transition — not just resolution — unlocking configurable hooks on any transition from day one.
24+
25+
### Specific Gaps
26+
27+
1. **No auto-close when tests pass.** `TestCaseResultRepository.setTestCaseResultIncidentId()` sets `incidentId = null` when a test succeeds but **never resolves the incident or closes its task**.
28+
2. **No auto-assign on incident creation.** Every incident starts in `New` and requires manual acknowledgement.
29+
3. **No extensibility.** Organizations cannot define configurable rules like "on any status change, execute action X" without code changes.
30+
4. **Fixed lifecycle.** The `New → Ack → Assigned → Resolved` states are hardcoded. Organizations with different triage processes have no way to customize.
31+
5. **No incident TTL.** No mechanism to auto-close stale incidents.
32+
33+
### Enterprise scale context
34+
35+
- 5M assets, 10-30% with data quality tests = 500K-1.5M test cases
36+
- At 2-5% failure rate = **10K-75K concurrent open incidents** (typical)
37+
- `getOrCreateIncident()` enforces one unresolved incident per test case
38+
39+
---
40+
41+
## Use Cases
42+
43+
**UC-1 — Auto-close incident when test passes**
44+
The system automatically resolves the open incident (reason: AutoResolved) and closes its task. No human intervention required.
45+
46+
**UC-2 — Auto-assign incident on creation**
47+
When a new incident is created, the system automatically assigns it to a configured user or team.
48+
49+
**UC-3 — Auto-close stale incidents (TTL)**
50+
An incident open longer than a configurable deadline is automatically resolved (reason: Expired).
51+
52+
**UC-4 — User-defined hooks on any status transition**
53+
Users wire follow-up steps (notifications, Jira tickets, etc.) on any status change via workflow edges — no code changes.
54+
55+
---
56+
57+
## Decision
58+
59+
### Task Lifecycle Node
60+
61+
A new governance workflow node that does NOT use Flowable's BPMN UserTask. It creates an OpenMetadata task, waits for status changes via `IntermediateCatchEvent`, and exposes each status to the parent workflow for routing.
62+
63+
**Internal BPMN structure:**
64+
```
65+
┌─ SubProcess ──────────────────────────────────────────────────────┐
66+
│ │
67+
│ [Start] → [Setup] → [Gateway: created?] │
68+
│ │ no → [End: skip] │
69+
│ │ yes ↓ │
70+
│ │ [IntermediateCatchEvent: wait] │
71+
│ │ ↓ message with {status} │
72+
│ │ [Gateway: terminal?] │
73+
│ │ yes → [CloseTask] → [SetResult] → [End] │
74+
│ │ no → [SetResult] → [End] │
75+
│ │ │
76+
│ │ Setup (idempotent): │
77+
│ │ • Check for existing open incident │
78+
│ │ → if exists with active process: skip │
79+
│ │ → if orphaned process: terminate it │
80+
│ │ • Create incident record (New) │
81+
│ │ • Create OM task │
82+
│ │ • Auto-assign (from template config) │
83+
│ │ • Set process variable omTaskId = task UUID │
84+
│ │
85+
│ + [TTL Boundary Timer: configurable, interrupting] │
86+
│ → [AutoResolve via repository] → [End] │
87+
└────────────────────────────────────────────────────────────────────┘
88+
```
89+
90+
**Node config:**
91+
```json
92+
{
93+
"type": "taskLifecycleNode",
94+
"config": {
95+
"template": "incident",
96+
"statuses": ["New", "Ack", "Assigned", "Resolved"],
97+
"terminal": ["Resolved"],
98+
"responsibles": { "source": "tableOwner" },
99+
"ttl": "P30D"
100+
}
101+
}
102+
```
103+
104+
The node:
105+
1. **Setup** — Creates the OM task (idempotent on re-entry). Sets `omTaskId` process variable.
106+
2. **Wait**`IntermediateCatchEvent` with `messageExpression="${omTaskId}"`. Subscribes to a message named after the task UUID (~2 Flowable DB rows).
107+
3. **On message** — Evaluates whether the received status is terminal.
108+
4. **Terminal** — Closes the OM task (idempotent), sets `{nodeName}_result` at parent scope, subprocess exits.
109+
5. **Non-terminal** — Sets `{nodeName}_result` at parent scope, subprocess exits. Parent-level edges route back to the node.
110+
111+
### Status exposed via graph edges (with cycles)
112+
113+
Status is set as a Flowable process variable when the subprocess exits. Parent-level edges condition on this variable. Non-terminal edges loop back to the node.
114+
115+
```
116+
┌────── "ack" ───────────────────────────┐
117+
│ ┌─── "assigned" → [NotifySlack] ──────┤
118+
▼ ▼ │
119+
[Start] → [ManageIncident] ── "resolved" → [End]
120+
```
121+
122+
**Workflow definition example:**
123+
```json
124+
{
125+
"name": "incident-lifecycle",
126+
"trigger": {
127+
"type": "eventBasedEntity",
128+
"config": {
129+
"entityTypes": ["TestCase"],
130+
"events": ["Updated"],
131+
"filter": { "TestCase": { "==": [{"var": "testCaseStatus"}, "Failed"] } }
132+
}
133+
},
134+
"nodes": [
135+
{ "type": "startEvent", "name": "start" },
136+
{ "type": "taskLifecycleNode", "name": "incident", "config": {
137+
"template": "incident",
138+
"statuses": ["New", "Ack", "Assigned", "Resolved"],
139+
"terminal": ["Resolved"],
140+
"responsibles": { "source": "tableOwner" },
141+
"ttl": "P30D"
142+
}},
143+
{ "type": "automatedTask", "subType": "sinkTask", "name": "notifySlack" },
144+
{ "type": "endEvent", "name": "end" }
145+
],
146+
"edges": [
147+
{ "from": "start", "to": "incident" },
148+
{ "from": "incident", "to": "incident", "condition": { "status": "Ack" } },
149+
{ "from": "incident", "to": "notifySlack", "condition": { "status": "Assigned" } },
150+
{ "from": "notifySlack", "to": "incident" },
151+
{ "from": "incident", "to": "end", "condition": { "status": "Resolved" } }
152+
]
153+
}
154+
```
155+
156+
### Message delivery via task ChangeEvents
157+
158+
With the task refactor, tasks emit `ChangeEvents` on status changes. These drive message delivery to Flowable:
159+
160+
1. Task status changes (via REST API / `storeInternal`)
161+
2. `ChangeEvent` emitted
162+
3. Listener correlates message to waiting `IntermediateCatchEvent`
163+
164+
The OM task is already updated before the message fires. If correlation fails, the task state is correct — Flowable catches up on the next status change.
165+
166+
**Mechanism TBD**: Listener on task `ChangeEvents` (clean separation) vs direct hook in task status update code (fewer hops).
167+
168+
### What the workflow controls vs the repository
169+
170+
| Action | Who handles it |
171+
| --- | --- |
172+
| Task creation | Node setup phase (idempotent) |
173+
| Status changes (Ack, Assigned, etc.) | Repository — synchronous, unchanged |
174+
| Resolution | Repository — synchronous, unchanged |
175+
| Task closure | Both — node closes on terminal, repository may also close. Idempotent. |
176+
| Flowable notification | Task ChangeEvent → message to IntermediateCatchEvent |
177+
| Follow-up hooks | Workflow edges — user-configurable |
178+
| TTL auto-resolve | Boundary timer on node |
179+
| Auto-close on test pass | Separate short-lived workflow |
180+
181+
### Why this approach
182+
183+
1. **Hooks on any transition.** Status exposed to parent graph → users wire follow-up steps via edges.
184+
2. **Configurable lifecycle.** Template defines statuses and terminal set. No hardcoded lifecycle.
185+
3. **OM task is source of truth.** No BPMN UserTask. ~2 DB rows per task vs ~5-10.
186+
4. **Repository stays in the critical path.** All transitions are synchronous. Flowable is notified after the fact. If Flowable is down, transitions still succeed.
187+
5. **Unified abstraction.** Same node type for incidents, approvals, certifications — different templates.
188+
189+
---
190+
191+
## Consequences
192+
193+
### Positive
194+
195+
- **Hooks on any status transition** without code changes.
196+
- **Configurable lifecycle from day one** via template config.
197+
- **Lightweight**~2 Flowable DB rows per task (IntermediateCatchEvent).
198+
- **Safe** — repository owns all transitions synchronously; Flowable is follow-up only.
199+
- **Default workflow replicates current behavior** and ships enabled.
200+
- **Unified abstraction** — incidents, approvals, certifications share one node type.
201+
202+
### Negative
203+
204+
- **MainWorkflow compiler must support cycles.** Today it assumes a DAG. Biggest technical risk.
205+
- **More Flowable interactions.** Every status change sends a message (vs resolution only). ~225K correlations over lifetime of 75K incidents with ~3 transitions each.
206+
- **Task refactor dependency.** Fallback: direct `reportOutcome()` from `storeInternal()` if not ready.
207+
208+
### Neutral
209+
210+
- REST API surface unchanged.
211+
- `TestCaseResolutionStatus` schema changes minimally (add `AutoResolved`, `Expired` reasons).
212+
- Resolution business logic in the repository is unchanged.
213+
214+
---
215+
216+
## Alternatives Considered
217+
218+
### Bookends only (no intermediate state hooks)
219+
220+
Handle only creation + resolution in the workflow. Intermediate states stay entirely in `storeInternal()`.
221+
222+
**Not chosen:** Users cannot wire hooks on Ack/Assigned. The task refactor makes full lifecycle hooks possible now — deferring them means two migrations.
223+
224+
### Internal loop (cycle hidden inside SubProcess)
225+
226+
The message loop lives inside the node. Status exposed only on terminal exit. Outer graph stays a DAG.
227+
228+
**Not chosen:** Users cannot wire hooks on non-terminal transitions. The point is exposing every status change to the parent graph.
229+
230+
### Resolution through Flowable (not fire-and-forget)
231+
232+
Route resolution through the Flowable process.
233+
234+
**Not chosen:** Puts Flowable in the critical path. If Flowable is slow/down, resolution is blocked.
235+
236+
### Extend state machine with Java hooks
237+
238+
**Rejected:** Parallel automation system, requires code changes for every new behavior.
239+
240+
### CMMN (Case Management)
241+
242+
**Rejected:** Zero existing infrastructure, overkill.
243+
244+
---
245+
246+
## Design Choices
247+
248+
### IntermediateCatchEvent with messageExpression
249+
250+
`messageExpression="${omTaskId}"` gives unique-per-instance subscriptions. `EventSubscriptionQuery.eventName(taskId)` is an indexed lookup. No MessageCorrelationBuilder (doesn't exist in Flowable 7.2.0).
251+
252+
### Idempotent setup on loop re-entry
253+
254+
When non-terminal edges loop back, Setup detects the existing task and reuses it. Safe for any number of loops.
255+
256+
### Terminal auto-close — both sides
257+
258+
`storeInternal(Resolved)` closes the task. The node's `CloseTask` also closes on terminal status. Both are idempotent. This handles TTL (node-initiated) and human resolution (repository-initiated) uniformly.
259+
260+
### Business key = test case FQN
261+
262+
Enables idempotent creation, fire-and-forget termination, auto-close correlation.
263+
264+
### Governance-bot loop prevention
265+
266+
`WorkflowEventConsumer` skips events from `governance-bot`. The workflow runs as `governance-bot`, so its own events don't re-trigger workflows.
267+
268+
---
269+
270+
## Open Questions
271+
272+
- [ ] **Message delivery mechanism**: Listener on task ChangeEvents vs direct hook in task status update.
273+
- [ ] **TestCaseResult.incidentId linking**: If creation moves to async workflow, test result may store before incident exists. Recommendation: keep `getOrCreateIncident()` synchronous.
274+
- [ ] **Cycle validation**: Should the compiler enforce that every non-terminal edge path routes back to a task node?
275+
276+
---
277+
278+
## Risks
279+
280+
| Risk | Impact | Mitigation |
281+
| --- | --- | --- |
282+
| Cycle support in MainWorkflow | Blocks the design | Spike early. Workaround: invisible gateway node. |
283+
| Task refactor not ready | No ChangeEvents for message delivery | Fall back to direct reportOutcome() from storeInternal() |
284+
| Race condition | Message lost during follow-up execution | EventSubscriptionQuery returns null → skipped. Java-side buffer later. |
285+
| ACT_RU growth | ~2 rows per open incident | 75K incidents = 150K rows. Measure in hardening phase. |
286+
| Process orphaning | Never-resolved incidents linger | TTL handles deadlines. Batch sweep for the rest. |
287+
288+
---
289+
290+
## Follow-up Work
291+
292+
1. **Batch sweep** for orphaned processes.
293+
2. **Migrate UserApprovalTask** (glossary) to same node type with `template: "approval"`.
294+
3. **SLA timer escalation** — optional boundary timer using same infrastructure as TTL.
295+
296+
---
297+
298+
## References
299+
300+
- `TestCaseResolutionStatusRepository.storeInternal()` — Current state machine
301+
- `WorkflowHandler.java` — Flowable ProcessEngine, message delivery
302+
- `MainWorkflow.java` — BPMN compiler (needs cycle support)
303+
- `UserApprovalTask.java` — Current UserTask pattern (being replaced)
304+
- `NodeFactory.java` — Node type registration
305+
- `WorkflowEventConsumer.java` — Event routing, governance-bot loop prevention

0 commit comments

Comments
 (0)