Skip to content

feat: Real-Time Agent Execution Monitoring Dashboard (#448)#457

Open
BekkamMallishwari wants to merge 5 commits into
sreerevanth:mainfrom
BekkamMallishwari:feature/issue-448-agent-monitoring-dashboard
Open

feat: Real-Time Agent Execution Monitoring Dashboard (#448)#457
BekkamMallishwari wants to merge 5 commits into
sreerevanth:mainfrom
BekkamMallishwari:feature/issue-448-agent-monitoring-dashboard

Conversation

@BekkamMallishwari

@BekkamMallishwari BekkamMallishwari commented Jun 22, 2026

Copy link
Copy Markdown

This PR adds a comprehensive Real-Time Agent Monitoring & Analytics Dashboard to address Issue #448.

Features Included:

  • Live Agent Monitoring: Running, completed, failed, and pending status tracking via WebSocket integration.
  • Analytics Dashboard: Success vs failure rate visualization via Recharts PieChart, average execution time metrics, and performance comparison panels showing framework statistics.
  • Error Management: Centralized error tracking system showcasing recent failures and safety blocks.
  • Visualization Layer: Interactive charts and graphs (Recharts), including an Execution Timeline view over the last 24 hours.
  • Notifications System: Integrated Toast notifications for agent failure alerts, long-running task warnings, and safety blocks.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added API dashboard analytics (success rate, average execution time, per-framework breakdown, recent failures, and 24-hour trends).
    • Introduced a Visual Workflow Builder with workflow CRUD and a simulation runner with step-by-step traces.
    • Added workflow persistence for saved templates and definitions.
  • UI/Frontend
    • Updated the dashboard with success-rate visualization, framework breakdown, execution timeline, recent error panel, and failure toasts.
  • Security
    • Strengthened API key encryption key derivation.
  • Tests
    • Added end-to-end workflow endpoint lifecycle test coverage.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f61a5630-f445-4ac4-87ce-73cdf3a0f861

📥 Commits

Reviewing files that changed from the base of the PR and between 65f06e2 and ffcec31.

📒 Files selected for processing (2)
  • agentwatch/api/server.py
  • agentwatch/cli/main.py
✅ Files skipped from review due to trivial changes (1)
  • agentwatch/cli/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • agentwatch/api/server.py

📝 Walkthrough

Walkthrough

Adds dashboard analytics, workflow CRUD and simulation APIs, and a new workflow builder UI. Updates the dashboard to use analytics-driven charts and notifications, and includes related security, typing, and test adjustments.

Changes

Dashboard Analytics and Workflow Builder

Layer / File(s) Summary
Backend security and typing updates
agentwatch/api/auth.py, agentwatch/core/models.py, agentwatch/protocol/mcp_server.py, agentwatch/rollback/engine.py, agentwatch/security/encryption.py, agentwatch/security/key_storage.py, agentwatch/cli/main.py
Updates SAML secret initialization, replaces the key derivation KDF, migrates the ORM base class, and adds local type annotations or type-checker suppressions in backend helpers.
Analytics contracts and endpoint
frontend/.eslintrc.json, frontend/lib/api.ts, agentwatch/api/server.py
Adds dashboard analytics TypeScript interfaces and ESLint config, then implements the authenticated analytics endpoint that returns success rate, execution timing, framework stats, recent errors, and hourly session trends.
Dashboard page updates
frontend/pages/index.tsx
Expands the dashboard with toast notifications, analytics refresh handling, recent errors, and the new analytics-driven charts and summary cards.
Workflow backend models, persistence, CRUD, and simulation
agentwatch/api/server.py
Adds workflow graph models, default templates, disk-backed workflow loading and saving, authenticated CRUD endpoints, and workflow simulation endpoints that traverse nodes and emit step results.
Workflow builder frontend
frontend/pages/workflow-builder.tsx
Adds the workflow builder page with workflow and graph types, node palette configuration, canvas editing, validation, save/delete actions, simulation controls, and the full builder layout.
Tests and test setup
tests/conftest.py, tests/test_safety.py, tests/test_workflows.py
Adds workflow lifecycle coverage and updates test setup and safety mocks for the new workflow and embedding behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Server as API
  participant Store as workflows_json
  Client->>Server: GET/POST/DELETE /api/v1/workflows
  Server->>Store: load_workflows_from_disk()
  Store-->>Server: workflows + defaults
  Server->>Store: save_workflows_to_disk()
  Store-->>Server: persisted workflows
  Client->>Server: POST /api/v1/workflows/{id}/run
  Server->>Server: traverse nodes and build step traces
  Server-->>Client: simulation status, logs, outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

Suggested labels

enhancement, backend

Suggested reviewers

  • sreerevanth

Poem

🐇 I hop through charts and workflows bright,
With toast bells ringing late at night.
The nodes all dance, the dashboards gleam,
And analytics flow like a dreamy stream.
Hoppy code, and a brighter screen!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the main dashboard/monitoring focus of the PR, even though it does not mention the additional workflow and backend changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@BekkamMallishwari

Copy link
Copy Markdown
Author

Hi @sreerevanth
Just checking if you had a chance to review this PR for Issue #448.

It includes the Real-Time Agent Execution Monitoring Dashboard and is ready for review. 🚀

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

🧪 PR Test Results

Check Result
Tests (pytest tests/) ✅ success
Lint (ruff check .) ❌ failure
Coverage (agentwatch) 74.03%

Python 3.12 · commit 65f06e2

@sreerevanth

Copy link
Copy Markdown
Owner

@BekkamMallishwari ci failing

@BekkamMallishwari

Copy link
Copy Markdown
Author

Hi @sreerevanth ,

I've pushed a fix commit (35904c8) addressing the CI lint/type-check issues. The PR is currently waiting for workflow approval before the checks can run again.

Once approved, the updated CI results should be available for review. Thanks!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agentwatch/api/server.py`:
- Around line 1461-1478: The read-modify-write pattern in the workflow update
logic is not protected against concurrent requests. Wrap the entire sequence
starting from load_workflows_from_disk() through
save_workflows_to_disk(workflows) in a file lock or synchronization mechanism to
ensure atomicity. Additionally, implement atomic file persistence by writing
workflow data to a temporary file first and then using atomic rename/replace
operations instead of directly overwriting the workflows file. Apply the same
synchronization pattern to the other affected workflow operations mentioned in
the comment (the delete workflow operation and any other save/delete functions
that follow similar patterns).
- Around line 1523-1535: The topological sort is enqueuing target nodes too
early, before all their dependencies are satisfied. When processing edges from a
completed node, the code must decrement the incoming_counts for each target node
and only enqueue the target when its incoming_counts reaches exactly 0 (meaning
all predecessors have completed). Currently, nodes are being added to the queue
as soon as a single predecessor finishes rather than waiting for all incoming
edges to be processed. Modify the edge processing logic to decrement
incoming_counts[target] for each edge processed and conditionally add target to
the queue only when its count becomes 0, ensuring proper topological ordering.
- Around line 1088-1091: The average execution time calculation has a mismatch
between the numerator and denominator. The total_time sum only includes sessions
where ended_at exists (due to the if s.ended_at filter), but the divisor uses
len(completed_sessions) which counts all completed sessions including those
without ended_at. To fix this, count only the sessions that actually have
ended_at values and use that count as the divisor for avg_execution_time instead
of len(completed_sessions). This ensures the metric is calculated correctly
based on the sessions that actually contributed to the total_time sum.

In `@frontend/pages/index.tsx`:
- Around line 338-339: The Retry button in index.tsx currently has an onClick
handler that only triggers a mock alert message. Replace this mock alert with a
real retry action that performs the actual retry operation (such as calling a
retry function or refetching data), or alternatively disable the button with a
disabled attribute and add visual/text indicators that the feature is coming
soon to avoid misleading users.
- Around line 350-351: The blockedData from the useSWR hook for blocked events
is not being included in the refresh flows, causing the Safety panel to display
stale data. Extract the mutate function from the blockedData useSWR call
(similar to how refreshAnalytics is extracted from the analytics useSWR call),
and include this mutate function in all refresh flows by calling it wherever
refreshAnalytics is called, including both the live refresh callback and the
manual Refresh button handler referenced at lines 375-378 and 411.
- Around line 392-395: The success value in the pieData array is incorrectly
calculated by subtracting failed_sessions from total_sessions, which
misclassifies running, blocked, and pending sessions as successful. Replace this
calculation in the pieData definition with the actual successful sessions count
from the summary object (likely available as a separate field like
successful_sessions or completed_sessions) to align with the true success-rate
metric semantics.

In `@frontend/pages/workflow-builder.tsx`:
- Around line 411-413: The triggerSimulation function currently runs a
simulation using selectedWorkflowId before verifying that the workflow has been
saved to the backend. Since handleNewWorkflow and handleSaveWorkflow generate a
client-side ID before backend persistence, running a simulation with an unsaved
workflow will fail. Guard the triggerSimulation function to either check if the
workflow has been saved to the backend before proceeding, or automatically call
the save logic (similar to handleSaveWorkflow) before executing the simulation
POST request. This ensures the selectedWorkflowId exists on the backend when the
API endpoint is called.
- Around line 1134-1150: The conditional logic only handles two states (running
vs not running), but when simulationStatus is 'paused', it falls into the else
branch and shows a Run button tied to triggerSimulation, which restarts the
simulation instead of resuming. Add an additional condition to check if
simulationStatus === 'paused' and render a Resume button that simply calls
setSimulationStatus('running') without invoking triggerSimulation. This
preserves the paused simulation state and allows continuation from the last
step. Keep the Run Simulation button as a fallback for other states when
simulationStatus is neither 'running' nor 'paused'.
- Around line 640-642: The hint text in the div element claims "Mouse wheel to
zoom" but the canvas main element lacks an onWheel event handler to implement
this functionality. Either add an onWheel handler to the main element that calls
zoom functions (such as zoomIn or zoomOut based on the wheel direction), or
remove the "Mouse wheel to zoom" claim from the hint text to accurately reflect
the available functionality.
- Around line 34-36: The API_BASE constant uses NEXT_PUBLIC_API_HOST with a
hardcoded https:// scheme instead of following the documented convention.
Replace the conditional logic that checks for NEXT_PUBLIC_API_HOST and
constructs the URL with a simpler approach that uses NEXT_PUBLIC_API_URL
environment variable as the override, defaulting to /api/v1 when the environment
variable is not set. This ensures consistency with the documented convention
where the API URL defaults to /api/v1 and can be overridden via
NEXT_PUBLIC_API_URL.
- Line 38: The fetcher function and all fetch calls throughout the
workflow-builder.tsx page are missing the required X-Api-Key header that the
backend expects, causing 401 errors in production. Update the fetcher function
to include the X-Api-Key header in its fetch request, and ensure all other
direct fetch calls at the mentioned line numbers (97, 211, 230, 417) also
include this header. Consider either creating a centralized HTTP client that
both workflow-builder.tsx and the dashboard (index.tsx) can use, or expose the
API key via NEXT_PUBLIC_API_KEY environment variable and include it in all
request headers. This ensures consistent authentication across all workflow
operations including list, save, delete, and run endpoints.

In `@tests/test_workflows.py`:
- Around line 11-13: The client fixture and tests in the file are using a
shared, persistent disk-backed workflow state with a hardcoded workflow ID
(wf-test-1234), causing tests to contaminate each other and become
order-dependent. Modify the client fixture to create a temporary workflows
directory instead of using the default shared storage, then update all test
cases (lines 25-64) to use a unique or temporary workflow ID per test run
instead of the hardcoded wf-test-1234. This will isolate each test's state and
prevent cross-test contamination.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 20c1a023-e75e-4376-a322-dc6e50bd238b

📥 Commits

Reviewing files that changed from the base of the PR and between 66eb9d7 and 35904c8.

📒 Files selected for processing (15)
  • agentwatch/api/auth.py
  • agentwatch/api/server.py
  • agentwatch/cli/main.py
  • agentwatch/core/models.py
  • agentwatch/protocol/mcp_server.py
  • agentwatch/rollback/engine.py
  • agentwatch/security/encryption.py
  • agentwatch/security/key_storage.py
  • frontend/.eslintrc.json
  • frontend/lib/api.ts
  • frontend/pages/index.tsx
  • frontend/pages/workflow-builder.tsx
  • tests/conftest.py
  • tests/test_safety.py
  • tests/test_workflows.py

Comment thread agentwatch/api/server.py
Comment thread agentwatch/api/server.py
Comment thread agentwatch/api/server.py
Comment thread frontend/pages/index.tsx
Comment thread frontend/pages/index.tsx
Comment thread frontend/pages/workflow-builder.tsx
Comment thread frontend/pages/workflow-builder.tsx
Comment thread frontend/pages/workflow-builder.tsx
Comment thread frontend/pages/workflow-builder.tsx
Comment thread tests/test_workflows.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agentwatch/api/server.py`:
- Around line 549-562: The record_failure call in the except block only captures
exceptions raised by handlers, but FastAPI returns HTTP errors (4xx/5xx) as
normal Response objects without raising exceptions. In the finally block where
latency is recorded, add a check for response.status_code to also record
failures for error responses. After confirming response is not None, inspect the
status code and call record_failure for responses with status_code >= 500 (and
optionally >= 400) to ensure all failure scenarios are captured in the failure
metric.
- Around line 767-769: The call to record_failure() on the line that processes
ExecutionStatus.FAILURE events is missing required arguments. Update the
record_failure() call to provide all required parameters: the endpoint
(currently passing only event.agent_id), the status_code (which needs to be
extracted from the event or derived from the failure context), and optionally an
error message. Reference the correct usage pattern in the middleware handler at
line 556 which calls record_failure with all three required arguments to match
the function signature defined in agentwatch/monitoring/metrics.py that expects
endpoint, status_code, and error_msg parameters.
- Around line 624-627: The metrics function at the /metrics endpoint lacks
authentication protection and uses a hardcoded media-type string. Add the
_require_api_key authentication decorator to the metrics function to match the
security pattern used by all other endpoints in this module, and replace the
hardcoded media-type string "text/plain; version=0.0.4" with the
library-provided constant from prometheus_client (imported from
prometheus_client.exposition) to ensure compatibility across versions and
prevent disclosing internal topology metrics to unauthenticated callers.
- Line 546: The endpoint variable is being assigned from request.url.path before
routing completes, but request.scope["route"] (which contains the matched route
template) is not populated until after await call_next(request) returns. Move
the endpoint assignment to capture the route template after the routing
completes by accessing it in both the except and finally blocks where it will be
available. Instead of using the concrete request.url.path which creates
unbounded metric labels with UUIDs and IDs, extract the route pattern from
request.scope["route"] after call_next(request) completes, then use this
endpoint value for both record_api_latency and record_failure metric calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 52ea8052-dc4b-4797-8339-332a6f783c30

📥 Commits

Reviewing files that changed from the base of the PR and between 35904c8 and 65f06e2.

📒 Files selected for processing (2)
  • agentwatch/api/server.py
  • agentwatch/cli/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • agentwatch/cli/main.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agentwatch/api/server.py`:
- Around line 549-562: The record_failure call in the except block only captures
exceptions raised by handlers, but FastAPI returns HTTP errors (4xx/5xx) as
normal Response objects without raising exceptions. In the finally block where
latency is recorded, add a check for response.status_code to also record
failures for error responses. After confirming response is not None, inspect the
status code and call record_failure for responses with status_code >= 500 (and
optionally >= 400) to ensure all failure scenarios are captured in the failure
metric.
- Around line 767-769: The call to record_failure() on the line that processes
ExecutionStatus.FAILURE events is missing required arguments. Update the
record_failure() call to provide all required parameters: the endpoint
(currently passing only event.agent_id), the status_code (which needs to be
extracted from the event or derived from the failure context), and optionally an
error message. Reference the correct usage pattern in the middleware handler at
line 556 which calls record_failure with all three required arguments to match
the function signature defined in agentwatch/monitoring/metrics.py that expects
endpoint, status_code, and error_msg parameters.
- Around line 624-627: The metrics function at the /metrics endpoint lacks
authentication protection and uses a hardcoded media-type string. Add the
_require_api_key authentication decorator to the metrics function to match the
security pattern used by all other endpoints in this module, and replace the
hardcoded media-type string "text/plain; version=0.0.4" with the
library-provided constant from prometheus_client (imported from
prometheus_client.exposition) to ensure compatibility across versions and
prevent disclosing internal topology metrics to unauthenticated callers.
- Line 546: The endpoint variable is being assigned from request.url.path before
routing completes, but request.scope["route"] (which contains the matched route
template) is not populated until after await call_next(request) returns. Move
the endpoint assignment to capture the route template after the routing
completes by accessing it in both the except and finally blocks where it will be
available. Instead of using the concrete request.url.path which creates
unbounded metric labels with UUIDs and IDs, extract the route pattern from
request.scope["route"] after call_next(request) completes, then use this
endpoint value for both record_api_latency and record_failure metric calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 52ea8052-dc4b-4797-8339-332a6f783c30

📥 Commits

Reviewing files that changed from the base of the PR and between 35904c8 and 65f06e2.

📒 Files selected for processing (2)
  • agentwatch/api/server.py
  • agentwatch/cli/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • agentwatch/cli/main.py
🛑 Comments failed to post (4)
agentwatch/api/server.py (4)

546-546: 🚀 Performance & Scalability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the relevant file section
cat -n agentwatch/api/server.py | sed -n '540,565p'

Repository: sreerevanth/AgentWatch

Length of output: 1100


🏁 Script executed:

# Search for how the endpoint variable is used after line 546
rg -n "record_api_latency|record_failure" agentwatch/api/server.py -A 3 -B 3

Repository: sreerevanth/AgentWatch

Length of output: 1333


🏁 Script executed:

# Check if request.scope is used elsewhere in the file to understand the framework context
rg -n "request\.scope" agentwatch/api/server.py -B 2 -A 2

Repository: sreerevanth/AgentWatch

Length of output: 160


🏁 Script executed:

# Understand the middleware/handler structure
ast-grep outline agentwatch/api/server.py --view expanded

Repository: sreerevanth/AgentWatch

Length of output: 7238


🏁 Script executed:

# Search for any FastAPI/Starlette route handling patterns in the codebase
rg -n "scope\[" agentwatch/api/server.py -A 2 -B 2

Repository: sreerevanth/AgentWatch

Length of output: 160


🏁 Script executed:

# Check what framework is being used and understand the middleware context
head -100 agentwork/api/server.py | grep -E "import.*FastAPI|import.*Starlette|from.*fastapi|from.*starlette"

Repository: sreerevanth/AgentWatch

Length of output: 243


🏁 Script executed:

# Look for route/path patterns in request handling
rg -n "`@app`\.|`@router`\." agentwatch/api/server.py | head -20

Repository: sreerevanth/AgentWatch

Length of output: 1155


🏁 Script executed:

# Check requirements or dependencies to confirm FastAPI version
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs cat 2>/dev/null | grep -i fastapi

Repository: sreerevanth/AgentWatch

Length of output: 182


🌐 Web query:

FastAPI Starlette middleware request.scope route when is it populated timing

💡 Result:

In FastAPI and Starlette, the routing information—specifically the 'route' key within the request.scope dictionary—is populated only after the routing phase has successfully matched the incoming request to an endpoint [1][2][3]. Because standard middleware (such as those implemented using BaseHTTPMiddleware or @app.middleware('http')) executes before the routing process, the 'route' key is not present in request.scope when the middleware first receives the request [4][2][5]. To access route information within a middleware, you must await the call_next(request) function [1][5]. This call proceeds to the application’s routing and endpoint execution logic. Once call_next returns, the routing phase has completed, and the 'route' object will typically be populated in the request.scope [1][3]. Implementation Summary: - Before await call_next(request): The 'route' key will be missing. Attempting to access request.scope['route'] at this stage will result in a KeyError [1][5]. - After await call_next(request): The 'route' key is usually available in the scope, assuming the request successfully reached a route [1][3]. Always use request.scope.get('route') and check for existence to avoid potential errors if the request failed to match any route [3]. Alternative approaches: If you require route information before the endpoint logic executes, consider using a custom APIRoute class (in FastAPI) to hook into the route handler, which provides more control over the execution lifecycle than standard ASGI middleware [1][2].

Citations:


🏁 Script executed:

# Look at the metric recording functions to understand their signature/usage
rg -n "def record_api_latency|def record_failure" agentwatch/monitoring/metrics.py -A 5

Repository: sreerevanth/AgentWatch

Length of output: 632


🏁 Script executed:

# Check if there are any other middlewares or examples of route access in request.scope
rg -n "request\.scope\[.*route" agentwatch/ -r

Repository: sreerevanth/AgentWatch

Length of output: 574


🏁 Script executed:

# Check the Starlette Route class to understand the path attribute
rg -n "class Route|\.path" agentwatch/ -t py | grep -i route | head -10

Repository: sreerevanth/AgentWatch

Length of output: 210


🏁 Script executed:

# Create a small verification script to understand the Starlette/FastAPI route structure
python3 << 'PY'
# Verify Starlette/FastAPI route object structure
# Looking at typical usage patterns
print("In Starlette/FastAPI:")
print("- request.scope['route'] is a Route object (or Match object)")
print("- Route objects have a 'path' attribute that contains the template")
print("- Example: '/api/v1/sessions/{session_id}' instead of concrete '/api/v1/sessions/abc-123'")
print("\nThe suggested diff captures endpoint BEFORE call_next, which is the problem:")
print("- Line 546: endpoint = request.url.path  (BEFORE call_next)")
print("- At this point, scope['route'] is NOT populated")
print("- So getattr(route, 'path', ...) would fail or return None")
print("\nCorrect approach: Capture endpoint AFTER call_next returns")
print("(in except or finally blocks where routing HAS completed)")
PY

Repository: sreerevanth/AgentWatch

Length of output: 764


Use the matched route template instead of the raw request path to prevent metric label cardinality explosion.

The metric labels record_api_latency and record_failure (lines 555, 556, 562) currently receive the concrete path from request.url.path, which includes dynamic segments like UUIDs and IDs. This creates unbounded label values—each /api/v1/sessions/<id>, /api/v1/workflows/<uuid>, etc., becomes a separate time-series in Prometheus, causing cardinality explosion and memory pressure in the scrape backend.

Implementation note: The suggested diff in the original comment has a timing issue. request.scope["route"] is not populated until after routing completes, which occurs during await call_next(request) (line 550). The endpoint should be captured after the call returns, not before. Capture it in the except and finally blocks where routing has definitively completed:

Corrected approach
     async def record_metrics(request: Request, call_next):
         """Record API latency metrics for all requests including failures."""
         start_time = time.time()
-        endpoint = request.url.path
         response = None
 
         try:
             response = await call_next(request)
             return response
         except Exception as exc:
+            route = request.scope.get("route")
+            endpoint = route.path if route else request.url.path
             duration = time.time() - start_time
             record_api_latency(endpoint, duration)
             record_failure(endpoint, 500, str(exc))
             raise
         finally:
             if response is not None:
+                route = request.scope.get("route")
+                endpoint = route.path if route else request.url.path
                 duration = time.time() - start_time
                 record_api_latency(endpoint, duration)

This ensures the route template is available for both success and failure paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agentwatch/api/server.py` at line 546, The endpoint variable is being
assigned from request.url.path before routing completes, but
request.scope["route"] (which contains the matched route template) is not
populated until after await call_next(request) returns. Move the endpoint
assignment to capture the route template after the routing completes by
accessing it in both the except and finally blocks where it will be available.
Instead of using the concrete request.url.path which creates unbounded metric
labels with UUIDs and IDs, extract the route pattern from request.scope["route"]
after call_next(request) completes, then use this endpoint value for both
record_api_latency and record_failure metric calls.

549-562: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Failure metric misses error responses returned without raising.

The docstring states latency is recorded "for all requests including failures," but record_failure is only invoked in the except block. FastAPI converts HTTPException (and validation errors) into normal Response objects, so they never propagate here — 4xx/5xx responses returned by handlers are silently excluded from the failure metric, while only unhandled exceptions are counted.

Consider inspecting response.status_code in the finally block to record failures for >= 500 (and optionally >= 400) responses.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agentwatch/api/server.py` around lines 549 - 562, The record_failure call in
the except block only captures exceptions raised by handlers, but FastAPI
returns HTTP errors (4xx/5xx) as normal Response objects without raising
exceptions. In the finally block where latency is recorded, add a check for
response.status_code to also record failures for error responses. After
confirming response is not None, inspect the status code and call record_failure
for responses with status_code >= 500 (and optionally >= 400) to ensure all
failure scenarios are captured in the failure metric.

624-627: 🔒 Security & Privacy | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
fd -t f "server.py" | grep -E "agentwatch.*server"

Repository: sreerevanth/AgentWatch

Length of output: 219


🏁 Script executed:

# Get overview of the server.py file
wc -l agentwatch/api/server.py

Repository: sreerevanth/AgentWatch

Length of output: 190


🏁 Script executed:

# Examine the metrics endpoint and surrounding context
sed -n '620,635p' agentwatch/api/server.py

Repository: sreerevanth/AgentWatch

Length of output: 647


🏁 Script executed:

# Check how other endpoints use authentication
grep -n "_require_api_key" agentwatch/api/server.py | head -20

Repository: sreerevanth/AgentWatch

Length of output: 1526


🏁 Script executed:

# Check imports at the top of the file to see what's currently imported
head -50 agentwatch/api/server.py

Repository: sreerevanth/AgentWatch

Length of output: 1441


🏁 Script executed:

# Verify CONTENT_TYPE_LATEST is available in prometheus_client
python3 -c "from prometheus_client import CONTENT_TYPE_LATEST; print(CONTENT_TYPE_LATEST)" 2>&1 || echo "CONTENT_TYPE_LATEST not found"

Repository: sreerevanth/AgentWatch

Length of output: 321


🏁 Script executed:

# Check if there's any documentation or comments about metrics endpoint authentication
grep -n -B5 -A10 "/metrics" agentwatch/api/server.py

Repository: sreerevanth/AgentWatch

Length of output: 707


🏁 Script executed:

# Search for any existing usage of CONTENT_TYPE_LATEST in the codebase
rg "CONTENT_TYPE_LATEST" --type py

Repository: sreerevanth/AgentWatch

Length of output: 160


🏁 Script executed:

# Check requirements or setup files to see prometheus_client version
fd -t f "requirements" | xargs cat 2>/dev/null | grep -i prometheus

Repository: sreerevanth/AgentWatch

Length of output: 160


🏁 Script executed:

# Look for any prometheus configuration or documentation
find . -type f \( -name "*.md" -o -name "*.txt" \) | xargs grep -l "prometheus" 2>/dev/null | head -5

Repository: sreerevanth/AgentWatch

Length of output: 220


🏁 Script executed:

# Check if there's any security documentation about metrics
rg "metrics" agentwatch/api/server.py -i -B2 -A2 | head -40

Repository: sreerevanth/AgentWatch

Length of output: 1061


Add authentication to /metrics endpoint and use library constant for content-type.

The /metrics endpoint at lines 624-627 lacks authentication while all other endpoints in this module use _require_api_key. Prometheus output includes per-endpoint/per-agent failure counts and latencies, disclosing internal topology and traffic patterns to anonymous callers. Additionally, the media-type is hardcoded; prefer the library-provided constant to guarantee correctness across prometheus_client versions.

Suggested change
-from prometheus_client import generate_latest
+from prometheus_client import CONTENT_TYPE_LATEST, generate_latest
@@
-@app.get("/metrics")
-async def metrics() -> Response:
+@app.get("/metrics")
+async def metrics(_auth: None = Depends(_require_api_key)) -> Response:
     """Expose Prometheus metrics endpoint."""
-    return Response(content=generate_latest(), media_type="text/plain; version=0.0.4")
+    return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agentwatch/api/server.py` around lines 624 - 627, The metrics function at the
/metrics endpoint lacks authentication protection and uses a hardcoded
media-type string. Add the _require_api_key authentication decorator to the
metrics function to match the security pattern used by all other endpoints in
this module, and replace the hardcoded media-type string "text/plain;
version=0.0.4" with the library-provided constant from prometheus_client
(imported from prometheus_client.exposition) to ensure compatibility across
versions and prevent disclosing internal topology metrics to unauthenticated
callers.

767-769: 🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm record_failure signature and all call sites/arity.
ast-grep run --pattern 'def record_failure($$$):' --lang python agentwatch/monitoring/metrics.py
rg -nP '\brecord_failure\s*\(' --type=py

Repository: sreerevanth/AgentWatch

Length of output: 160


🏁 Script executed:

# First, check if the files exist and their structure
fd -t f 'metrics.py' agentwatch/
fd -t f 'server.py' agentwatch/

Repository: sreerevanth/AgentWatch

Length of output: 252


🏁 Script executed:

# Read the record_failure definition in metrics.py
cat -n agentwatch/monitoring/metrics.py | grep -A 10 'def record_failure'

Repository: sreerevanth/AgentWatch

Length of output: 622


🏁 Script executed:

# Read the specific lines in server.py (767-769)
sed -n '760,775p' agentwatch/api/server.py

Repository: sreerevanth/AgentWatch

Length of output: 759


🏁 Script executed:

# Read the code around line 556 in server.py
sed -n '550,565p' agentwatch/api/server.py

Repository: sreerevanth/AgentWatch

Length of output: 672


🏁 Script executed:

# Find all occurrences of record_failure calls in the codebase
rg 'record_failure\s*\(' agentwatch/ -n --type py -B 1 -A 1

Repository: sreerevanth/AgentWatch

Length of output: 788


record_failure() is called with incorrect arity and will raise TypeError when processing FAILURE events.

The function signature (agentwatch/monitoring/metrics.py:37) requires record_failure(endpoint: str, status_code: int, error_msg: str = "") with two required positional arguments. The call at line 769 passes only event.agent_id, missing the required status_code parameter. This will raise TypeError: record_failure() missing 1 required positional argument: 'status_code' on every FAILURE event, crashing the ingest handler. The middleware at line 556 correctly calls it with all three arguments, confirming the mismatch.

Suggested fix
     if event.agent_id and hasattr(event, "status"):
         if getattr(event, "status", None) == ExecutionStatus.FAILURE:
-            record_failure(event.agent_id)
+            record_failure(event.agent_id, 500)

Note: the function uses its first parameter (endpoint) as the agent_id label, but the event handler is passing event.agent_id in place of an endpoint. Consider whether this semantic difference impacts your metrics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agentwatch/api/server.py` around lines 767 - 769, The call to
record_failure() on the line that processes ExecutionStatus.FAILURE events is
missing required arguments. Update the record_failure() call to provide all
required parameters: the endpoint (currently passing only event.agent_id), the
status_code (which needs to be extracted from the event or derived from the
failure context), and optionally an error message. Reference the correct usage
pattern in the middleware handler at line 556 which calls record_failure with
all three required arguments to match the function signature defined in
agentwatch/monitoring/metrics.py that expects endpoint, status_code, and
error_msg parameters.

@sreerevanth

Copy link
Copy Markdown
Owner

@BekkamMallishwari can u please fi this

@sreerevanth

Copy link
Copy Markdown
Owner

@BekkamMallishwari merge conflicts could u join discord it would be good for addressing changes please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants