feat: Real-Time Agent Execution Monitoring Dashboard (#448)#457
feat: Real-Time Agent Execution Monitoring Dashboard (#448)#457BekkamMallishwari wants to merge 5 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds dashboard analytics, workflow CRUD and simulation APIs, and a new workflow builder UI. Updates the dashboard to use analytics-driven charts and notifications, and includes related security, typing, and test adjustments. ChangesDashboard Analytics and Workflow Builder
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server as API
participant Store as workflows_json
Client->>Server: GET/POST/DELETE /api/v1/workflows
Server->>Store: load_workflows_from_disk()
Store-->>Server: workflows + defaults
Server->>Store: save_workflows_to_disk()
Store-->>Server: persisted workflows
Client->>Server: POST /api/v1/workflows/{id}/run
Server->>Server: traverse nodes and build step traces
Server-->>Client: simulation status, logs, outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @sreerevanth It includes the Real-Time Agent Execution Monitoring Dashboard and is ready for review. 🚀 |
🧪 PR Test Results
Python 3.12 · commit 65f06e2 |
|
@BekkamMallishwari ci failing |
|
Hi @sreerevanth , I've pushed a fix commit ( Once approved, the updated CI results should be available for review. Thanks! |
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agentwatch/api/server.py`:
- Around line 1461-1478: The read-modify-write pattern in the workflow update
logic is not protected against concurrent requests. Wrap the entire sequence
starting from load_workflows_from_disk() through
save_workflows_to_disk(workflows) in a file lock or synchronization mechanism to
ensure atomicity. Additionally, implement atomic file persistence by writing
workflow data to a temporary file first and then using atomic rename/replace
operations instead of directly overwriting the workflows file. Apply the same
synchronization pattern to the other affected workflow operations mentioned in
the comment (the delete workflow operation and any other save/delete functions
that follow similar patterns).
- Around line 1523-1535: The topological sort is enqueuing target nodes too
early, before all their dependencies are satisfied. When processing edges from a
completed node, the code must decrement the incoming_counts for each target node
and only enqueue the target when its incoming_counts reaches exactly 0 (meaning
all predecessors have completed). Currently, nodes are being added to the queue
as soon as a single predecessor finishes rather than waiting for all incoming
edges to be processed. Modify the edge processing logic to decrement
incoming_counts[target] for each edge processed and conditionally add target to
the queue only when its count becomes 0, ensuring proper topological ordering.
- Around line 1088-1091: The average execution time calculation has a mismatch
between the numerator and denominator. The total_time sum only includes sessions
where ended_at exists (due to the if s.ended_at filter), but the divisor uses
len(completed_sessions) which counts all completed sessions including those
without ended_at. To fix this, count only the sessions that actually have
ended_at values and use that count as the divisor for avg_execution_time instead
of len(completed_sessions). This ensures the metric is calculated correctly
based on the sessions that actually contributed to the total_time sum.
In `@frontend/pages/index.tsx`:
- Around line 338-339: The Retry button in index.tsx currently has an onClick
handler that only triggers a mock alert message. Replace this mock alert with a
real retry action that performs the actual retry operation (such as calling a
retry function or refetching data), or alternatively disable the button with a
disabled attribute and add visual/text indicators that the feature is coming
soon to avoid misleading users.
- Around line 350-351: The blockedData from the useSWR hook for blocked events
is not being included in the refresh flows, causing the Safety panel to display
stale data. Extract the mutate function from the blockedData useSWR call
(similar to how refreshAnalytics is extracted from the analytics useSWR call),
and include this mutate function in all refresh flows by calling it wherever
refreshAnalytics is called, including both the live refresh callback and the
manual Refresh button handler referenced at lines 375-378 and 411.
- Around line 392-395: The success value in the pieData array is incorrectly
calculated by subtracting failed_sessions from total_sessions, which
misclassifies running, blocked, and pending sessions as successful. Replace this
calculation in the pieData definition with the actual successful sessions count
from the summary object (likely available as a separate field like
successful_sessions or completed_sessions) to align with the true success-rate
metric semantics.
In `@frontend/pages/workflow-builder.tsx`:
- Around line 411-413: The triggerSimulation function currently runs a
simulation using selectedWorkflowId before verifying that the workflow has been
saved to the backend. Since handleNewWorkflow and handleSaveWorkflow generate a
client-side ID before backend persistence, running a simulation with an unsaved
workflow will fail. Guard the triggerSimulation function to either check if the
workflow has been saved to the backend before proceeding, or automatically call
the save logic (similar to handleSaveWorkflow) before executing the simulation
POST request. This ensures the selectedWorkflowId exists on the backend when the
API endpoint is called.
- Around line 1134-1150: The conditional logic only handles two states (running
vs not running), but when simulationStatus is 'paused', it falls into the else
branch and shows a Run button tied to triggerSimulation, which restarts the
simulation instead of resuming. Add an additional condition to check if
simulationStatus === 'paused' and render a Resume button that simply calls
setSimulationStatus('running') without invoking triggerSimulation. This
preserves the paused simulation state and allows continuation from the last
step. Keep the Run Simulation button as a fallback for other states when
simulationStatus is neither 'running' nor 'paused'.
- Around line 640-642: The hint text in the div element claims "Mouse wheel to
zoom" but the canvas main element lacks an onWheel event handler to implement
this functionality. Either add an onWheel handler to the main element that calls
zoom functions (such as zoomIn or zoomOut based on the wheel direction), or
remove the "Mouse wheel to zoom" claim from the hint text to accurately reflect
the available functionality.
- Around line 34-36: The API_BASE constant uses NEXT_PUBLIC_API_HOST with a
hardcoded https:// scheme instead of following the documented convention.
Replace the conditional logic that checks for NEXT_PUBLIC_API_HOST and
constructs the URL with a simpler approach that uses NEXT_PUBLIC_API_URL
environment variable as the override, defaulting to /api/v1 when the environment
variable is not set. This ensures consistency with the documented convention
where the API URL defaults to /api/v1 and can be overridden via
NEXT_PUBLIC_API_URL.
- Line 38: The fetcher function and all fetch calls throughout the
workflow-builder.tsx page are missing the required X-Api-Key header that the
backend expects, causing 401 errors in production. Update the fetcher function
to include the X-Api-Key header in its fetch request, and ensure all other
direct fetch calls at the mentioned line numbers (97, 211, 230, 417) also
include this header. Consider either creating a centralized HTTP client that
both workflow-builder.tsx and the dashboard (index.tsx) can use, or expose the
API key via NEXT_PUBLIC_API_KEY environment variable and include it in all
request headers. This ensures consistent authentication across all workflow
operations including list, save, delete, and run endpoints.
In `@tests/test_workflows.py`:
- Around line 11-13: The client fixture and tests in the file are using a
shared, persistent disk-backed workflow state with a hardcoded workflow ID
(wf-test-1234), causing tests to contaminate each other and become
order-dependent. Modify the client fixture to create a temporary workflows
directory instead of using the default shared storage, then update all test
cases (lines 25-64) to use a unique or temporary workflow ID per test run
instead of the hardcoded wf-test-1234. This will isolate each test's state and
prevent cross-test contamination.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 20c1a023-e75e-4376-a322-dc6e50bd238b
📒 Files selected for processing (15)
agentwatch/api/auth.pyagentwatch/api/server.pyagentwatch/cli/main.pyagentwatch/core/models.pyagentwatch/protocol/mcp_server.pyagentwatch/rollback/engine.pyagentwatch/security/encryption.pyagentwatch/security/key_storage.pyfrontend/.eslintrc.jsonfrontend/lib/api.tsfrontend/pages/index.tsxfrontend/pages/workflow-builder.tsxtests/conftest.pytests/test_safety.pytests/test_workflows.py
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agentwatch/api/server.py`:
- Around line 549-562: The record_failure call in the except block only captures
exceptions raised by handlers, but FastAPI returns HTTP errors (4xx/5xx) as
normal Response objects without raising exceptions. In the finally block where
latency is recorded, add a check for response.status_code to also record
failures for error responses. After confirming response is not None, inspect the
status code and call record_failure for responses with status_code >= 500 (and
optionally >= 400) to ensure all failure scenarios are captured in the failure
metric.
- Around line 767-769: The call to record_failure() on the line that processes
ExecutionStatus.FAILURE events is missing required arguments. Update the
record_failure() call to provide all required parameters: the endpoint
(currently passing only event.agent_id), the status_code (which needs to be
extracted from the event or derived from the failure context), and optionally an
error message. Reference the correct usage pattern in the middleware handler at
line 556 which calls record_failure with all three required arguments to match
the function signature defined in agentwatch/monitoring/metrics.py that expects
endpoint, status_code, and error_msg parameters.
- Around line 624-627: The metrics function at the /metrics endpoint lacks
authentication protection and uses a hardcoded media-type string. Add the
_require_api_key authentication decorator to the metrics function to match the
security pattern used by all other endpoints in this module, and replace the
hardcoded media-type string "text/plain; version=0.0.4" with the
library-provided constant from prometheus_client (imported from
prometheus_client.exposition) to ensure compatibility across versions and
prevent disclosing internal topology metrics to unauthenticated callers.
- Line 546: The endpoint variable is being assigned from request.url.path before
routing completes, but request.scope["route"] (which contains the matched route
template) is not populated until after await call_next(request) returns. Move
the endpoint assignment to capture the route template after the routing
completes by accessing it in both the except and finally blocks where it will be
available. Instead of using the concrete request.url.path which creates
unbounded metric labels with UUIDs and IDs, extract the route pattern from
request.scope["route"] after call_next(request) completes, then use this
endpoint value for both record_api_latency and record_failure metric calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 52ea8052-dc4b-4797-8339-332a6f783c30
📒 Files selected for processing (2)
agentwatch/api/server.pyagentwatch/cli/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
- agentwatch/cli/main.py
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agentwatch/api/server.py`:
- Around line 549-562: The record_failure call in the except block only captures
exceptions raised by handlers, but FastAPI returns HTTP errors (4xx/5xx) as
normal Response objects without raising exceptions. In the finally block where
latency is recorded, add a check for response.status_code to also record
failures for error responses. After confirming response is not None, inspect the
status code and call record_failure for responses with status_code >= 500 (and
optionally >= 400) to ensure all failure scenarios are captured in the failure
metric.
- Around line 767-769: The call to record_failure() on the line that processes
ExecutionStatus.FAILURE events is missing required arguments. Update the
record_failure() call to provide all required parameters: the endpoint
(currently passing only event.agent_id), the status_code (which needs to be
extracted from the event or derived from the failure context), and optionally an
error message. Reference the correct usage pattern in the middleware handler at
line 556 which calls record_failure with all three required arguments to match
the function signature defined in agentwatch/monitoring/metrics.py that expects
endpoint, status_code, and error_msg parameters.
- Around line 624-627: The metrics function at the /metrics endpoint lacks
authentication protection and uses a hardcoded media-type string. Add the
_require_api_key authentication decorator to the metrics function to match the
security pattern used by all other endpoints in this module, and replace the
hardcoded media-type string "text/plain; version=0.0.4" with the
library-provided constant from prometheus_client (imported from
prometheus_client.exposition) to ensure compatibility across versions and
prevent disclosing internal topology metrics to unauthenticated callers.
- Line 546: The endpoint variable is being assigned from request.url.path before
routing completes, but request.scope["route"] (which contains the matched route
template) is not populated until after await call_next(request) returns. Move
the endpoint assignment to capture the route template after the routing
completes by accessing it in both the except and finally blocks where it will be
available. Instead of using the concrete request.url.path which creates
unbounded metric labels with UUIDs and IDs, extract the route pattern from
request.scope["route"] after call_next(request) completes, then use this
endpoint value for both record_api_latency and record_failure metric calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 52ea8052-dc4b-4797-8339-332a6f783c30
📒 Files selected for processing (2)
agentwatch/api/server.pyagentwatch/cli/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
- agentwatch/cli/main.py
🛑 Comments failed to post (4)
agentwatch/api/server.py (4)
546-546: 🚀 Performance & Scalability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the relevant file section cat -n agentwatch/api/server.py | sed -n '540,565p'Repository: sreerevanth/AgentWatch
Length of output: 1100
🏁 Script executed:
# Search for how the endpoint variable is used after line 546 rg -n "record_api_latency|record_failure" agentwatch/api/server.py -A 3 -B 3Repository: sreerevanth/AgentWatch
Length of output: 1333
🏁 Script executed:
# Check if request.scope is used elsewhere in the file to understand the framework context rg -n "request\.scope" agentwatch/api/server.py -B 2 -A 2Repository: sreerevanth/AgentWatch
Length of output: 160
🏁 Script executed:
# Understand the middleware/handler structure ast-grep outline agentwatch/api/server.py --view expandedRepository: sreerevanth/AgentWatch
Length of output: 7238
🏁 Script executed:
# Search for any FastAPI/Starlette route handling patterns in the codebase rg -n "scope\[" agentwatch/api/server.py -A 2 -B 2Repository: sreerevanth/AgentWatch
Length of output: 160
🏁 Script executed:
# Check what framework is being used and understand the middleware context head -100 agentwork/api/server.py | grep -E "import.*FastAPI|import.*Starlette|from.*fastapi|from.*starlette"Repository: sreerevanth/AgentWatch
Length of output: 243
🏁 Script executed:
# Look for route/path patterns in request handling rg -n "`@app`\.|`@router`\." agentwatch/api/server.py | head -20Repository: sreerevanth/AgentWatch
Length of output: 1155
🏁 Script executed:
# Check requirements or dependencies to confirm FastAPI version find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs cat 2>/dev/null | grep -i fastapiRepository: sreerevanth/AgentWatch
Length of output: 182
🌐 Web query:
FastAPI Starlette middleware request.scope route when is it populated timing💡 Result:
In FastAPI and Starlette, the routing information—specifically the 'route' key within the request.scope dictionary—is populated only after the routing phase has successfully matched the incoming request to an endpoint [1][2][3]. Because standard middleware (such as those implemented using BaseHTTPMiddleware or
@app.middleware('http')) executes before the routing process, the 'route' key is not present in request.scope when the middleware first receives the request [4][2][5]. To access route information within a middleware, you must await the call_next(request) function [1][5]. This call proceeds to the application’s routing and endpoint execution logic. Once call_next returns, the routing phase has completed, and the 'route' object will typically be populated in the request.scope [1][3]. Implementation Summary: - Before await call_next(request): The 'route' key will be missing. Attempting to access request.scope['route'] at this stage will result in a KeyError [1][5]. - After await call_next(request): The 'route' key is usually available in the scope, assuming the request successfully reached a route [1][3]. Always use request.scope.get('route') and check for existence to avoid potential errors if the request failed to match any route [3]. Alternative approaches: If you require route information before the endpoint logic executes, consider using a custom APIRoute class (in FastAPI) to hook into the route handler, which provides more control over the execution lifecycle than standard ASGI middleware [1][2].Citations:
- 1: https://python.tutorialink.com/how-to-get-routes-name-using-fastapi-starlette/
- 2: https://stackoverflow.com/questions/78739451/get-path-template-from-starlette-request-from-a-middleware
- 3: fastapi/fastapi#12975
- 4: Kludex/starlette#2565
- 5: fastapi/fastapi#11450
🏁 Script executed:
# Look at the metric recording functions to understand their signature/usage rg -n "def record_api_latency|def record_failure" agentwatch/monitoring/metrics.py -A 5Repository: sreerevanth/AgentWatch
Length of output: 632
🏁 Script executed:
# Check if there are any other middlewares or examples of route access in request.scope rg -n "request\.scope\[.*route" agentwatch/ -rRepository: sreerevanth/AgentWatch
Length of output: 574
🏁 Script executed:
# Check the Starlette Route class to understand the path attribute rg -n "class Route|\.path" agentwatch/ -t py | grep -i route | head -10Repository: sreerevanth/AgentWatch
Length of output: 210
🏁 Script executed:
# Create a small verification script to understand the Starlette/FastAPI route structure python3 << 'PY' # Verify Starlette/FastAPI route object structure # Looking at typical usage patterns print("In Starlette/FastAPI:") print("- request.scope['route'] is a Route object (or Match object)") print("- Route objects have a 'path' attribute that contains the template") print("- Example: '/api/v1/sessions/{session_id}' instead of concrete '/api/v1/sessions/abc-123'") print("\nThe suggested diff captures endpoint BEFORE call_next, which is the problem:") print("- Line 546: endpoint = request.url.path (BEFORE call_next)") print("- At this point, scope['route'] is NOT populated") print("- So getattr(route, 'path', ...) would fail or return None") print("\nCorrect approach: Capture endpoint AFTER call_next returns") print("(in except or finally blocks where routing HAS completed)") PYRepository: sreerevanth/AgentWatch
Length of output: 764
Use the matched route template instead of the raw request path to prevent metric label cardinality explosion.
The metric labels
record_api_latencyandrecord_failure(lines 555, 556, 562) currently receive the concrete path fromrequest.url.path, which includes dynamic segments like UUIDs and IDs. This creates unbounded label values—each/api/v1/sessions/<id>,/api/v1/workflows/<uuid>, etc., becomes a separate time-series in Prometheus, causing cardinality explosion and memory pressure in the scrape backend.Implementation note: The suggested diff in the original comment has a timing issue.
request.scope["route"]is not populated until after routing completes, which occurs duringawait call_next(request)(line 550). The endpoint should be captured after the call returns, not before. Capture it in theexceptandfinallyblocks where routing has definitively completed:Corrected approach
async def record_metrics(request: Request, call_next): """Record API latency metrics for all requests including failures.""" start_time = time.time() - endpoint = request.url.path response = None try: response = await call_next(request) return response except Exception as exc: + route = request.scope.get("route") + endpoint = route.path if route else request.url.path duration = time.time() - start_time record_api_latency(endpoint, duration) record_failure(endpoint, 500, str(exc)) raise finally: if response is not None: + route = request.scope.get("route") + endpoint = route.path if route else request.url.path duration = time.time() - start_time record_api_latency(endpoint, duration)This ensures the route template is available for both success and failure paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agentwatch/api/server.py` at line 546, The endpoint variable is being assigned from request.url.path before routing completes, but request.scope["route"] (which contains the matched route template) is not populated until after await call_next(request) returns. Move the endpoint assignment to capture the route template after the routing completes by accessing it in both the except and finally blocks where it will be available. Instead of using the concrete request.url.path which creates unbounded metric labels with UUIDs and IDs, extract the route pattern from request.scope["route"] after call_next(request) completes, then use this endpoint value for both record_api_latency and record_failure metric calls.
549-562: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Failure metric misses error responses returned without raising.
The docstring states latency is recorded "for all requests including failures," but
record_failureis only invoked in theexceptblock. FastAPI convertsHTTPException(and validation errors) into normalResponseobjects, so they never propagate here — 4xx/5xx responses returned by handlers are silently excluded from the failure metric, while only unhandled exceptions are counted.Consider inspecting
response.status_codein thefinallyblock to record failures for>= 500(and optionally>= 400) responses.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agentwatch/api/server.py` around lines 549 - 562, The record_failure call in the except block only captures exceptions raised by handlers, but FastAPI returns HTTP errors (4xx/5xx) as normal Response objects without raising exceptions. In the finally block where latency is recorded, add a check for response.status_code to also record failures for error responses. After confirming response is not None, inspect the status code and call record_failure for responses with status_code >= 500 (and optionally >= 400) to ensure all failure scenarios are captured in the failure metric.
624-627: 🔒 Security & Privacy | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file fd -t f "server.py" | grep -E "agentwatch.*server"Repository: sreerevanth/AgentWatch
Length of output: 219
🏁 Script executed:
# Get overview of the server.py file wc -l agentwatch/api/server.pyRepository: sreerevanth/AgentWatch
Length of output: 190
🏁 Script executed:
# Examine the metrics endpoint and surrounding context sed -n '620,635p' agentwatch/api/server.pyRepository: sreerevanth/AgentWatch
Length of output: 647
🏁 Script executed:
# Check how other endpoints use authentication grep -n "_require_api_key" agentwatch/api/server.py | head -20Repository: sreerevanth/AgentWatch
Length of output: 1526
🏁 Script executed:
# Check imports at the top of the file to see what's currently imported head -50 agentwatch/api/server.pyRepository: sreerevanth/AgentWatch
Length of output: 1441
🏁 Script executed:
# Verify CONTENT_TYPE_LATEST is available in prometheus_client python3 -c "from prometheus_client import CONTENT_TYPE_LATEST; print(CONTENT_TYPE_LATEST)" 2>&1 || echo "CONTENT_TYPE_LATEST not found"Repository: sreerevanth/AgentWatch
Length of output: 321
🏁 Script executed:
# Check if there's any documentation or comments about metrics endpoint authentication grep -n -B5 -A10 "/metrics" agentwatch/api/server.pyRepository: sreerevanth/AgentWatch
Length of output: 707
🏁 Script executed:
# Search for any existing usage of CONTENT_TYPE_LATEST in the codebase rg "CONTENT_TYPE_LATEST" --type pyRepository: sreerevanth/AgentWatch
Length of output: 160
🏁 Script executed:
# Check requirements or setup files to see prometheus_client version fd -t f "requirements" | xargs cat 2>/dev/null | grep -i prometheusRepository: sreerevanth/AgentWatch
Length of output: 160
🏁 Script executed:
# Look for any prometheus configuration or documentation find . -type f \( -name "*.md" -o -name "*.txt" \) | xargs grep -l "prometheus" 2>/dev/null | head -5Repository: sreerevanth/AgentWatch
Length of output: 220
🏁 Script executed:
# Check if there's any security documentation about metrics rg "metrics" agentwatch/api/server.py -i -B2 -A2 | head -40Repository: sreerevanth/AgentWatch
Length of output: 1061
Add authentication to
/metricsendpoint and use library constant for content-type.The
/metricsendpoint at lines 624-627 lacks authentication while all other endpoints in this module use_require_api_key. Prometheus output includes per-endpoint/per-agent failure counts and latencies, disclosing internal topology and traffic patterns to anonymous callers. Additionally, the media-type is hardcoded; prefer the library-provided constant to guarantee correctness across prometheus_client versions.Suggested change
-from prometheus_client import generate_latest +from prometheus_client import CONTENT_TYPE_LATEST, generate_latest @@ -@app.get("/metrics") -async def metrics() -> Response: +@app.get("/metrics") +async def metrics(_auth: None = Depends(_require_api_key)) -> Response: """Expose Prometheus metrics endpoint.""" - return Response(content=generate_latest(), media_type="text/plain; version=0.0.4") + return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agentwatch/api/server.py` around lines 624 - 627, The metrics function at the /metrics endpoint lacks authentication protection and uses a hardcoded media-type string. Add the _require_api_key authentication decorator to the metrics function to match the security pattern used by all other endpoints in this module, and replace the hardcoded media-type string "text/plain; version=0.0.4" with the library-provided constant from prometheus_client (imported from prometheus_client.exposition) to ensure compatibility across versions and prevent disclosing internal topology metrics to unauthenticated callers.
767-769: 🎯 Functional Correctness | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash # Confirm record_failure signature and all call sites/arity. ast-grep run --pattern 'def record_failure($$$):' --lang python agentwatch/monitoring/metrics.py rg -nP '\brecord_failure\s*\(' --type=pyRepository: sreerevanth/AgentWatch
Length of output: 160
🏁 Script executed:
# First, check if the files exist and their structure fd -t f 'metrics.py' agentwatch/ fd -t f 'server.py' agentwatch/Repository: sreerevanth/AgentWatch
Length of output: 252
🏁 Script executed:
# Read the record_failure definition in metrics.py cat -n agentwatch/monitoring/metrics.py | grep -A 10 'def record_failure'Repository: sreerevanth/AgentWatch
Length of output: 622
🏁 Script executed:
# Read the specific lines in server.py (767-769) sed -n '760,775p' agentwatch/api/server.pyRepository: sreerevanth/AgentWatch
Length of output: 759
🏁 Script executed:
# Read the code around line 556 in server.py sed -n '550,565p' agentwatch/api/server.pyRepository: sreerevanth/AgentWatch
Length of output: 672
🏁 Script executed:
# Find all occurrences of record_failure calls in the codebase rg 'record_failure\s*\(' agentwatch/ -n --type py -B 1 -A 1Repository: sreerevanth/AgentWatch
Length of output: 788
record_failure()is called with incorrect arity and will raiseTypeErrorwhen processing FAILURE events.The function signature (agentwatch/monitoring/metrics.py:37) requires
record_failure(endpoint: str, status_code: int, error_msg: str = "")with two required positional arguments. The call at line 769 passes onlyevent.agent_id, missing the requiredstatus_codeparameter. This will raiseTypeError: record_failure() missing 1 required positional argument: 'status_code'on every FAILURE event, crashing the ingest handler. The middleware at line 556 correctly calls it with all three arguments, confirming the mismatch.Suggested fix
if event.agent_id and hasattr(event, "status"): if getattr(event, "status", None) == ExecutionStatus.FAILURE: - record_failure(event.agent_id) + record_failure(event.agent_id, 500)Note: the function uses its first parameter (
endpoint) as the agent_id label, but the event handler is passingevent.agent_idin place of an endpoint. Consider whether this semantic difference impacts your metrics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agentwatch/api/server.py` around lines 767 - 769, The call to record_failure() on the line that processes ExecutionStatus.FAILURE events is missing required arguments. Update the record_failure() call to provide all required parameters: the endpoint (currently passing only event.agent_id), the status_code (which needs to be extracted from the event or derived from the failure context), and optionally an error message. Reference the correct usage pattern in the middleware handler at line 556 which calls record_failure with all three required arguments to match the function signature defined in agentwatch/monitoring/metrics.py that expects endpoint, status_code, and error_msg parameters.
|
@BekkamMallishwari can u please fi this |
|
@BekkamMallishwari merge conflicts could u join discord it would be good for addressing changes please |
This PR adds a comprehensive Real-Time Agent Monitoring & Analytics Dashboard to address Issue #448.
Features Included:
Summary by CodeRabbit
Release Notes