Skip to content

[Bug]: @trace_class on EventQueue generates excessive spans during LLM streaming (1500+/session) #1034

@tomoyukiy0

Description

@tomoyukiy0

What happened?

Problem

The @trace_class(kind=SpanKind.SERVER) decorator is applied to EventQueue and related high-frequency classes without an exclude_list:

  • a2a/server/events/event_queue.py: EventQueue (v0.3.x) / EventQueueLegacy (v1.0.x)
  • a2a/server/events/event_consumer.py: EventConsumer
  • a2a/server/events/in_memory_queue_manager.py: InMemoryQueueManager

This causes a span to be created for every call to high-frequency methods like:

  • EventQueue.enqueue_event — called once per streamed LLM token
  • EventQueue.dequeue_event — called once per streamed LLM token
  • EventQueue.task_done — called once per streamed LLM token

For a typical LLM streaming response of ~500 tokens, this generates 1500+ internal spans per session, most of which provide no actionable observability value since they represent fine-grained internal queue operations rather than meaningful request-level events.

Impact

  1. Breaks span-quota-limited systems — AWS Bedrock AgentCore Online Evaluation has a hard limit of 1000 spans per evaluated session. Sessions exceeding this limit are silently skipped, leaving evaluations unusable for any A2A-based agent doing non-trivial LLM streaming.

  2. Increased observability costs — CloudWatch Logs storage, network bandwidth, and memory overhead for spans that are mostly noise.

  3. Approaches span size quotas — A single session with many internal spans can approach the 15 MB/session span data limit.

Environment variable OTEL_INSTRUMENTATION_A2A_SDK_ENABLED=false is too coarse

The existing environment variable disables all A2A tracing including the useful RequestHandler-level spans. There is no way to selectively disable high-frequency internal spans.

Current workaround

Users can apply a runtime monkey-patch at application startup to unwrap the @trace_class decorator on the high-frequency classes, restoring the original methods via __wrapped__ (which is preserved by functools.wraps in trace_function). This is fragile and requires knowledge of internal SDK structure.

Proposed fix

Add an exclude_list (or equivalent) to the @trace_class application on the high-frequency classes. For example:

# a2a/server/events/event_queue.py

@trace_class(
    kind=SpanKind.SERVER,
    exclude_list=['enqueue_event', 'dequeue_event', 'task_done', 'clear_events'],
)
class EventQueue:
    ...

Similar changes for EventConsumer and InMemoryQueueManager. The high-frequency internal methods would no longer generate spans, while the class-level tracing decorator is preserved for any other methods that might be added in the future.

Verification

I have verified locally that:

  • The trace_class mechanism already supports exclude_list
  • Applying the fix reduces spans from 1500+ to ~53 per session (97% reduction)
  • Useful RequestHandler traces (DefaultRequestHandler, JSONRPCHandler, RESTHandler) and client transport traces are preserved

Happy to submit a PR with the proposed changes if this direction is acceptable.

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions