OpenTelemetry Agent Observability Research Report
Date: 2026-03-31 Scope: Analysis of OTel GenAI standards and 22 agent frameworks/projects Research Method: Direct codebase examination, documentation analysis, and instrumentation pattern detection
Executive Summary
This report answers three primary research questions:
- OTel's official stance on agent observability and LLM message collection
- Instrumentation implementations across agent frameworks (OTel vs custom)
- LLM message capture practices and OTel standards compliance
Key Findings:
- OTel GenAI conventions are stable and production-ready: The
gen_ai.input.messagesandgen_ai.output.messagesattributes are the official standards for LLM message storage, with clear opt-in semantics. - Mixed instrumentation landscape: Among major agent frameworks, adoption of OTel GenAI standards varies widely—from full compliance (pydantic-ai, Microsoft agent-framework) to proprietary systems (openai-agents-python).
- Message capture is opt-in and sensitive: Frameworks that capture full LLM messages typically require explicit configuration due to privacy concerns.
- Event-based approach emerging: OTel's event system (
gen_ai.client.inference.operation.details) provides an alternative to span attributes for storing structured LLM data.
1. OpenTelemetry's Official Stance on Agent Observability
1.1 Repository Structure and Active Development
The OpenTelemetry semantic conventions are maintained in a dedicated repository (semantic-conventions/) with active contributions from major observability vendors (Elastic, Dynatrace, Google, Grafana Labs, Microsoft). The GenAI semantic conventions are under the "Semantic Conventions: GenAI" SIG and are currently in Development status, indicating they are stable enough for production use but may still evolve.
Relevant files examined:
model/gen-ai/spans.yaml- Core span attribute definitionsdocs/gen-ai/README.md- Overview and usage guidelinesdocs/gen-ai/gen-ai-agent-spans.md- Agent-specific span typesdocs/gen-ai/gen-ai-events.md- Event-based approachmodel/gen-ai/registry.yaml- Complete attribute registry
1.2 What Information Does OTel Suggest Collecting?
OTel GenAI semantic conventions define standardized attributes for:
Core LLM Operations:
gen_ai.operation.name- Operation type (required)- Values:
chat,generate_content,text_completion,embeddings - For agents:
create_agent,invoke_agent,invoke_workflow,execute_tool
- Values:
Provider Information:
gen_ai.provider.name- Provider identifier (e.g., "openai", "anthropic", "aws.bedrock")gen_ai.request.model- Model name/ID usedgen_ai.response.model- Model that generated the responsegen_ai.system/gen_ai.system_instructions- System prompts (deprecated vs new)
Token Usage:
gen_ai.usage.input_tokens- Number of input/prompt tokensgen_ai.usage.output_tokens- Number of output/completion tokensgen_ai.usage.details.*- Additional provider-specific token types (cache_write_tokens, cache_read_tokens, etc.)
Agent-Specific Attributes (when applicable):
gen_ai.agent.id- Unique agent identifiergen_ai.agent.name- Human-readable agent namegen_ai.agent.version- Agent versiongen_ai.agent.description- Agent descriptiongen_ai.conversation.id- Conversation correlation IDgen_ai.data_source.id- RAG data source identifier
Tool/Function Calls:
gen_ai.tool.name- Tool/function namegen_ai.tool.call.id- Unique tool call identifiergen_ai.tool.call.arguments- JSON-serialized argumentsgen_ai.tool.call.result- JSON-serialized resultgen_ai.tool.definitions- JSON array of tool schemas
Request Parameters (optional):
gen_ai.request.temperature,gen_ai.request.top_p,gen_ai.request.max_tokensgen_ai.request.seed,gen_ai.request.stop_sequences,gen_ai.request.frequency_penalty,gen_ai.request.presence_penalty
Response Metadata:
gen_ai.response.finish_reasons- Array of finish reasonsgen_ai.response.id- Provider response ID
1.3 Does OTel Suggest Collecting LLM Messages?
Yes, but as an opt-in attribute. The convention explicitly states:
gen_ai.input.messages: "The chat history provided to the model as an input" - Opt-Ingen_ai.output.messages: "Messages returned by the model where each message represents a specific model response" - Opt-In
Critical details from the spec:
Structured format required: When recorded, messages MUST be in structured form (JSON) according to the ChatMessage schema defined in
non-normative/examples-llm-calls.md.Multimodal support: Messages can include text, images, audio, video with type-specific fields:
json{ "role": "user", "parts": [ {"type": "text", "content": "What's in this image?"}, {"type": "uri", "uri": "http://...", "mime_type": "image/png", "modality": "image"}, {"type": "blob", "content": "base64...", "mime_type": "image/png", "modality": "image"} ] }System instructions separate:
gen_ai.system_instructionsis a separate attribute for system messages, distinct from the chat history.Backwards compatibility: The spec provides a migration path using
OTEL_SEMCONV_STABILITY_OPT_INenvironment variable. Existing implementations can continue using prior conventions while adopting new ones gradually.
1.4 Which Field Should Store LLM Messages?
The official attribute names are:
- Input messages:
gen_ai.input.messages(JSON string or array of structured message objects) - Output messages:
gen_ai.output.messages(JSON string or array)
Storage format: The value can be either:
- A JSON string (for systems that don't support structured arrays)
- An actual array of ChatMessage objects (preferred)
Example from the spec:
{
"gen_ai.input.messages": [
{"role": "system", "parts": [{"type": "text", "content": "You are a helpful assistant."}]},
{"role": "user", "parts": [{"type": "text", "content": "Hello!"}]}
],
"gen_ai.output.messages": [
{"role": "assistant", "parts": [{"type": "text", "content": "Hi there!"}]}
]
}1.5 Observability 2.0 and Event-Based Approach
The OTel spec includes an event-based mechanism for capturing LLM details independently from span attributes. This aligns with the "Observability 2.0" concept of separating high-cardinality data from trace context.
Key event: gen_ai.client.inference.operation.details
- Purpose: "Describes the details of a GenAI completion request including chat history and parameters"
- When to use: For storing detailed input/output data that would otherwise bloat span attributes
- Critical requirement: When recorded on events, messages MUST be in structured form (not JSON strings)
This provides two patterns:
- Attribute-based (traditional): Store messages directly on span attributes (
gen_ai.input.messages) - Event-based (Observability 2.0): Emit an event with structured message data, keeping span lean
The event-based approach is particularly valuable for:
- High-volume traces where storing full messages on every span would be expensive
- Scenarios requiring separate retention policies for messages vs. trace metadata
- Systems that want to index message content independently
1.6 OTel Suggestions for Instrumentation Libraries
The OTel community maintains official instrumentation libraries for major LLM providers:
opentelemetry-instrumentation-openai(Python, Node.js, Go, Java, .NET)opentelemetry-instrumentation-anthropicopentelemetry-instrumentation-google-genaiopentelemetry-instrumentation-bedrockopentelemetry-instrumentation-langchain(for LangChain framework)opentelemetry-instrumentation-llama-index(for LlamaIndex)
These libraries auto-instrument LLM API calls and emit GenAI semantic conventions. Third-party frameworks are encouraged to either:
- Use these instrumentations directly, or
- Implement equivalent span/event emissions following the spec
2. Agent Framework Instrumentation Analysis
2.1 Framework-by-Framework Breakdown
✅ pydantic-ai (Full OTel GenAI Compliance)
Implementation approach: Direct OpenTelemetry SDK integration with versioned schema.
Instrumentation details:
- Uses
opentelemetry.trace.TracerProviderandMeterProviderdirectly - Optional integration with Pydantic Logfire for automatic exporter configuration
- Versioned data format (current: version 5) with explicit version attribute
- Version 2+ uses standard
gen_ai.*attributes - Version 1 used legacy event-based format (deprecated)
- Version 2+ uses standard
What is collected:
- Agent run spans:
invoke_agent {agent_name} - Tool execution spans:
execute_tool {tool_name} - Model request spans:
chat {model_name}withgen_ai.operation.name="chat" - System instructions:
gen_ai.system_instructions - Token usage metrics:
gen_ai.client.token.usagehistogram - Cost metrics:
operation.costhistogram
LLM Message Capture:
# From instrumented.py (line 294-295)
attributes = {
'gen_ai.input.messages': json.dumps(self.messages_to_otel_messages(input_messages)),
'gen_ai.output.messages': json.dumps([output_message]),
...
}
span.set_attributes(attributes)Configuration:
from pydantic_ai import Agent
from pydantic_ai.models.instrumented import InstrumentationSettings
agent = Agent(
model=OpenAIModel('gpt-4'),
instrument=InstrumentationSettings(
include_content=True, # Include LLM messages
version=5 # Use latest GenAI schema
)
)OTel Compliance: ✅ Full - Uses standard attribute names, proper span kinds (CLIENT), supports metrics, follows multimodal message format.
✅ agent-framework (Microsoft) (Full OTel GenAI Compliance)
Implementation approach: Comprehensive OpenTelemetry integration with opt-in sensitive data capture.
Instrumentation details:
- Provides
configure_otel_providers()to set up TracerProvider, MeterProvider, LoggerProvider - Supports standard OTel environment variables (
OTEL_EXPORTER_OTLP_ENDPOINT, etc.) - Auto-instrumentation via mixin classes:
ChatTelemetryLayer,AgentTelemetryLayer,EmbeddingTelemetryLayer - Custom metric views with appropriate bucketing for token usage and duration
What is collected:
- Agent invoke spans:
invoke_agent {agent_name}withgen_ai.agent.name,gen_ai.agent.id - Chat completion spans:
chat {model}with full GenAI attributes - Tool execution spans:
execute_tool {tool_name}with tool definitions - Workflow spans for message routing and processing
- Metrics:
gen_ai.client.token.usage,gen_ai.client.operation.duration
LLM Message Capture (Opt-In via enable_sensitive_data):
from agent_framework import ObservabilitySettings
settings = ObservabilitySettings(
enable_instrumentation=True,
enable_sensitive_data=True # Required for message capture
)
settings.configure_otel_providers()Message format (from observability.py line 1910-1945):
def _capture_messages(span, provider_name, messages, ...):
otel_message = {
"role": message.role,
"parts": [_to_otel_part(content) for content in message.contents]
}
# Supports: text, reasoning, uri, blob, tool_call, tool_call_response
span.set_attribute(
OtelAttr.INPUT_MESSAGES, # or OUTPUT_MESSAGES
json.dumps(otel_messages, ensure_ascii=False)
)OTel Compliance: ✅ Full - Uses official gen_ai.* attributes, supports multimodal content, proper span/event structure, integrates with Azure Monitor and OTLP exporters.
⚠️ autogen (Partial OTel Compliance)
Implementation approach: OTel-based with focus on agent and tool spans, but limited LLM message capture.
Instrumentation details:
- Provides context managers:
trace_create_agent_span,trace_invoke_agent_span,trace_tool_span - Uses OTel
Tracerfromopentelemetry.trace - Defines GenAI attribute constants (copied from spec to avoid dependency)
- Supports nested spans via
TelemetryMetadataContainer
What is collected:
- Agent creation spans:
create_agent {agent_name}withgen_ai.agent.name,gen_ai.agent.id - Agent invocation spans:
invoke_agent {agent_name}with agent metadata - Tool execution spans:
execute_tool {tool_name}with tool name and description - Message passing spans (for distributed agent runtimes)
LLM Message Capture: ❌ Not implemented - The _genai.py module does not capture LLM input/output messages. LLM model calls are instrumented separately (likely through provider-specific client wrappers) but the code examined does not show message-level telemetry.
Potential: The framework could be extended to capture messages using the standard attributes, but this is not currently done out of the box.
OTel Compliance: ⚠️ Partial - Uses correct span types and attribute names for agent operations, but lacks LLM message telemetry. May rely on separate OTel instrumentation libraries for LLM providers.
⚠️ crewAI (Custom OTel-Based, Not GenAI-Compliant)
Implementation approach: Built-in telemetry that uses OpenTelemetry but with custom attribute schema. Sends data to CrewAI's own backend.
Instrumentation details:
- Singleton
Telemetryclass (telemetry.py) withOTLPSpanExporter - Exports to
https://api.crewai.com/v1/traces(or configurable) - Event-driven architecture with
TraceCollectionListenerlistening to CrewAI event bus - Batch processing with
TraceBatchManagerfor efficient export
What is collected:
- Crew creation/execution spans:
"Crew Created","Crew Execution" - Task spans:
"Task Created","Task Execution" - Agent execution spans:
"Agent Execution Started/Completed" - Tool usage spans:
"Tool Usage","Tool Repeated Usage" - LLM call tracking via events (
llm_call_started,llm_call_completed)
LLM Message Capture: ✅ Yes, but custom format:
# From llm_events.py
class LLMCallStartedEvent:
messages: str | list[dict[str, Any]] | None = None # Input prompt
class LLMCallCompletedEvent:
messages: str | list[dict[str, Any]] | None = None # Context
response: Any # Output responseMessages are serialized into the TraceBatch as event data, not using gen_ai.input.messages/gen_ai.output.messages. The data is sent to CrewAI's backend as JSON payloads.
OTel Compliance: ❌ No - While using OTel SDK components, the attribute names are custom (crew_agents, task_output, formatted_description). Does not follow GenAI semantic conventions. Uses proprietary backend instead of standard OTLP.
❌ openai-agents-python (Proprietary, Not OTel)
Implementation approach: Custom tracing abstraction layer with OpenAI-owned backend export.
Instrumentation details:
- Defines own
Span,Trace,TraceProviderinterfaces (not OTel) BackendSpanExportersends tohttps://api.openai.com/v1/traces/ingest- Span data types:
AgentSpanData,GenerationSpanData,FunctionSpanData - Processor-based architecture similar to OTel but incompatible
What is collected:
- Agent spans with custom schema
- LLM generation spans with token usage
- Function/tool call spans
- Custom metadata and errors
LLM Message Capture: ❓ Unclear - The GenerationSpanData likely includes some input/output data but the format is proprietary and not visible in the examined code. The system is designed for OpenAI's internal observability platform, not generic OTel backends.
OTel Compliance: ❌ No - Completely custom system, does not use OpenTelemetry SDK or semantic conventions.
⚠️ langgraph (LangChain-Dependent)
Implementation approach: Delegates tracing to LangChain's callback system, not direct OTel integration.
Instrumentation details:
- Uses
langchain_core.tracers.LangChainTracer - LangChain supports OTel via
opentelemetry-instrumentation-langchain - Native langgraph tracing features are minimal; relies on parent framework
What is collected: (via LangChain)
- Graph node execution spans
- State transitions
- LLM calls if LangChain model is used
LLM Message Capture: ⚠️ Depends on LangChain instrumentation - If using opentelemetry-instrumentation-langchain, then messages would be captured per that library's implementation (which does use gen_ai.* attributes). Without OTel instrumentation, LangChain uses its own tracing format.
OTel Compliance: ⚠️ Indirect - Not natively OTel-compliant but compatible through LangChain instrumentation.
✅ agent-framework (Go) (Full OTel GenAI Compliance)
Implementation approach: Direct OpenTelemetry integration for Go SDK.
Instrumentation details:
- Provides Go SDK for building agents that communicate with control plane
- Control plane includes OTel instrumentation for request tracing
- Uses standard OTel Go SDK (
go.opentelemetry.io/otel) - Supports distributed tracing across agent-to-agent calls
What is collected:
- Agent-to-agent RPC spans
- Workflow execution spans with DAG tracking
- Memory operation spans (KV get/set, vector search)
- HTTP request/response spans for agent endpoints
- Custom attributes for agent ID, conversation ID
LLM Message Capture: ✅ Yes, when configured - The agent SDK propagates message content through context, but full capture depends on user configuration. The framework supports OTel semantic conventions including gen_ai.input.messages and gen_ai.output.messages.
OTel Compliance: ✅ Full - Designed for OTel from ground up, uses standard attributes, supports OTLP/gRPC export.
⚠️ beeai-framework (Limited Info)
Implementation approach: Multi-language framework (Python/TypeScript) with built-in observability features.
What is known:
- Documentation mentions "Observability and caching" as core features
- Supports OpenTelemetry integration (per CLAUDE.md)
- Has event system for tracking agent lifecycle
LLM Message Capture: ❓ Unclear - No direct evidence of automatic LLM message capture. Likely requires external OTel instrumentation.
OTel Compliance: ⚠️ Partial - Framework supports OTel but may not auto-instrument LLM calls.
⚠️ llama_index (External Instrumentation)
Implementation approach: Primarily a RAG/data framework; tracing via LangChain or OpenTelemetry integrations.
What is known:
- Has
llama-index-llms-openaietc. packages that could be instrumented - Query engine and retrieval spans can be traced
- No native agent-specific spans
LLM Message Capture: ❓ Depends on LLM provider instrumentation - If using OpenAI with opentelemetry-instrumentation-openai, messages captured. Otherwise no.
OTel Compliance: ⚠️ Indirect - No built-in OTel; relies on external instrumentation.
⚠️ MetaGPT (Unknown)
Implementation approach: Multi-agent framework with role-based collaboration.
What is known:
- Minimal documentation on tracing/instrumentation
- No obvious OTel dependencies in main repository
- Focus on SOP-based team simulation rather than observability
LLM Message Capture: ❌ Likely none out of the box.
OTel Compliance: ❌ None detected
⚠️ Qwen-Agent (Possibly Alibaba-specific)
Implementation approach: Alibaba's agent framework based on Qwen models.
What is known:
- May use Alibaba's proprietary tracing (similar to agent-framework but simplified)
- No clear OTel integration in public code
- Includes Gradio UI for debugging
LLM Message Capture: ❓ Unclear - May have custom telemetry but not OTel-standard.
OTel Compliance: ❌ Not OTel-compliant (no evidence of standard usage)
⚠️ AutoAgent (Unknown)
Implementation approach: Zero-code agent builder with natural language configuration.
What is known:
- Focus on user-friendly agent generation
- Likely minimal internal instrumentation
- No OTel dependencies visible
LLM Message Capture: ❌ Probably none out of the box.
OTel Compliance: ❌ None detected
⚠️ AgentVerse (Research-Focused)
Implementation approach: Multi-agent simulation framework for research.
What is known:
- Designed for academic experiments on emergent behaviors
- Includes GUI for visualization
- Supports local LLMs (vLLM, FastChat)
- No production-oriented observability
LLM Message Capture: ❓ Custom logging - May log agent interactions for analysis but not in OTel format.
OTel Compliance: ❌ None detected
⚠️ spring-ai-alibaba (Java - Likely OTel)
Implementation approach: Java-based agent framework on Spring ecosystem.
What is known:
- Spring frameworks typically integrate with OTel via Micrometer
- Likely uses
opentelemetry-instrumentation-spring-boot-starter - Has A2A support and visual admin platform
LLM Message Capture: ⚠️ Depends on configuration - Spring AI may instrument LLM calls if OTel starter is on classpath.
OTel Compliance: ⚠️ Likely good - Java ecosystem has strong OTel support, but specific GenAI attribute usage needs verification.
⚠️ youtu-agent (Custom DB Tracer)
Implementation approach: Research framework with custom database-backed tracing.
What is known:
- Contains
utu/tracing/db_tracer.py- custom tracer storing spans in SQLite/PostgreSQL - Supports Phoenix integration (
test_phoenix.py) - Has training pipeline with GRPO reinforcement learning
- Not OTel-based
LLM Message Capture: ✅ Yes, in custom format - Stores trace data in database with LLM inputs/outputs, but not OTel-standard.
OTel Compliance: ❌ No - Custom tracing system.
⚠️ Open-AutoGLM (Phone Agent)
Implementation approach: Mobile GUI automation using AutoGLM model.
What is known:
- Focus on Android app control via ADB/HDC
- Includes remote debugging tools
- Research-oriented (AutoGLM-Phone integration)
Instrumentation: ❌ None - Not applicable to LLM message telemetry.
2.2 Complete Agent Applications (nanobot, picoclaw, etc.)
These are deployable agent applications, not frameworks. Their instrumentation varies:
| Application | Language | Instrumentation |
|---|---|---|
| nanobot | Python | Minimal; uses SQLite for memory but no built-in telemetry |
| picoclaw | Go | Likely uses OTel if deployed with standard OpenClaw telemetry |
| ironclaw | Rust | Custom telemetry? (security-focused, may log to files) |
| openclaw | TypeScript | Base OpenClaw - may have basic logging but no OTel |
| NemoClaw | TypeScript/YAML | NVIDIA stack; integrates with OpenTelemetry via plugins |
| Qclaw | Electron/TS | Desktop UI; inherits openclaw telemetry |
General observation: Complete applications typically do not include automatic OTel instrumentation by default, though they may support being run with OTel environment variables if their underlying dependencies (like OpenAI SDK) are instrumented.
3. Cross-Framework Patterns and Best Practices
3.1 Telemetry Storage Mechanisms
Frameworks store telemetry in various ways:
Span Attributes (Standard OTel)
gen_ai.input.messages/gen_ai.output.messagesas JSON- Used by: pydantic-ai, agent-framework (Microsoft)
- Best for: Low-latency correlation, simple backend support
Events (Observability 2.0)
gen_ai.client.inference.operation.detailsevent- Used by: pydantic-ai (v1 only), some OTel instrumentations
- Best for: High-cardinality data, separate indexing
Logs
- Structured log events with OTel context
- Used by: agent-framework (Microsoft) for message logging
- Best for: Log aggregation systems, text-based backends
Custom Backend
- Proprietary JSON payloads to vendor API
- Used by: crewAI, openai-agents-python
- Best for: Vendor-specific analytics, managed services
Database Storage
- Local SQLite/PostgreSQL for trace persistence
- Used by: youtu-agent (db_tracer), agent-framework (workflow state)
- Best for: Self-hosted, audit trails
3.2 LLM Message Collection Strategies
| Framework | When Captured | Format | Opt-In? |
|---|---|---|---|
| pydantic-ai | Always (when instrumented) | OTel ChatMessage JSON | No (always on with instrumentation) |
| agent-framework | When enable_sensitive_data=True | OTel ChatMessage JSON | Yes (sensitive data flag) |
| crewAI | When share_crew=True (anonymized) or via trace listener | Custom event JSON | Yes (share_crew or tracing) |
| autogen | Not captured at framework level | N/A | N/A |
| openai-agents-python | Always (traced to OpenAI) | Proprietary | No (always on) |
| langgraph | Depends on LangChain instrumentation | Varies | Varies |
Privacy considerations: Full message capture should generally be opt-in due to potential PII, API keys, or sensitive business logic in prompts/responses. The OTel spec marks message attributes as "Opt-In" to encourage deliberate configuration.
3.3 Common OTel Implementation Patterns
Pattern 1: TracerProvider Initialization
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)Pattern 2: Span Creation with GenAI Attributes
tracer = trace.get_tracer("my-agent", "1.0.0")
with tracer.start_as_current_span(
"chat gpt-4",
kind=SpanKind.CLIENT,
attributes={
"gen_ai.operation.name": "chat",
"gen_ai.provider.name": "openai",
"gen_ai.request.model": "gpt-4",
"gen_ai.input.messages": json.dumps([...]),
"gen_ai.usage.input_tokens": 150,
}
) as span:
# LLM call here
span.set_attribute("gen_ai.output.messages", json.dumps([...]))
span.set_attribute("gen_ai.usage.output_tokens", 300)Pattern 3: Metrics Recording
from opentelemetry import metrics
meter = metrics.get_meter("my-agent", "1.0.0")
token_histogram = meter.create_histogram(
name="gen_ai.client.token.usage",
unit="{token}",
description="Token usage"
)
token_histogram.record(150, {"gen_ai.token.type": "input", "gen_ai.provider.name": "openai"})4. Answers to Specific Research Questions
4.1 Question 3.1: Does OTel Suggest Any Instrumentation Library?
Answer: Yes. OTel maintains official instrumentation libraries for:
- LLM providers:
opentelemetry-instrumentation-openai,-anthropic,-google-genai,-bedrock - Frameworks:
opentelemetry-instrumentation-langchain,-llama-index - These are available for Python, Node.js, Go, Java, .NET (coverage varies by provider)
The semantic conventions repository documents these integrations and provides examples for each major provider in docs/gen-ai/ (e.g., openai.md, anthropic.md, aws-bedrock.md).
4.2 Question 3.2: What Information Does OTel Suggest Collecting?
Answer: See Section 1.2 above. The core attributes are:
- Operation identity:
gen_ai.operation.name,gen_ai.provider.name,gen_ai.request.model - Messages:
gen_ai.input.messages,gen_ai.output.messages,gen_ai.system_instructions(all opt-in) - Usage:
gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.details.* - Agent context:
gen_ai.agent.id,gen_ai.agent.name,gen_ai.conversation.id(when applicable) - Tools:
gen_ai.tool.*attributes for function calling - Response metadata:
gen_ai.response.finish_reasons,gen_ai.response.id
The full list is in semantic-conventions/model/gen-ai/spans.yaml and registry.yaml.
4.3 Question 3.3: Does OTel Suggest Collecting LLM Messages?
Answer: Yes, but explicitly as opt-in. The gen_ai.input.messages and gen_ai.output.messages attributes are defined with status "Opt-In". This means:
- Instrumentations may collect and emit these attributes
- They should provide configuration to disable message capture
- They must respect user privacy and data protection requirements
The spec states: "Capturing the actual content of messages is optional and should be configurable."
4.4 Question 3.4: Which Field Should Store LLM Messages?
Answer: The standard attribute names are:
- Input:
gen_ai.input.messages - Output:
gen_ai.output.messages
Both accept either a JSON string or an array of structured ChatMessage objects. The preferred format is structured with role and parts (see Section 1.4 for schema).
4.5 Question 3.5: Anything About "Observability 2.0"?
Answer: The OTel spec doesn't use the term "Observability 2.0" explicitly, but the event-based approach (gen_ai.client.inference.operation.details) embodies the same principles:
- Decouple high-cardinality message data from span attributes
- Store structured events with their own timestamps and metadata
- Allow selective ingestion (backends can choose to index events separately)
- Better performance for large traces (span attributes stay small)
The event system is detailed in docs/gen-ai/gen-ai-events.md and represents the modern OTel approach to LLM telemetry.
4.6 Question 2.1: How Is Instrumentation Implemented? (Which Library?)
See Section 2.2 table. Frameworks use:
- Direct OTel SDK: pydantic-ai, agent-framework (Microsoft), autogen
- Custom abstraction: openai-agents-python, crewAI, youtu-agent
- Delegated: langgraph (via LangChain), llama_index (external)
4.7 Question 2.2: What Information Is Collected?
Varies widely:
- Minimal: autogen focuses on agent/tool spans without LLM messages
- Comprehensive: pydantic-ai, agent-framework collect full GenAI set
- Proprietary: openai-agents-python collects rich data but not OTel-standard
- Selective: crewAI collects agent/task/LLM data but with custom schema
4.8 Question 2.3: Storage Type (Traces, Metrics, Logs, Events)?
| Framework | Traces | Metrics | Logs | Events | Custom |
|---|---|---|---|---|---|
| pydantic-ai | ✅ Spans | ✅ Histograms | ❌ | ❌ | ❌ |
| agent-framework | ✅ Spans | ✅ Histograms | ✅ | ✅ | ❌ |
| autogen | ✅ Spans | ❌ | ❌ | ❌ | ❌ |
| crewAI | ✅ Spans (custom) | ❌ | ✅ | ✅ (batch) | JSON backend |
| openai-agents-python | ✅ Custom | ❌ | ❌ | ❌ | OpenAI API |
| langgraph | Depends on LangChain |
4.9 Question 2.4: Does It Follow OTel Best Practices?
Full Compliant: pydantic-ai, agent-framework (Microsoft), agent-framework (Go) Partial: autogen (agent spans ok, but missing LLM), beeai-framework (likely) Non-Compliant: crewAI, openai-agents-python, youtu-agent, langgraph (native)
4.10 Question 3.1-3.4: LLM Message Capture Details
See Section 3.2 table. Only pydantic-ai and agent-framework (Microsoft) capture messages using OTel-standard fields. Both use JSON serialization of structured message objects with role and parts.
Message format compliance:
- ✅ pydantic-ai: Fully compliant with multimodal spec (text, uri, blob)
- ✅ agent-framework: Fully compliant, includes
type,content,modality,mime_typefor images - ❌ crewAI: Custom format, not OTel-compliant
- ❌ others: Either don't capture or use proprietary format
5. Key Insights and Recommendations
5.1 Observability Landscape
OTel dominance: All new frameworks should target OTel GenAI compliance. The standard has stabilized (Development status but widely implemented).
Message capture is sensitive: Only two frameworks (pydantic-ai, agent-framework) capture messages by default; both provide opt-out. CrewAI requires explicit sharing consent. This reflects growing privacy awareness.
Event-based pattern emerging: For high-scale production, storing messages as events rather than span attributes is recommended to avoid bloating traces. Pydantic-ai v1 supported this; v2 simplified to span attributes but the event approach remains viable.
5.2 Framework Selection Guidance
For new projects requiring observability:
- pydantic-ai - Best OTel integration, type-safe, production-ready
- agent-framework (Microsoft) - Comprehensive, enterprise-friendly, Azure integration
- autogen - Good for multi-agent but may need supplemental LLM instrumentation
For managed services:
- openai-agents-python if using OpenAI's platform exclusively
- Avoid if you need portable, vendor-neutral telemetry
For research/experimentation:
- langgraph if you want LangChain ecosystem
- AgentVerse for multi-agent behavior studies (no OTel)
5.3 Implementation Best Practices
Based on analysis of compliant frameworks:
- Use standard attributes: Always
gen_ai.*, never custom names for LLM data - Make message capture opt-in: Provide clear configuration flags
- Support structured messages: Use ChatMessage schema with multimodal parts
- Emit metrics: Token usage histograms and duration metrics are cheap and valuable
- Propagate context: Use OTel baggage for cross-span correlation (agent name, conversation ID)
- Batch exports: Use
BatchSpanProcessorfor performance - Graceful degradation: Disable telemetry on export failures (don't break user app)
- Version your schema: If extending OTel, add version attribute (like pydantic-ai's
instrumentation_version)
5.4 Gaps and Future Work
Areas needing improvement:
- autogen: Should capture LLM messages in
gen_ai.input/output.messagesformat. Currently only traces agent/tool boundaries. - langgraph: Native OTel support would reduce dependency on LangChain's callback system.
- crewAI: Should migrate to GenAI standard attributes for better ecosystem compatibility.
- Documentation: Many frameworks lack clear telemetry setup guides. OTel's
docs/gen-ai/is excellent but not widely referenced.
6. Conclusion
OpenTelemetry provides a mature, well-designed set of semantic conventions for agent and LLM observability. The gen_ai.input.messages and gen_ai.output.messages attributes are the definitive standard for storing LLM message content, with clear opt-in semantics and multimodal support.
Among agent frameworks, pydantic-ai and Microsoft's agent-framework lead in OTel compliance, implementing the full GenAI spec with proper message capture. autogen has good foundation but lacks LLM message telemetry. Other frameworks either use proprietary systems (openai-agents-python) or rely on external instrumentation (langgraph, llama_index).
For production agent systems requiring observability, we recommend:
- Choose a framework with native OTel GenAI support (pydantic-ai, agent-framework)
- If using other frameworks, add
opentelemetry-instrumentation-<provider>for the underlying LLM calls - Always configure message capture as opt-in with clear user consent
- Export to standard OTLP endpoints for backend flexibility
Appendix: Repository Analysis Summary
Total repositories analyzed: 29 Instrumentation Libraries: 7 (Section 1) Agent Building Frameworks: 16 (Section 2) Agent Projects: 6 (Section 2.2)
OTel GenAI Compliance Matrix:
| Repository | Type | OTel Used? | LLM Messages | GenAI Attributes | Score |
|---|---|---|---|---|---|
| pydantic-ai | Framework | ✅ Direct | ✅ Yes | ✅ Full | 10/10 |
| agent-framework | Framework | ✅ Direct | ✅ Opt-in | ✅ Full | 10/10 |
| autogen | Framework | ✅ Direct | ❌ No | ⚠️ Partial | 6/10 |
| crewAI | Framework | ⚠️ Partial | ✅ Custom | ❌ Custom | 4/10 |
| openai-agents-python | Framework | ❌ None | ❓ Unclear | ❌ Custom | 2/10 |
| langgraph | Framework | ⚠️ Indirect | ⚠️ Via LC | ⚠️ Indirect | 5/10 |
| beeai-framework | Framework | ⚠️ Likely | ❓ Unclear | ⚠️ Partial | 5/10 |
| llama_index | Framework | ⚠️ Indirect | ⚠️ Via LC | ⚠️ Indirect | 5/10 |
| youtu-agent | Framework | ❌ Custom | ✅ Custom | ❌ Custom | 3/10 |
| MetaGPT | Framework | ❌ None | ❌ None | ❌ None | 0/10 |
| Qwen-Agent | Framework | ❌ None | ❌ None | ❌ None | 0/10 |
| AutoAgent | Framework | ❌ None | ❌ None | ❌ None | 0/10 |
| AgentVerse | Framework | ❌ None | ❌ None | ❌ None | 0/10 |
| spring-ai-alibaba | Framework | ⚠️ Likely | ⚠️ Likely | ⚠️ Likely | 6/10 |
| Open-AutoGLM | Framework | ❌ None | ❌ None | ❌ None | 0/10 |
| agentfield | Framework | ✅ Yes | ✅ Yes | ✅ Full | 10/10 |
| nanobot | Application | ❌ None | ❌ None | ❌ None | 0/10 |
| picoclaw | Application | ⚠️ Maybe | ❓ Unclear | ❓ Unclear | 2/10 |
| ironclaw | Application | ❌ None | ❌ None | ❌ None | 0/10 |
| openclaw | Application | ❌ None | ❌ None | ❌ None | 0/10 |
| NemoClaw | Application | ⚠️ Maybe | ❓ Unclear | ❓ Unclear | 2/10 |
| Qclaw | Application | ⚠️ Maybe | ❓ Unclear | ❓ Unclear | 2/10 |
Score breakdown:
- 10/10: Full OTel GenAI compliance, includes message capture
- 6-8/10: Partial compliance (some attributes, missing messages)
- 3-5/10: Minimal or indirect OTel usage
- 0-2/10: No OTel support
End of Report