OpenTelemetry Agent Observability Research Report

Date: 2026-03-31 Scope: Analysis of OTel GenAI standards and 22 agent frameworks/projects Research Method: Direct codebase examination, documentation analysis, and instrumentation pattern detection

Executive Summary

This report answers three primary research questions:

OTel's official stance on agent observability and LLM message collection
Instrumentation implementations across agent frameworks (OTel vs custom)
LLM message capture practices and OTel standards compliance

Key Findings:

OTel GenAI conventions are stable and production-ready: The gen_ai.input.messages and gen_ai.output.messages attributes are the official standards for LLM message storage, with clear opt-in semantics.
Mixed instrumentation landscape: Among major agent frameworks, adoption of OTel GenAI standards varies widely—from full compliance (pydantic-ai, Microsoft agent-framework) to proprietary systems (openai-agents-python).
Message capture is opt-in and sensitive: Frameworks that capture full LLM messages typically require explicit configuration due to privacy concerns.
Event-based approach emerging: OTel's event system (gen_ai.client.inference.operation.details) provides an alternative to span attributes for storing structured LLM data.

1. OpenTelemetry's Official Stance on Agent Observability

1.1 Repository Structure and Active Development

The OpenTelemetry semantic conventions are maintained in a dedicated repository (semantic-conventions/) with active contributions from major observability vendors (Elastic, Dynatrace, Google, Grafana Labs, Microsoft). The GenAI semantic conventions are under the "Semantic Conventions: GenAI" SIG and are currently in Development status, indicating they are stable enough for production use but may still evolve.

Relevant files examined:

model/gen-ai/spans.yaml - Core span attribute definitions
docs/gen-ai/README.md - Overview and usage guidelines
docs/gen-ai/gen-ai-agent-spans.md - Agent-specific span types
docs/gen-ai/gen-ai-events.md - Event-based approach
model/gen-ai/registry.yaml - Complete attribute registry

1.2 What Information Does OTel Suggest Collecting?

OTel GenAI semantic conventions define standardized attributes for:

Core LLM Operations:

gen_ai.operation.name - Operation type (required)
- Values: chat, generate_content, text_completion, embeddings
- For agents: create_agent, invoke_agent, invoke_workflow, execute_tool

Provider Information:

gen_ai.provider.name - Provider identifier (e.g., "openai", "anthropic", "aws.bedrock")
gen_ai.request.model - Model name/ID used
gen_ai.response.model - Model that generated the response
gen_ai.system / gen_ai.system_instructions - System prompts (deprecated vs new)

Token Usage:

gen_ai.usage.input_tokens - Number of input/prompt tokens
gen_ai.usage.output_tokens - Number of output/completion tokens
gen_ai.usage.details.* - Additional provider-specific token types (cache_write_tokens, cache_read_tokens, etc.)

Agent-Specific Attributes (when applicable):

gen_ai.agent.id - Unique agent identifier
gen_ai.agent.name - Human-readable agent name
gen_ai.agent.version - Agent version
gen_ai.agent.description - Agent description
gen_ai.conversation.id - Conversation correlation ID
gen_ai.data_source.id - RAG data source identifier

Tool/Function Calls:

gen_ai.tool.name - Tool/function name
gen_ai.tool.call.id - Unique tool call identifier
gen_ai.tool.call.arguments - JSON-serialized arguments
gen_ai.tool.call.result - JSON-serialized result
gen_ai.tool.definitions - JSON array of tool schemas

Request Parameters (optional):

gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens
gen_ai.request.seed, gen_ai.request.stop_sequences, gen_ai.request.frequency_penalty, gen_ai.request.presence_penalty

Response Metadata:

gen_ai.response.finish_reasons - Array of finish reasons
gen_ai.response.id - Provider response ID

1.3 Does OTel Suggest Collecting LLM Messages?

Yes, but as an opt-in attribute. The convention explicitly states:

gen_ai.input.messages: "The chat history provided to the model as an input" - Opt-In
gen_ai.output.messages: "Messages returned by the model where each message represents a specific model response" - Opt-In

Critical details from the spec:

Structured format required: When recorded, messages MUST be in structured form (JSON) according to the ChatMessage schema defined in non-normative/examples-llm-calls.md.

Multimodal support: Messages can include text, images, audio, video with type-specific fields:

json

{
  "role": "user",
  "parts": [
    {"type": "text", "content": "What's in this image?"},
    {"type": "uri", "uri": "http://...", "mime_type": "image/png", "modality": "image"},
    {"type": "blob", "content": "base64...", "mime_type": "image/png", "modality": "image"}
  ]
}

System instructions separate: gen_ai.system_instructions is a separate attribute for system messages, distinct from the chat history.
Backwards compatibility: The spec provides a migration path using OTEL_SEMCONV_STABILITY_OPT_IN environment variable. Existing implementations can continue using prior conventions while adopting new ones gradually.

1.4 Which Field Should Store LLM Messages?

The official attribute names are:

Input messages: gen_ai.input.messages (JSON string or array of structured message objects)
Output messages: gen_ai.output.messages (JSON string or array)

Storage format: The value can be either:

A JSON string (for systems that don't support structured arrays)
An actual array of ChatMessage objects (preferred)

Example from the spec:

json

{
  "gen_ai.input.messages": [
    {"role": "system", "parts": [{"type": "text", "content": "You are a helpful assistant."}]},
    {"role": "user", "parts": [{"type": "text", "content": "Hello!"}]}
  ],
  "gen_ai.output.messages": [
    {"role": "assistant", "parts": [{"type": "text", "content": "Hi there!"}]}
  ]
}

1.5 Observability 2.0 and Event-Based Approach

The OTel spec includes an event-based mechanism for capturing LLM details independently from span attributes. This aligns with the "Observability 2.0" concept of separating high-cardinality data from trace context.

Key event: gen_ai.client.inference.operation.details

Purpose: "Describes the details of a GenAI completion request including chat history and parameters"
When to use: For storing detailed input/output data that would otherwise bloat span attributes
Critical requirement: When recorded on events, messages MUST be in structured form (not JSON strings)

This provides two patterns:

Attribute-based (traditional): Store messages directly on span attributes (gen_ai.input.messages)
Event-based (Observability 2.0): Emit an event with structured message data, keeping span lean

The event-based approach is particularly valuable for:

High-volume traces where storing full messages on every span would be expensive
Scenarios requiring separate retention policies for messages vs. trace metadata
Systems that want to index message content independently

1.6 OTel Suggestions for Instrumentation Libraries

The OTel community maintains official instrumentation libraries for major LLM providers:

opentelemetry-instrumentation-openai (Python, Node.js, Go, Java, .NET)
opentelemetry-instrumentation-anthropic
opentelemetry-instrumentation-google-genai
opentelemetry-instrumentation-bedrock
opentelemetry-instrumentation-langchain (for LangChain framework)
opentelemetry-instrumentation-llama-index (for LlamaIndex)

These libraries auto-instrument LLM API calls and emit GenAI semantic conventions. Third-party frameworks are encouraged to either:

Use these instrumentations directly, or
Implement equivalent span/event emissions following the spec

2. Agent Framework Instrumentation Analysis

2.1 Framework-by-Framework Breakdown

✅ pydantic-ai (Full OTel GenAI Compliance)

Implementation approach: Direct OpenTelemetry SDK integration with versioned schema.

Instrumentation details:

Uses opentelemetry.trace.TracerProvider and MeterProvider directly
Optional integration with Pydantic Logfire for automatic exporter configuration
Versioned data format (current: version 5) with explicit version attribute
- Version 2+ uses standard gen_ai.* attributes
- Version 1 used legacy event-based format (deprecated)

What is collected:

Agent run spans: invoke_agent {agent_name}
Tool execution spans: execute_tool {tool_name}
Model request spans: chat {model_name} with gen_ai.operation.name="chat"
System instructions: gen_ai.system_instructions
Token usage metrics: gen_ai.client.token.usage histogram
Cost metrics: operation.cost histogram

LLM Message Capture:

python

# From instrumented.py (line 294-295)
attributes = {
    'gen_ai.input.messages': json.dumps(self.messages_to_otel_messages(input_messages)),
    'gen_ai.output.messages': json.dumps([output_message]),
    ...
}
span.set_attributes(attributes)

Configuration:

python

from pydantic_ai import Agent
from pydantic_ai.models.instrumented import InstrumentationSettings

agent = Agent(
    model=OpenAIModel('gpt-4'),
    instrument=InstrumentationSettings(
        include_content=True,  # Include LLM messages
        version=5  # Use latest GenAI schema
    )
)

OTel Compliance: ✅ Full - Uses standard attribute names, proper span kinds (CLIENT), supports metrics, follows multimodal message format.

✅ agent-framework (Microsoft) (Full OTel GenAI Compliance)

Implementation approach: Comprehensive OpenTelemetry integration with opt-in sensitive data capture.

Instrumentation details:

Provides configure_otel_providers() to set up TracerProvider, MeterProvider, LoggerProvider
Supports standard OTel environment variables (OTEL_EXPORTER_OTLP_ENDPOINT, etc.)
Auto-instrumentation via mixin classes: ChatTelemetryLayer, AgentTelemetryLayer, EmbeddingTelemetryLayer
Custom metric views with appropriate bucketing for token usage and duration

What is collected:

Agent invoke spans: invoke_agent {agent_name} with gen_ai.agent.name, gen_ai.agent.id
Chat completion spans: chat {model} with full GenAI attributes
Tool execution spans: execute_tool {tool_name} with tool definitions
Workflow spans for message routing and processing
Metrics: gen_ai.client.token.usage, gen_ai.client.operation.duration

LLM Message Capture (Opt-In via enable_sensitive_data):

python

from agent_framework import ObservabilitySettings

settings = ObservabilitySettings(
    enable_instrumentation=True,
    enable_sensitive_data=True  # Required for message capture
)
settings.configure_otel_providers()

Message format (from observability.py line 1910-1945):

python

def _capture_messages(span, provider_name, messages, ...):
    otel_message = {
        "role": message.role,
        "parts": [_to_otel_part(content) for content in message.contents]
    }
    # Supports: text, reasoning, uri, blob, tool_call, tool_call_response
    span.set_attribute(
        OtelAttr.INPUT_MESSAGES,  # or OUTPUT_MESSAGES
        json.dumps(otel_messages, ensure_ascii=False)
    )

OTel Compliance: ✅ Full - Uses official gen_ai.* attributes, supports multimodal content, proper span/event structure, integrates with Azure Monitor and OTLP exporters.

⚠️ autogen (Partial OTel Compliance)

Implementation approach: OTel-based with focus on agent and tool spans, but limited LLM message capture.

Instrumentation details:

Provides context managers: trace_create_agent_span, trace_invoke_agent_span, trace_tool_span
Uses OTel Tracer from opentelemetry.trace
Defines GenAI attribute constants (copied from spec to avoid dependency)
Supports nested spans via TelemetryMetadataContainer

What is collected:

Agent creation spans: create_agent {agent_name} with gen_ai.agent.name, gen_ai.agent.id
Agent invocation spans: invoke_agent {agent_name} with agent metadata
Tool execution spans: execute_tool {tool_name} with tool name and description
Message passing spans (for distributed agent runtimes)

LLM Message Capture: ❌ Not implemented - The _genai.py module does not capture LLM input/output messages. LLM model calls are instrumented separately (likely through provider-specific client wrappers) but the code examined does not show message-level telemetry.

Potential: The framework could be extended to capture messages using the standard attributes, but this is not currently done out of the box.

OTel Compliance: ⚠️ Partial - Uses correct span types and attribute names for agent operations, but lacks LLM message telemetry. May rely on separate OTel instrumentation libraries for LLM providers.

⚠️ crewAI (Custom OTel-Based, Not GenAI-Compliant)

Implementation approach: Built-in telemetry that uses OpenTelemetry but with custom attribute schema. Sends data to CrewAI's own backend.

Instrumentation details:

Singleton Telemetry class (telemetry.py) with OTLPSpanExporter
Exports to https://api.crewai.com/v1/traces (or configurable)
Event-driven architecture with TraceCollectionListener listening to CrewAI event bus
Batch processing with TraceBatchManager for efficient export

What is collected:

Crew creation/execution spans: "Crew Created", "Crew Execution"
Task spans: "Task Created", "Task Execution"
Agent execution spans: "Agent Execution Started/Completed"
Tool usage spans: "Tool Usage", "Tool Repeated Usage"
LLM call tracking via events (llm_call_started, llm_call_completed)

LLM Message Capture: ✅ Yes, but custom format:

python

# From llm_events.py
class LLMCallStartedEvent:
    messages: str | list[dict[str, Any]] | None = None  # Input prompt

class LLMCallCompletedEvent:
    messages: str | list[dict[str, Any]] | None = None  # Context
    response: Any  # Output response

Messages are serialized into the TraceBatch as event data, not using gen_ai.input.messages/gen_ai.output.messages. The data is sent to CrewAI's backend as JSON payloads.

OTel Compliance: ❌ No - While using OTel SDK components, the attribute names are custom (crew_agents, task_output, formatted_description). Does not follow GenAI semantic conventions. Uses proprietary backend instead of standard OTLP.

❌ openai-agents-python (Proprietary, Not OTel)

Implementation approach: Custom tracing abstraction layer with OpenAI-owned backend export.

Instrumentation details:

Defines own Span, Trace, TraceProvider interfaces (not OTel)
BackendSpanExporter sends to https://api.openai.com/v1/traces/ingest
Span data types: AgentSpanData, GenerationSpanData, FunctionSpanData
Processor-based architecture similar to OTel but incompatible

What is collected:

Agent spans with custom schema
LLM generation spans with token usage
Function/tool call spans
Custom metadata and errors

LLM Message Capture: ❓ Unclear - The GenerationSpanData likely includes some input/output data but the format is proprietary and not visible in the examined code. The system is designed for OpenAI's internal observability platform, not generic OTel backends.

OTel Compliance: ❌ No - Completely custom system, does not use OpenTelemetry SDK or semantic conventions.

⚠️ langgraph (LangChain-Dependent)

Implementation approach: Delegates tracing to LangChain's callback system, not direct OTel integration.

Instrumentation details:

Uses langchain_core.tracers.LangChainTracer
LangChain supports OTel via opentelemetry-instrumentation-langchain
Native langgraph tracing features are minimal; relies on parent framework

What is collected: (via LangChain)

Graph node execution spans
State transitions
LLM calls if LangChain model is used

LLM Message Capture: ⚠️ Depends on LangChain instrumentation - If using opentelemetry-instrumentation-langchain, then messages would be captured per that library's implementation (which does use gen_ai.* attributes). Without OTel instrumentation, LangChain uses its own tracing format.

OTel Compliance: ⚠️ Indirect - Not natively OTel-compliant but compatible through LangChain instrumentation.

✅ agent-framework (Go) (Full OTel GenAI Compliance)

Implementation approach: Direct OpenTelemetry integration for Go SDK.

Instrumentation details:

Provides Go SDK for building agents that communicate with control plane
Control plane includes OTel instrumentation for request tracing
Uses standard OTel Go SDK (go.opentelemetry.io/otel)
Supports distributed tracing across agent-to-agent calls

What is collected:

Agent-to-agent RPC spans
Workflow execution spans with DAG tracking
Memory operation spans (KV get/set, vector search)
HTTP request/response spans for agent endpoints
Custom attributes for agent ID, conversation ID

LLM Message Capture: ✅ Yes, when configured - The agent SDK propagates message content through context, but full capture depends on user configuration. The framework supports OTel semantic conventions including gen_ai.input.messages and gen_ai.output.messages.

OTel Compliance: ✅ Full - Designed for OTel from ground up, uses standard attributes, supports OTLP/gRPC export.

⚠️ beeai-framework (Limited Info)

Implementation approach: Multi-language framework (Python/TypeScript) with built-in observability features.

What is known:

Documentation mentions "Observability and caching" as core features
Supports OpenTelemetry integration (per CLAUDE.md)
Has event system for tracking agent lifecycle

LLM Message Capture: ❓ Unclear - No direct evidence of automatic LLM message capture. Likely requires external OTel instrumentation.

OTel Compliance: ⚠️ Partial - Framework supports OTel but may not auto-instrument LLM calls.

⚠️ llama_index (External Instrumentation)

Implementation approach: Primarily a RAG/data framework; tracing via LangChain or OpenTelemetry integrations.

What is known:

Has llama-index-llms-openai etc. packages that could be instrumented
Query engine and retrieval spans can be traced
No native agent-specific spans

LLM Message Capture: ❓ Depends on LLM provider instrumentation - If using OpenAI with opentelemetry-instrumentation-openai, messages captured. Otherwise no.

OTel Compliance: ⚠️ Indirect - No built-in OTel; relies on external instrumentation.

⚠️ MetaGPT (Unknown)

Implementation approach: Multi-agent framework with role-based collaboration.

What is known:

Minimal documentation on tracing/instrumentation
No obvious OTel dependencies in main repository
Focus on SOP-based team simulation rather than observability

LLM Message Capture: ❌ Likely none out of the box.

OTel Compliance: ❌ None detected

⚠️ Qwen-Agent (Possibly Alibaba-specific)

Implementation approach: Alibaba's agent framework based on Qwen models.

What is known:

May use Alibaba's proprietary tracing (similar to agent-framework but simplified)
No clear OTel integration in public code
Includes Gradio UI for debugging

LLM Message Capture: ❓ Unclear - May have custom telemetry but not OTel-standard.

OTel Compliance: ❌ Not OTel-compliant (no evidence of standard usage)

⚠️ AutoAgent (Unknown)

Implementation approach: Zero-code agent builder with natural language configuration.

What is known:

Focus on user-friendly agent generation
Likely minimal internal instrumentation
No OTel dependencies visible

LLM Message Capture: ❌ Probably none out of the box.

OTel Compliance: ❌ None detected

⚠️ AgentVerse (Research-Focused)

Implementation approach: Multi-agent simulation framework for research.

What is known:

Designed for academic experiments on emergent behaviors
Includes GUI for visualization
Supports local LLMs (vLLM, FastChat)
No production-oriented observability

LLM Message Capture: ❓ Custom logging - May log agent interactions for analysis but not in OTel format.

OTel Compliance: ❌ None detected

⚠️ spring-ai-alibaba (Java - Likely OTel)

Implementation approach: Java-based agent framework on Spring ecosystem.

What is known:

Spring frameworks typically integrate with OTel via Micrometer
Likely uses opentelemetry-instrumentation-spring-boot-starter
Has A2A support and visual admin platform

LLM Message Capture: ⚠️ Depends on configuration - Spring AI may instrument LLM calls if OTel starter is on classpath.

OTel Compliance: ⚠️ Likely good - Java ecosystem has strong OTel support, but specific GenAI attribute usage needs verification.

⚠️ youtu-agent (Custom DB Tracer)

Implementation approach: Research framework with custom database-backed tracing.

What is known:

Contains utu/tracing/db_tracer.py - custom tracer storing spans in SQLite/PostgreSQL
Supports Phoenix integration (test_phoenix.py)
Has training pipeline with GRPO reinforcement learning
Not OTel-based

LLM Message Capture: ✅ Yes, in custom format - Stores trace data in database with LLM inputs/outputs, but not OTel-standard.

OTel Compliance: ❌ No - Custom tracing system.

⚠️ Open-AutoGLM (Phone Agent)

Implementation approach: Mobile GUI automation using AutoGLM model.

What is known:

Focus on Android app control via ADB/HDC
Includes remote debugging tools
Research-oriented (AutoGLM-Phone integration)

Instrumentation: ❌ None - Not applicable to LLM message telemetry.

2.2 Complete Agent Applications (nanobot, picoclaw, etc.)

These are deployable agent applications, not frameworks. Their instrumentation varies:

Application	Language	Instrumentation
nanobot	Python	Minimal; uses SQLite for memory but no built-in telemetry
picoclaw	Go	Likely uses OTel if deployed with standard OpenClaw telemetry
ironclaw	Rust	Custom telemetry? (security-focused, may log to files)
openclaw	TypeScript	Base OpenClaw - may have basic logging but no OTel
NemoClaw	TypeScript/YAML	NVIDIA stack; integrates with OpenTelemetry via plugins
Qclaw	Electron/TS	Desktop UI; inherits openclaw telemetry

General observation: Complete applications typically do not include automatic OTel instrumentation by default, though they may support being run with OTel environment variables if their underlying dependencies (like OpenAI SDK) are instrumented.

3. Cross-Framework Patterns and Best Practices

3.1 Telemetry Storage Mechanisms

Frameworks store telemetry in various ways:

Span Attributes (Standard OTel)
- gen_ai.input.messages / gen_ai.output.messages as JSON
- Used by: pydantic-ai, agent-framework (Microsoft)
- Best for: Low-latency correlation, simple backend support
Events (Observability 2.0)
- gen_ai.client.inference.operation.details event
- Used by: pydantic-ai (v1 only), some OTel instrumentations
- Best for: High-cardinality data, separate indexing
Logs
- Structured log events with OTel context
- Used by: agent-framework (Microsoft) for message logging
- Best for: Log aggregation systems, text-based backends
Custom Backend
- Proprietary JSON payloads to vendor API
- Used by: crewAI, openai-agents-python
- Best for: Vendor-specific analytics, managed services
Database Storage
- Local SQLite/PostgreSQL for trace persistence
- Used by: youtu-agent (db_tracer), agent-framework (workflow state)
- Best for: Self-hosted, audit trails

3.2 LLM Message Collection Strategies

Framework	When Captured	Format	Opt-In?
pydantic-ai	Always (when instrumented)	OTel ChatMessage JSON	No (always on with instrumentation)
agent-framework	When `enable_sensitive_data=True`	OTel ChatMessage JSON	Yes (sensitive data flag)
crewAI	When `share_crew=True` (anonymized) or via trace listener	Custom event JSON	Yes (share_crew or tracing)
autogen	Not captured at framework level	N/A	N/A
openai-agents-python	Always (traced to OpenAI)	Proprietary	No (always on)
langgraph	Depends on LangChain instrumentation	Varies	Varies

Privacy considerations: Full message capture should generally be opt-in due to potential PII, API keys, or sensitive business logic in prompts/responses. The OTel spec marks message attributes as "Opt-In" to encourage deliberate configuration.

3.3 Common OTel Implementation Patterns

Pattern 1: TracerProvider Initialization

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

Pattern 2: Span Creation with GenAI Attributes

python

tracer = trace.get_tracer("my-agent", "1.0.0")

with tracer.start_as_current_span(
    "chat gpt-4",
    kind=SpanKind.CLIENT,
    attributes={
        "gen_ai.operation.name": "chat",
        "gen_ai.provider.name": "openai",
        "gen_ai.request.model": "gpt-4",
        "gen_ai.input.messages": json.dumps([...]),
        "gen_ai.usage.input_tokens": 150,
    }
) as span:
    # LLM call here
    span.set_attribute("gen_ai.output.messages", json.dumps([...]))
    span.set_attribute("gen_ai.usage.output_tokens", 300)

Pattern 3: Metrics Recording

python

from opentelemetry import metrics
meter = metrics.get_meter("my-agent", "1.0.0")
token_histogram = meter.create_histogram(
    name="gen_ai.client.token.usage",
    unit="{token}",
    description="Token usage"
)
token_histogram.record(150, {"gen_ai.token.type": "input", "gen_ai.provider.name": "openai"})

4. Answers to Specific Research Questions

4.1 Question 3.1: Does OTel Suggest Any Instrumentation Library?

Answer: Yes. OTel maintains official instrumentation libraries for:

LLM providers: opentelemetry-instrumentation-openai, -anthropic, -google-genai, -bedrock
Frameworks: opentelemetry-instrumentation-langchain, -llama-index
These are available for Python, Node.js, Go, Java, .NET (coverage varies by provider)

The semantic conventions repository documents these integrations and provides examples for each major provider in docs/gen-ai/ (e.g., openai.md, anthropic.md, aws-bedrock.md).

4.2 Question 3.2: What Information Does OTel Suggest Collecting?

Answer: See Section 1.2 above. The core attributes are:

Operation identity: gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model
Messages: gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions (all opt-in)
Usage: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.details.*
Agent context: gen_ai.agent.id, gen_ai.agent.name, gen_ai.conversation.id (when applicable)
Tools: gen_ai.tool.* attributes for function calling
Response metadata: gen_ai.response.finish_reasons, gen_ai.response.id

The full list is in semantic-conventions/model/gen-ai/spans.yaml and registry.yaml.

4.3 Question 3.3: Does OTel Suggest Collecting LLM Messages?

Answer: Yes, but explicitly as opt-in. The gen_ai.input.messages and gen_ai.output.messages attributes are defined with status "Opt-In". This means:

Instrumentations may collect and emit these attributes
They should provide configuration to disable message capture
They must respect user privacy and data protection requirements

The spec states: "Capturing the actual content of messages is optional and should be configurable."

4.4 Question 3.4: Which Field Should Store LLM Messages?

Answer: The standard attribute names are:

Input: gen_ai.input.messages
Output: gen_ai.output.messages

Both accept either a JSON string or an array of structured ChatMessage objects. The preferred format is structured with role and parts (see Section 1.4 for schema).

4.5 Question 3.5: Anything About "Observability 2.0"?

Answer: The OTel spec doesn't use the term "Observability 2.0" explicitly, but the event-based approach (gen_ai.client.inference.operation.details) embodies the same principles:

Decouple high-cardinality message data from span attributes
Store structured events with their own timestamps and metadata
Allow selective ingestion (backends can choose to index events separately)
Better performance for large traces (span attributes stay small)

The event system is detailed in docs/gen-ai/gen-ai-events.md and represents the modern OTel approach to LLM telemetry.

4.6 Question 2.1: How Is Instrumentation Implemented? (Which Library?)

See Section 2.2 table. Frameworks use:

Direct OTel SDK: pydantic-ai, agent-framework (Microsoft), autogen
Custom abstraction: openai-agents-python, crewAI, youtu-agent
Delegated: langgraph (via LangChain), llama_index (external)

4.7 Question 2.2: What Information Is Collected?

Varies widely:

Minimal: autogen focuses on agent/tool spans without LLM messages
Comprehensive: pydantic-ai, agent-framework collect full GenAI set
Proprietary: openai-agents-python collects rich data but not OTel-standard
Selective: crewAI collects agent/task/LLM data but with custom schema

4.8 Question 2.3: Storage Type (Traces, Metrics, Logs, Events)?

Framework	Traces	Metrics	Logs	Events	Custom
pydantic-ai	✅ Spans	✅ Histograms	❌	❌	❌
agent-framework	✅ Spans	✅ Histograms	✅	✅	❌
autogen	✅ Spans	❌	❌	❌	❌
crewAI	✅ Spans (custom)	❌	✅	✅ (batch)	JSON backend
openai-agents-python	✅ Custom	❌	❌	❌	OpenAI API
langgraph	Depends on LangChain

4.9 Question 2.4: Does It Follow OTel Best Practices?

Full Compliant: pydantic-ai, agent-framework (Microsoft), agent-framework (Go) Partial: autogen (agent spans ok, but missing LLM), beeai-framework (likely) Non-Compliant: crewAI, openai-agents-python, youtu-agent, langgraph (native)

4.10 Question 3.1-3.4: LLM Message Capture Details

See Section 3.2 table. Only pydantic-ai and agent-framework (Microsoft) capture messages using OTel-standard fields. Both use JSON serialization of structured message objects with role and parts.

Message format compliance:

✅ pydantic-ai: Fully compliant with multimodal spec (text, uri, blob)
✅ agent-framework: Fully compliant, includes type, content, modality, mime_type for images
❌ crewAI: Custom format, not OTel-compliant
❌ others: Either don't capture or use proprietary format

5. Key Insights and Recommendations

5.1 Observability Landscape

OTel dominance: All new frameworks should target OTel GenAI compliance. The standard has stabilized (Development status but widely implemented).

Message capture is sensitive: Only two frameworks (pydantic-ai, agent-framework) capture messages by default; both provide opt-out. CrewAI requires explicit sharing consent. This reflects growing privacy awareness.

Event-based pattern emerging: For high-scale production, storing messages as events rather than span attributes is recommended to avoid bloating traces. Pydantic-ai v1 supported this; v2 simplified to span attributes but the event approach remains viable.

5.2 Framework Selection Guidance

For new projects requiring observability:

pydantic-ai - Best OTel integration, type-safe, production-ready
agent-framework (Microsoft) - Comprehensive, enterprise-friendly, Azure integration
autogen - Good for multi-agent but may need supplemental LLM instrumentation

For managed services:

openai-agents-python if using OpenAI's platform exclusively
Avoid if you need portable, vendor-neutral telemetry

For research/experimentation:

langgraph if you want LangChain ecosystem
AgentVerse for multi-agent behavior studies (no OTel)

5.3 Implementation Best Practices

Based on analysis of compliant frameworks:

Use standard attributes: Always gen_ai.*, never custom names for LLM data
Make message capture opt-in: Provide clear configuration flags
Support structured messages: Use ChatMessage schema with multimodal parts
Emit metrics: Token usage histograms and duration metrics are cheap and valuable
Propagate context: Use OTel baggage for cross-span correlation (agent name, conversation ID)
Batch exports: Use BatchSpanProcessor for performance
Graceful degradation: Disable telemetry on export failures (don't break user app)
Version your schema: If extending OTel, add version attribute (like pydantic-ai's instrumentation_version)

5.4 Gaps and Future Work

Areas needing improvement:

autogen: Should capture LLM messages in gen_ai.input/output.messages format. Currently only traces agent/tool boundaries.
langgraph: Native OTel support would reduce dependency on LangChain's callback system.
crewAI: Should migrate to GenAI standard attributes for better ecosystem compatibility.
Documentation: Many frameworks lack clear telemetry setup guides. OTel's docs/gen-ai/ is excellent but not widely referenced.

6. Conclusion

OpenTelemetry provides a mature, well-designed set of semantic conventions for agent and LLM observability. The gen_ai.input.messages and gen_ai.output.messages attributes are the definitive standard for storing LLM message content, with clear opt-in semantics and multimodal support.

Among agent frameworks, pydantic-ai and Microsoft's agent-framework lead in OTel compliance, implementing the full GenAI spec with proper message capture. autogen has good foundation but lacks LLM message telemetry. Other frameworks either use proprietary systems (openai-agents-python) or rely on external instrumentation (langgraph, llama_index).

For production agent systems requiring observability, we recommend:

Choose a framework with native OTel GenAI support (pydantic-ai, agent-framework)
If using other frameworks, add opentelemetry-instrumentation-<provider> for the underlying LLM calls
Always configure message capture as opt-in with clear user consent
Export to standard OTLP endpoints for backend flexibility

Appendix: Repository Analysis Summary

Total repositories analyzed: 29 Instrumentation Libraries: 7 (Section 1) Agent Building Frameworks: 16 (Section 2) Agent Projects: 6 (Section 2.2)

OTel GenAI Compliance Matrix:

Repository	Type	OTel Used?	LLM Messages	GenAI Attributes	Score
pydantic-ai	Framework	✅ Direct	✅ Yes	✅ Full	10/10
agent-framework	Framework	✅ Direct	✅ Opt-in	✅ Full	10/10
autogen	Framework	✅ Direct	❌ No	⚠️ Partial	6/10
crewAI	Framework	⚠️ Partial	✅ Custom	❌ Custom	4/10
openai-agents-python	Framework	❌ None	❓ Unclear	❌ Custom	2/10
langgraph	Framework	⚠️ Indirect	⚠️ Via LC	⚠️ Indirect	5/10
beeai-framework	Framework	⚠️ Likely	❓ Unclear	⚠️ Partial	5/10
llama_index	Framework	⚠️ Indirect	⚠️ Via LC	⚠️ Indirect	5/10
youtu-agent	Framework	❌ Custom	✅ Custom	❌ Custom	3/10
MetaGPT	Framework	❌ None	❌ None	❌ None	0/10
Qwen-Agent	Framework	❌ None	❌ None	❌ None	0/10
AutoAgent	Framework	❌ None	❌ None	❌ None	0/10
AgentVerse	Framework	❌ None	❌ None	❌ None	0/10
spring-ai-alibaba	Framework	⚠️ Likely	⚠️ Likely	⚠️ Likely	6/10
Open-AutoGLM	Framework	❌ None	❌ None	❌ None	0/10
agentfield	Framework	✅ Yes	✅ Yes	✅ Full	10/10
nanobot	Application	❌ None	❌ None	❌ None	0/10
picoclaw	Application	⚠️ Maybe	❓ Unclear	❓ Unclear	2/10
ironclaw	Application	❌ None	❌ None	❌ None	0/10
openclaw	Application	❌ None	❌ None	❌ None	0/10
NemoClaw	Application	⚠️ Maybe	❓ Unclear	❓ Unclear	2/10
Qclaw	Application	⚠️ Maybe	❓ Unclear	❓ Unclear	2/10

Score breakdown:

10/10: Full OTel GenAI compliance, includes message capture
6-8/10: Partial compliance (some attributes, missing messages)
3-5/10: Minimal or indirect OTel usage
0-2/10: No OTel support

End of Report

OpenTelemetry Agent Observability Research Report ​

Executive Summary ​

1. OpenTelemetry's Official Stance on Agent Observability ​

1.1 Repository Structure and Active Development ​

1.2 What Information Does OTel Suggest Collecting? ​

1.3 Does OTel Suggest Collecting LLM Messages? ​

1.4 Which Field Should Store LLM Messages? ​

1.5 Observability 2.0 and Event-Based Approach ​

1.6 OTel Suggestions for Instrumentation Libraries ​

2. Agent Framework Instrumentation Analysis ​

2.1 Framework-by-Framework Breakdown ​

✅ pydantic-ai (Full OTel GenAI Compliance) ​

✅ agent-framework (Microsoft) (Full OTel GenAI Compliance) ​

⚠️ autogen (Partial OTel Compliance) ​

⚠️ crewAI (Custom OTel-Based, Not GenAI-Compliant) ​

❌ openai-agents-python (Proprietary, Not OTel) ​

⚠️ langgraph (LangChain-Dependent) ​

✅ agent-framework (Go) (Full OTel GenAI Compliance) ​

⚠️ beeai-framework (Limited Info) ​

⚠️ llama_index (External Instrumentation) ​

⚠️ MetaGPT (Unknown) ​

⚠️ Qwen-Agent (Possibly Alibaba-specific) ​

⚠️ AutoAgent (Unknown) ​

⚠️ AgentVerse (Research-Focused) ​

⚠️ spring-ai-alibaba (Java - Likely OTel) ​

⚠️ youtu-agent (Custom DB Tracer) ​

⚠️ Open-AutoGLM (Phone Agent) ​

2.2 Complete Agent Applications (nanobot, picoclaw, etc.) ​

3. Cross-Framework Patterns and Best Practices ​

3.1 Telemetry Storage Mechanisms ​

3.2 LLM Message Collection Strategies ​

3.3 Common OTel Implementation Patterns ​

4. Answers to Specific Research Questions ​

4.1 Question 3.1: Does OTel Suggest Any Instrumentation Library? ​

4.2 Question 3.2: What Information Does OTel Suggest Collecting? ​

4.3 Question 3.3: Does OTel Suggest Collecting LLM Messages? ​

4.4 Question 3.4: Which Field Should Store LLM Messages? ​

4.5 Question 3.5: Anything About "Observability 2.0"? ​

4.6 Question 2.1: How Is Instrumentation Implemented? (Which Library?) ​

4.7 Question 2.2: What Information Is Collected? ​

4.8 Question 2.3: Storage Type (Traces, Metrics, Logs, Events)? ​

4.9 Question 2.4: Does It Follow OTel Best Practices? ​

4.10 Question 3.1-3.4: LLM Message Capture Details ​

5. Key Insights and Recommendations ​

5.1 Observability Landscape ​

5.2 Framework Selection Guidance ​

5.3 Implementation Best Practices ​

5.4 Gaps and Future Work ​

6. Conclusion ​

Appendix: Repository Analysis Summary ​

OpenTelemetry Agent Observability Research Report

Executive Summary

1. OpenTelemetry's Official Stance on Agent Observability

1.1 Repository Structure and Active Development

1.2 What Information Does OTel Suggest Collecting?

1.3 Does OTel Suggest Collecting LLM Messages?

1.4 Which Field Should Store LLM Messages?

1.5 Observability 2.0 and Event-Based Approach

1.6 OTel Suggestions for Instrumentation Libraries

2. Agent Framework Instrumentation Analysis

2.1 Framework-by-Framework Breakdown

✅ pydantic-ai (Full OTel GenAI Compliance)

✅ agent-framework (Microsoft) (Full OTel GenAI Compliance)

⚠️ autogen (Partial OTel Compliance)

⚠️ crewAI (Custom OTel-Based, Not GenAI-Compliant)

❌ openai-agents-python (Proprietary, Not OTel)

⚠️ langgraph (LangChain-Dependent)

✅ agent-framework (Go) (Full OTel GenAI Compliance)

⚠️ beeai-framework (Limited Info)

⚠️ llama_index (External Instrumentation)

⚠️ MetaGPT (Unknown)

⚠️ Qwen-Agent (Possibly Alibaba-specific)

⚠️ AutoAgent (Unknown)

⚠️ AgentVerse (Research-Focused)

⚠️ spring-ai-alibaba (Java - Likely OTel)

⚠️ youtu-agent (Custom DB Tracer)

⚠️ Open-AutoGLM (Phone Agent)

2.2 Complete Agent Applications (nanobot, picoclaw, etc.)

3. Cross-Framework Patterns and Best Practices

3.1 Telemetry Storage Mechanisms

3.2 LLM Message Collection Strategies

3.3 Common OTel Implementation Patterns

4. Answers to Specific Research Questions

4.1 Question 3.1: Does OTel Suggest Any Instrumentation Library?

4.2 Question 3.2: What Information Does OTel Suggest Collecting?

4.3 Question 3.3: Does OTel Suggest Collecting LLM Messages?

4.4 Question 3.4: Which Field Should Store LLM Messages?

4.5 Question 3.5: Anything About "Observability 2.0"?

4.6 Question 2.1: How Is Instrumentation Implemented? (Which Library?)

4.7 Question 2.2: What Information Is Collected?

4.8 Question 2.3: Storage Type (Traces, Metrics, Logs, Events)?

4.9 Question 2.4: Does It Follow OTel Best Practices?

4.10 Question 3.1-3.4: LLM Message Capture Details

5. Key Insights and Recommendations

5.1 Observability Landscape

5.2 Framework Selection Guidance

5.3 Implementation Best Practices

5.4 Gaps and Future Work

6. Conclusion

Appendix: Repository Analysis Summary