2026-04-05 Attempt to Implement Observability for LoongClaw
Vibe Coding
I subscribed to the standard plan of Xiaomi's MiMo Token Plan, eager to tackle this task in one day.
In VSCode, I described the task to MiMo-V2-Pro via Kilo, and MiMo created an instrumentation plan. (The initial prompt is no longer visible in history... I can only see that I asked MiMo to execute the task book in the .kilo/ directory.)
After reading the task book, MiMo created 12 todos:
- [ ] Add workspace OTel dependencies to root Cargo.toml
- [ ] Wire otel feature flag in daemon Cargo.toml
- [ ] Create ObservabilityConfig in crates/app/src/config/observability.rs
- [ ] Register observability module in config/mod.rs and runtime.rs
- [ ] Create crates/daemon/src/otel.rs (OtelGuard, init_otel, gen_ai constants)
- [ ] Refactor observability.rs to use registry() + optional OTel layer
- [ ] Update main.rs: load config first, pass observability config to init_tracing
- [ ] Register otel module in daemon lib.rs
- [ ] Add GenAI span instrumentation in request.rs (non-streaming path)
- [ ] Add GenAI span instrumentation in requestCoercor.rs (streaming path)
- [ ] Create docker/docker-compose.otel.yml and otel-collector-config.yaml
- [ ] Verify: cargo build, clippy, tests
Adding OTel dependencies, setting feature flags, creating config, registering modules — these steps seemed sensible. The subsequent content was determined by the current state. From my observations, after each piece of code was written, it ran checks: build -> check? -> clippy -> tests.
MiMo read many relevant files and then started implementing. It seemed to complete the initial instrumentation and write Docker Compose within an hour, followed by debugging.
Following the recommendations in OTel's Rust documentation, the approach was to use Tracing to generate spans, bridge via tracing-opentelemetry, and finally export through the otel-exporter.
Literals vs Constants
The first issue I encountered was trying to use the opentelemetry-semantic-conventions library to name span attributes, but it caused errors when placed in the Tracing macro used to create spans.
The macro for creating spans is designed to perform compile-time optimization. When a literal is directly passed:
trace_span!("my.span")Because the macro receives the literal "my.span", the compiler can intern the string.
- Intern: The compiler stores a unique instance of this string in the read-only data segment of the binary, assigning a fixed pointer or identifier.
Only literals or identifiers are supported. The resource names provided by the library are path expressions or runtime constants. Since macros only work at compile time, runtime variables naturally cannot be used. todo: continue introducing the issues encountered here — compile-time variables vs runtime variables.
Because the span's name is baked into the machine code, this makes telemetry very efficient.
But when I used the semconv library, I was passing a variable name pointing to a value — it is an identifier representing a memory location, not the raw text.
// In opentelemetry‑semantic‑conventions
pub const GEN_AI_OPERATION_NAME: &str = "gen_ai.operation.name";How to use semconv? One approach provided by MiMo was to use the record method after creating the span, but that didn't work either and looked inelegant. The best solution was to abandon semconv and write the attribute names manually. For now, that's acceptable.
A further solution would be to write a script that extracts the attribute names from semconv and generates literals that the macro can use.
Deadlock
When building the OTLP pipeline, MiMo used SimpleSpanProcessor, which is a synchronous span processor. After I ran LoongClaw and entered some content, there was no response — it seemed stuck.
I asked MiMo to check what was wrong. MiMo examined the process's various states, initially suspecting an incorrect LLM API key. Eventually, by inspecting the code, it found that using a synchronous processor in an asynchronous environment caused a deadlock.
The Tokio runtime created many worker threads. These threads, while executing SimpleSpanProcessor tasks, sent telemetry signals over the network and waited for responses. The tasks did not yield control, occupying the threads. The scheduler had no available threads and therefore could not deliver network responses to the tasks, causing a deadlock.
Let's review the four essential conditions for a deadlock:
- Mutual Exclusion
- Hold and Wait
- No Preemption
- Circular Wait
How to resolve a deadlock? Break any of these conditions.
Here, the solution was to use the asynchronous BatchSpanProcessor. That made much more sense.
Filtering
On Jaeger, I noticed many non-gen_ai chat entries. MiMo suggested adding a filter to exclude them.
Refactoring
Although some progress had been made, I still felt it wasn't enough. I asked multiple models to study the project in search of an ideal instrumentation location, ideally one that could collect data across the entire lifecycle of the agent.
Although the models' answers varied, it was clear that the current instrumentation only handled collection for LLM requests. Other parts, such as Tool Calls, still needed separate work.
Old-School Code Reading
Reading the code statically through a model might make it hard to understand the project's runtime logic, making it difficult to determine the best instrumentation points.
I wanted to fully understand the project through dynamic debugging.
After some searching, I found a tool called uftrace, which claims to record the functions and code files corresponding to a program's execution.