01
o11ycon 2026

Making Your Metrics LLM-Ready

Structured Observability Data for AI-Assisted Analysis

Stephanie Wang
Staff Software Engineer @ MongoDB
02

Stephanie Wang

Staff Software Engineer @ MongoDB
ex-Founding Engineer @ MotherDuck
BigQuery @ Google
Sybase-powered trading systems @ Morgan Stanley
Cassandra @ IBM
duckdb-mongo on Github
03

Imagine: It's 2am. You get paged.

⚠ 3 firing Service Health Overview Last 1h  ▾
latency_p99 847ms ▲ +612% vs baseline
cpu_utilization 78% ▲ +24%
error_rate 2.4% ▲ +1.9pp
throughput 1.2k/s ▼ -8%
latency_p99 (ms) — last 1h
gc_pause_p99 (ms)
Active Alerts
⬤ FIRINGlatency_p99 > 500ms
⬤ FIRINGerror_rate > 1%
⬤ FIRINGgc_pause > 30ms
✓ OKcpu_util < 90%
✓ OKdisk_iops < 800
heap_used (GB)
conn_pool_wait (ms)
request_rate (req/s) — by endpoint
You're not debugging. You're doing manual pattern recognition.
04
Who thought—
‘this should be easier’?
05

AI is perfect for observability

High volume
Pattern-heavy
Repetitive
So why doesn't this work well today?

The gap isn't the model — it's the data.

06

Why LLMs struggle with metrics today

01

Fragmentation

Scattered across many tools

02

No Structure

Built to collect, not analyze

03

No Meaning

No sense of good vs. bad

These aren't independent. Each one compounds the others.

07

Fragmentation

Metrics
Logs
Traces
Benchmarks
(you)

The model never sees the system. It sees fragments.

08

No structure

timestamp, gc_pause_ms, conn_pool_wait, heap_frag_ratio, queue_depth 10:00, 11, 3.1, 1.04, 12 10:01, 13, 3.5, 1.06, 15 10:02, 28, 9.2, 1.31, 84 10:03, 45, 16.8, 1.72, 210 10:04, 52, 19.1, 1.85, 340

"Did a regression occur?"

09

No meaning

GC Pause p99 (ms)
Baseline (~12ms)
Change
Is this bad? How bad?
threads_active +40% More capacity handling load? Or threads blocked waiting on locks?
heap_used +180% Cache warming up as expected? Or a memory leak that will OOM?
error_budget_remaining −24% Normal depletion after a planned deploy? Or need to wake someone up?

A number changing direction means nothing without context. Engineers carry that context.

10
The model can't reason about what it can't see.

Unified access layer

Victoria Metrics
Honeycomb
Prometheus
CI logs
MCP / CLI
LLM

MCP / CLI lets us pull the right data at the right time — across systems.

11
Raw metrics are designed for collection, not analysis.

Consistent transformation

10:00, 11, 3.1, 1.04
10:01, 13, 3.5, 1.06
10:02, 28, 9.2, 1.31
...
baseline: 3.20
current: 18.50
change: +478%
CoV: 2.1 → 18.4

Apply the same transformation template across workloads and services.

12
Structure gives the model clean input but it still doesn't know what the numbers mean.

Semantic & reasoning layer

  • Make signals interpretabledirectionality, thresholds, sustained vs spike
  • Make decisions consistentdetect regressions, assess severity
13

Encoding directionality

{ "metrics": [{ "name": "gc_pause_p99", "baseline": 12, "current": 48, "direction": "lower_is_better" }, { "name": "error_budget_remaining", "baseline": 0.95, "current": 0.72, "direction": "higher_is_better" }] }
14
Making metrics LLM-ready
Contextualized, structured, and opinionated

SKILLS
Semantics
Structure
Raw Data
15

Real-world applications

16

llm-perf-analyzer

Transform raw metrics into structured, LLM-consumable reports

Pluggable input adapters (JSON, Prometheus) → Statistical engine (mean, p95, p99, CoV) → Polarity classification → Structured markdown output

17
So if you remember
one thing from this talk…
18

If you want better AI-assisted analysis...

Don't start with better models
Start with better data!

Sonnet handles this just fine — you don't need Opus.

19

Be in Touch

Making Your Metrics LLM-Ready — o11ycon 2026
Speaker Notes (N to toggle)