o11ycon 2026

Making Your Metrics LLM-Ready

Structured Observability Data for AI-Assisted Analysis

Stephanie Wang

Staff Software Engineer @ MongoDB

02

About Me

Stephanie Wang

Staff Software Engineer @ MongoDB

ex-Founding Engineer @ MotherDuck

BigQuery @ Google

Sybase-powered trading systems @ Morgan Stanley

Cassandra @ IBM

duckdb-mongo on Github

03

Imagine: It's 2am. You get paged.

■ Grafana ⚠ 3 firing Service Health Overview Last 1h ▾

latency_p99 847ms ▲ +612% vs baseline

cpu_utilization 78% ▲ +24%

error_rate 2.4% ▲ +1.9pp

throughput 1.2k/s ▼ -8%

latency_p99 (ms) — last 1h

gc_pause_p99 (ms)

Active Alerts

⬤ FIRING	latency_p99 > 500ms
⬤ FIRING	error_rate > 1%
⬤ FIRING	gc_pause > 30ms
✓ OK	cpu_util < 90%
✓ OK	disk_iops < 800

heap_used (GB)

conn_pool_wait (ms)

request_rate (req/s) — by endpoint

You're not debugging. You're doing manual pattern recognition.

04

Who thought—
‘this should be easier’?

05

AI is perfect for observability

High volume

Pattern-heavy

Repetitive

So why doesn't this work well today?

The gap isn't the model — it's the data.

06

Three gaps

Why LLMs struggle with metrics today

01

Fragmentation

Scattered across many tools

02

No Structure

Built to collect, not analyze

03

No Meaning

No sense of good vs. bad

These aren't independent. Each one compounds the others.

07

Fragmentation

Metrics

Logs

Traces

Benchmarks

(you)

The model never sees the system. It sees fragments.

08

No structure

timestamp, gc_pause_ms, conn_pool_wait, heap_frag_ratio, queue_depth 10:00, 11, 3.1, 1.04, 12 10:01, 13, 3.5, 1.06, 15 10:02, 28, 9.2, 1.31, 84 10:03, 45, 16.8, 1.72, 210 10:04, 52, 19.1, 1.85, 340

"Did a regression occur?"

09

No meaning

GC Pause p99 (ms)

Baseline (~12ms)

Change

Is this bad? How bad?

threads_active +40% More capacity handling load? Or threads blocked waiting on locks?

heap_used +180% Cache warming up as expected? Or a memory leak that will OOM?

error_budget_remaining −24% Normal depletion after a planned deploy? Or need to wake someone up?

A number changing direction means nothing without context. Engineers carry that context.

10

Solution #1

The model can't reason about what it can't see.

Unified access layer

Victoria Metrics

Honeycomb

Prometheus

CI logs

→

MCP / CLI

→

LLM

MCP / CLI lets us pull the right data at the right time — across systems.

11

Solution #2

Raw metrics are designed for collection, not analysis.

Consistent transformation

10:00, 11, 3.1, 1.04
10:01, 13, 3.5, 1.06
10:02, 28, 9.2, 1.31
...

→

baseline: 3.20
current: 18.50
change: +478%
CoV: 2.1 → 18.4

Baseline vs comparison
Deltas & percent changes
Segmentation by time window
Aggregation (mean, p95, p99)

Apply the same transformation template across workloads and services.

12

Solution #3

Structure gives the model clean input but it still doesn't know what the numbers mean.

Semantic & reasoning layer

Make signals interpretabledirectionality, thresholds, sustained vs spike
Make decisions consistentdetect regressions, assess severity

13

Encoding directionality

{ "metrics": [{ "name": "gc_pause_p99", "baseline": 12, "current": 48, "direction": "lower_is_better" }, { "name": "error_budget_remaining", "baseline": 0.95, "current": 0.72, "direction": "higher_is_better" }] }

14

The framework

Making metrics LLM-ready

Contextualized, structured, and opinionated

SKILLS

Semantics

Structure

Raw Data

15

Real-world applications

Regression analysisBaseline vs current, spikes vs sustained
Root cause debuggingCorrelate signals across systems without jumping between systems
Operational workflowsAutomated reports, smarter anomaly detection, and better alert triage
Eliminate tribal knowledgeAny engineer — or any agent — can diagnose a problem

16

Live Demo

llm-perf-analyzer

stephaniewang526/llm-perf-analyzer

Transform raw metrics into structured, LLM-consumable reports

Pluggable input adapters (JSON, Prometheus) → Statistical engine (mean, p95, p99, CoV) → Polarity classification → Structured markdown output

Includes SKILL.md for AI reasoning — agents know how to interpret output
Extend with MCP tools for git history, CI logs, trace data

17

So if you remember
one thing from this talk…

18

If you want better AI-assisted analysis...

Don't start with better models

Start with better data!

Sonnet handles this just fine — you don't need Opus.

19

Let's connect

Be in Touch

email steph.wang@mongodb.com

twitter @StephWangBuilds

github stephaniewang526

linkedin /in/stephwangbuilds

Making Your Metrics LLM-Ready — o11ycon 2026