🛰️ The LLM Observability Stack: What to Track and Why

Gradient Descent Weekly — Issue #26

Your LLM app works great in dev.
But in prod?

Weird responses
Hallucinations
Latency spikes
User complaints
Hard-to-reproduce bugs
Everyone's guessing why

Welcome to the LLM observability problem.
Most teams ship LLM apps like they’re demos.
But real systems need real monitoring — across prompts, models, tools, and users.

In this issue:

What observability means in an LLM context
What you should be tracking (not just logging)
A reference observability stack
Tools that actually help
Real-world trace patterns to watch for

🔍 First, Why Does LLM Observability Matter?

LLMs are:

Non-deterministic
Slow
Prone to hallucinations
Cost-sensitive
Context-dependent
Version-volatile

If you’re not tracking, you’re not just blind — you’re wasting money and risking trust.

🧠 What Is LLM Observability?

Observability = the ability to answer why something broke, before the user tells you.

For LLM-powered apps, this means:

Prompt + context tracking
Model versioning
Latency and token usage
Tool/invocation chains
Output evaluation
User-level behavior traces
Cost breakdowns
Quality metrics (accuracy, relevance, etc.)

📊 10 Things You MUST Track in LLM Systems

Metric / Log	Why It Matters
🧠 Prompt Text (Raw & Final)	Debug hallucinations, optimize templates
🧾 Input Metadata (User, Intent)	Helps trace back user-level issues
🧩 Retrieved Context (RAG)	Diagnose poor retrieval → poor generation
🔧 Tools Invoked (Agents)	Identify broken chains, high-latency tools
🕰️ Latency (P50/P95)	Ensure responsiveness, alert on spikes
💰 Token Usage (input/output)	Optimize cost, track against budget
🤖 Model Version Used	Reproducibility, version-aware evals
🔁 Retry/Fail/Timeouts	Reliability signals
✅ Eval Scores (Auto/Human)	Track response quality, rank prompts/models
📦 Response Type/Class	Analyze what types of requests break most often

🧱 Reference LLM Observability Stack (2025 Edition)

Let’s build your stack layer by layer:

1. Instrumentation Layer

Middleware that wraps LLM calls (OpenAI, Anthropic, local models)
Logs:
- Prompt templates
- Variables inserted
- Model used
- Token count and cost

✅ Tools:

LangChain Tracing
OpenAI Function Logging
Manual wrappers via middleware functions

2. Trace & Session Layer

Group multiple LLM calls + tool invocations into one trace/session.

✅ You want:

Multi-step agent chain visibility
Call tree visualization
User-level sessions

✅ Tools:

3. Evaluation Layer

Auto- or human-grade LLM responses.

Types:

Factuality
Relevance
Fluency
Helpfulness
Safety / Bias
Groundedness (in RAG)

✅ Tools:

You want continuous eval pipelines, not one-off testing.

4. Alerting & Monitoring Layer

Set up:

Latency alerts
Token usage spikes
Drop in eval scores
Frequent fallback usage

✅ Tools:

Datadog / Grafana + custom metrics
LangSmith alerts
Self-hosted Prometheus if you’re feeling fancy

5. Analytics & Feedback Loop

Analyze:

Model version vs performance
User feedback correlation
Top failing prompts / queries
Cost per route / tool / agent

Feed insights into:

Prompt refinement
Routing logic
Cost controls

✅ Use:

Dashboards (Tableau, Superset, Metabase)
Custom logs in BigQuery / Snowflake
Langfuse Analytics

🧠 Example Trace You Should Be Seeing

User Query: "What’s our company’s refund policy in Germany?"

Trace:
  - Router → intent: "policy_lookup"
  - RAG → 3 chunks from vector DB
  - PromptTemplate v3.4 applied
  - OpenAI GPT-4 called
  - Output: ✅ Answer + Source
  - Tokens used: 412 input / 189 output
  - Latency: 2.7s
  - Eval: Groundedness 0.92, Helpfulness 0.85

If your system can’t give you this trace…
you’re not in production. You’re in prototype land.

⚠️ Common Anti-Patterns

Symptom	Cause
Hallucinations post-deploy	No RAG context logged
Unexplained cost spikes	Token usage not monitored
Agents silently fail mid-chain	No tool call traces
Same prompt gives different result	Model version not logged
You don’t know what broke	No eval or user-level sessions

✅ TL;DR: LLM Observability Checklist

Track full prompt chain (template + variables)
Store retrieved context and its source
Capture toolchain usage per request
Log latency + tokens + cost
Tag everything with model version
Score output quality continuously
Group calls into sessions with user metadata
Set alerts on cost spikes, eval drops, and latency outliers

🧠 Final Thoughts:

You can’t fix what you can’t see.
You can’t improve what you don’t measure.
You can’t scale what you don’t understand.

LLM apps are no longer toys.
They’re mission-critical interfaces powered by expensive, probabilistic models.

So stop flying blind.
Build observability in — or get burned out by support tickets and cloud bills.

🔮 Up Next on Gradient Descent Weekly:

Why You Should Stop Using LLMs as Black Boxes

🛰️ The LLM Observability Stack: What to Track and Why

🔍 First, Why Does LLM Observability Matter?

🧠 What Is LLM Observability?

📊 10 Things You MUST Track in LLM Systems