🛰️ The LLM Observability Stack: What to Track and Why
If you can’t observe it, you can’t trust it.

Forward-thinking IT Operations Leader with cross-domain expertise spanning incident & change management, cloud infrastructure (Azure, AWS, GCP), and automation engineering. Proven track record in building and leading high-performance operations teams that drive reliability, innovation, and uptime across mission-critical enterprise systems. Adept at aligning IT services with business goals through strategic leadership, cloud-native transformation, and process modernization. Currently spearheading application operations and monitoring for digital modernization initiatives. Deeply passionate about coding in Rust, Go, and Python, and solving real-world problems through machine learning, model inference, and Generative AI. Actively exploring the intersection of AI engineering and infrastructure automation to future-proof operational ecosystems and unlock new business value.
Gradient Descent Weekly — Issue #26
Your LLM app works great in dev.
But in prod?
Weird responses
Hallucinations
Latency spikes
User complaints
Hard-to-reproduce bugs
Everyone's guessing why
Welcome to the LLM observability problem.
Most teams ship LLM apps like they’re demos.
But real systems need real monitoring — across prompts, models, tools, and users.
In this issue:
What observability means in an LLM context
What you should be tracking (not just logging)
A reference observability stack
Tools that actually help
Real-world trace patterns to watch for
🔍 First, Why Does LLM Observability Matter?
LLMs are:
Non-deterministic
Slow
Prone to hallucinations
Cost-sensitive
Context-dependent
Version-volatile
If you’re not tracking, you’re not just blind — you’re wasting money and risking trust.
🧠 What Is LLM Observability?
Observability = the ability to answer why something broke, before the user tells you.
For LLM-powered apps, this means:
Prompt + context tracking
Model versioning
Latency and token usage
Tool/invocation chains
Output evaluation
User-level behavior traces
Cost breakdowns
Quality metrics (accuracy, relevance, etc.)
📊 10 Things You MUST Track in LLM Systems
| Metric / Log | Why It Matters |
| 🧠 Prompt Text (Raw & Final) | Debug hallucinations, optimize templates |
| 🧾 Input Metadata (User, Intent) | Helps trace back user-level issues |
| 🧩 Retrieved Context (RAG) | Diagnose poor retrieval → poor generation |
| 🔧 Tools Invoked (Agents) | Identify broken chains, high-latency tools |
| 🕰️ Latency (P50/P95) | Ensure responsiveness, alert on spikes |
| 💰 Token Usage (input/output) | Optimize cost, track against budget |
| 🤖 Model Version Used | Reproducibility, version-aware evals |
| 🔁 Retry/Fail/Timeouts | Reliability signals |
| ✅ Eval Scores (Auto/Human) | Track response quality, rank prompts/models |
| 📦 Response Type/Class | Analyze what types of requests break most often |
🧱 Reference LLM Observability Stack (2025 Edition)
Let’s build your stack layer by layer:
1. Instrumentation Layer
Middleware that wraps LLM calls (OpenAI, Anthropic, local models)
Logs:
Prompt templates
Variables inserted
Model used
Token count and cost
✅ Tools:
LangChain Tracing
Manual wrappers via middleware functions
2. Trace & Session Layer
Group multiple LLM calls + tool invocations into one trace/session.
✅ You want:
Multi-step agent chain visibility
Call tree visualization
User-level sessions
✅ Tools:
LangSmith
3. Evaluation Layer
Auto- or human-grade LLM responses.
Types:
Factuality
Relevance
Fluency
Helpfulness
Safety / Bias
Groundedness (in RAG)
✅ Tools:
You want continuous eval pipelines, not one-off testing.
4. Alerting & Monitoring Layer
Set up:
Latency alerts
Token usage spikes
Drop in eval scores
Frequent fallback usage
✅ Tools:
Datadog / Grafana + custom metrics
LangSmith alerts
Self-hosted Prometheus if you’re feeling fancy
5. Analytics & Feedback Loop
Analyze:
Model version vs performance
User feedback correlation
Top failing prompts / queries
Cost per route / tool / agent
Feed insights into:
Prompt refinement
Routing logic
Cost controls
✅ Use:
Dashboards (Tableau, Superset, Metabase)
Custom logs in BigQuery / Snowflake
Langfuse Analytics
🧠 Example Trace You Should Be Seeing
User Query: "What’s our company’s refund policy in Germany?"
Trace:
- Router → intent: "policy_lookup"
- RAG → 3 chunks from vector DB
- PromptTemplate v3.4 applied
- OpenAI GPT-4 called
- Output: ✅ Answer + Source
- Tokens used: 412 input / 189 output
- Latency: 2.7s
- Eval: Groundedness 0.92, Helpfulness 0.85
If your system can’t give you this trace…
you’re not in production. You’re in prototype land.
⚠️ Common Anti-Patterns
| Symptom | Cause |
| Hallucinations post-deploy | No RAG context logged |
| Unexplained cost spikes | Token usage not monitored |
| Agents silently fail mid-chain | No tool call traces |
| Same prompt gives different result | Model version not logged |
| You don’t know what broke | No eval or user-level sessions |
✅ TL;DR: LLM Observability Checklist
Track full prompt chain (template + variables)
Store retrieved context and its source
Capture toolchain usage per request
Log latency + tokens + cost
Tag everything with model version
Score output quality continuously
Group calls into sessions with user metadata
Set alerts on cost spikes, eval drops, and latency outliers
🧠 Final Thoughts:
You can’t fix what you can’t see.
You can’t improve what you don’t measure.
You can’t scale what you don’t understand.
LLM apps are no longer toys.
They’re mission-critical interfaces powered by expensive, probabilistic models.
So stop flying blind.
Build observability in — or get burned out by support tickets and cloud bills.
🔮 Up Next on Gradient Descent Weekly:
- Why You Should Stop Using LLMs as Black Boxes





