🐞 Debugging ML Systems: From Data to Deployment

Gradient Descent Weekly — Issue #13

You built the model. You trained it. It worked in dev.
Now it’s in prod and giving garbage predictions.
Welcome to real-world machine learning.

Debugging ML systems is not like debugging software.
A syntax error won’t save you. There are no stack traces for mispredictions.

Instead, your model silently fails…

Because the data pipeline shifted
Because the target label logic changed
Because a retraining job picked up the wrong version
Because your model is making correct predictions… for the wrong reasons

In this issue, we’ll walk through how to systematically debug ML systems from end to end — starting with data and ending at deployment.

🧭 The Debugging Map

ML systems fail across multiple layers:

Data Issues
Feature Engineering Bugs
Training Problems
Model Evaluation Mistakes
Serving Skew
Deployment Bugs
Monitoring Blind Spots

Let’s dig into each. With tactical steps. And no fluff.

🧩 1. Debugging Data Issues

🔍 Symptoms:

Sudden drop in accuracy
Model performs well on train but fails on real-world data
Predictions don't “feel right”

🛠 Checklist:

✅ Compare training data vs inference data distribution
✅ Validate schema and data types
✅ Check for missing/null/inconsistent values
✅ Watch out for label leakage

Tools:

Great Expectations
TensorFlow Data Validation (TFDV)
DVC for data versioning

✅ Tip: Always snapshot the data used for training. If you can’t reproduce the failure, you can’t fix it.

🧮 2. Feature Engineering Bugs

🔍 Symptoms:

Weird feature values at runtime (e.g., negative ages)
Model performs inconsistently across environments
Drift between training and serving features

🛠 Checklist:

✅ Ensure same preprocessing logic for train and inference
✅ Normalize, encode, scale features identically
✅ Test edge cases (out-of-range inputs, unseen categories)

Tools:

Feast (feature store)
Pytest unit tests for preprocessing
MLflow/Weights & Biases for feature logging

✅ Tip: Don't mix Python scripts with SQL logic for feature prep. That’s how skew sneaks in.

🧠 3. Training Pipeline Failures

🔍 Symptoms:

Model accuracy is suspiciously high or low
Re-training leads to drastically different results
Model doesn’t generalize

🛠 Checklist:

✅ Check if the training data is balanced and representative
✅ Validate label correctness
✅ Ensure train/val/test splits are truly disjoint
✅ Check for overfitting (train ≫ val/test accuracy)

Tools:

Confusion matrix, ROC curves, learning curves
Model explainability (SHAP, LIME)

✅ Tip: Always keep an untouched test set from day one. It's your sanity check.

🔬 4. Model Evaluation Bugs

🔍 Symptoms:

Great evaluation scores, terrible user feedback
Mismatch between metrics and business outcomes

🛠 Checklist:

✅ Are you using the right metric? (Accuracy vs F1 vs AUC)
✅ Are thresholds tuned properly?
✅ Are metrics aligned with business KPIs?

Tools:

MLflow / W&B for experiment tracking
Custom evaluation scripts

✅ Tip: Accuracy ≠ usefulness. Optimize for impact, not just math.

🧯 5. Serving Skew (Train ≠ Inference)

🔍 Symptoms:

Model behaves differently in production than in test
Predictions fail silently or return NaNs
Unexpected drop in API performance

🛠 Checklist:

✅ Validate feature types and ranges at runtime
✅ Log input/output data at inference
✅ Use the same codebase (or containers) for train & serve

Tools:

Inference loggers
Schema validators
A/B test environments

✅ Tip: Shadow test models on live traffic before full deployment. Always.

🚀 6. Deployment & Infrastructure Bugs

🔍 Symptoms:

Latency spikes
High failure rate on API calls
Wrong model version deployed

🛠 Checklist:

✅ Container includes the correct model file
✅ Model versioning + rollback works
✅ Infrastructure (memory/CPU) is sized correctly

Tools:

Docker, Kubernetes
Prometheus + Grafana
AWS SageMaker logs / endpoint configs

✅ Tip: Always tag and log every deployed version with a unique commit/hash.

🔍 Symptoms:

Drift goes undetected
Label distribution shifts quietly
You find out about issues from users (😬)

🛠 Checklist:

✅ Track input drift, label drift, concept drift
✅ Log prediction confidence, latency, throughput
✅ Set alerts and thresholds

Tools:

Evidently AI, WhyLabs
Prometheus, DataDog, CloudWatch
Custom health checks

✅ Tip: No monitoring = no trust. Make monitoring part of CI/CD, not an afterthought.

🧠 Debugging Mindset: Don't Look for a Bug. Look for the Mismatch.

ML failures are rarely “bugs” in the traditional sense.
They’re mismatches — between data and assumptions, training and production, signals and business value.

Debugging ML isn’t about fixing code.
It’s about asking better questions about your system.

🛡️ Debugging Survival Kit

Version everything — code, data, configs, models
Validate inputs — always
Test both model logic and business logic
Monitor before users complain
Automate test + eval in your pipeline
Never, ever debug blind

🔚 Final Thoughts: You Can’t Debug What You Don’t Log

ML systems are fragile by default.
You want them to be observable, explainable, and reproducible.

If your model is a black box, make your system a glass box.

Build logs, metrics, and checkpoints into everything.

Because when it breaks—and it will—you want to be the one with answers.

🔮 Up Next on Gradient Descent Weekly:

Building a 1-Person MLOps Stack That Works

🐞 Debugging ML Systems: From Data to Deployment

🧭 The Debugging Map

🧩 1. Debugging Data Issues

🔍 Symptoms:

🛠 Checklist:

🧮 2. Feature Engineering Bugs

🔍 Symptoms:

🛠 Checklist:

🧠 3. Training Pipeline Failures

🔍 Symptoms:

🛠 Checklist:

🔬 4. Model Evaluation Bugs

🔍 Symptoms:

🛠 Checklist:

🧯 5. Serving Skew (Train ≠ Inference)

🔍 Symptoms:

🛠 Checklist:

🚀 6. Deployment & Infrastructure Bugs

🔍 Symptoms:

🛠 Checklist:

📉 7. Monitoring Blind Spots

🔍 Symptoms:

🛠 Checklist:

🧠 Debugging Mindset: Don't Look for a Bug. Look for the Mismatch.

🛡️ Debugging Survival Kit

🔚 Final Thoughts: You Can’t Debug What You Don’t Log

🔮 Up Next on Gradient Descent Weekly:

Comments

More from this blog

🚀 Imagining an OpenAI-like Company in India: Building the Future of Artificial Intelligence

🛰️ The LLM Observability Stack: What to Track and Why

🪦 Prompt Engineering Is Dead. Long Live Prompt Architectures

🧲 How to Build a Vector Database That Doesn’t Suck

🤖 RAG vs Fine-Tuning: Which One Is Right for You?

Command Palette

🧭 The Debugging Map

🧩 1. Debugging Data Issues

🔍 Symptoms:

🛠 Checklist:

🧮 2. Feature Engineering Bugs

🔍 Symptoms:

🛠 Checklist:

🧠 3. Training Pipeline Failures

🔍 Symptoms:

🛠 Checklist:

🔬 4. Model Evaluation Bugs

🔍 Symptoms:

🛠 Checklist:

🧯 5. Serving Skew (Train ≠ Inference)

🔍 Symptoms:

🛠 Checklist:

🚀 6. Deployment & Infrastructure Bugs

🔍 Symptoms:

🛠 Checklist:

📉 7. Monitoring Blind Spots

🔍 Symptoms:

🛠 Checklist:

🧠 Debugging Mindset: Don't Look for a Bug. Look for the Mismatch.

🛡️ Debugging Survival Kit

🔚 Final Thoughts: You Can’t Debug What You Don’t Log

🔮 Up Next on Gradient Descent Weekly:

Comments

More from this blog