🐞 Debugging ML Systems: From Data to Deployment
Your model doesn’t “just break.” It breaks for a reason. Let’s find it.

Forward-thinking IT Operations Leader with cross-domain expertise spanning incident & change management, cloud infrastructure (Azure, AWS, GCP), and automation engineering. Proven track record in building and leading high-performance operations teams that drive reliability, innovation, and uptime across mission-critical enterprise systems. Adept at aligning IT services with business goals through strategic leadership, cloud-native transformation, and process modernization. Currently spearheading application operations and monitoring for digital modernization initiatives. Deeply passionate about coding in Rust, Go, and Python, and solving real-world problems through machine learning, model inference, and Generative AI. Actively exploring the intersection of AI engineering and infrastructure automation to future-proof operational ecosystems and unlock new business value.
Gradient Descent Weekly — Issue #13
You built the model. You trained it. It worked in dev.
Now it’s in prod and giving garbage predictions.
Welcome to real-world machine learning.
Debugging ML systems is not like debugging software.
A syntax error won’t save you. There are no stack traces for mispredictions.
Instead, your model silently fails…
Because the data pipeline shifted
Because the target label logic changed
Because a retraining job picked up the wrong version
Because your model is making correct predictions… for the wrong reasons
In this issue, we’ll walk through how to systematically debug ML systems from end to end — starting with data and ending at deployment.
🧭 The Debugging Map
ML systems fail across multiple layers:
Data Issues
Feature Engineering Bugs
Training Problems
Model Evaluation Mistakes
Serving Skew
Deployment Bugs
Monitoring Blind Spots
Let’s dig into each. With tactical steps. And no fluff.
🧩 1. Debugging Data Issues
🔍 Symptoms:
Sudden drop in accuracy
Model performs well on train but fails on real-world data
Predictions don't “feel right”
🛠 Checklist:
✅ Compare training data vs inference data distribution
✅ Validate schema and data types
✅ Check for missing/null/inconsistent values
✅ Watch out for label leakage
Tools:
Great Expectations
TensorFlow Data Validation (TFDV)
DVC for data versioning
✅ Tip: Always snapshot the data used for training. If you can’t reproduce the failure, you can’t fix it.
🧮 2. Feature Engineering Bugs
🔍 Symptoms:
Weird feature values at runtime (e.g., negative ages)
Model performs inconsistently across environments
Drift between training and serving features
🛠 Checklist:
✅ Ensure same preprocessing logic for train and inference
✅ Normalize, encode, scale features identically
✅ Test edge cases (out-of-range inputs, unseen categories)
Tools:
Feast (feature store)
Pytest unit tests for preprocessing
MLflow/Weights & Biases for feature logging
✅ Tip: Don't mix Python scripts with SQL logic for feature prep. That’s how skew sneaks in.
🧠 3. Training Pipeline Failures
🔍 Symptoms:
Model accuracy is suspiciously high or low
Re-training leads to drastically different results
Model doesn’t generalize
🛠 Checklist:
✅ Check if the training data is balanced and representative
✅ Validate label correctness
✅ Ensure train/val/test splits are truly disjoint
✅ Check for overfitting (train ≫ val/test accuracy)
Tools:
Confusion matrix, ROC curves, learning curves
Model explainability (SHAP, LIME)
✅ Tip: Always keep an untouched test set from day one. It's your sanity check.
🔬 4. Model Evaluation Bugs
🔍 Symptoms:
Great evaluation scores, terrible user feedback
Mismatch between metrics and business outcomes
🛠 Checklist:
✅ Are you using the right metric? (Accuracy vs F1 vs AUC)
✅ Are thresholds tuned properly?
✅ Are metrics aligned with business KPIs?
Tools:
MLflow / W&B for experiment tracking
Custom evaluation scripts
✅ Tip: Accuracy ≠ usefulness. Optimize for impact, not just math.
🧯 5. Serving Skew (Train ≠ Inference)
🔍 Symptoms:
Model behaves differently in production than in test
Predictions fail silently or return NaNs
Unexpected drop in API performance
🛠 Checklist:
✅ Validate feature types and ranges at runtime
✅ Log input/output data at inference
✅ Use the same codebase (or containers) for train & serve
Tools:
Inference loggers
Schema validators
A/B test environments
✅ Tip: Shadow test models on live traffic before full deployment. Always.
🚀 6. Deployment & Infrastructure Bugs
🔍 Symptoms:
Latency spikes
High failure rate on API calls
Wrong model version deployed
🛠 Checklist:
✅ Container includes the correct model file
✅ Model versioning + rollback works
✅ Infrastructure (memory/CPU) is sized correctly
Tools:
Docker, Kubernetes
Prometheus + Grafana
AWS SageMaker logs / endpoint configs
✅ Tip: Always tag and log every deployed version with a unique commit/hash.
📉 7. Monitoring Blind Spots
🔍 Symptoms:
Drift goes undetected
Label distribution shifts quietly
You find out about issues from users (😬)
🛠 Checklist:
✅ Track input drift, label drift, concept drift
✅ Log prediction confidence, latency, throughput
✅ Set alerts and thresholds
Tools:
Evidently AI, WhyLabs
Prometheus, DataDog, CloudWatch
Custom health checks
✅ Tip: No monitoring = no trust. Make monitoring part of CI/CD, not an afterthought.
🧠 Debugging Mindset: Don't Look for a Bug. Look for the Mismatch.
ML failures are rarely “bugs” in the traditional sense.
They’re mismatches — between data and assumptions, training and production, signals and business value.
Debugging ML isn’t about fixing code.
It’s about asking better questions about your system.
🛡️ Debugging Survival Kit
Version everything — code, data, configs, models
Validate inputs — always
Test both model logic and business logic
Monitor before users complain
Automate test + eval in your pipeline
Never, ever debug blind
🔚 Final Thoughts: You Can’t Debug What You Don’t Log
ML systems are fragile by default.
You want them to be observable, explainable, and reproducible.
If your model is a black box, make your system a glass box.
Build logs, metrics, and checkpoints into everything.
Because when it breaks—and it will—you want to be the one with answers.
🔮 Up Next on Gradient Descent Weekly:
- Building a 1-Person MLOps Stack That Works






