Skip to main content

Command Palette

Search for a command to run...

🐞 Debugging ML Systems: From Data to Deployment

Your model doesn’t “just break.” It breaks for a reason. Let’s find it.

Published
5 min read
🐞 Debugging ML Systems: From Data to Deployment
B

Forward-thinking IT Operations Leader with cross-domain expertise spanning incident & change management, cloud infrastructure (Azure, AWS, GCP), and automation engineering. Proven track record in building and leading high-performance operations teams that drive reliability, innovation, and uptime across mission-critical enterprise systems. Adept at aligning IT services with business goals through strategic leadership, cloud-native transformation, and process modernization. Currently spearheading application operations and monitoring for digital modernization initiatives. Deeply passionate about coding in Rust, Go, and Python, and solving real-world problems through machine learning, model inference, and Generative AI. Actively exploring the intersection of AI engineering and infrastructure automation to future-proof operational ecosystems and unlock new business value.

Gradient Descent Weekly — Issue #13

You built the model. You trained it. It worked in dev.
Now it’s in prod and giving garbage predictions.
Welcome to real-world machine learning.

Debugging ML systems is not like debugging software.
A syntax error won’t save you. There are no stack traces for mispredictions.

Instead, your model silently fails…

  • Because the data pipeline shifted

  • Because the target label logic changed

  • Because a retraining job picked up the wrong version

  • Because your model is making correct predictions… for the wrong reasons

In this issue, we’ll walk through how to systematically debug ML systems from end to end — starting with data and ending at deployment.

🧭 The Debugging Map

ML systems fail across multiple layers:

  1. Data Issues

  2. Feature Engineering Bugs

  3. Training Problems

  4. Model Evaluation Mistakes

  5. Serving Skew

  6. Deployment Bugs

  7. Monitoring Blind Spots

Let’s dig into each. With tactical steps. And no fluff.

🧩 1. Debugging Data Issues

🔍 Symptoms:

  • Sudden drop in accuracy

  • Model performs well on train but fails on real-world data

  • Predictions don't “feel right”

🛠 Checklist:

  • ✅ Compare training data vs inference data distribution

  • ✅ Validate schema and data types

  • ✅ Check for missing/null/inconsistent values

  • ✅ Watch out for label leakage

Tools:

  • Great Expectations

  • TensorFlow Data Validation (TFDV)

  • DVC for data versioning

Tip: Always snapshot the data used for training. If you can’t reproduce the failure, you can’t fix it.

🧮 2. Feature Engineering Bugs

🔍 Symptoms:

  • Weird feature values at runtime (e.g., negative ages)

  • Model performs inconsistently across environments

  • Drift between training and serving features

🛠 Checklist:

  • ✅ Ensure same preprocessing logic for train and inference

  • ✅ Normalize, encode, scale features identically

  • ✅ Test edge cases (out-of-range inputs, unseen categories)

Tools:

  • Feast (feature store)

  • Pytest unit tests for preprocessing

  • MLflow/Weights & Biases for feature logging

Tip: Don't mix Python scripts with SQL logic for feature prep. That’s how skew sneaks in.

🧠 3. Training Pipeline Failures

🔍 Symptoms:

  • Model accuracy is suspiciously high or low

  • Re-training leads to drastically different results

  • Model doesn’t generalize

🛠 Checklist:

  • ✅ Check if the training data is balanced and representative

  • ✅ Validate label correctness

  • ✅ Ensure train/val/test splits are truly disjoint

  • ✅ Check for overfitting (train ≫ val/test accuracy)

Tools:

  • Confusion matrix, ROC curves, learning curves

  • Model explainability (SHAP, LIME)

Tip: Always keep an untouched test set from day one. It's your sanity check.

🔬 4. Model Evaluation Bugs

🔍 Symptoms:

  • Great evaluation scores, terrible user feedback

  • Mismatch between metrics and business outcomes

🛠 Checklist:

  • ✅ Are you using the right metric? (Accuracy vs F1 vs AUC)

  • ✅ Are thresholds tuned properly?

  • ✅ Are metrics aligned with business KPIs?

Tools:

  • MLflow / W&B for experiment tracking

  • Custom evaluation scripts

Tip: Accuracy ≠ usefulness. Optimize for impact, not just math.

🧯 5. Serving Skew (Train ≠ Inference)

🔍 Symptoms:

  • Model behaves differently in production than in test

  • Predictions fail silently or return NaNs

  • Unexpected drop in API performance

🛠 Checklist:

  • ✅ Validate feature types and ranges at runtime

  • ✅ Log input/output data at inference

  • ✅ Use the same codebase (or containers) for train & serve

Tools:

  • Inference loggers

  • Schema validators

  • A/B test environments

Tip: Shadow test models on live traffic before full deployment. Always.

🚀 6. Deployment & Infrastructure Bugs

🔍 Symptoms:

  • Latency spikes

  • High failure rate on API calls

  • Wrong model version deployed

🛠 Checklist:

  • ✅ Container includes the correct model file

  • ✅ Model versioning + rollback works

  • ✅ Infrastructure (memory/CPU) is sized correctly

Tools:

  • Docker, Kubernetes

  • Prometheus + Grafana

  • AWS SageMaker logs / endpoint configs

Tip: Always tag and log every deployed version with a unique commit/hash.

📉 7. Monitoring Blind Spots

🔍 Symptoms:

  • Drift goes undetected

  • Label distribution shifts quietly

  • You find out about issues from users (😬)

🛠 Checklist:

  • ✅ Track input drift, label drift, concept drift

  • ✅ Log prediction confidence, latency, throughput

  • ✅ Set alerts and thresholds

Tools:

  • Evidently AI, WhyLabs

  • Prometheus, DataDog, CloudWatch

  • Custom health checks

Tip: No monitoring = no trust. Make monitoring part of CI/CD, not an afterthought.

🧠 Debugging Mindset: Don't Look for a Bug. Look for the Mismatch.

ML failures are rarely “bugs” in the traditional sense.
They’re mismatches — between data and assumptions, training and production, signals and business value.

Debugging ML isn’t about fixing code.
It’s about asking better questions about your system.

🛡️ Debugging Survival Kit

  • Version everything — code, data, configs, models

  • Validate inputs — always

  • Test both model logic and business logic

  • Monitor before users complain

  • Automate test + eval in your pipeline

  • Never, ever debug blind

🔚 Final Thoughts: You Can’t Debug What You Don’t Log

ML systems are fragile by default.
You want them to be observable, explainable, and reproducible.

If your model is a black box, make your system a glass box.

Build logs, metrics, and checkpoints into everything.

Because when it breaks—and it will—you want to be the one with answers.

🔮 Up Next on Gradient Descent Weekly:

  • Building a 1-Person MLOps Stack That Works