SentryML

SentryMLEngineering-focused coverage of ML observability and MLOps. Model monitoring, drift detection, training/serving skew, debugging production model failures, evaluation pipelines, and the tooling that actually works at scale.https://sentryml.com/enModel Monitoring Tools in 2026: What's Changed, What to Use Nowhttps://sentryml.com/posts/model-monitoring-tools-2/https://sentryml.com/posts/model-monitoring-tools-2/The model monitoring tools landscape shifted in 2026 — WhyLabs shut down, LLM observability went mainstream, and open source caught up to managed SaaS. Here's the current map.Mon, 22 Jun 2026 00:00:00 GMTmodel-monitoringdrift-detectionmlopstoolingllm-observabilitySentryML EditorialPredicting Model Behavior Before Release: What OpenAI's Deployment Simulation Means for MLOpshttps://sentryml.com/posts/weekly-predicting-model-behavior-before-release-by-simulating-deplo-2/https://sentryml.com/posts/weekly-predicting-model-behavior-before-release-by-simulating-deplo-2/OpenAI's Deployment Simulation replays 1.3M real conversations through candidate models before release, hitting 1.5x median error on safety predictions and surfacing behaviors like 'calculator hacking' that conventional evals never find.Mon, 22 Jun 2026 00:00:00 GMTdeployment-simulationllm-safetypre-deployment-evaluationmlopsshadow-testingmodel-behaviorSentryML EditorialML Model Deployment: Serving Frameworks, KV Cache, and the Latency Metrics That Matterhttps://sentryml.com/posts/ml-model-deployment-2/https://sentryml.com/posts/ml-model-deployment-2/Once a model clears staging, the serving stack decision determines whether you hit your latency SLAs or spend a sprint chasing p99 spikes. Here's what to evaluate and what to instrument.Sun, 21 Jun 2026 00:00:00 GMTmlopsmodel-deploymentinferencelatencyservingSentryML EditorialReplaying Production to Catch Drift: Inside OpenAI's Deployment Simulation Frameworkhttps://sentryml.com/posts/weekly-predicting-model-behavior-before-release-by-simulating-deplo/https://sentryml.com/posts/weekly-predicting-model-behavior-before-release-by-simulating-deplo/OpenAI's deployment simulation replays 1.3M de-identified production conversations through a candidate model pre-release, catching behavior shifts static benchmarks miss. Here's how it works and what it means for teams running their own models.Sun, 21 Jun 2026 00:00:00 GMTmlopsevaluationdriftmodel-monitoringpre-deploymentsafetySentryML EditorialFederated Learning in Production: What Substra Actually Does for Privacy-Preserving MLhttps://sentryml.com/posts/creating-privacy-preserving-ai-with-substra/https://sentryml.com/posts/creating-privacy-preserving-ai-with-substra/Owkin's Substra framework keeps training data local while sharing only model weights — but federated architectures break standard MLOps assumptions aroundSat, 13 Jun 2026 00:00:00 GMTfederated-learningprivacymlopstoolingmonitoringdata-governanceSentryML EditorialOpenAI Tops Gartner's Coding-Agent Quadrant. Now You Own a Production ML System.https://sentryml.com/posts/openai-named-a-leader-in-enterprise-coding-agents-by-gartner/https://sentryml.com/posts/openai-named-a-leader-in-enterprise-coding-agents-by-gartner/Gartner named OpenAI a Leader in its first Magic Quadrant for Enterprise AI Coding Agents. The operational story is the part the press release skips: aWed, 03 Jun 2026 00:00:00 GMTllm-observabilitydriftmonitoringevalsmlopsSentryML EditorialThe ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decayhttps://sentryml.com/posts/ml-monitoring-metrics-taxonomy-drift-data-quality-decay/https://sentryml.com/posts/ml-monitoring-metrics-taxonomy-drift-data-quality-decay/A reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data qualitySat, 23 May 2026 00:00:00 GMTmlopsmonitoringdriftdata-qualitymodel-decayobservabilitymetricsSentryML EditorialOpenTelemetry GenAI Semantic Conventions: Instrument LLM Appshttps://sentryml.com/posts/opentelemetry-genai-semantic-conventions-instrumenting-llm-apps/https://sentryml.com/posts/opentelemetry-genai-semantic-conventions-instrumenting-llm-apps/How the OpenTelemetry GenAI semantic conventions standardize spans, metrics, and events for LLM apps, what they skip, and how to instrument without rework.Sat, 23 May 2026 00:00:00 GMTobservabilityopentelemetryllm-securityagentsmonitoringinstrumentationmlopsSentryML EditorialModel Monitoring in Production: A Four-Layer Frameworkhttps://sentryml.com/posts/model-monitoring-2/https://sentryml.com/posts/model-monitoring-2/Model monitoring covers more than drift detection. Here's the four-layer framework — software health, data quality, model quality, business KPIs — wiredSat, 16 May 2026 00:00:00 GMTmodel-monitoringdrift-detectionmlopsevidentlypsiSentryML EditorialModel Monitoring for LLM Inference: Metrics Your APM Can't Seehttps://sentryml.com/posts/model-monitoring-3/https://sentryml.com/posts/model-monitoring-3/Model monitoring for LLM APIs requires a different metric set than traditional ML. Here's the signal hierarchy — TTFT, KV cache hit rate, output lengthSat, 16 May 2026 00:00:00 GMTmodel-monitoringllm-observabilityttftdrift-detectionmlopsvllmSentryML EditorialSmithDB and Five Other Things LangChain Shipped at Interrupt 2026https://sentryml.com/posts/langchain-interrupt-2026-smithdb-announcements/https://sentryml.com/posts/langchain-interrupt-2026-smithdb-announcements/LangChain's Interrupt 2026 surfaced a purpose-built trace database, a context version-control system, and an automated failure-triage engine.Thu, 14 May 2026 00:00:00 GMTagent-observabilitytracinglangsmithmlopsinfraSentryML EditorialLLM Benchmarks in 2026: Which Still Discriminate, and How to Runhttps://sentryml.com/posts/llm-benchmarks-2/https://sentryml.com/posts/llm-benchmarks-2/Static benchmarks like MMLU and HumanEval have saturated for frontier models. Here's which LLM benchmarks still produce signal, why contamination is worseThu, 14 May 2026 00:00:00 GMTllmbenchmarksevaluationmodel-selectionmlopsmonitoringSentryML EditorialWatermarking Should Be Treated as a Monitoring Primitivehttps://sentryml.com/posts/watermarking-should-be-treated-as-a-monitoring-primitive/https://sentryml.com/posts/watermarking-should-be-treated-as-a-monitoring-primitive/A new paper reframes LLM watermarking from an adversarial evasion problem into a monitoring infrastructure question.Thu, 14 May 2026 00:00:00 GMTwatermarkingmonitoringprovenanceattributionmlopsSentryML EditorialLLM Fine Tuning: Methods, Training Data, and Evaluationhttps://sentryml.com/posts/llm-fine-tuning-2/https://sentryml.com/posts/llm-fine-tuning-2/A practitioner's guide to llm fine tuning — how to pick between SFT, LoRA, and DPO, what your training data actually needs, and how to validate aTue, 12 May 2026 00:00:00 GMTllmfine-tuningmlopsloradpoevaluationSentryML EditorialLLM Testing: A Guide to Evals, Metrics, and Production Monitoringhttps://sentryml.com/posts/llm-testing/https://sentryml.com/posts/llm-testing/LLM testing spans offline evals, CI gate checks, and live production monitoring — three distinct jobs that need different tools.Tue, 12 May 2026 00:00:00 GMTllmevaluationmonitoringmlopstestingobservabilitySentryML EditorialML Testing: A Checklist from Pre-Train Checks to Production Drifthttps://sentryml.com/posts/ml-testing/https://sentryml.com/posts/ml-testing/ML testing spans pre-train sanity checks, behavioral validation, data integrity, and continuous drift monitoring.Tue, 12 May 2026 00:00:00 GMTml-testingmodel-validationdrift-detectionmlopsdata-qualitySentryML EditorialChoosing MLOps Tools: A Decision Framework for Production Teamshttps://sentryml.com/posts/mlops-tools-2/https://sentryml.com/posts/mlops-tools-2/Picking the wrong MLOps tools costs months of migration work. Here's how to evaluate experiment tracking, orchestration, monitoring, and serving optionsTue, 12 May 2026 00:00:00 GMTmlopstoolingmlops-toolsmodel-monitoringorchestrationSentryML EditorialWhen Embedding-Based Defenses Fail in Multi-Agent LLMshttps://sentryml.com/posts/embedding-defenses-fail-multi-agent-llm-logging/https://sentryml.com/posts/embedding-defenses-fail-multi-agent-llm-logging/A new arXiv paper shows that embedding-distance detectors miss three classes of adversarial agent. The fix lives in your observability stack, not yourMon, 11 May 2026 00:00:00 GMTmulti-agentobservabilityllm-monitoringagent-telemetrydrift-detectionmlopsSentryML EditorialLLM Benchmarks Explained: What the Numbers Mean and Misshttps://sentryml.com/posts/llm-benchmarks/https://sentryml.com/posts/llm-benchmarks/A practical guide to the major LLM benchmarks — MMLU, HumanEval, GPQA Diamond, SWE-bench — what they actually test, why saturation makes most scoresMon, 11 May 2026 00:00:00 GMTllmbenchmarksevaluationmlopsmodel-selectionmonitoringSentryML EditorialLLM Fine Tuning in Production: A Practical MLOps Guidehttps://sentryml.com/posts/llm-fine-tuning/https://sentryml.com/posts/llm-fine-tuning/When to use LLM fine tuning over RAG, how LoRA and QLoRA cut GPU costs, and what to monitor after you ship a fine-tuned model — for ML engineers who ownMon, 11 May 2026 00:00:00 GMTllmfine-tuningmlopsloramodel-driftmonitoringSentryML EditorialMachine Learning Pipeline: Stages, Failure Points, and Monitoringhttps://sentryml.com/posts/machine-learning-pipeline/https://sentryml.com/posts/machine-learning-pipeline/A practitioner's guide to the machine learning pipeline — from data ingestion to production monitoring — covering common failure points, drift types, andMon, 11 May 2026 00:00:00 GMTmlopsmonitoringdriftpipelinesdata-validationci-cdSentryML EditorialML Model Deployment: A Guide to Shipping Models That Stay Healthyhttps://sentryml.com/posts/ml-model-deployment/https://sentryml.com/posts/ml-model-deployment/ML model deployment fails far more often than it should — typically before the model ever serves traffic. Here's what breaks, which deployment patternsMon, 11 May 2026 00:00:00 GMTmlopsmodel-deploymentproduction-mlmonitoringfeature-storeSentryML EditorialMLOps Best Practices: What Keeps Models Running in Productionhttps://sentryml.com/posts/mlops-best-practices/https://sentryml.com/posts/mlops-best-practices/A practitioner's guide to mlops best practices — from CI/CD pipeline automation and model versioning to drift detection and continuous retraining — basedMon, 11 May 2026 00:00:00 GMTmlopsmonitoringdriftci-cdversioningretrainingSentryML EditorialMLOps Tools: A Practitioner's Map of the Production Stackhttps://sentryml.com/posts/mlops-tools/https://sentryml.com/posts/mlops-tools/A category-by-category breakdown of MLOps tools — experiment tracking, orchestration, feature stores, serving, and monitoring — with honest tradeoffs forMon, 11 May 2026 00:00:00 GMTmlopstoolingexperiment-trackingorchestrationmonitoringSentryML EditorialModel Monitoring Tools: A Technical Comparison for ML Teamshttps://sentryml.com/posts/model-monitoring-tools/https://sentryml.com/posts/model-monitoring-tools/Evidently, Arize, WhyLabs, Fiddler, NannyML, Alibi Detect — how each tool actually detects drift, what it costs to run, and which one fits your stack.Mon, 11 May 2026 00:00:00 GMTmodel-monitoringdrift-detectiontoolingmlopsobservabilitySentryML EditorialModel Monitoring in Production: What to Track and When to Acthttps://sentryml.com/posts/model-monitoring/https://sentryml.com/posts/model-monitoring/A practical guide to model monitoring for ML engineers: drift types, the metrics that actually matter, handling the no-ground-truth problem, and whichMon, 11 May 2026 00:00:00 GMTmodel-monitoringdata-driftmlopsobservabilityconcept-driftSentryML EditorialOpenAI's DeployCo Pushes the Observability Problem Onto Youhttps://sentryml.com/posts/openai-deployco-forward-deployed-observability/https://sentryml.com/posts/openai-deployco-forward-deployed-observability/OpenAI's new $10B deployment subsidiary will build production AI systems inside enterprises. What that means for ML platform teams who inherit the runbookMon, 11 May 2026 00:00:00 GMTmlopsobservabilitydriftdeploymentopenaiplatform-engineeringSentryML EditorialDetection Engineering for LLM Apps: A MITRE ATLAS Runbookhttps://sentryml.com/posts/llm-detection-engineering-mitre-atlas-runbook/https://sentryml.com/posts/llm-detection-engineering-mitre-atlas-runbook/Mapping LLM application telemetry to MITRE ATLAS techniques. Concrete log shapes, alerting heuristics, and a runbook structure that scales beyond ad-hocThu, 07 May 2026 00:00:00 GMTdetection-engineeringblue-teammitre-atlasllm-securitysiemincident-responseSentryML EditorialA Lean 4 Stability Proof for Tool-Mediated LLM Agentshttps://sentryml.com/posts/lean4-stability-proof-tool-mediated-llm-agents/https://sentryml.com/posts/lean4-stability-proof-tool-mediated-llm-agents/A new arXiv paper certifies controllability and ISS robustness for an LLM-driven SOC agent using Lean 4. The MLOps takeaway is simpler than the mathWed, 06 May 2026 00:00:00 GMTagentsobservabilityformal-methodsllm-monitoringmlopsSentryML EditorialThe Agent Authority Gap Is an Observability Problemhttps://sentryml.com/posts/agent-authority-gap-observability-instrumentation/https://sentryml.com/posts/agent-authority-gap-observability-instrumentation/Orchid Security's framing of agent governance as a delegation problem lands in the lap of ML observability teams.Tue, 05 May 2026 00:00:00 GMTagent-observabilityidentitymlopsopentelemetrygovernancerunbookSentryML EditorialLocal Coding Assistants Crossed the Quality Bar: Now Observe Themhttps://sentryml.com/posts/local-coding-assistants-quality-bar-observability/https://sentryml.com/posts/local-coding-assistants-quality-bar-observability/A practitioner's Reddit report on running Qwen3.6-27B locally signals a real inflection point. But moving off managed cloud APIs shifts monitoringSun, 03 May 2026 00:00:00 GMTlocal-llminferencetoolingmlopsobservabilityservingSentryML Editorial