MLflow as the Control Plane for MLOps — Beyond Experiment Tracking

Most MLflow setups use about 20% of what the platform offers. Teams log metrics, compare a few training runs, maybe register a model. That’s useful for notebooks and early experimentation — but it doesn’t get you to production.

The gap between “we track experiments” and “we have a production ML pipeline” is filled with questions that experiment tracking alone can’t answer: Which model is live right now? What data trained it? Why was the previous version replaced? Can we roll back in under a minute?

This article describes an approach that uses MLflow as the central control plane for the entire ML lifecycle — not just tracking, but model promotion, dataset lineage, prompt versioning, and decision audit. The result is an architecture with fewer moving parts, not more. One system replaces what would otherwise require three or four separate tools.

Five Roles, One System

In this setup, MLflow serves five distinct roles. Each one addresses a specific production need. Together, they form a coherent system where every stage of the pipeline is traceable, auditable, and reproducible.

MLflow — Single Source of Truth

1. Experiment Tracking

Every pipeline stage — training, evaluation, performance testing — logs its work to a dedicated MLflow experiment. Parameters, metrics, artifacts, and system metrics (CPU, GPU, memory) are captured automatically per run. Nothing special here; this is what most teams already do. The difference is that tracking doesn’t stop after training — it extends through every subsequent stage.

2. Model Registry and Alias-Based Promotion

The Model Registry stores versioned model artifacts. But the key mechanism isn’t versioning — it’s aliases. Aliases act as contracts between workflow stages. Each stage reads from an input alias and writes to an output alias.

This decouples the workflows. The deployment workflow doesn’t need to know how training works — it reads from alias:candidate. The evaluation workflow doesn’t need to know how deployment works — it reads from alias:staged. Each workflow operates independently, connected only through the alias contract.

This is a concrete example, not a rigid prescription. The alias chain is entirely flexible — you can add stages, remove them, or define different chains for different model types. A simpler pipeline might use just candidate → live. A more complex one might add a drift-detected alias that triggers retraining.

3. Dataset Lineage

Every pipeline stage that consumes data logs the exact dataset via mlflow.log_input() — with a content digest computed from the actual data, the S3 source URI, and the inferred schema. If the data changes, the digest changes. Within an experiment, you can filter runs by dataset digest to find all runs that used a specific data version.

The key is to differentiate runs through descriptive run names, not by splitting into ever more granular experiments. A run named mistral-7b-lora-r16-alpha32-v3-data tells you what happened; a run named run_1 does not.

4. Prompt Tracking

ML pipelines increasingly depend on prompts — for training data templates, for judge instructions, for evaluation criteria. Changing a prompt changes the results, so prompts need versioning too.

There are two approaches, depending on where the prompt lives. If the prompt is embedded in the data — as with training samples that contain their instruction templates — then tracking the data is enough. If the prompt is not part of the data — as with a judge system prompt used for LLM-as-Judge evaluation — it needs to be tracked explicitly. Logging the full prompt as an artifact with a content hash stored as a searchable tag makes it queryable: you can find all evaluation runs that used a specific judge prompt version.

5. Decision Auditability

Every quality gate logs its decision as a structured MLflow run: the metrics it evaluated, the thresholds it applied, whether it passed or failed, and the full results as artifacts. This makes every promotion decision auditable after the fact. Not through documentation that someone wrote — through data that the pipeline generated automatically.

The Alias Chain as a Promotion Contract

The alias mechanism deserves a closer look, because it’s the architectural core that holds everything together.

With alias-based promotion, the coupling is minimal. Each stage only knows two things: which alias to read from, and which alias to write to. This means:

Workflows are reusable. A promotion workflow that moves a model from one alias to another can be used for initial deployment, for promotion after evaluation, and for rollback — by simply passing different alias parameters.

Stages are independently deployable. You can change how training works without touching deployment. You can add a new quality gate without modifying existing stages.

Rollback is trivial. The previous alias always points to the last known-good version. Rolling back means running the same promotion workflow with --source-alias=previous. No retraining, no re-evaluation, no manual artifact hunting.

Reducing Complexity, Not Adding It

The counterintuitive insight is that using MLflow for five roles doesn’t make the architecture more complex — it makes it simpler. Here’s what you don’t need when MLflow covers these roles:

You don’t need a separate promotion database to track which model is live. You don’t need an external artifact store management layer. You don’t need a separate audit log for deployment decisions. You don’t need a custom lineage tracking system.

Each of these would be a separate system to deploy, maintain, secure, and keep in sync. With MLflow as the control plane, they’re all queryable from one API, visible in one UI, and backed by one database.

Model- and Platform-Agnostic

Nothing in this approach is tied to a specific model type, framework, or serving platform. The same patterns — experiments, aliases, dataset lineage, decision logging — work across fundamentally different pipelines. In the implementation this article is based on, the same MLflow instance manages both a computer vision pipeline (ResNet-18 on Triton Inference Server) and an LLM fine-tuning pipeline (Mistral-7B LoRA adapters on vLLM). The models are different, the serving stacks are different, but the contract is the same: register a versioned artifact, advance it through aliases, log every decision.

Same Pattern. Different Stacks.

See It in Practice

This approach is implemented in a public repository with two fully working pipelines on Kubernetes — including training, deployment, evaluation, performance testing, and promotion workflows:

→ mlops-on-kubernetes

The LLM pipeline is additionally documented in a 10-part blog series (German) that covers the complete journey from first deployment to production-grade evaluation:

→ Self-Hosted LLMs für Datensouveränität

I’m an ML Engineer specializing in self-hosted ML infrastructure and data sovereignty. I help organizations in the DACH region run ML and AI systems on their own infrastructure — with the MLOps foundation to make it production-ready. Open to interesting projects and conversations — feel free to reach out.

Five Roles, One System#

1. Experiment Tracking#

2. Model Registry and Alias-Based Promotion#

3. Dataset Lineage#

4. Prompt Tracking#

5. Decision Auditability#

The Alias Chain as a Promotion Contract#

Reducing Complexity, Not Adding It#

Model- and Platform-Agnostic#

See It in Practice#