Your Platform Should Be a Control Plane

Crossplane Hero Image

github.com/crossplane/artwork

Introduction
Crossplane has been a large part of my daily work for close to a year now. Throughout this time, my team has been building out a platform which we are really proud of. I’ve had the pleasure of doing a lot of deep dives into the project, its inner workings and the different ways it can be applied. Before we get too technical — it would be great to take a step back and look at the thought process behind it. Control planes are a deep enough topic that the why deserves its own moment before we get to the how.

There is a moment a lot of teams may recognise. You have invested a lot of time into building a library of reusable Terraform modules, carefully parameterised, peer-reviewed, versioned. The golden path is laid. Developers declare what they want, pipelines run, infrastructure appears. Life is good.

Then, six months later, configuration starts to drift. Teams are pinning older module versions. Some role assignments get applied outside of your IaC. An incident happened and a hotfix was applied, but never merged back into main. The diff gets overlooked. A new compliance requirement lands, and you need to start looking at which of the databases uses which version of the module. Getting everything aligned after drift is a struggle.

The ground truth is no longer in code. Your platform did not fail. It just never had a way to notice.

The Usual Paths

There are two very common approaches.

The first is the shared module library: Terraform, Bicep or Helm is used to codify reusable templates. Teams consume these modules from a registry, pin to a version, and provision resources through their pipelines.

The second is the custom provisioning API: a bespoke service, often written in Go, Python, C# or Java — sometimes using Pulumi as the underlying provisioning engine — that exposes endpoints for creating infrastructure. When called, it runs the necessary provisioning logic and returns. It feels like building a product. It often becomes a maintenance burden instead.

You can start with a library and move to a more API-centric model later on. You can stick to one. Both approaches have delivered real platforms for real organisations, and a skilled team with a healthy DevOps culture can get significant mileage out of either. The ceiling is real, though. The cracks appear at a specific point: when teams multiply, compliance requirements evolve frequently, and the coordination overhead of keeping everything aligned starts to grow faster than the team maintaining it.

The Shared Library

The appeal is genuine: modules are portable, reviewable, version-controlled, and composable. Wrapping infrastructure into a module that teams can call from their own pipelines is a low-ceremony way to raise the floor.

The friction surfaces when the interface has to do more work than a function call was designed to do. When your platform is a library, you are implicitly promising to handle every edge case through configuration. The pull is irresistible. Each new ticket feels small. Each new variable feels reasonable. Gradually, the interface between your platform and your consumers becomes a sprawling list of infrastructure-shaped knobs — subnet delegations, managed identity resource IDs and their permissions, diagnostic settings, private endpoint configurations.

What looked like abstraction is actually delegation. The consumer is still thinking in cloud primitives. They just write slightly less of it.

There is a second cost that is harder to see but very easy to feel on a Monday morning. When your platform is a module library, you become its support team. Developers file issues when a module does not cover their edge case. They open PRs to expose parameters you deliberately left out. They pin to old versions because upgrading breaks their variable names, so they defer it indefinitely. You spend your sprint reviewing module PRs, triaging compatibility questions, and bridging the gap between what the module was designed to do and what teams are actually trying to do with it. The abstraction created a support contract.

The practical consequence shows up six months later: twelve teams, twelve slightly different database configurations, twelve subtly divergent security postures, and a platform team that owns the maintenance burden of all of them. The module was a packaging mechanism, not an API. A module with twenty input variables is still twenty input variables — wrapped in a function call, but still forcing every caller to reason about the full parameter space and make decisions that should not be theirs to make.

But the deeper problem is not the interface. It is the execution model.

When a module is consumed, it is baked into a repository at a point in time. That version is what gets provisioned, and that version is what stays running. If you discover a security misconfiguration in your networking module, or want to enforce a new tagging standard across all storage resources, you now face a task that scales with the number of teams using your platform. You need to bump the module version, communicate the change, and wait for every team to run their pipeline on their own schedule. Some will. Many will not — not out of negligence, but because change freezes exist, upgrading a module version is never urgent enough to prioritise, and teams have their own roadmaps.

The result is that your organisation’s infrastructure drifts into a patchwork of versions, states, and interpretations. Diffs emerge. Someone made a change in the portal. Someone added a resource manually and never committed it. The ground truth is no longer in code.

It is fair to note that Terraform’s ecosystem has come a long way in addressing some of this. A workflow backed by Spacelift, Atlantis, or Terraform Cloud can automate applies, enforce policy, and reduce the coordination burden meaningfully. Drift detection tooling can flag when the real world diverges from state. These are real improvements. The structural gap that remains is continuous reconciliation — the world staying aligned with the declared state automatically, without a pipeline being triggered. That is the property the module library model does not provide, and it is the one that matters most at scale.

The Custom API

Building a bespoke provisioning API feels like the more serious engineering answer. You own the contract, you can version it, and you can change the underlying provisioning logic without touching every consumer.

In practice, it carries its own structural burden. They are stateless by design: they provision infrastructure, and then they walk away. If infrastructure drifts — and it always does — the API is not watching. There is no feedback loop. The API does not know that someone resized a database in the portal, changed the backup retention, or manually assigned a role outside of the declared configuration. It provisioned the desired state once. Whether that state still exists is a different question.

Keeping the real world in sync with the declared world requires continuous observation. A provisioning API that runs once and exits cannot do that.

What fills the gap in practice is a person. Someone checks the portal against the last known state. Someone responds to tickets that say “this resource looks different from what we deployed.” Someone runs periodic audits, comparing what should exist against what actually exists, deciding what needs to be re-applied and in which order. The provisioning API did not eliminate the reconciliation problem. It delegated it to a human.

Some teams address this with an administrative endpoint on the API itself — a reconcile route that queries the real world, compares it against the last-known desired state, and applies corrections where it finds differences. This is a genuine improvement over leaving the comparison entirely to a person. But it is still a human-triggered operation. Someone has to decide to call it, on what cadence, and act on what it finds. The continuous loop — always watching, correcting without being asked — is the property the stateless model cannot replicate.

What a Control Plane Does Differently

Both problems point toward the same missing ingredient: continuous reconciliation.

If you have been running Kubernetes for any length of time, you already depend on this property every day without thinking much about it. Consider what happens when you delete a Pod that belongs to a Deployment. Within seconds, Kubernetes creates a replacement — not because you asked it to, but because the Deployment controller is always watching. It continuously compares the actual state of the cluster against the desired state you declared. When those diverge, it acts. It does not wait for a pipeline. It does not send you an alert and wait for a human to respond. It corrects, immediately and automatically.

That is a control plane. And that is what is missing from both the module library and the custom provisioning API.

The defining characteristic of a control plane is not how it creates infrastructure. It is how it maintains it. A control plane holds a declared desired state and continuously compares it against the actual state of the world. When they diverge — because someone made a manual change, because a resource was deleted, because a dependency changed — the control plane notices, and it corrects. Not once. Continuously.

This is sometimes called self-healing. The thermostat is the classic metaphor: you set a temperature, and the system maintains it indefinitely, regardless of what the environment does. You are not running a pipeline. You are declaring intent, and a running system is responsible for honouring it.

This changes the operational posture fundamentally. Infrastructure is no longer a snapshot of the last pipeline run. It is a live reflection of a declared state. Drift is detected and corrected automatically. The platform team does not need to audit every team’s last run date.

The Right API for the Right Audience

Control planes also force a clarifying question: what does the consumer actually need to express?

A developer requesting a database for their application does not need to think about subnet delegations, server configurations, backup retention windows, or customer-managed key assignments. Those are platform concerns. They are important — often critically so — but they belong to the people who understand the organisation’s security posture, cost model, and compliance requirements.

What a developer needs to express is intent: “I need a Postgres database for the production environment, with high availability.” That is a fundamentally different vocabulary from infrastructure configuration. It speaks in business terms — environment context, availability requirements — not cloud primitives. There are no SKU names. No retention periods. No network topology. Those details are not missing; they are deliberately out of scope for the consumer.

This distinction matters more than it might appear. When a consumer expresses intent, they are saying something that remains true regardless of what the underlying infrastructure looks like. environment: production and availability: high are as meaningful today as they will be when the platform migrates to a new database generation, a new region, or a new cloud provider entirely. The intent does not change. The implementation does. And because the two are separated, the platform team can evolve the implementation freely — improving it, optimising it, securing it — without asking a single development team to update their configuration.

Compare this to a module parameter like high_availability.mode = "ZoneRedundant". That is not intent. That is an implementation detail that has leaked into the consumer’s configuration and will stay there forever, because it is now their responsibility. If the platform team wants to change the HA strategy — say, moving from zone-redundant standby to a read replica model — they cannot. They have to negotiate with every team that pinned it.

A well-designed control plane API exposes only the intent surface. The implementation — all the cloud-specific wiring that turns that intent into a running, compliant, cost-optimised database — lives entirely in the platform. The consumer declares what they want. The platform owns how it is built, and how it continues to be built as requirements evolve.

Crucially, the consumer never sees the how. If the platform team migrates to customer-managed keys, changes the backup strategy, or tightens the RBAC model, none of that is visible in the consumer’s configuration. Their declaration stays the same. The platform implementation changes. The control plane reconciles the difference across every instance of that resource in the organisation — automatically, continuously, without a single pipeline being triggered manually.

Reconciliation Owns the Rollout

This is the capability that separates a control plane from a module library most sharply.

When you update a Terraform module and bump its version, nothing in production changes. Existing infrastructure is frozen at the version it was last provisioned with. Adopting the new version requires every consuming team to update their reference, plan, and apply. For a change that touches hundreds of repositories, this is a coordination problem masquerading as a technical problem.

When you update a composition in a control plane, every resource that uses it is brought into conformance with the new definition on the next reconciliation cycle. There is no coordination burden. There is no version adoption lag. The platform team makes one change, and the organisation moves together.

This is how the cloud providers themselves operate. Azure does not ask you to recreate your storage accounts to pick up a new feature. They operate control planes, and so should you.

Eventually Consistent, Not Eventually Wrong

There is a term worth borrowing from distributed systems here: eventual consistency. A control plane does not guarantee that your infrastructure matches its desired state at every instant. But it guarantees that given enough time and no conflicting changes, the real world will converge on what you declared.

This is a stronger guarantee than a pipeline gives you. A pipeline gives you a snapshot of correctness at the moment it ran. A control plane gives you a continuously enforced contract. The difference is a photograph compared to a video.

It is worth being direct about one implication that compliance teams will raise: eventual consistency means there is a window — typically seconds to minutes, depending on reconciliation frequency — between when drift occurs and when it is corrected. A malicious or accidental change will exist briefly before the control plane acts. In high-compliance environments, that window needs to be auditable. Control planes address this through event streams and audit logs that record what changed, when, and what the reconciliation response was. The gap is short and observable — but it is real, and worth planning for.

Eventual consistency also changes how you think about dependencies between resources. When you declare a database that needs to live inside a particular network, and that network has not finished provisioning yet, a pipeline would fail and require a retry. A control plane simply waits. The database resource sits in a pending state, retrying on each reconciliation cycle until the network is available, and then proceeds. You do not need to sequence your declarations or manage ordering scripts. You declare the full desired state at once, and the control plane figures out the path to get there. Initial provisioning and ongoing drift correction are the same loop, behaving identically in both situations.

Platform-Level Optimisations

There is a dimension of control plane thinking that goes beyond correctness and consistency. A platform that understands the structure of your organisation can make intelligent decisions that no individual consumer could.

Consider static website hosting. A common pattern is for each frontend application to get its own Azure Storage Account with its own private endpoint and its own CDN configuration behind a gateway. From a consumer’s perspective, this is clean. Each team owns their resources. Boundaries are clear. But from the platform’s perspective, it is wasteful. You are paying for dozens of private endpoints that could be consolidated. You are managing dozens of storage accounts that differ in nothing except their name.

A control plane architecture can do better. Because the platform mediates every resource request, it has visibility across all consumers. A design that takes advantage of this can reason about which teams are requesting the same class of resource, which of those resources could be safely collocated, and at what level a shared model makes sense — per domain, per solution, per application tier. Rather than blindly provisioning an isolated resource for every request, the platform can route multiple consumers to the same underlying storage account behind the same private endpoint, while still exposing a clean, isolated interface to each consumer. From the consumer’s perspective, nothing changes. They declared their intent. The platform decided the most efficient way to honour it.

There is a genuine trade-off to name here. Consolidating consumers onto shared resources introduces blast radius — if one team’s workload stresses the shared storage account, it affects everyone collocated on it. Debugging becomes less obvious when a resource is not exclusively owned. This strategy is worth pursuing where the cost savings are significant and the workload profiles are predictable; it is not a universal win. The point is not that the platform should always consolidate, but that the design can reason about the decision centrally, rather than each team making it in isolation.

What makes this powerful is the same property that makes compliance rollouts powerful. When the platform team decides to move from an isolated model to a shared one — or back again, or to something more nuanced — the change lives in the composition. It does not require a conversation with every team, a migration plan per application, or a careful sequencing of Terraform runs. The composition changes. The control plane reconciles. Every consumer converges on the new model, without filing a ticket or touching their configuration.

In a shared module library, this kind of optimisation is structurally difficult. Each team’s module invocation is self-contained. The module has no awareness of what other teams have provisioned. Making a shared resource strategy work would require coordination at the caller level — agreed naming conventions, pre-provisioned shared resources that modules reference, and careful documentation of the contract between them. It is possible, but it is friction. In a control plane, it is just another decision encoded in a composition.

Developer Velocity at Organisational Scale

The optimisation story is compelling on its own. But the same principle extends further: what happens when the platform has enough knowledge to bootstrap an entire application, end to end.

When a developer requests a new application through an internal developer portal, a well-built control plane already knows what it needs to provision. It knows what a production-ready landing zone looks like — the resource groups, the Kubernetes namespaces, the RBAC assignments, the initial environment topology. Rather than a platform engineer stitching this together manually, or a developer waiting days for a ticket, the control plane responds: namespaces and resource groups appear, role assignments are provisioned, a repository is created from the right template.

This is the natural consequence of encoding your organisation’s opinions into a running system rather than a document. The same knowledge that enforces a consistent security posture across your databases is the knowledge that knows how to initialise a new application correctly. Getting there requires investment — well-authored compositions, a portal integration, real alignment on what a good landing zone looks like. But once that work is done, the platform does not need to be asked twice. And because the control plane continues watching after provisioning, the application’s landing zone does not drift the moment attention moves elsewhere. The velocity is sustained, not just one-time.

What the Control Plane Does Not Own

Before committing to this model, it is worth being precise about scope — because a control plane is not a replacement for all infrastructure as code.

The control plane is the right model for the resources that belong to an application’s landing zone: the databases, storage accounts, message queues, Key Vaults, role assignments, and networking components that each application team needs provisioned, maintained, and kept in conformance. These are the resources that vary per application, drift under load, and benefit most from continuous reconciliation and an intent-based API.

But some infrastructure sits below or beside that layer. The Kubernetes clusters that host the workloads. The shared API gateway that fronts multiple applications. The central firewall, the hub network, the DNS zones. These resources are organisational in scope — not owned by any single application team, not provisioned on demand, and not the kind of thing you want a self-healing loop adjusting without deliberate human intent. For these, traditional infrastructure as code — Terraform, Bicep, whatever your organisation already uses — is entirely appropriate. Apply once. Change deliberately. Treat them as the stable foundation the control plane builds on top of.

There is also an operational dependency worth naming explicitly: Crossplane itself runs inside a Kubernetes cluster, which means that cluster becomes foundational infrastructure. It must be highly available, monitored, and kept up to date. If it is unavailable, reconciliation pauses. For organisations already running Kubernetes, the incremental cost is modest. For those that are not, it is a real prerequisite to plan for — not a reason to avoid the model, but something to factor into the adoption path.

What makes this interesting is that the boundary between “provisioned by IaC” and “managed by the control plane” does not have to be a hard wall. A Kubernetes cluster can be provisioned and maintained by your infrastructure team using a Terraform module, and that same cluster can be the target the control plane deploys application workloads onto. An API gateway can be stood up once by the cloud team, while the control plane is responsible for registering each application’s APIs into it. The shared resource is owned by one layer; the application-specific configuration on top of it is owned by another. Both layers do what they are good at.

The principle is consistent: the control plane owns what is dynamic, per-application, and prone to drift. Everything that is static, shared, or foundational can stay in the hands of the team that understands it best.

You Don’t Have to Build It From Scratch

If the problems described above sound familiar, the encouraging part is that you do not need to build a solution from scratch.

The alternative to shared module libraries and custom provisioning APIs is to build your platform on an existing control plane foundation — one that already handles continuous reconciliation, declarative state management, drift correction, and dependency resolution. What you bring is your organisation’s specific opinions: what a “production database” means for you, what a “highly available API” or “Event-Driven Service” looks like in your environment, what compliance controls every workload must carry. That separation is exactly what makes this approach maintainable at scale: the machinery handles the hard part, and your team focuses on encoding what good looks like — knowing that once it is declared, the control plane will enforce it everywhere, always.

This is the architecture we will build in this series. The tool that makes it practical on top of Kubernetes is Crossplane — an open-source CNCF project that turns a Kubernetes cluster into a control plane for any infrastructure. The reason Crossplane is the right foundation for this is that it does not ask you to write a reconciliation engine. It provides one, along with provider integrations for the major cloud platforms — and, beyond that, for anything that exposes an API: databases, SaaS platforms, internal systems, or any other control plane your organisation already depends on. You encode your organisation’s opinions. Crossplane enforces them continuously. You provide the intent. The control plane does the rest.

Crossplane does come with a real learning curve. Composite resource definitions, compositions, provider configurations — there is a vocabulary to learn before the pieces click. The next post in this series covers that foundation directly: what Crossplane is, how it models infrastructure, and how to think about it before writing any compositions.

The result is a platform that behaves like the cloud itself: always watching, always correcting, always converging — with an API your developers actually want to use.