From YAML to Typed Manifests with KCL

Crossplane Hero Image

github.com/crossplane/artwork

Introduction
The previous post took the composition pipeline apart: how functions read observed state, how the identity annotations survive across reconciliations, and how an EnvironmentConfig feeds shared infrastructure references into the render. It closed on a confession (go-templating is stringly-typed) and a teaser that my team was moving to KCL. This post is the first half of that move: the layer underneath function-kcl, the typed schemas and offline rendering that produce the EnvironmentConfigs and ClusterProviderConfigs the compositions consume. The pipeline function itself comes later in the series. Everything here has a working example at github.com/ToonVanDeuren/crossplane-typed-environments.

Recommendation
Before continuing, check out my previous post: Inside the Composition Pipeline.

Disclosure: Parts of this post were refined with artificial intelligence using Claude (Opus 4.8, Sonnet 4.6). The opinions, thoughts, architectural choices, and the failures I learned from are my own. It was mainly used as a way of efficiently structuring my thoughts.

I will admit something up front: I can be messy, disorganised, and more than a little forgetful. For a long time that turned managing a fleet of environments into a real chore, and keeping the YAML coordinated across all of them was, on a bad day, a nightmare.

I would drop a property or mistype its name, and nothing would catch it. There was no validation to fail and no build to break, so I would end up with a manifest that looked fine and wasn’t. Sometimes it broke at deploy time. The worse cases broke at runtime, long after the change had shipped and the context had drained out of my head. This happened more often than I would like to admit.

So the team and I changed how we work. We type-check our configuration now, with real layers of validation and checks, so a fat-fingered key cannot quietly rot a manifest again. The shape is shared: every environment is an instance of one type, so you cannot silently add, drop, or rename a field in just one of them. Change the contract and they all have to satisfy it. KCL is what made that possible. Defining schemas gave us type-safety, plus a place to keep testing and validation right next to the thing being validated.

The first failure was drift across environments that survives review: a typo, a missing key, a stale field that lingers in acceptance after someone cleaned it out of production. The diff is small and plausible, the reviewer is tired, and it ships. The second was the one the previous post ended on: the go-templating compositions are stringly-typed, so a typo in a field name renders cleanly and dies at admission minutes later, in someone else’s namespace.

When every environment is the same typed schema rendered from a single contract, drift between environments stops being a discipline problem and becomes a render-time error.

Two Layers of KCL

KCL plays two distinct roles in our platform, and the line between them is drawn by one question: what is known when?

The first role is offline rendering, which is the subject of this post. The EnvironmentConfigs and ClusterProviderConfigs a composition reads are bootstrap data. They are known before any composite resource exists, and they are the same across every reconciliation, because the server identity for an environment does not change when a developer creates a database in their namespace. So they render once, at deploy time, from typed input into the YAML the cluster applies.

The second role is inline rendering: function-kcl inside the composition pipeline, replacing go-templating. That layer renders on every reconciliation, against live observed state, because what it produces depends on what already exists in the cluster and the cloud. The lifecycle and the tooling differ, but both layers speak the same language, and that shared language is much of the appeal.

The split is not arbitrary, and it shows up in the repository layout. The top level divides by what you author versus what the cluster applies:

kcl/             the source you edit
  schemas/         the types: the contract and the rendered shapes
  values.k         the three environments
  render/          one renderer per resource
manifests/
  static/          hand-written, applied as-is (providers, XRDs, compositions)
  generated/       rendered from kcl/, never hand-edited

manifests/static/ holds the things a human writes once: the Azure providers, and the XRD plus composition for each workload. The PostgresDatabase set is the one from the previous post, reused verbatim. Alongside it sits a RedisInstance set, a second workload added here specifically to show the pattern generalises beyond a single resource. manifests/generated/ holds the rendered output, one directory per environment, and nobody edits it by hand. If it is wrong, you fix the source and render again.

The render itself is a one-liner, the kind of thing a CI step runs:

kcl run -D env=production kcl/render/postgres.k

There is one renderer per resource (render/providerconfig.k, render/postgres.k, render/redis.k), and each emits exactly one document. No Kustomize, no overlays, no patch graph to reason about. The directories are applied directly. What’s new in this post isn’t the compositions. Postgres carries over from the previous post, and redis is its twin. What moved is where their per-environment input comes from: a single typed schema instead of hand-written YAML.

Schemas as Types

With KCL you stop writing YAML and start describing its shape. You no longer author the EnvironmentConfig directly. You author the type it has to conform to, plus the values for one environment, and KCL renders the rest. The YAML ends up as a build artifact, and the schema is the thing you actually maintain.

Here is the contract every environment satisfies, kcl/schemas/environment.k:

schema Environment:
    name: "integration" | "acceptance" | "production"
    subscriptionId: str
    resourceGroup: str
    location: str
    tags: base.Tags
    postgres: pg.PostgresInput

    check:
        regex.match(subscriptionId, r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"), "subscriptionId must be a UUID"
        regex.match(resourceGroup, r"^rg-[a-z0-9-]+$"), "resourceGroup must start with rg- and be kebab-case"
        regex.match(location, r"^[a-z][a-z0-9]+$"), "location must be an Azure region like westeurope"

The name field is a literal-union type: the only legal values are the three strings listed. That set mirrors the spec.environment enum in each XRD, so both ends of the platform (the typed config and the developer-facing API) enforce the same closed set. Add a fourth environment in one place and forget the other, and the build tells you before anything reaches a cluster.

The tags field is a required sub-schema, not a free {str:str} map. Tags (in base.k) demands owner, costCenter, and a dataClassification that is itself a union of internal | confidential. An environment that does not declare its governance tags does not render, so there is no path to an unclassified resource.

And postgres: PostgresInput is the per-service input. Rather than piling postgres fields onto the top level, the input mirrors the per-service split of the rendered data: a service that needs per-environment input gets a typed field here, and one that does not, does not. Redis is the second case, deriving everything it needs from the shared fields and its own defaults, so it has no field on Environment at all.

An environment that satisfies this is mostly values, almost no structure:

_production = s.Environment {
    name = "production"
    subscriptionId = _prodSubscriptionId
    resourceGroup = "rg-platform-${name}"
    location = _region
    tags = _baseTags | {dataClassification = _confidentialClassification}
    postgres = {
        serverName = "psql-shared-${name}"
        adminUser = "psql-admin"
    }
}

Run the renderer and you get the EnvironmentConfig the previous post’s composition selects by label:

kcl run -D env=production kcl/render/postgres.k

apiVersion: apiextensions.crossplane.io/v1beta1
kind: EnvironmentConfig
metadata:
  name: postgres-production
  labels:
    scope: shared
    service: postgres
    environment: production
data:
  tags:
    owner: platform-team
    costCenter: engineering-platform
    dataClassification: confidential
    sharedAcross: environments
  serverId: /subscriptions/22222222-2222-2222-2222-222222222222/resourceGroups/rg-platform-production/providers/Microsoft.DBforPostgreSQL/flexibleServers/psql-shared-production
  serverFqdn: psql-shared-production.postgres.database.azure.com
  adminUser: psql-admin
  charset: UTF8
  collation: en_US.utf8

Neither serverId nor serverFqdn is stored anywhere; both are interpolated from the typed inputs at render time, in render/postgres.k:

serverId = "/subscriptions/${_e.subscriptionId}/resourceGroups/${_e.resourceGroup}/providers/Microsoft.DBforPostgreSQL/flexibleServers/${_e.postgres.serverName}"
serverFqdn = "${_e.postgres.serverName}.postgres.database.azure.com"

Change serverName in one place and both derived values follow. There is no second copy of the resource ID to keep in sync with the FQDN.

The output is typed as well. The rendered resource is an instance of a schema in schemas/, so its apiVersion, kind, and field shapes are fixed by the render. A wrong field or a misspelled kind fails before it can reach a cluster. The data payload is typed in layers. EnvironmentConfig is generic: its data is a base EnvironmentData that carries the environment’s tags. Each workload’s payload is a schema that inherits that base:

schema PostgresData(base.EnvironmentData):
    serverId: str
    serverFqdn: str
    adminUser: str
    charset: str
    collation: str

So a key a composition reads (serverFqdn, say) is a typed field, and renaming it fails the render before it can silently produce a config the composition cannot find. The same Tags type that gates the input is carried into the rendered data, which means the governance an environment declares is what the cluster sees. (KCL has no parametric generics, so this is plain schema inheritance with a base data type rather than a Data<T>. In practice the base is the extension point you want anyway.)

Inheritance is also why adding redis cost almost nothing. RedisData(base.EnvironmentData) is one new file under schemas/services/, one new renderer, and one new static/resources/redis/. The generic EnvironmentConfig was never touched. The repository ships two workloads to make that visible: adding a workload means adding a file, and the shared resource stays untouched.

The selector labels are typed as well. EnvironmentConfigLabels fixes scope, service (itself a union of the known services), and environment. A renamed selector label fails the render, long before it could quietly break the composition’s function-extra-resources match at reconcile time. The set is typed on both ends: the input an environment provides and the output a composition reads.

Compare this to the same coverage spread across N Kustomize overlays. Overlays patch a base; nothing forces an overlay to be complete. An overlay that is missing a key does not fail, it just applies a resource without that key, and you discover the gap when something downstream reads a value that is not there. A KCL-rendered set cannot drift in shape, because the shape is a type and the type is checked before anything is written.

Validation in the Schema

Types give you structural validation. check blocks add semantic validation, and this is the part I find hardest to give up once you have it. The drift-detection rules you would otherwise write as a CI script live next to the schema they constrain and fail the render. Anything you would write a test for, you can write as a check.

The rules are not all in one file. They sit on the schema they belong to. Environment carries the UUID and kebab-case checks shown earlier. PostgresInput carries the one that constrains the server name to Azure’s Flexible Server naming rules, because that is where a server name is a server name:

schema PostgresInput:
    serverName: str
    adminUser: str = "platform-admin"
    charset: str = "UTF8"
    collation: str = "en_US.utf8"

    check:
        regex.match(serverName, r"^[a-z][a-z0-9-]{2,62}$"), "serverName must match Azure Flexible Server naming rules"

The tenancy check has a different shape. It encodes a governance requirement: every rendered resource is either dedicated to one consumer or shared across many, never both and never neither. That is a cross-field rule, and it sits on the rendered-data base EnvironmentData rather than on any input:

schema EnvironmentData:
    tags: Tags

    check:
        tags.dedicatedTo or tags.sharedAcross, "tenancy must be declared: set tags.dedicatedTo or tags.sharedAcross"
        not (tags.dedicatedTo and tags.sharedAcross), "tags.dedicatedTo and tags.sharedAcross are mutually exclusive"

Two conditions, two distinct messages, enforcing exactly-one. Each renderer sets its side: render/redis.k writes dedicatedTo = "app", render/postgres.k writes sharedAcross = "environments". Because the check lives on the base every workload inherits, every rendered resource has to declare its tenancy or it does not render. The alternative is a CI script that greps rendered YAML for the two keys and asserts exactly one is present. Here it is two lines next to the data it governs.

When a check fails, KCL points at the offending value and the line of the rule, with no stack trace to dig through. Break the tenancy rule by setting both keys and the render stops:

EvaluationError
  --> kcl/render/redis.k:25:16
   |
25 |         data = redis.RedisData {
   |                ^ Instance check failed
   |
  --> kcl/schemas/base.k:39:1
   |
39 |         not (tags.dedicatedTo and tags.sharedAcross), "tags.dedicatedTo and tags.sharedAcross are mutually exclusive"
   |  Check failed on the condition: tags.dedicatedTo and tags.sharedAcross are mutually exclusive

Break the resource group convention in a value file and you get the same shape, pointing at the env in values.k and the rule in environment.k:

EvaluationError
  --> kcl/values.k:23:16
   |
23 | _integration = s.Environment {
   |                ^ Instance check failed
   |
  --> kcl/schemas/environment.k:26:1
   |
26 |         regex.match(resourceGroup, r"^rg-[a-z0-9-]+$"), "resourceGroup must start with rg- and be kebab-case"
   |  Check failed on the condition: resourceGroup must start with rg- and be kebab-case

The error names the value that failed and the rule it failed, with file and line for both ends. The render fails closed, so nothing gets written and nothing wrong reaches a cluster.

One Schema, Every Environment

Every environment is the same schema, so the three env files are interchangeable in shape and differ only in values. Drift becomes impossible by construction, and it stops being something discipline has to catch.

Here are all three, side by side in values.k. The locals at the top are the values shared across environments; each entry sets only what differs:

_region = "westeurope"

_nonProdSubscriptionId = "11111111-1111-1111-1111-111111111111"
_prodSubscriptionId = "22222222-2222-2222-2222-222222222222"

_internalClassification = "internal"
_confidentialClassification = "confidential"

_baseTags = {
    owner = "platform-team"
    costCenter = "engineering-platform"
}

_integration = s.Environment {
    name = "integration"
    subscriptionId = _nonProdSubscriptionId
    resourceGroup = "rg-platform-${name}"
    location = _region
    tags = _baseTags | {dataClassification = _internalClassification}
    postgres = {serverName = "psql-shared-${name}"}
}

_acceptance = s.Environment {
    name = "acceptance"
    subscriptionId = _nonProdSubscriptionId
    resourceGroup = "rg-platform-${name}"
    location = _region
    tags = _baseTags | {dataClassification = _internalClassification}
    postgres = {serverName = "psql-shared-${name}"}
}

_production = s.Environment {
    name = "production"
    subscriptionId = _prodSubscriptionId
    resourceGroup = "rg-platform-${name}"
    location = _region
    tags = _baseTags | {dataClassification = _confidentialClassification}
    postgres = {
        serverName = "psql-shared-${name}"
        adminUser = "psql-admin"
    }
}

All three share one shape and differ only in values. production does two things the others do not: it classifies its data as confidential and it overrides adminUser. Everything else (charset, collation, the redis sizing defaults) comes from the schema’s defaults. That is defaults-plus-delta without any inheritance between environments. Production is just another instance of the same type that happens to set two more fields. There is no base environment for it to drift away from.

Rendering all of them is one command:

make render-all

It writes manifests/generated/<env>/{providerconfig,postgres,redis}.yaml for every environment: the ClusterProviderConfig named azure-<env> that both compositions reference, and the per-service EnvironmentConfigs labelled service=postgres / service=redis and environment=<env>. One renderer per resource, and values.k is the single place that maps -D env=<name> to the values for that environment. Adding staging is a fourth entry here and one line in each XRD enum. You are not copying a directory whose structure can rot over time.

The redis config renders through the same machinery, just a different data schema:

apiVersion: apiextensions.crossplane.io/v1beta1
kind: EnvironmentConfig
metadata:
  name: redis-production
  labels:
    scope: shared
    service: redis
    environment: production
data:
  tags:
    owner: platform-team
    costCenter: engineering-platform
    dataClassification: confidential
    dedicatedTo: app
  resourceGroupName: rg-platform-production
  location: westeurope
  skuName: Standard
  family: C
  capacity: 1

Now close the loop. A developer creates a PostgresDatabase directly in their namespace. Crossplane v2 has no separate claim; the composite is namespaced and applied as-is:

apiVersion: platform.example.com/v1alpha1
kind: PostgresDatabase
metadata:
  name: orders
  namespace: team-checkout
spec:
  environment: production
  databaseName: orders

spec.environment: production is the only coordination the developer does. It selects the postgres-production EnvironmentConfig (via the scope/service/environment labels) and the azure-production ClusterProviderConfig, and the composition from the previous post builds the database on the shared server for that environment. The developer never sees a server ID, a subscription, or a tag. They name a database and pick an environment.

A RedisInstance closes the identical loop through its own EnvironmentConfig:

apiVersion: platform.example.com/v1alpha1
kind: RedisInstance
metadata:
  name: sessions
  namespace: team-checkout
spec:
  environment: production
  instanceName: checkout-sessions-prod

Same spec.environment, same selection mechanism, different workload. The payoff is not postgres-specific. KCL feeds the same machine the previous post built, and adding a second workload to that machine was one schema file and one renderer.

The Toolchain

The reason it stuck for us is the day-to-day experience. The schema works for you while you type, well before it becomes a gate at the end of CI.

In the editor, the KCL language server (the VS Code extension) gives you schema-aware highlighting, autocompletion of fields and their types, and inline errors as you write, so a mistyped key is underlined before you save. The same schema that gates CI is the one driving the autocomplete.

That is the real shape of the enforcement chain. Add a property to a schema and you are forced to fill it in for every environment at once, with no second-guessing whether you wired it up everywhere, because the unfilled ones stop rendering. The same rule is checked three times against the same source of truth: while you type (the LSP), locally (make vet), and automatically in CI on the pull request, before anything deploys.

kcl fmt keeps the source formatted, and make vet is the gate: every environment renders and passes its checks, no output written.

make vet

✓ all environments render

In CI that is the whole validation job, plus a make check that fails if the committed manifests/generated/ has drifted from the source:

- name: Vet (every environment renders and passes its checks)
  run: make vet
- name: Check committed output matches source
  run: make check

None of this is free, and the honest cost shows up in three places. KCL is another language the team had to learn, and the early weeks were slower while the schema-first way of thinking settled in; for a while it felt like more ceremony than the YAML it replaced. The directory layout grew a second axis too, the split between the source you edit and the generated output you do not, which is one more thing a newcomer has to hold in their head before the first edit makes sense. And there is now a render step between writing config and applying it, so the generated manifests can drift from their source if someone skips it; make check exists because that drift is possible. The payoff is real, but it arrives after the ramp.

And to close the section: KCL is a real language. You will not run an HTTP server with it, but it has types, variables, lambdas, inheritance, packages, and imports. That is why the tooling feels like a real language toolchain, with a formatter, a language server, and a CI gate. The package and OCI-distribution story (kpm) is real and deserves its own post; I am leaving it out here on purpose.

Closing

This post covered the offline layer: typed schemas and a render step that turn the per-environment bootstrap config (the EnvironmentConfigs and ClusterProviderConfigs a composition reads) into a contract that fails the build, long before anything reaches apply. The other layer is inline rendering: function-kcl inside the composition pipeline, replacing the go-templating from the previous post and rendering on every reconciliation against live observed state. Same language, different lifecycle, and the natural next step in this series. I may take a detour first, though; writing custom composition functions is a topic I have been wanting to dig into, and it would slot in well here.

The full example, with both workloads, the schemas, the rendered output, and the Makefile, is at github.com/ToonVanDeuren/crossplane-typed-environments.

References:

kcl-lang.io (language reference, schemas, check, the standard library)

KCL: schema definitions (types, defaults, inheritance, check blocks)

KCL tools and IDE integration (kcl run, kcl fmt, kcl vet, the language server)

Crossplane docs: EnvironmentConfigs