Introduction
Have you inspected your nodes lately? If you’re running Kubernetes on one of the major cloud platforms, chances are high you have noticed one of these at some point. A long list of conditions, that seemingly appeared overnight. Personally, I was confused and did a little deep dive into where these came from. What I didn’t expect was accidentally stumbling upon a solution for a problem that we were facing for some time now and had not gotten around to fixing yet.
|
The Problem
When a new node is created, a few things happen; the machine itself boots, a kubelet instance starts running, as well as other critical services like the container runtime and your desired container network interface. Next, the kubelet sends a heartbeat to your control-plane. This heartbeat tells your cluster that the new node is healthy and ready to accept pods. And there lies the fundamental problem: ready on a basic technical level sure, but not accounting for your workload’s dependencies.
We often need more than just a host machine, network connectivity and a container runtime. Our clusters are filled with other dependencies. Some of them offer their services on a cluster-wide basis like an operator, controller or custom API. Take ingress for example; as long as it is up and running somewhere within our cluster and can send traffic towards the right pods, we do not care about which node the controller is running on. However, in a scenario where our workloads need a specific storage driver, GPU firmware or an observability agent to be available on its own node; this can become a hard-requirement for scheduling. There is no point in the node accepting pods if it does not meet any of these requirements.
DaemonSets
You might already be familiar with the term “DaemonSet” or heard about a ‘node-local’ deployment. If you haven’t, here is the short version. DaemonSets are very similar to Deployments in that they create and manage Pods across your cluster, however, they are very particular about where these pods need to run. DaemonSets try to place a copy of their Pods on every single node. This is achieved through creating pods configured with node affinities, that individually target a single node using a selector that matches the name of that specific node. Below is an example of a pod created by a DaemonSet. Humor me for a second that this is a very critical service instead of it printing to stdout..
|
As you can see in spec.affinity, this pod is targeting one specific node with the name ‘bob-the-node’. I promise Alice will join us soon.
The tolerations block is equally important. Taints are a way for a node to repel pods — a node can be marked with a taint that says “do not schedule here unless you explicitly opt in”. Tolerations are that opt-in: a pod that tolerates a taint is allowed to land on the node regardless. Kubernetes automatically adds a set of default tolerations to DaemonSet pods, covering system conditions like not-ready, unreachable, and various pressure states. This is what lets DaemonSet pods reach nodes that are in a degraded state — a node experiencing memory pressure would normally push pods away, but the DaemonSet tolerates it and lands anyway. We will come back to this when we wire up our own custom taint.
Our Problem
The observability agent scenario is where we had some troubles. That agent needs to be deployed and actively running on the node, before any of our actual workloads could be scheduled on it. Our agent scrapes metrics from all containers running on its node. When the agent is not yet active, but the pod is already scheduled and the containers are starting, we’re missing valuable metrics. Our agent makes some adjustments to our pod’s containers, but these adjustments are not picked up by the application without a restart because it was already running. Quite some of these annoying bugs were popping up sporadically.
“You can solve this with affinities”, I hear you typing already. Yes, of course, you can set certain pod-affinity rules on your workloads that make a hard-requirement for the scheduler to place them on a node that already has your observability agent running. This comes with a lot of other challenges: how do you scale the configuration of every workload to have the correct affinities configured. Kyverno policies can be a potential solution but comes with quite a bit of unwanted complexity our platform does not need right now. You have to make sure that it covers every scenario. It also pushes the responsibility to the Pod spec, which is in my opinion not a clean solution.
We were looking for something to handle this elegantly.
Node Readiness Controller
The Node Readiness Controller is a Kubernetes SIG project that adds a single, elegant abstraction on top of node conditions: the NodeReadinessRule. You define a rule that says “this node is not ready until condition X is False”, and the controller handles applying and removing a taint automatically. It is currently in alpha but it is already doing exactly what we need.
Before we get into the NodeReadinessRule though, we need something that actually sets those custom conditions on our nodes. That is where the Node Problem Detector comes in.
Node Problem Detector
The Node Problem Detector (NPD) is another Kubernetes project that runs as a DaemonSet on every node in your cluster. Its job is to detect problems on the node and surface them as node conditions or events. Out of the box it covers things like kernel deadlocks, readonly filesystems and frequent kubelet restarts — which is exactly where those extra conditions from the introduction came from.
What makes it interesting for our use case is its custom plugin system. You can write a shell script that exits with a specific code, and NPD translates that exit code into a node condition. Exit 0 means healthy, exit 1 means problem, exit 2 means unknown. That is all there is to it.
So the plan is:
- Write a plugin script that checks whether our critical daemon is running on the node.
- Configure NPD to run that script on an interval and map the result to a custom condition.
- Configure the Node Readiness Controller to manage a taint based on that condition.
Implementing the custom condition
The check script
The script below is what NPD will execute every 10 seconds on each node. It uses kubectl to query for a Running pod with the label app=demo-critical-supporting-daemon on that specific node. If it finds one, it exits 0. If it does not, it exits 1.
|
NPD injects NODE_NAME from the pod’s own metadata so the script knows which node it is on. We use a field selector to filter for pods on this specific node that are in the Running phase. kubectl is not present in the NPD image by default — we will handle that in the DaemonSet section below.
NPD configuration
The script alone is not enough. We need to tell NPD what to do with those exit codes. That configuration lives in a JSON file that gets mounted into the NPD pod as a ConfigMap.
|
The conditions block defines the default state — what the condition looks like when things are healthy. The rules block is what actually drives the condition. With "type": "permanent", NPD re-evaluates the script on every interval and overwrites the condition each time. The condition name CriticalDaemonNotReady is intentionally phrased as a problem statement so that False means everything is fine, which is how standard Kubernetes conditions like MemoryPressure work.
Both the script and the JSON config get stored together in a single ConfigMap in kube-system:
|
RBAC
NPD needs two sets of permissions. First, the cluster-wide permission to write node conditions and events — that is what lets it actually set CriticalDaemonNotReady on the node object. Second, a namespaced permission to read pods in the demo namespace, which our check script relies on.
|
NPD DaemonSet
With the ConfigMap and RBAC in place, we can deploy NPD itself. Because our check script calls kubectl and the NPD image does not ship with it, we use an init container to copy the kubectl binary into a shared emptyDir volume before the main container starts. The main container then picks it up from /usr/local/bin/kubectl. I must admit that this is quite a lazy and not that secure way of installing this binary. You should probably consider building your own container image from the node-problem-detector base image.
The other important bits are: passing the custom monitor config file as a flag, injecting NODE_NAME from the pod spec, and mounting the ConfigMap as /config — which serves both the JSON config and the plugin script. The script item gets mode: 0755 so it is executable.
|
The tolerations on NPD are important — it needs to tolerate all taints, including the readiness taint we are about to add, otherwise it would never schedule on a node that is not yet ready. That would be a neat little deadlock.
The NodeReadinessRule
With NPD running and setting CriticalDaemonNotReady on each node, we can now tell the Node Readiness Controller what to do with it.
|
There are a few things worth noting here. The nodeSelector: {} applies the rule to every node in the cluster. The requiredStatus: "False" is what drives the taint: the NRC removes the taint when the condition is False, meaning the daemon is running. As long as the condition is True or Unknown, the taint stays.
The enforcementMode: "continuous" means the controller keeps watching. If the daemon crashes after the node was marked ready, the condition flips back to True, and the controller re-applies the taint. If you only care about startup — that the daemon is ready when the node joins the cluster — you can use bootstrap-only instead, which does not react to changes after the initial ready state is reached.
The taint key is required to be prefixed with readiness.k8s.io/ and is immutable after creation, so pick a good name the first time.
The chicken-and-egg
At this point you might be wondering: if the node starts with the taint, how does the critical daemon ever get scheduled? The daemon is a DaemonSet, and the taint blocks scheduling. So nothing gets on the node. But without the daemon, the condition never becomes False. And without the condition becoming False, the taint never gets removed.
The answer is straightforward: the critical daemon’s DaemonSet needs to tolerate its own taint.
|
By explicitly tolerating readiness.k8s.io/critical-daemon-not-ready, the DaemonSet can land on a node regardless of the taint. The daemon starts, NPD’s next check picks it up, flips the condition to False, and the NRC removes the taint. The node is now open for business.
Your actual workloads do not need any changes at all. No tolerations, no affinities, no Kyverno policies. They just can not schedule onto nodes that carry the taint, which is exactly the behaviour we wanted.
Putting it all together
Let me walk through the full sequence one more time with Alice and Bob, who are joining our cluster as fresh nodes.
Alice and Bob both start with CriticalDaemonNotReady=Unknown because NPD has not run its first check yet. The NRC sees this and applies readiness.k8s.io/critical-daemon-not-ready:NoSchedule to both of them. Our actual workloads cannot schedule on either node.
The critical daemon DaemonSet tolerates the taint, so Alice and Bob both get a daemon pod almost immediately. NPD runs its check within 10 seconds, finds a Running pod on each node, and sets CriticalDaemonNotReady=False. The NRC picks up the condition change and removes the taint from both nodes. The workload pods are now free to schedule.
Six hours later, the daemon crashes on Bob for some reason. NPD’s next check returns exit 1, the condition flips to True, and the NRC re-applies the taint. Bob is now blocked from accepting new pods again. The workload pods that were already running on Bob are not evicted because the taint effect is NoSchedule, not NoExecute. If you wanted eviction, you would switch to NoExecute — but that is a separate trade-off.
Alice is unaffected throughout all of this. Her daemon kept running, her condition stayed False, and her workloads never noticed a thing.
Wrap-up
What started as me wondering where all those node conditions came from turned into a fairly clean solution to a problem that had been sitting on our backlog for a while. The combination of NPD’s custom plugin system and the Node Readiness Controller gives you a way to express node-level readiness requirements without touching your workload specs at all. The taint acts as a gate that the node itself has to earn its way through.
The two projects are still evolving. NPD is a well-established project but its custom plugin API is not exactly well-documented. The Node Readiness Controller is in alpha under kubernetes-sigs and the API may change. Still, the pattern itself is solid, and if you are running anything that has hard per-node dependencies — observability agents, storage drivers, GPU toolkits — it is worth looking at.
All the manifests from this post are in the demo repository if you want to try it yourself.
Stay curious!