Introduction

Have you inspected your nodes lately? If you’re running Kubernetes on one of the major cloud platforms, chances are high you have noticed one of these at some point. A long list of conditions, that seemingly appeared overnight. Personally, I was confused and did a little deep dive into where these came from. What I didn’t expect was accidentally stumbling upon a solution for a problem that we were facing for some time now and had not gotten around to fixing yet.

Conditions:
Type Status Reason Message
---- ------ ------ -------
VMEventScheduled False NoVMEventScheduled VM has no scheduled event
FrequentContainerdRestart False NoFrequentContainerdRestart containerd is functioning properly
FrequentDockerRestart False NoFrequentDockerRestart docker is functioning properly
FilesystemCorruptionProblem False FilesystemIsOK Filesystem is healthy
FrequentUnregisterNetDevice False NoFrequentUnregisterNetDevice node is functioning properly
ContainerRuntimeProblem False ContainerRuntimeIsUp container runtime service is up
KernelDeadlock False KernelHasNoDeadlock kernel has no deadlock
FrequentKubeletRestart False NoFrequentKubeletRestart kubelet is functioning properly
KubeletProblem False KubeletIsUp kubelet service is up
ReadonlyFilesystem False FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False RouteCreated RouteController created a route
MemoryPressure True KubeletHasInsufficientMemory kubelet has insufficient memory available
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready True KubeletReady kubelet is posting ready status.

The Problem

When a new node is created, a few things happen; the machine itself boots, a kubelet instance starts running, as well as other critical services like the container runtime and your desired container network interface. Next, the kubelet sends a heartbeat to your control-plane. This heartbeat tells your cluster that the new node is healthy and ready to accept pods. And there lies the fundamental problem: ready on a basic technical level sure, but not accounting for your workload’s dependencies.

We often need more than just a host machine, network connectivity and a container runtime. Our clusters are filled with other dependencies. Some of them offer their services on a cluster-wide basis like an operator, controller or custom API. Take ingress for example; as long as it is up and running somewhere within our cluster and can send traffic towards the right pods, we do not care about which node the controller is running on. However, in a scenario where our workloads need a specific storage driver, GPU firmware or an observability agent to be available on its own node; this can become a hard-requirement for scheduling. There is no point in the node accepting pods if it does not meet any of these requirements.

DaemonSets

You might already be familiar with the term “DaemonSet” or heard about a ‘node-local’ deployment. If you haven’t, here is the short version. DaemonSets are very similar to Deployments in that they create and manage Pods across your cluster, however, they are very particular about where these pods need to run. DaemonSets try to place a copy of their Pods on every single node. This is achieved through creating pods configured with node affinities, that individually target a single node using a selector that matches the name of that specific node. Below is an example of a pod created by a DaemonSet. Humor me for a second that this is a very critical service instead of it printing to stdout..

apiVersion: v1
kind: Pod
metadata:
labels:
app: demo-critical-supporting-daemon
name: demo-critical-supporting-daemon-fz5dd
namespace: demo
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- bob-the-node
containers:
- command:
- sh
- -c
- |
echo "demo-critical-supporting-daemon started on $(hostname)"
while true; do
echo "demo-critical-supporting-daemon running at $(date)"
sleep 30
done
image: alpine:3.23
name: daemon
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists

As you can see in spec.affinity, this pod is targeting one specific node with the name ‘bob-the-node’. I promise Alice will join us soon.

The tolerations block is equally important. Taints are a way for a node to repel pods — a node can be marked with a taint that says “do not schedule here unless you explicitly opt in”. Tolerations are that opt-in: a pod that tolerates a taint is allowed to land on the node regardless. Kubernetes automatically adds a set of default tolerations to DaemonSet pods, covering system conditions like not-ready, unreachable, and various pressure states. This is what lets DaemonSet pods reach nodes that are in a degraded state — a node experiencing memory pressure would normally push pods away, but the DaemonSet tolerates it and lands anyway. We will come back to this when we wire up our own custom taint.

Our Problem

The observability agent scenario is where we had some troubles. That agent needs to be deployed and actively running on the node, before any of our actual workloads could be scheduled on it. Our agent scrapes metrics from all containers running on its node. When the agent is not yet active, but the pod is already scheduled and the containers are starting, we’re missing valuable metrics. Our agent makes some adjustments to our pod’s containers, but these adjustments are not picked up by the application without a restart because it was already running. Quite some of these annoying bugs were popping up sporadically.

“You can solve this with affinities”, I hear you typing already. Yes, of course, you can set certain pod-affinity rules on your workloads that make a hard-requirement for the scheduler to place them on a node that already has your observability agent running. This comes with a lot of other challenges: how do you scale the configuration of every workload to have the correct affinities configured. Kyverno policies can be a potential solution but comes with quite a bit of unwanted complexity our platform does not need right now. You have to make sure that it covers every scenario. It also pushes the responsibility to the Pod spec, which is in my opinion not a clean solution.

We were looking for something to handle this elegantly.

Node Readiness Controller

The Node Readiness Controller is a Kubernetes SIG project that adds a single, elegant abstraction on top of node conditions: the NodeReadinessRule. You define a rule that says “this node is not ready until condition X is False”, and the controller handles applying and removing a taint automatically. It is currently in alpha but it is already doing exactly what we need.

Before we get into the NodeReadinessRule though, we need something that actually sets those custom conditions on our nodes. That is where the Node Problem Detector comes in.

Node Problem Detector

The Node Problem Detector (NPD) is another Kubernetes project that runs as a DaemonSet on every node in your cluster. Its job is to detect problems on the node and surface them as node conditions or events. Out of the box it covers things like kernel deadlocks, readonly filesystems and frequent kubelet restarts — which is exactly where those extra conditions from the introduction came from.

What makes it interesting for our use case is its custom plugin system. You can write a shell script that exits with a specific code, and NPD translates that exit code into a node condition. Exit 0 means healthy, exit 1 means problem, exit 2 means unknown. That is all there is to it.

So the plan is:

  1. Write a plugin script that checks whether our critical daemon is running on the node.
  2. Configure NPD to run that script on an interval and map the result to a custom condition.
  3. Configure the Node Readiness Controller to manage a taint based on that condition.

Implementing the custom condition

The check script

The script below is what NPD will execute every 10 seconds on each node. It uses kubectl to query for a Running pod with the label app=demo-critical-supporting-daemon on that specific node. If it finds one, it exits 0. If it does not, it exits 1.

#!/bin/bash
# Checks whether demo-critical-supporting-daemon has a Running pod on this node.
# Exit 0 → healthy (NPD sets CriticalDaemonNotReady=False, NRC removes taint)
# Exit 1 → problem (NPD sets CriticalDaemonNotReady=True, NRC applies taint)
# Exit 2 → unknown (NPD sets CriticalDaemonNotReady=Unknown, NRC keeps taint)

NAMESPACE="demo"
NODE_NAME="${NODE_NAME:-}"

if [ -z "$NODE_NAME" ]; then
echo "NODE_NAME environment variable is not set"
exit 2
fi

if ! PODS=$(kubectl get pods -n "$NAMESPACE" \
-l app=demo-critical-supporting-daemon \
--field-selector="spec.nodeName=${NODE_NAME},status.phase=Running" \
--no-headers 2>/dev/null); then
echo "Failed to query Kubernetes API"
exit 2
fi

if [ -z "$PODS" ]; then
echo "demo-critical-supporting-daemon is NOT running on node ${NODE_NAME}"
exit 1
else
echo "demo-critical-supporting-daemon is running on node ${NODE_NAME}"
exit 0
fi

NPD injects NODE_NAME from the pod’s own metadata so the script knows which node it is on. We use a field selector to filter for pods on this specific node that are in the Running phase. kubectl is not present in the NPD image by default — we will handle that in the DaemonSet section below.

NPD configuration

The script alone is not enough. We need to tell NPD what to do with those exit codes. That configuration lives in a JSON file that gets mounted into the NPD pod as a ConfigMap.

{
"plugin": "custom",
"pluginConfig": {
"invoke_interval": "10s",
"timeout": "30s",
"max_output_length": 80,
"concurrency": 1,
"enable_message_change_based_condition_update": false
},
"source": "critical-daemon-custom-monitor",
"conditions": [
{
"type": "CriticalDaemonNotReady",
"reason": "CriticalDaemonIsReady",
"message": "demo-critical-supporting-daemon is running on this node"
}
],
"rules": [
{
"type": "permanent",
"condition": "CriticalDaemonNotReady",
"reason": "CriticalDaemonIsNotRunning",
"path": "/config/check-critical-daemon.sh",
"timeout": "30s"
}
]
}

The conditions block defines the default state — what the condition looks like when things are healthy. The rules block is what actually drives the condition. With "type": "permanent", NPD re-evaluates the script on every interval and overwrites the condition each time. The condition name CriticalDaemonNotReady is intentionally phrased as a problem statement so that False means everything is fine, which is how standard Kubernetes conditions like MemoryPressure work.

Both the script and the JSON config get stored together in a single ConfigMap in kube-system:

apiVersion: v1
kind: ConfigMap
metadata:
name: node-problem-detector
namespace: kube-system
data:
critical-daemon-monitor.json: |
{ ... }
check-critical-daemon.sh: |
#!/bin/bash
...

RBAC

NPD needs two sets of permissions. First, the cluster-wide permission to write node conditions and events — that is what lets it actually set CriticalDaemonNotReady on the node object. Second, a namespaced permission to read pods in the demo namespace, which our check script relies on.

apiVersion: v1
kind: ServiceAccount
metadata:
name: node-problem-detector
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: node-problem-detector
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get"]
- apiGroups: [""]
resources: ["nodes/status"]
verbs: ["patch", "update"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: node-problem-detector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: node-problem-detector
subjects:
- kind: ServiceAccount
name: node-problem-detector
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: npd-pod-reader
namespace: demo
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: npd-pod-reader
namespace: demo
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: npd-pod-reader
subjects:
- kind: ServiceAccount
name: node-problem-detector
namespace: kube-system

NPD DaemonSet

With the ConfigMap and RBAC in place, we can deploy NPD itself. Because our check script calls kubectl and the NPD image does not ship with it, we use an init container to copy the kubectl binary into a shared emptyDir volume before the main container starts. The main container then picks it up from /usr/local/bin/kubectl. I must admit that this is quite a lazy and not that secure way of installing this binary. You should probably consider building your own container image from the node-problem-detector base image.

The other important bits are: passing the custom monitor config file as a flag, injecting NODE_NAME from the pod spec, and mounting the ConfigMap as /config — which serves both the JSON config and the plugin script. The script item gets mode: 0755 so it is executable.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
labels:
app: node-problem-detector
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
initContainers:
- name: install-kubectl
image: bitnami/kubectl:latest
command: ["sh", "-c", "cp $(which kubectl) /kubectl-bin/kubectl"]
volumeMounts:
- name: kubectl-bin
mountPath: /kubectl-bin
containers:
- name: node-problem-detector
command:
- /node-problem-detector
- --config.custom-plugin-monitor=/config/critical-daemon-monitor.json
image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19
securityContext:
privileged: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: config
mountPath: /config
readOnly: true
- name: kubectl-bin
mountPath: /usr/local/bin/kubectl
subPath: kubectl
serviceAccountName: node-problem-detector
volumes:
- name: config
configMap:
name: node-problem-detector
items:
- key: critical-daemon-monitor.json
path: critical-daemon-monitor.json
- key: check-critical-daemon.sh
path: check-critical-daemon.sh
mode: 0755
- name: kubectl-bin
emptyDir: {}
tolerations:
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists

The tolerations on NPD are important — it needs to tolerate all taints, including the readiness taint we are about to add, otherwise it would never schedule on a node that is not yet ready. That would be a neat little deadlock.

The NodeReadinessRule

With NPD running and setting CriticalDaemonNotReady on each node, we can now tell the Node Readiness Controller what to do with it.

apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: critical-daemon-readiness
spec:
nodeSelector: {}
conditions:
- type: "CriticalDaemonNotReady"
requiredStatus: "False"
taint:
key: "readiness.k8s.io/critical-daemon-not-ready"
effect: "NoSchedule"

enforcementMode: "continuous"

There are a few things worth noting here. The nodeSelector: {} applies the rule to every node in the cluster. The requiredStatus: "False" is what drives the taint: the NRC removes the taint when the condition is False, meaning the daemon is running. As long as the condition is True or Unknown, the taint stays.

The enforcementMode: "continuous" means the controller keeps watching. If the daemon crashes after the node was marked ready, the condition flips back to True, and the controller re-applies the taint. If you only care about startup — that the daemon is ready when the node joins the cluster — you can use bootstrap-only instead, which does not react to changes after the initial ready state is reached.

The taint key is required to be prefixed with readiness.k8s.io/ and is immutable after creation, so pick a good name the first time.

The chicken-and-egg

At this point you might be wondering: if the node starts with the taint, how does the critical daemon ever get scheduled? The daemon is a DaemonSet, and the taint blocks scheduling. So nothing gets on the node. But without the daemon, the condition never becomes False. And without the condition becoming False, the taint never gets removed.

The answer is straightforward: the critical daemon’s DaemonSet needs to tolerate its own taint.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: demo-critical-supporting-daemon
namespace: demo
spec:
selector:
matchLabels:
app: demo-critical-supporting-daemon
template:
metadata:
labels:
app: demo-critical-supporting-daemon
spec:
tolerations:
- key: "readiness.k8s.io/critical-daemon-not-ready"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: daemon
image: busybox:stable
command:
- sh
- -c
- |
echo "demo-critical-supporting-daemon started on $(hostname)"
while true; do
echo "demo-critical-supporting-daemon running at $(date)"
sleep 30
done

By explicitly tolerating readiness.k8s.io/critical-daemon-not-ready, the DaemonSet can land on a node regardless of the taint. The daemon starts, NPD’s next check picks it up, flips the condition to False, and the NRC removes the taint. The node is now open for business.

Your actual workloads do not need any changes at all. No tolerations, no affinities, no Kyverno policies. They just can not schedule onto nodes that carry the taint, which is exactly the behaviour we wanted.

Putting it all together

Let me walk through the full sequence one more time with Alice and Bob, who are joining our cluster as fresh nodes.

Alice and Bob both start with CriticalDaemonNotReady=Unknown because NPD has not run its first check yet. The NRC sees this and applies readiness.k8s.io/critical-daemon-not-ready:NoSchedule to both of them. Our actual workloads cannot schedule on either node.

The critical daemon DaemonSet tolerates the taint, so Alice and Bob both get a daemon pod almost immediately. NPD runs its check within 10 seconds, finds a Running pod on each node, and sets CriticalDaemonNotReady=False. The NRC picks up the condition change and removes the taint from both nodes. The workload pods are now free to schedule.

Six hours later, the daemon crashes on Bob for some reason. NPD’s next check returns exit 1, the condition flips to True, and the NRC re-applies the taint. Bob is now blocked from accepting new pods again. The workload pods that were already running on Bob are not evicted because the taint effect is NoSchedule, not NoExecute. If you wanted eviction, you would switch to NoExecute — but that is a separate trade-off.

Alice is unaffected throughout all of this. Her daemon kept running, her condition stayed False, and her workloads never noticed a thing.

Wrap-up

What started as me wondering where all those node conditions came from turned into a fairly clean solution to a problem that had been sitting on our backlog for a while. The combination of NPD’s custom plugin system and the Node Readiness Controller gives you a way to express node-level readiness requirements without touching your workload specs at all. The taint acts as a gate that the node itself has to earn its way through.

The two projects are still evolving. NPD is a well-established project but its custom plugin API is not exactly well-documented. The Node Readiness Controller is in alpha under kubernetes-sigs and the API may change. Still, the pattern itself is solid, and if you are running anything that has hard per-node dependencies — observability agents, storage drivers, GPU toolkits — it is worth looking at.

All the manifests from this post are in the demo repository if you want to try it yourself.

Stay curious!