Cluster Operations
Worker Node & kubelet Reliability
Trace kubelet, container runtime, and CNI signals when pods look healthy but workloads stall.
Program narrative
You spend most of the week inside nodes: cgroup pressure, image pulls, CNI timeouts, and kubelet PLEG loops. Labs rotate through deliberate faults so you practice narrowing blame with evidence instead of restarting everything. We also cover graceful drain choreography for enterprise clients who cannot afford surprise eviction storms.
Inclusions
- kubelet flags that matter in production versus toy clusters
- Runtime class differences without picking a vendor winner
- Host-level networking checks that complement kubectl
- Node problem detector patterns and when to silence them
- Quality standards for cordon/drain communications
- Pair debugging etiquette under time pressure
Outcomes you can evidence
- Complete a drain plan with risk callouts for stateful sets
- Produce a kubelet log excerpt that proves root cause
- Ship a one-page postmortem your manager can skim in two minutes
Common questions
Labs default to containerd. If your employer still ships another supported runtime, bring screenshots and we adapt exercises verbally.
From our cohorts
“Client in cloud operations — the PLEG loop reproduction finally convinced our vendor it was not “just Kubernetes being Kubernetes.””