Cluster Operations

Worker Node & kubelet Reliability

Trace kubelet, container runtime, and CNI signals when pods look healthy but workloads stall.

4 days intensive · Hybrid studio days · KRW 1,680,000

Visual for Worker Node & kubelet Reliability

Program narrative

You spend most of the week inside nodes: cgroup pressure, image pulls, CNI timeouts, and kubelet PLEG loops. Labs rotate through deliberate faults so you practice narrowing blame with evidence instead of restarting everything. We also cover graceful drain choreography for enterprise clients who cannot afford surprise eviction storms.

Inclusions

  • kubelet flags that matter in production versus toy clusters
  • Runtime class differences without picking a vendor winner
  • Host-level networking checks that complement kubectl
  • Node problem detector patterns and when to silence them
  • Quality standards for cordon/drain communications
  • Pair debugging etiquette under time pressure

Outcomes you can evidence

  • Complete a drain plan with risk callouts for stateful sets
  • Produce a kubelet log excerpt that proves root cause
  • Ship a one-page postmortem your manager can skim in two minutes

Common questions

Labs default to containerd. If your employer still ships another supported runtime, bring screenshots and we adapt exercises verbally.

From our cohorts

“Client in cloud operations — the PLEG loop reproduction finally convinced our vendor it was not “just Kubernetes being Kubernetes.””

D. in Busan
Top