Module 15 — Troubleshooting Clusters
Overview
Troubleshooting is the highest-weighted domain on the CKA exam (30%). You must be able to diagnose and fix broken control plane components, worker nodes, and system services quickly. This module covers a systematic approach to cluster-level troubleshooting — control plane failures, worker node failures, component logs, and node conditions.
1. Troubleshooting Methodology
1.1 The Systematic Approach
Always follow this order — top-down, from the cluster level to the component level:
| 1. Can I reach the API server?
│
├── NO → Control plane issue (Section 2)
│ Check kubelet, static pods, certificates
│
└── YES
│
2. Are all nodes Ready?
│
├── NO → Node issue (Section 3)
│ Check kubelet, container runtime, networking
│
└── YES
│
3. Are system pods running?
│
├── NO → Component issue (Section 2/3)
│ Check specific component logs
│
└── YES → Cluster is healthy
Problem is likely application-level
(covered in 16-troubleshooting-applications.md)
|
1.2 First Commands to Run
| # 1. Can I talk to the API server?
kubectl cluster-info
kubectl get nodes
# 2. What's the state of all nodes?
kubectl get nodes -o wide
# 3. What's running in kube-system?
kubectl get pods -n kube-system -o wide
# 4. Any recent events?
kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -20
|
2. Diagnosing Control Plane Failures
2.1 Control Plane Components Recap
| ┌─────────────────────────────────────────────────┐
│ Control Plane Node │
│ │
│ Static Pods (in /etc/kubernetes/manifests/): │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ kube-apiserver │ │ etcd │ │
│ └──────────────────┘ └─────────────────────┘ │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ kube-scheduler │ │ kube-controller-mgr │ │
│ └──────────────────┘ └─────────────────────┘ │
│ │
│ Systemd Service: │
│ ┌──────────────────┐ │
│ │ kubelet │ ← manages static pods │
│ └──────────────────┘ │
└─────────────────────────────────────────────────┘
|
Key insight: kubelet manages the static pods. If kubelet is down, all control plane components are down.
2.2 kube-apiserver Failures
The API server is the single point of contact. If it's down, kubectl doesn't work at all.
Symptoms
| kubectl get nodes
# The connection to the server 192.168.1.10:6443 was refused
# OR
# Unable to connect to the server: dial tcp 192.168.1.10:6443: connect: connection refused
|
Diagnosis
| # SSH to the control plane node
# 1. Is kubelet running? (kubelet manages the API server static pod)
systemctl status kubelet
# 2. Is the API server container running?
crictl ps | grep kube-apiserver
crictl ps -a | grep kube-apiserver # include stopped containers
# 3. Check API server container logs
crictl logs <apiserver-container-id>
# 4. Check the static pod manifest
cat /etc/kubernetes/manifests/kube-apiserver.yaml
# 5. Check for syntax errors in the manifest
# Look for typos in flags, wrong paths, missing files
|
Common Causes and Fixes
| Cause |
How to identify |
Fix |
| Manifest syntax error |
crictl ps -a shows container restarting; crictl logs shows the error |
Fix the YAML in /etc/kubernetes/manifests/kube-apiserver.yaml |
| Wrong certificate path |
Logs: open /etc/kubernetes/pki/wrong-file.crt: no such file or directory |
Correct the path in the manifest |
| Expired certificates |
Logs: certificate has expired |
kubeadm certs renew all |
| Wrong etcd endpoint |
Logs: connection refused to etcd |
Fix --etcd-servers in the manifest |
| Port conflict |
Logs: bind: address already in use |
Find and stop the conflicting process |
| kubelet not running |
systemctl status kubelet shows failed |
Fix kubelet (see Section 3) |
2.3 etcd Failures
If etcd is down, the API server can't read or write cluster state.
Symptoms
| kubectl get nodes
# Error from server: etcdserver: leader changed
# OR
# Error from server: rpc error: code = Unavailable
|
Diagnosis
| # 1. Is etcd container running?
crictl ps | grep etcd
crictl ps -a | grep etcd
# 2. Check etcd logs
crictl logs <etcd-container-id>
# 3. Check etcd manifest
cat /etc/kubernetes/manifests/etcd.yaml
# 4. Check etcd health (if etcdctl is available)
ETCDCTL_API=3 etcdctl endpoint health \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 5. Check disk space (etcd needs disk space)
df -h /var/lib/etcd
|
Common Causes and Fixes
| Cause |
How to identify |
Fix |
| Wrong data directory |
Logs: no such file or directory for data-dir |
Fix --data-dir in manifest and ensure directory exists |
| Disk full |
df -h shows 100% on etcd partition |
Free disk space, clean up old snapshots |
| Certificate mismatch |
Logs: certificate signed by unknown authority |
Verify cert paths in manifest match actual files |
| Corrupt data |
Logs: database space exceeded or WAL errors |
Restore from backup (see 04-etcd-backup-restore.md) |
| Permission denied |
Logs: permission denied on data directory |
chown -R root:root /var/lib/etcd |
2.4 kube-scheduler Failures
If the scheduler is down, new pods stay Pending (existing pods keep running).
Symptoms
| kubectl get pods
# NAME READY STATUS AGE
# nginx 0/1 Pending 5m
kubectl describe pod nginx
# Events:
# Warning FailedScheduling no nodes available to schedule pods
# (But nodes are Ready — scheduler isn't running)
|
Diagnosis
| # 1. Is the scheduler running?
crictl ps | grep kube-scheduler
# 2. Check scheduler logs
crictl logs <scheduler-container-id>
# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-scheduler.yaml
# 4. Check if the scheduler pod exists
kubectl get pods -n kube-system | grep scheduler
|
Common Causes and Fixes
| Cause |
Fix |
| Manifest typo (wrong command name, wrong flag) |
Fix /etc/kubernetes/manifests/kube-scheduler.yaml |
| Wrong kubeconfig path |
Verify --kubeconfig=/etc/kubernetes/scheduler.conf exists |
| Port conflict on 10259 |
Find and stop the conflicting process |
2.5 kube-controller-manager Failures
If the controller manager is down, no self-healing occurs — replicas aren't maintained, nodes aren't monitored, namespaces aren't cleaned up.
Symptoms
- Deployments don't scale
- Deleted namespaces stay in
Terminating
- Nodes aren't marked
NotReady when they fail
- ReplicaSets don't create new pods
Diagnosis
| # 1. Is the controller manager running?
crictl ps | grep kube-controller-manager
# 2. Check logs
crictl logs <controller-manager-container-id>
# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
|
Common Causes and Fixes
| Cause |
Fix |
Wrong --cluster-signing-cert-file or --cluster-signing-key-file |
Fix paths in manifest |
| Wrong kubeconfig |
Verify --kubeconfig=/etc/kubernetes/controller-manager.conf |
| Manifest syntax error |
Fix the YAML |
2.6 Quick Reference — Control Plane Troubleshooting
| # For ANY control plane component:
# Step 1: Is the container running?
crictl ps -a | grep <component-name>
# Step 2: What do the logs say?
crictl logs <container-id>
# OR (if the pod is visible to kubectl)
kubectl logs -n kube-system <pod-name>
# Step 3: Is the manifest correct?
cat /etc/kubernetes/manifests/<component>.yaml
# Step 4: Is kubelet running? (it manages all static pods)
systemctl status kubelet
journalctl -u kubelet -f --no-pager | tail -50
|
3. Diagnosing Worker Node Failures
3.1 Node Not Ready
| kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-1 NotReady <none> 30d v1.30.0
|
Diagnosis Flowchart
| Node is NotReady
│
├── Can you SSH to the node?
│ │
│ ├── NO → Node is down (hardware, VM, network)
│ │ Fix: restart the node/VM
│ │
│ └── YES
│ │
│ ├── Is kubelet running?
│ │ systemctl status kubelet
│ │ │
│ │ ├── NO → Start/fix kubelet (Section 3.2)
│ │ │
│ │ └── YES
│ │ │
│ │ ├── Is the container runtime running?
│ │ │ systemctl status containerd
│ │ │ │
│ │ │ ├── NO → Start/fix containerd (Section 3.3)
│ │ │ │
│ │ │ └── YES
│ │ │ │
│ │ │ └── Check kubelet logs for errors
│ │ │ journalctl -u kubelet -f
│ │ │ (certificates, config, networking)
│ │
│ └── Check node conditions (Section 4)
|
3.2 kubelet Failures
kubelet is the most common cause of node issues.
| # Check kubelet status
systemctl status kubelet
# Check kubelet logs
journalctl -u kubelet -f --no-pager | tail -100
# Check kubelet config
cat /var/lib/kubelet/config.yaml
# Check kubelet service file
systemctl cat kubelet
|
Common kubelet Failures
| Symptom in logs |
Cause |
Fix |
failed to load kubelet config file |
Wrong config path or missing file |
Verify /var/lib/kubelet/config.yaml exists |
unable to load client CA file |
Wrong CA certificate path |
Fix clientCAFile in kubelet config |
node not found |
kubelet can't register with API server |
Check --kubeconfig and API server connectivity |
container runtime is not running |
containerd/CRI-O is down |
systemctl restart containerd |
failed to run Kubelet: misconfiguration |
Invalid kubelet config |
Check config YAML for syntax errors |
certificate has expired |
kubelet client cert expired |
kubeadm certs renew all on control plane |
cgroup driver mismatch |
kubelet and containerd use different cgroup drivers |
Align both to systemd |
| # Common fix pattern
systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet
# If kubelet keeps failing, check the service file
systemctl cat kubelet
# Look for --config flag pointing to the right config file
|
3.3 Container Runtime Failures
| # Check containerd status
systemctl status containerd
# Check containerd logs
journalctl -u containerd -f --no-pager | tail -50
# List containers (even if kubelet is down)
crictl ps -a
# Check runtime endpoint
crictl info
# Restart containerd
systemctl restart containerd
|
| Symptom |
Cause |
Fix |
crictl ps fails with connection error |
containerd is down |
systemctl restart containerd |
runtime not ready in kubelet logs |
containerd socket not available |
Check /run/containerd/containerd.sock exists |
SystemdCgroup mismatch |
containerd config doesn't match kubelet |
Set SystemdCgroup = true in /etc/containerd/config.toml |
3.4 kube-proxy Failures
kube-proxy runs as a DaemonSet. If it fails, Services don't route traffic on that node.
| # Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
# Check kube-proxy logs
kubectl logs -n kube-system <kube-proxy-pod>
# Check kube-proxy ConfigMap
kubectl get configmap kube-proxy -n kube-system -o yaml
# Check iptables rules on the node
iptables -t nat -L KUBE-SERVICES -n | head -20
|
4. Checking Component Logs
4.1 Log Sources by Component
| Component |
How to check logs |
| kube-apiserver |
crictl logs <id> or kubectl logs -n kube-system kube-apiserver-<node> |
| etcd |
crictl logs <id> or kubectl logs -n kube-system etcd-<node> |
| kube-scheduler |
crictl logs <id> or kubectl logs -n kube-system kube-scheduler-<node> |
| kube-controller-manager |
crictl logs <id> or kubectl logs -n kube-system kube-controller-manager-<node> |
| kubelet |
journalctl -u kubelet (systemd service — NOT a pod) |
| kube-proxy |
kubectl logs -n kube-system <kube-proxy-pod> |
| containerd |
journalctl -u containerd |
| CoreDNS |
kubectl logs -n kube-system -l k8s-app=kube-dns |
4.2 crictl — When kubectl Doesn't Work
When the API server is down, kubectl is useless. Use crictl directly on the node:
| # List all containers (running and stopped)
crictl ps -a
# Find a specific component
crictl ps -a | grep kube-apiserver
# View container logs
crictl logs <container-id>
# View last 50 lines
crictl logs --tail=50 <container-id>
# Follow logs
crictl logs -f <container-id>
# Inspect container details
crictl inspect <container-id>
# List pods
crictl pods
# Pull an image manually
crictl pull nginx:latest
|
4.3 journalctl — For Systemd Services
| # kubelet logs
journalctl -u kubelet
# Last 100 lines
journalctl -u kubelet --no-pager | tail -100
# Follow in real-time
journalctl -u kubelet -f
# Since a specific time
journalctl -u kubelet --since "2024-01-15 10:00:00"
# Only errors
journalctl -u kubelet -p err
# containerd logs
journalctl -u containerd -f
|
4.4 kubectl logs — For Pod-Based Components
| # Current logs
kubectl logs -n kube-system kube-apiserver-controlplane
# Previous container logs (after a crash)
kubectl logs -n kube-system kube-apiserver-controlplane --previous
# Follow logs
kubectl logs -n kube-system kube-apiserver-controlplane -f
# Last 50 lines
kubectl logs -n kube-system kube-apiserver-controlplane --tail=50
# All pods with a label
kubectl logs -n kube-system -l component=kube-apiserver
|
5. Node Conditions and Status
5.1 Checking Node Status
| kubectl describe node <node-name>
|
Key sections to check:
| Conditions:
Type Status Reason Message
---- ------ ------ -------
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID
Ready True KubeletReady kubelet is posting ready status
|
5.2 Node Conditions
| Condition |
True means |
Effect |
Ready |
kubelet is healthy and ready to accept pods |
Normal operation |
MemoryPressure |
Node is running low on memory |
Eviction of pods begins |
DiskPressure |
Node is running low on disk space |
Eviction of pods begins |
PIDPressure |
Too many processes on the node |
New pods may not be scheduled |
NetworkUnavailable |
Node network is not configured |
Pods can't communicate |
5.3 When Ready Is False
| kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-1 NotReady <none> 30d v1.30.0
kubectl describe node worker-1 | grep -A5 Conditions
# Ready False KubeletNotReady container runtime not ready
|
| Ready=False Reason |
Meaning |
Fix |
KubeletNotReady |
kubelet is not running or not healthy |
Check systemctl status kubelet |
container runtime not ready |
containerd/CRI-O is down |
systemctl restart containerd |
PLEG is not healthy |
Pod Lifecycle Event Generator stuck |
Restart kubelet, check container runtime |
NetworkPluginNotReady |
CNI plugin not installed or broken |
Install/fix CNI plugin |
| # Capacity vs Allocatable
kubectl describe node <node> | grep -A10 "Capacity\|Allocatable"
# Capacity: total resources on the node
# Allocatable: resources available for pods (capacity minus system reserved)
|
| Capacity:
cpu: 4
memory: 8Gi
pods: 110
Allocatable:
cpu: 3800m # 200m reserved for system
memory: 7600Mi # ~400Mi reserved
pods: 110
|
| # Current resource usage
kubectl top nodes # requires metrics-server
kubectl describe node <node> | grep -A10 "Allocated resources"
|
5.5 Node Taints and Unschedulable
| # Check if node is cordoned
kubectl get nodes
# STATUS: Ready,SchedulingDisabled ← cordoned
# Check taints
kubectl describe node <node> | grep Taints
# Common taints on problem nodes:
# node.kubernetes.io/not-ready:NoExecute ← node is NotReady
# node.kubernetes.io/unreachable:NoExecute ← node is unreachable
# node.kubernetes.io/disk-pressure:NoSchedule ← disk pressure
# node.kubernetes.io/memory-pressure:NoSchedule ← memory pressure
# node.kubernetes.io/unschedulable:NoSchedule ← cordoned
|
6. Complete Troubleshooting Scenarios
6.1 Scenario: kubectl Doesn't Work
| # Symptom
kubectl get nodes
# The connection to the server 192.168.1.10:6443 was refused
# Checklist (on the control plane node):
# 1. Is kubelet running?
systemctl status kubelet
# If not: systemctl start kubelet
# 2. Is the API server container running?
crictl ps | grep kube-apiserver
# If not: check crictl ps -a for crashed containers
# 3. Check API server logs
crictl logs <apiserver-container-id>
# 4. Check the manifest for errors
cat /etc/kubernetes/manifests/kube-apiserver.yaml
# 5. Check certificates
kubeadm certs check-expiration
# 6. Check if kubeconfig is correct
echo $KUBECONFIG
cat ~/.kube/config | grep server
|
6.2 Scenario: Node NotReady
| # Symptom
kubectl get nodes
# worker-1 NotReady
# On the worker node:
# 1. Check kubelet
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50
# 2. Check container runtime
systemctl status containerd
crictl info
# 3. Common fixes
systemctl restart containerd
systemctl daemon-reload
systemctl restart kubelet
# 4. Check networking
ip addr show
ping <control-plane-ip>
nc -zv <control-plane-ip> 6443
|
6.3 Scenario: Pods Not Being Scheduled
| # Symptom: new pods stuck in Pending
# 1. Is the scheduler running?
kubectl get pods -n kube-system | grep scheduler
crictl ps | grep scheduler
# 2. Check scheduler logs
crictl logs <scheduler-container-id>
# 3. Check the scheduler manifest
cat /etc/kubernetes/manifests/kube-scheduler.yaml
# 4. If scheduler is running, check pod events
kubectl describe pod <pending-pod>
# Look for: Insufficient cpu/memory, taints, affinity mismatch
|
6.4 Scenario: Deployments Not Scaling
| # Symptom: kubectl scale works but no new pods appear
# 1. Is the controller manager running?
kubectl get pods -n kube-system | grep controller-manager
crictl ps | grep controller-manager
# 2. Check controller manager logs
crictl logs <controller-manager-container-id>
# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
|
7. Practice Exercises
Exercise 1 — Break and Fix the API Server
| # 1. Introduce a typo in the API server manifest
# SSH to the control plane node
cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak
# Change --etcd-servers to a wrong port
sed -i 's/2379/2399/' /etc/kubernetes/manifests/kube-apiserver.yaml
# 2. Wait 30 seconds, then try kubectl
kubectl get nodes
# Should fail — connection refused
# 3. Diagnose
crictl ps -a | grep kube-apiserver
crictl logs <container-id>
# Should show: connection refused to etcd on port 2399
# 4. Fix
cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
# 5. Verify
sleep 30
kubectl get nodes
|
Exercise 2 — Break and Fix kubelet
| # 1. Stop kubelet
systemctl stop kubelet
# 2. Observe from another node (or before kubelet stops)
kubectl get nodes
# Control plane node goes NotReady after ~40 seconds
# 3. Diagnose
systemctl status kubelet
# Active: inactive (dead)
# 4. Fix
systemctl start kubelet
# 5. Verify
kubectl get nodes
# Should return to Ready
|
Exercise 3 — Break and Fix the Scheduler
| # 1. Move the scheduler manifest
mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
# 2. Create a test pod
kubectl run test-sched --image=nginx
kubectl get pods
# test-sched 0/1 Pending ← no scheduler to place it
# 3. Diagnose
kubectl get pods -n kube-system | grep scheduler
# No scheduler pod
crictl ps | grep scheduler
# Nothing
ls /etc/kubernetes/manifests/
# kube-scheduler.yaml is missing
# 4. Fix
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
# 5. Verify
sleep 15
kubectl get pods
# test-sched 1/1 Running
# 6. Clean up
kubectl delete pod test-sched
|
Exercise 4 — Investigate Node Conditions
| # 1. Check all node conditions
kubectl describe node <node-name> | grep -A20 Conditions
# 2. Check resource allocation
kubectl describe node <node-name> | grep -A10 "Allocated resources"
# 3. Check for taints
kubectl describe node <node-name> | grep Taints
# 4. Check kubelet logs for any warnings
journalctl -u kubelet --no-pager | grep -i "error\|warning" | tail -20
# 5. Check disk and memory
df -h
free -m
|
Exercise 5 — Full Cluster Health Check
| # Run through this checklist on any cluster:
# 1. API server reachable?
kubectl cluster-info
# 2. All nodes Ready?
kubectl get nodes
# 3. All system pods running?
kubectl get pods -n kube-system
# 4. Certificates valid?
kubeadm certs check-expiration
# 5. etcd healthy?
ETCDCTL_API=3 etcdctl endpoint health \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 6. DNS working?
kubectl run dns-test --image=busybox -it --rm --restart=Never -- nslookup kubernetes
# 7. Any warning events?
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20
|
8. Key Takeaways for the CKA Exam
| Point |
Detail |
| kubectl broken → SSH to node |
Use crictl and journalctl when kubectl doesn't work |
| kubelet is the root of everything |
If kubelet is down, static pods (control plane) are down |
crictl ps -a |
Shows stopped/crashed containers — essential for diagnosing restarts |
crictl logs <id> |
First thing to check for any crashed control plane component |
journalctl -u kubelet |
kubelet is a systemd service, not a pod — use journalctl |
| Check the manifest |
Most control plane issues are typos in /etc/kubernetes/manifests/ |
systemctl daemon-reload && restart kubelet |
The universal "try this first" for kubelet issues |
| Node NotReady |
Check kubelet → container runtime → networking, in that order |
| Node conditions |
kubectl describe node — look at Conditions, Taints, Allocated resources |
| Scheduler down = Pending pods |
Existing pods keep running; new pods can't be placed |
| Controller manager down = no self-healing |
Replicas not maintained, no scaling, no node monitoring |
| etcd down = API server errors |
Check disk space, certificates, data directory |
Previous: 14-storage.md — Storage
Next: 16-troubleshooting-applications.md — Troubleshooting Applications