Module 15 — Troubleshooting Clusters¶

Overview¶

Troubleshooting is the highest-weighted domain on the CKA exam (30%). You must be able to diagnose and fix broken control plane components, worker nodes, and system services quickly. This module covers a systematic approach to cluster-level troubleshooting — control plane failures, worker node failures, component logs, and node conditions.

1. Troubleshooting Methodology¶

1.1 The Systematic Approach¶

Always follow this order — top-down, from the cluster level to the component level:

1. Can I reach the API server?
   │
   ├── NO  → Control plane issue (Section 2)
   │         Check kubelet, static pods, certificates
   │
   └── YES
        │
        2. Are all nodes Ready?
           │
           ├── NO  → Node issue (Section 3)
           │         Check kubelet, container runtime, networking
           │
           └── YES
                │
                3. Are system pods running?
                   │
                   ├── NO  → Component issue (Section 2/3)
                   │         Check specific component logs
                   │
                   └── YES → Cluster is healthy
                             Problem is likely application-level
                             (covered in 16-troubleshooting-applications.md)

1.2 First Commands to Run¶

# 1. Can I talk to the API server?
kubectl cluster-info
kubectl get nodes

# 2. What's the state of all nodes?
kubectl get nodes -o wide

# 3. What's running in kube-system?
kubectl get pods -n kube-system -o wide

# 4. Any recent events?
kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -20

2. Diagnosing Control Plane Failures¶

2.1 Control Plane Components Recap¶

┌─────────────────────────────────────────────────┐
│              Control Plane Node                  │
│                                                 │
│  Static Pods (in /etc/kubernetes/manifests/):   │
│  ┌──────────────────┐  ┌─────────────────────┐  │
│  │ kube-apiserver    │  │ etcd                │  │
│  └──────────────────┘  └─────────────────────┘  │
│  ┌──────────────────┐  ┌─────────────────────┐  │
│  │ kube-scheduler    │  │ kube-controller-mgr │  │
│  └──────────────────┘  └─────────────────────┘  │
│                                                 │
│  Systemd Service:                               │
│  ┌──────────────────┐                           │
│  │ kubelet           │  ← manages static pods   │
│  └──────────────────┘                           │
└─────────────────────────────────────────────────┘

Key insight: kubelet manages the static pods. If kubelet is down, all control plane components are down.

2.2 kube-apiserver Failures¶

The API server is the single point of contact. If it's down, kubectl doesn't work at all.

Symptoms¶

kubectl get nodes
# The connection to the server 192.168.1.10:6443 was refused
# OR
# Unable to connect to the server: dial tcp 192.168.1.10:6443: connect: connection refused

Diagnosis¶

# SSH to the control plane node

# 1. Is kubelet running? (kubelet manages the API server static pod)
systemctl status kubelet

# 2. Is the API server container running?
crictl ps | grep kube-apiserver
crictl ps -a | grep kube-apiserver    # include stopped containers

# 3. Check API server container logs
crictl logs <apiserver-container-id>

# 4. Check the static pod manifest
cat /etc/kubernetes/manifests/kube-apiserver.yaml

# 5. Check for syntax errors in the manifest
# Look for typos in flags, wrong paths, missing files

Common Causes and Fixes¶

Cause	How to identify	Fix
Manifest syntax error	`crictl ps -a` shows container restarting; `crictl logs` shows the error	Fix the YAML in `/etc/kubernetes/manifests/kube-apiserver.yaml`
Wrong certificate path	Logs: `open /etc/kubernetes/pki/wrong-file.crt: no such file or directory`	Correct the path in the manifest
Expired certificates	Logs: `certificate has expired`	`kubeadm certs renew all`
Wrong etcd endpoint	Logs: `connection refused` to etcd	Fix `--etcd-servers` in the manifest
Port conflict	Logs: `bind: address already in use`	Find and stop the conflicting process
kubelet not running	`systemctl status kubelet` shows failed	Fix kubelet (see Section 3)

2.3 etcd Failures¶

If etcd is down, the API server can't read or write cluster state.

Symptoms¶

kubectl get nodes
# Error from server: etcdserver: leader changed
# OR
# Error from server: rpc error: code = Unavailable

Diagnosis¶

# 1. Is etcd container running?
crictl ps | grep etcd
crictl ps -a | grep etcd

# 2. Check etcd logs
crictl logs <etcd-container-id>

# 3. Check etcd manifest
cat /etc/kubernetes/manifests/etcd.yaml

# 4. Check etcd health (if etcdctl is available)
ETCDCTL_API=3 etcdctl endpoint health \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 5. Check disk space (etcd needs disk space)
df -h /var/lib/etcd

Common Causes and Fixes¶

Cause	How to identify	Fix
Wrong data directory	Logs: `no such file or directory` for data-dir	Fix `--data-dir` in manifest and ensure directory exists
Disk full	`df -h` shows 100% on etcd partition	Free disk space, clean up old snapshots
Certificate mismatch	Logs: `certificate signed by unknown authority`	Verify cert paths in manifest match actual files
Corrupt data	Logs: `database space exceeded` or WAL errors	Restore from backup (see 04-etcd-backup-restore.md)
Permission denied	Logs: `permission denied` on data directory	`chown -R root:root /var/lib/etcd`

2.4 kube-scheduler Failures¶

If the scheduler is down, new pods stay Pending (existing pods keep running).

Symptoms¶

kubectl get pods
# NAME    READY   STATUS    AGE
# nginx   0/1     Pending   5m

kubectl describe pod nginx
# Events:
#   Warning  FailedScheduling  no nodes available to schedule pods
# (But nodes are Ready — scheduler isn't running)

Diagnosis¶

# 1. Is the scheduler running?
crictl ps | grep kube-scheduler

# 2. Check scheduler logs
crictl logs <scheduler-container-id>

# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-scheduler.yaml

# 4. Check if the scheduler pod exists
kubectl get pods -n kube-system | grep scheduler

Common Causes and Fixes¶

Cause	Fix
Manifest typo (wrong command name, wrong flag)	Fix `/etc/kubernetes/manifests/kube-scheduler.yaml`
Wrong kubeconfig path	Verify `--kubeconfig=/etc/kubernetes/scheduler.conf` exists
Port conflict on 10259	Find and stop the conflicting process

2.5 kube-controller-manager Failures¶

If the controller manager is down, no self-healing occurs — replicas aren't maintained, nodes aren't monitored, namespaces aren't cleaned up.

Symptoms¶

Deployments don't scale
Deleted namespaces stay in Terminating
Nodes aren't marked NotReady when they fail
ReplicaSets don't create new pods

Diagnosis¶

# 1. Is the controller manager running?
crictl ps | grep kube-controller-manager

# 2. Check logs
crictl logs <controller-manager-container-id>

# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml

Common Causes and Fixes¶

Cause	Fix
Wrong `--cluster-signing-cert-file` or `--cluster-signing-key-file`	Fix paths in manifest
Wrong kubeconfig	Verify `--kubeconfig=/etc/kubernetes/controller-manager.conf`
Manifest syntax error	Fix the YAML

2.6 Quick Reference — Control Plane Troubleshooting¶

# For ANY control plane component:

# Step 1: Is the container running?
crictl ps -a | grep <component-name>

# Step 2: What do the logs say?
crictl logs <container-id>
# OR (if the pod is visible to kubectl)
kubectl logs -n kube-system <pod-name>

# Step 3: Is the manifest correct?
cat /etc/kubernetes/manifests/<component>.yaml

# Step 4: Is kubelet running? (it manages all static pods)
systemctl status kubelet
journalctl -u kubelet -f --no-pager | tail -50

3. Diagnosing Worker Node Failures¶

3.1 Node Not Ready¶

kubectl get nodes
# NAME       STATUS     ROLES    AGE   VERSION
# worker-1   NotReady   <none>   30d   v1.30.0

Diagnosis Flowchart¶

Node is NotReady
    │
    ├── Can you SSH to the node?
    │   │
    │   ├── NO  → Node is down (hardware, VM, network)
    │   │         Fix: restart the node/VM
    │   │
    │   └── YES
    │        │
    │        ├── Is kubelet running?
    │        │   systemctl status kubelet
    │        │   │
    │        │   ├── NO  → Start/fix kubelet (Section 3.2)
    │        │   │
    │        │   └── YES
    │        │        │
    │        │        ├── Is the container runtime running?
    │        │        │   systemctl status containerd
    │        │        │   │
    │        │        │   ├── NO  → Start/fix containerd (Section 3.3)
    │        │        │   │
    │        │        │   └── YES
    │        │        │        │
    │        │        │        └── Check kubelet logs for errors
    │        │        │            journalctl -u kubelet -f
    │        │        │            (certificates, config, networking)
    │        │
    │        └── Check node conditions (Section 4)

3.2 kubelet Failures¶

kubelet is the most common cause of node issues.

# Check kubelet status
systemctl status kubelet

# Check kubelet logs
journalctl -u kubelet -f --no-pager | tail -100

# Check kubelet config
cat /var/lib/kubelet/config.yaml

# Check kubelet service file
systemctl cat kubelet

Common kubelet Failures¶

Symptom in logs	Cause	Fix
`failed to load kubelet config file`	Wrong config path or missing file	Verify `/var/lib/kubelet/config.yaml` exists
`unable to load client CA file`	Wrong CA certificate path	Fix `clientCAFile` in kubelet config
`node not found`	kubelet can't register with API server	Check `--kubeconfig` and API server connectivity
`container runtime is not running`	containerd/CRI-O is down	`systemctl restart containerd`
`failed to run Kubelet: misconfiguration`	Invalid kubelet config	Check config YAML for syntax errors
`certificate has expired`	kubelet client cert expired	`kubeadm certs renew all` on control plane
`cgroup driver mismatch`	kubelet and containerd use different cgroup drivers	Align both to `systemd`

# Common fix pattern
systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet

# If kubelet keeps failing, check the service file
systemctl cat kubelet
# Look for --config flag pointing to the right config file

3.3 Container Runtime Failures¶

# Check containerd status
systemctl status containerd

# Check containerd logs
journalctl -u containerd -f --no-pager | tail -50

# List containers (even if kubelet is down)
crictl ps -a

# Check runtime endpoint
crictl info

# Restart containerd
systemctl restart containerd

Symptom	Cause	Fix
`crictl ps` fails with connection error	containerd is down	`systemctl restart containerd`
`runtime not ready` in kubelet logs	containerd socket not available	Check `/run/containerd/containerd.sock` exists
`SystemdCgroup` mismatch	containerd config doesn't match kubelet	Set `SystemdCgroup = true` in `/etc/containerd/config.toml`

3.4 kube-proxy Failures¶

kube-proxy runs as a DaemonSet. If it fails, Services don't route traffic on that node.

# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

# Check kube-proxy logs
kubectl logs -n kube-system <kube-proxy-pod>

# Check kube-proxy ConfigMap
kubectl get configmap kube-proxy -n kube-system -o yaml

# Check iptables rules on the node
iptables -t nat -L KUBE-SERVICES -n | head -20

4. Checking Component Logs¶

4.1 Log Sources by Component¶

Component	How to check logs
kube-apiserver	`crictl logs <id>` or `kubectl logs -n kube-system kube-apiserver-<node>`
etcd	`crictl logs <id>` or `kubectl logs -n kube-system etcd-<node>`
kube-scheduler	`crictl logs <id>` or `kubectl logs -n kube-system kube-scheduler-<node>`
kube-controller-manager	`crictl logs <id>` or `kubectl logs -n kube-system kube-controller-manager-<node>`
kubelet	`journalctl -u kubelet` (systemd service — NOT a pod)
kube-proxy	`kubectl logs -n kube-system <kube-proxy-pod>`
containerd	`journalctl -u containerd`
CoreDNS	`kubectl logs -n kube-system -l k8s-app=kube-dns`

4.2 crictl — When kubectl Doesn't Work¶

When the API server is down, kubectl is useless. Use crictl directly on the node:

# List all containers (running and stopped)
crictl ps -a

# Find a specific component
crictl ps -a | grep kube-apiserver

# View container logs
crictl logs <container-id>

# View last 50 lines
crictl logs --tail=50 <container-id>

# Follow logs
crictl logs -f <container-id>

# Inspect container details
crictl inspect <container-id>

# List pods
crictl pods

# Pull an image manually
crictl pull nginx:latest

4.3 journalctl — For Systemd Services¶

# kubelet logs
journalctl -u kubelet

# Last 100 lines
journalctl -u kubelet --no-pager | tail -100

# Follow in real-time
journalctl -u kubelet -f

# Since a specific time
journalctl -u kubelet --since "2024-01-15 10:00:00"

# Only errors
journalctl -u kubelet -p err

# containerd logs
journalctl -u containerd -f

4.4 kubectl logs — For Pod-Based Components¶

# Current logs
kubectl logs -n kube-system kube-apiserver-controlplane

# Previous container logs (after a crash)
kubectl logs -n kube-system kube-apiserver-controlplane --previous

# Follow logs
kubectl logs -n kube-system kube-apiserver-controlplane -f

# Last 50 lines
kubectl logs -n kube-system kube-apiserver-controlplane --tail=50

# All pods with a label
kubectl logs -n kube-system -l component=kube-apiserver

5. Node Conditions and Status¶

5.1 Checking Node Status¶

kubectl describe node <node-name>

Key sections to check:

Conditions:
  Type                 Status  Reason                       Message
  ----                 ------  ------                       -------
  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory
  DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID
  Ready                True    KubeletReady                 kubelet is posting ready status

5.2 Node Conditions¶

Condition	True means	Effect
`Ready`	kubelet is healthy and ready to accept pods	Normal operation
`MemoryPressure`	Node is running low on memory	Eviction of pods begins
`DiskPressure`	Node is running low on disk space	Eviction of pods begins
`PIDPressure`	Too many processes on the node	New pods may not be scheduled
`NetworkUnavailable`	Node network is not configured	Pods can't communicate

5.3 When Ready Is False¶

kubectl get nodes
# NAME       STATUS     ROLES    AGE   VERSION
# worker-1   NotReady   <none>   30d   v1.30.0

kubectl describe node worker-1 | grep -A5 Conditions
# Ready   False   KubeletNotReady   container runtime not ready

Ready=False Reason	Meaning	Fix
`KubeletNotReady`	kubelet is not running or not healthy	Check `systemctl status kubelet`
`container runtime not ready`	containerd/CRI-O is down	`systemctl restart containerd`
`PLEG is not healthy`	Pod Lifecycle Event Generator stuck	Restart kubelet, check container runtime
`NetworkPluginNotReady`	CNI plugin not installed or broken	Install/fix CNI plugin

5.4 Node Resource Information¶

# Capacity vs Allocatable
kubectl describe node <node> | grep -A10 "Capacity\|Allocatable"

# Capacity:    total resources on the node
# Allocatable: resources available for pods (capacity minus system reserved)

Capacity:
  cpu:                4
  memory:             8Gi
  pods:               110
Allocatable:
  cpu:                3800m      # 200m reserved for system
  memory:             7600Mi     # ~400Mi reserved
  pods:               110

# Current resource usage
kubectl top nodes                    # requires metrics-server
kubectl describe node <node> | grep -A10 "Allocated resources"

5.5 Node Taints and Unschedulable¶

# Check if node is cordoned
kubectl get nodes
# STATUS: Ready,SchedulingDisabled  ← cordoned

# Check taints
kubectl describe node <node> | grep Taints

# Common taints on problem nodes:
# node.kubernetes.io/not-ready:NoExecute           ← node is NotReady
# node.kubernetes.io/unreachable:NoExecute          ← node is unreachable
# node.kubernetes.io/disk-pressure:NoSchedule       ← disk pressure
# node.kubernetes.io/memory-pressure:NoSchedule     ← memory pressure
# node.kubernetes.io/unschedulable:NoSchedule       ← cordoned

6. Complete Troubleshooting Scenarios¶

6.1 Scenario: kubectl Doesn't Work¶

# Symptom
kubectl get nodes
# The connection to the server 192.168.1.10:6443 was refused

# Checklist (on the control plane node):
# 1. Is kubelet running?
systemctl status kubelet
# If not: systemctl start kubelet

# 2. Is the API server container running?
crictl ps | grep kube-apiserver
# If not: check crictl ps -a for crashed containers

# 3. Check API server logs
crictl logs <apiserver-container-id>

# 4. Check the manifest for errors
cat /etc/kubernetes/manifests/kube-apiserver.yaml

# 5. Check certificates
kubeadm certs check-expiration

# 6. Check if kubeconfig is correct
echo $KUBECONFIG
cat ~/.kube/config | grep server

6.2 Scenario: Node NotReady¶

# Symptom
kubectl get nodes
# worker-1   NotReady

# On the worker node:
# 1. Check kubelet
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50

# 2. Check container runtime
systemctl status containerd
crictl info

# 3. Common fixes
systemctl restart containerd
systemctl daemon-reload
systemctl restart kubelet

# 4. Check networking
ip addr show
ping <control-plane-ip>
nc -zv <control-plane-ip> 6443

6.3 Scenario: Pods Not Being Scheduled¶

# Symptom: new pods stuck in Pending

# 1. Is the scheduler running?
kubectl get pods -n kube-system | grep scheduler
crictl ps | grep scheduler

# 2. Check scheduler logs
crictl logs <scheduler-container-id>

# 3. Check the scheduler manifest
cat /etc/kubernetes/manifests/kube-scheduler.yaml

# 4. If scheduler is running, check pod events
kubectl describe pod <pending-pod>
# Look for: Insufficient cpu/memory, taints, affinity mismatch

6.4 Scenario: Deployments Not Scaling¶

# Symptom: kubectl scale works but no new pods appear

# 1. Is the controller manager running?
kubectl get pods -n kube-system | grep controller-manager
crictl ps | grep controller-manager

# 2. Check controller manager logs
crictl logs <controller-manager-container-id>

# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml

7. Practice Exercises¶

Exercise 1 — Break and Fix the API Server¶

# 1. Introduce a typo in the API server manifest
#    SSH to the control plane node
cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak

# Change --etcd-servers to a wrong port
sed -i 's/2379/2399/' /etc/kubernetes/manifests/kube-apiserver.yaml

# 2. Wait 30 seconds, then try kubectl
kubectl get nodes
# Should fail — connection refused

# 3. Diagnose
crictl ps -a | grep kube-apiserver
crictl logs <container-id>
# Should show: connection refused to etcd on port 2399

# 4. Fix
cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml

# 5. Verify
sleep 30
kubectl get nodes

Exercise 2 — Break and Fix kubelet¶

# 1. Stop kubelet
systemctl stop kubelet

# 2. Observe from another node (or before kubelet stops)
kubectl get nodes
# Control plane node goes NotReady after ~40 seconds

# 3. Diagnose
systemctl status kubelet
# Active: inactive (dead)

# 4. Fix
systemctl start kubelet

# 5. Verify
kubectl get nodes
# Should return to Ready

Exercise 3 — Break and Fix the Scheduler¶

# 1. Move the scheduler manifest
mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/

# 2. Create a test pod
kubectl run test-sched --image=nginx
kubectl get pods
# test-sched   0/1   Pending   ← no scheduler to place it

# 3. Diagnose
kubectl get pods -n kube-system | grep scheduler
# No scheduler pod

crictl ps | grep scheduler
# Nothing

ls /etc/kubernetes/manifests/
# kube-scheduler.yaml is missing

# 4. Fix
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

# 5. Verify
sleep 15
kubectl get pods
# test-sched   1/1   Running

# 6. Clean up
kubectl delete pod test-sched

Exercise 4 — Investigate Node Conditions¶

# 1. Check all node conditions
kubectl describe node <node-name> | grep -A20 Conditions

# 2. Check resource allocation
kubectl describe node <node-name> | grep -A10 "Allocated resources"

# 3. Check for taints
kubectl describe node <node-name> | grep Taints

# 4. Check kubelet logs for any warnings
journalctl -u kubelet --no-pager | grep -i "error\|warning" | tail -20

# 5. Check disk and memory
df -h
free -m

Exercise 5 — Full Cluster Health Check¶

# Run through this checklist on any cluster:

# 1. API server reachable?
kubectl cluster-info

# 2. All nodes Ready?
kubectl get nodes

# 3. All system pods running?
kubectl get pods -n kube-system

# 4. Certificates valid?
kubeadm certs check-expiration

# 5. etcd healthy?
ETCDCTL_API=3 etcdctl endpoint health \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 6. DNS working?
kubectl run dns-test --image=busybox -it --rm --restart=Never -- nslookup kubernetes

# 7. Any warning events?
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20

8. Key Takeaways for the CKA Exam¶

Point	Detail
kubectl broken → SSH to node	Use `crictl` and `journalctl` when kubectl doesn't work
kubelet is the root of everything	If kubelet is down, static pods (control plane) are down
`crictl ps -a`	Shows stopped/crashed containers — essential for diagnosing restarts
`crictl logs <id>`	First thing to check for any crashed control plane component
`journalctl -u kubelet`	kubelet is a systemd service, not a pod — use journalctl
Check the manifest	Most control plane issues are typos in `/etc/kubernetes/manifests/`
`systemctl daemon-reload && restart kubelet`	The universal "try this first" for kubelet issues
Node NotReady	Check kubelet → container runtime → networking, in that order
Node conditions	`kubectl describe node` — look at Conditions, Taints, Allocated resources
Scheduler down = Pending pods	Existing pods keep running; new pods can't be placed
Controller manager down = no self-healing	Replicas not maintained, no scaling, no node monitoring
etcd down = API server errors	Check disk space, certificates, data directory

Previous: 14-storage.md — Storage

Next: 16-troubleshooting-applications.md — Troubleshooting Applications