Skip to content

Module 15 — Troubleshooting Clusters

Overview

Troubleshooting is the highest-weighted domain on the CKA exam (30%). You must be able to diagnose and fix broken control plane components, worker nodes, and system services quickly. This module covers a systematic approach to cluster-level troubleshooting — control plane failures, worker node failures, component logs, and node conditions.


1. Troubleshooting Methodology

1.1 The Systematic Approach

Always follow this order — top-down, from the cluster level to the component level:

1. Can I reach the API server?
   ├── NO  → Control plane issue (Section 2)
   │         Check kubelet, static pods, certificates
   └── YES
        2. Are all nodes Ready?
           ├── NO  → Node issue (Section 3)
           │         Check kubelet, container runtime, networking
           └── YES
                3. Are system pods running?
                   ├── NO  → Component issue (Section 2/3)
                   │         Check specific component logs
                   └── YES → Cluster is healthy
                             Problem is likely application-level
                             (covered in 16-troubleshooting-applications.md)

1.2 First Commands to Run

# 1. Can I talk to the API server?
kubectl cluster-info
kubectl get nodes

# 2. What's the state of all nodes?
kubectl get nodes -o wide

# 3. What's running in kube-system?
kubectl get pods -n kube-system -o wide

# 4. Any recent events?
kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -20

2. Diagnosing Control Plane Failures

2.1 Control Plane Components Recap

┌─────────────────────────────────────────────────┐
│              Control Plane Node                  │
│                                                 │
│  Static Pods (in /etc/kubernetes/manifests/):   │
│  ┌──────────────────┐  ┌─────────────────────┐  │
│  │ kube-apiserver    │  │ etcd                │  │
│  └──────────────────┘  └─────────────────────┘  │
│  ┌──────────────────┐  ┌─────────────────────┐  │
│  │ kube-scheduler    │  │ kube-controller-mgr │  │
│  └──────────────────┘  └─────────────────────┘  │
│                                                 │
│  Systemd Service:                               │
│  ┌──────────────────┐                           │
│  │ kubelet           │  ← manages static pods   │
│  └──────────────────┘                           │
└─────────────────────────────────────────────────┘

Key insight: kubelet manages the static pods. If kubelet is down, all control plane components are down.

2.2 kube-apiserver Failures

The API server is the single point of contact. If it's down, kubectl doesn't work at all.

Symptoms

1
2
3
4
kubectl get nodes
# The connection to the server 192.168.1.10:6443 was refused
# OR
# Unable to connect to the server: dial tcp 192.168.1.10:6443: connect: connection refused

Diagnosis

# SSH to the control plane node

# 1. Is kubelet running? (kubelet manages the API server static pod)
systemctl status kubelet

# 2. Is the API server container running?
crictl ps | grep kube-apiserver
crictl ps -a | grep kube-apiserver    # include stopped containers

# 3. Check API server container logs
crictl logs <apiserver-container-id>

# 4. Check the static pod manifest
cat /etc/kubernetes/manifests/kube-apiserver.yaml

# 5. Check for syntax errors in the manifest
# Look for typos in flags, wrong paths, missing files

Common Causes and Fixes

Cause How to identify Fix
Manifest syntax error crictl ps -a shows container restarting; crictl logs shows the error Fix the YAML in /etc/kubernetes/manifests/kube-apiserver.yaml
Wrong certificate path Logs: open /etc/kubernetes/pki/wrong-file.crt: no such file or directory Correct the path in the manifest
Expired certificates Logs: certificate has expired kubeadm certs renew all
Wrong etcd endpoint Logs: connection refused to etcd Fix --etcd-servers in the manifest
Port conflict Logs: bind: address already in use Find and stop the conflicting process
kubelet not running systemctl status kubelet shows failed Fix kubelet (see Section 3)

2.3 etcd Failures

If etcd is down, the API server can't read or write cluster state.

Symptoms

1
2
3
4
kubectl get nodes
# Error from server: etcdserver: leader changed
# OR
# Error from server: rpc error: code = Unavailable

Diagnosis

# 1. Is etcd container running?
crictl ps | grep etcd
crictl ps -a | grep etcd

# 2. Check etcd logs
crictl logs <etcd-container-id>

# 3. Check etcd manifest
cat /etc/kubernetes/manifests/etcd.yaml

# 4. Check etcd health (if etcdctl is available)
ETCDCTL_API=3 etcdctl endpoint health \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 5. Check disk space (etcd needs disk space)
df -h /var/lib/etcd

Common Causes and Fixes

Cause How to identify Fix
Wrong data directory Logs: no such file or directory for data-dir Fix --data-dir in manifest and ensure directory exists
Disk full df -h shows 100% on etcd partition Free disk space, clean up old snapshots
Certificate mismatch Logs: certificate signed by unknown authority Verify cert paths in manifest match actual files
Corrupt data Logs: database space exceeded or WAL errors Restore from backup (see 04-etcd-backup-restore.md)
Permission denied Logs: permission denied on data directory chown -R root:root /var/lib/etcd

2.4 kube-scheduler Failures

If the scheduler is down, new pods stay Pending (existing pods keep running).

Symptoms

1
2
3
4
5
6
7
8
kubectl get pods
# NAME    READY   STATUS    AGE
# nginx   0/1     Pending   5m

kubectl describe pod nginx
# Events:
#   Warning  FailedScheduling  no nodes available to schedule pods
# (But nodes are Ready — scheduler isn't running)

Diagnosis

# 1. Is the scheduler running?
crictl ps | grep kube-scheduler

# 2. Check scheduler logs
crictl logs <scheduler-container-id>

# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-scheduler.yaml

# 4. Check if the scheduler pod exists
kubectl get pods -n kube-system | grep scheduler

Common Causes and Fixes

Cause Fix
Manifest typo (wrong command name, wrong flag) Fix /etc/kubernetes/manifests/kube-scheduler.yaml
Wrong kubeconfig path Verify --kubeconfig=/etc/kubernetes/scheduler.conf exists
Port conflict on 10259 Find and stop the conflicting process

2.5 kube-controller-manager Failures

If the controller manager is down, no self-healing occurs — replicas aren't maintained, nodes aren't monitored, namespaces aren't cleaned up.

Symptoms

  • Deployments don't scale
  • Deleted namespaces stay in Terminating
  • Nodes aren't marked NotReady when they fail
  • ReplicaSets don't create new pods

Diagnosis

1
2
3
4
5
6
7
8
# 1. Is the controller manager running?
crictl ps | grep kube-controller-manager

# 2. Check logs
crictl logs <controller-manager-container-id>

# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml

Common Causes and Fixes

Cause Fix
Wrong --cluster-signing-cert-file or --cluster-signing-key-file Fix paths in manifest
Wrong kubeconfig Verify --kubeconfig=/etc/kubernetes/controller-manager.conf
Manifest syntax error Fix the YAML

2.6 Quick Reference — Control Plane Troubleshooting

# For ANY control plane component:

# Step 1: Is the container running?
crictl ps -a | grep <component-name>

# Step 2: What do the logs say?
crictl logs <container-id>
# OR (if the pod is visible to kubectl)
kubectl logs -n kube-system <pod-name>

# Step 3: Is the manifest correct?
cat /etc/kubernetes/manifests/<component>.yaml

# Step 4: Is kubelet running? (it manages all static pods)
systemctl status kubelet
journalctl -u kubelet -f --no-pager | tail -50

3. Diagnosing Worker Node Failures

3.1 Node Not Ready

1
2
3
kubectl get nodes
# NAME       STATUS     ROLES    AGE   VERSION
# worker-1   NotReady   <none>   30d   v1.30.0

Diagnosis Flowchart

Node is NotReady
    ├── Can you SSH to the node?
    │   │
    │   ├── NO  → Node is down (hardware, VM, network)
    │   │         Fix: restart the node/VM
    │   │
    │   └── YES
    │        │
    │        ├── Is kubelet running?
    │        │   systemctl status kubelet
    │        │   │
    │        │   ├── NO  → Start/fix kubelet (Section 3.2)
    │        │   │
    │        │   └── YES
    │        │        │
    │        │        ├── Is the container runtime running?
    │        │        │   systemctl status containerd
    │        │        │   │
    │        │        │   ├── NO  → Start/fix containerd (Section 3.3)
    │        │        │   │
    │        │        │   └── YES
    │        │        │        │
    │        │        │        └── Check kubelet logs for errors
    │        │        │            journalctl -u kubelet -f
    │        │        │            (certificates, config, networking)
    │        │
    │        └── Check node conditions (Section 4)

3.2 kubelet Failures

kubelet is the most common cause of node issues.

# Check kubelet status
systemctl status kubelet

# Check kubelet logs
journalctl -u kubelet -f --no-pager | tail -100

# Check kubelet config
cat /var/lib/kubelet/config.yaml

# Check kubelet service file
systemctl cat kubelet

Common kubelet Failures

Symptom in logs Cause Fix
failed to load kubelet config file Wrong config path or missing file Verify /var/lib/kubelet/config.yaml exists
unable to load client CA file Wrong CA certificate path Fix clientCAFile in kubelet config
node not found kubelet can't register with API server Check --kubeconfig and API server connectivity
container runtime is not running containerd/CRI-O is down systemctl restart containerd
failed to run Kubelet: misconfiguration Invalid kubelet config Check config YAML for syntax errors
certificate has expired kubelet client cert expired kubeadm certs renew all on control plane
cgroup driver mismatch kubelet and containerd use different cgroup drivers Align both to systemd
1
2
3
4
5
6
7
8
# Common fix pattern
systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet

# If kubelet keeps failing, check the service file
systemctl cat kubelet
# Look for --config flag pointing to the right config file

3.3 Container Runtime Failures

# Check containerd status
systemctl status containerd

# Check containerd logs
journalctl -u containerd -f --no-pager | tail -50

# List containers (even if kubelet is down)
crictl ps -a

# Check runtime endpoint
crictl info

# Restart containerd
systemctl restart containerd
Symptom Cause Fix
crictl ps fails with connection error containerd is down systemctl restart containerd
runtime not ready in kubelet logs containerd socket not available Check /run/containerd/containerd.sock exists
SystemdCgroup mismatch containerd config doesn't match kubelet Set SystemdCgroup = true in /etc/containerd/config.toml

3.4 kube-proxy Failures

kube-proxy runs as a DaemonSet. If it fails, Services don't route traffic on that node.

# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

# Check kube-proxy logs
kubectl logs -n kube-system <kube-proxy-pod>

# Check kube-proxy ConfigMap
kubectl get configmap kube-proxy -n kube-system -o yaml

# Check iptables rules on the node
iptables -t nat -L KUBE-SERVICES -n | head -20

4. Checking Component Logs

4.1 Log Sources by Component

Component How to check logs
kube-apiserver crictl logs <id> or kubectl logs -n kube-system kube-apiserver-<node>
etcd crictl logs <id> or kubectl logs -n kube-system etcd-<node>
kube-scheduler crictl logs <id> or kubectl logs -n kube-system kube-scheduler-<node>
kube-controller-manager crictl logs <id> or kubectl logs -n kube-system kube-controller-manager-<node>
kubelet journalctl -u kubelet (systemd service — NOT a pod)
kube-proxy kubectl logs -n kube-system <kube-proxy-pod>
containerd journalctl -u containerd
CoreDNS kubectl logs -n kube-system -l k8s-app=kube-dns

4.2 crictl — When kubectl Doesn't Work

When the API server is down, kubectl is useless. Use crictl directly on the node:

# List all containers (running and stopped)
crictl ps -a

# Find a specific component
crictl ps -a | grep kube-apiserver

# View container logs
crictl logs <container-id>

# View last 50 lines
crictl logs --tail=50 <container-id>

# Follow logs
crictl logs -f <container-id>

# Inspect container details
crictl inspect <container-id>

# List pods
crictl pods

# Pull an image manually
crictl pull nginx:latest

4.3 journalctl — For Systemd Services

# kubelet logs
journalctl -u kubelet

# Last 100 lines
journalctl -u kubelet --no-pager | tail -100

# Follow in real-time
journalctl -u kubelet -f

# Since a specific time
journalctl -u kubelet --since "2024-01-15 10:00:00"

# Only errors
journalctl -u kubelet -p err

# containerd logs
journalctl -u containerd -f

4.4 kubectl logs — For Pod-Based Components

# Current logs
kubectl logs -n kube-system kube-apiserver-controlplane

# Previous container logs (after a crash)
kubectl logs -n kube-system kube-apiserver-controlplane --previous

# Follow logs
kubectl logs -n kube-system kube-apiserver-controlplane -f

# Last 50 lines
kubectl logs -n kube-system kube-apiserver-controlplane --tail=50

# All pods with a label
kubectl logs -n kube-system -l component=kube-apiserver

5. Node Conditions and Status

5.1 Checking Node Status

kubectl describe node <node-name>

Key sections to check:

1
2
3
4
5
6
7
Conditions:
  Type                 Status  Reason                       Message
  ----                 ------  ------                       -------
  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory
  DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID
  Ready                True    KubeletReady                 kubelet is posting ready status

5.2 Node Conditions

Condition True means Effect
Ready kubelet is healthy and ready to accept pods Normal operation
MemoryPressure Node is running low on memory Eviction of pods begins
DiskPressure Node is running low on disk space Eviction of pods begins
PIDPressure Too many processes on the node New pods may not be scheduled
NetworkUnavailable Node network is not configured Pods can't communicate

5.3 When Ready Is False

1
2
3
4
5
6
kubectl get nodes
# NAME       STATUS     ROLES    AGE   VERSION
# worker-1   NotReady   <none>   30d   v1.30.0

kubectl describe node worker-1 | grep -A5 Conditions
# Ready   False   KubeletNotReady   container runtime not ready
Ready=False Reason Meaning Fix
KubeletNotReady kubelet is not running or not healthy Check systemctl status kubelet
container runtime not ready containerd/CRI-O is down systemctl restart containerd
PLEG is not healthy Pod Lifecycle Event Generator stuck Restart kubelet, check container runtime
NetworkPluginNotReady CNI plugin not installed or broken Install/fix CNI plugin

5.4 Node Resource Information

1
2
3
4
5
# Capacity vs Allocatable
kubectl describe node <node> | grep -A10 "Capacity\|Allocatable"

# Capacity:    total resources on the node
# Allocatable: resources available for pods (capacity minus system reserved)
1
2
3
4
5
6
7
8
Capacity:
  cpu:                4
  memory:             8Gi
  pods:               110
Allocatable:
  cpu:                3800m      # 200m reserved for system
  memory:             7600Mi     # ~400Mi reserved
  pods:               110
1
2
3
# Current resource usage
kubectl top nodes                    # requires metrics-server
kubectl describe node <node> | grep -A10 "Allocated resources"

5.5 Node Taints and Unschedulable

# Check if node is cordoned
kubectl get nodes
# STATUS: Ready,SchedulingDisabled  ← cordoned

# Check taints
kubectl describe node <node> | grep Taints

# Common taints on problem nodes:
# node.kubernetes.io/not-ready:NoExecute           ← node is NotReady
# node.kubernetes.io/unreachable:NoExecute          ← node is unreachable
# node.kubernetes.io/disk-pressure:NoSchedule       ← disk pressure
# node.kubernetes.io/memory-pressure:NoSchedule     ← memory pressure
# node.kubernetes.io/unschedulable:NoSchedule       ← cordoned

6. Complete Troubleshooting Scenarios

6.1 Scenario: kubectl Doesn't Work

# Symptom
kubectl get nodes
# The connection to the server 192.168.1.10:6443 was refused

# Checklist (on the control plane node):
# 1. Is kubelet running?
systemctl status kubelet
# If not: systemctl start kubelet

# 2. Is the API server container running?
crictl ps | grep kube-apiserver
# If not: check crictl ps -a for crashed containers

# 3. Check API server logs
crictl logs <apiserver-container-id>

# 4. Check the manifest for errors
cat /etc/kubernetes/manifests/kube-apiserver.yaml

# 5. Check certificates
kubeadm certs check-expiration

# 6. Check if kubeconfig is correct
echo $KUBECONFIG
cat ~/.kube/config | grep server

6.2 Scenario: Node NotReady

# Symptom
kubectl get nodes
# worker-1   NotReady

# On the worker node:
# 1. Check kubelet
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50

# 2. Check container runtime
systemctl status containerd
crictl info

# 3. Common fixes
systemctl restart containerd
systemctl daemon-reload
systemctl restart kubelet

# 4. Check networking
ip addr show
ping <control-plane-ip>
nc -zv <control-plane-ip> 6443

6.3 Scenario: Pods Not Being Scheduled

# Symptom: new pods stuck in Pending

# 1. Is the scheduler running?
kubectl get pods -n kube-system | grep scheduler
crictl ps | grep scheduler

# 2. Check scheduler logs
crictl logs <scheduler-container-id>

# 3. Check the scheduler manifest
cat /etc/kubernetes/manifests/kube-scheduler.yaml

# 4. If scheduler is running, check pod events
kubectl describe pod <pending-pod>
# Look for: Insufficient cpu/memory, taints, affinity mismatch

6.4 Scenario: Deployments Not Scaling

# Symptom: kubectl scale works but no new pods appear

# 1. Is the controller manager running?
kubectl get pods -n kube-system | grep controller-manager
crictl ps | grep controller-manager

# 2. Check controller manager logs
crictl logs <controller-manager-container-id>

# 3. Check the manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml

7. Practice Exercises

Exercise 1 — Break and Fix the API Server

# 1. Introduce a typo in the API server manifest
#    SSH to the control plane node
cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak

# Change --etcd-servers to a wrong port
sed -i 's/2379/2399/' /etc/kubernetes/manifests/kube-apiserver.yaml

# 2. Wait 30 seconds, then try kubectl
kubectl get nodes
# Should fail — connection refused

# 3. Diagnose
crictl ps -a | grep kube-apiserver
crictl logs <container-id>
# Should show: connection refused to etcd on port 2399

# 4. Fix
cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml

# 5. Verify
sleep 30
kubectl get nodes

Exercise 2 — Break and Fix kubelet

# 1. Stop kubelet
systemctl stop kubelet

# 2. Observe from another node (or before kubelet stops)
kubectl get nodes
# Control plane node goes NotReady after ~40 seconds

# 3. Diagnose
systemctl status kubelet
# Active: inactive (dead)

# 4. Fix
systemctl start kubelet

# 5. Verify
kubectl get nodes
# Should return to Ready

Exercise 3 — Break and Fix the Scheduler

# 1. Move the scheduler manifest
mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/

# 2. Create a test pod
kubectl run test-sched --image=nginx
kubectl get pods
# test-sched   0/1   Pending   ← no scheduler to place it

# 3. Diagnose
kubectl get pods -n kube-system | grep scheduler
# No scheduler pod

crictl ps | grep scheduler
# Nothing

ls /etc/kubernetes/manifests/
# kube-scheduler.yaml is missing

# 4. Fix
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

# 5. Verify
sleep 15
kubectl get pods
# test-sched   1/1   Running

# 6. Clean up
kubectl delete pod test-sched

Exercise 4 — Investigate Node Conditions

# 1. Check all node conditions
kubectl describe node <node-name> | grep -A20 Conditions

# 2. Check resource allocation
kubectl describe node <node-name> | grep -A10 "Allocated resources"

# 3. Check for taints
kubectl describe node <node-name> | grep Taints

# 4. Check kubelet logs for any warnings
journalctl -u kubelet --no-pager | grep -i "error\|warning" | tail -20

# 5. Check disk and memory
df -h
free -m

Exercise 5 — Full Cluster Health Check

# Run through this checklist on any cluster:

# 1. API server reachable?
kubectl cluster-info

# 2. All nodes Ready?
kubectl get nodes

# 3. All system pods running?
kubectl get pods -n kube-system

# 4. Certificates valid?
kubeadm certs check-expiration

# 5. etcd healthy?
ETCDCTL_API=3 etcdctl endpoint health \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 6. DNS working?
kubectl run dns-test --image=busybox -it --rm --restart=Never -- nslookup kubernetes

# 7. Any warning events?
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20

8. Key Takeaways for the CKA Exam

Point Detail
kubectl broken → SSH to node Use crictl and journalctl when kubectl doesn't work
kubelet is the root of everything If kubelet is down, static pods (control plane) are down
crictl ps -a Shows stopped/crashed containers — essential for diagnosing restarts
crictl logs <id> First thing to check for any crashed control plane component
journalctl -u kubelet kubelet is a systemd service, not a pod — use journalctl
Check the manifest Most control plane issues are typos in /etc/kubernetes/manifests/
systemctl daemon-reload && restart kubelet The universal "try this first" for kubelet issues
Node NotReady Check kubelet → container runtime → networking, in that order
Node conditions kubectl describe node — look at Conditions, Taints, Allocated resources
Scheduler down = Pending pods Existing pods keep running; new pods can't be placed
Controller manager down = no self-healing Replicas not maintained, no scaling, no node monitoring
etcd down = API server errors Check disk space, certificates, data directory

Previous: 14-storage.md — Storage

Next: 16-troubleshooting-applications.md — Troubleshooting Applications