Skip to content

16 — Troubleshooting Applications

← 15-troubleshooting-clusters.md | → 17-troubleshooting-networking.md

Overview

This file covers Module 5 — Troubleshooting (30%), topics 5–8:

# Topic Section
5 Pod troubleshooting (CrashLoopBackOff, ImagePullBackOff, Pending…) Pod Troubleshooting
6 Service and endpoint troubleshooting Service & Endpoint Troubleshooting
7 DNS resolution debugging DNS Debugging
8 Resource quota and LimitRange issues Quota & LimitRange Issues

Pod Troubleshooting

Diagnostic Flow

Pod not running?
├─ Pending ──────────► Scheduling problem
│                      kubectl describe pod → Events
├─ ImagePullBackOff ─► Image/registry problem
│                      Check image name, tag, pull secret
├─ CrashLoopBackOff ► Container starts then dies
│                      kubectl logs / kubectl logs --previous
├─ CreateContainerConfigError ► ConfigMap/Secret missing
│                      kubectl describe pod → Events
├─ Init:Error ───────► Init container failing
│                      kubectl logs <pod> -c <init-container>
└─ Running but wrong ► App-level issue
                       kubectl logs, kubectl exec

Status → Cause → Fix Reference

Status Common Causes Key Commands
Pending No node matches requests/selectors/affinity/taints; PVC unbound; ResourceQuota exceeded kubectl describe pod → Events section
ImagePullBackOff Wrong image name/tag; private registry without imagePullSecret; network to registry blocked kubectl describe pod → look for "Failed to pull image"
CrashLoopBackOff App error; wrong command/args; missing env var; readiness/liveness misconfigured kubectl logs <pod> --previous
CreateContainerConfigError Referenced ConfigMap or Secret doesn't exist kubectl describe pod → "Error: configmaps \"x\" not found"
ErrImageNeverPull imagePullPolicy: Never but image not on node Change policy or pre-pull image
OOMKilled Container exceeded memory limit kubectl describe pod → Last State: OOMKilled; raise limit or fix leak
Init:CrashLoopBackOff Init container failing kubectl logs <pod> -c <init-container-name>

CrashLoopBackOff Deep Dive

The backoff timer doubles each restart: 10s → 20s → 40s → 80s → … capped at 5 minutes.

1
2
3
4
5
6
7
8
9
# See restart count and last state
kubectl describe pod <pod> | grep -A5 "Last State"

# Logs from the PREVIOUS (crashed) container
kubectl logs <pod> --previous

# If container exits too fast to exec into, override entrypoint
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- sh

Pending Pod Checklist

# 1. Check events
kubectl describe pod <pod> | tail -20

# 2. Common event messages and meaning:
#    "Insufficient cpu"        → node resources exhausted
#    "Insufficient memory"     → node resources exhausted
#    "didn't match Pod's node affinity/selector" → scheduling constraint
#    "persistentvolumeclaim not found" → PVC missing or unbound
#    "exceeded quota"          → ResourceQuota hit

# 3. Check node capacity vs requests
kubectl describe nodes | grep -A5 "Allocated resources"

# 4. Check if PVC is bound
kubectl get pvc

ImagePullBackOff Checklist

1
2
3
4
5
6
7
8
9
# 1. Verify image exists (typo is #1 cause on exam)
kubectl describe pod <pod> | grep "Image:"

# 2. Check pull secret
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret <secret> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# 3. Test pull manually on node (if SSH access)
crictl pull <image>

Multi-Container Pod Debugging

1
2
3
4
5
6
7
8
# List all containers in a pod
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].name}'

# Logs for specific container
kubectl logs <pod> -c <container-name>

# Exec into specific container
kubectl exec -it <pod> -c <container-name> -- sh

Service & Endpoint Troubleshooting

Service Connectivity Flow

Client → Service (ClusterIP) → Endpoints → Pod
         │                      │
         │                      └─ Empty? Selector mismatch
         └─ No response? kube-proxy / iptables issue

Checklist:
1. Does the Service exist?
2. Does it have Endpoints?
3. Do Endpoints point to running Pods?
4. Are Pod labels matching Service selector?
5. Is the targetPort correct?

The Selector–Label Match Problem

This is the #1 Service troubleshooting issue on the CKA exam.

# Service
apiVersion: v1
kind: Service
metadata:
  name: my-svc
spec:
  selector:
    app: myapp        # ← must match pod labels EXACTLY
  ports:
  - port: 80
    targetPort: 8080  # ← must match container port
# Step 1: Check Service selector
kubectl get svc my-svc -o wide
# Shows SELECTOR column

# Step 2: Check Endpoints
kubectl get endpoints my-svc
# Empty ENDPOINTS = selector doesn't match any pod

# Step 3: Compare labels
kubectl get pods --show-labels
kubectl get svc my-svc -o jsonpath='{.spec.selector}'

# Step 4: Verify targetPort matches container
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].ports}'

Port Mapping Confusion

1
2
3
4
5
6
7
Service port (what clients connect to)
  port: 80  ──────►  targetPort: 8080  (container listens here)
                     containerPort: 8080
Field Where Must Match
port Service spec What clients use
targetPort Service spec Container's listening port
containerPort Pod spec What the app actually binds to
nodePort Service spec (NodePort type) External access port (30000-32767)

Endpoint Debugging

1
2
3
4
5
6
7
8
# Endpoints empty → no pods match selector
kubectl get ep my-svc

# EndpointSlices (newer)
kubectl get endpointslices -l kubernetes.io/service-name=my-svc

# Manually verify connectivity to pod IP
kubectl run tmp --image=busybox --rm -it -- wget -qO- <pod-ip>:8080

Service Not Reachable Checklist

# 1. Service exists?
kubectl get svc my-svc

# 2. Endpoints populated?
kubectl get ep my-svc

# 3. Pod running and ready?
kubectl get pods -l app=myapp

# 4. Pod actually serving traffic?
kubectl exec <pod> -- curl -s localhost:8080

# 5. kube-proxy running on nodes?
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# 6. iptables rules present? (on node)
iptables-save | grep my-svc

DNS Debugging

Kubernetes DNS Architecture

Pod resolves "my-svc"
/etc/resolv.conf → nameserver <CoreDNS ClusterIP>
                    search <ns>.svc.cluster.local svc.cluster.local cluster.local
CoreDNS (Deployment in kube-system)
Returns ClusterIP of Service

DNS Record Formats

Resource DNS Name Resolves To
Service (ClusterIP) <svc>.<ns>.svc.cluster.local ClusterIP
Service (Headless) <svc>.<ns>.svc.cluster.local Set of Pod IPs
Pod <pod-ip-dashed>.<ns>.pod.cluster.local Pod IP
StatefulSet Pod <pod-name>.<svc>.<ns>.svc.cluster.local Pod IP

DNS Debugging Chain

# 1. Check resolv.conf in pod
kubectl exec <pod> -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (CoreDNS ClusterIP)

# 2. Test DNS from a debug pod
kubectl run dnstest --image=busybox:1.36 --rm -it -- nslookup my-svc
kubectl run dnstest --image=busybox:1.36 --rm -it -- nslookup my-svc.default.svc.cluster.local

# 3. Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 4. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# 5. Check CoreDNS Service
kubectl get svc -n kube-system kube-dns

# 6. Check CoreDNS ConfigMap
kubectl get cm -n kube-system coredns -o yaml

Common DNS Failures

Symptom Cause Fix
nslookup times out CoreDNS pods not running Check CoreDNS Deployment, restart if needed
nslookup returns NXDOMAIN Wrong service name or namespace Use FQDN: svc.ns.svc.cluster.local
DNS works from some pods, not others NetworkPolicy blocking UDP/53 to kube-dns Add egress rule allowing DNS
resolv.conf has wrong nameserver kubelet --cluster-dns misconfigured Check kubelet config
Cross-namespace resolution fails Using short name without namespace Use <svc>.<ns> or FQDN

DNS Policy Reminder

1
2
3
4
5
6
7
8
# Default: pod gets cluster DNS
spec:
  dnsPolicy: ClusterFirst    # default — uses CoreDNS

# Other policies (rarely changed on exam):
# - Default          → inherits node's resolv.conf
# - ClusterFirstWithHostNet → ClusterFirst for hostNetwork pods
# - None             → requires dnsConfig to be set

CKA Tip: If DNS doesn't work, always check CoreDNS pods first. 90% of DNS issues = CoreDNS not running or misconfigured.


Quota & LimitRange Issues

How Quotas Block Pod Creation

kubectl create pod
Admission Controller checks:
    ├─ ResourceQuota exceeded? → REJECTED
    │   "exceeded quota: mem-quota, requested: memory=512Mi, used: 1536Mi, limited: 2Gi"
    ├─ LimitRange violated? → REJECTED or DEFAULTED
    │   "minimum memory usage per Container is 64Mi, but request is 32Mi"
    └─ Pass → Pod created

ResourceQuota Troubleshooting

# Check quota usage in namespace
kubectl get resourcequota -n <ns>
kubectl describe resourcequota <name> -n <ns>

# Output shows:
# Resource    Used    Hard
# --------    ----    ----
# cpu         800m    2
# memory      1Gi     2Gi
# pods        5       10

Common quota-related failures:

Error Message Meaning Fix
exceeded quota Sum of requests exceeds Hard limit Reduce requests or increase quota
must specify requests/limits Quota exists but pod has no requests Add resource requests to pod spec
Deployment creates RS but no pods Quota silently blocks pod creation kubectl describe rs <rs> → Events

Critical exam pattern: When a ResourceQuota is set in a namespace, every pod must specify resource requests for the resources being quota'd. Otherwise creation is rejected.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "4"
    limits.memory: 4Gi
    pods: "10"

LimitRange Troubleshooting

LimitRange sets per-container defaults and constraints.

1
2
3
4
5
6
7
kubectl describe limitrange -n <ns>

# Output:
# Type        Resource  Min   Max   Default  DefaultRequest
# ----        --------  ---   ---   -------  --------------
# Container   cpu       50m   2     500m     100m
# Container   memory    64Mi  1Gi   256Mi    128Mi
Scenario Behavior
Pod has no requests/limits LimitRange injects defaults
Pod requests below Min Rejected
Pod limits above Max Rejected
Pod has limit but no request Request set to limit (or DefaultRequest if lower)
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limits
  namespace: dev
spec:
  limits:
  - type: Container
    default:          # default limit
      memory: 256Mi
    defaultRequest:   # default request
      memory: 128Mi
    min:
      memory: 64Mi
    max:
      memory: 1Gi

Debugging Quota/LimitRange Rejections

# Pod stuck — check events
kubectl describe pod <pod> | grep -A3 "Events"

# Deployment not scaling — check ReplicaSet events
kubectl get rs -n <ns>
kubectl describe rs <rs-name> -n <ns> | grep -A5 "Events"
# Look for: "Error creating: pods ... is forbidden: exceeded quota"

# Quick status check
kubectl get resourcequota,limitrange -n <ns>

Interaction Between Quota and LimitRange

1
2
3
4
5
6
7
8
9
Namespace has both ResourceQuota and LimitRange:

1. LimitRange injects defaults into pods without requests/limits
2. ResourceQuota checks totals against hard limits
3. Both must pass for pod to be admitted

Common pattern on exam:
- ResourceQuota requires requests → pod has none → REJECTED
- Fix: add LimitRange with defaults so pods auto-get requests

CKA Tips

  • Pod status tells you where to look: Pending = scheduling, ImagePull = image, CrashLoop = app logs, OOMKilled = memory
  • kubectl logs --previous is essential for CrashLoopBackOff — the current container may have no logs yet
  • Empty Endpoints = selector/label mismatch — this is the most common Service fix on the exam
  • DNS debug pod: kubectl run tmp --image=busybox:1.36 --rm -it -- nslookup <svc>
  • Quota blocks are silent for Deployments — always check ReplicaSet events, not just Deployment events
  • kubectl describe is your best friend — Events section at the bottom has the answer 90% of the time

Practice Exercises

Exercise 1 — Fix a CrashLoopBackOff Pod

1
2
3
4
5
6
# Create a broken pod
kubectl run crash-pod --image=busybox --command -- /bin/sh -c "exit 1"

# Tasks:
# 1. Identify why the pod is crashing
# 2. Fix it so it runs successfully (hint: change the command)

Exercise 2 — Fix Service Connectivity

# Create pod and broken service
kubectl run web --image=nginx --port=80 --labels="app=web"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: web-svc
spec:
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080
EOF

# Tasks:
# 1. Why does `kubectl get ep web-svc` show no endpoints?
# 2. Fix the Service so it routes to the pod
# 3. Verify connectivity

Exercise 3 — DNS Troubleshooting

1
2
3
4
5
# Tasks:
# 1. Deploy a busybox pod and verify DNS resolves kubernetes.default.svc.cluster.local
# 2. Check what nameserver the pod uses
# 3. Verify CoreDNS pods are running and check their logs
# 4. Resolve a service in a different namespace using FQDN

Exercise 4 — Quota Rejection

# Setup
kubectl create namespace quota-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: strict-quota
  namespace: quota-test
spec:
  hard:
    requests.cpu: "500m"
    requests.memory: 256Mi
    pods: "2"
EOF

# Tasks:
# 1. Try: kubectl run test --image=nginx -n quota-test — why does it fail?
# 2. Fix it by adding resource requests
# 3. Create a LimitRange that auto-injects defaults
# 4. Verify a pod without explicit requests now gets admitted

Exercise 5 — Break/Fix: Full Application Stack

# Deploy a broken stack and fix all issues:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: app
  labels:
    tier: frontend
spec:
  containers:
  - name: app
    image: nginx:latestttt
    ports:
    - containerPort: 80
    env:
    - name: DB_HOST
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: db_host
---
apiVersion: v1
kind: Service
metadata:
  name: app-svc
spec:
  selector:
    tier: backend
  ports:
  - port: 80
    targetPort: 9090
EOF

# Find and fix ALL issues (there are at least 4)

Key Takeaways

Concept Key Point
Pod status Directly indicates troubleshooting path
--previous flag Gets logs from crashed container
Empty Endpoints Selector doesn't match any pod labels
targetPort Must match what container actually listens on
DNS debug nslookup from busybox pod → check CoreDNS
resolv.conf Shows nameserver and search domains
ResourceQuota Blocks pods silently via ReplicaSet
LimitRange Injects defaults; rejects min/max violations
Quota + no requests Pod rejected — add LimitRange defaults

← 15-troubleshooting-clusters.md | → 17-troubleshooting-networking.md