16 — Troubleshooting Applications
← 15-troubleshooting-clusters.md | → 17-troubleshooting-networking.md
Overview
This file covers Module 5 — Troubleshooting (30%), topics 5–8:
Pod Troubleshooting
Diagnostic Flow
| Pod not running?
│
├─ Pending ──────────► Scheduling problem
│ kubectl describe pod → Events
│
├─ ImagePullBackOff ─► Image/registry problem
│ Check image name, tag, pull secret
│
├─ CrashLoopBackOff ► Container starts then dies
│ kubectl logs / kubectl logs --previous
│
├─ CreateContainerConfigError ► ConfigMap/Secret missing
│ kubectl describe pod → Events
│
├─ Init:Error ───────► Init container failing
│ kubectl logs <pod> -c <init-container>
│
└─ Running but wrong ► App-level issue
kubectl logs, kubectl exec
|
Status → Cause → Fix Reference
| Status |
Common Causes |
Key Commands |
| Pending |
No node matches requests/selectors/affinity/taints; PVC unbound; ResourceQuota exceeded |
kubectl describe pod → Events section |
| ImagePullBackOff |
Wrong image name/tag; private registry without imagePullSecret; network to registry blocked |
kubectl describe pod → look for "Failed to pull image" |
| CrashLoopBackOff |
App error; wrong command/args; missing env var; readiness/liveness misconfigured |
kubectl logs <pod> --previous |
| CreateContainerConfigError |
Referenced ConfigMap or Secret doesn't exist |
kubectl describe pod → "Error: configmaps \"x\" not found" |
| ErrImageNeverPull |
imagePullPolicy: Never but image not on node |
Change policy or pre-pull image |
| OOMKilled |
Container exceeded memory limit |
kubectl describe pod → Last State: OOMKilled; raise limit or fix leak |
| Init:CrashLoopBackOff |
Init container failing |
kubectl logs <pod> -c <init-container-name> |
CrashLoopBackOff Deep Dive
The backoff timer doubles each restart: 10s → 20s → 40s → 80s → … capped at 5 minutes.
| # See restart count and last state
kubectl describe pod <pod> | grep -A5 "Last State"
# Logs from the PREVIOUS (crashed) container
kubectl logs <pod> --previous
# If container exits too fast to exec into, override entrypoint
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- sh
|
Pending Pod Checklist
| # 1. Check events
kubectl describe pod <pod> | tail -20
# 2. Common event messages and meaning:
# "Insufficient cpu" → node resources exhausted
# "Insufficient memory" → node resources exhausted
# "didn't match Pod's node affinity/selector" → scheduling constraint
# "persistentvolumeclaim not found" → PVC missing or unbound
# "exceeded quota" → ResourceQuota hit
# 3. Check node capacity vs requests
kubectl describe nodes | grep -A5 "Allocated resources"
# 4. Check if PVC is bound
kubectl get pvc
|
ImagePullBackOff Checklist
| # 1. Verify image exists (typo is #1 cause on exam)
kubectl describe pod <pod> | grep "Image:"
# 2. Check pull secret
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret <secret> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
# 3. Test pull manually on node (if SSH access)
crictl pull <image>
|
Multi-Container Pod Debugging
| # List all containers in a pod
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].name}'
# Logs for specific container
kubectl logs <pod> -c <container-name>
# Exec into specific container
kubectl exec -it <pod> -c <container-name> -- sh
|
Service & Endpoint Troubleshooting
Service Connectivity Flow
| Client → Service (ClusterIP) → Endpoints → Pod
│ │
│ └─ Empty? Selector mismatch
└─ No response? kube-proxy / iptables issue
Checklist:
1. Does the Service exist?
2. Does it have Endpoints?
3. Do Endpoints point to running Pods?
4. Are Pod labels matching Service selector?
5. Is the targetPort correct?
|
The Selector–Label Match Problem
This is the #1 Service troubleshooting issue on the CKA exam.
| # Service
apiVersion: v1
kind: Service
metadata:
name: my-svc
spec:
selector:
app: myapp # ← must match pod labels EXACTLY
ports:
- port: 80
targetPort: 8080 # ← must match container port
|
| # Step 1: Check Service selector
kubectl get svc my-svc -o wide
# Shows SELECTOR column
# Step 2: Check Endpoints
kubectl get endpoints my-svc
# Empty ENDPOINTS = selector doesn't match any pod
# Step 3: Compare labels
kubectl get pods --show-labels
kubectl get svc my-svc -o jsonpath='{.spec.selector}'
# Step 4: Verify targetPort matches container
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].ports}'
|
Port Mapping Confusion
| Service port (what clients connect to)
│
▼
port: 80 ──────► targetPort: 8080 (container listens here)
│
▼
containerPort: 8080
|
| Field |
Where |
Must Match |
port |
Service spec |
What clients use |
targetPort |
Service spec |
Container's listening port |
containerPort |
Pod spec |
What the app actually binds to |
nodePort |
Service spec (NodePort type) |
External access port (30000-32767) |
Endpoint Debugging
| # Endpoints empty → no pods match selector
kubectl get ep my-svc
# EndpointSlices (newer)
kubectl get endpointslices -l kubernetes.io/service-name=my-svc
# Manually verify connectivity to pod IP
kubectl run tmp --image=busybox --rm -it -- wget -qO- <pod-ip>:8080
|
Service Not Reachable Checklist
| # 1. Service exists?
kubectl get svc my-svc
# 2. Endpoints populated?
kubectl get ep my-svc
# 3. Pod running and ready?
kubectl get pods -l app=myapp
# 4. Pod actually serving traffic?
kubectl exec <pod> -- curl -s localhost:8080
# 5. kube-proxy running on nodes?
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# 6. iptables rules present? (on node)
iptables-save | grep my-svc
|
DNS Debugging
Kubernetes DNS Architecture
| Pod resolves "my-svc"
│
▼
/etc/resolv.conf → nameserver <CoreDNS ClusterIP>
search <ns>.svc.cluster.local svc.cluster.local cluster.local
│
▼
CoreDNS (Deployment in kube-system)
│
▼
Returns ClusterIP of Service
|
| Resource |
DNS Name |
Resolves To |
| Service (ClusterIP) |
<svc>.<ns>.svc.cluster.local |
ClusterIP |
| Service (Headless) |
<svc>.<ns>.svc.cluster.local |
Set of Pod IPs |
| Pod |
<pod-ip-dashed>.<ns>.pod.cluster.local |
Pod IP |
| StatefulSet Pod |
<pod-name>.<svc>.<ns>.svc.cluster.local |
Pod IP |
DNS Debugging Chain
| # 1. Check resolv.conf in pod
kubectl exec <pod> -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (CoreDNS ClusterIP)
# 2. Test DNS from a debug pod
kubectl run dnstest --image=busybox:1.36 --rm -it -- nslookup my-svc
kubectl run dnstest --image=busybox:1.36 --rm -it -- nslookup my-svc.default.svc.cluster.local
# 3. Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 4. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# 5. Check CoreDNS Service
kubectl get svc -n kube-system kube-dns
# 6. Check CoreDNS ConfigMap
kubectl get cm -n kube-system coredns -o yaml
|
Common DNS Failures
| Symptom |
Cause |
Fix |
nslookup times out |
CoreDNS pods not running |
Check CoreDNS Deployment, restart if needed |
nslookup returns NXDOMAIN |
Wrong service name or namespace |
Use FQDN: svc.ns.svc.cluster.local |
| DNS works from some pods, not others |
NetworkPolicy blocking UDP/53 to kube-dns |
Add egress rule allowing DNS |
resolv.conf has wrong nameserver |
kubelet --cluster-dns misconfigured |
Check kubelet config |
| Cross-namespace resolution fails |
Using short name without namespace |
Use <svc>.<ns> or FQDN |
DNS Policy Reminder
| # Default: pod gets cluster DNS
spec:
dnsPolicy: ClusterFirst # default — uses CoreDNS
# Other policies (rarely changed on exam):
# - Default → inherits node's resolv.conf
# - ClusterFirstWithHostNet → ClusterFirst for hostNetwork pods
# - None → requires dnsConfig to be set
|
CKA Tip: If DNS doesn't work, always check CoreDNS pods first. 90% of DNS issues = CoreDNS not running or misconfigured.
Quota & LimitRange Issues
How Quotas Block Pod Creation
| kubectl create pod
│
▼
Admission Controller checks:
│
├─ ResourceQuota exceeded? → REJECTED
│ "exceeded quota: mem-quota, requested: memory=512Mi, used: 1536Mi, limited: 2Gi"
│
├─ LimitRange violated? → REJECTED or DEFAULTED
│ "minimum memory usage per Container is 64Mi, but request is 32Mi"
│
└─ Pass → Pod created
|
ResourceQuota Troubleshooting
| # Check quota usage in namespace
kubectl get resourcequota -n <ns>
kubectl describe resourcequota <name> -n <ns>
# Output shows:
# Resource Used Hard
# -------- ---- ----
# cpu 800m 2
# memory 1Gi 2Gi
# pods 5 10
|
Common quota-related failures:
| Error Message |
Meaning |
Fix |
exceeded quota |
Sum of requests exceeds Hard limit |
Reduce requests or increase quota |
must specify requests/limits |
Quota exists but pod has no requests |
Add resource requests to pod spec |
| Deployment creates RS but no pods |
Quota silently blocks pod creation |
kubectl describe rs <rs> → Events |
Critical exam pattern: When a ResourceQuota is set in a namespace, every pod must specify resource requests for the resources being quota'd. Otherwise creation is rejected.
| apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: dev
spec:
hard:
requests.cpu: "2"
requests.memory: 2Gi
limits.cpu: "4"
limits.memory: 4Gi
pods: "10"
|
LimitRange Troubleshooting
LimitRange sets per-container defaults and constraints.
| kubectl describe limitrange -n <ns>
# Output:
# Type Resource Min Max Default DefaultRequest
# ---- -------- --- --- ------- --------------
# Container cpu 50m 2 500m 100m
# Container memory 64Mi 1Gi 256Mi 128Mi
|
| Scenario |
Behavior |
| Pod has no requests/limits |
LimitRange injects defaults |
| Pod requests below Min |
Rejected |
| Pod limits above Max |
Rejected |
| Pod has limit but no request |
Request set to limit (or DefaultRequest if lower) |
| apiVersion: v1
kind: LimitRange
metadata:
name: mem-limits
namespace: dev
spec:
limits:
- type: Container
default: # default limit
memory: 256Mi
defaultRequest: # default request
memory: 128Mi
min:
memory: 64Mi
max:
memory: 1Gi
|
Debugging Quota/LimitRange Rejections
| # Pod stuck — check events
kubectl describe pod <pod> | grep -A3 "Events"
# Deployment not scaling — check ReplicaSet events
kubectl get rs -n <ns>
kubectl describe rs <rs-name> -n <ns> | grep -A5 "Events"
# Look for: "Error creating: pods ... is forbidden: exceeded quota"
# Quick status check
kubectl get resourcequota,limitrange -n <ns>
|
Interaction Between Quota and LimitRange
| Namespace has both ResourceQuota and LimitRange:
1. LimitRange injects defaults into pods without requests/limits
2. ResourceQuota checks totals against hard limits
3. Both must pass for pod to be admitted
Common pattern on exam:
- ResourceQuota requires requests → pod has none → REJECTED
- Fix: add LimitRange with defaults so pods auto-get requests
|
CKA Tips
- Pod status tells you where to look: Pending = scheduling, ImagePull = image, CrashLoop = app logs, OOMKilled = memory
kubectl logs --previous is essential for CrashLoopBackOff — the current container may have no logs yet
- Empty Endpoints = selector/label mismatch — this is the most common Service fix on the exam
- DNS debug pod:
kubectl run tmp --image=busybox:1.36 --rm -it -- nslookup <svc>
- Quota blocks are silent for Deployments — always check ReplicaSet events, not just Deployment events
kubectl describe is your best friend — Events section at the bottom has the answer 90% of the time
Practice Exercises
Exercise 1 — Fix a CrashLoopBackOff Pod
| # Create a broken pod
kubectl run crash-pod --image=busybox --command -- /bin/sh -c "exit 1"
# Tasks:
# 1. Identify why the pod is crashing
# 2. Fix it so it runs successfully (hint: change the command)
|
Exercise 2 — Fix Service Connectivity
| # Create pod and broken service
kubectl run web --image=nginx --port=80 --labels="app=web"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: web-svc
spec:
selector:
app: webapp
ports:
- port: 80
targetPort: 8080
EOF
# Tasks:
# 1. Why does `kubectl get ep web-svc` show no endpoints?
# 2. Fix the Service so it routes to the pod
# 3. Verify connectivity
|
Exercise 3 — DNS Troubleshooting
| # Tasks:
# 1. Deploy a busybox pod and verify DNS resolves kubernetes.default.svc.cluster.local
# 2. Check what nameserver the pod uses
# 3. Verify CoreDNS pods are running and check their logs
# 4. Resolve a service in a different namespace using FQDN
|
Exercise 4 — Quota Rejection
| # Setup
kubectl create namespace quota-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: strict-quota
namespace: quota-test
spec:
hard:
requests.cpu: "500m"
requests.memory: 256Mi
pods: "2"
EOF
# Tasks:
# 1. Try: kubectl run test --image=nginx -n quota-test — why does it fail?
# 2. Fix it by adding resource requests
# 3. Create a LimitRange that auto-injects defaults
# 4. Verify a pod without explicit requests now gets admitted
|
Exercise 5 — Break/Fix: Full Application Stack
| # Deploy a broken stack and fix all issues:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: app
labels:
tier: frontend
spec:
containers:
- name: app
image: nginx:latestttt
ports:
- containerPort: 80
env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: db_host
---
apiVersion: v1
kind: Service
metadata:
name: app-svc
spec:
selector:
tier: backend
ports:
- port: 80
targetPort: 9090
EOF
# Find and fix ALL issues (there are at least 4)
|
Key Takeaways
| Concept |
Key Point |
| Pod status |
Directly indicates troubleshooting path |
--previous flag |
Gets logs from crashed container |
| Empty Endpoints |
Selector doesn't match any pod labels |
| targetPort |
Must match what container actually listens on |
| DNS debug |
nslookup from busybox pod → check CoreDNS |
| resolv.conf |
Shows nameserver and search domains |
| ResourceQuota |
Blocks pods silently via ReplicaSet |
| LimitRange |
Injects defaults; rejects min/max violations |
| Quota + no requests |
Pod rejected — add LimitRange defaults |
← 15-troubleshooting-clusters.md | → 17-troubleshooting-networking.md