16 — Troubleshooting Applications¶

← 15-troubleshooting-clusters.md | → 17-troubleshooting-networking.md

Overview¶

This file covers Module 5 — Troubleshooting (30%), topics 5–8:

#	Topic	Section
5	Pod troubleshooting (CrashLoopBackOff, ImagePullBackOff, Pending…)	Pod Troubleshooting
6	Service and endpoint troubleshooting	Service & Endpoint Troubleshooting
7	DNS resolution debugging	DNS Debugging
8	Resource quota and LimitRange issues	Quota & LimitRange Issues

Pod Troubleshooting¶

Diagnostic Flow¶

Pod not running?
│
├─ Pending ──────────► Scheduling problem
│                      kubectl describe pod → Events
│
├─ ImagePullBackOff ─► Image/registry problem
│                      Check image name, tag, pull secret
│
├─ CrashLoopBackOff ► Container starts then dies
│                      kubectl logs / kubectl logs --previous
│
├─ CreateContainerConfigError ► ConfigMap/Secret missing
│                      kubectl describe pod → Events
│
├─ Init:Error ───────► Init container failing
│                      kubectl logs <pod> -c <init-container>
│
└─ Running but wrong ► App-level issue
                       kubectl logs, kubectl exec

Status → Cause → Fix Reference¶

Status	Common Causes	Key Commands
Pending	No node matches requests/selectors/affinity/taints; PVC unbound; ResourceQuota exceeded	`kubectl describe pod` → Events section
ImagePullBackOff	Wrong image name/tag; private registry without imagePullSecret; network to registry blocked	`kubectl describe pod` → look for "Failed to pull image"
CrashLoopBackOff	App error; wrong command/args; missing env var; readiness/liveness misconfigured	`kubectl logs <pod> --previous`
CreateContainerConfigError	Referenced ConfigMap or Secret doesn't exist	`kubectl describe pod` → "Error: configmaps \"x\" not found"
ErrImageNeverPull	`imagePullPolicy: Never` but image not on node	Change policy or pre-pull image
OOMKilled	Container exceeded memory limit	`kubectl describe pod` → Last State: OOMKilled; raise limit or fix leak
Init:CrashLoopBackOff	Init container failing	`kubectl logs <pod> -c <init-container-name>`

CrashLoopBackOff Deep Dive¶

The backoff timer doubles each restart: 10s → 20s → 40s → 80s → … capped at 5 minutes.

# See restart count and last state
kubectl describe pod <pod> | grep -A5 "Last State"

# Logs from the PREVIOUS (crashed) container
kubectl logs <pod> --previous

# If container exits too fast to exec into, override entrypoint
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- sh

Pending Pod Checklist¶

# 1. Check events
kubectl describe pod <pod> | tail -20

# 2. Common event messages and meaning:
#    "Insufficient cpu"        → node resources exhausted
#    "Insufficient memory"     → node resources exhausted
#    "didn't match Pod's node affinity/selector" → scheduling constraint
#    "persistentvolumeclaim not found" → PVC missing or unbound
#    "exceeded quota"          → ResourceQuota hit

# 3. Check node capacity vs requests
kubectl describe nodes | grep -A5 "Allocated resources"

# 4. Check if PVC is bound
kubectl get pvc

ImagePullBackOff Checklist¶

# 1. Verify image exists (typo is #1 cause on exam)
kubectl describe pod <pod> | grep "Image:"

# 2. Check pull secret
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret <secret> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# 3. Test pull manually on node (if SSH access)
crictl pull <image>

Multi-Container Pod Debugging¶

# List all containers in a pod
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].name}'

# Logs for specific container
kubectl logs <pod> -c <container-name>

# Exec into specific container
kubectl exec -it <pod> -c <container-name> -- sh

Service & Endpoint Troubleshooting¶

Service Connectivity Flow¶

Client → Service (ClusterIP) → Endpoints → Pod
         │                      │
         │                      └─ Empty? Selector mismatch
         └─ No response? kube-proxy / iptables issue

Checklist:
1. Does the Service exist?
2. Does it have Endpoints?
3. Do Endpoints point to running Pods?
4. Are Pod labels matching Service selector?
5. Is the targetPort correct?

The Selector–Label Match Problem¶

This is the #1 Service troubleshooting issue on the CKA exam.

# Service
apiVersion: v1
kind: Service
metadata:
  name: my-svc
spec:
  selector:
    app: myapp        # ← must match pod labels EXACTLY
  ports:
  - port: 80
    targetPort: 8080  # ← must match container port

# Step 1: Check Service selector
kubectl get svc my-svc -o wide
# Shows SELECTOR column

# Step 2: Check Endpoints
kubectl get endpoints my-svc
# Empty ENDPOINTS = selector doesn't match any pod

# Step 3: Compare labels
kubectl get pods --show-labels
kubectl get svc my-svc -o jsonpath='{.spec.selector}'

# Step 4: Verify targetPort matches container
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].ports}'

Port Mapping Confusion¶

Service port (what clients connect to)
    │
    ▼
  port: 80  ──────►  targetPort: 8080  (container listens here)
                          │
                          ▼
                     containerPort: 8080

Field	Where	Must Match
`port`	Service spec	What clients use
`targetPort`	Service spec	Container's listening port
`containerPort`	Pod spec	What the app actually binds to
`nodePort`	Service spec (NodePort type)	External access port (30000-32767)

Endpoint Debugging¶

# Endpoints empty → no pods match selector
kubectl get ep my-svc

# EndpointSlices (newer)
kubectl get endpointslices -l kubernetes.io/service-name=my-svc

# Manually verify connectivity to pod IP
kubectl run tmp --image=busybox --rm -it -- wget -qO- <pod-ip>:8080

Service Not Reachable Checklist¶

# 1. Service exists?
kubectl get svc my-svc

# 2. Endpoints populated?
kubectl get ep my-svc

# 3. Pod running and ready?
kubectl get pods -l app=myapp

# 4. Pod actually serving traffic?
kubectl exec <pod> -- curl -s localhost:8080

# 5. kube-proxy running on nodes?
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# 6. iptables rules present? (on node)
iptables-save | grep my-svc

DNS Debugging¶

Kubernetes DNS Architecture¶

Pod resolves "my-svc"
    │
    ▼
/etc/resolv.conf → nameserver <CoreDNS ClusterIP>
                    search <ns>.svc.cluster.local svc.cluster.local cluster.local
    │
    ▼
CoreDNS (Deployment in kube-system)
    │
    ▼
Returns ClusterIP of Service

DNS Record Formats¶

Resource	DNS Name	Resolves To
Service (ClusterIP)	`<svc>.<ns>.svc.cluster.local`	ClusterIP
Service (Headless)	`<svc>.<ns>.svc.cluster.local`	Set of Pod IPs
Pod	`<pod-ip-dashed>.<ns>.pod.cluster.local`	Pod IP
StatefulSet Pod	`<pod-name>.<svc>.<ns>.svc.cluster.local`	Pod IP

DNS Debugging Chain¶

# 1. Check resolv.conf in pod
kubectl exec <pod> -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (CoreDNS ClusterIP)

# 2. Test DNS from a debug pod
kubectl run dnstest --image=busybox:1.36 --rm -it -- nslookup my-svc
kubectl run dnstest --image=busybox:1.36 --rm -it -- nslookup my-svc.default.svc.cluster.local

# 3. Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 4. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# 5. Check CoreDNS Service
kubectl get svc -n kube-system kube-dns

# 6. Check CoreDNS ConfigMap
kubectl get cm -n kube-system coredns -o yaml

Common DNS Failures¶

Symptom	Cause	Fix
`nslookup` times out	CoreDNS pods not running	Check CoreDNS Deployment, restart if needed
`nslookup` returns NXDOMAIN	Wrong service name or namespace	Use FQDN: `svc.ns.svc.cluster.local`
DNS works from some pods, not others	NetworkPolicy blocking UDP/53 to kube-dns	Add egress rule allowing DNS
`resolv.conf` has wrong nameserver	kubelet `--cluster-dns` misconfigured	Check kubelet config
Cross-namespace resolution fails	Using short name without namespace	Use `<svc>.<ns>` or FQDN

DNS Policy Reminder¶

# Default: pod gets cluster DNS
spec:
  dnsPolicy: ClusterFirst    # default — uses CoreDNS

# Other policies (rarely changed on exam):
# - Default          → inherits node's resolv.conf
# - ClusterFirstWithHostNet → ClusterFirst for hostNetwork pods
# - None             → requires dnsConfig to be set

CKA Tip: If DNS doesn't work, always check CoreDNS pods first. 90% of DNS issues = CoreDNS not running or misconfigured.

Quota & LimitRange Issues¶

How Quotas Block Pod Creation¶

kubectl create pod
    │
    ▼
Admission Controller checks:
    │
    ├─ ResourceQuota exceeded? → REJECTED
    │   "exceeded quota: mem-quota, requested: memory=512Mi, used: 1536Mi, limited: 2Gi"
    │
    ├─ LimitRange violated? → REJECTED or DEFAULTED
    │   "minimum memory usage per Container is 64Mi, but request is 32Mi"
    │
    └─ Pass → Pod created

ResourceQuota Troubleshooting¶

# Check quota usage in namespace
kubectl get resourcequota -n <ns>
kubectl describe resourcequota <name> -n <ns>

# Output shows:
# Resource    Used    Hard
# --------    ----    ----
# cpu         800m    2
# memory      1Gi     2Gi
# pods        5       10

Common quota-related failures:

Error Message	Meaning	Fix
`exceeded quota`	Sum of requests exceeds Hard limit	Reduce requests or increase quota
`must specify requests/limits`	Quota exists but pod has no requests	Add resource requests to pod spec
Deployment creates RS but no pods	Quota silently blocks pod creation	`kubectl describe rs <rs>` → Events

Critical exam pattern: When a ResourceQuota is set in a namespace, every pod must specify resource requests for the resources being quota'd. Otherwise creation is rejected.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "4"
    limits.memory: 4Gi
    pods: "10"

LimitRange Troubleshooting¶

LimitRange sets per-container defaults and constraints.

kubectl describe limitrange -n <ns>

# Output:
# Type        Resource  Min   Max   Default  DefaultRequest
# ----        --------  ---   ---   -------  --------------
# Container   cpu       50m   2     500m     100m
# Container   memory    64Mi  1Gi   256Mi    128Mi

Scenario	Behavior
Pod has no requests/limits	LimitRange injects defaults
Pod requests below Min	Rejected
Pod limits above Max	Rejected
Pod has limit but no request	Request set to limit (or DefaultRequest if lower)

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limits
  namespace: dev
spec:
  limits:
  - type: Container
    default:          # default limit
      memory: 256Mi
    defaultRequest:   # default request
      memory: 128Mi
    min:
      memory: 64Mi
    max:
      memory: 1Gi

Debugging Quota/LimitRange Rejections¶

# Pod stuck — check events
kubectl describe pod <pod> | grep -A3 "Events"

# Deployment not scaling — check ReplicaSet events
kubectl get rs -n <ns>
kubectl describe rs <rs-name> -n <ns> | grep -A5 "Events"
# Look for: "Error creating: pods ... is forbidden: exceeded quota"

# Quick status check
kubectl get resourcequota,limitrange -n <ns>

Interaction Between Quota and LimitRange¶

Namespace has both ResourceQuota and LimitRange:

1. LimitRange injects defaults into pods without requests/limits
2. ResourceQuota checks totals against hard limits
3. Both must pass for pod to be admitted

Common pattern on exam:
- ResourceQuota requires requests → pod has none → REJECTED
- Fix: add LimitRange with defaults so pods auto-get requests

CKA Tips¶

Pod status tells you where to look: Pending = scheduling, ImagePull = image, CrashLoop = app logs, OOMKilled = memory
kubectl logs --previous is essential for CrashLoopBackOff — the current container may have no logs yet
Empty Endpoints = selector/label mismatch — this is the most common Service fix on the exam
DNS debug pod: kubectl run tmp --image=busybox:1.36 --rm -it -- nslookup <svc>
Quota blocks are silent for Deployments — always check ReplicaSet events, not just Deployment events
kubectl describe is your best friend — Events section at the bottom has the answer 90% of the time

Practice Exercises¶

Exercise 1 — Fix a CrashLoopBackOff Pod¶

# Create a broken pod
kubectl run crash-pod --image=busybox --command -- /bin/sh -c "exit 1"

# Tasks:
# 1. Identify why the pod is crashing
# 2. Fix it so it runs successfully (hint: change the command)

Exercise 2 — Fix Service Connectivity¶

# Create pod and broken service
kubectl run web --image=nginx --port=80 --labels="app=web"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: web-svc
spec:
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080
EOF

# Tasks:
# 1. Why does `kubectl get ep web-svc` show no endpoints?
# 2. Fix the Service so it routes to the pod
# 3. Verify connectivity

Exercise 3 — DNS Troubleshooting¶

# Tasks:
# 1. Deploy a busybox pod and verify DNS resolves kubernetes.default.svc.cluster.local
# 2. Check what nameserver the pod uses
# 3. Verify CoreDNS pods are running and check their logs
# 4. Resolve a service in a different namespace using FQDN

Exercise 4 — Quota Rejection¶

# Setup
kubectl create namespace quota-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: strict-quota
  namespace: quota-test
spec:
  hard:
    requests.cpu: "500m"
    requests.memory: 256Mi
    pods: "2"
EOF

# Tasks:
# 1. Try: kubectl run test --image=nginx -n quota-test — why does it fail?
# 2. Fix it by adding resource requests
# 3. Create a LimitRange that auto-injects defaults
# 4. Verify a pod without explicit requests now gets admitted

Exercise 5 — Break/Fix: Full Application Stack¶

# Deploy a broken stack and fix all issues:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: app
  labels:
    tier: frontend
spec:
  containers:
  - name: app
    image: nginx:latestttt
    ports:
    - containerPort: 80
    env:
    - name: DB_HOST
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: db_host
---
apiVersion: v1
kind: Service
metadata:
  name: app-svc
spec:
  selector:
    tier: backend
  ports:
  - port: 80
    targetPort: 9090
EOF

# Find and fix ALL issues (there are at least 4)

Key Takeaways¶

Concept	Key Point
Pod status	Directly indicates troubleshooting path
`--previous` flag	Gets logs from crashed container
Empty Endpoints	Selector doesn't match any pod labels
targetPort	Must match what container actually listens on
DNS debug	`nslookup` from busybox pod → check CoreDNS
resolv.conf	Shows nameserver and search domains
ResourceQuota	Blocks pods silently via ReplicaSet
LimitRange	Injects defaults; rejects min/max violations
Quota + no requests	Pod rejected — add LimitRange defaults

← 15-troubleshooting-clusters.md | → 17-troubleshooting-networking.md