Skip to content

17 — Troubleshooting Networking

← 16-troubleshooting-applications.md | → 18-exam-tips.md

Overview

This file covers Module 5 — Troubleshooting (30%), topics 9–12:

# Topic Section
9 Debugging connectivity between pods and services Pod & Service Connectivity
10 Verifying kube-proxy rules (iptables/ipvs) kube-proxy & iptables/IPVS
11 CNI troubleshooting CNI Troubleshooting
12 Using tools: kubectl exec, nslookup, curl, netcat, tcpdump Network Debugging Toolkit

Pod & Service Connectivity

Systematic Debugging Flow

Pod A cannot reach Service B
├─ 1. Can Pod A reach anything?
│     kubectl exec A -- ping <another-pod-ip>
│     └─ No → CNI / node networking issue (jump to CNI section)
├─ 2. Can Pod A resolve the Service name?
│     kubectl exec A -- nslookup B
│     └─ No → DNS issue (see 16-troubleshooting-applications.md)
├─ 3. Can Pod A reach the Service ClusterIP directly?
│     kubectl exec A -- curl -s --max-time 3 <clusterIP>:<port>
│     └─ No → kube-proxy / iptables issue
├─ 4. Can Pod A reach the backend Pod IP directly?
│     kubectl exec A -- curl -s --max-time 3 <podIP>:<targetPort>
│     └─ No → NetworkPolicy blocking, or pod not listening
└─ 5. Is the backend pod actually serving?
      kubectl exec <backend-pod> -- curl -s localhost:<targetPort>
      └─ No → Application issue

Layer-by-Layer Isolation

Layer Test Failure Means
App kubectl exec backend -- curl localhost:8080 App not listening / crashed
Pod-to-Pod curl <pod-ip>:8080 from another pod CNI or NetworkPolicy
Service curl <clusterIP>:80 from a pod kube-proxy / Endpoints empty
DNS nslookup <svc-name> from a pod CoreDNS issue
External curl <nodeIP>:<nodePort> from outside NodePort / firewall

Cross-Namespace Connectivity

1
2
3
4
5
6
# Pods in namespace A reaching service in namespace B
kubectl exec -n ns-a <pod> -- curl -s svc-b.ns-b.svc.cluster.local

# Common mistake: using short name across namespaces
# "curl svc-b" only works within ns-b
# Must use "svc-b.ns-b" or FQDN from other namespaces

kube-proxy & iptables/IPVS

kube-proxy Modes

kube-proxy watches Services & Endpoints
    ├─ iptables mode (default)
    │   Programs iptables rules for DNAT
    │   Service ClusterIP → random backend Pod IP
    └─ IPVS mode
        Uses Linux IPVS (IP Virtual Server)
        Better performance at scale
        Supports multiple load-balancing algorithms

Verifying kube-proxy Is Running

# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"
# "Using iptables proxier" or "Using ipvs proxier"

# Check kube-proxy ConfigMap
kubectl get cm -n kube-system kube-proxy -o yaml | grep mode

Verifying iptables Rules (on node)

# List all Service NAT rules
iptables-save | grep <service-name>

# Trace a specific Service — look for KUBE-SVC and KUBE-SEP chains
iptables -t nat -L KUBE-SERVICES -n | grep <clusterIP>

# Example output for a Service with 2 endpoints:
# KUBE-SVC-XXXX  tcp  --  0.0.0.0/0  10.96.100.50  tcp dpt:80
#   → KUBE-SEP-AAAA  (statistic mode random probability 0.50)  → DNAT to 10.244.1.5:8080
#   → KUBE-SEP-BBBB  (statistic mode random probability 1.00)  → DNAT to 10.244.2.8:8080

# Count rules (high count = slow if too many services)
iptables-save | wc -l

Verifying IPVS Rules (on node)

# List virtual servers
ipvsadm -Ln

# Example output:
# TCP  10.96.100.50:80 rr
#   -> 10.244.1.5:8080    Masq  1  0  0
#   -> 10.244.2.8:8080    Masq  1  0  0

# Check IPVS mode is active
ipvsadm -Ln | grep -c "TCP\|UDP"

kube-proxy Failure Symptoms

Symptom Likely Cause Fix
Service ClusterIP unreachable but Pod IP works kube-proxy not running or misconfigured Restart kube-proxy pods; check ConfigMap
Service works on some nodes, not others kube-proxy down on specific node Check kube-proxy pod on that node
New Service not reachable kube-proxy can't reach API server Check kube-proxy logs for connection errors
NodePort not accessible externally Firewall blocking port range 30000-32767 Check node firewall / security groups

kube-proxy ConfigMap

# View full config
kubectl get cm -n kube-system kube-proxy -o yaml

# Key fields:
# mode: ""           ← empty = iptables (default)
# mode: "ipvs"       ← IPVS mode
# clusterCIDR: "10.244.0.0/16"
# metricsBindAddress: "0.0.0.0:10249"

# After editing, restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system

CNI Troubleshooting

CNI Architecture

kubelet creates pod sandbox
Calls CNI binary (from /opt/cni/bin/)
Reads config (from /etc/cni/net.d/)
CNI plugin sets up:
  - veth pair (pod ↔ node bridge)
  - IP address assignment (IPAM)
  - Routes
Pod gets network interface and IP

Key CNI Paths

Path Purpose
/etc/cni/net.d/ CNI configuration files (first file alphabetically wins)
/opt/cni/bin/ CNI plugin binaries
/var/log/calico/ Calico-specific logs (if using Calico)
/run/flannel/ Flannel subnet config (if using Flannel)

CNI Not Installed / Broken

# Symptom: pods stuck in ContainerCreating, nodes NotReady
kubectl get nodes
kubectl describe node <node> | grep -i "network"
# "NetworkReady=false  reason:NetworkPluginNotReady message:Network plugin not ready: cni"

# Check if CNI config exists
ls /etc/cni/net.d/

# Check if CNI binaries exist
ls /opt/cni/bin/

# Check CNI plugin pods (e.g., Calico)
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l app=flannel

# Check CNI pod logs
kubectl logs -n kube-system <calico-node-pod>

Common CNI Issues

Symptom Cause Fix
Nodes NotReady, pods ContainerCreating No CNI installed Install CNI: kubectl apply -f <cni-manifest>
Pods get IPs but can't communicate cross-node CNI misconfigured; pod CIDR mismatch Verify --pod-network-cidr matches CNI config
IP address exhaustion Pod CIDR too small Expand CIDR or clean up leaked IPs
CNI pods CrashLooping Wrong CNI config or missing binaries Check CNI pod logs; reinstall CNI
Partial connectivity Overlay network blocked (VXLAN UDP 4789, BGP TCP 179) Open required ports between nodes

Pod CIDR Mismatch

# What kubeadm was told
kubectl cluster-info dump | grep -m1 "cluster-cidr"
# or
kubectl get cm -n kube-system kubeadm-config -o yaml | grep podSubnet

# What CNI is configured with
# Calico:
kubectl get ippools.crd.projectcalico.org -o yaml | grep cidr
# Flannel:
kubectl get cm -n kube-system kube-flannel-cfg -o yaml | grep Network

# These MUST match — mismatch = cross-node communication fails

Required Ports for CNI Overlays

CNI Protocol Port Purpose
Calico (VXLAN) UDP 4789 VXLAN encapsulation
Calico (BGP) TCP 179 BGP peering
Flannel (VXLAN) UDP 4789 VXLAN encapsulation
Cilium UDP 8472 VXLAN (default)
WireGuard (any CNI) UDP 51820 Encrypted overlay

Network Debugging Toolkit

Tool Quick Reference

Tool Purpose Available In
kubectl exec Run commands inside a pod kubectl (always available)
nslookup / dig DNS resolution testing busybox / dnsutils
curl / wget HTTP connectivity testing curlimages/curl / busybox
nc (netcat) TCP/UDP port testing busybox / nicolaka/netshoot
tcpdump Packet capture nicolaka/netshoot / node
ip Network interface/route inspection Most images / node
ss / netstat Socket/connection listing Most images / node

kubectl exec Patterns

# Interactive shell
kubectl exec -it <pod> -- sh

# One-shot command
kubectl exec <pod> -- cat /etc/resolv.conf

# Specific container in multi-container pod
kubectl exec <pod> -c <container> -- ps aux

# Namespace-aware
kubectl exec -n <ns> <pod> -- env

Ephemeral Debug Containers

1
2
3
4
5
6
7
8
9
# Attach a debug container to a running pod (K8s 1.25+)
kubectl debug -it <pod> --image=busybox --target=<container>

# Debug a node
kubectl debug node/<node-name> -it --image=busybox
# Gives you a shell with node filesystem at /host

# Copy a pod for debugging (non-destructive)
kubectl debug <pod> -it --image=busybox --copy-to=debug-pod

DNS Testing (nslookup)

# Quick DNS test pod
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup kubernetes.default

# Test specific service
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup my-svc.my-ns.svc.cluster.local

# Test external resolution
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup google.com

# Using dig for more detail (needs dnsutils image)
kubectl run dig-test --image=registry.k8s.io/e2e-test-images/agnhost:2.39 --rm -it -- dig my-svc.default.svc.cluster.local

curl / wget for HTTP Testing

# Test Service from inside cluster
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s --max-time 5 http://my-svc:80

# Test with headers and status code
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s -o /dev/null -w "%{http_code}" http://my-svc:80

# Test pod IP directly (bypass Service)
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s http://10.244.1.5:8080

# wget alternative (busybox)
kubectl run wget-test --image=busybox:1.36 --rm -it -- wget -qO- --timeout=5 http://my-svc:80

netcat (nc) for Port Testing

# Test if TCP port is open
kubectl run nc-test --image=busybox:1.36 --rm -it -- nc -zv <ip> <port>
# "open" = port reachable, "Connection refused" = nothing listening, timeout = blocked

# Test Service port
kubectl run nc-test --image=busybox:1.36 --rm -it -- nc -zv my-svc 80

# Listen on a port (create a test server)
kubectl run listener --image=busybox:1.36 -- nc -lk -p 8080

# Send data to test connectivity
kubectl exec sender -- sh -c 'echo "test" | nc <ip> 8080'

tcpdump for Packet Capture

# On a node — capture traffic for a specific pod IP
tcpdump -i any host 10.244.1.5 -nn

# Capture DNS traffic only
tcpdump -i any port 53 -nn

# Capture with limited count and write to file
tcpdump -i any host 10.244.1.5 -nn -c 50 -w /tmp/capture.pcap

# Using netshoot as a debug sidecar
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -- tcpdump -i eth0 -nn

ip and ss Commands

# Inside a pod — check interfaces and IPs
kubectl exec <pod> -- ip addr
kubectl exec <pod> -- ip route

# Check listening ports inside a pod
kubectl exec <pod> -- ss -tlnp
# or
kubectl exec <pod> -- netstat -tlnp

# On a node — check bridge interfaces
ip link show type bridge
ip link show type veth
bridge fdb show

The netshoot Swiss Army Knife

1
2
3
4
5
6
7
8
9
# nicolaka/netshoot has ALL networking tools pre-installed
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash

# Inside netshoot you get:
# curl, wget, nslookup, dig, nc, tcpdump, ip, ss, traceroute,
# mtr, iperf3, ethtool, nmap, strace, and more

# Attach to existing pod's network namespace
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>

Putting It All Together: Connectivity Debugging Playbook

Scenario: Pod A → Service B Broken

# Step 1: Verify Service and Endpoints exist
kubectl get svc,ep <service-b>

# Step 2: DNS resolution
kubectl exec <pod-a> -- nslookup <service-b>

# Step 3: Reach ClusterIP
kubectl exec <pod-a> -- curl -s --max-time 3 http://<clusterIP>:<port>

# Step 4: Reach Pod IP directly (bypass kube-proxy)
kubectl exec <pod-a> -- curl -s --max-time 3 http://<pod-b-ip>:<targetPort>

# Step 5: Check backend is listening
kubectl exec <pod-b> -- ss -tlnp | grep <targetPort>

# Step 6: Check NetworkPolicies
kubectl get networkpolicy -A
kubectl describe networkpolicy -n <namespace>

# Step 7: Check kube-proxy (on node)
iptables-save | grep <service-name>

# Step 8: Check CNI pods
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"

Quick Decision Matrix

1
2
3
4
5
Step 4 works but Step 3 doesn't → kube-proxy issue
Step 3 works but Step 2 doesn't → DNS issue
Step 2 works but Step 1 shows empty EP → selector mismatch
Step 5 fails → app not listening on expected port
Step 4 fails → NetworkPolicy or CNI issue

CKA Tips

  • Always isolate the layer first — don't jump to CNI when it's a selector mismatch
  • --rm -it on debug pods auto-cleans up — essential during exam to avoid clutter
  • busybox:1.36 is your go-to debug image — has nslookup, wget, nc, ping
  • nicolaka/netshoot when you need heavier tools (tcpdump, dig, iperf)
  • curl -s --max-time 3 prevents hanging on unreachable endpoints
  • Check iptables on the node only after confirming kube-proxy is the suspect
  • CNI issues show as NotReady nodes and ContainerCreating pods — check /etc/cni/net.d/ first

Practice Exercises

Exercise 1 — Isolate a Connectivity Failure

# Setup
kubectl create namespace net-debug
kubectl run server --image=nginx --port=80 -n net-debug --labels="app=server"
kubectl expose pod server --port=80 --target-port=80 -n net-debug

# Tasks:
# 1. From a busybox pod in the default namespace, try to reach the service
# 2. Use FQDN to resolve across namespaces
# 3. Verify with curl that HTTP response is 200
# 4. Test direct pod IP connectivity

Exercise 2 — kube-proxy Verification

1
2
3
4
5
# Tasks:
# 1. Identify which mode kube-proxy is running in
# 2. Find the kube-proxy ConfigMap and check clusterCIDR
# 3. SSH to a node and verify iptables rules exist for the "server" service from Exercise 1
# 4. Count total iptables NAT rules

Exercise 3 — CNI Diagnosis

1
2
3
4
5
6
# Tasks:
# 1. Identify which CNI plugin is installed in your cluster
# 2. Check /etc/cni/net.d/ on a node for the config file
# 3. Verify CNI pods are running and healthy
# 4. Confirm pod CIDR matches between kubeadm config and CNI config
# 5. Check that required overlay ports are open between nodes

Exercise 4 — NetworkPolicy Blocking Traffic

# Setup
kubectl create namespace policy-test
kubectl run web --image=nginx --port=80 -n policy-test --labels="app=web"
kubectl expose pod web --port=80 -n policy-test

# Apply default deny
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: policy-test
spec:
  podSelector: {}
  policyTypes:
  - Ingress
EOF

# Tasks:
# 1. Verify that a test pod in policy-test cannot reach the web service
# 2. Create a NetworkPolicy that allows ingress from pods with label role=client
# 3. Label a test pod as role=client and verify connectivity is restored
# 4. Verify pods without the label are still blocked

Exercise 5 — Break/Fix: Full Network Stack

# This exercise requires a multi-node cluster (kubeadm, kind, etc.)
# Simulate these failures one at a time and fix them:

# Scenario A: Delete the CNI config
#   sudo mv /etc/cni/net.d/10-calico.conflist /tmp/
#   Observe: new pods stuck in ContainerCreating
#   Fix: restore the file, restart affected pods

# Scenario B: Kill kube-proxy
#   kubectl delete pods -n kube-system -l k8s-app=kube-proxy
#   Observe: existing connections work (conntrack), new Service connections fail
#   Fix: kube-proxy DaemonSet recreates pods automatically

# Scenario C: Break DNS
#   kubectl scale deployment coredns -n kube-system --replicas=0
#   Observe: DNS resolution fails, but direct IP access works
#   Fix: kubectl scale deployment coredns -n kube-system --replicas=2

Key Takeaways

Concept Key Point
Debugging flow Isolate layer: App → Pod IP → Service → DNS → CNI
kube-proxy Translates ClusterIP to Pod IPs via iptables/IPVS rules
iptables verification iptables-save \| grep <svc> on the node
CNI config /etc/cni/net.d/ — first file alphabetically wins
CNI binaries /opt/cni/bin/ — must exist on every node
Pod CIDR mismatch kubeadm podSubnet must match CNI config
busybox debug pod kubectl run tmp --image=busybox:1.36 --rm -it -- sh
netshoot Full toolkit: tcpdump, dig, nc, iperf, traceroute
--max-time Always set timeout on curl to avoid exam time waste
Decision matrix Pod IP works but ClusterIP doesn't = kube-proxy problem

← 16-troubleshooting-applications.md | → 18-exam-tips.md