17 — Troubleshooting Networking
← 16-troubleshooting-applications.md | → 18-exam-tips.md
Overview
This file covers Module 5 — Troubleshooting (30%), topics 9–12:
Pod & Service Connectivity
Systematic Debugging Flow
| Pod A cannot reach Service B
│
├─ 1. Can Pod A reach anything?
│ kubectl exec A -- ping <another-pod-ip>
│ └─ No → CNI / node networking issue (jump to CNI section)
│
├─ 2. Can Pod A resolve the Service name?
│ kubectl exec A -- nslookup B
│ └─ No → DNS issue (see 16-troubleshooting-applications.md)
│
├─ 3. Can Pod A reach the Service ClusterIP directly?
│ kubectl exec A -- curl -s --max-time 3 <clusterIP>:<port>
│ └─ No → kube-proxy / iptables issue
│
├─ 4. Can Pod A reach the backend Pod IP directly?
│ kubectl exec A -- curl -s --max-time 3 <podIP>:<targetPort>
│ └─ No → NetworkPolicy blocking, or pod not listening
│
└─ 5. Is the backend pod actually serving?
kubectl exec <backend-pod> -- curl -s localhost:<targetPort>
└─ No → Application issue
|
Layer-by-Layer Isolation
| Layer |
Test |
Failure Means |
| App |
kubectl exec backend -- curl localhost:8080 |
App not listening / crashed |
| Pod-to-Pod |
curl <pod-ip>:8080 from another pod |
CNI or NetworkPolicy |
| Service |
curl <clusterIP>:80 from a pod |
kube-proxy / Endpoints empty |
| DNS |
nslookup <svc-name> from a pod |
CoreDNS issue |
| External |
curl <nodeIP>:<nodePort> from outside |
NodePort / firewall |
Cross-Namespace Connectivity
| # Pods in namespace A reaching service in namespace B
kubectl exec -n ns-a <pod> -- curl -s svc-b.ns-b.svc.cluster.local
# Common mistake: using short name across namespaces
# "curl svc-b" only works within ns-b
# Must use "svc-b.ns-b" or FQDN from other namespaces
|
kube-proxy & iptables/IPVS
kube-proxy Modes
| kube-proxy watches Services & Endpoints
│
├─ iptables mode (default)
│ Programs iptables rules for DNAT
│ Service ClusterIP → random backend Pod IP
│
└─ IPVS mode
Uses Linux IPVS (IP Virtual Server)
Better performance at scale
Supports multiple load-balancing algorithms
|
Verifying kube-proxy Is Running
| # Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy
# Check kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"
# "Using iptables proxier" or "Using ipvs proxier"
# Check kube-proxy ConfigMap
kubectl get cm -n kube-system kube-proxy -o yaml | grep mode
|
Verifying iptables Rules (on node)
| # List all Service NAT rules
iptables-save | grep <service-name>
# Trace a specific Service — look for KUBE-SVC and KUBE-SEP chains
iptables -t nat -L KUBE-SERVICES -n | grep <clusterIP>
# Example output for a Service with 2 endpoints:
# KUBE-SVC-XXXX tcp -- 0.0.0.0/0 10.96.100.50 tcp dpt:80
# → KUBE-SEP-AAAA (statistic mode random probability 0.50) → DNAT to 10.244.1.5:8080
# → KUBE-SEP-BBBB (statistic mode random probability 1.00) → DNAT to 10.244.2.8:8080
# Count rules (high count = slow if too many services)
iptables-save | wc -l
|
Verifying IPVS Rules (on node)
| # List virtual servers
ipvsadm -Ln
# Example output:
# TCP 10.96.100.50:80 rr
# -> 10.244.1.5:8080 Masq 1 0 0
# -> 10.244.2.8:8080 Masq 1 0 0
# Check IPVS mode is active
ipvsadm -Ln | grep -c "TCP\|UDP"
|
kube-proxy Failure Symptoms
| Symptom |
Likely Cause |
Fix |
| Service ClusterIP unreachable but Pod IP works |
kube-proxy not running or misconfigured |
Restart kube-proxy pods; check ConfigMap |
| Service works on some nodes, not others |
kube-proxy down on specific node |
Check kube-proxy pod on that node |
| New Service not reachable |
kube-proxy can't reach API server |
Check kube-proxy logs for connection errors |
| NodePort not accessible externally |
Firewall blocking port range 30000-32767 |
Check node firewall / security groups |
kube-proxy ConfigMap
| # View full config
kubectl get cm -n kube-system kube-proxy -o yaml
# Key fields:
# mode: "" ← empty = iptables (default)
# mode: "ipvs" ← IPVS mode
# clusterCIDR: "10.244.0.0/16"
# metricsBindAddress: "0.0.0.0:10249"
# After editing, restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system
|
CNI Troubleshooting
CNI Architecture
| kubelet creates pod sandbox
│
▼
Calls CNI binary (from /opt/cni/bin/)
│
▼
Reads config (from /etc/cni/net.d/)
│
▼
CNI plugin sets up:
- veth pair (pod ↔ node bridge)
- IP address assignment (IPAM)
- Routes
│
▼
Pod gets network interface and IP
|
Key CNI Paths
| Path |
Purpose |
/etc/cni/net.d/ |
CNI configuration files (first file alphabetically wins) |
/opt/cni/bin/ |
CNI plugin binaries |
/var/log/calico/ |
Calico-specific logs (if using Calico) |
/run/flannel/ |
Flannel subnet config (if using Flannel) |
CNI Not Installed / Broken
| # Symptom: pods stuck in ContainerCreating, nodes NotReady
kubectl get nodes
kubectl describe node <node> | grep -i "network"
# "NetworkReady=false reason:NetworkPluginNotReady message:Network plugin not ready: cni"
# Check if CNI config exists
ls /etc/cni/net.d/
# Check if CNI binaries exist
ls /opt/cni/bin/
# Check CNI plugin pods (e.g., Calico)
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l app=flannel
# Check CNI pod logs
kubectl logs -n kube-system <calico-node-pod>
|
Common CNI Issues
| Symptom |
Cause |
Fix |
| Nodes NotReady, pods ContainerCreating |
No CNI installed |
Install CNI: kubectl apply -f <cni-manifest> |
| Pods get IPs but can't communicate cross-node |
CNI misconfigured; pod CIDR mismatch |
Verify --pod-network-cidr matches CNI config |
| IP address exhaustion |
Pod CIDR too small |
Expand CIDR or clean up leaked IPs |
| CNI pods CrashLooping |
Wrong CNI config or missing binaries |
Check CNI pod logs; reinstall CNI |
| Partial connectivity |
Overlay network blocked (VXLAN UDP 4789, BGP TCP 179) |
Open required ports between nodes |
Pod CIDR Mismatch
| # What kubeadm was told
kubectl cluster-info dump | grep -m1 "cluster-cidr"
# or
kubectl get cm -n kube-system kubeadm-config -o yaml | grep podSubnet
# What CNI is configured with
# Calico:
kubectl get ippools.crd.projectcalico.org -o yaml | grep cidr
# Flannel:
kubectl get cm -n kube-system kube-flannel-cfg -o yaml | grep Network
# These MUST match — mismatch = cross-node communication fails
|
Required Ports for CNI Overlays
| CNI |
Protocol |
Port |
Purpose |
| Calico (VXLAN) |
UDP |
4789 |
VXLAN encapsulation |
| Calico (BGP) |
TCP |
179 |
BGP peering |
| Flannel (VXLAN) |
UDP |
4789 |
VXLAN encapsulation |
| Cilium |
UDP |
8472 |
VXLAN (default) |
| WireGuard (any CNI) |
UDP |
51820 |
Encrypted overlay |
| Tool |
Purpose |
Available In |
kubectl exec |
Run commands inside a pod |
kubectl (always available) |
nslookup / dig |
DNS resolution testing |
busybox / dnsutils |
curl / wget |
HTTP connectivity testing |
curlimages/curl / busybox |
nc (netcat) |
TCP/UDP port testing |
busybox / nicolaka/netshoot |
tcpdump |
Packet capture |
nicolaka/netshoot / node |
ip |
Network interface/route inspection |
Most images / node |
ss / netstat |
Socket/connection listing |
Most images / node |
kubectl exec Patterns
| # Interactive shell
kubectl exec -it <pod> -- sh
# One-shot command
kubectl exec <pod> -- cat /etc/resolv.conf
# Specific container in multi-container pod
kubectl exec <pod> -c <container> -- ps aux
# Namespace-aware
kubectl exec -n <ns> <pod> -- env
|
Ephemeral Debug Containers
| # Attach a debug container to a running pod (K8s 1.25+)
kubectl debug -it <pod> --image=busybox --target=<container>
# Debug a node
kubectl debug node/<node-name> -it --image=busybox
# Gives you a shell with node filesystem at /host
# Copy a pod for debugging (non-destructive)
kubectl debug <pod> -it --image=busybox --copy-to=debug-pod
|
DNS Testing (nslookup)
| # Quick DNS test pod
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup kubernetes.default
# Test specific service
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup my-svc.my-ns.svc.cluster.local
# Test external resolution
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup google.com
# Using dig for more detail (needs dnsutils image)
kubectl run dig-test --image=registry.k8s.io/e2e-test-images/agnhost:2.39 --rm -it -- dig my-svc.default.svc.cluster.local
|
curl / wget for HTTP Testing
| # Test Service from inside cluster
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s --max-time 5 http://my-svc:80
# Test with headers and status code
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s -o /dev/null -w "%{http_code}" http://my-svc:80
# Test pod IP directly (bypass Service)
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s http://10.244.1.5:8080
# wget alternative (busybox)
kubectl run wget-test --image=busybox:1.36 --rm -it -- wget -qO- --timeout=5 http://my-svc:80
|
netcat (nc) for Port Testing
| # Test if TCP port is open
kubectl run nc-test --image=busybox:1.36 --rm -it -- nc -zv <ip> <port>
# "open" = port reachable, "Connection refused" = nothing listening, timeout = blocked
# Test Service port
kubectl run nc-test --image=busybox:1.36 --rm -it -- nc -zv my-svc 80
# Listen on a port (create a test server)
kubectl run listener --image=busybox:1.36 -- nc -lk -p 8080
# Send data to test connectivity
kubectl exec sender -- sh -c 'echo "test" | nc <ip> 8080'
|
tcpdump for Packet Capture
| # On a node — capture traffic for a specific pod IP
tcpdump -i any host 10.244.1.5 -nn
# Capture DNS traffic only
tcpdump -i any port 53 -nn
# Capture with limited count and write to file
tcpdump -i any host 10.244.1.5 -nn -c 50 -w /tmp/capture.pcap
# Using netshoot as a debug sidecar
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -- tcpdump -i eth0 -nn
|
ip and ss Commands
| # Inside a pod — check interfaces and IPs
kubectl exec <pod> -- ip addr
kubectl exec <pod> -- ip route
# Check listening ports inside a pod
kubectl exec <pod> -- ss -tlnp
# or
kubectl exec <pod> -- netstat -tlnp
# On a node — check bridge interfaces
ip link show type bridge
ip link show type veth
bridge fdb show
|
The netshoot Swiss Army Knife
| # nicolaka/netshoot has ALL networking tools pre-installed
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash
# Inside netshoot you get:
# curl, wget, nslookup, dig, nc, tcpdump, ip, ss, traceroute,
# mtr, iperf3, ethtool, nmap, strace, and more
# Attach to existing pod's network namespace
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>
|
Putting It All Together: Connectivity Debugging Playbook
Scenario: Pod A → Service B Broken
| # Step 1: Verify Service and Endpoints exist
kubectl get svc,ep <service-b>
# Step 2: DNS resolution
kubectl exec <pod-a> -- nslookup <service-b>
# Step 3: Reach ClusterIP
kubectl exec <pod-a> -- curl -s --max-time 3 http://<clusterIP>:<port>
# Step 4: Reach Pod IP directly (bypass kube-proxy)
kubectl exec <pod-a> -- curl -s --max-time 3 http://<pod-b-ip>:<targetPort>
# Step 5: Check backend is listening
kubectl exec <pod-b> -- ss -tlnp | grep <targetPort>
# Step 6: Check NetworkPolicies
kubectl get networkpolicy -A
kubectl describe networkpolicy -n <namespace>
# Step 7: Check kube-proxy (on node)
iptables-save | grep <service-name>
# Step 8: Check CNI pods
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"
|
Quick Decision Matrix
| Step 4 works but Step 3 doesn't → kube-proxy issue
Step 3 works but Step 2 doesn't → DNS issue
Step 2 works but Step 1 shows empty EP → selector mismatch
Step 5 fails → app not listening on expected port
Step 4 fails → NetworkPolicy or CNI issue
|
CKA Tips
- Always isolate the layer first — don't jump to CNI when it's a selector mismatch
--rm -it on debug pods auto-cleans up — essential during exam to avoid clutter
- busybox:1.36 is your go-to debug image — has nslookup, wget, nc, ping
- nicolaka/netshoot when you need heavier tools (tcpdump, dig, iperf)
curl -s --max-time 3 prevents hanging on unreachable endpoints
- Check iptables on the node only after confirming kube-proxy is the suspect
- CNI issues show as NotReady nodes and ContainerCreating pods — check
/etc/cni/net.d/ first
Practice Exercises
Exercise 1 — Isolate a Connectivity Failure
| # Setup
kubectl create namespace net-debug
kubectl run server --image=nginx --port=80 -n net-debug --labels="app=server"
kubectl expose pod server --port=80 --target-port=80 -n net-debug
# Tasks:
# 1. From a busybox pod in the default namespace, try to reach the service
# 2. Use FQDN to resolve across namespaces
# 3. Verify with curl that HTTP response is 200
# 4. Test direct pod IP connectivity
|
Exercise 2 — kube-proxy Verification
| # Tasks:
# 1. Identify which mode kube-proxy is running in
# 2. Find the kube-proxy ConfigMap and check clusterCIDR
# 3. SSH to a node and verify iptables rules exist for the "server" service from Exercise 1
# 4. Count total iptables NAT rules
|
Exercise 3 — CNI Diagnosis
| # Tasks:
# 1. Identify which CNI plugin is installed in your cluster
# 2. Check /etc/cni/net.d/ on a node for the config file
# 3. Verify CNI pods are running and healthy
# 4. Confirm pod CIDR matches between kubeadm config and CNI config
# 5. Check that required overlay ports are open between nodes
|
Exercise 4 — NetworkPolicy Blocking Traffic
| # Setup
kubectl create namespace policy-test
kubectl run web --image=nginx --port=80 -n policy-test --labels="app=web"
kubectl expose pod web --port=80 -n policy-test
# Apply default deny
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: policy-test
spec:
podSelector: {}
policyTypes:
- Ingress
EOF
# Tasks:
# 1. Verify that a test pod in policy-test cannot reach the web service
# 2. Create a NetworkPolicy that allows ingress from pods with label role=client
# 3. Label a test pod as role=client and verify connectivity is restored
# 4. Verify pods without the label are still blocked
|
Exercise 5 — Break/Fix: Full Network Stack
| # This exercise requires a multi-node cluster (kubeadm, kind, etc.)
# Simulate these failures one at a time and fix them:
# Scenario A: Delete the CNI config
# sudo mv /etc/cni/net.d/10-calico.conflist /tmp/
# Observe: new pods stuck in ContainerCreating
# Fix: restore the file, restart affected pods
# Scenario B: Kill kube-proxy
# kubectl delete pods -n kube-system -l k8s-app=kube-proxy
# Observe: existing connections work (conntrack), new Service connections fail
# Fix: kube-proxy DaemonSet recreates pods automatically
# Scenario C: Break DNS
# kubectl scale deployment coredns -n kube-system --replicas=0
# Observe: DNS resolution fails, but direct IP access works
# Fix: kubectl scale deployment coredns -n kube-system --replicas=2
|
Key Takeaways
| Concept |
Key Point |
| Debugging flow |
Isolate layer: App → Pod IP → Service → DNS → CNI |
| kube-proxy |
Translates ClusterIP to Pod IPs via iptables/IPVS rules |
| iptables verification |
iptables-save \| grep <svc> on the node |
| CNI config |
/etc/cni/net.d/ — first file alphabetically wins |
| CNI binaries |
/opt/cni/bin/ — must exist on every node |
| Pod CIDR mismatch |
kubeadm podSubnet must match CNI config |
| busybox debug pod |
kubectl run tmp --image=busybox:1.36 --rm -it -- sh |
| netshoot |
Full toolkit: tcpdump, dig, nc, iperf, traceroute |
--max-time |
Always set timeout on curl to avoid exam time waste |
| Decision matrix |
Pod IP works but ClusterIP doesn't = kube-proxy problem |
← 16-troubleshooting-applications.md | → 18-exam-tips.md