17 — Troubleshooting Networking¶

← 16-troubleshooting-applications.md | → 18-exam-tips.md

Overview¶

This file covers Module 5 — Troubleshooting (30%), topics 9–12:

#	Topic	Section
9	Debugging connectivity between pods and services	Pod & Service Connectivity
10	Verifying kube-proxy rules (iptables/ipvs)	kube-proxy & iptables/IPVS
11	CNI troubleshooting	CNI Troubleshooting
12	Using tools: kubectl exec, nslookup, curl, netcat, tcpdump	Network Debugging Toolkit

Pod & Service Connectivity¶

Systematic Debugging Flow¶

Pod A cannot reach Service B
│
├─ 1. Can Pod A reach anything?
│     kubectl exec A -- ping <another-pod-ip>
│     └─ No → CNI / node networking issue (jump to CNI section)
│
├─ 2. Can Pod A resolve the Service name?
│     kubectl exec A -- nslookup B
│     └─ No → DNS issue (see 16-troubleshooting-applications.md)
│
├─ 3. Can Pod A reach the Service ClusterIP directly?
│     kubectl exec A -- curl -s --max-time 3 <clusterIP>:<port>
│     └─ No → kube-proxy / iptables issue
│
├─ 4. Can Pod A reach the backend Pod IP directly?
│     kubectl exec A -- curl -s --max-time 3 <podIP>:<targetPort>
│     └─ No → NetworkPolicy blocking, or pod not listening
│
└─ 5. Is the backend pod actually serving?
      kubectl exec <backend-pod> -- curl -s localhost:<targetPort>
      └─ No → Application issue

Layer-by-Layer Isolation¶

Layer	Test	Failure Means
App	`kubectl exec backend -- curl localhost:8080`	App not listening / crashed
Pod-to-Pod	`curl <pod-ip>:8080` from another pod	CNI or NetworkPolicy
Service	`curl <clusterIP>:80` from a pod	kube-proxy / Endpoints empty
DNS	`nslookup <svc-name>` from a pod	CoreDNS issue
External	`curl <nodeIP>:<nodePort>` from outside	NodePort / firewall

Cross-Namespace Connectivity¶

# Pods in namespace A reaching service in namespace B
kubectl exec -n ns-a <pod> -- curl -s svc-b.ns-b.svc.cluster.local

# Common mistake: using short name across namespaces
# "curl svc-b" only works within ns-b
# Must use "svc-b.ns-b" or FQDN from other namespaces

kube-proxy & iptables/IPVS¶

kube-proxy Modes¶

kube-proxy watches Services & Endpoints
    │
    ├─ iptables mode (default)
    │   Programs iptables rules for DNAT
    │   Service ClusterIP → random backend Pod IP
    │
    └─ IPVS mode
        Uses Linux IPVS (IP Virtual Server)
        Better performance at scale
        Supports multiple load-balancing algorithms

Verifying kube-proxy Is Running¶

# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"
# "Using iptables proxier" or "Using ipvs proxier"

# Check kube-proxy ConfigMap
kubectl get cm -n kube-system kube-proxy -o yaml | grep mode

Verifying iptables Rules (on node)¶

# List all Service NAT rules
iptables-save | grep <service-name>

# Trace a specific Service — look for KUBE-SVC and KUBE-SEP chains
iptables -t nat -L KUBE-SERVICES -n | grep <clusterIP>

# Example output for a Service with 2 endpoints:
# KUBE-SVC-XXXX  tcp  --  0.0.0.0/0  10.96.100.50  tcp dpt:80
#   → KUBE-SEP-AAAA  (statistic mode random probability 0.50)  → DNAT to 10.244.1.5:8080
#   → KUBE-SEP-BBBB  (statistic mode random probability 1.00)  → DNAT to 10.244.2.8:8080

# Count rules (high count = slow if too many services)
iptables-save | wc -l

Verifying IPVS Rules (on node)¶

# List virtual servers
ipvsadm -Ln

# Example output:
# TCP  10.96.100.50:80 rr
#   -> 10.244.1.5:8080    Masq  1  0  0
#   -> 10.244.2.8:8080    Masq  1  0  0

# Check IPVS mode is active
ipvsadm -Ln | grep -c "TCP\|UDP"

kube-proxy Failure Symptoms¶

Symptom	Likely Cause	Fix
Service ClusterIP unreachable but Pod IP works	kube-proxy not running or misconfigured	Restart kube-proxy pods; check ConfigMap
Service works on some nodes, not others	kube-proxy down on specific node	Check kube-proxy pod on that node
New Service not reachable	kube-proxy can't reach API server	Check kube-proxy logs for connection errors
NodePort not accessible externally	Firewall blocking port range 30000-32767	Check node firewall / security groups

kube-proxy ConfigMap¶

# View full config
kubectl get cm -n kube-system kube-proxy -o yaml

# Key fields:
# mode: ""           ← empty = iptables (default)
# mode: "ipvs"       ← IPVS mode
# clusterCIDR: "10.244.0.0/16"
# metricsBindAddress: "0.0.0.0:10249"

# After editing, restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system

CNI Troubleshooting¶

CNI Architecture¶

kubelet creates pod sandbox
    │
    ▼
Calls CNI binary (from /opt/cni/bin/)
    │
    ▼
Reads config (from /etc/cni/net.d/)
    │
    ▼
CNI plugin sets up:
  - veth pair (pod ↔ node bridge)
  - IP address assignment (IPAM)
  - Routes
    │
    ▼
Pod gets network interface and IP

Key CNI Paths¶

Path	Purpose
`/etc/cni/net.d/`	CNI configuration files (first file alphabetically wins)
`/opt/cni/bin/`	CNI plugin binaries
`/var/log/calico/`	Calico-specific logs (if using Calico)
`/run/flannel/`	Flannel subnet config (if using Flannel)

CNI Not Installed / Broken¶

# Symptom: pods stuck in ContainerCreating, nodes NotReady
kubectl get nodes
kubectl describe node <node> | grep -i "network"
# "NetworkReady=false  reason:NetworkPluginNotReady message:Network plugin not ready: cni"

# Check if CNI config exists
ls /etc/cni/net.d/

# Check if CNI binaries exist
ls /opt/cni/bin/

# Check CNI plugin pods (e.g., Calico)
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l app=flannel

# Check CNI pod logs
kubectl logs -n kube-system <calico-node-pod>

Common CNI Issues¶

Symptom	Cause	Fix
Nodes NotReady, pods ContainerCreating	No CNI installed	Install CNI: `kubectl apply -f <cni-manifest>`
Pods get IPs but can't communicate cross-node	CNI misconfigured; pod CIDR mismatch	Verify `--pod-network-cidr` matches CNI config
IP address exhaustion	Pod CIDR too small	Expand CIDR or clean up leaked IPs
CNI pods CrashLooping	Wrong CNI config or missing binaries	Check CNI pod logs; reinstall CNI
Partial connectivity	Overlay network blocked (VXLAN UDP 4789, BGP TCP 179)	Open required ports between nodes

Pod CIDR Mismatch¶

# What kubeadm was told
kubectl cluster-info dump | grep -m1 "cluster-cidr"
# or
kubectl get cm -n kube-system kubeadm-config -o yaml | grep podSubnet

# What CNI is configured with
# Calico:
kubectl get ippools.crd.projectcalico.org -o yaml | grep cidr
# Flannel:
kubectl get cm -n kube-system kube-flannel-cfg -o yaml | grep Network

# These MUST match — mismatch = cross-node communication fails

Required Ports for CNI Overlays¶

CNI	Protocol	Port	Purpose
Calico (VXLAN)	UDP	4789	VXLAN encapsulation
Calico (BGP)	TCP	179	BGP peering
Flannel (VXLAN)	UDP	4789	VXLAN encapsulation
Cilium	UDP	8472	VXLAN (default)
WireGuard (any CNI)	UDP	51820	Encrypted overlay

Network Debugging Toolkit¶

Tool Quick Reference¶

Tool	Purpose	Available In
`kubectl exec`	Run commands inside a pod	kubectl (always available)
`nslookup` / `dig`	DNS resolution testing	busybox / dnsutils
`curl` / `wget`	HTTP connectivity testing	curlimages/curl / busybox
`nc` (netcat)	TCP/UDP port testing	busybox / nicolaka/netshoot
`tcpdump`	Packet capture	nicolaka/netshoot / node
`ip`	Network interface/route inspection	Most images / node
`ss` / `netstat`	Socket/connection listing	Most images / node

kubectl exec Patterns¶

# Interactive shell
kubectl exec -it <pod> -- sh

# One-shot command
kubectl exec <pod> -- cat /etc/resolv.conf

# Specific container in multi-container pod
kubectl exec <pod> -c <container> -- ps aux

# Namespace-aware
kubectl exec -n <ns> <pod> -- env

Ephemeral Debug Containers¶

# Attach a debug container to a running pod (K8s 1.25+)
kubectl debug -it <pod> --image=busybox --target=<container>

# Debug a node
kubectl debug node/<node-name> -it --image=busybox
# Gives you a shell with node filesystem at /host

# Copy a pod for debugging (non-destructive)
kubectl debug <pod> -it --image=busybox --copy-to=debug-pod

DNS Testing (nslookup)¶

# Quick DNS test pod
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup kubernetes.default

# Test specific service
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup my-svc.my-ns.svc.cluster.local

# Test external resolution
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup google.com

# Using dig for more detail (needs dnsutils image)
kubectl run dig-test --image=registry.k8s.io/e2e-test-images/agnhost:2.39 --rm -it -- dig my-svc.default.svc.cluster.local

curl / wget for HTTP Testing¶

# Test Service from inside cluster
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s --max-time 5 http://my-svc:80

# Test with headers and status code
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s -o /dev/null -w "%{http_code}" http://my-svc:80

# Test pod IP directly (bypass Service)
kubectl run curl-test --image=curlimages/curl --rm -it -- curl -s http://10.244.1.5:8080

# wget alternative (busybox)
kubectl run wget-test --image=busybox:1.36 --rm -it -- wget -qO- --timeout=5 http://my-svc:80

netcat (nc) for Port Testing¶

# Test if TCP port is open
kubectl run nc-test --image=busybox:1.36 --rm -it -- nc -zv <ip> <port>
# "open" = port reachable, "Connection refused" = nothing listening, timeout = blocked

# Test Service port
kubectl run nc-test --image=busybox:1.36 --rm -it -- nc -zv my-svc 80

# Listen on a port (create a test server)
kubectl run listener --image=busybox:1.36 -- nc -lk -p 8080

# Send data to test connectivity
kubectl exec sender -- sh -c 'echo "test" | nc <ip> 8080'

tcpdump for Packet Capture¶

# On a node — capture traffic for a specific pod IP
tcpdump -i any host 10.244.1.5 -nn

# Capture DNS traffic only
tcpdump -i any port 53 -nn

# Capture with limited count and write to file
tcpdump -i any host 10.244.1.5 -nn -c 50 -w /tmp/capture.pcap

# Using netshoot as a debug sidecar
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -- tcpdump -i eth0 -nn

ip and ss Commands¶

# Inside a pod — check interfaces and IPs
kubectl exec <pod> -- ip addr
kubectl exec <pod> -- ip route

# Check listening ports inside a pod
kubectl exec <pod> -- ss -tlnp
# or
kubectl exec <pod> -- netstat -tlnp

# On a node — check bridge interfaces
ip link show type bridge
ip link show type veth
bridge fdb show

The netshoot Swiss Army Knife¶

# nicolaka/netshoot has ALL networking tools pre-installed
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash

# Inside netshoot you get:
# curl, wget, nslookup, dig, nc, tcpdump, ip, ss, traceroute,
# mtr, iperf3, ethtool, nmap, strace, and more

# Attach to existing pod's network namespace
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>

Putting It All Together: Connectivity Debugging Playbook¶

Scenario: Pod A → Service B Broken¶

# Step 1: Verify Service and Endpoints exist
kubectl get svc,ep <service-b>

# Step 2: DNS resolution
kubectl exec <pod-a> -- nslookup <service-b>

# Step 3: Reach ClusterIP
kubectl exec <pod-a> -- curl -s --max-time 3 http://<clusterIP>:<port>

# Step 4: Reach Pod IP directly (bypass kube-proxy)
kubectl exec <pod-a> -- curl -s --max-time 3 http://<pod-b-ip>:<targetPort>

# Step 5: Check backend is listening
kubectl exec <pod-b> -- ss -tlnp | grep <targetPort>

# Step 6: Check NetworkPolicies
kubectl get networkpolicy -A
kubectl describe networkpolicy -n <namespace>

# Step 7: Check kube-proxy (on node)
iptables-save | grep <service-name>

# Step 8: Check CNI pods
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"

Quick Decision Matrix¶

Step 4 works but Step 3 doesn't → kube-proxy issue
Step 3 works but Step 2 doesn't → DNS issue
Step 2 works but Step 1 shows empty EP → selector mismatch
Step 5 fails → app not listening on expected port
Step 4 fails → NetworkPolicy or CNI issue

CKA Tips¶

Always isolate the layer first — don't jump to CNI when it's a selector mismatch
--rm -it on debug pods auto-cleans up — essential during exam to avoid clutter
busybox:1.36 is your go-to debug image — has nslookup, wget, nc, ping
nicolaka/netshoot when you need heavier tools (tcpdump, dig, iperf)
curl -s --max-time 3 prevents hanging on unreachable endpoints
Check iptables on the node only after confirming kube-proxy is the suspect
CNI issues show as NotReady nodes and ContainerCreating pods — check /etc/cni/net.d/ first

Practice Exercises¶

Exercise 1 — Isolate a Connectivity Failure¶

# Setup
kubectl create namespace net-debug
kubectl run server --image=nginx --port=80 -n net-debug --labels="app=server"
kubectl expose pod server --port=80 --target-port=80 -n net-debug

# Tasks:
# 1. From a busybox pod in the default namespace, try to reach the service
# 2. Use FQDN to resolve across namespaces
# 3. Verify with curl that HTTP response is 200
# 4. Test direct pod IP connectivity

Exercise 2 — kube-proxy Verification¶

# Tasks:
# 1. Identify which mode kube-proxy is running in
# 2. Find the kube-proxy ConfigMap and check clusterCIDR
# 3. SSH to a node and verify iptables rules exist for the "server" service from Exercise 1
# 4. Count total iptables NAT rules

Exercise 3 — CNI Diagnosis¶

# Tasks:
# 1. Identify which CNI plugin is installed in your cluster
# 2. Check /etc/cni/net.d/ on a node for the config file
# 3. Verify CNI pods are running and healthy
# 4. Confirm pod CIDR matches between kubeadm config and CNI config
# 5. Check that required overlay ports are open between nodes

Exercise 4 — NetworkPolicy Blocking Traffic¶

# Setup
kubectl create namespace policy-test
kubectl run web --image=nginx --port=80 -n policy-test --labels="app=web"
kubectl expose pod web --port=80 -n policy-test

# Apply default deny
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: policy-test
spec:
  podSelector: {}
  policyTypes:
  - Ingress
EOF

# Tasks:
# 1. Verify that a test pod in policy-test cannot reach the web service
# 2. Create a NetworkPolicy that allows ingress from pods with label role=client
# 3. Label a test pod as role=client and verify connectivity is restored
# 4. Verify pods without the label are still blocked

Exercise 5 — Break/Fix: Full Network Stack¶

# This exercise requires a multi-node cluster (kubeadm, kind, etc.)
# Simulate these failures one at a time and fix them:

# Scenario A: Delete the CNI config
#   sudo mv /etc/cni/net.d/10-calico.conflist /tmp/
#   Observe: new pods stuck in ContainerCreating
#   Fix: restore the file, restart affected pods

# Scenario B: Kill kube-proxy
#   kubectl delete pods -n kube-system -l k8s-app=kube-proxy
#   Observe: existing connections work (conntrack), new Service connections fail
#   Fix: kube-proxy DaemonSet recreates pods automatically

# Scenario C: Break DNS
#   kubectl scale deployment coredns -n kube-system --replicas=0
#   Observe: DNS resolution fails, but direct IP access works
#   Fix: kubectl scale deployment coredns -n kube-system --replicas=2

Key Takeaways¶

Concept	Key Point
Debugging flow	Isolate layer: App → Pod IP → Service → DNS → CNI
kube-proxy	Translates ClusterIP to Pod IPs via iptables/IPVS rules
iptables verification	`iptables-save \\| grep <svc>` on the node
CNI config	`/etc/cni/net.d/` — first file alphabetically wins
CNI binaries	`/opt/cni/bin/` — must exist on every node
Pod CIDR mismatch	kubeadm `podSubnet` must match CNI config
busybox debug pod	`kubectl run tmp --image=busybox:1.36 --rm -it -- sh`
netshoot	Full toolkit: tcpdump, dig, nc, iperf, traceroute
`--max-time`	Always set timeout on curl to avoid exam time waste
Decision matrix	Pod IP works but ClusterIP doesn't = kube-proxy problem

← 16-troubleshooting-applications.md | → 18-exam-tips.md