Module 8 — Scheduling
Overview
The scheduler decides which node runs each pod. The CKA exam tests your ability to control scheduling using nodeSelector, affinity rules, taints, tolerations, resource constraints, static pods, and manual scheduling. This module covers the scheduler internals and every mechanism to influence pod placement.
1. Scheduler Internals
1.1 How the Scheduler Works
The scheduler watches for pods with no spec.nodeName and assigns them to a node through two phases:
| Unscheduled Pod (nodeName = "")
│
▼
┌───────────────────┐
│ 1. FILTERING │ Eliminate nodes that CAN'T run the pod
│ │
│ Checks: │
│ - Sufficient CPU/memory (vs requests)
│ - Node taints tolerated?
│ - nodeSelector / nodeAffinity match?
│ - PodAntiAffinity satisfied?
│ - Node cordoned? (unschedulable)
│ - Port conflicts?
│ - Volume topology constraints?
└────────┬──────────┘
│ remaining candidate nodes
▼
┌───────────────────┐
│ 2. SCORING │ Rank candidates by preference
│ │
│ Factors: │
│ - Resource balance (LeastRequestedPriority)
│ - Pod spread (SelectorSpreadPriority)
│ - Node affinity weight
│ - Pod affinity/anti-affinity weight
│ - Image already present on node
└────────┬──────────┘
│
▼
Highest-scoring node wins
Pod.spec.nodeName = "selected-node"
|
1.2 When Scheduling Fails
If no node passes filtering, the pod stays Pending:
| kubectl describe pod <pending-pod> | grep -A5 Events
# Warning FailedScheduling 0/3 nodes are available:
# 1 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate,
# 2 node(s) didn't match Pod's node affinity/selector
|
CKA Tip: Always check Events in kubectl describe pod when a pod is stuck in Pending.
2. nodeSelector
The simplest way to constrain a pod to specific nodes. Matches exact node labels.
2.1 Label a Node
| # Add a label
kubectl label node worker-1 disktype=ssd
# Verify
kubectl get nodes --show-labels | grep disktype
# Remove a label
kubectl label node worker-1 disktype-
|
2.2 Use nodeSelector in a Pod
| apiVersion: v1
kind: Pod
metadata:
name: ssd-pod
spec:
nodeSelector:
disktype: ssd
containers:
- name: app
image: nginx
|
| # Imperative (generate YAML, then add nodeSelector)
kubectl run ssd-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# Edit pod.yaml to add nodeSelector
|
2.3 Built-in Node Labels
Every node has these labels automatically:
| Label |
Example Value |
kubernetes.io/hostname |
worker-1 |
kubernetes.io/os |
linux |
kubernetes.io/arch |
amd64 |
node.kubernetes.io/instance-type |
m5.large (cloud) |
topology.kubernetes.io/zone |
us-east-1a (cloud) |
topology.kubernetes.io/region |
us-east-1 (cloud) |
3. Node Affinity
More expressive than nodeSelector — supports operators, soft/hard preferences, and multiple conditions.
3.1 Required vs Preferred
| Type |
Behavior |
Equivalent to |
requiredDuringSchedulingIgnoredDuringExecution |
Hard — pod won't schedule if no node matches |
nodeSelector (but more flexible) |
preferredDuringSchedulingIgnoredDuringExecution |
Soft — scheduler prefers matching nodes but will schedule elsewhere if needed |
No equivalent |
"IgnoredDuringExecution" means if a node's labels change after the pod is scheduled, the pod is NOT evicted. It stays where it is.
3.2 Required (Hard) Node Affinity
| apiVersion: v1
kind: Pod
metadata:
name: hard-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
- nvme
containers:
- name: app
image: nginx
|
3.3 Preferred (Soft) Node Affinity
| apiVersion: v1
kind: Pod
metadata:
name: soft-affinity
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # 1-100, higher = stronger preference
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 20
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
containers:
- name: app
image: nginx
|
3.4 Combining Required and Preferred
| spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os # MUST be linux
operator: In
values: ["linux"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: disktype # PREFER ssd
operator: In
values: ["ssd"]
|
3.5 Operators
| Operator |
Meaning |
In |
Label value is in the list |
NotIn |
Label value is NOT in the list |
Exists |
Label key exists (value doesn't matter) |
DoesNotExist |
Label key does NOT exist |
Gt |
Label value is greater than (numeric) |
Lt |
Label value is less than (numeric) |
4. Pod Affinity and Anti-Affinity
Controls pod placement relative to other pods, not nodes.
4.1 Pod Affinity — "Schedule Near"
Schedule this pod on a node that already runs pods with label app=cache:
| spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["cache"]
topologyKey: kubernetes.io/hostname # same node
|
4.2 Pod Anti-Affinity — "Schedule Away From"
Spread replicas across different nodes:
| apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["web"]
topologyKey: kubernetes.io/hostname # different nodes
containers:
- name: web
image: nginx
|
4.3 topologyKey
Defines the "domain" for affinity/anti-affinity:
| topologyKey |
Meaning |
kubernetes.io/hostname |
Same/different node |
topology.kubernetes.io/zone |
Same/different availability zone |
topology.kubernetes.io/region |
Same/different region |
4.4 Soft Pod Anti-Affinity
| spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["web"]
topologyKey: kubernetes.io/hostname
|
CKA Tip: Hard anti-affinity with topologyKey: kubernetes.io/hostname means you can't have more replicas than nodes. If you have 3 nodes and 4 replicas, one pod stays Pending.
5. Taints and Tolerations
Taints are applied to nodes to repel pods. Tolerations are applied to pods to allow scheduling on tainted nodes.
5.1 Concept
| Node with taint: "gpu=true:NoSchedule"
Pod WITHOUT toleration ──▶ REJECTED ✗
Pod WITH toleration ──▶ ALLOWED ✓
|
Taints repel. Tolerations permit. A toleration does NOT guarantee scheduling on the tainted node — it only removes the restriction. Use nodeSelector or affinity to attract pods to specific nodes.
5.2 Taint Effects
| Effect |
Behavior |
NoSchedule |
New pods without toleration won't be scheduled. Existing pods stay. |
PreferNoSchedule |
Scheduler tries to avoid, but will schedule if no other option. |
NoExecute |
New pods rejected AND existing pods without toleration are evicted. |
5.3 Managing Taints
| # Add a taint
kubectl taint nodes worker-1 gpu=true:NoSchedule
# Verify
kubectl describe node worker-1 | grep Taints
# Remove a taint (note the minus at the end)
kubectl taint nodes worker-1 gpu=true:NoSchedule-
# Remove all taints with a key
kubectl taint nodes worker-1 gpu-
|
5.4 Control Plane Taint
By default, kubeadm taints control plane nodes:
| kubectl describe node controlplane | grep Taints
# Taints: node-role.kubernetes.io/control-plane:NoSchedule
|
This prevents workload pods from running on the control plane. To allow scheduling:
| # Remove the control plane taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-
|
5.5 Tolerations
| apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: app
image: nvidia/cuda:latest
|
5.6 Toleration Operators
| Operator |
Meaning |
Equal |
Key, value, and effect must all match |
Exists |
Key and effect must match (value is ignored) |
Special cases:
| # Tolerate ALL taints with key "gpu" (any effect)
tolerations:
- key: "gpu"
operator: "Exists"
# Tolerate ALL taints on the node (master toleration)
tolerations:
- operator: "Exists"
|
5.7 NoExecute and tolerationSeconds
With NoExecute, you can set how long a pod stays before eviction:
| tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # stay for 5 minutes, then evict
|
5.8 Taints + nodeSelector Together
Common pattern — dedicate nodes to specific workloads:
| # 1. Taint the node (repel everything)
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule
# 2. Label the node (attract specific pods)
kubectl label nodes worker-3 hardware=gpu
|
| # Pod: tolerate the taint AND select the node
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
nodeSelector:
hardware: gpu
|
6. Resource Requests and Limits
6.1 Concepts
| Field |
Purpose |
Used by |
requests |
Minimum guaranteed resources — used by the scheduler for placement |
Scheduler |
limits |
Maximum allowed resources — enforced by the kubelet at runtime |
kubelet / kernel |
| spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: 250m # 0.25 CPU cores
memory: 128Mi # 128 MiB
limits:
cpu: 500m # 0.5 CPU cores
memory: 256Mi # 256 MiB
|
6.2 CPU vs Memory Units
| Resource |
Unit |
Examples |
| CPU |
millicores (m) or cores |
100m = 0.1 core, 1 = 1 core, 1500m = 1.5 cores |
| Memory |
bytes with suffix |
128Mi (mebibytes), 1Gi (gibibytes), 256M (megabytes) |
6.3 How Requests Affect Scheduling
| Node capacity: 4 CPU, 8Gi memory
Already allocated: 2.5 CPU, 5Gi memory
Available: 1.5 CPU, 3Gi memory
New pod requests: 2 CPU, 1Gi memory
→ CPU request (2) > available (1.5) → NOT SCHEDULABLE on this node
|
The scheduler sums all pod requests (not limits) on a node to determine available capacity.
6.4 How Limits Are Enforced
| Resource |
Over limit behavior |
| CPU |
Throttled — container is slowed down but NOT killed |
| Memory |
OOMKilled — container is killed and restarted |
| # Check if a pod was OOMKilled
kubectl describe pod <name> | grep -A3 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
|
6.5 QoS Classes
Kubernetes assigns a Quality of Service class based on requests and limits:
| QoS Class |
Condition |
Eviction Priority |
Guaranteed |
requests == limits for ALL containers |
Last to be evicted |
Burstable |
At least one container has requests set, but requests ≠ limits |
Middle |
BestEffort |
No requests or limits set on any container |
First to be evicted |
| # Check QoS class
kubectl get pod <name> -o jsonpath='{.status.qosClass}'
|
| # Guaranteed example (requests == limits)
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 500m
memory: 256Mi
# BestEffort example (no resources at all)
# Just don't set the resources field
|
6.6 LimitRange
Namespace-level defaults and constraints for resource requests/limits:
| apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: default
spec:
limits:
- type: Container
default: # default limits (if not specified)
cpu: 500m
memory: 256Mi
defaultRequest: # default requests (if not specified)
cpu: 100m
memory: 128Mi
max: # maximum allowed
cpu: "2"
memory: 1Gi
min: # minimum allowed
cpu: 50m
memory: 64Mi
|
6.7 ResourceQuota
Namespace-level total resource budget:
| apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "20"
services: "10"
persistentvolumeclaims: "5"
|
| # Check quota usage
kubectl get resourcequota -n team-a
kubectl describe resourcequota team-quota -n team-a
|
CKA Tip: When a ResourceQuota is set, every pod in that namespace MUST specify resource requests/limits, or it will be rejected.
7. Static Pods
7.1 What Are Static Pods?
Static pods are managed directly by kubelet, not by the API server. kubelet watches a directory for pod manifests and creates/restarts them automatically.
| kubelet watches → /etc/kubernetes/manifests/
├── etcd.yaml
├── kube-apiserver.yaml
├── kube-controller-manager.yaml
└── kube-scheduler.yaml
|
7.2 Key Characteristics
| Aspect |
Static Pods |
Regular Pods |
| Created by |
kubelet (from manifest files) |
API server (via controllers) |
| Visible in API |
Yes (mirror pod, read-only) |
Yes (full control) |
| Can be deleted via kubectl |
No — kubelet recreates them |
Yes |
| Managed by controllers |
No |
Yes (Deployments, etc.) |
| Naming |
<name>-<node-name> |
<name>-<random> |
7.3 Finding the Static Pod Path
| # Method 1: Check kubelet config
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# staticPodPath: /etc/kubernetes/manifests
# Method 2: Check kubelet process arguments
ps aux | grep kubelet | grep -- --pod-manifest-path
# Method 3: Check kubelet service file
systemctl cat kubelet | grep -- --config
# Then check the config file for staticPodPath
|
7.4 Creating a Static Pod
| # Create a manifest in the static pod directory
cat <<EOF > /etc/kubernetes/manifests/static-nginx.yaml
apiVersion: v1
kind: Pod
metadata:
name: static-nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
EOF
# kubelet automatically creates the pod
kubectl get pods | grep static-nginx
# static-nginx-controlplane 1/1 Running 0 10s
|
7.5 Deleting a Static Pod
| # This does NOT work permanently — kubelet recreates it
kubectl delete pod static-nginx-controlplane
# To actually remove it, delete the manifest file
rm /etc/kubernetes/manifests/static-nginx.yaml
|
CKA Tip: If asked to create a static pod on a specific node, SSH to that node, find the staticPodPath, and create the manifest there.
8. Manual Scheduling (nodeName)
8.1 Bypassing the Scheduler
Setting spec.nodeName directly assigns a pod to a node without going through the scheduler:
| apiVersion: v1
kind: Pod
metadata:
name: manual-pod
spec:
nodeName: worker-1 # bypass scheduler entirely
containers:
- name: nginx
image: nginx
|
8.2 When to Use
- Scheduler is down and you need to place a pod
- Debugging scheduling issues
- Exam scenarios that explicitly ask for manual scheduling
8.3 Limitations
- No filtering or scoring — the pod is placed even if the node can't handle it
- If the node doesn't exist, the pod stays
Pending
- Cannot be changed after pod creation — you must delete and recreate
8.4 Binding Object (Alternative)
If a pod is already created without nodeName, you can bind it manually:
| cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Binding
metadata:
name: manual-pod
target:
apiVersion: v1
kind: Node
name: worker-1
EOF
|
9. Scheduler Profiles and Multiple Schedulers
9.1 Custom Scheduler
You can run a second scheduler alongside the default:
| apiVersion: v1
kind: Pod
metadata:
name: custom-scheduled
spec:
schedulerName: my-custom-scheduler # use a non-default scheduler
containers:
- name: app
image: nginx
|
| # Check which scheduler placed a pod
kubectl get pod <name> -o jsonpath='{.spec.schedulerName}'
# If schedulerName doesn't match any running scheduler, pod stays Pending
|
CKA Tip: You probably won't need to deploy a custom scheduler, but you should know the schedulerName field exists.
10. Summary — Choosing the Right Mechanism
| Goal |
Mechanism |
| Pod MUST run on nodes with label X |
nodeSelector or required nodeAffinity |
| Pod SHOULD PREFER nodes with label X |
Preferred nodeAffinity with weight |
| Pod MUST run on same node as pod Y |
Required podAffinity |
| Pod MUST NOT run on same node as pod Y |
Required podAntiAffinity |
| Repel all pods from a node |
Taint with NoSchedule |
| Dedicate a node to specific workloads |
Taint + toleration + nodeSelector |
| Guarantee minimum resources |
resources.requests |
| Cap maximum resources |
resources.limits |
| Run exactly one pod per node |
DaemonSet |
| Place pod on a specific node (no scheduler) |
spec.nodeName |
| Run pod managed by kubelet only |
Static pod in /etc/kubernetes/manifests/ |
11. Practice Exercises
Exercise 1 — nodeSelector
| # 1. Label a node
kubectl label node worker-1 env=production
# 2. Create a pod with nodeSelector
kubectl run selector-test --image=nginx --dry-run=client -o yaml > pod.yaml
# Add nodeSelector: { env: production } to the spec
kubectl apply -f pod.yaml
# 3. Verify it landed on worker-1
kubectl get pod selector-test -o wide
# 4. Clean up
kubectl delete pod selector-test
kubectl label node worker-1 env-
|
Exercise 2 — Node Affinity
| # 1. Label two nodes
kubectl label node worker-1 disktype=ssd
kubectl label node worker-2 disktype=hdd
# 2. Create a pod that REQUIRES ssd
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: affinity-test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values: ["ssd"]
containers:
- name: app
image: nginx
EOF
# 3. Verify it's on worker-1
kubectl get pod affinity-test -o wide
# 4. Clean up
kubectl delete pod affinity-test
kubectl label node worker-1 disktype-
kubectl label node worker-2 disktype-
|
Exercise 3 — Taints and Tolerations
| # 1. Taint a node
kubectl taint nodes worker-1 dedicated=special:NoSchedule
# 2. Try to schedule a pod (should go to worker-2)
kubectl run no-toleration --image=nginx
kubectl get pod no-toleration -o wide # NOT on worker-1
# 3. Create a pod with toleration (can go to worker-1)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: with-toleration
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "special"
effect: "NoSchedule"
nodeSelector:
kubernetes.io/hostname: worker-1
containers:
- name: app
image: nginx
EOF
# 4. Verify
kubectl get pod with-toleration -o wide # on worker-1
# 5. Clean up
kubectl delete pod no-toleration with-toleration
kubectl taint nodes worker-1 dedicated=special:NoSchedule-
|
Exercise 4 — Resource Requests and Limits
| # 1. Create a pod with requests and limits
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: resource-test
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
EOF
# 2. Check QoS class
kubectl get pod resource-test -o jsonpath='{.status.qosClass}'
# Burstable
# 3. Check node resource allocation
kubectl describe node <node> | grep -A10 "Allocated resources"
# 4. Create a Guaranteed pod (requests == limits)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 200m
memory: 256Mi
EOF
# 5. Verify QoS
kubectl get pod guaranteed-pod -o jsonpath='{.status.qosClass}'
# Guaranteed
# 6. Clean up
kubectl delete pod resource-test guaranteed-pod
|
Exercise 5 — Static Pod
| # 1. Find the static pod path
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# 2. Create a static pod manifest
cat <<EOF > /etc/kubernetes/manifests/static-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: static-test
spec:
containers:
- name: nginx
image: nginx
EOF
# 3. Verify it appears (with node name suffix)
kubectl get pods | grep static-test
# 4. Try to delete it via kubectl — it comes back
kubectl delete pod static-test-<node-name>
kubectl get pods | grep static-test # still there
# 5. Actually remove it
rm /etc/kubernetes/manifests/static-test.yaml
kubectl get pods | grep static-test # gone
|
Exercise 6 — Manual Scheduling
| # 1. Create a pod YAML without scheduling
cat <<EOF > manual.yaml
apiVersion: v1
kind: Pod
metadata:
name: manual-pod
spec:
nodeName: worker-1
containers:
- name: nginx
image: nginx
EOF
# 2. Apply
kubectl apply -f manual.yaml
# 3. Verify it's on worker-1 (bypassed scheduler)
kubectl get pod manual-pod -o wide
# 4. Clean up
kubectl delete pod manual-pod
|
12. Key Takeaways for the CKA Exam
| Point |
Detail |
| nodeSelector is simplest |
Exact label match — use when you just need "run on nodes with label X" |
| Node affinity for complex rules |
In, NotIn, Exists, DoesNotExist operators + soft/hard |
| Taints repel, tolerations permit |
Toleration alone doesn't attract — combine with nodeSelector |
| Know the three taint effects |
NoSchedule, PreferNoSchedule, NoExecute |
| Remove taint with minus |
kubectl taint nodes <node> key:effect- |
| Requests = scheduling, limits = enforcement |
Scheduler uses requests; kubelet enforces limits |
| CPU throttled, memory OOMKilled |
Over-limit behavior differs by resource type |
| QoS: Guaranteed > Burstable > BestEffort |
Eviction order under memory pressure |
Static pods in /etc/kubernetes/manifests/ |
Managed by kubelet, not the API server |
nodeName bypasses the scheduler |
Direct placement — no filtering or scoring |
kubectl describe pod for scheduling failures |
Check Events section for FailedScheduling |
Previous: 07-pods-and-workloads.md — Pods & Workloads
Next: 09-configmaps-secrets.md — ConfigMaps & Secrets