Skip to content

Module 8 — Scheduling

Overview

The scheduler decides which node runs each pod. The CKA exam tests your ability to control scheduling using nodeSelector, affinity rules, taints, tolerations, resource constraints, static pods, and manual scheduling. This module covers the scheduler internals and every mechanism to influence pod placement.


1. Scheduler Internals

1.1 How the Scheduler Works

The scheduler watches for pods with no spec.nodeName and assigns them to a node through two phases:

Unscheduled Pod (nodeName = "")
┌───────────────────┐
│   1. FILTERING     │  Eliminate nodes that CAN'T run the pod
│                   │
│   Checks:         │
│   - Sufficient CPU/memory (vs requests)
│   - Node taints tolerated?
│   - nodeSelector / nodeAffinity match?
│   - PodAntiAffinity satisfied?
│   - Node cordoned? (unschedulable)
│   - Port conflicts?
│   - Volume topology constraints?
└────────┬──────────┘
         │  remaining candidate nodes
┌───────────────────┐
│   2. SCORING       │  Rank candidates by preference
│                   │
│   Factors:        │
│   - Resource balance (LeastRequestedPriority)
│   - Pod spread (SelectorSpreadPriority)
│   - Node affinity weight
│   - Pod affinity/anti-affinity weight
│   - Image already present on node
└────────┬──────────┘
   Highest-scoring node wins
   Pod.spec.nodeName = "selected-node"

1.2 When Scheduling Fails

If no node passes filtering, the pod stays Pending:

1
2
3
4
kubectl describe pod <pending-pod> | grep -A5 Events
# Warning  FailedScheduling  0/3 nodes are available:
#   1 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate,
#   2 node(s) didn't match Pod's node affinity/selector

CKA Tip: Always check Events in kubectl describe pod when a pod is stuck in Pending.


2. nodeSelector

The simplest way to constrain a pod to specific nodes. Matches exact node labels.

2.1 Label a Node

1
2
3
4
5
6
7
8
# Add a label
kubectl label node worker-1 disktype=ssd

# Verify
kubectl get nodes --show-labels | grep disktype

# Remove a label
kubectl label node worker-1 disktype-

2.2 Use nodeSelector in a Pod

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: app
    image: nginx
1
2
3
# Imperative (generate YAML, then add nodeSelector)
kubectl run ssd-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# Edit pod.yaml to add nodeSelector

2.3 Built-in Node Labels

Every node has these labels automatically:

Label Example Value
kubernetes.io/hostname worker-1
kubernetes.io/os linux
kubernetes.io/arch amd64
node.kubernetes.io/instance-type m5.large (cloud)
topology.kubernetes.io/zone us-east-1a (cloud)
topology.kubernetes.io/region us-east-1 (cloud)

3. Node Affinity

More expressive than nodeSelector — supports operators, soft/hard preferences, and multiple conditions.

3.1 Required vs Preferred

Type Behavior Equivalent to
requiredDuringSchedulingIgnoredDuringExecution Hard — pod won't schedule if no node matches nodeSelector (but more flexible)
preferredDuringSchedulingIgnoredDuringExecution Soft — scheduler prefers matching nodes but will schedule elsewhere if needed No equivalent

"IgnoredDuringExecution" means if a node's labels change after the pod is scheduled, the pod is NOT evicted. It stays where it is.

3.2 Required (Hard) Node Affinity

apiVersion: v1
kind: Pod
metadata:
  name: hard-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
            - nvme
  containers:
  - name: app
    image: nginx

3.3 Preferred (Soft) Node Affinity

apiVersion: v1
kind: Pod
metadata:
  name: soft-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80                    # 1-100, higher = stronger preference
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 20
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
  containers:
  - name: app
    image: nginx

3.4 Combining Required and Preferred

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os       # MUST be linux
            operator: In
            values: ["linux"]
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: disktype               # PREFER ssd
            operator: In
            values: ["ssd"]

3.5 Operators

Operator Meaning
In Label value is in the list
NotIn Label value is NOT in the list
Exists Label key exists (value doesn't matter)
DoesNotExist Label key does NOT exist
Gt Label value is greater than (numeric)
Lt Label value is less than (numeric)

4. Pod Affinity and Anti-Affinity

Controls pod placement relative to other pods, not nodes.

4.1 Pod Affinity — "Schedule Near"

Schedule this pod on a node that already runs pods with label app=cache:

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["cache"]
        topologyKey: kubernetes.io/hostname    # same node

4.2 Pod Anti-Affinity — "Schedule Away From"

Spread replicas across different nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["web"]
            topologyKey: kubernetes.io/hostname    # different nodes
      containers:
      - name: web
        image: nginx

4.3 topologyKey

Defines the "domain" for affinity/anti-affinity:

topologyKey Meaning
kubernetes.io/hostname Same/different node
topology.kubernetes.io/zone Same/different availability zone
topology.kubernetes.io/region Same/different region

4.4 Soft Pod Anti-Affinity

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["web"]
          topologyKey: kubernetes.io/hostname

CKA Tip: Hard anti-affinity with topologyKey: kubernetes.io/hostname means you can't have more replicas than nodes. If you have 3 nodes and 4 replicas, one pod stays Pending.


5. Taints and Tolerations

Taints are applied to nodes to repel pods. Tolerations are applied to pods to allow scheduling on tainted nodes.

5.1 Concept

1
2
3
4
Node with taint: "gpu=true:NoSchedule"

  Pod WITHOUT toleration  ──▶  REJECTED ✗
  Pod WITH toleration     ──▶  ALLOWED  ✓

Taints repel. Tolerations permit. A toleration does NOT guarantee scheduling on the tainted node — it only removes the restriction. Use nodeSelector or affinity to attract pods to specific nodes.

5.2 Taint Effects

Effect Behavior
NoSchedule New pods without toleration won't be scheduled. Existing pods stay.
PreferNoSchedule Scheduler tries to avoid, but will schedule if no other option.
NoExecute New pods rejected AND existing pods without toleration are evicted.

5.3 Managing Taints

# Add a taint
kubectl taint nodes worker-1 gpu=true:NoSchedule

# Verify
kubectl describe node worker-1 | grep Taints

# Remove a taint (note the minus at the end)
kubectl taint nodes worker-1 gpu=true:NoSchedule-

# Remove all taints with a key
kubectl taint nodes worker-1 gpu-

5.4 Control Plane Taint

By default, kubeadm taints control plane nodes:

kubectl describe node controlplane | grep Taints
# Taints: node-role.kubernetes.io/control-plane:NoSchedule

This prevents workload pods from running on the control plane. To allow scheduling:

# Remove the control plane taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

5.5 Tolerations

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: app
    image: nvidia/cuda:latest

5.6 Toleration Operators

Operator Meaning
Equal Key, value, and effect must all match
Exists Key and effect must match (value is ignored)

Special cases:

1
2
3
4
5
6
7
8
# Tolerate ALL taints with key "gpu" (any effect)
tolerations:
- key: "gpu"
  operator: "Exists"

# Tolerate ALL taints on the node (master toleration)
tolerations:
- operator: "Exists"

5.7 NoExecute and tolerationSeconds

With NoExecute, you can set how long a pod stays before eviction:

1
2
3
4
5
tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300    # stay for 5 minutes, then evict

5.8 Taints + nodeSelector Together

Common pattern — dedicate nodes to specific workloads:

1
2
3
4
5
# 1. Taint the node (repel everything)
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule

# 2. Label the node (attract specific pods)
kubectl label nodes worker-3 hardware=gpu
1
2
3
4
5
6
7
8
9
# Pod: tolerate the taint AND select the node
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  nodeSelector:
    hardware: gpu

6. Resource Requests and Limits

6.1 Concepts

Field Purpose Used by
requests Minimum guaranteed resources — used by the scheduler for placement Scheduler
limits Maximum allowed resources — enforced by the kubelet at runtime kubelet / kernel
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: 250m         # 0.25 CPU cores
        memory: 128Mi     # 128 MiB
      limits:
        cpu: 500m         # 0.5 CPU cores
        memory: 256Mi     # 256 MiB

6.2 CPU vs Memory Units

Resource Unit Examples
CPU millicores (m) or cores 100m = 0.1 core, 1 = 1 core, 1500m = 1.5 cores
Memory bytes with suffix 128Mi (mebibytes), 1Gi (gibibytes), 256M (megabytes)

6.3 How Requests Affect Scheduling

1
2
3
4
5
6
Node capacity: 4 CPU, 8Gi memory
Already allocated: 2.5 CPU, 5Gi memory
Available: 1.5 CPU, 3Gi memory

New pod requests: 2 CPU, 1Gi memory
  → CPU request (2) > available (1.5) → NOT SCHEDULABLE on this node

The scheduler sums all pod requests (not limits) on a node to determine available capacity.

6.4 How Limits Are Enforced

Resource Over limit behavior
CPU Throttled — container is slowed down but NOT killed
Memory OOMKilled — container is killed and restarted
1
2
3
4
5
# Check if a pod was OOMKilled
kubectl describe pod <name> | grep -A3 "Last State"
#   Last State:  Terminated
#     Reason:    OOMKilled
#     Exit Code: 137

6.5 QoS Classes

Kubernetes assigns a Quality of Service class based on requests and limits:

QoS Class Condition Eviction Priority
Guaranteed requests == limits for ALL containers Last to be evicted
Burstable At least one container has requests set, but requests ≠ limits Middle
BestEffort No requests or limits set on any container First to be evicted
# Check QoS class
kubectl get pod <name> -o jsonpath='{.status.qosClass}'
# Guaranteed example (requests == limits)
resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

# BestEffort example (no resources at all)
# Just don't set the resources field

6.6 LimitRange

Namespace-level defaults and constraints for resource requests/limits:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - type: Container
    default:            # default limits (if not specified)
      cpu: 500m
      memory: 256Mi
    defaultRequest:     # default requests (if not specified)
      cpu: 100m
      memory: 128Mi
    max:                # maximum allowed
      cpu: "2"
      memory: 1Gi
    min:                # minimum allowed
      cpu: 50m
      memory: 64Mi

6.7 ResourceQuota

Namespace-level total resource budget:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "20"
    services: "10"
    persistentvolumeclaims: "5"
1
2
3
# Check quota usage
kubectl get resourcequota -n team-a
kubectl describe resourcequota team-quota -n team-a

CKA Tip: When a ResourceQuota is set, every pod in that namespace MUST specify resource requests/limits, or it will be rejected.


7. Static Pods

7.1 What Are Static Pods?

Static pods are managed directly by kubelet, not by the API server. kubelet watches a directory for pod manifests and creates/restarts them automatically.

1
2
3
4
5
kubelet watches → /etc/kubernetes/manifests/
                  ├── etcd.yaml
                  ├── kube-apiserver.yaml
                  ├── kube-controller-manager.yaml
                  └── kube-scheduler.yaml

7.2 Key Characteristics

Aspect Static Pods Regular Pods
Created by kubelet (from manifest files) API server (via controllers)
Visible in API Yes (mirror pod, read-only) Yes (full control)
Can be deleted via kubectl No — kubelet recreates them Yes
Managed by controllers No Yes (Deployments, etc.)
Naming <name>-<node-name> <name>-<random>

7.3 Finding the Static Pod Path

# Method 1: Check kubelet config
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# staticPodPath: /etc/kubernetes/manifests

# Method 2: Check kubelet process arguments
ps aux | grep kubelet | grep -- --pod-manifest-path

# Method 3: Check kubelet service file
systemctl cat kubelet | grep -- --config
# Then check the config file for staticPodPath

7.4 Creating a Static Pod

# Create a manifest in the static pod directory
cat <<EOF > /etc/kubernetes/manifests/static-nginx.yaml
apiVersion: v1
kind: Pod
metadata:
  name: static-nginx
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
EOF

# kubelet automatically creates the pod
kubectl get pods | grep static-nginx
# static-nginx-controlplane   1/1   Running   0   10s

7.5 Deleting a Static Pod

1
2
3
4
5
# This does NOT work permanently — kubelet recreates it
kubectl delete pod static-nginx-controlplane

# To actually remove it, delete the manifest file
rm /etc/kubernetes/manifests/static-nginx.yaml

CKA Tip: If asked to create a static pod on a specific node, SSH to that node, find the staticPodPath, and create the manifest there.


8. Manual Scheduling (nodeName)

8.1 Bypassing the Scheduler

Setting spec.nodeName directly assigns a pod to a node without going through the scheduler:

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: Pod
metadata:
  name: manual-pod
spec:
  nodeName: worker-1          # bypass scheduler entirely
  containers:
  - name: nginx
    image: nginx

8.2 When to Use

  • Scheduler is down and you need to place a pod
  • Debugging scheduling issues
  • Exam scenarios that explicitly ask for manual scheduling

8.3 Limitations

  • No filtering or scoring — the pod is placed even if the node can't handle it
  • If the node doesn't exist, the pod stays Pending
  • Cannot be changed after pod creation — you must delete and recreate

8.4 Binding Object (Alternative)

If a pod is already created without nodeName, you can bind it manually:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Binding
metadata:
  name: manual-pod
target:
  apiVersion: v1
  kind: Node
  name: worker-1
EOF

9. Scheduler Profiles and Multiple Schedulers

9.1 Custom Scheduler

You can run a second scheduler alongside the default:

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled
spec:
  schedulerName: my-custom-scheduler    # use a non-default scheduler
  containers:
  - name: app
    image: nginx
1
2
3
4
# Check which scheduler placed a pod
kubectl get pod <name> -o jsonpath='{.spec.schedulerName}'

# If schedulerName doesn't match any running scheduler, pod stays Pending

CKA Tip: You probably won't need to deploy a custom scheduler, but you should know the schedulerName field exists.


10. Summary — Choosing the Right Mechanism

Goal Mechanism
Pod MUST run on nodes with label X nodeSelector or required nodeAffinity
Pod SHOULD PREFER nodes with label X Preferred nodeAffinity with weight
Pod MUST run on same node as pod Y Required podAffinity
Pod MUST NOT run on same node as pod Y Required podAntiAffinity
Repel all pods from a node Taint with NoSchedule
Dedicate a node to specific workloads Taint + toleration + nodeSelector
Guarantee minimum resources resources.requests
Cap maximum resources resources.limits
Run exactly one pod per node DaemonSet
Place pod on a specific node (no scheduler) spec.nodeName
Run pod managed by kubelet only Static pod in /etc/kubernetes/manifests/

11. Practice Exercises

Exercise 1 — nodeSelector

# 1. Label a node
kubectl label node worker-1 env=production

# 2. Create a pod with nodeSelector
kubectl run selector-test --image=nginx --dry-run=client -o yaml > pod.yaml
# Add nodeSelector: { env: production } to the spec
kubectl apply -f pod.yaml

# 3. Verify it landed on worker-1
kubectl get pod selector-test -o wide

# 4. Clean up
kubectl delete pod selector-test
kubectl label node worker-1 env-

Exercise 2 — Node Affinity

# 1. Label two nodes
kubectl label node worker-1 disktype=ssd
kubectl label node worker-2 disktype=hdd

# 2. Create a pod that REQUIRES ssd
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: affinity-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values: ["ssd"]
  containers:
  - name: app
    image: nginx
EOF

# 3. Verify it's on worker-1
kubectl get pod affinity-test -o wide

# 4. Clean up
kubectl delete pod affinity-test
kubectl label node worker-1 disktype-
kubectl label node worker-2 disktype-

Exercise 3 — Taints and Tolerations

# 1. Taint a node
kubectl taint nodes worker-1 dedicated=special:NoSchedule

# 2. Try to schedule a pod (should go to worker-2)
kubectl run no-toleration --image=nginx
kubectl get pod no-toleration -o wide    # NOT on worker-1

# 3. Create a pod with toleration (can go to worker-1)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: with-toleration
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "special"
    effect: "NoSchedule"
  nodeSelector:
    kubernetes.io/hostname: worker-1
  containers:
  - name: app
    image: nginx
EOF

# 4. Verify
kubectl get pod with-toleration -o wide    # on worker-1

# 5. Clean up
kubectl delete pod no-toleration with-toleration
kubectl taint nodes worker-1 dedicated=special:NoSchedule-

Exercise 4 — Resource Requests and Limits

# 1. Create a pod with requests and limits
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: resource-test
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

# 2. Check QoS class
kubectl get pod resource-test -o jsonpath='{.status.qosClass}'
# Burstable

# 3. Check node resource allocation
kubectl describe node <node> | grep -A10 "Allocated resources"

# 4. Create a Guaranteed pod (requests == limits)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: 200m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

# 5. Verify QoS
kubectl get pod guaranteed-pod -o jsonpath='{.status.qosClass}'
# Guaranteed

# 6. Clean up
kubectl delete pod resource-test guaranteed-pod

Exercise 5 — Static Pod

# 1. Find the static pod path
cat /var/lib/kubelet/config.yaml | grep staticPodPath

# 2. Create a static pod manifest
cat <<EOF > /etc/kubernetes/manifests/static-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: static-test
spec:
  containers:
  - name: nginx
    image: nginx
EOF

# 3. Verify it appears (with node name suffix)
kubectl get pods | grep static-test

# 4. Try to delete it via kubectl — it comes back
kubectl delete pod static-test-<node-name>
kubectl get pods | grep static-test    # still there

# 5. Actually remove it
rm /etc/kubernetes/manifests/static-test.yaml
kubectl get pods | grep static-test    # gone

Exercise 6 — Manual Scheduling

# 1. Create a pod YAML without scheduling
cat <<EOF > manual.yaml
apiVersion: v1
kind: Pod
metadata:
  name: manual-pod
spec:
  nodeName: worker-1
  containers:
  - name: nginx
    image: nginx
EOF

# 2. Apply
kubectl apply -f manual.yaml

# 3. Verify it's on worker-1 (bypassed scheduler)
kubectl get pod manual-pod -o wide

# 4. Clean up
kubectl delete pod manual-pod

12. Key Takeaways for the CKA Exam

Point Detail
nodeSelector is simplest Exact label match — use when you just need "run on nodes with label X"
Node affinity for complex rules In, NotIn, Exists, DoesNotExist operators + soft/hard
Taints repel, tolerations permit Toleration alone doesn't attract — combine with nodeSelector
Know the three taint effects NoSchedule, PreferNoSchedule, NoExecute
Remove taint with minus kubectl taint nodes <node> key:effect-
Requests = scheduling, limits = enforcement Scheduler uses requests; kubelet enforces limits
CPU throttled, memory OOMKilled Over-limit behavior differs by resource type
QoS: Guaranteed > Burstable > BestEffort Eviction order under memory pressure
Static pods in /etc/kubernetes/manifests/ Managed by kubelet, not the API server
nodeName bypasses the scheduler Direct placement — no filtering or scoring
kubectl describe pod for scheduling failures Check Events section for FailedScheduling

Previous: 07-pods-and-workloads.md — Pods & Workloads

Next: 09-configmaps-secrets.md — ConfigMaps & Secrets