Module 8 — Scheduling¶

Overview¶

The scheduler decides which node runs each pod. The CKA exam tests your ability to control scheduling using nodeSelector, affinity rules, taints, tolerations, resource constraints, static pods, and manual scheduling. This module covers the scheduler internals and every mechanism to influence pod placement.

1. Scheduler Internals¶

1.1 How the Scheduler Works¶

The scheduler watches for pods with no spec.nodeName and assigns them to a node through two phases:

Unscheduled Pod (nodeName = "")
        │
        ▼
┌───────────────────┐
│   1. FILTERING     │  Eliminate nodes that CAN'T run the pod
│                   │
│   Checks:         │
│   - Sufficient CPU/memory (vs requests)
│   - Node taints tolerated?
│   - nodeSelector / nodeAffinity match?
│   - PodAntiAffinity satisfied?
│   - Node cordoned? (unschedulable)
│   - Port conflicts?
│   - Volume topology constraints?
└────────┬──────────┘
         │  remaining candidate nodes
         ▼
┌───────────────────┐
│   2. SCORING       │  Rank candidates by preference
│                   │
│   Factors:        │
│   - Resource balance (LeastRequestedPriority)
│   - Pod spread (SelectorSpreadPriority)
│   - Node affinity weight
│   - Pod affinity/anti-affinity weight
│   - Image already present on node
└────────┬──────────┘
         │
         ▼
   Highest-scoring node wins
   Pod.spec.nodeName = "selected-node"

1.2 When Scheduling Fails¶

If no node passes filtering, the pod stays Pending:

kubectl describe pod <pending-pod> | grep -A5 Events
# Warning  FailedScheduling  0/3 nodes are available:
#   1 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate,
#   2 node(s) didn't match Pod's node affinity/selector

CKA Tip: Always check Events in kubectl describe pod when a pod is stuck in Pending.

2. nodeSelector¶

The simplest way to constrain a pod to specific nodes. Matches exact node labels.

2.1 Label a Node¶

# Add a label
kubectl label node worker-1 disktype=ssd

# Verify
kubectl get nodes --show-labels | grep disktype

# Remove a label
kubectl label node worker-1 disktype-

2.2 Use nodeSelector in a Pod¶

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: app
    image: nginx

# Imperative (generate YAML, then add nodeSelector)
kubectl run ssd-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# Edit pod.yaml to add nodeSelector

2.3 Built-in Node Labels¶

Every node has these labels automatically:

Label	Example Value
`kubernetes.io/hostname`	`worker-1`
`kubernetes.io/os`	`linux`
`kubernetes.io/arch`	`amd64`
`node.kubernetes.io/instance-type`	`m5.large` (cloud)
`topology.kubernetes.io/zone`	`us-east-1a` (cloud)
`topology.kubernetes.io/region`	`us-east-1` (cloud)

3. Node Affinity¶

More expressive than nodeSelector — supports operators, soft/hard preferences, and multiple conditions.

3.1 Required vs Preferred¶

Type	Behavior	Equivalent to
`requiredDuringSchedulingIgnoredDuringExecution`	Hard — pod won't schedule if no node matches	nodeSelector (but more flexible)
`preferredDuringSchedulingIgnoredDuringExecution`	Soft — scheduler prefers matching nodes but will schedule elsewhere if needed	No equivalent

"IgnoredDuringExecution" means if a node's labels change after the pod is scheduled, the pod is NOT evicted. It stays where it is.

3.2 Required (Hard) Node Affinity¶

apiVersion: v1
kind: Pod
metadata:
  name: hard-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
            - nvme
  containers:
  - name: app
    image: nginx

3.3 Preferred (Soft) Node Affinity¶

apiVersion: v1
kind: Pod
metadata:
  name: soft-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80                    # 1-100, higher = stronger preference
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 20
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
  containers:
  - name: app
    image: nginx

3.4 Combining Required and Preferred¶

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os       # MUST be linux
            operator: In
            values: ["linux"]
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: disktype               # PREFER ssd
            operator: In
            values: ["ssd"]

3.5 Operators¶

Operator	Meaning
`In`	Label value is in the list
`NotIn`	Label value is NOT in the list
`Exists`	Label key exists (value doesn't matter)
`DoesNotExist`	Label key does NOT exist
`Gt`	Label value is greater than (numeric)
`Lt`	Label value is less than (numeric)

4. Pod Affinity and Anti-Affinity¶

Controls pod placement relative to other pods, not nodes.

4.1 Pod Affinity — "Schedule Near"¶

Schedule this pod on a node that already runs pods with label app=cache:

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["cache"]
        topologyKey: kubernetes.io/hostname    # same node

4.2 Pod Anti-Affinity — "Schedule Away From"¶

Spread replicas across different nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["web"]
            topologyKey: kubernetes.io/hostname    # different nodes
      containers:
      - name: web
        image: nginx

4.3 topologyKey¶

Defines the "domain" for affinity/anti-affinity:

topologyKey	Meaning
`kubernetes.io/hostname`	Same/different node
`topology.kubernetes.io/zone`	Same/different availability zone
`topology.kubernetes.io/region`	Same/different region

4.4 Soft Pod Anti-Affinity¶

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["web"]
          topologyKey: kubernetes.io/hostname

CKA Tip: Hard anti-affinity with topologyKey: kubernetes.io/hostname means you can't have more replicas than nodes. If you have 3 nodes and 4 replicas, one pod stays Pending.

5. Taints and Tolerations¶

Taints are applied to nodes to repel pods. Tolerations are applied to pods to allow scheduling on tainted nodes.

5.1 Concept¶

Node with taint: "gpu=true:NoSchedule"

  Pod WITHOUT toleration  ──▶  REJECTED ✗
  Pod WITH toleration     ──▶  ALLOWED  ✓

Taints repel. Tolerations permit. A toleration does NOT guarantee scheduling on the tainted node — it only removes the restriction. Use nodeSelector or affinity to attract pods to specific nodes.

5.2 Taint Effects¶

Effect	Behavior
`NoSchedule`	New pods without toleration won't be scheduled. Existing pods stay.
`PreferNoSchedule`	Scheduler tries to avoid, but will schedule if no other option.
`NoExecute`	New pods rejected AND existing pods without toleration are evicted.

5.3 Managing Taints¶

# Add a taint
kubectl taint nodes worker-1 gpu=true:NoSchedule

# Verify
kubectl describe node worker-1 | grep Taints

# Remove a taint (note the minus at the end)
kubectl taint nodes worker-1 gpu=true:NoSchedule-

# Remove all taints with a key
kubectl taint nodes worker-1 gpu-

5.4 Control Plane Taint¶

By default, kubeadm taints control plane nodes:

kubectl describe node controlplane | grep Taints
# Taints: node-role.kubernetes.io/control-plane:NoSchedule

This prevents workload pods from running on the control plane. To allow scheduling:

# Remove the control plane taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

5.5 Tolerations¶

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: app
    image: nvidia/cuda:latest

5.6 Toleration Operators¶

Operator	Meaning
`Equal`	Key, value, and effect must all match
`Exists`	Key and effect must match (value is ignored)

Special cases:

# Tolerate ALL taints with key "gpu" (any effect)
tolerations:
- key: "gpu"
  operator: "Exists"

# Tolerate ALL taints on the node (master toleration)
tolerations:
- operator: "Exists"

5.7 NoExecute and tolerationSeconds¶

With NoExecute, you can set how long a pod stays before eviction:

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300    # stay for 5 minutes, then evict

5.8 Taints + nodeSelector Together¶

Common pattern — dedicate nodes to specific workloads:

# 1. Taint the node (repel everything)
kubectl taint nodes worker-3 dedicated=gpu:NoSchedule

# 2. Label the node (attract specific pods)
kubectl label nodes worker-3 hardware=gpu

# Pod: tolerate the taint AND select the node
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  nodeSelector:
    hardware: gpu

6. Resource Requests and Limits¶

6.1 Concepts¶

Field	Purpose	Used by
`requests`	Minimum guaranteed resources — used by the scheduler for placement	Scheduler
`limits`	Maximum allowed resources — enforced by the kubelet at runtime	kubelet / kernel

spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: 250m         # 0.25 CPU cores
        memory: 128Mi     # 128 MiB
      limits:
        cpu: 500m         # 0.5 CPU cores
        memory: 256Mi     # 256 MiB

6.2 CPU vs Memory Units¶

Resource	Unit	Examples
CPU	millicores (m) or cores	`100m` = 0.1 core, `1` = 1 core, `1500m` = 1.5 cores
Memory	bytes with suffix	`128Mi` (mebibytes), `1Gi` (gibibytes), `256M` (megabytes)

6.3 How Requests Affect Scheduling¶

Node capacity: 4 CPU, 8Gi memory
Already allocated: 2.5 CPU, 5Gi memory
Available: 1.5 CPU, 3Gi memory

New pod requests: 2 CPU, 1Gi memory
  → CPU request (2) > available (1.5) → NOT SCHEDULABLE on this node

The scheduler sums all pod requests (not limits) on a node to determine available capacity.

6.4 How Limits Are Enforced¶

Resource	Over limit behavior
CPU	Throttled — container is slowed down but NOT killed
Memory	OOMKilled — container is killed and restarted

# Check if a pod was OOMKilled
kubectl describe pod <name> | grep -A3 "Last State"
#   Last State:  Terminated
#     Reason:    OOMKilled
#     Exit Code: 137

6.5 QoS Classes¶

Kubernetes assigns a Quality of Service class based on requests and limits:

QoS Class	Condition	Eviction Priority
`Guaranteed`	requests == limits for ALL containers	Last to be evicted
`Burstable`	At least one container has requests set, but requests ≠ limits	Middle
`BestEffort`	No requests or limits set on any container	First to be evicted

# Check QoS class
kubectl get pod <name> -o jsonpath='{.status.qosClass}'

# Guaranteed example (requests == limits)
resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

# BestEffort example (no resources at all)
# Just don't set the resources field

6.6 LimitRange¶

Namespace-level defaults and constraints for resource requests/limits:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - type: Container
    default:            # default limits (if not specified)
      cpu: 500m
      memory: 256Mi
    defaultRequest:     # default requests (if not specified)
      cpu: 100m
      memory: 128Mi
    max:                # maximum allowed
      cpu: "2"
      memory: 1Gi
    min:                # minimum allowed
      cpu: 50m
      memory: 64Mi

6.7 ResourceQuota¶

Namespace-level total resource budget:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "20"
    services: "10"
    persistentvolumeclaims: "5"

# Check quota usage
kubectl get resourcequota -n team-a
kubectl describe resourcequota team-quota -n team-a

CKA Tip: When a ResourceQuota is set, every pod in that namespace MUST specify resource requests/limits, or it will be rejected.

7. Static Pods¶

7.1 What Are Static Pods?¶

Static pods are managed directly by kubelet, not by the API server. kubelet watches a directory for pod manifests and creates/restarts them automatically.

kubelet watches → /etc/kubernetes/manifests/
                  ├── etcd.yaml
                  ├── kube-apiserver.yaml
                  ├── kube-controller-manager.yaml
                  └── kube-scheduler.yaml

7.2 Key Characteristics¶

Aspect	Static Pods	Regular Pods
Created by	kubelet (from manifest files)	API server (via controllers)
Visible in API	Yes (mirror pod, read-only)	Yes (full control)
Can be deleted via kubectl	No — kubelet recreates them	Yes
Managed by controllers	No	Yes (Deployments, etc.)
Naming	`<name>-<node-name>`	`<name>-<random>`

7.3 Finding the Static Pod Path¶

# Method 1: Check kubelet config
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# staticPodPath: /etc/kubernetes/manifests

# Method 2: Check kubelet process arguments
ps aux | grep kubelet | grep -- --pod-manifest-path

# Method 3: Check kubelet service file
systemctl cat kubelet | grep -- --config
# Then check the config file for staticPodPath

7.4 Creating a Static Pod¶

# Create a manifest in the static pod directory
cat <<EOF > /etc/kubernetes/manifests/static-nginx.yaml
apiVersion: v1
kind: Pod
metadata:
  name: static-nginx
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
EOF

# kubelet automatically creates the pod
kubectl get pods | grep static-nginx
# static-nginx-controlplane   1/1   Running   0   10s

7.5 Deleting a Static Pod¶

# This does NOT work permanently — kubelet recreates it
kubectl delete pod static-nginx-controlplane

# To actually remove it, delete the manifest file
rm /etc/kubernetes/manifests/static-nginx.yaml

CKA Tip: If asked to create a static pod on a specific node, SSH to that node, find the staticPodPath, and create the manifest there.

8. Manual Scheduling (nodeName)¶

8.1 Bypassing the Scheduler¶

Setting spec.nodeName directly assigns a pod to a node without going through the scheduler:

apiVersion: v1
kind: Pod
metadata:
  name: manual-pod
spec:
  nodeName: worker-1          # bypass scheduler entirely
  containers:
  - name: nginx
    image: nginx

8.2 When to Use¶

Scheduler is down and you need to place a pod
Debugging scheduling issues
Exam scenarios that explicitly ask for manual scheduling

8.3 Limitations¶

No filtering or scoring — the pod is placed even if the node can't handle it
If the node doesn't exist, the pod stays Pending
Cannot be changed after pod creation — you must delete and recreate

8.4 Binding Object (Alternative)¶

If a pod is already created without nodeName, you can bind it manually:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Binding
metadata:
  name: manual-pod
target:
  apiVersion: v1
  kind: Node
  name: worker-1
EOF

9. Scheduler Profiles and Multiple Schedulers¶

9.1 Custom Scheduler¶

You can run a second scheduler alongside the default:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled
spec:
  schedulerName: my-custom-scheduler    # use a non-default scheduler
  containers:
  - name: app
    image: nginx

# Check which scheduler placed a pod
kubectl get pod <name> -o jsonpath='{.spec.schedulerName}'

# If schedulerName doesn't match any running scheduler, pod stays Pending

CKA Tip: You probably won't need to deploy a custom scheduler, but you should know the schedulerName field exists.

10. Summary — Choosing the Right Mechanism¶

Goal	Mechanism
Pod MUST run on nodes with label X	`nodeSelector` or required `nodeAffinity`
Pod SHOULD PREFER nodes with label X	Preferred `nodeAffinity` with weight
Pod MUST run on same node as pod Y	Required `podAffinity`
Pod MUST NOT run on same node as pod Y	Required `podAntiAffinity`
Repel all pods from a node	Taint with `NoSchedule`
Dedicate a node to specific workloads	Taint + toleration + nodeSelector
Guarantee minimum resources	`resources.requests`
Cap maximum resources	`resources.limits`
Run exactly one pod per node	`DaemonSet`
Place pod on a specific node (no scheduler)	`spec.nodeName`
Run pod managed by kubelet only	Static pod in `/etc/kubernetes/manifests/`

11. Practice Exercises¶

Exercise 1 — nodeSelector¶

# 1. Label a node
kubectl label node worker-1 env=production

# 2. Create a pod with nodeSelector
kubectl run selector-test --image=nginx --dry-run=client -o yaml > pod.yaml
# Add nodeSelector: { env: production } to the spec
kubectl apply -f pod.yaml

# 3. Verify it landed on worker-1
kubectl get pod selector-test -o wide

# 4. Clean up
kubectl delete pod selector-test
kubectl label node worker-1 env-

Exercise 2 — Node Affinity¶

# 1. Label two nodes
kubectl label node worker-1 disktype=ssd
kubectl label node worker-2 disktype=hdd

# 2. Create a pod that REQUIRES ssd
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: affinity-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values: ["ssd"]
  containers:
  - name: app
    image: nginx
EOF

# 3. Verify it's on worker-1
kubectl get pod affinity-test -o wide

# 4. Clean up
kubectl delete pod affinity-test
kubectl label node worker-1 disktype-
kubectl label node worker-2 disktype-

Exercise 3 — Taints and Tolerations¶

# 1. Taint a node
kubectl taint nodes worker-1 dedicated=special:NoSchedule

# 2. Try to schedule a pod (should go to worker-2)
kubectl run no-toleration --image=nginx
kubectl get pod no-toleration -o wide    # NOT on worker-1

# 3. Create a pod with toleration (can go to worker-1)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: with-toleration
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "special"
    effect: "NoSchedule"
  nodeSelector:
    kubernetes.io/hostname: worker-1
  containers:
  - name: app
    image: nginx
EOF

# 4. Verify
kubectl get pod with-toleration -o wide    # on worker-1

# 5. Clean up
kubectl delete pod no-toleration with-toleration
kubectl taint nodes worker-1 dedicated=special:NoSchedule-

Exercise 4 — Resource Requests and Limits¶

# 1. Create a pod with requests and limits
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: resource-test
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

# 2. Check QoS class
kubectl get pod resource-test -o jsonpath='{.status.qosClass}'
# Burstable

# 3. Check node resource allocation
kubectl describe node <node> | grep -A10 "Allocated resources"

# 4. Create a Guaranteed pod (requests == limits)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: 200m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

# 5. Verify QoS
kubectl get pod guaranteed-pod -o jsonpath='{.status.qosClass}'
# Guaranteed

# 6. Clean up
kubectl delete pod resource-test guaranteed-pod

Exercise 5 — Static Pod¶

# 1. Find the static pod path
cat /var/lib/kubelet/config.yaml | grep staticPodPath

# 2. Create a static pod manifest
cat <<EOF > /etc/kubernetes/manifests/static-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: static-test
spec:
  containers:
  - name: nginx
    image: nginx
EOF

# 3. Verify it appears (with node name suffix)
kubectl get pods | grep static-test

# 4. Try to delete it via kubectl — it comes back
kubectl delete pod static-test-<node-name>
kubectl get pods | grep static-test    # still there

# 5. Actually remove it
rm /etc/kubernetes/manifests/static-test.yaml
kubectl get pods | grep static-test    # gone

Exercise 6 — Manual Scheduling¶

# 1. Create a pod YAML without scheduling
cat <<EOF > manual.yaml
apiVersion: v1
kind: Pod
metadata:
  name: manual-pod
spec:
  nodeName: worker-1
  containers:
  - name: nginx
    image: nginx
EOF

# 2. Apply
kubectl apply -f manual.yaml

# 3. Verify it's on worker-1 (bypassed scheduler)
kubectl get pod manual-pod -o wide

# 4. Clean up
kubectl delete pod manual-pod

12. Key Takeaways for the CKA Exam¶

Point	Detail
nodeSelector is simplest	Exact label match — use when you just need "run on nodes with label X"
Node affinity for complex rules	`In`, `NotIn`, `Exists`, `DoesNotExist` operators + soft/hard
Taints repel, tolerations permit	Toleration alone doesn't attract — combine with nodeSelector
Know the three taint effects	`NoSchedule`, `PreferNoSchedule`, `NoExecute`
Remove taint with minus	`kubectl taint nodes <node> key:effect-`
Requests = scheduling, limits = enforcement	Scheduler uses requests; kubelet enforces limits
CPU throttled, memory OOMKilled	Over-limit behavior differs by resource type
QoS: Guaranteed > Burstable > BestEffort	Eviction order under memory pressure
Static pods in `/etc/kubernetes/manifests/`	Managed by kubelet, not the API server
`nodeName` bypasses the scheduler	Direct placement — no filtering or scoring
`kubectl describe pod` for scheduling failures	Check Events section for `FailedScheduling`

Previous: 07-pods-and-workloads.md — Pods & Workloads

Next: 09-configmaps-secrets.md — ConfigMaps & Secrets