Kernel Space and User Space in Virtualization and Containers¶
Audience: Developers with basic Linux knowledge who want to understand how process isolation really works — from CPU privilege rings down to what happens when a containerized process makes a system call.
Modern operating systems split memory and execution into two distinct domains: kernel space and user space. This boundary is one of the most fundamental concepts in systems programming, and understanding it is essential for reasoning about how virtual machines, containers, and tools like Docker or containerd actually provide (or fail to provide) isolation. This page walks from first principles — CPU rings, context switches, system calls — through to the concrete security implications of sharing a kernel across dozens of containers.
Table of Contents¶
- Definitions
- How It Works in a Traditional OS
- Virtualization: VMs vs Containers
- Containers Deep Dive
- Practical Examples
- Comparison Table
- Further Reading
- Summary
1. Definitions¶
Kernel Space¶
Kernel space is the region of virtual memory reserved for the operating system kernel and its extensions (drivers, kernel modules). Code running here executes at the highest privilege level and has unrestricted access to:
- All physical and virtual memory
- All CPU instructions, including privileged ones (
lgdt,in/out,wrmsr, etc.) - All hardware devices
- Every process's address space
The kernel itself — the scheduler, memory manager, VFS layer, network stack, device drivers — all live here.
User Space¶
User space is everything else: applications, daemons, libraries, language runtimes. User-space processes run in an isolated virtual address space and cannot directly access hardware or kernel memory. Any operation that requires elevated privilege — opening a file, allocating memory, sending a packet — must be requested from the kernel via a system call.
Why the Separation Exists¶
The separation is a deliberate security and stability boundary:
| Concern | Without separation | With separation |
|---|---|---|
| Stability | A buggy app could corrupt kernel memory | A crashing process is killed; kernel is unaffected |
| Security | Any process could read any other process's memory | Processes are isolated in separate virtual address spaces |
| Integrity | Malicious code could overwrite interrupt handlers | Kernel code is protected from unprivileged writes |
The boundary is enforced by hardware — not just by software convention.
2. How It Works in a Traditional OS¶
CPU Privilege Rings¶
x86/x64 CPUs implement four privilege rings (Ring 0–3), though mainstream operating systems use only two:
- Ring 0 — the kernel runs here. Can execute any instruction.
- Ring 3 — user processes run here. Attempting a privileged instruction causes a CPU exception (General Protection Fault), which the kernel handles — usually by killing the offending process.
Rings 1 and 2 were intended for device drivers but are unused on Linux and Windows; drivers run in Ring 0.
System Calls¶
A system call (syscall) is the controlled gateway from user space into kernel space. The kernel exposes a stable ABI of ~300–400 syscalls on Linux (read, write, open, fork, mmap, socket, etc.).
The flow for a simple read() call:
This transition — saving registers, switching privilege levels, executing kernel code, and returning — is a context switch between privilege domains. It costs on the order of hundreds of nanoseconds on modern hardware.
The Virtual Address Space¶
Each user-space process has its own virtual address space (typically 128 TiB on x86-64 Linux). The upper portion is mapped to kernel memory (but not accessible from Ring 3). This means the kernel is always "present" at known virtual addresses, but the hardware enforces that Ring 3 code cannot read or write to it.
3. Virtualization: VMs vs Containers¶
The kernel/user-space model becomes much more interesting in virtualized environments because there are now multiple operating systems involved, each with their own kernel and user space.
Virtual Machines (Full Virtualization)¶
A hypervisor (KVM, VMware ESXi, Hyper-V) sits between the hardware and one or more guest OSes. Each VM runs a complete OS with its own kernel.
Key properties: - Full kernel isolation — a kernel bug in VM1 does not affect VM2. - Syscalls stay within the guest kernel — they never reach the host kernel directly. - Higher overhead — each VM boots a full OS; memory footprint is in the GB range.
Containers (OS-Level Virtualization)¶
Containers use Linux kernel features (namespaces and cgroups) to partition a single running kernel into isolated views. There is no guest kernel.
Key properties: - Shared kernel — all containers call into the same kernel. - Lower overhead — containers start in milliseconds; overhead is in the MB range. - Weaker isolation — the shared kernel is both the isolation mechanism and the attack surface.
4. Containers Deep Dive¶
The Linux Primitives: Namespaces and cgroups¶
Containers are not a single kernel feature — they are a composition of several:
| Feature | What it isolates |
|---|---|
pid namespace |
Process ID space — PID 1 in a container is not the host's init |
net namespace |
Network interfaces, routing tables, firewall rules |
mnt namespace |
Mount points and filesystem hierarchy |
uts namespace |
Hostname and NIS domain name |
ipc namespace |
POSIX message queues, shared memory segments |
user namespace |
UID/GID mappings (root in container ≠ root on host, optionally) |
cgroup namespace |
View of cgroup hierarchy |
| [200~# Kernel Space and User Space in Virtualization and Containers |
Audience: Developers with basic Linux knowledge who want to understand how process isolation really works — from CPU privilege rings down to what happens when a containerized process makes a system call.
Modern operating systems split memory and execution into two distinct domains: kernel space and user space. This boundary is one of the most fundamental concepts in systems programming, and understanding it is essential for reasoning about how virtual machines, containers, and tools like Docker or containerd actually provide (or fail to provide) isolation. This page walks from first principles — CPU rings, context switches, system calls — through to the concrete security implications of sharing a kernel across dozens of containers.
Table of Contents¶
- Definitions
- How It Works in a Traditional OS
- Virtualization: VMs vs Containers
- Containers Deep Dive
- Practical Examples
- Comparison Table
- Further Reading
- Summary
1. Definitions¶
Kernel Space¶
Kernel space is the region of virtual memory reserved for the operating system kernel and its extensions (drivers, kernel modules). Code running here executes at the highest privilege level and has unrestricted access to:
- All physical and virtual memory
- All CPU instructions, including privileged ones (
lgdt,in/out,wrmsr, etc.) - All hardware devices
- Every process's address space
The kernel itself — the scheduler, memory manager, VFS layer, network stack, device drivers — all live here.
User Space¶
User space is everything else: applications, daemons, libraries, language runtimes. User-space processes run in an isolated virtual address space and cannot directly access hardware or kernel memory. Any operation that requires elevated privilege — opening a file, allocating memory, sending a packet — must be requested from the kernel via a system call.
Why the Separation Exists¶
The separation is a deliberate security and stability boundary:
| Concern | Without separation | With separation |
|---|---|---|
| Stability | A buggy app could corrupt kernel memory | A crashing process is killed; kernel is unaffected |
| Security | Any process could read any other process's memory | Processes are isolated in separate virtual address spaces |
| Integrity | Malicious code could overwrite interrupt handlers | Kernel code is protected from unprivileged writes |
The boundary is enforced by hardware — not just by software convention.
2. How It Works in a Traditional OS¶
CPU Privilege Rings¶
x86/x64 CPUs implement four privilege rings (Ring 0–3), though mainstream operating systems use only two:
- Ring 0 — the kernel runs here. Can execute any instruction.
- Ring 3 — user processes run here. Attempting a privileged instruction causes a CPU exception (General Protection Fault), which the kernel handles — usually by killing the offending process.
Rings 1 and 2 were intended for device drivers but are unused on Linux and Windows; drivers run in Ring 0.
System Calls¶
A system call (syscall) is the controlled gateway from user space into kernel space. The kernel exposes a stable ABI of ~300–400 syscalls on Linux (read, write, open, fork, mmap, socket, etc.).
The flow for a simple read() call:
This transition — saving registers, switching privilege levels, executing kernel code, and returning — is a context switch between privilege domains. It costs on the order of hundreds of nanoseconds on modern hardware.
The Virtual Address Space¶
Each user-space process has its own virtual address space (typically 128 TiB on x86-64 Linux). The upper portion is mapped to kernel memory (but not accessible from Ring 3). This means the kernel is always "present" at known virtual addresses, but the hardware enforces that Ring 3 code cannot read or write to it.
3. Virtualization: VMs vs Containers¶
The kernel/user-space model becomes much more interesting in virtualized environments because there are now multiple operating systems involved, each with their own kernel and user space.
Virtual Machines (Full Virtualization)¶
A hypervisor (KVM, VMware ESXi, Hyper-V) sits between the hardware and one or more guest OSes. Each VM runs a complete OS with its own kernel.
Key properties: - Full kernel isolation — a kernel bug in VM1 does not affect VM2. - Syscalls stay within the guest kernel — they never reach the host kernel directly. - Higher overhead — each VM boots a full OS; memory footprint is in the GB range.
Containers (OS-Level Virtualization)¶
Containers use Linux kernel features (namespaces and cgroups) to partition a single running kernel into isolated views. There is no guest kernel.
Key properties: - Shared kernel — all containers call into the same kernel. - Lower overhead — containers start in milliseconds; overhead is in the MB range. - Weaker isolation — the shared kernel is both the isolation mechanism and the attack surface.
4. Containers Deep Dive¶
The Linux Primitives: Namespaces and cgroups¶
Containers are not a single kernel feature — they are a composition of several:
| Feature | What it isolates |
|---|---|
pid namespace |
Process ID space — PID 1 in a container is not the host's init |
net namespace |
Network interfaces, routing tables, firewall rules |
mnt namespace |
Mount points and filesystem hierarchy |
uts namespace |
Hostname and NIS domain name |
ipc namespace |
POSIX message queues, shared memory segments |
user namespace |
UID/GID mappings (root in container ≠ root on host, optionally) |
cgroup namespace |
View of cgroup hierarchy |
| cgroups v2 | Resource limits: CPU, memory, I/O, PIDs |
These namespaces give each container a virtualised view of the system, but all system calls still reach the same host kernel.
What Runs in User Space Per Container¶
Each container has its own isolated user space containing:
- Root filesystem — typically an OCI image layer stack (e.g.,
alpine:3.19+ your app layer), mounted viaoverlayfs. This provides a private/bin,/lib,/etc, etc. - Processes — each container has its own PID namespace; the first process is PID 1 inside the container.
- Libraries —
libc, language runtimes, and dependencies come from the container image, not the host. - Environment — its own hostname, network interfaces, and mount tree.
The kernel itself — at /proc/version, in kernel memory, in system call tables — is the host kernel and is identical for all containers.
How containerd Manages This¶
containerd is the industry-standard container runtime (used by Docker, Kubernetes, and others). Its role in the stack:
runc — the low-level OCI runtime — directly invokes syscalls like clone(2), unshare(2), pivot_root(2), and mount(2) to set up the container environment. It does not involve any virtual machine or hypervisor.
Isolation and Security Implications¶
Because every container shares the host kernel:
- There is no kernel isolation. A vulnerability in the kernel is exploitable by any container.
- Namespace escapes are possible. Misconfigured namespaces (e.g., a container with access to the host's
pidnamespace) break isolation entirely. - Privileged containers are dangerous. Running a container with
--privilegedor excessive capabilities grants near-root access to the host. - The syscall surface is the attack surface. Every syscall reachable from a container is a potential exploit path into the shared kernel.
Mitigation tools (seccomp, AppArmor/SELinux, user namespaces) reduce this surface but do not eliminate it.
5. Practical Examples¶
5.1 Architecture Diagram (Mermaid)¶
5.2 A System Call from Inside a Container¶
Consider a Python process inside Container B calling socket.connect(). Here is the full path from container user space to the shared kernel:
Python (user space, inside container):
What happens at the kernel boundary:
Key observation: the connect(2) syscall executes in the host kernel's network stack. The container's net namespace gives it an isolated view (its own routing table, its own eth0), but the code executing is the host kernel's TCP implementation — the same code serving Container A and the host itself.
5.3 Security Implication: Kernel Exploit Affects All Containers¶
Scenario: Dirty Pipe (CVE-2022-0847)
In February 2022, a vulnerability was disclosed in the Linux kernel's pipe subsystem (affects kernels 5.8–5.16.10). It allowed an unprivileged local user to overwrite arbitrary read-only files, including SUID binaries.
Why this is different from a VM scenario:
In a VM, the guest kernel's pipe subsystem is isolated. CVE-2022-0847 in the guest kernel would give root inside the VM, but the hypervisor boundary would stop the exploit from reaching the host or other VMs. In a container environment, there is no such boundary — every container on the host was vulnerable until the host kernel was patched.
Mitigation layers (none are complete substitutes for patching):
5.4 Container Escape via Privileged Mode¶
Running a container with --privileged removes nearly all namespace isolation:
This escape works because --privileged grants CAP_SYS_ADMIN and disables seccomp/AppArmor, effectively giving the container access to the host kernel's full interface.
6. Comparison Table¶
| Dimension | Bare Metal | Virtual Machine | Container |
|---|---|---|---|
| Kernel | Host kernel only | Separate guest kernel per VM | Shared host kernel |
| Isolation level | None (single OS) | Strong (hypervisor boundary) | Moderate (namespace boundary) |
| Kernel sharing | N/A | No — each VM has its own | Yes — all containers on host |
| Attack surface | Full host | Guest kernel + hypervisor API | Host kernel syscall table |
| A kernel CVE affects… | The host | Only VMs running that kernel version | All containers on the host |
| Startup time | N/A | Seconds to minutes | Milliseconds |
| Memory overhead | None | ~512 MB – several GB per VM | ~10–50 MB per container |
| Filesystem isolation | None | Full (virtual disk) | overlayfs per container |
| Network isolation | None | Virtual NIC per VM | net namespace per container |
| Syscall path | Direct to kernel | Guest kernel → hypervisor (vmcall/vmexit) | Direct to host kernel (via namespace filter) |
| Root in environment = root on host? | Yes | No (root in guest ≠ host root) | Potentially yes (without user namespaces) |
| Use case | Max performance, single tenant | Strong multi-tenant isolation | High-density, microservices, CI |
| Examples | Physical server | KVM, VMware ESXi, Hyper-V | Docker, containerd/runc, LXC, Podman |
7. Further Reading¶
Linux Kernel and System Calls¶
- Linux Kernel Documentation: syscalls
- The Linux Programming Interface — Michael Kerrisk (definitive reference)
man 2 syscall,man 2 clone,man 7 namespaces,man 7 cgroups
Containers and Runtimes¶
- OCI Runtime Specification — defines what
runcimplements - containerd Architecture — official docs
- Containers from Scratch — Liz Rice, GOTO 2018 (excellent live demo building a container with raw Go syscalls)
- Linux Namespaces — LWN series by Michael Kerrisk
Security¶
- CVE-2022-0847 (Dirty Pipe) analysis — original writeup by Max Kellermann
- NCC Group: Understanding and Hardening Linux Containers
- gVisor — Google's user-space kernel for containers (addresses the shared-kernel problem)
- Kata Containers — lightweight VMs that look like containers (hardware isolation + container UX)
- seccomp-bpf — syscall filtering for containers
Tools¶
strace— trace syscalls made by a process (works inside containers)nsenter— enter the namespaces of a running containerunshare— create new namespaces from the shell- Falco — runtime security; detects anomalous syscall behaviour in containers
8. Summary¶
The kernel/user-space boundary is a hardware-enforced privilege separation that underpins all OS security and stability. System calls are the only legal crossing point, governed by CPU rings and a controlled kernel ABI.
Virtual machines extend this model by running complete guest kernels inside a hypervisor, giving each VM its own isolated kernel. This means a kernel-level exploit in one VM stays within that VM.
Containers take a fundamentally different approach: they use Linux namespaces to create isolated views of a single running kernel, and cgroups to enforce resource limits. The kernel itself is shared. This makes containers fast and lightweight, but it means:
- The kernel is the trust boundary — and it is shared by every container on the host.
- A kernel vulnerability is a host-wide vulnerability — not scoped to a single container.
- User space is fully isolated per container (filesystem, processes, network interfaces), but kernel space is not.
For most workloads, the container isolation model is acceptable when combined with defence-in-depth (seccomp profiles, AppArmor/SELinux policies, non-root users, minimal capabilities, image provenance). For high-security multi-tenant environments, consider gVisor (user-space kernel) or Kata Containers (lightweight VMs) to recover hardware-level isolation without sacrificing the container developer experience.
Page last updated: 2025. Kernel behaviour described targets Linux 5.15+ / 6.x LTS unless otherwise noted.