Skip to content

The Mechanics of Containerization

A container image is an immutable, self-contained package that bundles everything your application needs to run. Think of it as a snapshot - a blueprint that describes the state of a filesystem and the configuration needed to launch a process from it.

A container image can include any combination of the following:

Content TypeDescription
System packagesOS-level utilities and libraries (e.g., glibc, openssl)
A runtimeThe execution environment for your app (e.g., JVM, Node.js, Python interpreter)
Library dependenciesApplication-level libraries your code imports
Source codeYour application’s source files (in interpreted languages)
BinariesPre-compiled executables
Static assetsHTML, CSS, images, config templates
ConfigurationContainer configuration (entrypoint, env vars, user, working dir)
ConceptDefinition
Container ImageA static, read-only blueprint stored on disk. It doesn’t “run” - it just exists.
ContainerA live, running instance created from a container image. It only exists while there are active processes.
  • When you run a container image, the container runtime executes the program specified in the image’s entrypoint (e.g., starting the JVM for a Java app).
  • A container only exists at runtime. If the process exits or is killed, the container stops and ceases to exist.

Two important things happen automatically when a container is launched:

The contents of the container image are used to seed a private, isolated file system for the container. Every process inside the container sees this file system - and only this file system - as if it were the entire machine.

The container gets its own virtual network interface with a local IP address. Your application can bind to this interface and start listening on a port, enabling it to receive incoming network traffic.

The configuration section of a container image tells the runtime how to turn the image into a running container. Key settings include:

The entrypoint is the command executed when the container starts. For example:

  • A Java app → java -jar app.jar
  • A Python app → python main.py
  • A compiled binary → /usr/local/bin/myapp

Used to pass runtime configuration into your application without baking it into the image. Common uses: database URLs, API keys, feature flags, log levels.

DATABASE_URL=postgres://user:pass@host/db
LOG_LEVEL=info

Specifies which OS user the container process runs as.

Sets the default directory from which the entrypoint command is executed (equivalent to cd /app before running your process).

When starting a container, you can override the image’s default values for:

  • The entrypoint command
  • Arguments passed to the entrypoint
  • Environment variables

This makes images reusable across different environments (dev, staging, production) without rebuilding.

The Linux Kernel Features That Make Containers Possible

Section titled “The Linux Kernel Features That Make Containers Possible”

Containers are not virtual machines. They don’t emulate hardware or run a full guest OS. Instead, they rely on three foundational Linux kernel features to provide isolation, resource control, and efficient storage.

Container Namespaces

A namespace wraps a global system resource so that processes within the namespace have their own isolated view of it. From inside a namespace, a process believes it has the entire resource to itself.

PID Namespace
  • Inside the container, your application’s main process is assigned PID 1, even if on the host machine it’s actually running as PID 12345. - The process cannot see, signal, or interact with processes running outside its PID namespace. - This is why running ps aux inside a container only shows the container’s own processes.
Network Namespace
  • Each container gets its own completely isolated network stack: its own IP address, routing table, firewall rules (iptables), and ports. - Two containers can each bind to port 8080 simultaneously without conflict, because they live in different network namespaces. - The container runtime (e.g., Docker) creates virtual ethernet pairs (veth) to connect the container’s network namespace to the host.
Mount Namespace
  • Gives the container its own isolated view of the file system hierarchy. - The container can have /etc, /var, /home, etc. that are completely separate from the host’s file system. - Changes to the file system inside the container (in the writable layer) are invisible to the host and other containers.

Three tools work together to deliver complete filesystem isolation:

ToolRole
Union Filesystem (UFS)Combines the image’s read-only layers into a single unified view for the container
MNT NamespaceGives the container its own mount point, so it cannot see or traverse the host’s filesystem tree
chrootA kernel syscall that sets the image filesystem root as the container’s root (/), preventing the container from accessing anything above it on the host
  • Allows the container to have its own hostname and domain name, independent of the host.
  • This is why a container can report its hostname as web-server-1 while the host machine is named prod-node-42.

IPC Namespace (Inter-Process Communication Isolation)

Section titled “IPC Namespace (Inter-Process Communication Isolation)”
  • Isolates IPC resources such as System V message queues and POSIX shared memory.
  • Prevents processes in one container from interfering with IPC resources used by another.
User Namespace
  • Maps user IDs inside the container to different user IDs on the host. - A process running as UID 0 (root) inside the container can be mapped to an unprivileged user (e.g., UID 65534) on the host. - This is a critical security feature: even if a malicious process “escapes” the container, it runs as a non-privileged host user.

2. Control Groups (cgroups) - Resource Management

Section titled “2. Control Groups (cgroups) - Resource Management”
Control Groups

While namespaces provide isolation, cgroups provide resource governance. They allow the kernel to limit, account for, and isolate the resource usage of a group of processes.

  • You can cap how much CPU time a container’s processes receive.
  • Example: Limit a container to 0.5 CPU cores, even on a 32-core machine.
  • Implemented via CPU shares, CPU quotas, and CPU periods.
  • Set a maximum amount of RAM a container can use.
  • Example: --memory=512m restricts the container to 512 MB of RAM.
  • If the container exceeds its memory limit, the kernel’s OOM (Out-Of-Memory) killer will terminate a process in the container.
  • Throttle the rate at which a container can read from or write to disk.
  • Prevents one container from saturating disk bandwidth.
Noisy Neighbor

Without cgroups, a single poorly-written or malicious container could consume 100% of CPU or RAM, starving every other application on the same host. cgroups enforce guaranteed resource isolation, making multi-tenant container hosting reliable.

cgroups v1 vs. v2
  • cgroups v1: Each resource controller (cpu, memory, blkio) is managed separately. - cgroups v2: A unified hierarchy where all controllers are managed together. More modern and consistent. Required by some newer container runtimes.

3. Union File Systems & Copy-on-Write (CoW) - Storage

Section titled “3. Union File Systems & Copy-on-Write (CoW) - Storage”
Union File Systems

A Union File System combines multiple directories (called layers) into a single unified view. This layered architecture is what makes images lightweight, fast to pull, and highly efficient in storage and memory usage. It maintains two distinct perspectives:

  • Upper view: The merged view of all layers combined
  • Lower view: The view of the base layer only

A container image is composed of multiple stacked, read-only layers. Each layer represents a set of file system changes (additions, modifications, deletions) from one build step.

Example layer stack for a Java web app:

[ Layer 4 ] → App JAR file added (top, most specific)
[ Layer 3 ] → JDK installed
[ Layer 2 ] → apt-get update + curl
[ Layer 1 ] → Base Ubuntu 22.04 image (bottom, most general)

Each layer is content-addressed - identified by a cryptographic hash of its contents. This means:

  • If two images share the same base layer (e.g., Ubuntu 22.04), that layer is stored once on disk and shared in memory, even if 50 containers are running from different images. Cached Layers
  • Pulling a new image version is fast: only the changed layers need to be downloaded.

Each layer’s metadata includes its own identifier (cryptographic hash), the identifier of its parent layer, and the execution context of the container that created it. These relationships form a graph that Docker and the UFS use to assemble and verify images at build, pull, and run time.

Container Layer (Upper Layer - Read-Write)

Section titled “Container Layer (Upper Layer - Read-Write)”

When a container is started, a thin, ephemeral read-write layer is added on top of the read-only image layers. This is the only place where the running container can write new data.

[ Container Layer ] → Read-Write (ephemeral, destroyed when container stops)
─────────────────────────────────────────────────────────────────────────────
[ Layer 4 ] → Read-Only ─┐
[ Layer 3 ] → Read-Only │ These are the image layers - shared and immutable
[ Layer 2 ] → Read-Only │
[ Layer 1 ] → Read-Only ─┘
Copy-on-Write

The key question is: what happens if a running container needs to modify a file that exists in a read-only layer?

The answer is Copy-on-Write:

  1. When a container process writes to a file that exists only in a read-only lower layer, the file system detects this.
  2. A copy of the file is made up into the writable container layer.
  3. The modification is applied to the copy in the upper layer.
  4. Subsequent reads of that file will see the modified version from the upper layer (it shadows the original below).
  5. When the container is deleted, the entire upper read-write layer is discarded - the original image layers remain perfectly intact.

Deletions work differently. When a container process deletes a file that exists in a read-only layer, Docker writes a whiteout record to the writable container layer. The UFS interprets this whiteout and hides any version of that file in the layers below. The original file in the read-only layer is never touched — the whiteout simply masks it from view. The same mechanism applies to the parent folder when an entire directory is removed.

Delete on Union FS

This is why stopping and removing a container does not destroy the image, and why you can spin up 100 containers from the same image without 100 copies of the image on disk.

Union File System Implementations

The “stacking” of layers into a single unified view is handled by a Union File System. Common implementations include:

ImplementationNotes
OverlayFSThe default for Docker on modern Linux. Uses kernel-native overlay mounts. Very fast.
AUFSAn older union file system, once the Docker default. Requires a kernel patch; largely deprecated.
BtrfsA copy-on-write file system at the block level. Offers snapshots.
ZFSProvides strong integrity guarantees and efficient snapshots. Less common for containers.
Device MapperBlock-level thin provisioning. Used in older RHEL/CentOS environments.

Union filesystems are powerful but carry some trade-offs worth knowing:

SELinux Extended Attributes Because union filesystems translate between different filesystem conventions when merging layers, some implementations do not support certain metadata features — most notably SELinux extended attributes (xattrs). A layer built with SELinux labels may lose those labels or have them reinterpreted when merged. This can cause unexpected permission or policy enforcement issues in hardened environments.

Memory-Mapped Files (mmap) Union filesystems use a copy-on-write pattern: when a process modifies a file in a read-only layer, the file is first copied up into the writable layer before the write is applied. This conflicts with mmap — the system call that maps a file directly into a process’s memory address space — because mmap expects the file to have a stable location on disk. The copy-up breaks that assumption, causing failures or degraded performance for applications that rely heavily on mmap (databases, certain JVMs, memory-mapped logging).

The Workaround: Volumes Most write-related UFS issues — including mmap conflicts — can be resolved by mounting a volume at the path where the application writes. Volumes bypass the union filesystem entirely and write directly to the host filesystem, eliminating both the performance overhead of copy-on-write and the mmap compatibility issue.

Terminal window
# Mount a volume at the path an mmap-heavy app writes to,
# bypassing the union filesystem for that directory:
docker run -v my-data:/var/lib/postgresql/data postgres

Filesystem Attribute Changes (Ownership and Permissions) Changes to filesystem attributes — such as file ownership (chown) and permissions (chmod) — are recorded by the UFS in exactly the same way as changes to file content. The practical implication: if you run chown or chmod on a large number of files in a single layer, every one of those files gets copied into the new layer even though their content is unchanged. This can silently inflate layer sizes significantly. Keep attribute changes targeted and combine them with the instruction that first creates the files to avoid creating a bloated extra layer.

Hidden Bloat and the Flatten Anti-Pattern Because whiteouts hide but never delete, and because every RUN instruction adds a layer, images can accumulate invisible bloat over time. The temptation is to “flatten” the image by collapsing all layers into one:

Terminal window
# ❌ Anti-pattern: flattening destroys critical image metadata
docker image save my-app | docker image import - my-app:flat

This is the wrong solution — it costs you:

  • Metadata — entrypoint, env vars, labels, exposed ports, and the full execution context are lost
  • Change historydocker history becomes a single opaque entry, making debugging impossible
  • Layer sharing — consumers who already have your base layers cached must re-download the entire image from scratch

The correct approach is to branch instead of flatten — go back to an earlier point in the image’s layer history and create a new image from there, keeping only the layers you need. Better still, automate construction with a Dockerfile and combine operations into single RUN steps to prevent bloat from accumulating in the first place. Every Dockerfile build is a reproducible, auditable branch — without the manual error-prone steps of the flatten workaround.

Exporting and Importing Filesystems (Legitimate Uses)

While flattening a production image is an anti-pattern, there are legitimate reasons to work with container filesystems outside the union filesystem context entirely — and Docker provides two commands for this:

CommandWhat it does
docker container exportStreams the full flattened filesystem of a container to a tarball (stdout or file). Represents what the container actually sees — the merged union of all layers.
docker image importStreams a tarball into Docker as a new single-layer image. Supports compressed and uncompressed formats. An optional Dockerfile instruction can be applied during import.
Terminal window
# Export a running or stopped container's filesystem to a tar archive
docker container export my-container -o my-container-fs.tar
# For just a few files, docker cp is more direct
docker cp my-container:/etc/nginx/nginx.conf ./nginx.conf
# Import a filesystem tarball as a new minimal base image
# (optional: apply a Dockerfile instruction inline with --change)
docker image import my-container-fs.tar my-base-image:latest
docker image import --change 'CMD ["/bin/sh"]' my-fs.tar my-base:minimal

When to use these:

  • You need to inspect or use the container’s merged filesystem outside Docker (e.g., forensic analysis, extracting artifacts)

  • You are bootstrapping a minimal base image from a filesystem you built externally (e.g., a custom root fs created with debootstrap or buildroot)

  • You need more than a few files — docker cp works for individual files, but export is more direct for bulk extraction

    Docker Export/Import

Putting It All Together: The Full Container Lifecycle

Section titled “Putting It All Together: The Full Container Lifecycle”
1. Build → Dockerfile instructions create stacked read-only image layers
each layer hashed and stored
2. Push/Pull → Only missing layers are transferred over the network
shared layers are reused from local cache
3. Start → Runtime creates:
• New set of Namespaces (pid, net, mnt, uts, ipc, user)
• New cgroup for the container's processes
• A writable CoW layer on top of the image layers
• A virtual network interface with an IP address
4. Running → Process runs in isolation
sees its own PID 1, its own /etc, its own IP
cgroups enforce CPU/memory limits
writes go to the ephemeral upper layer
5. Stop/Delete → Running processes are terminated
The ephemeral read-write layer is discarded
Image layers remain untouched
Namespaces and cgroups are torn down