Container Isolation Misconceptions: PID Namespaces, Docker Architecture, and Memory Overcommit

How I Got Here

I’ve been playing around with containers for a while — writing Dockerfiles, spinning up services, poking at cgroups. I thought I had a decent mental model of how container isolation works. Namespaces keep things separate, cgroups limit resources, done.

Then I spun up a Debian VM in OrbStack and started actually testing those assumptions. What I found broke three things I took for granted:

A single Docker flag lets any container read every other container’s secrets
Killing docker run doesn’t kill the container
A 32MB container can malloc() 256MB without errors

Each of these is something I either assumed was impossible or never thought to question. Here’s what happened when I ran the experiments.

Setup

Component	Detail
Host	OrbStack VM on macOS (Apple Silicon, aarch64)
OS	Debian 13 (trixie), kernel 6.17.8
Docker	29.2.1, containerd v2.2.1, runc 1.3.4
cgroup	v2 (cgroup2fs), systemd driver
Memory	7.8 GiB RAM, `vm.overcommit_memory=1`
Images	`fedora:latest`, `test-container:latest`

1. `--pid=host` Exposes Everything

Setup: Two fedora:latest containers — a “victim” with secret env vars, and an “attacker” launched once with default isolation, once with --pid=host.

The Assumption

Containers can’t see each other’s processes. Each container gets its own PID namespace — it sees only its own processes starting from PID 1. This is a fundamental security boundary.

The --pid=host flag exists for debugging and monitoring tools. It puts the container in the host’s PID namespace so it can see all processes. I knew this existed but never thought through the full implications.

The Experiment

I started a “victim” container with sensitive environment variables — the kind of thing you’d see in any 12-factor app:

docker run --rm -dit --name victim \
  -e DB_PASSWORD=super_secret_p4ss \
  -e API_KEY=sk-12345abcde \
  fedora:latest bash -c 'sleep infinity'

Then I launched an “attacker” container in two modes — once with normal isolation, once with --pid=host.

Normal mode (isolated PID namespace):

Visible PIDs: 3
cat: /proc/27720/environ: No such file or directory
cat: /proc/27720/cmdline: No such file or directory
ls: cannot access '/proc/27720/root/': No such file or directory

The victim’s host PID doesn’t even exist inside the attacker’s namespace. Complete isolation — the attacker can’t confirm the victim exists.

With --pid=host:

Visible PIDs: 42

== /proc/27720/environ ==
DB_PASSWORD=super_secret_p4ss
API_KEY=sk-12345abcde
HOME=/root
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
...

== /proc/27720/cmdline ==
sleep infinity

== /proc/27720/maps ==
aaaab1790000-aaaab1796000 r-xp 00000000 00:9e 66208  /usr/bin/sleep
aaaab77d4000-aaaab77f5000 rw-p 00000000 00:00 0      [heap]
ffffb40b0000-ffffb4256000 r-xp 00000000 00:9e 67427  /usr/lib64/libc.so.6

Full compromise. One flag, and the attacker gets:

Asset	Path	What’s Exposed
Environment variables	`/proc/<pid>/environ`	Database passwords, API keys, tokens
Command line	`/proc/<pid>/cmdline`	Flags, config paths, secrets passed as args
Filesystem	`/proc/<pid>/root/`	Application code, config files
Memory layout	`/proc/<pid>/maps`	ASLR¹ bypass for exploit development

It Gets Worse: Cross-Container Kill

I tested whether a --pid=host container could kill other containers. Started two containers with separate PID namespaces:

docker run --rm -dit --name tc fedora:latest bash -c 'sleep infinity'
docker run --rm -dit --name tc2 fedora:latest bash -c 'sleep infinity'

From tc with normal isolation, trying to kill tc2:

bash: line 1: kill: (28051) - No such process

tc2 is alive. Now restart tc with --pid=host and try again:

docker exec tc bash -c "kill -9 28051"
# (no error)

tc2 is dead. A container with --pid=host can kill any process on the host, including other containers.

What This Means

Never use --pid=host in multi-tenant environments. A single container with this flag can exfiltrate secrets from every other container on the host.
Kubernetes hostPID: true is the equivalent. Pod security policies or admission controllers should block it.
Secrets in environment variables are always exposed via /proc. Even without --pid=host, anyone with root on the host can read them. Use mounted secret files or a secret manager instead.

2. `docker run` Is Not the Container

Setup: One test-container:latest container in interactive mode. Two terminals — one running docker run, another to inspect and kill the client process.

The Assumption

I always thought of docker run as “running the container.” That creates a mental model where the docker run process IS the container — and killing it should kill the container.

This is wrong.

What Actually Happens

The Docker container runtime separates the CLI client and the actual container into completely different process trees:

Process Tree A (CLI client):
  user's shell
    └─ docker run --rm -it --name tc ...     ← API client only

Process Tree B (actual container — entirely separate):
  systemd (PID 1)
    └─ containerd-shim-runc-v2               ← container supervisor
         └─ sleep infinity                    ← container PID 1

The key: containerd-shim’s parent is systemd (PID 1), not dockerd or containerd. This is by design.

docker run performs three HTTP operations over /var/run/docker.sock:

Step	API Call	What Happens	Persistent?
1	`POST /containers/create`	Creates container config in dockerd’s database	Yes
2	`POST /containers/{id}/start`	containerd spawns containerd-shim + runc	Yes
3	`POST /containers/{id}/attach`	Bidirectional stdio stream (HTTP hijack)	No — this is `docker run`’s only ongoing role

Killing docker run disconnects Step 3 (the stdio attachment). Steps 1 and 2 are already complete and self-sustaining.

The Experiment

I traced the process ancestry from the container’s PID back to init:

PID 28496 [sleep] → parent 28472
  cmd: sleep infinity
PID 28472 [containerd-shim] → parent 1
  cmd: /usr/bin/containerd-shim-runc-v2 -namespace moby -id ab842a027dd9...
PID 1 [systemd] → parent 0
  cmd: /sbin/init

docker run doesn’t appear anywhere in that chain.

Then I killed the docker run client process directly:

# Terminal 1: running the container interactively
docker run --rm -it --name tc test-container:latest bash

# Terminal 2: find and kill the docker run process
ps aux --forest | grep 'docker run'
# tasnim 13666 ... docker run --rm -it --name tc test-container:latest bash

sudo kill -9 13666

# Container is still running:
docker ps
# NAMES  IMAGE                  STATUS
# tc     test-container:latest  Up 2 minutes

docker exec tc bash -c 'echo STILL ALIVE'
# STILL ALIVE

The container didn’t even notice.

The Resilience Hierarchy

This architecture means containers survive cascading failures:

What Dies	Container Impact	Recovery
`docker run` CLI	None	`docker attach` or `docker exec`
User’s terminal/SSH	None	Container keeps running
`dockerd` (Docker daemon)	None — containers keep running	Restart dockerd, it reconnects
`containerd`	None — containers keep running	Restart containerd, it reconnects
`containerd-shim`	Container dies	This is the single point of failure
Host kernel	Everything dies	—

This explains things I’d observed but never thought about. Closing my laptop doesn’t kill remote containers. CI pipeline timeouts don’t stop containers they started. Ctrl+C in docker run only works because the CLI catches SIGINT and forwards it via the Docker API — it’s not parent-child signal propagation.

3. `malloc()` Lies About Memory Limits

Setup: One fedora:latest container with --memory=32m. A static C binary (mem_eater) that allocates 1MB blocks via malloc(), with an option to memset() them.

The Assumption

Docker’s --memory=32m sets a cgroup limit. I assumed a container with 32MB of memory can’t use more than 32MB. That’s only true for physical memory.

How Linux Memory Allocation Actually Works

graph TD
    A["malloc(1MB)"] --> B["glibc calls mmap()"]
    B --> C["Kernel creates VMA"]
    C --> D["Returns SUCCESS"]
    D --> E["App writes to memory"]
    E --> F["Page fault"]
    F --> G["Kernel allocates physical page"]
    G --> H{"cgroup limit exceeded?"}
    H -->|No| I["Page mapped — done"]
    H -->|Yes| J["OOM killer → SIGKILL"]

The critical insight: malloc() succeeds at VMA creation time, not page allocation time. The kernel promises memory it may not have, deferring physical allocation until the pages are actually accessed. On this host, vm.overcommit_memory=1 — the kernel never rejects malloc().

The Test Binary

I wrote a small C program (mem_eater) that allocates 1MB blocks via malloc() and optionally memset()s them to force physical page allocation. Compiled with gcc -static, copied into a 32MB-limited container. Source code is in the appendix.

Test A: malloc Without Touching Pages

docker run --rm -dit --name mem-test --memory=32m fedora:latest bash
docker cp /tmp/mem_eater mem-test:/tmp/mem_eater
docker exec mem-test /tmp/mem_eater 256 n

Mode: NO-TOUCH pages | Target: 256 MB
  Allocated 10 MB (virtual) [pages NOT touched]
  Allocated 20 MB (virtual) [pages NOT touched]
  ...
  Allocated 250 MB (virtual) [pages NOT touched]
  Allocated 256 MB (virtual) [pages NOT touched]
Final: 256 MB allocated. Sleeping...
Exit code: 0

256 MB “allocated” in a 32 MB container. No OOM. No error. Every malloc() returned a valid pointer. The process sleeps with 256MB of virtual address space mapped but zero physical pages consumed.

Test B: malloc With memset (Touching Pages)

docker exec mem-test /tmp/mem_eater 256 t

Mode: TOUCH pages | Target: 256 MB
  Allocated 10 MB (virtual) [pages touched]
  Allocated 20 MB (virtual) [pages touched]
  Allocated 30 MB (virtual) [pages touched]
  Allocated 40 MB (virtual) [pages touched]
  Allocated 50 MB (virtual) [pages touched]
  Allocated 60 MB (virtual) [pages touched]
Exit code: 137

OOM killed between 60–70 MB virtual (~32 MB resident). Exit code 137 = 128 + 9 (SIGKILL from the OOM killer). The process was killed during memset() when it tried to touch pages beyond the cgroup limit.

What the Kernel Saw

The cgroup’s memory.events told the full story:

BEFORE OOM:                    AFTER OOM:
low           0                low           0
high          0                high          0
max           0                max           40    ← 40 times cgroup limit was hit
oom           0                oom           1     ← OOM event triggered
oom_kill      0                oom_kill      1     ← process was killed

The kernel hit the cgroup limit 40 times before triggering the OOM kill. Each time, it tried to reclaim memory — flushing page cache, swapping — before giving up.

The dmesg output confirmed:

Memory cgroup out of memory: Killed process 87087 (mem_eater)
  total-vm:31688kB, anon-rss:30324kB, file-rss:520kB, shmem-rss:0kB
  constraint=CONSTRAINT_MEMCG

Key details: anon-rss (29.6 MB resident) hit the 32MB cgroup wall. The kill was CONSTRAINT_MEMCG — scoped to the container’s cgroup, not system-wide. Only mem_eater was killed, not the container’s PID 1. The container stayed running.

And there’s no warning. No SIGTERM first. SIGKILL, immediately. The signal handler I registered in the code never fired.

One more thing I checked — how the kernel decides what to kill:

Process	oom_score	oom_score_adj
Container process	666	0
`dockerd`	336	-500
`systemd`	666	0

dockerd runs with oom_score_adj=-500, making it a low-priority target. Container processes have no such protection — they’re the first to go.

What I Learned

Misconception	Reality	Risk
”Containers can’t see each other’s processes”	`--pid=host` exposes everything	Critical — full secret exfiltration
”Killing `docker run` kills the container”	Container is in a separate process tree	Low — operational confusion
”`--memory=32m` means 32MB max”	256MB virtual allocated; OOM only on page touch	High — silent, sudden SIGKILL

The biggest takeaway: container isolation is real but conditional. The defaults are good, but a single misconfigured flag or a misunderstanding of how memory works can break the model completely.

What I changed after running these experiments:

Security:

Block --pid=host in production. Kubernetes PodSecurityStandards (restricted profile) or OPA/Gatekeeper policies should enforce this.
Stop putting secrets in environment variables. They’re readable via /proc by anyone with access to the host. Mounted secret files or a secret manager (Vault, AWS Secrets Manager) are better.

Memory:

Monitor RSS, not VSZ. Virtual memory size is meaningless for capacity planning. docker stats, cAdvisor, or memory.current show actual usage.
Set --memory-swap equal to --memory. Without it, containers silently use swap to defer OOM kills, masking the problem while degrading performance.
Watch memory.events. The max counter rising means the cgroup limit is being hit repeatedly — it’s an early warning before OOM kills start.

Appendix: mem_eater.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>

void sig_handler(int sig) {
    printf("CAUGHT SIGNAL %d before OOM kill!\n", sig);
    fflush(stdout);
}

int main(int argc, char *argv[]) {
    int mb = atoi(argv[1]);
    int touch = (argc > 2 && argv[2][0] == 't');

    signal(SIGTERM, sig_handler);
    signal(SIGUSR1, sig_handler);

    printf("Mode: %s pages | Target: %d MB\n", touch ? "TOUCH" : "NO-TOUCH", mb);
    fflush(stdout);

    char **blocks = malloc(sizeof(char*) * mb);
    int allocated = 0;

    for (int i = 0; i < mb; i++) {
        blocks[i] = malloc(1024 * 1024);
        if (!blocks[i]) {
            printf("malloc() FAILED at %d MB\n", i);
            fflush(stdout);
            break;
        }
        allocated++;
        if (touch) {
            memset(blocks[i], 'A', 1024 * 1024);
        }
        if (i % 10 == 9 || i == mb - 1) {
            printf("  Allocated %d MB (virtual)%s\n", i + 1,
                   touch ? " [pages touched]" : " [pages NOT touched]");
            fflush(stdout);
        }
    }

    printf("Final: %d MB allocated. Sleeping...\n", allocated);
    fflush(stdout);
    sleep(60);
    return 0;
}

Compiled with: gcc -static mem_eater.c -o mem_eater

Appendix: Demand Paging (Page Touch)

When malloc() returns a pointer, the kernel hasn’t allocated any physical RAM — it only creates a Virtual Memory Area (VMA), a bookkeeping entry that marks the address range as valid. Physical pages are allocated only when the process writes to (touches) that memory, triggering a page fault:

malloc(1MB) — kernel creates VMA, returns immediately. No RAM consumed.
memset(ptr, 'A', 1MB) — CPU accesses the address, finds no physical page mapped, traps to the kernel (page fault).
Kernel allocates a real 4KB physical page, maps it into the process’s page table, resumes execution.
This repeats for every 4KB page within the block.

This lazy strategy is called demand paging — the kernel defers physical allocation until the last possible moment. Combined with vm.overcommit_memory=1 (the default on this host), the kernel never refuses a malloc(), which is why Test A allocated 256MB of virtual memory in a 32MB container without triggering the OOM killer. Test B, which called memset() to touch every page, forced real page faults that demanded physical RAM — hitting the cgroup limit and getting killed.

Appendix: OOM Score

When the OOM killer triggers, the kernel picks a victim using a two-part scoring system:

Badness heuristic (0–1000): the kernel assigns each process a base score roughly proportional to the fraction of allowed memory it’s consuming. A process using half its allowed memory scores around 500. This base score is what drives the OOM killer’s selection.
oom_score_adj (-1000 to +1000): an admin-tunable bias added to the badness score before the kernel makes its kill decision. Negative values protect a process; positive values make it a bigger target. Setting -1000 disables OOM killing entirely for that process — it will always report a badness score of 0.
oom_score (shown in /proc/<pid>/oom_score): the final score displayed by the kernel, which already includes the oom_score_adj offset. This is what you see when inspecting a process, but the underlying calculation is badness + adjustment.

From the experiments:

Process	oom_score	oom_score_adj	Why
Container process	666	0	No protection — first to die
`dockerd`	336	-500	Docker protects itself; losing dockerd means losing all container management
`systemd`	666	0	High base score, but PID 1 has kernel-level OOM immunity regardless

The effective kill priority: container processes first (high score, no adjustment) → systemd (protected as PID 1 by the kernel) → dockerd (score artificially lowered by its -500 adjustment). This ensures Docker infrastructure survives while container workloads get sacrificed.

References

proc_pid_oom_score(5) — Linux manual page for /proc/<pid>/oom_score
proc_pid_oom_score_adj(5) — Linux manual page for /proc/<pid>/oom_score_adj
Overcommit Accounting — Kernel documentation on vm.overcommit_memory modes
Memory Management Concepts — Kernel documentation on demand paging and virtual memory
The /proc Filesystem — Kernel documentation on /proc entries including OOM-related files
Documentation for /proc/sys/vm/ — Kernel sysctl documentation for VM tunables

Address Space Layout Randomization — a security technique where the OS randomizes memory addresses (stack, heap, libraries) for each process, making it harder for attackers to predict where code and data reside. Reading /proc/<pid>/maps reveals the actual layout, defeating the randomization entirely. ↩

How I Got Here

Setup

1. --pid=host Exposes Everything

The Assumption

The Experiment

It Gets Worse: Cross-Container Kill

What This Means

2. docker run Is Not the Container

The Assumption

What Actually Happens

The Experiment

The Resilience Hierarchy

3. malloc() Lies About Memory Limits

The Assumption

How Linux Memory Allocation Actually Works

The Test Binary

Test A: malloc Without Touching Pages

Test B: malloc With memset (Touching Pages)

What the Kernel Saw

What I Learned

Appendix: mem_eater.c

Appendix: Demand Paging (Page Touch)

Appendix: OOM Score

References

Footnotes

1. `--pid=host` Exposes Everything

2. `docker run` Is Not the Container

3. `malloc()` Lies About Memory Limits