Container Isolation Misconceptions: PID Namespaces, Docker Architecture, and Memory Overcommit
How I Got Here
I’ve been playing around with containers for a while — writing Dockerfiles, spinning up services, poking at cgroups. I thought I had a decent mental model of how container isolation works. Namespaces keep things separate, cgroups limit resources, done.
Then I spun up a Debian VM in OrbStack and started actually testing those assumptions. What I found broke three things I took for granted:
- A single Docker flag lets any container read every other container’s secrets
- Killing
docker rundoesn’t kill the container - A 32MB container can
malloc()256MB without errors
Each of these is something I either assumed was impossible or never thought to question. Here’s what happened when I ran the experiments.
Setup
| Component | Detail |
|---|---|
| Host | OrbStack VM on macOS (Apple Silicon, aarch64) |
| OS | Debian 13 (trixie), kernel 6.17.8 |
| Docker | 29.2.1, containerd v2.2.1, runc 1.3.4 |
| cgroup | v2 (cgroup2fs), systemd driver |
| Memory | 7.8 GiB RAM, vm.overcommit_memory=1 |
| Images | fedora:latest, test-container:latest |
1. --pid=host Exposes Everything
Setup: Two
fedora:latestcontainers — a “victim” with secret env vars, and an “attacker” launched once with default isolation, once with--pid=host.
The Assumption
Containers can’t see each other’s processes. Each container gets its own PID namespace — it sees only its own processes starting from PID 1. This is a fundamental security boundary.
The --pid=host flag exists for debugging and monitoring tools. It puts the container in the host’s PID namespace so it can see all processes. I knew this existed but never thought through the full implications.
The Experiment
I started a “victim” container with sensitive environment variables — the kind of thing you’d see in any 12-factor app:
docker run --rm -dit --name victim \
-e DB_PASSWORD=super_secret_p4ss \
-e API_KEY=sk-12345abcde \
fedora:latest bash -c 'sleep infinity'
Then I launched an “attacker” container in two modes — once with normal isolation, once with --pid=host.
Normal mode (isolated PID namespace):
Visible PIDs: 3
cat: /proc/27720/environ: No such file or directory
cat: /proc/27720/cmdline: No such file or directory
ls: cannot access '/proc/27720/root/': No such file or directory
The victim’s host PID doesn’t even exist inside the attacker’s namespace. Complete isolation — the attacker can’t confirm the victim exists.
With --pid=host:
Visible PIDs: 42
== /proc/27720/environ ==
DB_PASSWORD=super_secret_p4ss
API_KEY=sk-12345abcde
HOME=/root
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
...
== /proc/27720/cmdline ==
sleep infinity
== /proc/27720/maps ==
aaaab1790000-aaaab1796000 r-xp 00000000 00:9e 66208 /usr/bin/sleep
aaaab77d4000-aaaab77f5000 rw-p 00000000 00:00 0 [heap]
ffffb40b0000-ffffb4256000 r-xp 00000000 00:9e 67427 /usr/lib64/libc.so.6
Full compromise. One flag, and the attacker gets:
| Asset | Path | What’s Exposed |
|---|---|---|
| Environment variables | /proc/<pid>/environ | Database passwords, API keys, tokens |
| Command line | /proc/<pid>/cmdline | Flags, config paths, secrets passed as args |
| Filesystem | /proc/<pid>/root/ | Application code, config files |
| Memory layout | /proc/<pid>/maps | ASLR1 bypass for exploit development |
It Gets Worse: Cross-Container Kill
I tested whether a --pid=host container could kill other containers. Started two containers with separate PID namespaces:
docker run --rm -dit --name tc fedora:latest bash -c 'sleep infinity'
docker run --rm -dit --name tc2 fedora:latest bash -c 'sleep infinity'
From tc with normal isolation, trying to kill tc2:
bash: line 1: kill: (28051) - No such process
tc2 is alive. Now restart tc with --pid=host and try again:
docker exec tc bash -c "kill -9 28051"
# (no error)
tc2 is dead. A container with --pid=host can kill any process on the host, including other containers.
What This Means
- Never use
--pid=hostin multi-tenant environments. A single container with this flag can exfiltrate secrets from every other container on the host. - Kubernetes
hostPID: trueis the equivalent. Pod security policies or admission controllers should block it. - Secrets in environment variables are always exposed via
/proc. Even without--pid=host, anyone with root on the host can read them. Use mounted secret files or a secret manager instead.
2. docker run Is Not the Container
Setup: One
test-container:latestcontainer in interactive mode. Two terminals — one runningdocker run, another to inspect and kill the client process.
The Assumption
I always thought of docker run as “running the container.” That creates a mental model where the docker run process IS the container — and killing it should kill the container.
This is wrong.
What Actually Happens
The Docker container runtime separates the CLI client and the actual container into completely different process trees:
Process Tree A (CLI client):
user's shell
└─ docker run --rm -it --name tc ... ← API client only
Process Tree B (actual container — entirely separate):
systemd (PID 1)
└─ containerd-shim-runc-v2 ← container supervisor
└─ sleep infinity ← container PID 1
The key: containerd-shim’s parent is systemd (PID 1), not dockerd or containerd. This is by design.
docker run performs three HTTP operations over /var/run/docker.sock:
| Step | API Call | What Happens | Persistent? |
|---|---|---|---|
| 1 | POST /containers/create | Creates container config in dockerd’s database | Yes |
| 2 | POST /containers/{id}/start | containerd spawns containerd-shim + runc | Yes |
| 3 | POST /containers/{id}/attach | Bidirectional stdio stream (HTTP hijack) | No — this is docker run’s only ongoing role |
Killing docker run disconnects Step 3 (the stdio attachment). Steps 1 and 2 are already complete and self-sustaining.
The Experiment
I traced the process ancestry from the container’s PID back to init:
PID 28496 [sleep] → parent 28472
cmd: sleep infinity
PID 28472 [containerd-shim] → parent 1
cmd: /usr/bin/containerd-shim-runc-v2 -namespace moby -id ab842a027dd9...
PID 1 [systemd] → parent 0
cmd: /sbin/init
docker run doesn’t appear anywhere in that chain.
Then I killed the docker run client process directly:
# Terminal 1: running the container interactively
docker run --rm -it --name tc test-container:latest bash
# Terminal 2: find and kill the docker run process
ps aux --forest | grep 'docker run'
# tasnim 13666 ... docker run --rm -it --name tc test-container:latest bash
sudo kill -9 13666
# Container is still running:
docker ps
# NAMES IMAGE STATUS
# tc test-container:latest Up 2 minutes
docker exec tc bash -c 'echo STILL ALIVE'
# STILL ALIVE
The container didn’t even notice.
The Resilience Hierarchy
This architecture means containers survive cascading failures:
| What Dies | Container Impact | Recovery |
|---|---|---|
docker run CLI | None | docker attach or docker exec |
| User’s terminal/SSH | None | Container keeps running |
dockerd (Docker daemon) | None — containers keep running | Restart dockerd, it reconnects |
containerd | None — containers keep running | Restart containerd, it reconnects |
containerd-shim | Container dies | This is the single point of failure |
| Host kernel | Everything dies | — |
This explains things I’d observed but never thought about. Closing my laptop doesn’t kill remote containers. CI pipeline timeouts don’t stop containers they started. Ctrl+C in docker run only works because the CLI catches SIGINT and forwards it via the Docker API — it’s not parent-child signal propagation.
3. malloc() Lies About Memory Limits
Setup: One
fedora:latestcontainer with--memory=32m. A static C binary (mem_eater) that allocates 1MB blocks viamalloc(), with an option tomemset()them.
The Assumption
Docker’s --memory=32m sets a cgroup limit. I assumed a container with 32MB of memory can’t use more than 32MB. That’s only true for physical memory.
How Linux Memory Allocation Actually Works
graph TD
A["malloc(1MB)"] --> B["glibc calls mmap()"]
B --> C["Kernel creates VMA"]
C --> D["Returns SUCCESS"]
D --> E["App writes to memory"]
E --> F["Page fault"]
F --> G["Kernel allocates physical page"]
G --> H{"cgroup limit exceeded?"}
H -->|No| I["Page mapped — done"]
H -->|Yes| J["OOM killer → SIGKILL"]
The critical insight: malloc() succeeds at VMA creation time, not page allocation time. The kernel promises memory it may not have, deferring physical allocation until the pages are actually accessed. On this host, vm.overcommit_memory=1 — the kernel never rejects malloc().
The Test Binary
I wrote a small C program (mem_eater) that allocates 1MB blocks via malloc() and optionally memset()s them to force physical page allocation. Compiled with gcc -static, copied into a 32MB-limited container. Source code is in the appendix.
Test A: malloc Without Touching Pages
docker run --rm -dit --name mem-test --memory=32m fedora:latest bash
docker cp /tmp/mem_eater mem-test:/tmp/mem_eater
docker exec mem-test /tmp/mem_eater 256 n
Mode: NO-TOUCH pages | Target: 256 MB
Allocated 10 MB (virtual) [pages NOT touched]
Allocated 20 MB (virtual) [pages NOT touched]
...
Allocated 250 MB (virtual) [pages NOT touched]
Allocated 256 MB (virtual) [pages NOT touched]
Final: 256 MB allocated. Sleeping...
Exit code: 0
256 MB “allocated” in a 32 MB container. No OOM. No error. Every malloc() returned a valid pointer. The process sleeps with 256MB of virtual address space mapped but zero physical pages consumed.
Test B: malloc With memset (Touching Pages)
docker exec mem-test /tmp/mem_eater 256 t
Mode: TOUCH pages | Target: 256 MB
Allocated 10 MB (virtual) [pages touched]
Allocated 20 MB (virtual) [pages touched]
Allocated 30 MB (virtual) [pages touched]
Allocated 40 MB (virtual) [pages touched]
Allocated 50 MB (virtual) [pages touched]
Allocated 60 MB (virtual) [pages touched]
Exit code: 137
OOM killed between 60–70 MB virtual (~32 MB resident). Exit code 137 = 128 + 9 (SIGKILL from the OOM killer). The process was killed during memset() when it tried to touch pages beyond the cgroup limit.
What the Kernel Saw
The cgroup’s memory.events told the full story:
BEFORE OOM: AFTER OOM:
low 0 low 0
high 0 high 0
max 0 max 40 ← 40 times cgroup limit was hit
oom 0 oom 1 ← OOM event triggered
oom_kill 0 oom_kill 1 ← process was killed
The kernel hit the cgroup limit 40 times before triggering the OOM kill. Each time, it tried to reclaim memory — flushing page cache, swapping — before giving up.
The dmesg output confirmed:
Memory cgroup out of memory: Killed process 87087 (mem_eater)
total-vm:31688kB, anon-rss:30324kB, file-rss:520kB, shmem-rss:0kB
constraint=CONSTRAINT_MEMCG
Key details: anon-rss (29.6 MB resident) hit the 32MB cgroup wall. The kill was CONSTRAINT_MEMCG — scoped to the container’s cgroup, not system-wide. Only mem_eater was killed, not the container’s PID 1. The container stayed running.
And there’s no warning. No SIGTERM first. SIGKILL, immediately. The signal handler I registered in the code never fired.
One more thing I checked — how the kernel decides what to kill:
| Process | oom_score | oom_score_adj |
|---|---|---|
| Container process | 666 | 0 |
dockerd | 336 | -500 |
systemd | 666 | 0 |
dockerd runs with oom_score_adj=-500, making it a low-priority target. Container processes have no such protection — they’re the first to go.
What I Learned
| Misconception | Reality | Risk |
|---|---|---|
| ”Containers can’t see each other’s processes” | --pid=host exposes everything | Critical — full secret exfiltration |
”Killing docker run kills the container” | Container is in a separate process tree | Low — operational confusion |
”--memory=32m means 32MB max” | 256MB virtual allocated; OOM only on page touch | High — silent, sudden SIGKILL |
The biggest takeaway: container isolation is real but conditional. The defaults are good, but a single misconfigured flag or a misunderstanding of how memory works can break the model completely.
What I changed after running these experiments:
Security:
- Block
--pid=hostin production. Kubernetes PodSecurityStandards (restricted profile) or OPA/Gatekeeper policies should enforce this. - Stop putting secrets in environment variables. They’re readable via
/procby anyone with access to the host. Mounted secret files or a secret manager (Vault, AWS Secrets Manager) are better.
Memory:
- Monitor RSS, not VSZ. Virtual memory size is meaningless for capacity planning.
docker stats, cAdvisor, ormemory.currentshow actual usage. - Set
--memory-swapequal to--memory. Without it, containers silently use swap to defer OOM kills, masking the problem while degrading performance. - Watch
memory.events. Themaxcounter rising means the cgroup limit is being hit repeatedly — it’s an early warning before OOM kills start.
Appendix: mem_eater.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
void sig_handler(int sig) {
printf("CAUGHT SIGNAL %d before OOM kill!\n", sig);
fflush(stdout);
}
int main(int argc, char *argv[]) {
int mb = atoi(argv[1]);
int touch = (argc > 2 && argv[2][0] == 't');
signal(SIGTERM, sig_handler);
signal(SIGUSR1, sig_handler);
printf("Mode: %s pages | Target: %d MB\n", touch ? "TOUCH" : "NO-TOUCH", mb);
fflush(stdout);
char **blocks = malloc(sizeof(char*) * mb);
int allocated = 0;
for (int i = 0; i < mb; i++) {
blocks[i] = malloc(1024 * 1024);
if (!blocks[i]) {
printf("malloc() FAILED at %d MB\n", i);
fflush(stdout);
break;
}
allocated++;
if (touch) {
memset(blocks[i], 'A', 1024 * 1024);
}
if (i % 10 == 9 || i == mb - 1) {
printf(" Allocated %d MB (virtual)%s\n", i + 1,
touch ? " [pages touched]" : " [pages NOT touched]");
fflush(stdout);
}
}
printf("Final: %d MB allocated. Sleeping...\n", allocated);
fflush(stdout);
sleep(60);
return 0;
}
Compiled with: gcc -static mem_eater.c -o mem_eater
Appendix: Demand Paging (Page Touch)
When malloc() returns a pointer, the kernel hasn’t allocated any physical RAM — it only creates a Virtual Memory Area (VMA), a bookkeeping entry that marks the address range as valid. Physical pages are allocated only when the process writes to (touches) that memory, triggering a page fault:
malloc(1MB)— kernel creates VMA, returns immediately. No RAM consumed.memset(ptr, 'A', 1MB)— CPU accesses the address, finds no physical page mapped, traps to the kernel (page fault).- Kernel allocates a real 4KB physical page, maps it into the process’s page table, resumes execution.
- This repeats for every 4KB page within the block.
This lazy strategy is called demand paging — the kernel defers physical allocation until the last possible moment. Combined with vm.overcommit_memory=1 (the default on this host), the kernel never refuses a malloc(), which is why Test A allocated 256MB of virtual memory in a 32MB container without triggering the OOM killer. Test B, which called memset() to touch every page, forced real page faults that demanded physical RAM — hitting the cgroup limit and getting killed.
Appendix: OOM Score
When the OOM killer triggers, the kernel picks a victim using a two-part scoring system:
- Badness heuristic (0–1000): the kernel assigns each process a base score roughly proportional to the fraction of allowed memory it’s consuming. A process using half its allowed memory scores around 500. This base score is what drives the OOM killer’s selection.
oom_score_adj(-1000 to +1000): an admin-tunable bias added to the badness score before the kernel makes its kill decision. Negative values protect a process; positive values make it a bigger target. Setting-1000disables OOM killing entirely for that process — it will always report a badness score of 0.oom_score(shown in/proc/<pid>/oom_score): the final score displayed by the kernel, which already includes theoom_score_adjoffset. This is what you see when inspecting a process, but the underlying calculation is badness + adjustment.
From the experiments:
| Process | oom_score | oom_score_adj | Why |
|---|---|---|---|
| Container process | 666 | 0 | No protection — first to die |
dockerd | 336 | -500 | Docker protects itself; losing dockerd means losing all container management |
systemd | 666 | 0 | High base score, but PID 1 has kernel-level OOM immunity regardless |
The effective kill priority: container processes first (high score, no adjustment) → systemd (protected as PID 1 by the kernel) → dockerd (score artificially lowered by its -500 adjustment). This ensures Docker infrastructure survives while container workloads get sacrificed.
References
proc_pid_oom_score(5)— Linux manual page for/proc/<pid>/oom_scoreproc_pid_oom_score_adj(5)— Linux manual page for/proc/<pid>/oom_score_adj- Overcommit Accounting — Kernel documentation on
vm.overcommit_memorymodes - Memory Management Concepts — Kernel documentation on demand paging and virtual memory
- The /proc Filesystem — Kernel documentation on
/procentries including OOM-related files - Documentation for /proc/sys/vm/ — Kernel sysctl documentation for VM tunables
Footnotes
-
Address Space Layout Randomization — a security technique where the OS randomizes memory addresses (stack, heap, libraries) for each process, making it harder for attackers to predict where code and data reside. Reading
/proc/<pid>/mapsreveals the actual layout, defeating the randomization entirely. ↩