- Single-host or orchestrated? Just `docker run` on one machine, or multi-host orchestration (Swarm/K8s)? β Focus on single-host runtime + registry. Mention orchestration as evolution.
- Build, ship, or run β all three? β All three: docker build (image creation), docker push/pull (registry), docker run (container lifecycle).
- Container type? Linux containers (namespaces/cgroups), Windows containers, or VMs? β Linux containers. The core isolation primitives.
- Registry: self-hosted (Docker Registry) or cloud (Docker Hub)? β Design both: the storage backend and the distribution protocol.
- Security model? Root vs rootless, image signing, vulnerability scanning? β In scope: namespace isolation, image trust. Out of scope: deep CVE scanning.
- Scale? How many images, how many running containers per host? β Registry: millions of images, billions of pulls/day. Runtime: ~100-500 containers per host.
| In Scope | Out of Scope |
|---|---|
| Container runtime (create, start, stop, exec) | Multi-host orchestration (Kubernetes) |
| Image building (Dockerfile β image) | CI/CD pipeline integration |
| Image registry (push, pull, storage) | CVE / vulnerability scanning |
| Container networking (bridge, host, overlay basics) | Service mesh (Istio, Linkerd) |
| Storage: volumes, layers, copy-on-write | Windows containers |
| Security: namespaces, cgroups, seccomp, image trust | Kubernetes CRI integration details |
- UC1: docker build β Developer writes a Dockerfile β system produces an immutable, layered image
- UC2: docker push / pull β Image is uploaded to a registry and distributed to any host
- UC3: docker run β Host creates an isolated process with its own filesystem, network, and resource limits
- UC4: docker exec β Attach to a running container for debugging
- UC5: docker stop / rm β Graceful shutdown with SIGTERM β SIGKILL, cleanup resources
- Container startup <1 second: Containers must start near-instantly compared to VMs (minutes). This is the core value proposition β lightweight process isolation.
- Image layer deduplication: 1000 containers sharing the same base image (Ubuntu 22.04) should NOT store 1000 copies. Copy-on-write (CoW) filesystem is essential.
- Hard isolation between containers: A compromised container must NOT be able to access another container's filesystem, network, or processes. Security boundary = defense in depth.
- Registry pull latency: Cold pull of a 500MB image should complete in <30 seconds. Incremental pull (only missing layers) in <5 seconds.
- Registry availability: If the registry is down, already-pulled images can still run. Registry is on the critical path for deployment, not for runtime.
- Resource limits must be enforced: A container configured for 512MB RAM must be OOM-killed if it exceeds this. No noisy neighbor β cgroups must be hard limits.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Sub-second container start | Linux namespaces + cgroups (not VMs) | No kernel boot. Process isolation via kernel primitives. VMs require hypervisor + guest OS boot (30-60s). Trade: weaker isolation (shared kernel). | β |
| 500 containers share one base image | OverlayFS (copy-on-write layers) | Layers are immutable and shared. Only writes create new data. AUFS is legacy, devicemapper is slow on metadata. OverlayFS is mainline kernel. | β |
| Image identity must be tamper-proof | Content-addressable storage (SHA-256) | Image ID = hash of content. Changing 1 byte changes the hash. Tag-based identity is mutable ("latest" can point to anything). Hash-based is immutable. | CP |
| Registry must serve billions of pulls/day | Blob storage (S3) + CDN + layer dedup | Layers stored once in S3 regardless of how many images reference them. CDN caches popular layers at edge. DB-stored blobs can't scale to petabytes. | AP |
| Containers need isolated networking | Network namespaces + veth pairs + bridge | Each container gets its own IP, routing table, iptables. Bridge connects containers on same host. Shared host networking leaks isolation. | β |
| Resource limits must be hard (no noisy neighbor) | cgroups v2 for CPU, memory, I/O | Kernel-enforced limits. OOM killer fires if memory exceeded. CPU shares for fair scheduling. Userspace enforcement is bypassable. | β |
Docker CLI CLIENT
- Parses commands, sends REST calls to daemon via Unix socket
- Handles build context (tar + send to daemon)
- Streams logs, attach, exec via HTTP hijack
Docker Daemon (dockerd) DAEMON
- REST API server on Unix socket /var/run/docker.sock
- Manages images, containers, networks, volumes
- Delegates container lifecycle to containerd
containerd RUNTIME
- Container lifecycle management (create, start, stop)
- Image pull, unpack, snapshot management
- gRPC API β Kubernetes CRI-compatible
runc OCI RUNTIME
- Creates namespaces, cgroups, mounts
- Forks container init process, then exits
- OCI Runtime Spec compliant (replaceable: crun, gVisor, Kata)
containerd-shim SHIM
- Reparents container process after runc exits
- Keeps STDIO open for logs and exec
- Allows containerd/dockerd restart without killing containers
Registry DISTRIBUTION
- OCI Distribution Spec: manifest + blob storage
- Content-addressable: blobs keyed by SHA-256
- S3 backend for blobs, PostgreSQL for tagβdigest mapping
| Namespace | What It Isolates | Why It Matters |
|---|---|---|
| PID | Process tree | Container sees only its own processes. PID 1 inside = entrypoint. Host PID might be 48291. |
| NET | Network stack | Container gets its own IP, routing table, iptables. Can't see host interfaces. |
| MNT | Filesystem mounts | Container has its own root filesystem (from image layers). Can't see host /etc/passwd. |
| UTS | Hostname | Container has its own hostname. `hostname` returns container ID, not host. |
| IPC | System V IPC, POSIX MQ | Shared memory segments are per-container. Prevents cross-container IPC leakage. |
| USER | UID/GID mapping | Root inside container (UID 0) maps to unprivileged user on host (UID 100000). Rootless containers. |
docker exec command works by calling setns() to join an existing container's namespaces. It doesn't create a new container β it attaches a new process to the same isolation boundary. This is why exec'd processes see the same filesystem and network as the container's main process.| Concept | What It Is | Key Property |
|---|---|---|
| Layer | Filesystem diff (tar.gz) from one instruction | Immutable, content-addressed (SHA-256). Shared across images. |
| Image Manifest | JSON listing layer digests + config | The "recipe" β tells the runtime which layers to stack in what order. |
| Image Config | JSON with env vars, entrypoint, exposed ports | Runtime metadata. Not a layer β doesn't contain filesystem data. |
| Tag | Human-readable pointer (e.g., "nginx:1.25") | MUTABLE β "latest" can point to different digests over time. Not trustworthy for pinning. |
| Digest | SHA-256 of the manifest (e.g., sha256:abc123...) | IMMUTABLE β changing 1 byte changes the digest. Use for production pinning. |
FROM ubuntu:22.04, the Ubuntu base layer is stored ONCE in the registry, regardless of how many images reference it. The layer's address IS its content hash β deduplication is automatic. This is what makes Docker Hub viable: billions of pulls but most are fetching layers that are already locally cached. A docker pull first downloads the manifest (small JSON), then checks each layer against the local store. Only missing layers are downloaded.RUN apt-get install nginx hasn't changed AND its parent layer is the same, the cached layer is reused. This is why Dockerfiles should order instructions by change frequency: base OS first (changes rarely), dependencies next (changes weekly), application code last (changes every commit). A well-ordered Dockerfile rebuilds in seconds because only the final COPY layer is new.| Network Mode | How It Works | Use Case |
|---|---|---|
| bridge (default) | Container on docker0 bridge, NAT to host. Each gets 172.17.0.x IP. | Standard isolation. Containers communicate via bridge, reach internet via NAT. |
| host | No network namespace. Container shares host's network stack directly. | Maximum network performance (no NAT overhead). Zero isolation. |
| none | No networking. Only loopback interface. | Batch processing, security-sensitive workloads that should never access network. |
| overlay | VXLAN tunnels between hosts. Containers on different hosts get same virtual network. | Docker Swarm / multi-host networking. Containers on different machines communicate as if local. |
| macvlan | Container gets its own MAC address on physical network. | Legacy apps that need to appear as physical devices on the LAN. |
/var/lib/docker/volumes/{name}/_data) mounted directly into the container, bypassing OverlayFS. This means: (1) I/O goes directly to the host filesystem (no CoW overhead), (2) data survives container deletion, (3) volumes can be shared between containers. Bind mounts are similar but mount an arbitrary host path β useful for development (mount source code into container).| Data | Store | Why This Store |
|---|---|---|
| Image layers | OverlayFS (/var/lib/docker/overlay2) | Read-only, content-addressed by SHA-256. Shared across containers. Deduped on disk automatically. |
| Container writable layer | OverlayFS (upperdir, per-container) | Ephemeral. Copy-on-write. Deleted with container. Not for persistent data. |
| Volumes | Host filesystem (direct mount) | Persistent data (DBs, uploads). Bypasses OverlayFS for native I/O performance. Survives container lifecycle. |
| Registry blobs | S3 / object storage | Petabyte scale, content-addressed. CDN-friendly. Layer blobs are immutable β perfect for object storage. |
| Registry metadata | PostgreSQL | Tag β digest mapping, repository permissions, user accounts. Relational with ACID for tag updates. |
| Container state | JSON on disk + containerd DB (bbolt) | Container config, status, restart policy. Local to host. Low volume β no distributed DB needed. |
- Multi-host orchestration (Kubernetes): Scheduling containers across a cluster. Pod abstraction (co-located containers), service discovery, rolling deploys, auto-scaling based on metrics.
- Image build optimization (BuildKit): Parallel layer builds (independent RUN commands execute concurrently), cache mounts (reuse pip/npm cache across builds), multi-stage builds to minimize final image size.
- Rootless containers: Run the entire Docker daemon as a non-root user. User namespaces map container root to unprivileged host UID. Eliminates the biggest attack vector (Docker socket = root access).
- WebAssembly (WASM) containers: WASM runtimes (WasmEdge) as an alternative to Linux containers. Microsecond startup, 1MB memory footprint, sandbox-by-default. Ideal for edge computing and serverless.
- Image streaming (lazy pulling): Start the container before the full image is downloaded. Pull layers on-demand as files are accessed (eStargz, Nydus). Reduces cold-start from 30 seconds to <2 seconds for large images.
- Supply chain security (SBOM + attestation): Every image includes a Software Bill of Materials listing every package. Signed attestations prove the image was built from a specific commit by a specific CI pipeline. SLSA framework compliance.
What's the difference between a container and a VM? When would you still choose a VM?
A container is a process isolated by kernel namespaces, sharing the host kernel. A VM runs a complete guest kernel on a hypervisor, with hardware-level isolation. Containers win on: startup speed (<1s vs 30-60s), density (500/host vs 50), image size (MBs vs GBs), and resource efficiency (shared kernel, shared base layers). VMs win on: isolation strength (hardware boundary β a kernel exploit in a container escapes to the host, in a VM it doesn't), running different OSes (Windows on Linux host), and regulatory compliance (some security standards require hardware isolation). The practical rule: containers for your own code running in a trusted environment, VMs for untrusted multi-tenant workloads or when you need a different kernel version. The middle ground is Kata Containers or Firecracker β lightweight VMs that boot in <1 second, giving VM-level isolation with near-container performance.
Why does Docker need containerd AND runc? Why not just have the daemon create containers directly?
This is separation of concerns driven by hard operational requirements. Original Docker (pre-1.11) was monolithic β the daemon did everything. Problem: restarting the daemon (for upgrades) killed ALL running containers. Splitting into layers solved this: (1) runc creates the container process and exits β it's a short-lived CLI tool, not a daemon. (2) containerd-shim reparents the container process, so it survives daemon restarts. (3) containerd manages the lifecycle and talks to the shim via gRPC. (4) dockerd provides the user-facing API and builds on top of containerd. The result: you can upgrade dockerd without touching running containers. Kubernetes also benefits β it talks to containerd directly (CRI), bypassing dockerd entirely. This is why Docker "removed" Dockershim from Kubernetes: K8s never needed dockerd, just containerd.
How does copy-on-write work in OverlayFS, and what are its performance implications?
OverlayFS has two directories: lowerdir (read-only image layers, stacked) and upperdir (writable, per-container). On read: the kernel checks upperdir first, then falls through lower layers until the file is found. On write: if the file exists in a lower layer, it's copied up to upperdir first (copy-on-write), then modified in upperdir. New files go directly to upperdir. On delete: a "whiteout" file is created in upperdir that hides the lower layer file. Performance implications: reads are fast (kernel caches the lookup path). First write to an existing file is slow (must copy the entire file up, even if modifying 1 byte β this is per-file, not per-block). Subsequent writes to the same file are fast (already in upperdir). This means: containers that read heavily are great. Containers that modify large files repeatedly (databases) should use volumes, not the container filesystem. For databases, always use a Docker volume β it bypasses OverlayFS entirely for native filesystem performance.
How would you design Docker Hub to handle billions of image pulls per day?
The key insight is that a "pull" is mostly blob (layer) downloads, and blobs are immutable content-addressed data β perfect for CDN. Architecture: (1) Client requests manifest from API server (PostgreSQL lookup: tag β digest β manifest). (2) Manifest lists layer digests. Client checks local cache β usually 80% of layers are already present. (3) Missing layers are fetched via CDN (CloudFront/Cloudflare). The CDN key IS the SHA-256 digest β same layer, same key, regardless of which image references it. (4) S3 stores all blobs. CDN cache hit rate is extremely high because popular base images (python, node, ubuntu) are pulled millions of times. The PostgreSQL metadata DB is the bottleneck β sharded by repository name. Rate limiting per user/IP prevents CI systems from overwhelming the registry. The massive deduplication (ubuntu:22.04 base layer stored once, referenced by millions of images) is what makes the economics work.
A container running as root inside β is it root on the host? How do you secure this?
By default, yes β root in the container IS root on the host (UID 0). If a container escape exploit exists, the attacker has root on the host machine. This is Docker's biggest security criticism. Mitigations, in order of impact: (1) USER namespace remapping: configure Docker to map container UID 0 to host UID 100000. Now even a container escape lands as an unprivileged user. This is "rootless Docker" and is the single most effective security control. (2) Don't run as root in the container: Dockerfile should include `USER nonroot`. Most applications don't need root. (3) Drop capabilities: Docker drops most Linux capabilities by default (CAP_SYS_ADMIN, CAP_NET_RAW, etc.). Even root inside can't load kernel modules or access raw sockets. (4) Seccomp profile: blocks 300+ dangerous syscalls. (5) Read-only filesystem (--read-only): prevents writing to container filesystem at all. (6) For truly untrusted code: use gVisor (intercepts all syscalls in userspace) or Kata (lightweight VM). The defense-in-depth approach means a container escape requires bypassing ALL of these layers.