This page is an attempt to document the ins and outs of containers on Linux. This is not just restricted to programmers looking to implement
containers or use container like features in their own code but also Sysadmins and Users who want to get more of a handle on how containers work
'under the hood'.
If you are a User looking to know more then hit up the FAQ section, if you are a programmer then the Implementations section and Links section are going to be the most useful. It is recommended that
Sys admins read up on the Security and Networking part and perhaps take a look at the different Implementations available.
Rather than take an 'All or nothing' approach to containers (eg FreeBSD/Solaris/OpenVZ), native Linux Containers support allows you to unshare
Specific resources from the host. These can be mixed and matched in various ways to produce interesting combinations for things such as testing
network setups, preventing information leakage (eg for shared hosting webservers) or testing out OS builds (eg from debootstrap). It can even be used
to provide a more complete fakeroot replacement
- UTS: Allows a different hostname for each container
- PID: Hides processes outside the namespace from processes in the namespace, calling shutdown() in a PID namespace will perform a 'shutdown' of processes only in that namespace (in Linux > 3.8)
- MOUNT: Allows a group of processes to mount and unmount filesystems and not have these changes visible outside the namespace
- UID: Allows you to give processes root inside the namespace but have this mapped to a normal user when interacting with processes outside the namespace (eg accessing files)
- IPC: Allows you to have a separate space for IPC resources such as semaphores and locks
- NET: Allows processes to have their own networking stack with different interfaces, firewalls and routing tables
- SYSLOG: Only see ksyslog messages that belong to the namespace you are in (eg dmesg)
- AUDIT: Allows a namespace to only see messages from the audit subsystem that apply to that namespace
- CGROUP: Allows a namespacing of cgroups giving you your own separate hierarchy
- AF_NET: A liteweight form of network namespaces centered around limiting which devices/addresses bind() calls bind to
There are also some additional proposed namespaces that are not yet in Linux:
- DEVICE: allow 'Hotplugging' of devices into a container, causing the right signals to be passed to the container to recognize device addition/removal
- LSM: namespacing of linux security modules allowing a namespace to apply its own (eg selinux) security policy that applies only to that namespace
Containers are commonly thought of as a security mechanism, much in the same way that chroot is also mentioned. This is the wrong way to think
about containers and will not only lead you astray but also to potential compromise. Namespaces (a part of containers) are an isolation mechanism
That can be used to prevent information from one namespace leaking into another inadvertently. They do not however prevent intentional leakage
(eg same filesystem mounted in multiple containers at once)
One thing to consider with containers is that the Linux kernel is shared between multiple containers, a namespace aware rootkit that compromise
one container will be able to infect other containers and do what it wants to them. Containers will not reduce the attack surface in this case
but instead give you multiple instances of that attack surface.
Unfortunately there is no 'One size fits all' security solution when implementing containers and as such you will need to mix and match the features
from multiple Security subsystems in order to secure containers against attack. The main ones that when combined cover all bases are listed below:
With these security subsystems in Linux you should be able to implement an overlapping security model that is still resilient should one of these
features be unavailable (eg seccomp, selinux and cgroups can all be used to limit what device nodes can be created)
- Seccomp mode 2: This can be used to filter out individual syscalls or filter based on the arguments passed to the syscall. the mknod syscall is a good example here as there is no dynamic device support yet in a kernel, a static
/dev can be populated and udev disabled. another good example is mount as this can be used in some situations to escape the container.
- SELINUX: After evaluating multiple options in this space we found that selinux seems to be the clear winner here. The multi-category security can be used to prevent a compromised container from accessing the resources of another container should information from one container leak into another (This is surprisingly simple to set up).
- capabilities: While
CAP_SYS_ADMIN does tend to be overpowered many of the other capabilities do not make sense in the context of containers and can be disabled. a perfect example of this is
CAP_MAC_OVERRIDE, as these will allow you to disable the SELINUX or equivalent protections above.
- cgroups: While the resource limiting cgroups are handy the device cgroup is handy for restricting what a container can do to a device node for those times where you may want to only hand out read only access to a dev node but root int he container could possibly change the permissions/ this cgroup enforces a second level of checks that can be overridden.
Namespace under Linux have not been without their fair share of security vulnerabilities, most notably around the User namespaces feature which
can be used to get around some of the restrictions placed on users. Below is a list of most of the CVE's that have been reported against the
Linux kernel as well as some preemptive patches which highlight the potential security complications that namespaces can introduce.
- CVE-2010-0006: Linux prior to 220.127.116.11 contained an error in driver code that only manifests when network namespaces are turned on
- git:41c21e351e: Changing namespace mappings should require privileges
- CVE-2011-2189: Linux prior to 2.6.35 contained a DOS via network namespaces
- CVE-2013-1858: Linux prior to 3.8.3 allowed mismatched args passed to clone() to cause privilege escalation (More info available here)
- CVE-2013-1956: Linux prior to 3.8.6 was vulnerable to chrooting into an alternate (more privileged) namespace
- CVE-2013-4205: Linux prior to 3.10.6 was vulnerable to a memory leak when creating User namespace that could cause DOS
- CVE-2014-4014: Linux prior to 3.14.8 filesystem capabilities were not namespace aware and could be used to raise privileges or escalate to outside a container
- CVE-2014-5206: Linux prior to 3.16.1 did not prevent removal of MNT_LOCK_READONLY on remount allowing the restriction to by bypassed with the mount command
- CVE-2014-5207: Linux prior to 3.16.1 did not prevent removal of MNT_NODEV, MNT_NOSUID, and MNT_NOEXEC and MNT_ATIME_MASK during a remount of a bind mount allowing these restrictions to be bypassed with the mount command
- CVE-2014-8989: Linux prior to 3.17.4 allowed dropping of supplementary groups allowing ACL bypass by dropping the restricted groups from the active set
- CVE-2015-8709: Current linux releases (as of 2016/01/22) contained a timing attack that allowed an attacker to ptrace processes entering a namespace before they can setuid to a UID that is valid for the container, potentially allowing escape from the container
Networking is one of the easiest ways to get started with namespaces under Linux, the iproute2 command has native support via 'setns' for the 'link'
commands and the 'netns' command to create and destroy namespaces. This can be used to spin up environments for testing network topologies of
arbitrary complexity or create a process behind a virtual interface for
Below is a list of virtual networking features under Linux that can be used in conjunction with containers under Linux. For most simple uses knowing
about Virtual Ethernet Interfaces and Bridging should suffice. if you are looking at more advanced
networking setups such as those detailed below then you may want to be familiar with all the options listed here
- Virtual Ethernet Pipes: Easiest to think of this as a virtual version of a ethernet or crossover cable that can join 2 containers
- Bridges: Implements a layer 2 switch in software letting all 'ports' (both real and virtual) communicate back and forth
- VLANs: Used to split a bridge/switch into multiple smaller switches or use a single interface to connect to multiple separate locations on a network without having multiple 'real' interfaces
- MACVLANs: Creates a new interface with a different mac address that allows a container more direct access to a real ethernet interface
- VXLAN: VLANs on steroids, allows a VLAN to expand across datacenters
- Openflow: Software defined networking, sends the packet headers to userspace to discover where and how the packet should be sent. Allows for new and flexible switching/routing options
- Hyperspace: by folding and manipulating space in 6 dimensions one can send a signal to a remote host instantaneously. Downsides: no known implementations, Physics may not allow it
To date most if not all linux container setups have used the VETH + Bridge model details below. While this model
suffices for simple uses and scales up fairly well there may be senarios where a slightly diffrent setup provides additinal benifits eg multi
tennancy private networks.
- Raw VETH: This setup can be handy for small setups where only local networking is required (eg testing) or you wish to
route/firewall connections from your containers to the internet. This involves having VETH pipes between the host and the containers.
firewalling and routing is then done with the hosts networking stack
- VETH + Bridge: An extension of the above, instead of routing packets between the containers in the hosts networking
a bridge is used instead to virtually connect them together creating a private network. If the hosts real ethernet connection is added
to the bridge then all the containers connected to the bridge can access all the same machines the host can. Has the advantage of easy
networking setup in the containers via DHCP compared to the above setup which is usually best served by explicitly setting up the IP
Addresses on both ends of the Pipe
- MACVLAN: Due to limitations with most networking cards, this setup is only suitable for a small amount of containers.
Most cards have a hard limit of how many MAC addresses they can listen for on the network at a time and once this number has been exceeded they
switch to software processing of ALL incoming packets regardless of if they are destined for a MAC address on the system or not. The flip side
to this is that MACVLANs have the lowest processing overhead of any of the networking setups listed here and as such if you have a small amount
of containers pushing large numbers of packets (eg 10Gbit) then this may be something to consider. Good examples of this are where you have
multiple 'border routers' that themselves are containers that route packets to containers internally
- VLANs: There are times when a collection of containers need their own private network so they cannot see any other
servers on that network segment (eg isolating apps of different sensitivity or just grouping similar servers together). For this VLANs can
be used in conjunction with VETH + Bridge shown above. VLANs are added to the main host network interface and
instead of adding the Main network interface to the bridge, the VLAN interface is instead added. This causes all traffic for the containers
on that bridge to go out the main network interface to be 'tagged' with a specific VLAN
- VXLAN: This setup is similar to above but can allow you to drop the bridge part and instead have a VXLAN interface in
each container, Massively simplifying the setup work. In addition VXLAN lends itself to easier migration compared to the VLAN setup above as
you do not need to ensure the VLAN is setup on the destination machine, you just move the container to the new destination and continue running.
Another advantage of this setup is the 24bits worth of 'VLANs' compared to VLAN's 12 bits allowing dramatically more Private networks (in many
cases you don't even get to use the full 12 bits due to support lacking in the switches). The downside however is that you need a multicast
- OPENFLOW: Fully software defined network, capable of sending packets sent to old location to the new location during
migration of nodes to a new host and re-configuring the network on the fly based on internal network usage. If you are at this stage it is
likely you do not need my help to tell you how to do networking. Example users: Google
lxcfs: Provide fake cgroups to namespaced programs and dummy /proc entries
- iproute2: Implements network namespaces and is a good way to get started (
- LXC (git): One of the early implementations of containers (and most feature rich)
- systemd-nspawn: systemd chroot replacement with extra magic
- libvirt-lxc: Containers support in libvirt (not related to LXC mentioned above)
- lmctfy: Open source version of Googles containers implementation
- psd: Nice new single file C implementation
- pflask: Another simple C implementation with more features than psd
- contain: C implementation which is compatible with /etc/subuid for delegating uids to a container
- mbox: A novel approch to process containment using seccomp and ptrace
- firejail: Process containment and monitoring for security (LWN Review)
- criu: Checkpoint and restore in userspace, allows snapshotting of containers
- rocket: CoreOS's implementation of containers
- CoreOS's App Container spec: CoreOS's container spec (configs and filesystem images)
- Open Container spec: Linux Foundations Open Container spec (supersedes appc)
- bocker: Short and simple container solution written in bash
- vagga: Daemonless containers written in rust
- oz: Restrict X11 apps using containers, seccomp and capabilities
- libct: Container managment library
- omochabako: Simple toy container implementation
- beamwhale: Simple implemntation written in erlang
- Isolation: 1st Gen container program - Proof of concept
- Asylum: 2nd Gen container program - "Kitchen Sink"
- Hammerhead: 3rd Gen container program - Streamlined chroot replacement
- Asylum Deploy: Container startup 'beacon'
- Igor: Runs plugins based on Inotify/etcd events
- etcpy: Python library to talk to etcd
- Butter: Python library to interface to features of Linux such as fsnotify and inotify
- Linux VServers: Predates namespace support in Linux, appears to now use namespace features
- OpenVZ: Predates namespace support in Linux, appears to now use namespace features. Did a lot of the work implementing namespaces
- FreeBSD Jails: Container like support in FreeBSD (Predates Linux support)
- Solaris Containers: Container like support in Solaris (Predates Linux support)
- SmartOS: Uses Solaris zones and KVM to build a hypervisor for containers and VMs
- Mininet: Spin up large virtual networks for testing using namespaces
- Warden: Ruby containers implementation with namespaces backend
- Garden: Go container orchestration system
- Anbox: Containerized android running under linux
- What is a container?
A container is a collection of namespaces mixed in with cgroups, normal Linux networking and normal Linux 5security mechanisms. It allows you to 'host' multiple instances of userspace (the utilities you use and interact with every day) with only a single kernel in a similar manner to how vhosting on a webserver allows you to host multiple websites.
- So whats a namespace then?
A namespace is a specific type of resource that can be split up and partitioned, eg network interfaces, process IDs, user ids, the filesystem. A good example of this that everyone is familiar with is the 'chroot' command which allows you to present a subset of files on the filesystem to a set of processes and hide the other folders from it. All the children of the process you launch with chroot or its namespace equivalent will see exactly the same files/pids/interfaces as the process that was launched in the namespace.
- So which ones are useful then?
That depends on what you are doing,
If you are emulating 'old style' visualization and taking a standard Linux install and containerizing it then the correct anwser is 'all of them'.
If you need to emulate multiple nodes to pretend to be a network of machines then the net namespace (and the MOUNT namespace if you need to mount tmpfs on a lock dir for some daemons) should be sufficient and allow you to reuse your existing host filesystem.
If you are bootstraping new installs of a distro, PID and MOUNT namespaces should be sufficient (you dont want to do any UID translation and you want to ensure that shuting down the container does not shut down the host, hence the PID namespace).
If using this as a thin tool to do things like continuous integration against multiple distros without maintaining multiple runners or buildbots then PID namespaces to hide other processes (and prevent errant processes from causing damage) chroot to switch between distro images should be sufficient. if however your processes requires root for testing then UID namespaces may prove to be a handy replacement that means you don't need to run the server process as root.