Secure Your Containers with this One Weird Trick

Did you know there is an option to drop Linux capabilities in Docker? Using the docker run --cap-drop option, you can lock down root in a container so that it has limited access within the container. Sadly, almost no one ever tightens the security on a container or anywhere else.

The Day After is Too Late

There’s an unfortunate tendency in IT to think about security too late. People only buy a security system the day after they have been broken into.

Dropping capabilities can be low hanging fruit when it comes to improving container security.

What are Linux Capabilities?

According to the capabilities man page, capabilities are distinct units of privilege that can be independently enabled or disabled.

The way I describe it is that most people think of root as being all powerful. This isn’t the whole picture, the root user with all capabilities is all powerful. Capabilities were added to the kernel around 15 or so years ago to try to divide up the power of root.

Originally the kernel allocated a 32-bit bitmask to define these capabilities. A few years ago it was expanded to 64. There are currently around 38 capabilities defined.

Capabilities are things like the ability to send raw IP packets, or bind to ports below 1024. When we run containers we can drop a whole bunch of capabilities before running our containers without causing the vast majority of containerized applications to fail.

Most capabilities are required to manipulate the kernel/system, and these are used by the container framework (docker), but seldom used by the processes running inside the container. However, some containers require a few capabilities, for example a container process needs capabilities like setuid/setgid to drop privileges. As with most things in the container world, we try to establish a compromise between security and the ability to get work done.

A few years ago the guys at grsecurity did some analysis of capabilities and found that a lot of them give you close to full access to the system.

Luckily we also use additional tools like SELinux, seccomp, and namespaces to protect the host system from the containers.

Bottom line: dropping more of the capabilities from your container is a good idea from a security point of view.

Note: When the container framework drops capabilities before starting a container, the processes inside of the container can not get them back, even if they execute a setuid application. For more information look for the section Capability Bounding Set in the capabilities man page.

What Docker gives by default

Let’s look at the default list of capabilities available to privileged processes in a docker container:

chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

In the OCI/runc spec they are even more drastic only retaining, audit_write, kill, and net_bind_service and users can use ocitools to add additional capabilities. As you can imagine, I like the approach of adding capabilities you need rather than having to remember to remove capabilities you don’t.

Deep Dive into Capabilities

Lets look deeper into each of these remaining capabilities.


The man page describes chown as the ability to make arbitrary changes to file UIDs and GIDs.

This means that root can change the ownership or group of any file system object. If you are not running a shell within a container and not installing packages into the container, you should drop this capability.

I would make the argument this should never be needed in production. If you need to chown, allow the capability, do the work, then take it away.


The man page says that dac_override allows root to bypass file read, write, and execute permission checks. DAC is an abbreviation of “discretionary access control”.

This means a root capable process can read, write, and execute any file on the system, even if the permission and ownership fields would not allow it. Almost no apps need DAC_OVERRIDE, and if they do they are probably doing something wrong. There are probably less than ten in the whole distribution that actually need it. Of course the administrator shell could require DAC_OVERRIDE fixing bad permissions in the file system.

Steve Grubb, security standards expert at Red Hat, says that “nothing should need this. If your container needs this, it’s probably doing something horrible.”


According to the man page, fowner conveys the ability to bypass permission checks on operations that normally require the filesystem UID of the process to match the UID of the file. For example, chmod and utime, and excludes operations covered by cap_dac_override and cap_dac_read_search. Here’s more from the man page:

  • set extended file attributes (see chattr(1)) on arbitrary files;
  • set Access Control Lists (ACLs) on arbitrary files;
  • ignore directory sticky bit on file deletion;
  • specify O_NOATIME for arbitrary files in open(2) and fcntl(2).

This is similar to DAC_OVERRIDE, almost no applications need this other than, potentially, software installation tools. Most likely your container would run fine without this capability. You might need to allow this for docker build but it should be blocked it when you run your container is production.


The man page says “don’t clear set-user-ID and set-group-ID mode bits when a file is modified; set the set-group-ID bit for a file whose GID does not match the filesystem or any of the supplementary GIDs of the calling process.”

My take: if you are not running an installation, you probably do not need this capability. I would disable this one by default.


If a process has this capability it can override the restriction that “the real or effective user ID of a process sending a signal must match the real or effective user ID of the process receiving the signal.”

This capability basically means that a root owned process can send kill signals to non root processes. If your container is running all processes as root or the root processes never kills processes running as non root, you do not need this capability. If you are running systemd as PID 1 inside of a container and you want to stop a container running with a different UID you might need this capability.

It’s probably also worth mentioning on the danger scale, this one is on the low end.


The man page says that the setgid capability lets a process make arbitrary manipulations of process GIDs and supplementary GID list. It can also forge GID when passing socket credentials via UNIX domain sockets or write a group ID mapping in a user namespace. See user_namespaces(7) for more information.

In short, a process with this capability can change its GID to any other GID. Basically allows full group access to all files on the system. If your container processes do not change UIDs/GIDs, they do not need this capability.


If a process has the setuid capability it can “make arbitrary manipulations of process UIDs (setuid(2), setreuid(2), setresuid(2), setfsuid(2)); forge UID when passing socket credentials via UNIX domain sockets; write a user ID mapping in a user namespace (see user_namespaces(7)).”

A process with this capability can change its UID to any other UID. Basically, it allows full access to all files on the system. If your container processes do not change UIDs/GIDs always running as the same UID, preferably non root, they do not need this capability. Applications that that need setuid usually start as root in order to bind to ports below 1024 and then changes their UIDS and drop capabilities. Apache binding to port 80 requires net_bind_service, usually starting as root. It then needs setuid/setgid to switch to the apache user and drop capabilities.

Most containers can safely drop setuid/setgid capability.


Let’s look at the man page description: “Add any capability from the calling thread’s bounding set to its inheritable set; drop capabilities from the bounding set (via prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags.”

In layman’s terms, a process with this capability can change its current capability set within its bounding set. Meaning a process could drop capabilities or add capabilities if it did not currently have them, but limited by the bounding set capabilities.


This one’s easy. If you have this capability, you can bind to privileged ports (e.g., those below 1024).

If you want to bind to a port below 1024 you need this capability. If you are running a service that listens to a port above 1024 you should drop this capability.

The risk of this capabilty is a rogue process interpreting a service like sshd, and collecting users passwords. Running a container in a different network namespace reduces the risk of this capability. It would be difficult for the container process to get to the public network interface


The man page says, “allow use of RAW and PACKET sockets. Allow binding to any address for transparent proxying.”

This access allows a process to spy on packets on its network. That’s bad, right? Most container processes would not need this access so it probably should be dropped. Note this would only affect the containers that share the same network that your container process is running on, usually preventing access to the real network.

RAW sockets also give an attacker the ability to inject scary things onto the network. Depending on what you are doing with the ping command, it could require this access.


This capability allows use of chroot(). In other words, it allows your processes to chroot into a different rootfs. chroot is probably not used within your container, so it should be dropped.


If you have this capability, you can create special files using mknod.

This allows your processes to create device nodes. Containers are usually provided all of the device nodes they need in /dev, the creation of device nodes is controlled by the device node cgroup, but I really think this should be dropped by default. Almost no containers ever do this, and even fewer containers should do this.


If you have this one, you can write a message to kernel auditing log. Few processes attempt to write to the audit log (login programs, su, sudo) and processes inside of the container are probably not trusted. The audit subsystem is not currently namespace aware, so this should be dropped by default.


Finally, the setfcap capability allows you to set file capabilities on a file system. Might be needed for doing installs during builds, but in production it should probably be dropped.

How can I drop these capabilities using Docker?

So, how can we drop these capabilities using docker? First, let’s see what capabilities a process has. There is a cool tool in Linux that can help you view what capabilities a process has called pscap, available in the libcap-ng-utils package on Fedora.

Here’s a sample output using pscap | head -10:

ppid  pid   name        command         capabilities
1   1082  root      systemd-journal   chown, dac_override, dac_read_search, fowner, setgid, setuid, sys_ptrace, sys_admin, audit_control, mac_override, syslog, audit_read
1   1116  root      systemd-udevd   full
1   1760  root      auditd          full
1760  1778  root        audispd         full
1   1812  root      mcelog          full
1   1815  root      bluetoothd      net_bind_service, net_admin
1   1816  root      ModemManager    full
1   1817  root      systemd-logind  chown, dac_override, dac_read_search, fowner, kill, sys_admin, sys_tty_config, audit_control, mac_admin
1   1818  root      rngd            full

Here are the capabilities of a normal container running:

#  docker run -d fedora sleep 5 >/dev/null; pscap | grep sleep
26358 26375 root        sleep           chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

If I wanted to drop setfcap, audit_write, and mknod, I could use --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod:

#  docker run -d --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod fedora sleep 5 > /dev/null; pscap | grep sleep
26555 26571 root        sleep           chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot

Better yet, if you know your container only needs setuid and setgid, you can drop all capabilities and just add setgid and setuid back in.

#  docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid fedora sleep 5 > /dev/null; pscap | grep sleep
26767 26783 root        sleep           setgid, setuid

You can even use Container Labels and the [atomic run]( command to define the default run command which your container should run with.

# cat Dockerfile
FROM fedora
LABEL RUN /usr/bin/docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid  \${IMAGE} sleep 10
# docker build -t sleep . >/dev/null
# atomic run  --quiet sleep > /dev/null; pscap | grep sleep
32119 32135 root        sleep           setgid, setuid

Bottom Line

You are probably running containers with a lot more privileges than they need. Dropping these capabilities when the containers are in production would be a great idea.

  1. How about not legitimizing clickbait by using their annoying title formats. I want to pass this article onto colleagues and professional peers, but I also want them to correspond with me in the future. And I am afraid that sending them articles like this will cause my emails to end up in their SPAM folder or even worse, to have them believe I have become a mindless zombie that sends them clickbait on purpose.

      1. I respect the artistic choice. But the tongue-in-cheek reference is not clear for the receiving party. Also is it hard to find in the article itself. Rather I only now this because of your reply. So this is quite a good article with a (sorry) really bad title and have to agree with Ryan on it’s ability to be passed on or gain track.

  2. Instead of installing pscap, you can execute this command to view the capabilities
    docker ps –quiet | xargs docker inspect –format ‘{{ .Id }}: CapAdd={{ .HostConfig.CapAdd }} CapDrop={{ .HostConfig.CapDrop }}’

  3. Why do we need to drop these privileges if the container is namespaced? Presumably all of these privileges only extend to the namespaced boundary. Why does it matter if the only process in my container can chroot?

    Are you just proposing defense in depth (a fine proposal), or is there another reason? The article seems to skip the justification for these changes.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s