Architecting Containers Part 1: Why Understanding User Space vs. Kernel Space Matters

Perhaps you’ve been charged with developing a container-based application infrastructure?  If so, you most likely understand the value that containers can provide to your developers, architects, and operations team. In fact, you’ve likely been reading up on containers and are excited about exploring the technology in more detail. However, before diving head-first into a discussion about the architecture and deployment of containers in a production environment, there are three important things that developers, architects, and systems administrators, need to know:

  1. All applications, inclusive of containerized applications, rely on the underlying kernel
  2. The kernel provides an API to these applications via system calls
  3. Versioning of this API matters as it’s the “glue” that ensures deterministic communication between the user space and kernel space

While containers are sometimes treated like virtual machines, it is important to note, unlike virtual machines, the kernel is the only layer of abstraction between programs and the resources they need access to. Let’s see why.

All processes make system calls:

User Space vs. Kernel Space - Simple User Space As containers are processes, they also make system calls:

User Space vs. Kernel Space - Simple Container

OK, so you understand what a process is, and that containers are processes, but what about the files and programs that live inside a container image? These files and programs make up what is known as user space. When a container is started, a program is loaded into memory from the container image. Once the program in the container is running, it still needs to make system calls into kernel space. The ability for the user space and kernel space to communicate in a deterministic fashion is critical.

User Space

User space refers to all of the code in an operating system that lives outside of the kernel. Most Unix-like operating systems (including Linux) come pre-packaged with all kinds of utilities, programming languages, and graphical tools – these are user space applications. We often refer to this as “userland.”

Userland applications can include programs that are written in C, Java, Python, Ruby, and other languages. In a containerized world, these programs are typically delivered in a container image format such as Docker. When you pull down and run a Red Hat Enterprise Linux 7 container image from the Red Hat Registry, you are utilizing a pre-packaged, minimal Red Hat Enterprise Linux 7 user space which contains utilities such as bash, awk, grep, and yum (so that you can install other software).

docker run -i -t rhel7 bash

All user programs (containerized or not) function by manipulating data, but where does this data live? This data can come from registers in the CPU and external devices, but most commonly it is stored in memory and on disk. User programs get access to data by making special requests to the kernel called system calls. Examples include allocating memory (variables) or opening a file. Memory and files often store sensitive information owned by different users, so access must be requested from the kernel through system calls.

Kernel Space

The kernel provides abstraction for security, hardware, and internal data structures. The open() system call is commonly used to get a file handle in Python, C, Ruby and other languages. You wouldn’t want your program to be able to make bit level changes to an XFS file system, so the kernel provides a system call and handles the drivers. In fact, this system call is so common that is part of the POSIX library.

Notice in the following drawing that bash makes a getpid() call which requests its own process identity. Also, notice that the cat command requests access to /etc/hosts with a file open() call. In the next article, we will dig into how this works in a containerized world, but notice that some code lives in user space, and some lives in the kernel.

User Space vs. Kernel Space - Basic System CallsRegular user space programs evoke system calls all the time to get work done, for example:

ls
ps
top
bash

 

These are some user space programs that map almost directly to system calls, for example:

chroot
sync
mount/umount
swapon/swapoff

 

Digging one layer deeper, the following are some example system calls which are invoked by the above listed programs. Typically these functions are called through libraries such as glibc, or through an interpreter such as Ruby, Python, or the Java Virtual Machine.

open (files)
getpid (processes)
socket (network)

 

A typical program gets access to resources in the kernel through layers of abstraction similar to the following diagram:

User Space vs. Kernel Space - System Calls Gears

To get a feel for what system calls are available in a Linux kernel, check out the syscalls man page. Interestingly, I am invoking this command on my Red Hat Enterprise Linux 7 laptop, but I am using a Red Hat Enterprise Linux 6 container image (aka user space) because I want to see system calls which were added/removed in the older kernel:

docker run -t -i rhel6-base man syscalls

 

SYSCALLS(2)                Linux Programmer’s Manual               SYSCALLS(2)
NAME
syscalls - Linux system calls
SYNOPSIS
Linux system calls.
DESCRIPTION
The system call is the fundamental interface between an application and the kernel.
System call                 Kernel        Notes
------------------------------------------------------------------------------
_llseek(2)                  1.2
_newselect(2)
_sysctl(2)
accept(2)
accept4(2)                  2.6.28
access(2)
acct(2)
add_key(2)                  2.6.11
adjtimex(2)
afs_syscall(2)                            Not implemented
alarm(2)
alloc_hugepages(2)          2.5.36        Removed in 2.5.44
bdflush(2)                                Deprecated (does nothing) since 2.6
bind(2)
break(2)                                  Not implemented
brk(2)
cacheflush(2)               1.2           Not on i386

 

Notice from the man page, that certain system calls (aka interfaces) have been added and removed in different versions of the kernel. Linus Torvalds et. al. take great care to keep the behavior of these system calls well understood and stable. As of Red Hat Enterprise Linux 7 (kernel 3.10), there are 382 syscalls available. From time to time new system calls are added, and old system calls are deprecated; this should be considered when thinking about the lifecycle of your container infrastructure and the applications that will run within it.
Conclusion

There are some important take aways that you need to understand about the user space and kernel space:

  1. Applications contain business logic, but rely on system calls.
  2. Once an application is compiled, the set of system calls that an application uses (i.e. relies upon) is embedded in the binary (in higher level languages, this is the interpreter or JVM).
  3. Containers don’t abstract the need for the user space and kernel space to share a common set of system calls.
  4. In a containerized world, this user space is bundled up and shipped around to different hosts, ranging from laptops to production servers.
  5. Over the coming years, this will create challenges.

 

Over time, it will be challenging to guarantee that a container built today will run on the container hosts of tomorrow. Imagine the year is 2024 (maybe we’ll finally have real hoverboards) and you still have a container-based application that requires a Red Hat Enterprise Linux 7 user space running in production. How can you safely upgrade the underlying container host and infrastructure? Will the containerized application run equally well on any of the latest greatest container hosts available in the market place?

In Architecting Containers Part 2: Why the User Space Matters, we will explore how the user space / kernel space relationship affects architectural decisions and what you can do to minimize these challenges.  In the mean time – if you have thoughts or questions – feel free to reach out using the comments section (below).

  1. This was a good jog into the inner workings of Linux processes. You conclude by reminding us of challenges with longevity of containerized applications since it packages user space. I think I get that. I think you are cautioning us that a container that works today on my RHEL7 kernel may not work on RHEL 15 kernel in 2024. That is because that kernel on which RHEL15 is based may not support the system calls being made by the user space programs in my container. Right?
    But it left me wondering: isn’t that the same challenge with “traditional apps” (non-container-based ones)? Say I build a Java app that targets the Java-8 VM? That Java app has a limited shelf life — determined by the backward compatibility of future JVMs.

    1. You nailed it! It’s just like being locked into a specific version of the JVM. The good news is that given the length of Red Hat Enterprise Linux life cycle, containers based upon Red Hat Enterprise Linux 7 user space have a lot of supported mileage ahead of them. The value is around the packaging.

      In my opinion, Docker images bring the equivalent of a layered WAR file and a standard protocol for storage (Docker registry) to any application stack (Ruby, Python, C, whatever). This simplifies polyglot deployment.

  2. This is a really well written article. Thank you.

    A big challenge with legacy applications in a VM paradigm is that you not only need the jvm or libraries from then the app was compiled, but you also need to maintain the entire OS. I can’t tell you how many times I’ve seen a business need for a legacy app that immediately introduces vulnerabilities on the network, just because the OS or Java version it needs is no longer supported.

    With the container model, wouldn’t you be able to write wrappers or facades in the container to intercept any deprecated kernel calls, and thus be able to ensure the longevity of the app FOREVER, without introducing any legacy-style vulnerabilities?

    1. Jim, thanks for the compliment, and very interesting thought. The gears in my brain immediately started going. I see three major layers to any application (containerized or not) – There is the kernel, the container image, and the network (where the app is accessed). An exploit can be found and mitigated at each layer. In a containerized world, I think HOW changes, but I don’t think WHAT is really affected.

      In a container world, the kernel is part of the container host. A vulnerable kernel can be swapped out by using the tools the container host provides. With RHEL7, this is RPM. With RHEL Atomic, this is Atomic updates. Furthermore, with a tool like kpatch, you might be able to do this live. In all cases, the longevity of this solution will be dependent on the lifecycle of the operating systems (e.g. how long the OS is supported. If the vendor doesn’t provide an updated kernel, you would have to compile one yourself.

      In a containerized world, the user space is packaged up and shipped around as an image. The vulnerable JVM typically lives in the container image (user space). If the application required an unsupported JVM, that same JVM would need to be built/embedded in the container image. Once that container image was started, it would be exploitable from the network. The same is true with any user space vulnerabilities (think web servers, openssl, ssh, etc). All of these utilities live in the user space and would require an upgrade by doing a rebuild of the container image. You would still be limited by what versions were available for the user space (typically provided by the OS, e.g.RHEL7).

      In a containerized world, the network is pretty much the same. At some place there is a demarcation point between where your application lives and the external world. Many a company have built application layer firewalls (aka layer 7 firewalls) to try to mitigate these problems like this. Some vendors can manipulate data structures and even block exploit code from ever reaching the application. In a containerized world, this is probably no different.

      To conclude, In a containerized world, theoretically, only the application and it’s immediate dependencies (this is often referred to as application virtualization) are exposed to the network. One might imagine that typically, end users would not be given access to old containers with known CVEs, etc. Though, this would require fairly complex management and I think it might be years before we could see any change to security paradigm based on containers. Also, this isn’t terribly different than what PCI compliance strives to do in a current virtual and physical infrastructure.

      1. More of the same, maybe more depth. For example, the interface from the application is obviously the same, but through what mechanism does the OS know that the process is running inside a container? How does it hide other processes/resources from it? I imagine it’s something in the process’s metadata, but it’d be cool to have some details there.

        One detail I haven’t looked into so much is how networking works. I imagine it’s like something like virtual box would do: Add virtual NICs and configure them, but then again, how does the OS restrict which processes can see/use these devices? Is it by process owner? But I’m sure there are cases where we want two containers with the same owner to have a different view of interfaces (so we can simulate them living on different ip addresses).

  3. The perennity of applications – and their entire dependencies stack – is exactly the reason why we moved from the container world (Linux VServer at the time) to full virtualization (Qemu/KVM) and are not going back (at least in any foreseeable future). Sure, unmaintained/unported applications (including their outdated dependencies stack and OS) are a bag of security vulnerabilities which should not be allowed to run anylonger. But the Real world – ours, at least (academia and scientific research) – is about many unmaintained/unported applications (students/researchers come, go and leave us with their legacy); no way around it (just deal with the issues they raise the best we can).

    1. Cédric,
      I think your comment is quite interesting (and I whole heatedly agree). There is a debate as to whether containers will end up being long lived like VMs, but I suspect they will.

      Some argue that automation and the automatic rebuilding of containers will somehow change this, but I am skeptical.

      That’s kind of the whole point of my blog series, just because you break the operating system into two parts, doesn’t fundamentally change 1. the things we need to worry about or 2. the way we will use it (a la containers).

      Now, to address your point from another angle. Even with virtualization, there is a limit to how long a virtual machine can be supported, but we just don’t bump into that limit as much. I suspect VMs from ESX 2.X will run on 6.X, but there will come a point where old VMs will probably need upgraded or won’t be supported anymore (e.g. ESX 25).

      With containers, I suspect we will bump into this challenge more ofton, so VMs will still need to be used to mitigate the chance of your container no longer working.

      If you want to run an unsupported RHEL 7 container image on a RHEL 7 container host in a VM, go ahead, but be careful mixing and matching user spaces and kernels, I foresee compatibility issues, but I suspect it will be years before people really start to feel this pain….

  4. Great post! This issue will become even more important with NFV (network functions virtualization). Networking protocols evolve more slowly than applications, so just as much infrastructure runs on 10-15 year old Cisco routers and switches today, as we move to NFV there will be demands for containerized software versions of these to run long into the future.

  5. This is the article I’ve been looking for. There’s so much 30,000 ft. hype about containers that I can’t find any information on what’s going on under the hood. Excellent work!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s