Containing Myself

Recently, for a variety of reasons, I got interested in how containers work in Linux. While my systems came with a version of the LXC container tools, I hadn't used containers for anything. But, a use-case appeared, and it piqued my interest. So I looked at containers and concluded that, if I were to use them, I would have to understand them first. And so, down the rabbit hole, I've gone.

The first thing that I discovered is that there is no such thing as a "container" in Linux. What the IT world calls a "container" is actually a confluence of several independant, but related, features implemented in the Linux kernel over the past couple of decades. So far, I've been able to sort them out into four technologies:

  1. Capabilities, which subdivide the powers of the root user,
  2. Namespaces, which allow processes to isolate or share their view of the system,
  3. Mounts, who's expanded powers permit the sharing or isolation of multiple views of mounted filesystems, and
  4. Cgroups, which provide a mechanism to limit the use of select system resources within groups of processes

I don't intend to expound on the features or usage of these various technologies; I list them simply to show that, individually, they are each complex and arcane studies that together make up the simple technology called "containers".

My plan of attack is to study each of these technologies, and to prove my study by building my own "container" system. I've taken the first steps with a "toy" container that succeeds in giving me a (not very configurable) isolated, "contained" environment to play in. I call this implementation "toybox", and will write more about it and my learning as I go.

But, for now, I will have to contain myself.

System Management: 


So, I've been playing with my own container code, inspired by Brian Swetland's "mkbox" toy container (see ), and I've got it working pretty well.

My "toybox" container doesn't use cgroups (yet), and is still a very primitive implementation, but it does let me run a unique linux environment within my existing Slackware host. I have tested it with a faux root filesystem and the busybox utility set, and am satisfied that it indeed provides a (very rudimentary) "container" environment. I intend to continue to work on this toybox, adding functionality and optimizing it's implementation, as it allows me to learn about system components that I otherwise wouldn't have the opportunity to play with.

If you want to play with my toybox, you can get my code here.

I have too many projects on the go, and I've had slow progress with toybox. But, today I had a breakthrough.

Up until now, toybox only allowed for a completely self-contained program environment; all external programs had to reside within the toybox environment's directory structure. This meant that, at a minimum, I had to provide my own /bin/sh, etc, to toybox, and (ultimately) any other programs that I wanted to run within it. So, I populated the toybox root filesystem directory with a copy of busybox and all it's utilities.

But, that's pretty limiting. If, for instance, I want to write code, I can't compile within toybox, because I have no resident compiler. I only have the tools that I put in, which means that I've more work to do in order to explore and expand the capabilities of my little toy container.

Last night, I had a brainstorm: What if I BIND mount my host environment's program and library directories into the toybox guest directory structure? If I BIND mount them READ-ONLY, I have no little or no chance to damage my host environment, and (given the proper choice of directories) I can have all the tools present on my existing host system without the work.

So, that's what I did. I modified toybox to (on demand) READ-ONLY BIND mount /bin, /sbin, /lib, /lib64, and /usr (home of /usr/bin, etc.) to directories within the toybox environment. And (with some minor tweaking), it worked! To make it work properly, I had to seed /etc/ within the toybox environment to get dynamic loading working, but after that, root prompt and all the tools of home.

So, allow me to present the next phase in toybox: toybox-20230824

I looked over my toybox code with an eye to simplify some of the logic. I had an idea that, should it pan out, would eliminate a couple of steps and make the container construction more durable. So, I wrote a one-off program to do some A/B testing between the toybox code as it stood, and my ideas for simplification. As I fiddled, I incorporated changes that worked back into the A side of my A/B test code, and (as it stands now) simplified the container construction a bit. Here's what worked:

We first perform some preparation by creating three subdirectories in our target directory. We need that target directory to contain:

  1. a subdirectory called "tmp",
  2. a subdirectory called "proc", and
  3. a subdirectory called "sys".

We start off our container build with a minimal unshare() to gain us our own user namespace, mount namespace, PID namespace and network namespace. In this version, we don't do anything with the network namespace; we simply need it so that our container can mount a sysfs.

Next, in our mount namespace, we change the propogation type of the (current) root mountpoint (and all those mounts under it) to PRIVATE, so that future host namespace mounts don't propagate into our mount namespace.

Now, we bind mount our target directory to itself, to satisfy a condition of pivot_root(2), which requires that the "new root" (our target directory) not reside on the same filesystem as our "old root" (the current root directory).

At this point, we can move our CWD into the directory that will become our container's root directory.

Now, we pivot_root(2), to move the (old) root filesystem onto our CWD's tmp directory, and make our CWD the (new) root filesystem.

For reasons, pivot_root(2) recommends that we "call chdir("/") immediately after pivot_root()".

We are working in our own PID namespace because pid_namespaces(7) tells us that

"if a new mount namespace is simultaneously created by including CLONE_NEWNS in the flags argument of ... unshare(2), then ... a new procfs instance can be mounted directly over /proc"

So, we fork() here, and perform the rest of our work in the child process. Note that the child process becomes the init(8) process for our container.

In the child ("init") process, we can now mount our /proc filesystem. Note that mount(2) wants a proc filesystem mounted already, before it will let us mount /proc. Fortunately, we still have access to the "old" proc, buried in the (current) /tmp directory structure.

Also in the child ("init") process, we can now mount our /sys filesystem.

And now, in the child ("init") process, we can unmount our old root filesystem

Finally, we can mount a tempfs over our /tmp directory

At this point, our child ("init") process can now fork off daemons, and do all the other stuff that init(8) does.

With all this new knowledge, I revised my toybox.c container exploration toy. Subsequent testing has proven that the changes work, and I have written (most of) a viable unprivileged container for system paravirtualization. So, I give you the latest iteration: toybox-20230927. Enjoy!