#+comment: -*- org-src-preserve-indentation: t; org-babel-use-quick-and-dirty-noweb-expansion: t-*- #+title: Linux containers in 500 lines of code #+text: ...and 3000 lines of text. #+comment: ^:nil means that _x_ won't be underscores... #+options: ^:nil #+toc: headlines 3 #+date: <2016-10-17 Mon> I've used Linux containers [[https://circleci.com][directly]] and [[https://chromium.googlesource.com/chromium/src/%2B/master/docs/linux_sandboxing.md][indirectly]] for years, but I wanted to become more familiar with them. So I wrote some code. This used to be 500 lines of code, I swear, but I've revised it some since publishing; I've ended up with about 70 lines more. I wanted specifically to find a minimal set of restrictions to run untrusted code. This isn't how you should approach containers on anything with any exposure: you should restrict everything you can. But I think it's important to know which permissions are categorically unsafe! I've tried to back up things I'm saying with links to code or people I trust, but [[mailto:_@lizzie.io][I'd love to know if I missed anything.]] This is a [[https://www.cs.tufts.edu/~nr/noweb/][~noweb~]]-style piece of literate code. References named ~<>~ will be expanded to the code block named ~x~. You can find the tangled source [[file:linux-containers-in-500-loc/contained.c][here]]. This document is an [[http://orgmode.org/][orgmode]] document, you can find its source [[https://blog.lizzie.io/linux-containers-in-500-loc.org][here]]. This document and this code are licensed under the GPLv3; you can find its source [[https://www.gnu.org/licenses/gpl-3.0.en.html][here]]. * Container setup There are several complementary and overlapping mechanisms that make up modern Linux containers. Roughly, + ~namespaces~ are used to group kernel objects into different sets that can be accessed by specific process trees. For example, pid namespaces limit the view of the process list to the processes within the namespace. There are a couple of different kind of namespaces. I'll go into this more later. + ~capabilities~ are used here to set some coarse limits on what uid 0 can do. + ~cgroups~ is a mechanism to limit usage of resources like memory, disk io, and cpu-time. + ~setrlimit~ is another mechanism for limiting resource usage. It's older than cgroups, but can do some things cgroups can't. These are all Linux kernel mechanisms. Seccomp, capabilities, and ~setrlimit~ are all done with system calls. ~cgroups~ is accessed through a filesystem. There's a lot here, and the scope of each mechanism is pretty unclear. They overlap a lot and it's tricky to find the best way to limit things. User namespaces are somewhat new, and promise to unify a lot of this behavior. But unfortunately compiling the kernel with user namespaces enabled complicates things. Compiling with user namespaces changes the semantics of capabilities system-wide, which could cause more problems or at least confusion[fn:subverting-capabilities]. There have been a large number of privilege-escalation bugs exposed by user namespaces. [[https://www.nccgroup.trust/globalassets/our-research/us/whitepapers/2016/april/ncc_group_understanding_hardening_linux_containers-1-1.pdf]["Understanding and Hardening Linux Containers"]] explains #+begin_quote Despite the large upsides the user namespace provides in terms of security, due to the sensitive nature of the user namespace, somewhat conflicting security models and large amount of new code, several serious vulnerabilities have been discovered and new vulnerabilities have unfortunately continued to be discovered. These deal with both the implementation of user namespaces itself or allow the illegitimate or unintended use of the user namespace to perform a privilege escalation. Often these issues present themselves on systems where containers are not being used, and where the kernel version is recent enough to support user namespaces. #+end_quote It's turned off by default in Linux at the time of this writing[fn:turned-off-in-linux], but many distributions apply patches to turn it on in a limited way[fn:distros-userns]. [fn:subverting-capabilities] [[https://medium.com/@ewindisch/linux-user-namespaces-might-not-be-secure-enough-a-k-a-subverting-posix-capabilities-f1c4ae19cad#.3lbw4loa7]["Linux User Namespaces Might Not Be Secure Enough"]] by Erica Windisch: #+begin_quote If a (real) root user has had the SYS_CAP_ADMIN capability removed, but then creates a user namespace, this capability is restored for the (fake) root user. That is, before creating the namespace, ‘mount’ would be denied, but following the creation of the user namespace, the ‘mount’ syscall would magically work again, albeit in a limited fashion. While limited in function, it’s significant enough that given a (real) root user and a kernel with user namespaces, Linux capabilities may be completely subverted. #+end_quote and [[http://man7.org/linux/man-pages/man7/user_namespaces.7.html][~man 7 user_namespaces~]] says: #+begin_quote The child process created by clone(2) with the CLONE_NEWUSER flag starts out with a complete set of capabilities in the new user namespace. #+end_quote and [[https://www.nccgroup.trust/globalassets/our-research/us/whitepapers/2016/april/ncc_group_understanding_hardening_linux_containers-1-1.pdf]["Understanding and Hardening Linux Containers"]] again #+begin_quote User namespaces also allows for ``interesting'' intersections of security models, whereas full root capabilities are granted to new namespace. This can allow CLONE_NEWUSER to effectively use CAP_NET_ADMIN over other network namespaces as they are exposed, and if containers are not in use. Additionally, as we have seen many times, processes with CAP_NET_ADMIN have a large attack surface and have resulted in a number of different kernel vulnerabilities. This may allow an unprivileged user namespace to target a large attack surface (the kernel networking subsystem) whereas a privileged container with reduced capabilities would not have such permissions. See Section 5.5 on page 39 for a more in-depth discussion on this topic. #+end_quote We can demonstrate this behavior (on a host with user namespaces compiled in) with #+caption: ~subverting_networking.c~ #+include: "linux-containers-in-500-loc/subverting_networking.c" src C #+begin_example alpine-kernel-dev:~$ whoami lizzie alpine-kernel-dev:~$ ./subverting_networking ++ success! alpine-kernel-dev:~$ #+end_example but we're not actually that powerful. #+caption: ~subverting_setfcap.c~ #+include: "linux-containers-in-500-loc/subverting_setfcap.c" src C #+begin_example alpine-kernel-dev:~$ whoami lizzie alpine-kernel-dev:~$ touch example alpine-kernel-dev:~$ ./subverting_setfcap ++ cap_set_file failed: Operation not permitted #+end_example [fn:turned-off-in-linux] [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/init/Kconfig?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1207][~init/Kconfig:1207@c8d2bc~]] #+begin_src text config USER_NS bool "User namespace" default n help This allows containers, i.e. vservers, to use user namespaces to provide different user info for different servers. When user namespaces are enabled in the kernel it is recommended that the MEMCG option also be enabled and that user-space use the memory control groups to limit the amount of memory a memory unprivileged users can use. If unsure, say N. #+end_src [fn:distros-userns] Ubuntu switches ~CONFIG_USER_NS~ on, but patches it so that it unprivileged use can be disabled with a sysctl, ~unpriviliged_userns_clone~. #+caption: [[http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?id%3D92e575e769cc50a9bfb50fb58fe94aab4f2a2bff][~92e575e769cc50a9bfb50fb58fe94aab4f2a2bff~]] #+begin_src diff commit 92e575e769cc50a9bfb50fb58fe94aab4f2a2bff Author: Serge Hallyn Date: Tue Jan 5 20:12:21 2016 +0000 UBUNTU: SAUCE: add a sysctl to disable unprivileged user namespace unsharing It is turned on by default, but can be turned off if admins prefer or, more importantly, if a security vulnerability is found. The intent is to use this as mitigation so long as Ubuntu is on the cutting edge of enablement for things like unprivileged filesystem mounting. (This patch is tweaked from the one currently still in Debian sid, which in turn came from the patch we had in saucy) Signed-off-by: Serge Hallyn [bwh: Remove unneeded binary sysctl bits] Signed-off-by: Tim Gardner #+end_src Debian has the same behavior: #+caption: [[https://anonscm.debian.org/git/kernel/linux.git/tree/debian/patches/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch][~debian/patches/debian/add-sysctl-to-allow-unprivileged-CLONE_NEWUSER-by-default.patch~]] #+begin_src diff From: Serge Hallyn Date: Fri, 31 May 2013 19:12:12 +0000 (+0100) Subject: add sysctl to disallow unprivileged CLONE_NEWUSER by default Origin: http://kernel.ubuntu.com/git?p=serge%2Fubuntu-saucy.git;a=commit;h=5c847404dcb2e3195ad0057877e1422ae90892b8 add sysctl to disallow unprivileged CLONE_NEWUSER by default This is a short-term patch. Unprivileged use of CLONE_NEWUSER is certainly an intended feature of user namespaces. However for at least saucy we want to make sure that, if any security issues are found, we have a fail-safe. Signed-off-by: Serge Hallyn [bwh: Remove unneeded binary sysctl bits] --- #+end_src Grsecurity disables it entirely for users without ~CAP_SYS_ADMIN~, ~CAP_SETUID~, and ~CAP_SETGID~. #+caption: https://grsecurity.net/test/grsecurity-3.1-4.7.9-201610200819.patch #+begin_src diff --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -84,6 +84,21 @@ int create_user_ns(struct cred *new) !kgid_has_mapping(parent_ns, group)) return -EPERM; +#ifdef CONFIG_GRKERNSEC + /* + * This doesn't really inspire confidence: + * http://marc.info/?l=linux-kernel&m=135543612731939&w=2 + * http://marc.info/?l=linux-kernel&m=135545831607095&w=2 + * Increases kernel attack surface in areas developers + * previously cared little about ("low importance due + * to requiring "root" capability") + * To be removed when this code receives *proper* review + */ + if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) || + !capable(CAP_SETGID)) + return -EPERM; +#endif #+end_src and Arch Linux has it off. #+caption: [[https://bugs.archlinux.org/task/36969][{linux} 3.13 add CONFIG_USER_NS]] #+begin_src text Comment by William Kennington (Webhostbudd) - Sunday, 06 October 2013, 03:55 GMT I agree with Florian, allowing non-root users to take advantage of elevating themselves to a local root seems like a huge attack surface. Preferably this would be a sysctl with a huge warning attached to it when it is switched on. Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 03:55 GMT [...] Arch doesn't add new features via patches. If you want to see this feature enabled, then land something like this upstream. Note that CONFIG_USER_NS is already enabled in the linux-grsec package because it fully removes the ability to have unprivileged user namespaces. #+end_src It would have been cool to include Red Hat's patches here, but I couldn't find them. But all of these issues apply to hosts with user namespaces compiled in; it doesn't really matter whether we use user namespaces or not, especially since I'll be preventing nested user namespaces. So I'll only use a user namespace if they're available. (The user-namespace handling in this code was originally pretty broken. Jann Horn in particular gave great feedback. Thanks!) * ~contained.c~ This program can be used like this, to run ~/misc/img/bin/sh~ in ~/misc/img~ as ~root~: #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained -m ~/misc/busybox-img/ -u 0 -c /bin/sh => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.oQ5jOY...done. => trying a user namespace...writing /proc/32627/uid_map...writing /proc/32627/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. / # whoami root / # hostname 05fe5c-three-of-pentacles / # exit => cleaning cgroups...done. #+end_example So, a skeleton for it: #+caption: ~contained.c~ #+begin_src C :tangle "linux-containers-in-500-loc/contained.c" :padline "no" :noweb tangle /* -*- compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" -*- */ /* This code is licensed under the GPLv3. You can find its text here: https://www.gnu.org/licenses/gpl-3.0.en.html */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct child_config { int argc; uid_t uid; int fd; char *hostname; char **argv; char *mount_dir; }; <> <> <> <> <> <> int main (int argc, char **argv) { struct child_config config = {0}; int err = 0; int option = 0; int sockets[2] = {0}; pid_t child_pid = 0; int last_optind = 0; while ((option = getopt(argc, argv, "c:m:u:"))) { switch (option) { case 'c': config.argc = argc - last_optind - 1; config.argv = &argv[argc - config.argc]; goto finish_options; case 'm': config.mount_dir = optarg; break; case 'u': if (sscanf(optarg, "%d", &config.uid) != 1) { fprintf(stderr, "badly-formatted uid: %s\n", optarg); goto usage; } break; default: goto usage; } last_optind = optind; } finish_options: if (!config.argc) goto usage; if (!config.mount_dir) goto usage; <> char hostname[256] = {0}; if (choose_hostname(hostname, sizeof(hostname))) goto error; config.hostname = hostname; <> goto cleanup; usage: fprintf(stderr, "Usage: %s -u -1 -m . -c /bin/sh ~\n", argv[0]); error: err = 1; cleanup: if (sockets[0]) close(sockets[0]); if (sockets[1]) close(sockets[1]); return err; } #+end_src Since I'll be blacklisting system calls and capabilities, it's important to make sure there aren't any new ones. #+caption: =<>= = #+begin_src C :noweb-ref check-linux-version fprintf(stderr, "=> validating Linux version..."); struct utsname host = {0}; if (uname(&host)) { fprintf(stderr, "failed: %m\n"); goto cleanup; } int major = -1; int minor = -1; if (sscanf(host.release, "%u.%u.", &major, &minor) != 2) { fprintf(stderr, "weird release format: %s\n", host.release); goto cleanup; } if (major != 4 || (minor != 7 && minor != 8)) { fprintf(stderr, "expected 4.7.x or 4.8.x: %s\n", host.release); goto cleanup; } if (strcmp("x86_64", host.machine)) { fprintf(stderr, "expected x86_64: %s\n", host.machine); goto cleanup; } fprintf(stderr, "%s on %s.\n", host.release, host.machine); #+end_src (This had a bug. [[https://www.reddit.com/r/programming/comments/57x26h/linux_containers_in_500_lines_of_code/d8w07vf?context%3D3][captainjey on reddit let me know. Thanks!]]) And I wasn't quite at 500 lines of code, so I thought I had some space to build nice hostnames. #+caption: =<>= = #+begin_src C :noweb-ref choose-hostname int choose_hostname(char *buff, size_t len) { static const char *suits[] = { "swords", "wands", "pentacles", "cups" }; static const char *minor[] = { "ace", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "page", "knight", "queen", "king" }; static const char *major[] = { "fool", "magician", "high-priestess", "empress", "emperor", "hierophant", "lovers", "chariot", "strength", "hermit", "wheel", "justice", "hanged-man", "death", "temperance", "devil", "tower", "star", "moon", "sun", "judgment", "world" }; struct timespec now = {0}; clock_gettime(CLOCK_MONOTONIC, &now); size_t ix = now.tv_nsec % 78; if (ix < sizeof(major) / sizeof(*major)) { snprintf(buff, len, "%05lx-%s", now.tv_sec, major[ix]); } else { ix -= sizeof(major) / sizeof(*major); snprintf(buff, len, "%05lxc-%s-of-%s", now.tv_sec, minor[ix % (sizeof(minor) / sizeof(*minor))], suits[ix / (sizeof(minor) / sizeof(*minor))]); } return 0; } #+end_src ** Namespaces ~clone~ is the system call behind ~fork()~ et al. It's also the key to all of this. Conceptually we want to create a process with different properties than its parent: it should be able to mount a different ~/~, set its own hostname, and do other things. We'll specify all of this by passing flags to ~clone~ [fn:man-clone]. [fn:man-clone] Most of this section is cribbed from the example at the bottom of [[http://man7.org/linux/man-pages/man2/clone.2.html][~man 2 clone~]]. The child needs to send some messages to the parent, so we'll initialize a socketpair, and then make sure the child only receives access to one. #+caption: =<>= += #+begin_src C :noweb-ref namespaces if (socketpair(AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) { fprintf(stderr, "socketpair failed: %m\n"); goto error; } if (fcntl(sockets[0], F_SETFD, FD_CLOEXEC)) { fprintf(stderr, "fcntl failed: %m\n"); goto error; } config.fd = sockets[1]; #+end_src But first we need to set up room for a stack. We'll ~execve~ later, which will actually set up the stack again, so this is only temporary.[fn:clone-stack-temporary] [fn:clone-stack-temporary] #+caption: ~clone_stack.c~ #+include: "linux-containers-in-500-loc/clone_stack.c" src C #+caption: ~show_stack.c~ #+include: "linux-containers-in-500-loc/show_stack.c" src C #+begin_example [lizzie@empress linux-containers-in-500-loc]$ ./clone_stack pre-execve, stack is ~0x7f3f98deefec post-execve, stack is ~0x7ffd14d2291c #+end_example The stack grows down on x86, so the fact that the address is higher numerically post-execve means that a new stack has been allocated. #+caption: =<>= += #+begin_src C :noweb-ref namespaces #define STACK_SIZE (1024 * 1024) char *stack = 0; if (!(stack = malloc(STACK_SIZE))) { fprintf(stderr, "=> malloc failed, out of memory?\n"); goto error; } #+end_src We'll also prepare the cgroup for this process tree. More on this later. #+caption: =<>= += #+begin_src C :noweb-ref namespaces if (resources(&config)) { err = 1; goto clear_resources; } #+end_src We'll namespace the mounts, pids, IPC data structures, network devices, and hostname / domain name. I'll go into these more in the code for capabilities, cgroups, and syscalls. #+caption: =<>= += #+begin_src C :noweb-ref namespaces int flags = CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS; #+end_src Stacks on x86, and almost everything else Linux runs on, grow downwards, so we'll add ~STACK_SIZE~ to get a pointer just below the end.[fn:pointer-addition-ub] We also ~|~ the flags with ~SIGCHLD~ so that we can ~wait~ on it. [fn:pointer-addition-ub] I thought this might be undefined behavior, since ~stack + STACK_SIZE~ does point past the last item of the array, but point 8 of 6.5.6 [Additive operators] in [[http://www.iso-9899.info/n1570.html][ISO-9899]] has us covered: #+begin_quote If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated. #+end_quote i.e., the pointer addition is valid, but dereferencing it wouldn't be. #+caption: =<>= += #+begin_src C :noweb-ref namespaces if ((child_pid = clone(child, stack + STACK_SIZE, flags | SIGCHLD, &config)) == -1) { fprintf(stderr, "=> clone failed! %m\n"); err = 1; goto clear_resources; } #+end_src Close and zero the child's socket, so that if something breaks then we don't leave an open fd, possibly causing the child to or the parent to hang. #+caption: =<>= += #+begin_src C :noweb-ref namespaces close(sockets[1]); sockets[1] = 0; #+end_src The parent process will configure the child's user namespace and then pause until the child process tree exits[fn:pidns-sigkill]. [fn:pidns-sigkill] I wasn't confident that ~waitpid~ was enough to wait for the process and all of its children, but when the root of a pid namespace closes, all of its children get ~SIGKILL~: [[http://man7.org/linux/man-pages/man7/pid_namespaces.7.html][~man 7 pid_namespaces~]]: #+begin_quote If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This behavior reflects the fact that the "init" process is essential for the correct operation of a PID namespace. #+end_quote Also verified this myself, before I found that: #+caption: ~persistent_child.c~ #+include: "linux-containers-in-500-loc/persistent_child.c" src C #+begin_example [lizzie@empress l-c-i-500-l]$ touch persistent_child.log [lizzie@empress l-c-i-500-l]$ chmod 666 persistent_child.log [lizzie@empress l-c-i-500-l]$ sudo strace -f ./contained -m . -u 0 -c ./persistent_child execve("./contained", ["./contained", "-m", ".", "-u", "0", "-c", "./persistent_child"], [/* 15 vars */]) = 0 brk(NULL) = 0x605490 # ... [pid 736] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x6b68d0) = 2 strace: Process 746 attached [pid 736] nanosleep({2, 0}, [pid 746] open("persistent_child.log", O_WRONLY|O_CREAT|O_APPEND, 0600) = 3 [pid 746] fstat(3, {st_mode=S_IFREG|0666, st_size=4, ...}) = 0 [pid 746] lseek(3, 0, SEEK_CUR) = 0 [pid 746] write(3, "0\n", 2) = 2 [pid 746] nanosleep({1, 0}, 0x3fee2d718d0) = 0 [pid 746] fstat(3, {st_mode=S_IFREG|0666, st_size=6, ...}) = 0 [pid 746] lseek(3, 0, SEEK_CUR) = 6 [pid 746] write(3, "1\n", 2) = 2 [pid 746] nanosleep({1, 0}, [pid 736] <... nanosleep resumed> 0x3fee2d718d0) = 0 [pid 736] exit_group(0) = ? [pid 746] +++ killed by SIGKILL +++ [pid 736] +++ exited with 0 +++ # ... #+end_example #+caption: =<>= += #+begin_src C :noweb-ref namespaces close(sockets[1]); sockets[1] = 0; if (handle_child_uid_map(child_pid, sockets[0])) { err = 1; goto kill_and_finish_child; } goto finish_child; kill_and_finish_child: if (child_pid) kill(child_pid, SIGKILL); finish_child:; int child_status = 0; waitpid(child_pid, &child_status, 0); err |= WEXITSTATUS(child_status); clear_resources: free_resources(&config); free(stack); #+end_src #+RESULTS: A process setting its own user namespace is pretty limited[fn:self-userns-limited], so the parent will wait until the child enters the user namespace, and then write a mapping to its ~uid_map~ and ~gid_map~. [fn:self-userns-limited] #+caption: [[http://man7.org/linux/man-pages/man7/user_namespaces.7.html][~man 7 user_namespaces~]] #+begin_src text In order for a process to write to the /proc/[pid]/uid_map (/proc/[pid]/gid_map) file, all of the following requirements must be met: 1. The writing process must have the CAP_SETUID (CAP_SETGID) capability in the user namespace of the process pid. 2. The writing process must either be in the user namespace of the process pid or be in the parent user namespace of the process pid. 3. The mapped user IDs (group IDs) must in turn have a mapping in the parent user namespace. 4. One of the following two cases applies: ,* Either the writing process has the CAP_SETUID (CAP_SETGID) capability in the parent user namespace. + No further restrictions apply: the process can make mappings to arbitrary user IDs (group IDs) in the parent user namespace. ,* Or otherwise all of the following restrictions apply: + The data written to uid_map (gid_map) must consist of a single line that maps the writing process's effective user ID (group ID) in the parent user namespace to a user ID (group ID) in the user namespace. + The writing process must have the same effective user ID as the process that created the user namespace. + In the case of gid_map, use of the setgroups(2) system call must first be denied by writing deny to the /proc/[pid]/setgroups file (see below) before writing to gid_map. Writes that violate the above rules fail with the error EPERM. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref child #define USERNS_OFFSET 10000 #define USERNS_COUNT 2000 int handle_child_uid_map (pid_t child_pid, int fd) { int uid_map = 0; int has_userns = -1; if (read(fd, &has_userns, sizeof(has_userns)) != sizeof(has_userns)) { fprintf(stderr, "couldn't read from child!\n"); return -1; } if (has_userns) { char path[PATH_MAX] = {0}; for (char **file = (char *[]) { "uid_map", "gid_map", 0 }; *file; file++) { if (snprintf(path, sizeof(path), "/proc/%d/%s", child_pid, *file) > sizeof(path)) { fprintf(stderr, "snprintf too big? %m\n"); return -1; } fprintf(stderr, "writing %s...", path); if ((uid_map = open(path, O_WRONLY)) == -1) { fprintf(stderr, "open failed: %m\n"); return -1; } if (dprintf(uid_map, "0 %d %d\n", USERNS_OFFSET, USERNS_COUNT) == -1) { fprintf(stderr, "dprintf failed: %m\n"); close(uid_map); return -1; } close(uid_map); } } if (write(fd, & (int) { 0 }, sizeof(int)) != sizeof(int)) { fprintf(stderr, "couldn't write: %m\n"); return -1; } return 0; } #+end_src The child process will send a message to the parent process about whether it should set uid and gid mappings. If that works, it will ~setgroups~, ~setresgid~, and ~setresuid~. Both ~setgroups~ and ~setresgid~ are necessary here since there are two separate group mechanisms on Linux[fn:setgroups-setresuid]. I'm also assuming here that every uid has a corresponding gid, which is common but not necessarily universal. [fn:setgroups-setresuid] ~gid~, ~sgid~, and ~egid~ are separate from ~group_info~ in ~struct cred~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/cred.h?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n95][~include/linux/cred.h:95@c8d2bc~]] #+begin_src C /* ,* The security context of a task ,* ,* The parts of the context break down into two categories: ,* ,* (1) The objective context of a task. These parts are used when some other ,* task is attempting to affect this one. ,* ,* (2) The subjective context. These details are used when the task is acting ,* upon another object, be that a file, a task, a key or whatever. ,* ,* Note that some members of this structure belong to both categories - the ,* LSM security pointer for instance. ,* ,* A task has two security pointers. task->real_cred points to the objective ,* context that defines that task's actual details. The objective part of this ,* context is used whenever that task is acted upon. ,* ,* task->cred points to the subjective context that defines the details of how ,* that task is going to act upon another object. This may be overridden ,* temporarily to point to another security context, but normally points to the ,* same context as task->real_cred. ,*/ struct cred { atomic_t usage; #ifdef CONFIG_DEBUG_CREDENTIALS atomic_t subscribers; /* number of processes subscribed */ void *put_addr; unsigned magic; #define CRED_MAGIC 0x43736564 #define CRED_MAGIC_DEAD 0x44656144 #endif kuid_t uid; /* real UID of the task */ kgid_t gid; /* real GID of the task */ kuid_t suid; /* saved UID of the task */ kgid_t sgid; /* saved GID of the task */ kuid_t euid; /* effective UID of the task */ kgid_t egid; /* effective GID of the task */ kuid_t fsuid; /* UID for VFS ops */ kgid_t fsgid; /* GID for VFS ops */ unsigned securebits; /* SUID-less security management */ kernel_cap_t cap_inheritable; /* caps our children can inherit */ kernel_cap_t cap_permitted; /* caps we're permitted */ kernel_cap_t cap_effective; /* caps we can actually use */ kernel_cap_t cap_bset; /* capability bounding set */ kernel_cap_t cap_ambient; /* Ambient capability set */ #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested ,* keys to */ struct key __rcu *session_keyring; /* keyring inherited over fork */ struct key *process_keyring; /* keyring private to this process */ struct key *thread_keyring; /* keyring private to this thread */ struct key *request_key_auth; /* assumed request_key authority */ #endif #ifdef CONFIG_SECURITY void *security; /* subjective LSM security */ #endif struct user_struct *user; /* real user ID subscription */ struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */ struct group_info *group_info; /* supplementary groups for euid/fsgid */ struct rcu_head rcu; /* RCU deletion hook */ }; #+end_src #+caption: =<>= += #+begin_src C :noweb-ref child int userns(struct child_config *config) { fprintf(stderr, "=> trying a user namespace..."); int has_userns = !unshare(CLONE_NEWUSER); if (write(config->fd, &has_userns, sizeof(has_userns)) != sizeof(has_userns)) { fprintf(stderr, "couldn't write: %m\n"); return -1; } int result = 0; if (read(config->fd, &result, sizeof(result)) != sizeof(result)) { fprintf(stderr, "couldn't read: %m\n"); return -1; } if (result) return -1; if (has_userns) { fprintf(stderr, "done.\n"); } else { fprintf(stderr, "unsupported? continuing.\n"); } fprintf(stderr, "=> switching to uid %d / gid %d...", config->uid, config->uid); if (setgroups(1, & (gid_t) { config->uid }) || setresgid(config->uid, config->uid, config->uid) || setresuid(config->uid, config->uid, config->uid)) { fprintf(stderr, "%m\n"); return -1; } fprintf(stderr, "done.\n"); return 0; } #+end_src And this is where the child process from ~clone~ will end up. We'll perform all of our setup, switch users and groups, and then load the executable. The order is important here: we can't change mounts without certain capabilities, we can't ~unshare~ after we limit the syscalls, etc. #+caption: =<>= += #+begin_src C :noweb-ref child int child(void *arg) { struct child_config *config = arg; if (sethostname(config->hostname, strlen(config->hostname)) || mounts(config) || userns(config) || capabilities() || syscalls()) { close(config->fd); return -1; } if (close(config->fd)) { fprintf(stderr, "close failed: %m\n"); return -1; } if (execve(config->argv[0], config->argv, NULL)) { fprintf(stderr, "execve failed! %m.\n"); return -1; } return 0; } #+end_src ** Capabilties ~capabilities~ subdivide the property of "being root" on Linux. It's useful to compartmentalize privileges so that, for example a process can allocate network devices (~CAP_NET_ADMIN~) but not read all files (~CAP_DAC_OVERRIDE~). I'll use them here to drop the ones we don't want. But not all of "being root" is subvidivided into capabilities. For example, writing to parts of procfs is allowed by root even after having dropped capabilities[fn:procfs-write]. There are a lot of things like this: this is part of why need other restrictions beside capabilities. [fn:procfs-write] For example, ~test_perm~ in the ~/proc/sys~-handling-code: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/proc/proc_sysctl.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n406][~fs/proc/proc_sysctl.c:406@c8d2bc~]] #+begin_src C static int test_perm(int mode, int op) { if (uid_eq(current_euid(), GLOBAL_ROOT_UID)) mode >>= 6; else if (in_egroup_p(GLOBAL_ROOT_GID)) mode >>= 3; if ((op & ~mode & (MAY_READ|MAY_WRITE|MAY_EXEC)) == 0) return 0; return -EACCES; } #+end_src It's also important to think about how we're dropping capabilities. ~man 7 capabilities~ has an algorithm for us: #+begin_src text During an execve(2), the kernel calculates the new capabilities of the process using the following algorithm: P'(ambient) = (file is privileged) ? 0 : P(ambient) P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & cap_bset) | P'(ambient) P'(effective) = F(effective) ? P'(permitted) : P'(ambient) P'(inheritable) = P(inheritable) [i.e., unchanged] where: P denotes the value of a thread capability set before the execve(2) P' denotes the value of a thread capability set after the execve(2) F denotes a file capability set cap_bset is the value of the capability bounding set (described below). #+end_src We'd like ~P'(ambient)~ and ~P(inheritable)~ to be empty, and ~P'(permitted)~ and ~P(effective)~ to only include the capabilities above. This is achievable by doing the following + Clearing our own inheritable set. This clears the ambient set; ~man 7 capabilities~ says "The ambient capability set obeys the invariant that no capability can ever be ambient if it is not both permitted and inheritable." This also clears the child's inheritable set. + Clearing the bounding set. This limits the file capabilities we'll gain when we ~execve~, and the rest are limited by clearing the inheritable and ambient sets. If we were to only drop our own effective, permitted and inheritable sets, we'd regain the permissions in the child file's capabilities. This is how ~bash~ can call ~ping~, for example.[fn:execve-setcap-file] [fn:execve-setcap-file] #+caption: ~try_regain_cap.c~ #+include: "linux-containers-in-500-loc/try_regain_cap.c" src C If we drop the bounding set, files with extra capabilities don't get those capabilities: #+begin_example [lizzie@empress l-c-i-500-l]$ sudo setcap "cap_mknod+p" try_regain_cap [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c try_regain_cap => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.lVLNB1...done. => trying a user namespace...writing /proc/852/uid_map...writing /proc/852/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ don't have CAP_MKNOD => cleaning cgroups...done. #+end_example but if we don't, they work: #+caption: ~allow_all_caps.diff~ #+include: "linux-containers-in-500-loc/allow_all_caps.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_all_caps -m . -u 0 -c try_regain_cap => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.Qnzw2A...done. => trying a user namespace...writing /proc/940/uid_map...writing /proc/940/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ have CAP_MKNOD => cleaning cgroups...done. #+end_example (and if we set ~+ep~, execve fails because it's considered a "capability-dumb binary") #+begin_example [lizzie@empress l-c-i-500-l]$ sudo setcap "cap_mknod+ep" try_regain_cap [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c try_regain_cap => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.Esog3p...done. => trying a user namespace...writing /proc/994/uid_map...writing /proc/994/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. execve failed! Operation not permitted. => cleaning cgroups...done. #+end_example #+caption: [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] #+begin_src text Safety checking for capability-dumb binaries A capability-dumb binary is an application that has been marked to have file capabilities, but has not been converted to use the libcap(3) API to manipulate its capabilities. (In other words, this is a traditional set-user-ID-root program that has been switched to use file capabilities, but whose code has not been modified to understand capabilities.) For such applications, the effective capability bit is set on the file, so that the file permitted capabilities are automatically enabled in the process effective set when executing the file. The kernel recognizes a file which has the effective capability bit set as capability-dumb for the purpose of the check described here. When executing a capability-dumb binary, the kernel checks if the process obtained all permitted capabilities that were specified in the file permitted set, after the capability transformations described above have been performed. (The typical reason why this might not occur is that the capability bounding set masked out some of the capabilities in the file permitted set.) If the process did not obtain the full set of file permitted capabilities, then execve(2) fails with the error EPERM. This prevents possible security risks that could arise when a capability-dumb application is executed with less privilege that it needs. Note that, by definition, the application could not itself recognize this problem, since it does not employ the libcap(3) API. #+end_src *** Dropped capabilities #+caption: =<>= += #+begin_src C :noweb-ref capabilities int capabilities() { fprintf(stderr, "=> dropping capabilities..."); #+end_src ~CAP_AUDIT_CONTROL~, ~_READ~, and ~_WRITE~ allow access to the audit system of the kernel (i.e. functions like ~audit_set_enabled~, usually used with ~auditctl~). The kernel prevents messages that normally require ~CAP_AUDIT_CONTROL~ outside of the first pid namespace, but it does allow messages that would require ~CAP_AUDIT_READ~ and ~CAP_AUDIT_WRITE~ from any namespace.[fn:cap-audit-control-pid-ns] So let's drop them all. We especially want to drop ~CAP_AUDIT_READ~, since it isn't namespaced[fn:audit-socket] and may contain important information, but ~CAP_AUDIT_WRITE~ may also allow the contained process to falsify logs or DOS the audit system. [fn:cap-audit-control-pid-ns] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/audit.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n663][~kernel/audit.c:663@c8d2bc~]] #+begin_src C switch (msg_type) { case AUDIT_LIST: case AUDIT_ADD: case AUDIT_DEL: return -EOPNOTSUPP; case AUDIT_GET: case AUDIT_SET: case AUDIT_GET_FEATURE: case AUDIT_SET_FEATURE: case AUDIT_LIST_RULES: case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: case AUDIT_SIGNAL_INFO: case AUDIT_TTY_GET: case AUDIT_TTY_SET: case AUDIT_TRIM: case AUDIT_MAKE_EQUIV: /* Only support auditd and auditctl in initial pid namespace ,* for now. */ if (task_active_pid_ns(current) != &init_pid_ns) return -EPERM; if (!netlink_capable(skb, CAP_AUDIT_CONTROL)) err = -EPERM; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!netlink_capable(skb, CAP_AUDIT_WRITE)) err = -EPERM; break; default: /* bad msg */ err = -EINVAL; } #+end_src [fn:audit-socket] You can obtain an audit system file descriptor by calling : socket(AF_NETLINK, SOCK_DGRAM, NETLINK_AUDIT) #+caption: [[http://man7.org/linux/man-pages/man7/netlink.7.html][~man 7 netlink~]] #+begin_src text NETLINK(7) -- 2016-07-17 -- Linux -- Linux Programmer's Manual NAME netlink - communication between kernel and user space (AF_NETLINK) SYNOPSIS [...] netlink_socket = socket(AF_NETLINK, socket_type, netlink_family); [...] DESCRIPTION Netlink is used to transfer information between the kernel and user-space processes. It consists of a standard sockets-based interface for user space processes and an internal kernel API for kernel modules. [...] netlink_family selects the kernel module or netlink group to communicate with. The currently assigned netlink families are: [...] NETLINK_AUDIT (since Linux 2.6.6) Auditing. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities int drop_caps[] = { CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE, #+end_src ~CAP_BLOCK_SUSPEND~ lets programs prevent the system from suspending, either with ~EPOLLWAKEUP~ or /proc/sys/wake_lock.[fn:cap-block-suspend] Supend isn't namespaced, so we'd like to prevent this. [fn:cap-block-suspend] #+caption: [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] #+begin_src text CAP_BLOCK_SUSPEND (since Linux 3.5) Employ features that can block system suspend (epoll(7) EPOLLWAKEUP, /proc/sys/wake_lock). #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_BLOCK_SUSPEND, #+end_src ~CAP_DAC_READ_SEARCH~ lets programs call ~open_by_handle_at~ with an arbitrary ~struct file_handle *~. ~struct file_handle~ is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in 2014.[fn:shocker-c] [fn:shocker-c] [[http://www.openwall.com/lists/oss-security/2014/06/18/4][An email and description by Sebastian Krahmer]] #+begin_quote In 0.11 the problem is that the apps that run in the container have CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE which allows the containered app to access files not just by pathname (which would be impossible due to the bind mount of the rootfs) but also by handles via open_by_handle_at(). Handles are mostly 64bit values and can be kind of pre-computed as they are inode-based and the inode of / is 2. So you can go ahead and walk / by passing a handle of 2 and search the FS until you find the inode# of the file you want to access. Even though you are containered somewhere in /var/lib. #+end_quote which links to the code, [[http://stealth.openwall.net/xSports/shocker.c][~shocker.c~]]. Note that, if usernamespaces are on, we're not vulnerable, since ~open_by_handle_at~ checks for ~CAP_DAC_READ_SEARCH~ in the root namespace: #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.GSmTxw...done. => trying a user namespace...writing /proc/1538/uid_map...writing /proc/1538/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. [***] docker VMM-container breakout Po(C) 2014 [***] [***] The tea from the 90's kicks your sekurity again. [***] [***] If you have pending sec consulting, I'll happily [***] [***] forward to my friends who drink secury-tea too! [***] [*] Resolving 'etc/shadow' [-] open_by_handle_at: Operation not permitted => cleaning cgroups...done. #+end_example #+caption: ~fs/fhandle.c:166~ #+begin_src C static int handle_to_path(int mountdirfd, struct file_handle __user *ufh, struct path *path) { int retval = 0; struct file_handle f_handle; struct file_handle *handle = NULL; /* ,* With handle we don't look at the execute bit on the ,* the directory. Ideally we would like CAP_DAC_SEARCH. ,* But we don't have that ,*/ if (!capable(CAP_DAC_READ_SEARCH)) { retval = -EPERM; goto out_err; } /* ... */ } #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_DAC_READ_SEARCH, #+end_src ~CAP_FSETID~, without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, it's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.[fn:cap_fsetid] [fn:cap_fsetid] The setuid executable we'll subvert: #+caption: ~harmless_setuid.c~ #+include: "linux-containers-in-500-loc/harmless_setuid.c" src C This program will write itself to the executable at =argv[1]=. If it's a setuid root executable, there's no user namespace, and ~CAP_FSETID~ isn't dropped, it'll retain setuid root. #+caption: ~cap_fsetid.c~ #+include: "linux-containers-in-500-loc/cap_fsetid.c" src C #+caption: ~allow_capfsetid.diff~ #+include: "linux-containers-in-500-loc/allow_capfsetid.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ make -B harmless_setuid cc -Wall -Werror -static harmless_setuid.c -o harmless_setuid [lizzie@empress l-c-i-500-l]$ sudo chown root harmless_setuid [lizzie@empress l-c-i-500-l]$ sudo chmod 4755 harmless_setuid [lizzie@empress l-c-i-500-l]$ ./harmless_setuid I'm #1000/0/0 [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./cap_fsetid harmless_setuid => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.qapCVs...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ ./harmless_setuid ++ failed switching uids to root: Operation not permitted [lizzie@empress l-c-i-500-l]$ make -B harmless_setuid cc -Wall -Werror -static harmless_setuid.c -o harmless_setuid [lizzie@empress l-c-i-500-l]$ sudo chown root harmless_setuid [lizzie@empress l-c-i-500-l]$ sudo chmod 4755 harmless_setuid [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capfsetid -m . -u 0 -c ./cap_fsetid harmless_setuid => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.4u1dNe...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ ls -lh ./harmless_setuid -rwsr-xr-x 1 root lizzie 788K Oct 25 05:22 ./harmless_setuid [lizzie@empress l-c-i-500-l]$ ./harmless_setuid sh-4.3# whoami root sh-4.3# id uid=0(root) gid=1000(lizzie) groups=1000(lizzie) sh-4.3# exit [lizzie@empress l-c-i-500-l]$ rm harmless_setuid #+end_example #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_FSETID, #+end_src ~CAP_IPC_LOCK~ can be used to lock more of a process' own memory than would normally be allowed[fn:cap_ipc_x], which could be a way to deny service. [fn:cap_ipc_x] #+caption: [[http://man7.org/linux/man-pages/man2/mlock.2.html][~man 2 mlock~]] #+begin_src text DESCRIPTION mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range may once more to be swapped out if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. ERRORS ENOMEM (Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK). #+end_src These functions are the only use of ~CAP_IPC_LOCK~; the only mention in the source is #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/mm/mlock.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n27][~mm/mlock.c:27@c8d2bc~]] #+begin_src C bool can_do_mlock(void) { if (rlimit(RLIMIT_MEMLOCK) != 0) return true; if (capable(CAP_IPC_LOCK)) return true; return false; } #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_IPC_LOCK, #+end_src ~CAP_MAC_ADMIN~ and ~CAP_MAC_OVERRIDE~ are used by the mandatory acess control systems Apparmor, SELinux, and SMACK to restrict access to their settings. These aren't namespaced, so they could be used by the contained programs to circumvent system-wide access control. #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_MAC_ADMIN, CAP_MAC_OVERRIDE, #+end_src ~CAP_MKNOD~, without user namespacing, allows programs to create device files corresponding to real-world devices. This includes creating new device files for existing hardware. If this capability were not dropped, a contained process could re-create the hard disk device, remount it, and read or write to it.[fn:cap_mknod_exploit] [fn:cap_mknod_exploit] #+caption: ~cap_mknod.c~ #+include: "linux-containers-in-500-loc/cap_mknod.c" src C #+caption: ~allow_capmknod.diff~ #+include: "linux-containers-in-500-loc/allow_capmknod.diff" src diff Note that ~CAP_SYS_ADMIN~ doesn't need to be allowed for this to work, it's just that ~mount~ is more convenient than reading the block device in userspace. #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c cap_mknod 8 1 vfat => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.VTnW1G...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ mknod failed: Operation not permitted => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ make contained.allow_capmknod patch contained.c -i allow_capmknod.diff -o contained.allow_capmknod.c patching file contained.allow_capmknod.c (read from contained.c) Hunk #1 succeeded at 46 (offset 8 lines). cc -Wall -Werror -lseccomp -lcap contained.allow_capmknod.c -o contained.allow_capmknod rm contained.allow_capmknod.c [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capmknod -m . -u 0 -c cap_mknod 8 1 vfat => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.fdbi8q...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ reading /etc/shadow: [redacted] => cleaning cgroups...done. #+end_example #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_MKNOD, #+end_src I was worried that ~CAP_SETFCAP~ could be used to add a capability to an executable and ~execve~ it, but it's not actually possible for a process to set capabilities it doesn't have[fn:cap_setfcap]. But! An executable altered this way could be executed by any unsandboxed user, so I think it unacceptably undermines the security of the system. [fn:cap_setfcap] #+caption: ~setfcap_and_exec.c~ #+include: "linux-containers-in-500-loc/setfcap_and_exec.c" src C #+caption: ~allow_capsetfcap.diff~ #+include: "linux-containers-in-500-loc/allow_capsetfcap.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capsetfcap -m . -u 0 -c setfcap_and_exec => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.GCu2Ry...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. !! don't have cap_mknod+p? ++ can't cap_set_proc: Operation not permitted => cleaning cgroups...done. #+end_example it *does* work if we don't restrict ~CAP_MKNOD~, so it does seem like processes aren't allowed to set capabilities on files that they don't have: #+caption: ~allow_capmknod_capsetfcap.diff~ #+include: "linux-containers-in-500-loc/allow_capmknod_capsetfcap.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capmknod_capsetfcap -m . -u 0 -c setfcap_and_exec => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.IZ1gDw...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ have CAP_MKNOD! => cleaning cgroups...done. #+end_example This disagrees with [[https://forums.grsecurity.net/viewtopic.php?f%3D7&t%3D2522][Brad Spengler's note in False Boundaries and Arbitrary Code Execution]] #+begin_quote CAP_SETFCAP: generic: can set full capabilities on a file, granting full capabilities upon exec #+end_quote but that's 5 years old, so it may have changed. #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SETFCAP, #+end_src ~CAP_SYSLOG~ lets users perform destructive actions against the syslog. Importantly, it doesn't prevent contained processes from reading the syslog, which could be risky. It also exposes kernel addresses, which could be used to circumvent kernel address layout randomization[fn:cap_syslog]. [fn:cap_syslog] #+caption: [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] #+begin_src text CAP_SYSLOG (since Linux 2.6.37) * Perform privileged syslog(2) operations. See syslog(2) for information on which operations require privilege. * View kernel addresses exposed via /proc and other interfaces when /proc/sys/kernel/kptr_restrict has the value 1. (See the discussion of the kptr_restrict in proc(5).) #+end_src #+caption: [[http://man7.org/linux/man-pages/man2/syslog.2.html][~man 2 syslog~]] #+begin_src text SYSLOG_ACTION_READ (2) [...] Bytes read from the log disappear from the log buffer [...] SYSLOG_ACTION_READ_ALL (3) [...] The call reads the last len bytes from the log buffer (nondestructively) [...] SYSLOG_ACTION_READ_CLEAR (4) [...] SYSLOG_ACTION_CLEAR (5) [...] SYSLOG_ACTION_CONSOLE_OFF (6) [...] SYSLOG_ACTION_CONSOLE_ON (7) [...] SYSLOG_ACTION_CONSOLE_LEVEL (8) [...] SYSLOG_ACTION_SIZE_UNREAD (9) [...] SYSLOG_ACTION_SIZE_BUFFER (10) [...] All commands except 3 and 10 require privilege. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYSLOG, #+end_src ~CAP_SYS_ADMIN~ allows many behaviors! We don't want most of them (~mount~, ~vm86~, etc). Some would be nice to have (~sethostname~, ~mount~ for bind mounts...) but the extra complexity doesn't seem worth it. #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_ADMIN, #+end_src ~CAP_SYS_BOOT~ allows programs to restart the system (the ~reboot~ syscall) and load new kernels (the ~kexec_load~ and ~kexec_file~ syscalls)[fn:cap_sys_boot-usages]. We absolutely don't want this. ~reboot~ is user-namespaced, and the ~kexec*~ functions only work in the root user namespace, but neither of those help us. [fn:cap_sys_boot-usages] All of the uses of ~CAP_SYS_BOOT~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/reboot.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n280][~kernel/reboot.c:280@c8d2bc~]]: #+begin_src C SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, void __user *, arg) { struct pid_namespace *pid_ns = task_active_pid_ns(current); char buffer[256]; int ret = 0; /* We only trust the superuser with rebooting the system. */ if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT)) return -EPERM; [...] } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/kexec.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n187][~kernel/kexec.c:187@c8d2bc~]]: #+begin_src C SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments, struct kexec_segment __user *, segments, unsigned long, flags) { int result; /* We only trust the superuser with rebooting the system. */ if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) return -EPERM; [...] } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/kexec_file.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n256][~kernel/kexec_file.c:256@c8d2bc~]]: #+begin_src C SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, unsigned long, cmdline_len, const char __user *, cmdline_ptr, unsigned long, flags) { int ret = 0, i; struct kimage **dest_image, *image; /* We only trust the superuser with rebooting the system. */ if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) return -EPERM; [...] } #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_BOOT, #+end_src ~CAP_SYS_MODULE~ is used by the syscalls ~delete_module~, ~init_module~, ~finit_module~ [fn:cap-sys-module], by the code for ~kmod~ [fn:kmod], and by the code for loading device modules with ioctl[fn:dev-load]. [fn:cap-sys-module] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/module.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n931][~kernel/module.c:931@c8d2bc~]] #+begin_src C SYSCALL_DEFINE2(delete_module, const char __user *, name_user, unsigned int, flags) { struct module *mod; char name[MODULE_NAME_LEN]; int ret, forced = 0; if (!capable(CAP_SYS_MODULE) || modules_disabled) return -EPERM; [...] } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/module.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n3468][~kernel/module.c:3468@c8d2bc~]] #+begin_src C static int may_init_module(void) { if (!capable(CAP_SYS_MODULE) || modules_disabled) return -EPERM; return 0; } #+end_src which is called by ~init_module~ and ~finit_module~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/module.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n3759][~kernel/module.c:3759@c8d2bc~]] #+begin_src C SYSCALL_DEFINE3(init_module, void __user *, umod, unsigned long, len, const char __user *, uargs) { int err; struct load_info info = { }; err = may_init_module(); if (err) return err; pr_debug("init_module: umod=%p, len=%lu, uargs=%p\n", umod, len, uargs); err = copy_module_from_user(umod, len, &info); if (err) return err; return load_module(&info, uargs, 0); } SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; loff_t size; void *hdr; int err; err = may_init_module(); if (err) return err; pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags); if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS |MODULE_INIT_IGNORE_VERMAGIC)) return -EINVAL; err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX, READING_MODULE); if (err) return err; info.hdr = hdr; info.len = size; return load_module(&info, uargs, flags); } #+end_src [fn:kmod] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/kmod.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n630][~kernel/kmod.c:630@c8d2bc~]] #+begin_src C static int proc_cap_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { struct ctl_table t; unsigned long cap_array[_KERNEL_CAPABILITY_U32S]; kernel_cap_t new_cap; int err, i; if (write && (!capable(CAP_SETPCAP) || !capable(CAP_SYS_MODULE))) return -EPERM; [...] } #+end_src which is used to authorize requests to load modules. [fn:dev-load] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/core/dev_ioctl.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n349][~net/core/dev_ioctl.c:349@c8d2bc~]] #+begin_src C /** ,* dev_load - load a network module ,* @net: the applicable net namespace ,* @name: name of interface ,* ,* If a network interface is not present and the process has suitable ,* privileges this function loads the module. If module loading is not ,* available in this kernel then it becomes a nop. ,*/ void dev_load(struct net *net, const char *name) { struct net_device *dev; int no_module; rcu_read_lock(); dev = dev_get_by_name_rcu(net, name); rcu_read_unlock(); no_module = !dev; if (no_module && capable(CAP_NET_ADMIN)) no_module = request_module("netdev-%s", name); if (no_module && capable(CAP_SYS_MODULE)) request_module("%s", name); } #+end_src This also allows processes with only ~CAP_NET_ADMIN~ to load ~netdev-*~ modules, and is run on almost every ~ioctl~ on a network device: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/core/dev_ioctl.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n381][~net/core/dev_ioctl.c:381@c8d2bc~]] #+begin_src C /** ,* dev_ioctl - network device ioctl ,* @net: the applicable net namespace ,* @cmd: command to issue ,* @arg: pointer to a struct ifreq in user space ,* ,* Issue ioctl functions to devices. This is normally called by the ,* user space syscall interfaces but can sometimes be useful for ,* other purposes. The return value is the return from the syscall if ,* positive or a negative errno code on error. ,*/ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg) { [...] /* ,* See which interface the caller is talking about. ,*/ switch (cmd) { /* ,* These ioctl calls: ,* - can be done by all. ,* - atomic and do not require locking. ,* - return a value ,*/ case SIOCGIFFLAGS: case SIOCGIFMETRIC: case SIOCGIFMTU: case SIOCGIFHWADDR: case SIOCGIFSLAVE: case SIOCGIFMAP: case SIOCGIFINDEX: case SIOCGIFTXQLEN: dev_load(net, ifr.ifr_name); [...] } #+end_src This was pretty surprising to me! I should look into this further. #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_MODULE, #+end_src ~CAP_SYS_NICE~ allows processes to set higher priority on given pids than the default[fn:cap_sys_nice]. The default kernel scheduler doesn't know anything about pid namespaces, so it's possible for a contained process to deny service to the rest of the system[fn:nice-dos]. [fn:cap_sys_nice] #+caption: [[http://man7.org/linux/man-pages/man2/nice.2.html][~man 2 nice~]] #+begin_src text DESCRIPTION nice() adds inc to the nice value for the calling process. (A higher nice value means a low priority.) Only the superuser may specify a negative increment, or priority increase. [...] ERRORS EPERM The calling process attempted to increase its priority by supplying a negative inc but has insufficient privileges. Under Linux, the CAP_SYS_NICE capability is required. (But see the discussion of the RLIMIT_NICE resource limit in setrlimit(2).) #+end_src [fn:nice-dos] We'll see how many CPU cycles this gets in a single-core virtual machine, in the host and in a container that can set low nice values: #+caption: ~busy_loop.c~ #+include: "linux-containers-in-500-loc/busy_loop.c" src C #+caption: ~nice_dos.c~ #+include: "linux-containers-in-500-loc/nice_dos.c" src C #+caption: ~allow_capsysnice.diff~ #+include: "linux-containers-in-500-loc/allow_capsysnice.diff" src diff #+begin_example alpine-kernel-dev:~# (./busy_loop && echo '^ uncontained one' &) && (sudo ./contained.allow_capsysnice -m . -u 0 -c ./nice_dos &) => validating Linux version...4.7.6. => setting cgroups...memory...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.elKMci...done. => trying a user namespace...unsupported? continuing. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ticks: 52 ^ uncontained one ticks: 341 => cleaning cgroups...done. alpine-kernel-dev:~# #+end_example #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_NICE, #+end_src ~CAP_SYS_RAWIO~ allows full access to the host systems memory with ~/proc/kcore~, ~/dev/mem~, and ~/dev/kmem~ [fn:kmem-etc], but a contained process would need ~mknod~ to access these within the namespace.[fn:kmem-etc-mknod]. But it also allows things like ~iopl~ and ~ioperm~, which give raw access to the IO ports[fn:io-ports]. [fn:kmem-etc] #+caption: [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] #+begin_src text CAP_SYS_RAWIO ,* Perform I/O port operations (iopl(2) and ioperm(2)); ,* access /proc/kcore; ,* employ the FIBMAP ioctl(2) operation; ,* open devices for accessing x86 model-specific registers (MSRs, see msr(4)) ,* update /proc/sys/vm/mmap_min_addr; ,* create memory mappings at addresses below the value specified by /proc/sys/vm/mmap_min_addr; ,* map files in /proc/bus/pci; ,* open /dev/mem and /dev/kmem; ,* perform various SCSI device commands; ,* perform certain operations on hpsa(4) and cciss(4) devices; ,* perform a range of device-specific operations on other devices. #+end_src [fn:kmem-etc-mknod] #+caption: [[http://man7.org/linux/man-pages/man4/mem.4.html][~man 4 mem~]] #+begin_src text /dev/mem is a character device file that is an image of the main memory of the computer. It may be used, for example, to examine (and even patch) the system. [...] It is typically created by: mknod -m 660 /dev/mem c 1 1 chown root:kmem /dev/mem The file /dev/kmem is the same as /dev/mem, except that the kernel virtual memory rather than physical memory is accessed. Since Linux 2.6.26, this file is available only if the CONFIG_DEVKMEM kernel configuration option is enabled. It is typically created by: mknod -m 640 /dev/kmem c 1 2 chown root:kmem /dev/kmem /dev/port is similar to /dev/mem, but the I/O ports are accessed. It is typically created by: mknod -m 660 /dev/port c 1 4 chown root:kmem /dev/port #+end_src [fn:io-ports] #+caption: [[http://man7.org/linux/man-pages/man2/ioperm.2.html][~man 2 ioperm~]] #+begin_src text ioperm() sets the port access permission bits for the calling thread for num bits starting from port address from. If turn_on is nonzero, then permission for the specified bits is enabled; otherwise it is disabled. If turn_on is nonzero, the calling thread must be privileged (CAP_SYS_RAWIO). #+end_src #+caption: [[http://man7.org/linux/man-pages/man2/iopl.2.html][~man 2 iopl~]] #+begin_src text iopl() changes the I/O privilege level of the calling process, as specified by the two least significant bits in level. This call is necessary to allow 8514-compatible X servers to run under Linux. Since these X servers require access to all 65536 I/O ports, the ioperm(2) call is not sufficient. In addition to granting unrestricted I/O port access, running at a higher I/O privilege level also allows the process to disable interrupts. This will probably crash the system, and is not recommended. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_RAWIO, #+end_src ~CAP_SYS_RESOURCE~ specifically allows circumventing kernel-wide limits, so we probably should drop it[fn:cap_sys_resource]. But I don't think this can do more than DOS the kernel, in general[fn:cap_sys_resource-spender]. [fn:cap_sys_resource] #+caption: [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] #+begin_src text CAP_SYS_RESOURCE ,* Use reserved space on ext2 filesystems; ,* make ioctl(2) calls controlling ext3 journaling; ,* override disk quota limits; ,* increase resource limits (see setrlimit(2)); ,* override RLIMIT_NPROC resource limit; ,* override maximum number of consoles on console allocation; ,* override maximum number of keymaps; ,* allow more than 64hz interrupts from the real-time clock; ,* raise msg_qbytes limit for a System V message queue above the limit in /proc/sys/kernel/msgmnb (see msgop(2) and msgctl(2)); ,* override the /proc/sys/fs/pipe-size-max limit when setting the capacity of a pipe using the F_SETPIPE_SZ fcntl(2) command. ,* use F_SETPIPE_SZ to increase the capacity of a pipe above the limit specified by /proc/sys/fs/pipe-max-size; ,* override /proc/sys/fs/mqueue/queues_max limit when creating POSIX message queues (see mq_overview(7)); ,* employ prctl(2) PR_SET_MM operation; ,* set /proc/PID/oom_score_adj to a value lower than the value last set by a process with CAP_SYS_RESOURCE. #+end_src [fn:cap_sys_resource-spender] [[https://forums.grsecurity.net/viewtopic.php?f%3D7&t%3D2522][Brad Spengler agreees in "False Boundaries and Arbitrary Code Execution":]] #+begin_quote No transitions known (to this author, yet): [...] CAP_SYS_RESOURCE [...] #+end_quote #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_RESOURCE, #+end_src ~CAP_SYS_TIME~: setting the time isn't namespaced, so we should prevent contained processes from altering the system-wide time[fn:time-travel-attacks]. [fn:time-travel-attacks] It turns out that you can break important things by altering the time. [[https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_dowling.pdf]["Authenticated Network Time Synchronization"]] describes some of these: #+begin_quote The importance of accurate time for security. There are many examples of security mechanisms which (often implicitly) rely on having an accurate clock: * Certificate validation in TLS and other protocols. Validating a public key certificate requires confirming that the current time is within the certificate’s validity period. Performing validation with a slow or inaccurate clock may cause expired certificates to be accepted as valid. A revoked certificate may also validate if the clock is slow, since the relying party will not check for updated revocation information. + Ticket verification in Kerberos. In Kerberos, authentication tickets have a validity period, and proper verification requires an accurate clock to prevent authentication with an expired ticket. + HTTP Strict Transport Security (HSTS) policy duration. HSTS allows website administrators to protect against downgrade attacks from HTTPS to HTTP by sending a header to browsers indicating that HTTPS must be used instead of HTTP. HSTS policies specify the duration of time that HTTPS must be used. If the browser’s clock jumps ahead, the policy may expire re-allowing downgrade attacks. A related mechanism, HTTP Public Key Pinning also relies on accurate client time for security. For clients who set their clocks using NTP, these security mechanisms (and others) can be attacked by a network-level attacker who can intercept and modify NTP traffic, such as a malicious wireless access point or an insider at an ISP. In practice, most NTP servers do not authenticate themselves to clients, so a network attacker can intercept responses and set the timestamps arbitrarily. Even if the client sends requests to multiple servers, these may all be intercepted by an upstream network device and modified to present a consistently incorrect time to a victim. Such an attack on HSTS was demonstrated by [[https://www.blackhat.com/docs/eu-14/materials/eu-14-Selvi-Bypassing-HTTP-Strict-Transport-Security-wp.pdf][Selvi]], who provided a tool to advance the clock of victims in order to expire HSTS policies. [[http://www.cs.bu.edu/~goldbe/NTPattack.html][Malhotra et al]]. present a variety of attacks that rely on NTP being unauthenticated, further emphasizing the need for authenticated time synchronization. #+end_quote #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_SYS_TIME, #+end_src ~CAP_WAKE_ALARM~, like ~CAP_BLOCK_SUSPEND~, lets the contained process interfere with suspend[fn:cap_wake_alarm], and we'd like to prevent that. [fn:cap_wake_alarm] #+caption: [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] #+begin_src text CAP_WAKE_ALARM (since Linux 3.0) Trigger something that will wake up the system (set CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM timers). #+end_src I had trouble finding more information about these, but [[https://lwn.net/Articles/429925/]["Waking systems from suspend" on LWN]] goes into more detail: #+begin_quote these timers are exposed to user space via the standard POSIX clocks and timers interface, using the new the CLOCK_REALTIME_ALARM clockid. The new clockid behaves identically to CLOCK_REALTIME except that timers set against the _ALARM clockid will wake the system if it is suspended. #+end_quote #+caption: =<>= += #+begin_src C :noweb-ref capabilities CAP_WAKE_ALARM }; #+end_src #+caption: =<>= += #+begin_src C :noweb-ref capabilities size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps); fprintf(stderr, "bounding..."); for (size_t i = 0; i < num_caps; i++) { if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) { fprintf(stderr, "prctl failed: %m\n"); return 1; } } fprintf(stderr, "inheritable..."); cap_t caps = NULL; if (!(caps = cap_get_proc()) || cap_set_flag(caps, CAP_INHERITABLE, num_caps, drop_caps, CAP_CLEAR) || cap_set_proc(caps)) { fprintf(stderr, "failed: %m\n"); if (caps) cap_free(caps); return 1; } cap_free(caps); fprintf(stderr, "done.\n"); return 0; } #+end_src *** Retained Capabilities It's important to keep track of the capabilities I'm not dropping, too. I've heard multiple places[fn:multiple-places] that ~CAP_DAC_OVERRIDE~ might expose the same functionality as ~CAP_DAC_READ_SEARCH~ (i.e. ~open_by_handle_at~), but as far as I can tell that isn't true. ~shocker.c~ doesn't get anywhere with only ~CAP_DAC_OVERRIDE~ [fn:cap_dac_override-same-functionality], and the only usage in the kernel is in the Unix permission-checking code[fn:cap_dac_override-the-only-usage]. So my understanding is that ~CAP_DAC_OVERRIDE~ on its own doesn't allow processes to read outside of their mount namespaces ("DAC" or "Discretionary Access Control" refers here to ordinary unix permissions). [fn:multiple-places] [[https://forums.grsecurity.net/viewtopic.php?f%3D7&t%3D2522][Brad Spengler's "False Boundaries and Arbitrary Code Execution"]]: #+begin_quote CAP_DAC_OVERRIDE: generic: same bypass as CAP_DAC_READ_SEARCH, can also modify a non-suid binary executed by root to execute code with full privileges (modifying a suid root binary for you to execute would require CAP_FSETID, as the setuid bit is cleared on modification otherwise; thanks to Eric Paris). The modprobe sysctl can be modified as mentioned above to execute code with full capabilities. #+end_quote and of course [[http://www.openwall.com/lists/oss-security/2014/06/18/4][Sebastian Krahmer's email]]: #+begin_quote In 0.11 the problem is that the apps that run in the container have CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE which allows the containered app to access files not just by pathname (which would be impossible due to the bind mount of the rootfs) but also by handles via open_by_handle_at(). #+end_quote He might mean that the combination of both of them is problematic, though, which is absolutely true: with ~CAP_DAC_OVERRIDE~ and ~CAP_DAC_READ_SEARCH~, it's possible to modify arbitrary files: #+caption: ~shocker_write.patch~ #+include: "linux-containers-in-500-loc/shocker_write.patch" src diff #+caption: ~allow_capdacreadsearch.diff~ #+include: "linux-containers-in-500-loc/allow_capdacreadsearch.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker_write => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.axVxAE...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. [***] docker VMM-container breakout Po(C) 2014 [***] [***] The tea from the 90's kicks your sekurity again. [***] [***] If you have pending sec consulting, I'll happily [***] [***] forward to my friends who drink secury-tea too! [***] [*] Resolving 'etc/motd' [*] Found . [*] Found .. [*] Found lib64 [*] Found sys [*] Found run [*] Found sbin [*] Found opt [*] Found tmp [*] Found lost+found [*] Found dev [*] Found mnt [*] Found root [*] Found lib [*] Found boot [*] Found home [*] Found usr [*] Found bin [*] Found srv [*] Found etc [+] Match: etc ino=4325377 [*] Brute forcing remaining 32bit. This can take a while... [*] (etc) Trying: 0x00000000 [*] #=8, 1, char nh[] = {0x01, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00}; [*] Resolving 'motd' [*] Found binfmt.d [*] Found ts.conf [*] Found nscd.conf [*] Found dhcpcd.duid [*] Found sensors3.conf [*] Found libao.conf [*] Found . [*] Found motd [+] Match: motd ino=4325389 [*] Brute forcing remaining 32bit. This can take a while... [*] (motd) Trying: 0x00000000 [*] #=8, 1, char nh[] = {0x0d, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00}; [!] Got a final handle! [*] #=8, 1, char nh[] = {0x0d, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00}; [!] Win! /etc/motd written. => cleaning cgroups...done. #+end_example [fn:cap_dac_override-same-functionality] #+caption: ~allow_capdacreadsearch.diff~ #+include: "linux-containers-in-500-loc/allow_capdacreadsearch.diff" src diff #+caption: ~allow_capdacreadsearch.diff~ #+include: "linux-containers-in-500-loc/allow_capdacreadsearch.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./shocker => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.bWoGr4...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. [***] docker VMM-container breakout Po(C) 2014 [***] [***] The tea from the 90's kicks your sekurity again. [***] [***] If you have pending sec consulting, I'll happily [***] [***] forward to my friends who drink secury-tea too! [***] [*] Resolving 'etc/shadow' [-] open_by_handle_at: Operation not permitted => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.Jto0pj...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. [***] docker VMM-container breakout Po(C) 2014 [***] [***] The tea from the 90's kicks your sekurity again. [***] [***] If you have pending sec consulting, I'll happily [***] [***] forward to my friends who drink secury-tea too! [***] [*] Resolving 'etc/shadow' [*] Found . [*] Found .. [*] Found lib64 [*] Found sys [*] Found run [*] Found sbin [*] Found opt [*] Found tmp [*] Found lost+found [*] Found dev [*] Found mnt [*] Found root [*] Found lib [*] Found boot [*] Found home [*] Found usr [*] Found bin [*] Found srv [*] Found etc [+] Match: etc ino=4325377 [*] Brute forcing remaining 32bit. This can take a while... [*] (etc) Trying: 0x00000000 [*] #=8, 1, char nh[] = {0x01, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00}; [*] Resolving 'shadow' [*] Found binfmt.d [*] Found ts.conf [*] Found nscd.conf [*] Found dhcpcd.duid [*] Found sensors3.conf [*] Found libao.conf [*] Found . [*] Found motd [*] Found gdb [*] Found .. [*] Found qemu [*] Found lirc [*] Found healthd.conf [*] Found subuid [*] Found locale.gen.pacnew [*] Found gtk-3.0 [*] Found idn.conf [*] Found wgetrc [*] Found mime.types [*] Found texmf [*] Found request-key.conf [*] Found xinetd.d [*] Found ssl [*] Found ifplugd [*] Found mpd.conf [*] Found gimp [*] Found logrotate.d [*] Found dhcpcd.conf [*] Found trusted-key.key [*] Found resolv.conf [*] Found gemrc [*] Found libpaper.d [*] Found hostname [*] Found kernel [*] Found audit [*] Found request-key.d [*] Found subgid [*] Found services [*] Found protocols [*] Found profile.d [*] Found Muttrc.dist [*] Found audisp [*] Found default [*] Found resolv.conf.bak [*] Found ufw [*] Found man_db.conf [*] Found gconf [*] Found geoclue [*] Found netconfig [*] Found nanorc [*] Found environment [*] Found crypttab [*] Found brltty.conf [*] Found logrotate.conf [*] Found goaccess.conf [*] Found nsswitch.conf [*] Found shadow [+] Match: shadow ino=4334485 [*] Brute forcing remaining 32bit. This can take a while... [*] (shadow) Trying: 0x00000000 [*] #=8, 1, char nh[] = {0x95, 0x23, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00}; [!] Got a final handle! [*] #=8, 1, char nh[] = {0x95, 0x23, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00}; [!] Win! /etc/shadow output follows: [redacted] => cleaning cgroups...done. #+end_example [fn:cap_dac_override-the-only-usage] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n316][~fs/namei.c:316@c8d2bc~]]: #+begin_src C int generic_permission(struct inode *inode, int mask) { int ret; /* ,* Do the basic permission checks. ,*/ ret = acl_permission_check(inode, mask); if (ret != -EACCES) return ret; if (S_ISDIR(inode->i_mode)) { /* DACs are overridable for directories */ if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE)) return 0; if (!(mask & MAY_WRITE)) if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH)) return 0; return -EACCES; } /* ,* Read/write DACs are always overridable. ,* Executable DACs are overridable when there is ,* at least one exec bit set. ,*/ if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO)) if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE)) return 0; /* ,* Searching includes executable on directories, else just read. ,*/ mask &= MAY_READ | MAY_WRITE | MAY_EXEC; if (mask == MAY_READ) if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH)) return 0; return -EACCES; } #+end_src ~CAP_FOWNER~, ~CAP_LEASE~, and ~CAP_LINUX_IMMUTABLE~ all operate on files inside of the mount namespace. Likewise, ~CAP_SYS_PACCT~ allows processes to switch accounting on and off for itself. The ~acct~ system call takes a path to log to (which must be within the mount namespace), and only operates on the calling process. We're not using process accounting in our containerization, so turning it off should be harmless as well.[fn:cap_sys_pacct] [fn:cap_sys_pacct] [[http://man7.org/linux/man-pages/man5/acct.5.html][~man 5 acct~]] gives more useful information about this system than [[http://man7.org/linux/man-pages/man2/acct.2.html][~man 2 acct~]]. ~CAP_IPC_OWNER~ is only used by functions that respect IPC namespaces[fn:cap-ipc-owner]; since we're in a separate IPC namespace from the host, we can allow this. [fn:cap-ipc-owner] ~CAP_IPC_OWNER~ is only used in ~ipcperms~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/util.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n468][~ipc/util.c:468@c8d2bc~]] #+begin_src C /** * ipcperms - check ipc permissions * @ns: ipc namespace * @ipcp: ipc permission set * @flag: desired permission set * * Check user, group, other permissions for access * to ipc resources. return 0 if allowed * * @flag will most probably be 0 or S_...UGO from */ int ipcperms(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp, short flag) { kuid_t euid = current_euid(); int requested_mode, granted_mode; audit_ipc_obj(ipcp); requested_mode = (flag >> 6) | (flag >> 3) | flag; granted_mode = ipcp->mode; if (uid_eq(euid, ipcp->cuid) || uid_eq(euid, ipcp->uid)) granted_mode >>= 6; else if (in_group_p(ipcp->cgid) || in_group_p(ipcp->gid)) granted_mode >>= 3; /* is there some bit set in requested_mode but not in granted_mode? */ if ((requested_mode & ~granted_mode & 0007) && !ns_capable(ns->user_ns, CAP_IPC_OWNER)) return -1; return security_ipc_permission(ipcp, flag); } #+end_src It's used in the following places immediately after looking up the IPC object in the IPC namespace: + In the IPC shared memory system [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/shm.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~ipc/shm.c@c8d2bc~]] (done after ~shm_obtain_object~ and ~shm_obtain_object_check~): + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/shm.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n869][~ipc/shm.c:869@c8d2bc~]]: ~shmctl_nolock~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/shm.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1081][~ipc/shm.c:1081@c8d2bc~]]: ~do_shmat~ + In the IPC semaphore system, [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~ipc/sem.c@c8d2bc~]] (done ~sem_obtain_object~ and ~sem_obtain_object_check~): + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1200][~ipc/sem.c:1200@c8d2bc~]]: ~semctl_nolock~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1289][~ipc/sem.c:1289@c8d2bc~]]: ~semctl_setval~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1360][~ipc/sem.c:1360@c8d2bc~]]: ~semctl_main~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1816][~ipc/sem.c:1816@c8d2bc~]]: ~semtimedop~ + In the IPC message queue system, [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~ipc/msg.c@c8d2bc~]] (done after ~msq_obtain_object~ and ~msq_obtain_object_check)~: + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n445][~ipc/msg.c:445@c8d2bc~]]: ~msgctl_nolock~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n630][~ipc/msg.c:630@c8d2bc~]]: ~do_msgsnd~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n846][~ipc/msg.c:846@c8d2bc~]]: ~do_msgrcv~ ~ipc_check_perms~ is another a thin layer over it that doesn't check the IPC namespace. #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/util.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n290][~ipc/util.c:290@c8d2bc~]] #+begin_src C /** * ipc_check_perms - check security and permissions for an ipc object * @ns: ipc namespace * @ipcprgre: ipc permission set * @ops: the actual security routine to call * @params: its parameters * * This routine is called by sys_msgget(), sys_semget() and sys_shmget() * when the key is not IPC_PRIVATE and that key already exists in the * ds IDR. * * On success, the ipc id is returned. * * It is called with ipc_ids.rwsem and ipcp->lock held. */ static int ipc_check_perms(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp, const struct ipc_ops *ops, struct ipc_params *params) { int err; if (ipcperms(ns, ipcp, params->flg)) err = -EACCES; else { err = ops->associate(ipcp, params->flg); if (!err) err = ipcp->id; } return err; } #+end_src which is called by ~ipcget_public~. #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/util.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n323][~ipc/util.c:323@c8d2bc~]] #+begin_src C /** * ipcget_public - get an ipc object or create a new one * @ns: ipc namespace * @ids: ipc identifier set * @ops: the actual creation routine to call * @params: its parameters * * This routine is called by sys_msgget, sys_semget() and sys_shmget() * when the key is not IPC_PRIVATE. * It adds a new entry if the key is not found and does some permission * / security checkings if the key is found. * * On success, the ipc id is returned. */ static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids, const struct ipc_ops *ops, struct ipc_params *params) { struct kern_ipc_perm *ipcp; int flg = params->flg; int err; /* * Take the lock as a writer since we are potentially going to add * a new entry + read locks are not "upgradable" */ down_write(&ids->rwsem); ipcp = ipc_findkey(ids, params->key); if (ipcp == NULL) { /* key not used */ if (!(flg & IPC_CREAT)) err = -ENOENT; else err = ops->getnew(ns, params); } else { /* ipc object has been locked by ipc_findkey() */ if (flg & IPC_CREAT && flg & IPC_EXCL) err = -EEXIST; else { err = 0; if (ops->more_checks) err = ops->more_checks(ipcp, params); if (!err) /* * ipc_check_perms returns the IPC id on * success */ err = ipc_check_perms(ns, ipcp, ops, params); } ipc_unlock(ipcp); } up_write(&ids->rwsem); return err; } #+end_src ~ipcget_public~ handles both creation and accessing for non-~IPC_PRIVATE~ requests. It *doesn't* check IPC namespace for existing IPC objects. It's called by ~ipc_get~ if ~IPC_PRIVATE~ is not set: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/util.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n625][~ipc/util.c:625@c8d2bc~]] #+begin_src C /** * ipcget - Common sys_*get() code * @ns: namespace * @ids: ipc identifier set * @ops: operations to be called on ipc object creation, permission checks * and further checks * @params: the parameters needed by the previous operations. * * Common routine called by sys_msgget(), sys_semget() and sys_shmget(). */ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids, const struct ipc_ops *ops, struct ipc_params *params) { if (params->key == IPC_PRIVATE) return ipcget_new(ns, ids, ops, params); else return ipcget_public(ns, ids, ops, params); } #+end_src whcih in turn is called in the following places: + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/shm.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n654][~ipc/shm.c:654@c8d2bc~]]: ~shmget~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n604][~ipc/sem.c:604@c8d2bc~]]: ~semget~ + [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n265][~ipc/msg.c:265@c8d2bc~]]: ~msgget~ But ~shmget~, ~semget~, and ~msgget~ are all part of the System V IPC set, and in order to use them you need to call ~shmat~, ~semop~ / ~semtimedop~, and ~msgsend~ / ~msgrcv~~, all only work for objects in the namespace: ~shmat~ immediately calls ~do_shmat~, which is listed above; #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/shm.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1249][~ipc/shm.c:1249@c8d2bc~]] #+begin_src C SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg) { unsigned long ret; long err; err = do_shmat(shmid, shmaddr, shmflg, &ret, SHMLBA); if (err) return err; force_successful_syscall_return(); return (long)ret; } #+end_src ~semop~ calls ~semtimedop~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n2051][~ipc/sem.c:20151@c8d2bc~]] #+begin_src C SYSCALL_DEFINE3(semop, int, semid, struct sembuf __user *, tsops, unsigned, nsops) { return sys_semtimedop(semid, tsops, nsops, NULL); } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/sem.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#1816][~ipc/sem.c:1816@c8d2bc~]] #+begin_src C SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops, unsigned, nsops, const struct timespec __user *, timeout) { /* ... */ ns = current->nsproxy->ipc_ns; /* ... allocate some space for things. ... ,*/ sma = sem_obtain_object_check(ns, semid); /* ... */ } #+end_src ~msgsnd~ and ~msgrcv~ immediately call ~do_msgsnd~ and ~do_msgrcv~, which are also listed above: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n743][~ipc/msg.c:743@c8d2bc~]] #+begin_src C SYSCALL_DEFINE4(msgsnd, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz, int, msgflg) { long mtype; if (get_user(mtype, &msgp->mtype)) return -EFAULT; return do_msgsnd(msqid, mtype, msgp->mtext, msgsz, msgflg); } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/ipc/msg.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1004][~ipc/msg.c:1004@c8d2bc~]] #+begin_src C SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz, long, msgtyp, int, msgflg) { return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill); } #+end_src ~CAP_NET_ADMIN~ lets processes create network devices; ~CAP_NET_BIND_SERVICE~ lets processes bind to low ports on those devices; ~CAP_NET_RAW~ lets processes send raw packets on those devices. Since we're going to isolate the networking with a virtual bridge, and the contained process is inside of a network namespace, these shouldn't be an issue[fn:networking-namespaces]. I was wondering whether we could recreate an existing device like ~mknod~ does, but I don't think it's possible [fn:net-device-initialization]. [fn:networking-namespaces] We can see that they're effectively namespaced: #+caption: ~enumerate_net_devs.c~ #+include: "linux-containers-in-500-loc/enumerate_net_devs.c" src C #+begin_example [lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./enumerate_net_devs => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.7npCN7...done. => trying a user namespace...writing /proc/1750/uid_map...writing /proc/1750/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. 1: lo => cleaning cgroups...done. #+end_example [fn:net-device-initialization] Network device datastructures are created inside of the kernel, not in userspace with ~mknod~. For example, ~ip link add dummy0 type dummy~ does this: + Opens a ~NETLINK_ROUTE~ netlink socket. + Sends a ~RTM_NEWLINK~ message over it. + Code in [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/core/rtnetlink.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~net/core/rtnetlink.c@c8d2bc~]] dispatches the message to ~rtnl_create_link~, which does this; #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/core/rtnetlink.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n2239][~net/core/rtnetlink.c:2239@c8d2bc~]] #+begin_src C struct net_device *rtnl_create_link(struct net *net, const char *ifname, unsigned char name_assign_type, const struct rtnl_link_ops *ops, struct nlattr *tb[]) { int err; struct net_device *dev; unsigned int num_tx_queues = 1; unsigned int num_rx_queues = 1; /* ... */ err = -ENOMEM; dev = alloc_netdev_mqs(ops->priv_size, ifname, name_assign_type, ops->setup, num_tx_queues, num_rx_queues); if (!dev) goto err; /* ... */ } #+end_src + ~alloc_netdev_mqs~ calls the ~setup~ function: #+caption: #+begin_src C /** ,* alloc_netdev_mqs - allocate network device ,* @sizeof_priv: size of private data to allocate space for ,* @name: device name format string ,* @name_assign_type: origin of device name ,* @setup: callback to initialize device ,* @txqs: the number of TX subqueues to allocate ,* @rxqs: the number of RX subqueues to allocate ,* ,* Allocates a struct net_device with private data area for driver use ,* and performs basic initialization. Also allocates subqueue structs ,* for each queue on the device. ,*/ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, unsigned char name_assign_type, void (*setup)(struct net_device *), unsigned int txqs, unsigned int rxqs) { struct net_device *dev; size_t alloc_size; struct net_device *p; /* ... */ setup(dev); /* ... */ } #+end_src + ~dummy_setup~ gets called, since it's the ~.setup~ of a ~rtnl_link_ops~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/net/dummy.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n170][~drivers/net/dummy.c:170@c8d2bc~]] #+begin_src C static struct rtnl_link_ops dummy_link_ops __read_mostly = { .kind = DRV_NAME, .setup = dummy_setup, .validate = dummy_validate, }; #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/net/dummy.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n137][~drivers/net/dummy.c:137@c8d2bc~]] #+begin_src C static void dummy_setup(struct net_device *dev) { ether_setup(dev); /* Initialize the device structure. */ dev->netdev_ops = &dummy_netdev_ops; dev->ethtool_ops = &dummy_ethtool_ops; dev->destructor = free_netdev; /* Fill in device structure with ethernet-generic values. */ dev->flags |= IFF_NOARP; dev->flags &= ~IFF_MULTICAST; dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE; dev->features |= NETIF_F_SG | NETIF_F_FRAGLIST; dev->features |= NETIF_F_ALL_TSO | NETIF_F_UFO; dev->features |= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX; dev->features |= NETIF_F_GSO_ENCAP_ALL; dev->hw_features |= dev->features; dev->hw_enc_features |= dev->features; eth_hw_addr_random(dev); } #+end_src In other words, there's no equivalent of userspace major / minor device numbers for network devices. ~CAP_SYS_PTRACE~ doesn't allow ptrace across pid namespaces[fn:cap_sys_ptrace]. ~CAP_KILL~ doesn't allow signals across pid namespaces[fn:cap_kill]. [fn:cap_sys_ptrace] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1079][~kernel/ptrace.c:1079@c8d2bc~]]: #+begin_src C SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr, unsigned long, data) { struct task_struct *child; long ret; if (request == PTRACE_TRACEME) { ret = ptrace_traceme(); if (!ret) arch_ptrace_attach(current); goto out; } child = ptrace_get_task_struct(pid); if (IS_ERR(child)) { ret = PTR_ERR(child); goto out; } [...] } #+end_src which calls ~ptrace_get_task_struct~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1060][~kernel/ptrace.c:1060@c8d2bc~]]: #+begin_src C static struct task_struct *ptrace_get_task_struct(pid_t pid) { struct task_struct *child; rcu_read_lock(); child = find_task_by_vpid(pid); if (child) get_task_struct(child); rcu_read_unlock(); if (!child) return ERR_PTR(-ESRCH); return child; } #+end_src ...which in turn calls ~find_task_by_vpid~ #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/pid.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n459][~kernel/pid.c:459@c8d2bc~]]: #+begin_src C struct task_struct *find_task_by_vpid(pid_t vnr) { return find_task_by_pid_ns(vnr, task_active_pid_ns(current)); } #+end_src which calls ~find_task_by_pid_ns~: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/pid.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n452][~kernel/pid.c:452@c8d2bc~]]: #+begin_src C struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns) { RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "find_task_by_pid_ns() needs rcu_read_lock() protection"); return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID); } #+end_src which, finally, calls ~find_pid_ns~. You can see here that it only finds a ~stuct pid *~ that shares the pid namespace of the current task. #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/pid.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n366][~kernel/pid.c:366@c8d2bc~]]: #+begin_src C struct pid *find_pid_ns(int nr, struct pid_namespace *ns) { struct upid *pnr; hlist_for_each_entry_rcu(pnr, &pid_hash[pid_hashfn(nr, ns)], pid_chain) if (pnr->nr == nr && pnr->ns == ns) return container_of(pnr, struct pid, numbers[ns->level]); return NULL; } #+end_src [fn:cap_kill] The ~kill~ syscalls call ~kill_something_info~, which follows a dense call chain ( ~kill_pid_info~ -> ~group_send_sig_info~ -> ~do_send_sig_info~ -> ~send_sig_info~ -> ~send_signal~ -> ~__send_signal~) to eventually end up in ~__send_signal~, which does respect user namespaces: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n972][~kernel/signal.c:972@c8d2bc~]] #+begin_src C static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, int group, int from_ancestor_ns) { /* ... */ q = __sigqueue_alloc(sig, t, GFP_ATOMIC | __GFP_NOTRACK_FALSE_POSITIVE, override_rlimit); if (q) { list_add_tail(&q->list, &pending->list); switch ((unsigned long) info) { case (unsigned long) SEND_SIG_NOINFO: q->info.si_signo = sig; q->info.si_errno = 0; q->info.si_code = SI_USER; q->info.si_pid = task_tgid_nr_ns(current, task_active_pid_ns(t)); q->info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); break; case (unsigned long) SEND_SIG_PRIV: q->info.si_signo = sig; q->info.si_errno = 0; q->info.si_code = SI_KERNEL; q->info.si_pid = 0; q->info.si_uid = 0; break; default: copy_siginfo(&q->info, info); if (from_ancestor_ns) q->info.si_pid = 0; break; } userns_fixup_signal_uid(&q->info, t); } /*...*/ } #+end_src ~CAP_SETUID~ and ~CAPSETGID~ have similar behaviors[fn:similar-behaviors]: + ~Make arbitrary manipulations of process UIDS and GIDs and supplementary GID list~, which will only apply to pids in the namespace. + ~forge UID (GID) when passing socket credentials via UNIX domain sockets~ the mount namespace should prevent us from reading the host system's unix domain sockets. + ~write a user(group ID) mapping in a user namespace (see user_namespaces(7))~: this is ~/proc/self/uid_map~, which will be hidden inside the container. [fn:similar-behaviors] Quoted [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]], again: #+begin_src text CAP_SETGID Make arbitrary manipulations of process GIDs and supplementary GID list; forge GID when passing socket credentials via UNIX domain sockets; write a group ID mapping in a user namespace (see user_namespaces(7)). CAP_SETUID Make arbitrary manipulations of process UIDs (setuid(2), setreuid(2), setresuid(2), setfsuid(2)); forge UID when passing socket credentials via UNIX domain sockets; write a user ID mapping in a user namespace (see user_namespaces(7)). #+end_src ~CAP_SETPCAP~ only lets processes add or drop capabilities they already effectively have; [[http://man7.org/linux/man-pages/man7/capabilities.7.html][~man 7 capabilities~]] says #+begin_quote If file capabilities are supported: add any capability from the calling thread's bounding set to its inheritable set; drop capabilities from the bounding set (via prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags. #+end_quote We've dropped everything relevant from the bounding set, and dropping further capabilities should be harmless. ~CAP_SYS_CHROOT~ is traditionally abused by changing root to a directory with a setuid root binary and tampered-with dynamic libraries[fn:chroot-dynamic-libraries]. Additionally, it can be used to escape a chroot "jail"[fn:escaping-chroot-jail]. Neither of those should be relevant in our setup so this should be harmless. [fn:chroot-dynamic-libraries] [[https://forums.grsecurity.net/viewtopic.php?f%3D7&t%3D2522][Brad Spengler's "False Boundaries and Arbitrary Code Execution"]], again #+begin_quote CAP_SYS_CHROOT: generic: From Julien Tinnes/Chris Evans: if you have write access to the same filesystem as a suid root binary, set up a chroot environment with a backdoored libc and then execute a hardlinked suid root binary within your chroot and gain full root privileges through your backdoor #+end_quote [fn:escaping-chroot-jail] [[http://man7.org/linux/man-pages/man2/chroot.2.html][~man 2 chroot~]]: #+begin_quote This call does not change the current working directory, so that after the call '.' can be outside the tree rooted at '/'. In particular, the superuser can escape from a "chroot jail" by doing: : mkdir foo; chroot foo; cd .. #+end_quote [[https://forums.grsecurity.net/viewtopic.php?f%3D7&t%3D2522][Brad Spengler, in "False Boundaries and Arbitrary Code Execution"]] says that ~CAP_SYS_TTYCONFIG~ can "temporarily change the keyboard mapping of an administrator's tty via the KDSETKEYCODE ioctl to cause a different command to be executed than intended", but again this is an ~ioctl~ against a device that should be impossible to access within the mount namespace. ** Mounts The child process is in its own mount namespace, so we can unmount things that it specifically shouldn't have access to. Here's how: + Create a temporary directory, and one inside of it. + Bind mount of the user argument onto the temporary directory + ~pivot_root~, making the bind mount our root and mounting the old root onto the inner temporary directory. + ~umount~ the old root, and remove the inner temporary directory. But first we'll remount everything with ~MS_PRIVATE~. This is mostly a convenience, so that the bind mount is invisible outside of our namespace. #+caption: =<>= = #+begin_src C :noweb-ref mounts <> int mounts(struct child_config *config) { fprintf(stderr, "=> remounting everything with MS_PRIVATE..."); if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL)) { fprintf(stderr, "failed! %m\n"); return -1; } fprintf(stderr, "remounted.\n"); fprintf(stderr, "=> making a temp directory and a bind mount there..."); char mount_dir[] = "/tmp/tmp.XXXXXX"; if (!mkdtemp(mount_dir)) { fprintf(stderr, "failed making a directory!\n"); return -1; } if (mount(config->mount_dir, mount_dir, NULL, MS_BIND | MS_PRIVATE, NULL)) { fprintf(stderr, "bind mount failed!\n"); return -1; } char inner_mount_dir[] = "/tmp/tmp.XXXXXX/oldroot.XXXXXX"; memcpy(inner_mount_dir, mount_dir, sizeof(mount_dir) - 1); if (!mkdtemp(inner_mount_dir)) { fprintf(stderr, "failed making the inner directory!\n"); return -1; } fprintf(stderr, "done.\n"); fprintf(stderr, "=> pivoting root..."); if (pivot_root(mount_dir, inner_mount_dir)) { fprintf(stderr, "failed!\n"); return -1; } fprintf(stderr, "done.\n"); char *old_root_dir = basename(inner_mount_dir); char old_root[sizeof(inner_mount_dir) + 1] = { "/" }; strcpy(&old_root[1], old_root_dir); fprintf(stderr, "=> unmounting %s...", old_root); if (chdir("/")) { fprintf(stderr, "chdir failed! %m\n"); return -1; } if (umount2(old_root, MNT_DETACH)) { fprintf(stderr, "umount failed! %m\n"); return -1; } if (rmdir(old_root)) { fprintf(stderr, "rmdir failed! %m\n"); return -1; } fprintf(stderr, "done.\n"); return 0; } #+end_src ~pivot_root~ is a system call lets us swap the mount at ~/~ with another. Glibc doesn't provide a wrapper for it, but includes a prototype in the man page. I don't really understand, but OK, we'll include our own. #+caption: =<>= = #+begin_src C :noweb-ref pivot-root int pivot_root(const char *new_root, const char *put_old) { return syscall(SYS_pivot_root, new_root, put_old); } #+end_src It's worth noting that I'm avoiding packing and unpackaging containers. This is fertile ground for vulnerabilities[fn:unpackaging-containers]; I'll count on the user to ensure that the mounted directory doesn't contain trusted or sensitive files or hard links. [fn:unpackaging-containers] There have been issues with unpacking containers in Docker and LXC: #+caption: [[http://www.openwall.com/lists/oss-security/2014/11/24/5][~Docker 1.3.2 - Security Advisory {24 Nov 2014}~]] #+begin_src text ===================================================== [CVE-2014-6407] Archive extraction allowing host privilege escalation ===================================================== Severity: Critical Affects: Docker up to 1.3.1 The Docker engine, up to and including version 1.3.1, was vulnerable to extracting files to arbitrary paths on the host during ‘docker pull’ and ‘docker load’ operations. This was caused by symlink and hardlink traversals present in Docker's image extraction. This vulnerability could be leveraged to perform remote code execution and privilege escalation. #+end_src #+caption: [[http://www.openwall.com/lists/oss-security/2015/05/07/10][~Docker 1.6.1 - Security Advisory {150507}~]] #+begin_src text ==================================================================== [CVE-2015-3629] Symlink traversal on container respawn allows local privilege escalation ==================================================================== Libcontainer version 1.6.0 introduced changes which facilitated a mount namespace breakout upon respawn of a container. This allowed malicious images to write files to the host system and escape containerization. #+end_src #+caption: [[http://www.openwall.com/lists/oss-security/2015/07/22/4][~Security issues in LXC (CVE-2015-1331 and CVE-2015-1334)~]], from Tyler Hicks #+begin_src text ,* Roman Fiedler discovered a directory traversal flaw that allows arbitrary file creation as the root user. A local attacker must set up a symlink at /run/lock/lxc/var/lib/lxc/, prior to an admin ever creating an LXC container on the system. If an admin then creates a container with a name matching , the symlink will be followed and LXC will create an empty file at the symlink's target as the root user. - CVE-2015-1331 - Affects LXC 1.0.0 and higher - https://launchpad.net/bugs/1470842 - https://github.com/lxc/lxc/commit/72cf81f6a3404e35028567db2c99a90406e9c6e6 (master) - https://github.com/lxc/lxc/commit/61ecf69d7834921cc078e14d1b36c459ad8f91c7 (stable-1.1) - https://github.com/lxc/lxc/commit/f547349ea7ef3a6eae6965a95cb5986cd921bd99 (stable-1.0) ,* Roman Fiedler discovered a flaw that allows processes intended to be run inside of confined LXC containers to escape their AppArmor or SELinux confinement. A malicious container can create a fake proc filesystem, possibly by mounting tmpfs on top of the container's /proc, and wait for a lxc-attach to be ran from the host environment. lxc-attach incorrectly trusts the container's /proc/PID/attr/{current,exec} files to set up the AppArmor profile and SELinux domain transitions which may result in no confinement being used. - CVE-2015-1334 - Affects LXC 0.9.0 and higher - https://launchpad.net/bugs/1475050 - https://github.com/lxc/lxc/commit/5c3fcae78b63ac9dd56e36075903921bd9461f9e (master) - https://github.com/lxc/lxc/commit/659e807c8dd1525a5c94bdecc47599079fad8407 (stable-1.1) - https://github.com/lxc/lxc/commit/15ec0fd9d490dd5c8a153401360233c6ee947c24 (stable-1.0) Tyler #+end_src These are all really interesting! I want to write more about them. ** System Calls I'll be blacklisting system calls that I can demonstrate causing harm or sandbox escapes. Again this isn't the best way to do this, but it seems like the most illustrative. [[https://github.com/docker/docker.github.io/blob/master/engine/security/seccomp.md][Docker's documentation]] and [[https://github.com/docker/docker/blob/b248de7e332b6e67b08a8981f68060e6ae629ccf/profiles/seccomp/default.json][default seccomp profile]] are reasonable sources for dangerous system calls[fn:docker-seccomp-whitelist]. They also include obsolete sytem calls and calls that overlap with restricted capabilities; I'll ignore those. [fn:docker-seccomp-whitelist] The Docker seccomp policy doesn't include an explicit blacklist, which makes it a little hard to follow, so I wrote code to find it. #+begin_src python :results output raw drawer :exports both #!/usr/bin/env python3 import gzip import requests import re import sys url = "https://raw.githubusercontent.com/docker/docker/5ff21add06ce0e502b41a194077daad311901996/profiles/seccomp/default.json" conditional = set() allowed = set() disallowed = set() for entry in requests.get(url).json()["syscalls"]: if entry["args"]: conditional |= set(entry["names"]) else: allowed |= set(entry["names"]) manpage = "/usr/share/man/man2/syscalls.2.gz" with gzip.open(manpage, "r") as f: ready = False for _line in f: line = _line.decode("utf-8") # table end if ready and line == ".TE\n": break match = re.match(r"\\fB(.+?)\\fP(.+)", line) if match: if match.group(1) == "System call": ready = True elif (match.group(1) not in allowed and match.group(1) not in conditional): disallowed.add(match.group(1)) print("Conditionally allowed:") for c in sorted(conditional): sys.stdout.write("~%s~, " % c) print("\n\nDisallowed:") for d in sorted(disallowed): sys.stdout.write("~%s~, " % d) sys.stdout.write("\n") #+end_src #+RESULTS: :RESULTS: Conditionally allowed: ~clone~, ~personality~, Disallowed: ~_sysctl~, ~add_key~, ~alloc_hugepages~, ~bdflush~, ~clock_adjtime~, ~clock_settime~, ~create_module~, ~free_hugepages~, ~get_kernel_syms~, ~get_mempolicy~, ~getpagesize~, ~kern_features~, ~kexec_file_load~, ~kexec_load~, ~keyctl~, ~mbind~, ~migrate_pages~, ~move_pages~, ~nfsservctl~, ~nice~, ~oldfstat~, ~oldlstat~, ~oldolduname~, ~oldstat~, ~olduname~, ~pciconfig_iobase~, ~pciconfig_read~, ~pciconfig_write~, ~perfctr~, ~perfmonctl~, ~pivot_root~, ~ppc_rtas~, ~preadv2~, ~pwritev2~, ~quotactl~, ~readdir~, ~request_key~, ~set_mempolicy~, ~setup~, ~sgetmask~, ~sigaction~, ~signal~, ~sigpending~, ~sigprocmask~, ~sigsuspend~, ~spu_create~, ~spu_run~, ~ssetmask~, ~subpage_prot~, ~swapoff~, ~swapon~, ~sync_file_range2~, ~sysfs~, ~uselib~, ~userfaultfd~, ~ustat~, ~utrap_install~, ~vm86~, ~vm86old~ :END: *** Disallowed System Calls #+caption: =<>= += #+begin_src C :noweb-ref syscalls #define SCMP_FAIL SCMP_ACT_ERRNO(EPERM) int syscalls() { scmp_filter_ctx ctx = NULL; fprintf(stderr, "=> filtering syscalls..."); if (!(ctx = seccomp_init(SCMP_ACT_ALLOW)) #+end_src We want to prevent new setuid / setgid executables from being created, since in the absence of user namespaces the contained process could create a setuid binary that could be used by any user to get root.[fn:self-setuid] [fn:self-setuid] #+caption: ~self_setuid.c~ #+include: "linux-containers-in-500-loc/self_setuid.c" src C #+caption: ~allow_chmod.diff~ #+include: "linux-containers-in-500-loc/allow_chmod.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./self_setuid => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.EXwjdL...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ chmod / fchmod / fchmodat failed: Operation not permitted => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$sudo ./contained.allow_chmod -m . -u 0 -c ./self_setuid => validating Linux version...4.8.4-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.35HO0W...done. => trying a user namespace...unsupported? continuing. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$./self_setuid shell sh-4.3#whoami root sh-4.3# exit [lizzie@empress l-c-i-500-l]$rm ./self_setuid #+end_example #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID)) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID)) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID)) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID)) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1, SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID)) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1, SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID)) #+end_src Allowing contained processes to start new user namespaces can allow processes to gain new (albeit limited) capabilities, so we prevent it. #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(unshare), 1, SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER)) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1, SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER)) #+end_src ~TIOCSTI~ allows contained processes to write to the controlling terminal[fn:tiocsti]. [fn:tiocsti] I heard about this pretty recently because of CVE-2016-7545, an SELinux bug: #+caption: [[http://www.openwall.com/lists/oss-security/2016/09/25/1][~CVE-2016-7545 -- SELinux sandbox escape~]] from Federico Bento #+begin_src text Hi, When executing a program via the SELinux sandbox, the nonpriv session can escape to the parent session by using the TIOCSTI ioctl to push characters into the terminal's input buffer, allowing an attacker to escape the sandbox. $ cat test.c #include #include int main() { char *cmd = "id\n"; while(*cmd) ioctl(0, TIOCSTI, cmd++); execlp("/bin/id", "id", NULL); } $ gcc test.c -o test $ /bin/sandbox ./test id uid=1000 gid=1000 groups=1000 context=unconfined_u:unconfined_r:sandbox_t:s0:c47,c176 $ id <------ did not type this uid=1000(saken) gid=1000(saken) groups=1000(saken) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 Bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1378577 Upstream fix: https://marc.info/?l=selinux&m=147465160112766&w=2 https://marc.info/?l=selinux&m=147466045909969&w=2 https://github.com/SELinuxProject/selinux/commit/acca96a135a4d2a028ba9b636886af99c0915379 Federico Bento. #+end_src #+caption: ~tiocsti.c~ #+include: "linux-containers-in-500-loc/tiocsti.c" src C #+caption: ~allow_tiocsti.diff~ #+include: "linux-containers-in-500-loc/allow_tiocsti.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./tiocsti => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.P5QATt...done. => trying a user namespace...writing /proc/1819/uid_map...writing /proc/1819/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ ioctl failed: Operation not permitted => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_tiocsti -m . -u 0 -c ./tiocsti => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.J9mulv...done. => trying a user namespace...writing /proc/1865/uid_map...writing /proc/1865/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. id => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ uid=1000(lizzie) gid=1000(lizzie) groups=1000(lizzie) #+end_example #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ioctl), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, TIOCSTI, TIOCSTI)) #+end_src The kernel keyring system isn't namespaced.[fn:kernel-keyring] [fn:kernel-keyring] There's a notion of "user keyrings", that I believe are user-namespaced, but that's it. #+caption: [[http://man7.org/linux/man-pages/man7/keyrings.7.html][~man 7 keyrings~]] #+begin_src text User keyrings Each UID known to the kernel has a record that contains two keyrings: The user keyring and the user session keyring. These exist for as long as the UID record in the kernel exists. A link to the user keyring is placed in a new session keyring by pam_keyinit when a new login session is initiated. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0) #+end_src Before Linux 4.8, ~ptrace~ totally breaks seccomp[fn:ptrace-seccomp]. [fn:ptrace-seccomp] [[http://man7.org/linux/man-pages/man2/seccomp.2.html][~man 2 seccomp~]] says: #+begin_quote The seccomp check will not be run again after the tracer is notified. (This means that seccomp-based sandboxes must not allow use of ptrace(2)--even of other sandboxed processes--without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.) #+end_quote Here's an example (remember that our seccomp profile should prevent ~chmod(x, I_SUID)~: #+caption: ~ptrace_breaks_seccomp.c~ #+include: "linux-containers-in-500-loc/ptrace_breaks_seccomp.c" src C #+caption: ~allow_ptrace.diff~ #+include: "linux-containers-in-500-loc/allow_ptrace.diff" src diff #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./ptrace_breaks_seccomp => validating Linux version...4.7.6-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.EiZRVH...done. => trying a user namespace...unsupported? continuing. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ child stopping itself. ++ ptrace failed: Operation not permitted => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_ptrace -m . -u 0 -c ./ptrace_breaks_seccomp => validating Linux version...4.7.6-1-ARCH on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.ThyjKm...done. => trying a user namespace...unsupported? continuing. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ child stopping itself. ++ child continued ++ got MAGIC_SYSCALL! ++ chmod succeeded, child finished. ++ finished waiting. => cleaning cgroups...done. [lizzie@empress l-c-i-500-l]$ ls -lh ptrace_breaks_seccomp -rws------ 1 lizzie lizzie 793K Oct 11 14:55 ptrace_breaks_seccomp #+end_example This seems to have been fixed in June by Kees Cook: #+caption: [[https://lkml.org/lkml/2016/6/9/627][~run seccomp after ptrace~]] on LKML #+begin_src text There has been a long-standing (and documented) issue with seccomp where ptrace can be used to change a syscall out from under seccomp. This is a problem for containers and other wider seccomp filtered environments where ptrace needs to remain available, as it allows for an escape of the seccomp filter. Since the ptrace attack surface is available for any allowed syscall, moving seccomp after ptrace doesn't increase the actually available attack surface. And this actually improves tracing since, for example, tracers will be notified of syscall entry before seccomp sends a SIGSYS, which makes debugging filters much easier. The per-architecture changes do make one (hopefully small) semantic change, which is that since ptrace comes first, it may request a syscall be skipped. Running seccomp after this doesn't make sense, so if ptrace wants to skip a syscall, it will bail out early similarly to how seccomp was. This means that skipped syscalls will not be fed through audit, though that likely means we're actually avoiding noise this way. This series first cleans up seccomp to remove the now unneeded two-phase entry, fixes the SECCOMP_RET_TRACE hole (same as the ptrace hole above), and then reorders seccomp after ptrace on each architecture. Thanks, -Kees #+end_src This patchset made it into the kernel at 4.8. See for example [[https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=93e35efb8de45393cf61ed07f7b407629bf698ea][93e35e]]: #+begin_example [lizzie@empress linux-stable]$ git branch --contains 93e35efb8de45393cf61ed07f7b407629bf698ea * linux-4.8.y master #+end_example #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ptrace), 0) #+end_src These system calls let processes assign NUMA nodes. I don't have anything specific in mind, but I could see these being used to deny service to some other NUMA-aware application on the host. #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(mbind), 0) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(migrate_pages), 0) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(move_pages), 0) || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(set_mempolicy), 0) #+end_src ~userfaultd~ allows userspace to handle page faults[fn:userfaultfd]. It doesn't require any privileges, so in theory it should be safe to be called by an unprivileged user. But it can be used to pause execution in the kernel by triggering page faults in system calls. This is an important part in some kernel exploits[fn:userfaultfd-races]. It's only rarely used legitimately, so I'll disable it. [fn:userfaultfd] This is, as far as I can tell, only documented in the kernel tree: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/userfaultfd.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/vm/userfaultfd.txt@c8d2bc~]] #+begin_src text = Userfaultfd = == Objective == Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do. [...] = API == When first opened the userfaultfd must be enabled invoking the UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or a later API version) which will specify the read/POLLIN protocol userland intends to speak on the UFFD and the uffdio_api.features userland requires. The UFFDIO_API ioctl if successful (i.e. if the requested uffdio_api.api is spoken also by the running kernel and the requested features are going to be enabled) will return into uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of respectively all the available features of the read(2) protocol and the generic ioctl available. Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should be invoked (if present in the returned uffdio_api.ioctls bitmask) to register a memory range in the userfaultfd by setting the uffdio_register structure accordingly. The uffdio_register.mode bitmask will specify to the kernel which kind of faults to track for the range (UFFDIO_REGISTER_MODE_MISSING would track missing pages). The UFFDIO_REGISTER ioctl will return the uffdio_register.ioctls bitmask of ioctls that are suitable to resolve userfaults on the range registered. Not all ioctls will necessarily be supported for all memory types depending on the underlying virtual memory backend (anonymous memory vs tmpfs vs real filebacked mappings). Userland can use the uffdio_register.ioctls to manage the virtual address space in the background (to add or potentially also remove memory from the userfaultfd registered range). This means a userfault could be triggering just before userland maps in the background the user-faulted page. The primary ioctl to resolve userfaults is UFFDIO_COPY. That atomically copies a page into the userfault registered range and wakes up the blocked userfaults (unless uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an half copied page since it'll keep userfaulting until the copy has finished. #+end_src [fn:userfaultfd-races] Jann Horn described this to me, [[https://bugs.chromium.org/p/project-zero/issues/detail?id%3D808][and linked to his vulnerability and exploit]]: #+begin_quote In order to make exploitation more reliable, the attacker should be able to pause code execution in the kernel between the writability check of the target file and the actual write operation. This can be done by abusing the writev() syscall and FUSE: The attacker mounts a FUSE filesystem that artificially delays read accesses, then mmap()s a file containing a struct iovec from that FUSE filesystem and passes the result of mmap() to writev(). (Another way to do this would be to use the userfaultfd() syscall.) #+end_quote It was also used by [[https://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit][Vitaly Nikolenko in his proof-of-concept for CVE-2016-6187]]: #+begin_quote [...] If we could overwrite the cleanup function pointer (remember that this object is now allocated in user space), then we'll have arbitrary code execution with CPL=0. The only problem is that subprocess_info object allocation and freeing happens on the same path. One way to modify the object's function pointer is to somehow suspend the execution before info->cleanup)(info) gets called and set the function pointer to our privilege escalation payload. I could have found other objects of the same size with two "separate" paths for allocation and function triggering but I needed a reason to try userfaultfd() and the page splitting idea. The userfaultfd syscall can be used to handle page faults in user space. We can allocate a page in user space and set up a handler (as a separate thread); when this page is accessed either for reading or writing, execution will be transferred to the user-space handler to deal with the page fault. There's nothing new here and this was mentioned by [[https://bugs.chromium.org/p/project-zero/issues/detail?id%3D808][Jann Hornh]] [...]. + Allocate two consecutive pages, split the object over these two pages (as before) and set up the page handler for the second page. + When the user-space PF is triggered by memset, set up another user-space PF handler but for the first page. + The next user-space PF will be triggered when object variables (located in the first page) get initialised in call_usermodehelper_setup. At this point, set up another PF for the second page. + Finally, the last user-space PF handler can modify the cleanup function pointer (by setting it to our privilege escalation payload or a ROP chain) and set the path member to 0 (since these members are all located in the first page and already initialised). Setting up user-space PF handlers for already "page-faulted" pages can be accomplished by munmapping/mapping these pages again and then passing them to userfaultfd(). The PoC for 4.5.1 can be found [[https://cyseclabs.com/exploits/matreshka.c][here]]. There's nothing specific to the kernel version though (it should work on all vulnerable kernels). There's no privilege escalation payload but the PoC will execute instructions at the user-space address 0xdeadbeef. #+end_quote #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(userfaultfd), 0) #+end_src I was initially worried about ~perf_event_open~ because the [[https://github.com/docker/docker.github.io/blob/master/engine/security/seccomp.md][Docker documentation says]] it "could leak a lot of information on the host", but it can't be used in our system to see information for out-of-namespace processes[fn:perf_event_open]. But, if ~/proc/sys/kernel/perf_event_paranoid~ is less than 2, it can be used to discover kernel addresses and possibly uninitialized memory. 2 is the default since is the default since 4.6, but it can be changed, and relying on it seems like a bad idea[fn:paranoid-46]. [fn:perf_event_open] #+caption: [[http://man7.org/linux/man-pages/man2/perf_event_open.2.html][~man 2 perf_event_open~]] #+begin_src text PERF_EVENT_OPEN(2) -- 2016-07-17 -- Linux -- Linux Programmer's Manual NAME perf_event_open - set up performance monitoring SYNOPSIS #include #include int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION [...] Arguments The pid and cpu arguments allow specifying which process and CPU to monitor: pid == 0 and cpu == -1 This measures the calling process/thread on any CPU. pid == 0 and cpu >= 0 This measures the calling process/thread only when running on the specified CPU. pid > 0 and cpu == -1 This measures the specified process/thread on any CPU. pid > 0 and cpu >= 0 This measures the specified process/thread only when running on the specified CPU. pid == -1 and cpu >= 0 This measures all processes/threads on the specified CPU. This requires CAP_SYS_ADMIN capability or a /proc/sys/kernel/perf_event_paranoid value of less than 1. pid == -1 and cpu == -1 This setting is invalid and will return an error. #+end_src If a pid is specified, the corresponding process is found within the namespace: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/events/core.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n9376][~kernel/events/core.c:9376@c8d2bc~]] #+begin_src C /** ,* sys_perf_event_open - open a performance event, associate it to a task/cpu ,* ,* @attr_uptr: event_id type attributes for monitoring/sampling ,* @pid: target pid ,* @cpu: target cpu ,* @group_fd: group leader event fd ,*/ SYSCALL_DEFINE5(perf_event_open, struct perf_event_attr __user *, attr_uptr, pid_t, pid, int, cpu, int, group_fd, unsigned long, flags) { /* ... */ if (pid != -1 && !(flags & PERF_FLAG_PID_CGROUP)) { task = find_lively_task_by_vpid(pid); if (IS_ERR(task)) { err = PTR_ERR(task); goto err_group_fd; } } /* ... */ } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/events/core.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n3621][~kernel/events/core.c:3621@c8d2bc~]] #+begin_src C static struct task_struct * find_lively_task_by_vpid(pid_t vpid) { struct task_struct *task; rcu_read_lock(); if (!vpid) task = current; else task = find_task_by_vpid(vpid); if (task) get_task_struct(task); rcu_read_unlock(); if (!task) return ERR_PTR(-ESRCH); return task; } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/pid.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n459][~kernel/pid.c:459@c8d2bc~]] #+begin_src C struct task_struct *find_task_by_vpid(pid_t vnr) { return find_task_by_pid_ns(vnr, task_active_pid_ns(current)); } #+end_src [fn:paranoid-46] The Relevant commit is [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id%3D0161028b7c8aebef64194d3d73e43bc3b53b5c66][~0161028~]], whose commit message gives a good description of the problems: #+begin_src diff commit 0161028b7c8aebef64194d3d73e43bc3b53b5c66 Author: Andy Lutomirski Date: Mon May 9 15:48:51 2016 -0700 perf/core: Change the default paranoia level to 2 Allowing unprivileged kernel profiling lets any user dump follow kernel control flow and dump kernel registers. This most likely allows trivial kASLR bypassing, and it may allow other mischief as well. (Off the top of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads could be quite interesting.) Signed-off-by: Andy Lutomirski Acked-by: Kees Cook Signed-off-by: Linus Torvalds diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 57653a4..fcddfd5 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -645,7 +645,7 @@ allowed to execute. perf_event_paranoid: Controls use of the performance events system by unprivileged -users (without CAP_SYS_ADMIN). The default value is 1. +users (without CAP_SYS_ADMIN). The default value is 2. -1: Allow use of (almost) all events by all users >=0: Disallow raw tracepoint access by users without CAP_IOC_LOCK diff --git a/kernel/events/core.c b/kernel/events/core.c index 4e2ebf6..c0ded24 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -351,7 +351,7 @@ static struct srcu_struct pmus_srcu; ,* 1 - disallow cpu events for unpriv ,* 2 - disallow kernel profiling for unpriv ,*/ -int sysctl_perf_event_paranoid __read_mostly = 1; +int sysctl_perf_event_paranoid __read_mostly = 2; /* Minimum for 512 kiB + 1 user control page */ #+end_src This is included in 4.6: #+begin_example [lizzie@empress linux]$ git tag --contains 0161028b7c8aebef64194d3d73e43bc3b53b5c66 v4.6 v4.7 v4.7-rc1 v4.7-rc2 v4.7-rc3 v4.7-rc4 v4.7-rc5 v4.7-rc6 v4.7-rc7 v4.8 v4.8-rc1 v4.8-rc2 v4.8-rc3 v4.8-rc4 v4.8-rc5 v4.8-rc6 v4.8-rc7 v4.8-rc8 #+end_example Thanks to Jann Horn for pointing this out. #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(perf_event_open), 0) #+end_src We'll set ~PR_SET_NO_NEW_PRIVS~ to 0. The name is a little vague: it specifically prevents ~setuid~ and ~setcap~'d binaries from being executed with their additional privileges. This has some security benefits (it makes it harder for an unprivileged user in-container to exploit a vulnerability in a setuid or setcap executable to become in-container root, for example). But it's a little weird, and means that, for example, ~ping~ won't work in a container for an unprivileged user[fn:pr_set_no_new_privs]. [fn:pr_set_no_new_privs] [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/prctl/no_new_privs.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/prctl/no_new_privs.txt@c8d2bc~]] #+begin_quote The execve system call can grant a newly-started program privileges that its parent did not have. The most obvious examples are setuid/setgid programs and file capabilities. [...] Any task can set no_new_privs. Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset. With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call. #+end_quote #+caption: [[http://man7.org/linux/man-pages/man2/seccomp.2.html][~man 2 seccomp~]] #+begin_src text In order to use the SECCOMP_SET_MODE_FILTER operation, either the caller must have the CAP_SYS_ADMIN capability in its user namespace, or the thread must already have the no_new_privs bit set. If that bit was not already set by an ancestor of this thread, the thread must make the following call: prctl(PR_SET_NO_NEW_PRIVS, 1); Otherwise, the SECCOMP_SET_MODE_FILTER operation will fail and return EACCES in errno. This requirement ensures that an unprivileged process cannot apply a malicious filter and then invoke a set-user-ID or other privileged program using execve(2), thus potentially compromising that program. (Such a malicious filter might, for example, cause an attempt to use setuid(2) to set the caller's user IDs to non-zero values to instead return 0 without actually making the system call. Thus, the program might be tricked into retaining superuser privileges in circumstances where it is possible to influence it to do dangerous things because it did not actually drop privileges.) #+end_src It took me a while to internalize this behavior. My impression was that without ~PR_SET_NO_NEW_PRIVS~, seccomp filters would be dropped across a ~setuid~ exec. This would lead to an easy way to escape ~seccomp~: + Create a setuid executable that calls some filtered syscall. + Become a non-root user. + Execute that setuid executable. But that's actually not the case. Instead, you just can't set seccomp filters unless you have one of the following: + ~PR_SET_NO_NEW_PRIVS~ == 1 + ~CAP_SYS_ADMIN~ and so libseccomp sets ~PR_SET_NO_NEW_PRIVS~ by default. Here's the code I thought would work: #+caption: ~setuidd_lower_reexec_and_escape.c~ #+include: "linux-containers-in-500-loc/setuidd_lower_reexec_and_escape.c" src C but it doesn't : #+begin_example [lizzie@empress l-c-i-500-l]$sudo chown root setuidd_lower_reexec_and_escape [lizzie@empress l-c-i-500-l]$sudo chmod 4007 setuidd_lower_reexec_and_escape [lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./setuidd_lower_reexec_and_escape => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.ZM2vnz...done. => trying a user namespace...writing /proc/2095/uid_map...writing /proc/2095/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ we're 99/99/99. ++ ioctl failed: Operation not permitted => cleaning cgroups...done. #+end_example Here's the code responsible for that check: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/seccomp.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n340][~kernel/seccomp.c:340@c8d2bc~]] #+begin_src C /** ,* seccomp_prepare_filter: Prepares a seccomp filter for use. ,* @fprog: BPF program to install ,* ,* Returns filter on success or an ERR_PTR on failure. ,*/ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); /* ,* Installing a seccomp filter requires that the task has ,* CAP_SYS_ADMIN in its namespace or be running with no_new_privs. ,* This avoids scenarios where unprivileged tasks can affect the ,* behavior of privileged children. ,*/ if (!task_no_new_privs(current) && security_capable_noaudit(current_cred(), current_user_ns(), CAP_SYS_ADMIN) != 0) return ERR_PTR(-EACCES); /* Allocate a new seccomp_filter */ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); if (!sfilter) return ERR_PTR(-ENOMEM); ret = bpf_prog_create_from_user(&sfilter->prog, fprog, seccomp_check_filter, save_orig); if (ret < 0) { kfree(sfilter); return ERR_PTR(ret); } atomic_set(&sfilter->usage, 1); return sfilter; } #+end_src and the code that unconditionally propagates seccomp filters across exec: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#1268][~kernel/fork.c:1268@c8d2bc~]] #+begin_src C static void copy_seccomp(struct task_struct *p) { #ifdef CONFIG_SECCOMP /* ,* Must be called with sighand->lock held, which is common to ,* all threads in the group. Holding cred_guard_mutex is not ,* needed because this new task is not yet running and cannot ,* be racing exec. ,*/ assert_spin_locked(¤t->sighand->siglock); /* Ref-count the new filter user, and assign it. */ get_seccomp_filter(current); p->seccomp = current->seccomp; /* ,* Explicitly enable no_new_privs here in case it got set ,* between the task_struct being duplicated and holding the ,* sighand lock. The seccomp state and nnp must be in sync. ,*/ if (task_no_new_privs(current)) task_set_no_new_privs(p); /* ,* If the parent gained a seccomp mode after copying thread ,* flags and between before we held the sighand lock, we have ,* to manually enable the seccomp thread flag here. ,*/ if (p->seccomp.mode != SECCOMP_MODE_DISABLED) set_tsk_thread_flag(p, TIF_SECCOMP); #endif } #+end_src (called by ~copy_process~ in [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~kernel/fork.c@c8d2bc~]]). #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_attr_set(ctx, SCMP_FLTATR_CTL_NNP, 0) #+end_src And we'll actually apply it to the process, and release the context. #+caption: =<>= += #+begin_src C :noweb-ref syscalls || seccomp_load(ctx)) { if (ctx) seccomp_release(ctx); fprintf(stderr, "failed: %m\n"); return 1; } seccomp_release(ctx); fprintf(stderr, "done.\n"); return 0; } #+end_src *** Allowed System Calls Here are the system calls that are disallowed by the default Docker policy but permitted by this code: ~_sysctl~ is obsolete and disabled by default[fn:_sysctl]. ~alloc_hugepages~ and ~free_hugepages~ [fn:alloc_hugepages], ~bdflush~ [fn:bdflush], ~create_module~ [fn:create_module], ~nfsservctl~ [fn:nfsservctl], ~perfctr~ [fn:perfctr], ~get_kernel_syms~ [fn:get_kernel_syms], and ~setup~ [fn:setup-syscall] are not present on modern Linux. ~clock_adjtime~, ~clock_settime~ [fn:clock_settime], and ~adjtime~ [fn:adjtime] depend on ~CAP_SYS_TIME~. ~pciconfig_read~ and ~pciconfig_write~ [fn:pci-etc] and all of the side-effecting operations of ~quotactl~ [fn:quotactl] are prevented by ~CAP_SYS_ADMIN~. ~get_mempolicy~ and ~getpagesize~ reveal information about the memory layout of the system, but they can be made by unprivileged processes, and are probably harmless. ~pciconfig_iobase~ can be made by unprivileged processes, and reveals information about PCI decvices. ~ustat~ [fn:ustat] and ~sysfs~ [fn:sysfs] leak some information about the filesystems, but are nothing that I see as critical. ~uselib~ is more-or-less obsolete, but is just used for loading a shared library in userspace [fn:uselib] ~sync_file_range2~ is ~sync_file_range~ with swapped argument order[fn:sync_file_range2]. ~readdir~ is mostly obsolete, but probably harmless[fn:readdir]. ~kexec_file_load~ and ~kexec_load~ are prevented by ~CAP_SYS_BOOT~ [fn:kexec-etc]. ~nice~ can only be used to lower priority without ~CAP_SYS_NICE~ [fn:nice-again]. ~oldfstat~, ~oldlstat~, ~oldolduname~, ~oldstat~, and ~olduname~ are just older versions of their respective functions. I expect them to have the same security properties as the modern ones. ~perfmonctl~ [fn:perfmonctl] is only available on IA-64. ~ppc_rtas~ [fn:ppc_rtas], ~spu_create~ [fn:spu_create] and ~spu_run~ [fn:spu_run], and ~subpage_prot~ [fn:subpage_prot] are only avaiable on PowerPC. ~utrap_install~ is only available on Sparc[fn:utrap_install]. ~kern_features~ is only available on Sparc64, and should be harmless anyway[fn:kern_features]. I don't believe ~pivot_root~ is a problem in our setup (but it could probably be used to circumvent path-based MAC). ~preadv2~ and ~pwritev2~ are just extensions to ~preadv~ and ~pwritev~ / ~readv~ and ~writev~, which are "scatter input" / "gather output" extensions to ~read~ and ~write~ [fn:preadv2-etc]. [fn:sysfs] #+caption: [[http://man7.org/linux/man-pages/man2/sysfs.2.html][~man 2 sysfs~]] #+begin_src text SYSFS(2) -- 2010-06-27 -- Linux -- Linux Programmer's Manual NAME sysfs - get filesystem type information SYNOPSIS int sysfs(int option, const char *fsname); int sysfs(int option, unsigned int fs_index, char *buf); int sysfs(int option); DESCRIPTION sysfs() returns information about the filesystem types currently present in the kernel. The specific form of the sysfs() call and the information returned depends on the option in effect: 1 Translate the filesystem identifier string fsname into a filesystem type index. 2 Translate the filesystem type index fs_index into a null-terminated filesystem identifier string. This string will be written to the buffer pointed to by buf. Make sure that buf has enough space to accept the string. 3 Return the total number of filesystem types currently present in the kernel. The numbering of the filesystem type indexes begins with zero. #+end_src [fn:ustat] #+caption: [[http://man7.org/linux/man-pages/man2/ustat.2.html][~man 2 ustat~]] #+begin_src text USTAT(2) -- 2003-08-04 -- Linux -- Linux Programmer's Manual NAME ustat - get filesystem statistics SYNOPSIS #include #include /* libc[45] */ #include /* glibc2 */ int ustat(dev_t dev, struct ustat *ubuf); DESCRIPTION ustat() returns information about a mounted filesystem. dev is a device number identifying a device containing a mounted filesystem. ubuf is a pointer to a ustat structure that contains the following members: daddr_t f_tfree; /* Total free blocks */ ino_t f_tinode; /* Number of free inodes */ char f_fname[6]; /* Filsys name */ char f_fpack[6]; /* Filsys pack name */ The last two fields, f_fname and f_fpack, are not implemented and will always be filled with null bytes ('\0'). #+end_src [fn:kern_features] #+caption: [[http://man7.org/linux/man-pages/man2/syscalls.2.html][~man 2 syscalls~]] #+begin_src text kern_features(2) 3.7 Sparc64 #+end_src This is pretty vague, so I looked at the source. It's only mentioned in an Sparc64-specific file: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/sparc/kernel/sys_sparc_64.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n648][~arch/sparc/kernel/sys_sparc_64.c:648@c8d2bc~]] #+begin_src C asmlinkage long sys_kern_features(void) { return KERN_FEATURE_MIXED_MODE_STACK; } #+end_src [fn:get_kernel_syms] #+caption: [[http://man7.org/linux/man-pages/man2/get_kernel_syms.2.html][~man 2 get_kernel_syms~]] #+begin_src text GET_KERNEL_SYMS(2) -- 2016-10-08 -- Linux -- Linux Programmer's Manual NAME get_kernel_syms - retrieve exported kernel and module symbols SYNOPSIS #include int get_kernel_syms(struct kernel_sym *table); Note: No declaration of this system call is provided in glibc headers; see NOTES. DESCRIPTION Note: This system call is present only in kernels before Linux 2.6. #+end_src [fn:readdir] #+caption: [[http://man7.org/linux/man-pages/man2/readdir.2.html][~man 2 readdir~]] #+begin_src text READDIR(2) -- 2013-06-21 -- Linux -- Linux Programmer's Manual NAME readdir - read directory entry SYNOPSIS int readdir(unsigned int fd, struct old_linux_dirent *dirp, unsigned int count); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION This is not the function you are interested in. Look at readdir(3) for the POSIX conforming C library interface. This page documents the bare kernel system call interface, which is superseded by getdents(2). readdir() reads one old_linux_dirent structure from the directory referred to by the file descriptor fd into the buffer pointed to by dirp. The argument count is ignored; at most one old_linux_dirent structure is read. #+end_src [fn:sync_file_range2] #+caption: [[http://man7.org/linux/man-pages/man2/sync_file_range2.2.html][~man 2 sync_file_range2~]] #+begin_src text SYNC_FILE_RANGE(2) -- 2014-08-19 -- Linux -- Linux Programmer's Manual NAME sync_file_range - sync a file segment with disk [...] NOTES sync_file_range2() Some architectures (e.g., PowerPC, ARM) need 64-bit arguments to be aligned in a suitable pair of registers. On such architectures, the call signature of sync_file_range() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. (See syscall(2) for details.) Therefore, these architectures define a different system call that orders the arguments suitably: int sync_file_range2(int fd, unsigned int flags, off64_t offset, off64_t nbytes); The behavior of this system call is otherwise exactly the same as sync_file_range(). #+end_src [fn:utrap_install] #+caption: [[http://man7.org/linux/man-pages/man2/syscalls.2.html][~man 2 syscalls~]] #+begin_src text utrap_install(2) 2.2 Sparc only #+end_src [fn:uselib] #+caption: [[http://man7.org/linux/man-pages/man2/uselib.2.html][~man 2 uselib~]] #+begin_src text USELIB(2) -- 2016-03-15 -- Linux -- Linux Programmer's Manual NAME uselib - load shared library [..] NOTES [...] Since Linux 3.15, this system call is available only when the kernel is configured with the CONFIG_USELIB option. #+end_src [fn:subpage_prot] #+caption: [[http://man7.org/linux/man-pages/man2/subpage_prot.2.html][~man 2 subpage_prot~]] #+begin_src text SUBPAGE_PROT(2) -- 2012-07-13 -- Linux -- Linux Programmer's Manual NAME subpage_prot - define a subpage protection for an address range [...] VERSIONS This system call is provided on the PowerPC architecture since Linux 2.6.25. The system call is provided only if the kernel is configured with CONFIG_PPC_64K_PAGES. No library support is provided. #+end_src [fn:spu_create] #+caption: [[http://man7.org/linux/man-pages/man2/spu_create.2.html][~man 2 spu_create~]] #+begin_src text SPU_CREATE(2) -- 2015-12-28 -- Linux -- Linux Programmer's Manual NAME spu_create - create a new spu context SYNOPSIS #include #include int spu_create(const char *pathname, int flags, mode_t mode); int spu_create(const char *pathname, int flags, mode_t mode, int neighbor_fd); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The spu_create() system call is used on PowerPC machines that implement the Cell Broadband Engine Architecture in order to access Synergistic Processor Units (SPUs). It creates a new logical context for an SPU in pathname and returns a file descriptor associated with it. pathname must refer to a nonexistent directory in the mount point of the SPU filesystem (spufs). If spu_create() is successful, a directory is created at pathname and it is populated with the files described in spufs(7). #+end_src [fn:spu_run] #+caption: [[http://man7.org/linux/man-pages/man2/spu_run.2.html][~man 2 spu_run~]] #+begin_src text SPU_RUN(2) -- 2012-08-05 -- Linux -- Linux Programmer's Manual NAME spu_run - execute an SPU context SYNOPSIS #include int spu_run(int fd, unsigned int *npc, unsigned int *event); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The spu_run() system call is used on PowerPC machines that implement the Cell Broadband Engine Architecture in order to access Synergistic Processor Units (SPUs). The fd argument is a file descriptor returned by spu_create(2) that refers to a specific SPU context. When the context gets scheduled to a physical SPU, it starts execution at the instruction pointer passed in npc. #+end_src [fn:setup-syscall] #+caption: [[http://man7.org/linux/man-pages/man2/setup.2.html][~man 2 setup~]] #+begin_src text SETUP(2) -- 2008-12-03 -- Linux -- Linux Programmer's Manual NAME setup - setup devices and filesystems, mount root filesystem [...] VERSIONS Since Linux 2.1.121, no such function exists anymore. #+end_src [fn:quotactl] Too many too list, but see [[http://man7.org/linux/man-pages/man2/quotactl.2.html][~man 2 quotactl~]]. [fn:preadv2-etc] #+caption: [[http://man7.org/linux/man-pages/man2/preadv2.2.html][~man 2 preadv2~]] #+begin_src text DESCRIPTION The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov ("scatter input"). The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd ("gather output"). [...] The readv() system call works just like read(2) except that multiple buffers are filled. The writev() system call works just like write(2) except that multiple buffers are written out. [...] preadv() and pwritev() The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed. The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed. The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking. preadv2() and pwritev2() These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis. Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated. The flags argument contains a bitwise OR of zero or more of the following flags: RWF_DSYNC (since Linux 4.7) Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. RWF_HIPRI (since Linux 4.6) High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.) RWF_SYNC (since Linux 4.7) Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. #+end_src [fn:ppc_rtas] #+caption: [[http://man7.org/linux/man-pages/man2/syscalls.2.html][~man 2 syscalls~]] #+begin_src text ppc_rtas(2) 2.6.2 PowerPC only #+end_src [fn:perfmonctl] #+caption: [[http://man7.org/linux/man-pages/man2/perfmonctl.2.html][~man 2 perfmonctl~]] #+begin_src text PERFMONCTL(2) -- 2013-02-13 -- Linux -- Linux Programmer's Manual NAME perfmonctl - interface to IA-64 performance monitoring unit [...] CONFORMING TO perfmonctl() is Linux-specific and is available only on the IA-64 architecture. #+end_src [fn:perfctr] #+caption: [[http://man7.org/linux/man-pages/man2/syscalls.2.html][~man 2 syscalls~]] #+begin_src text perfctr(2) 2.2 Sparc; removed in 2.6.34 #+end_src [fn:pci-etc] #+caption: [[http://man7.org/linux/man-pages/man2/pciconfig_read.2.html][~man 2 pciconfig_read~]] #+begin_src text PCICONFIG_READ(2) -- 2016-07-17 -- Linux -- Linux Programmer's Manual NAME pciconfig_read, pciconfig_write, pciconfig_iobase - pci device information handling [...] ERRORS [...] EPERM User does not have the CAP_SYS_ADMIN capability. This does not apply to pciconfig_iobase(). #+end_src [fn:clock_settime] [[http://man7.org/linux/man-pages/man2/clock_settime.2.html][~man 2 clock_settime~]] is unfortunately pretty vague: #+caption: [[http://man7.org/linux/man-pages/man2/clock_settime.2.html][~man 2 clock_settime~]] #+begin_src text CLOCK_GETRES(2) -- 2016-05-09 -- Linux Programmer's Manual NAME clock_getres, clock_gettime, clock_settime - clock and time functions [...] ERRORS EFAULT tp points outside the accessible address space. EINVAL The clk_id specified is not supported on this system. EPERM clock_settime() does not have permission to set the clock indicated. #+end_src but you can see in the source that ~CLOCK_REALTIME~ is the only clock with ~.clock_set~ and ~.clock_adj~ set: #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/time/posix-timers.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n282][~kernel/time/posix-timers.c:282@c8d2bc~]] #+begin_src C /* ,* Initialize everything, well, just everything in Posix clocks/timers ;) ,*/ static __init int init_posix_timers(void) { struct k_clock clock_realtime = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_clock_realtime_get, .clock_set = posix_clock_realtime_set, .clock_adj = posix_clock_realtime_adj, .nsleep = common_nsleep, .nsleep_restart = hrtimer_nanosleep_restart, .timer_create = common_timer_create, .timer_set = common_timer_set, .timer_get = common_timer_get, .timer_del = common_timer_del, }; struct k_clock clock_monotonic = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_ktime_get_ts, .nsleep = common_nsleep, .nsleep_restart = hrtimer_nanosleep_restart, .timer_create = common_timer_create, .timer_set = common_timer_set, .timer_get = common_timer_get, .timer_del = common_timer_del, }; struct k_clock clock_monotonic_raw = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_get_monotonic_raw, }; struct k_clock clock_realtime_coarse = { .clock_getres = posix_get_coarse_res, .clock_get = posix_get_realtime_coarse, }; struct k_clock clock_monotonic_coarse = { .clock_getres = posix_get_coarse_res, .clock_get = posix_get_monotonic_coarse, }; struct k_clock clock_tai = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_get_tai, .nsleep = common_nsleep, .nsleep_restart = hrtimer_nanosleep_restart, .timer_create = common_timer_create, .timer_set = common_timer_set, .timer_get = common_timer_get, .timer_del = common_timer_del, }; struct k_clock clock_boottime = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_get_boottime, .nsleep = common_nsleep, .nsleep_restart = hrtimer_nanosleep_restart, .timer_create = common_timer_create, .timer_set = common_timer_set, .timer_get = common_timer_get, .timer_del = common_timer_del, }; posix_timers_register_clock(CLOCK_REALTIME, &clock_realtime); posix_timers_register_clock(CLOCK_MONOTONIC, &clock_monotonic); posix_timers_register_clock(CLOCK_MONOTONIC_RAW, &clock_monotonic_raw); posix_timers_register_clock(CLOCK_REALTIME_COARSE, &clock_realtime_coarse); posix_timers_register_clock(CLOCK_MONOTONIC_COARSE, &clock_monotonic_coarse); posix_timers_register_clock(CLOCK_BOOTTIME, &clock_boottime); posix_timers_register_clock(CLOCK_TAI, &clock_tai); posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof (struct k_itimer), 0, SLAB_PANIC, NULL); return 0; } #+end_src and that those methods go through ~settimeofday~ and ~adjtimex~, which are both also gated by ~CAP_SYS_TIME~. #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/time/posix-timers.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n212][~kernel/time/posix-timers.c:212@c8d2bc~]] #+begin_src C /* Set clock_realtime */ static int posix_clock_realtime_set(const clockid_t which_clock, const struct timespec *tp) { return do_sys_settimeofday(tp, NULL); } static int posix_clock_realtime_adj(const clockid_t which_clock, struct timex *t) { return do_adjtimex(t); } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/security/commoncap.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n106][~security/commoncap.c:106@c8d2bc~]] #+begin_src C /** ,* cap_settime - Determine whether the current process may set the system clock ,* @ts: The time to set ,* @tz: The timezone to set ,* ,* Determine whether the current process may set the system clock and timezone ,* information, returning 0 if permission granted, -ve if denied. ,*/ int cap_settime(const struct timespec64 *ts, const struct timezone *tz) { if (!capable(CAP_SYS_TIME)) return -EPERM; return 0; } #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/time/ntp.c?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n657][~kernel/time/ntp.c:657@c8d2bc~]] #+begin_src C /** ,* ntp_validate_timex - Ensures the timex is ok for use in do_adjtimex ,*/ int ntp_validate_timex(struct timex *txc) { if (txc->modes & ADJ_ADJTIME) { /* singleshot must not be used with any other mode bits */ if (!(txc->modes & ADJ_OFFSET_SINGLESHOT)) return -EINVAL; if (!(txc->modes & ADJ_OFFSET_READONLY) && !capable(CAP_SYS_TIME)) return -EPERM; } else { /* In order to modify anything, you gotta be super-user! */ if (txc->modes && !capable(CAP_SYS_TIME)) return -EPERM; /* ,* if the quartz is off by more than 10% then ,* something is VERY wrong! ,*/ if (txc->modes & ADJ_TICK && (txc->tick < 900000/USER_HZ || txc->tick > 1100000/USER_HZ)) return -EINVAL; } /* ... * } #+end_src [fn:adjtime] #+caption: [[http://man7.org/linux/man-pages/man3/adjtime.3.html][~man 3 adjtime~]] #+begin_src text ADJTIME(3) -- 2016-03-15 -- Linux -- Linux Programmer's Manual NAME adjtime - correct the time to synchronize the system clock [...] ERRORS EINVAL The adjustment in delta is outside the permitted range. EPERM The caller does not have sufficient privilege to adjust the time. Under Linux, the CAP_SYS_TIME capability is required. #+end_src [fn:nice-again] #+caption: [[http://man7.org/linux/man-pages/man2/nice.2.html][~man 2 nice~]] #+begin_src text NICE(2) -- 2016-03-15 -- Linux -- Linux Programmer's Manual NAME nice - change process priority [...] ERRORS EPERM The calling process attempted to increase its priority by supplying a negative inc but has insufficient privileges. Under Linux, the CAP_SYS_NICE capability is required. (But see the discussion of the RLIMIT_NICE resource limit in setrlimit(2).) #+end_src [fn:nfsservctl] #+caption: [[http://man7.org/linux/man-pages/man2/nfsservctl.2.html][~man 2 nfsservctl~]] #+begin_src text NAME nfsservctl - syscall interface to kernel nfs daemon SYNOPSIS #include long nfsservctl(int cmd, struct nfsctl_arg *argp, union nfsctl_res *resp); DESCRIPTION Note: Since Linux 3.1, this system call no longer exists. It has been replaced by a set of files in the nfsd filesystem; see nfsd(7). #+end_src [fn:kexec-etc] #+caption: [[http://man7.org/linux/man-pages/man2/kexec_file_load.2.html][~man 2 kexec_file_load~]] #+begin_src text NAME kexec_load, kexec_file_load - load a new kernel for later execution [...] ERRORS [...] EPERM The caller does not have the CAP_SYS_BOOT capability. #+end_src [fn:create_module] #+caption: [[http://man7.org/linux/man-pages/man2/create_module.2.html][~man 2 create_module~]] #+begin_src text DESCRIPTION Note: This system call is present only in kernels before Linux 2.6. #+end_src [fn:_sysctl] #+caption: [[http://man7.org/linux/man-pages/man2/_sysctl.2.html][~man 2 _sysctl~]] #+begin_src text NOTES Glibc does not provide a wrapper for this system call; call it using syscall(2). Or rather... don't call it: use of this system call has long been discouraged, and it is so unloved that it is likely to disappear in a future kernel version. Since Linux 2.6.24, uses of this system call result in warnings in the kernel log. Remove it from your programs now; use the /proc/sys interface instead. This system call is available only if the kernel was configured with the CONFIG_SYSCTL_SYSCALL option. #+end_src #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/init/Kconfig?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3#n1420][~init/Kconfig:1420@c8d2bc~]] #+begin_src text config SYSCTL_SYSCALL bool "Sysctl syscall support" if EXPERT depends on PROC_SYSCTL default n select SYSCTL ---help--- sys_sysctl uses binary paths that have been found challenging to properly maintain and use. The interface in /proc/sys using paths with ascii names is now the primary path to this information. Almost nothing using the binary sysctl interface so if you are trying to save some space it is probably safe to disable this, making your kernel marginally smaller. If unsure say N here. #+end_src [fn:alloc_hugepages] #+caption: [[http://man7.org/linux/man-pages/man2/alloc_hugepages.2.html][~man 2 alloc_hugepages~]] #+begin_src text DESCRIPTION The system calls alloc_hugepages() and free_hugepages() were introduced in Linux 2.5.36 and removed again in 2.5.54. They existed only on i386 and ia64 (when built with CONFIG_HUGETLB_PAGE). In Linux 2.4.20, the syscall numbers exist, but the calls fail with the error ENOSYS. #+end_src [fn:bdflush] #+caption: [[http://man7.org/linux/man-pages/man2/bdflush.2.html][~man 2 bdflush~]] #+begin_src text DESCRIPTION Note: Since Linux 2.6, this system call is deprecated and does nothing. It is likely to disappear altogether in a future kernel release. Nowadays, the task performed by bdflush() is handled by the kernel pdflush thread. #+end_src ** Resources We'd like to prevent badly-behaved child processes from denying service to the rest of the system[fn:oom-killer-security]. Cgroups let us limit memory and cpu time in particular; limiting the pid count and IO usage is also useful. [[https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt][There's a very useful document in the kernel tree about it]]. The ~cgroup~ and ~cgroup2~ filesystems are the canonical interfaces to the cgroup system. ~cgroup2~ is a little different, and unitialized on my system, so I'll use the first version here. Cgroup namespaces are a little different from, for example, mount namespaces. We need to create the cgroup before we enter a cgroup namespace; once we do, that cgroup will behave like the root cgroup inside of the namespace[fn:cgroup-namespaces]. This isn't the most relevant, since a contained process can't mount the cgroup filesystem or ~/proc~ for introspection, but it's nice to be thorough. [fn:oom-killer-security] This isn't just a denial-of-service concern. If a process consumes a lot of memory, and has a better ~badness~ score than some other critical host-side process, the host-side process will be killed by the kernel's out-of-memory killer. The badness score favors longer-running processes, among other things: [[https://lwn.net/Articles/317814/]["Taming the OOM Killer"]] on LWN: #+begin_quote The process to be killed in an out-of-memory situation is selected based on its badness score. The badness score is reflected in /proc//oom_score. This value is determined on the basis that the system loses the minimum amount of work done, recovers a large amount of memory, doesn't kill any innocent process eating tons of memory, and kills the minimum number of processes (if possible limited to one). The badness score is computed using the original memory size of the process, its CPU time (utime + stime), the run time (uptime - start time) and its oom_adj value. The more memory the process uses, the higher the score. The longer a process is alive in the system, the smaller the score. #+end_quote I haven't demonstrated it, but I believe this could manipulated to cause a screen lock program to be killed, for example. It's not unheard of for e.g. xscreensaver to leak memory: [[https://bugs.launchpad.net/ubuntu/%2Bsource/xscreensaver/%2Bbug/768032]["gltext seems to leak memory eventually causing oom-killer to run"]]: #+begin_quote gltext is consuming large amounts of memory. Often being killed by oom-killer but eventually causing me not to be able to log into my computer disabling gltext from the list of possible screensavers caused the problem to go away. #+end_quote There's even an open Ubuntu xscreensaver bug to make the OOM killer *more likely* to kill xscreensaver. This seems like the wrong direction to me.... [[https://bugs.launchpad.net/ubuntu/%2Bsource/xscreensaver/%2Bbug/807685]["xscreensaver does not protect the system against its children"]]: #+begin_quote The thing is, a screensaver is *NOT* a critically important part of the system. It should die early if it is a resource hog. All you have to do is write "10" into /proc/PID/oom_adj and Bob's your uncle. Until then, Xscreensaver is failing its duties. #+end_quote [fn:cgroup-namespaces] #+caption: [[http://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html][~man 7 cgroup_namespaces~]] #+begin_src text Cgroup namespaces virtualize the view of a process's cgroups (see cgroups(7)) as seen via /proc/[pid]/cgroup and /proc/[pid]/mountinfo. Each cgroup namespace has its own set of cgroup root directories, which are the base points for the relative locations displayed in /proc/[pid]/cgroup. When a process creates a new cgroup namespace using clone(2) or unshare(2) with the CLONE_NEWCGROUP flag, it enters a new cgroup namespace in which its current cgroups directories become the cgroup root directories of the new namespace. (This applies both for the cgroups version 1 hierarchies and the cgroups version 2 unified hierarchy.) #+end_src I'll set up a struct so I don't have to repeat myself too much, with the following instructions: + Set ~memory/$hostname/memory.limit_in_bytes~, so the contained process and its child processes can't total more than 1GB memory in userspace[fn:1GB-total-userspace]. + Set ~memory/$hostname/memory.kmem.limit_in_bytes~, so that the contained process and its child processes can't total more than 1GB memory in userspace[fn:1GB-total-kmem]. + Set ~cpu/$hostname/cpu.shares~ to 256. CPU shares are chunks of 1024; 256 * 4 = 1024, so this lets the contained process take a quarter of cpu-time on a busy system at most[fn:cpu-time]. + Set the ~pids/$hostname/pid.max~, allowing the contained process and its children to have 64 pids at most. This is useful because there are per-user pid limits that we could hit on the host if the contained process occupies too many[fn:pids]. + Set ~blkio/$hostname/weight~ to 50, so that it's lower than the rest of the system and prioritized accordingly[fn:blkio]. I'll also add the calling process for each of ~{memory,cpu,blkio,pids}/$hostname/tasks~ by writing '0' to it. [fn:1GB-total-userspace] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v1/memory.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/cgroup-v1/memory.txt@c8d2bc~]] #+begin_src text Brief summary of control files. [...] memory.limit_in_bytes # set/show limit of memory usage #+end_src [fn:1GB-total-kmem] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v1/memory.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/cgroup-v1/memory.txt@c8d2bc~]] #+begin_src text Brief summary of control files. [...] memory.kmem.limit_in_bytes # set/show hard limit for kernel memory #+end_src [fn:pids] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v1/pids.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/cgroup-v1/pids.txt@c8d2bc~]] #+begin_src text Process Number Controller ========================= Abstract -------- The process number controller is used to allow a cgroup hierarchy to stop any new tasks from being fork()'d or clone()'d after a certain limit is reached. Since it is trivial to hit the task limit without hitting any kmemcg limits in place, PIDs are a fundamental resource. As such, PID exhaustion must be preventable in the scope of a cgroup hierarchy by allowing resource limiting of the number of tasks in a cgroup. Usage ----- In order to use the `pids` controller, set the maximum number of tasks in pids.max (this is not available in the root cgroup for obvious reasons). The number of processes currently in the cgroup is given by pids.current. #+end_src for example, #+caption: ~forkbomb.c~ #+include: "linux-containers-in-500-loc/forkbomb.c" src C #+begin_example [lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c forkbomb => validating Linux version...4.7.10.201610222037-1-grsec on x86_64. => setting cgroups...memory...cpu...pids...blkio...done. => setting rlimit...done. => remounting everything with MS_PRIVATE...remounted. => making a temp directory and a bind mount there...done. => pivoting root...done. => unmounting /oldroot.0sOZgF...done. => trying a user namespace...writing /proc/2184/uid_map...writing /proc/2184/gid_map...done. => switching to uid 0 / gid 0...done. => dropping capabilities...bounding...inheritable...done. => filtering syscalls...done. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. ++ successful fork. C-c C-c #+end_example [fn:cpu-time] #+caption: [[http://man7.org/linux/man-pages/man7/cgroups.7.html][~man 7 cgroups~]] #+begin_src text Cgroups version 1 controllers Each of the cgroups version 1 controllers is governed by a kernel configuration option (listed below). Additionally, the availability of the cgroups feature is governed by the CONFIG_CGROUPS kernel configuration option. cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED) Cgroups can be guaranteed a minimum number of "CPU shares" when a system is busy. This does not limit a cgroup's CPU usage if the CPUs are not busy. Further information can be found in the kernel source file Documentation/scheduler/sched-bwc.txt. #+end_src [fn:blkio] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v1/blkio-controller.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/cgroup-v1/blkio-controller.txt@c8d2bc~]] #+begin_src text Details of cgroup files ======================= Proportional weight policy files -------------------------------- - blkio.weight - Specifies per cgroup weight. This is default weight of the group on all the devices until and unless overridden by per device rule. (See blkio.weight_device). Currently allowed range of weights is from 10 to 1000. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref resources #define MEMORY "1073741824" #define SHARES "256" #define PIDS "64" #define WEIGHT "10" #define FD_COUNT 64 struct cgrp_control { char control[256]; struct cgrp_setting { char name[256]; char value[256]; } **settings; }; struct cgrp_setting add_to_tasks = { .name = "tasks", .value = "0" }; struct cgrp_control *cgrps[] = { & (struct cgrp_control) { .control = "memory", .settings = (struct cgrp_setting *[]) { & (struct cgrp_setting) { .name = "memory.limit_in_bytes", .value = MEMORY }, & (struct cgrp_setting) { .name = "memory.kmem.limit_in_bytes", .value = MEMORY }, &add_to_tasks, NULL } }, & (struct cgrp_control) { .control = "cpu", .settings = (struct cgrp_setting *[]) { & (struct cgrp_setting) { .name = "cpu.shares", .value = SHARES }, &add_to_tasks, NULL } }, & (struct cgrp_control) { .control = "pids", .settings = (struct cgrp_setting *[]) { & (struct cgrp_setting) { .name = "pids.max", .value = PIDS }, &add_to_tasks, NULL } }, & (struct cgrp_control) { .control = "blkio", .settings = (struct cgrp_setting *[]) { & (struct cgrp_setting) { .name = "blkio.weight", .value = PIDS }, &add_to_tasks, NULL } }, NULL }; #+end_src Writing to the cgroups version 1 filesystem works like this[fn:how-cgroups-works]: + In each controller, you can create a cgroup with a name with ~mkdir~. For memory, ~mkdir /sys/fs/cgroup/memory/$hostname~. + Inside of that you can write to the individual files to set values. For example, ~echo $MEMORY > /sys/fs/cgroup/memory/$hostname/memory.limit_in_bytes~. + You can a pid to ~tasks~ to add the process tree to the cgroup. "0" is a special value that means "the writing process". so I'll iterate over that structure and fill in the values. [fn:how-cgroups-works] #+caption: [[http://man7.org/linux/man-pages/man7/cgroups.7.html][~man 7 cgroups~]] #+begin_src text Creating cgroups and moving processes A cgroup filesystem initially contains a single root cgroup, '/', which all processes belong to. A new cgroup is created by creating a directory in the cgroup filesystem: mkdir /sys/fs/cgroup/cpu/cg1 This creates a new empty cgroup. A process may be moved to this cgroup by writing its PID into the cgroup's cgroup.procs file: echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs Only one PID at a time should be written to this file. Writing the value 0 to a cgroup.procs file causes the writing process to be moved to the corresponding cgroup. When writing a PID into the cgroup.procs, all threads in the process are moved into the new cgroup at once. Within a hierarchy, a process can be a member of exactly one cgroup. Writing a process's PID to a cgroup.procs file automatically removes it from the cgroup of which it was previously a member. The cgroup.procs file can be read to obtain a list of the processes that are members of a cgroup. The returned list of PIDs is not guaranteed to be in order. Nor is it guaranteed to be free of duplicates. (For example, a PID may be recycled while reading from the list.) In cgroups v1 (but not cgroups v2), an individual thread can be moved to another cgroup by writing its thread ID (i.e., the kernel thread ID returned by clone(2) and gettid(2)) to the tasks file in a cgroup directory. This file can be read to discover the set of threads that are members of the cgroup. This file is not present in cgroup v2 directories. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref resources int resources(struct child_config *config) { fprintf(stderr, "=> setting cgroups..."); for (struct cgrp_control **cgrp = cgrps; *cgrp; cgrp++) { char dir[PATH_MAX] = {0}; fprintf(stderr, "%s...", (*cgrp)->control); if (snprintf(dir, sizeof(dir), "/sys/fs/cgroup/%s/%s", (*cgrp)->control, config->hostname) == -1) { return -1; } if (mkdir(dir, S_IRUSR | S_IWUSR | S_IXUSR)) { fprintf(stderr, "mkdir %s failed: %m\n", dir); return -1; } for (struct cgrp_setting **setting = (*cgrp)->settings; *setting; setting++) { char path[PATH_MAX] = {0}; int fd = 0; if (snprintf(path, sizeof(path), "%s/%s", dir, (*setting)->name) == -1) { fprintf(stderr, "snprintf failed: %m\n"); return -1; } if ((fd = open(path, O_WRONLY)) == -1) { fprintf(stderr, "opening %s failed: %m\n", path); return -1; } if (write(fd, (*setting)->value, strlen((*setting)->value)) == -1) { fprintf(stderr, "writing to %s failed: %m\n", path); close(fd); return -1; } close(fd); } } fprintf(stderr, "done.\n"); #+end_src I'll also lower the hard limit on the number of file descriptors. The file descriptor number, like the number of pids, is per-user, and so we want to prevent in-container process from occupying all of them. Setting the hard limit sets a permanent upper bound for this process tree, since I've dropped ~CAP_SYS_RESOURCE~ [fn:lower-hard-limit]. [fn:lower-hard-limit] #+caption: [[http://man7.org/linux/man-pages/man2/setrlimit.2.html][~man 2 setrlimit~]] #+begin_src text The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability) may make arbitrary changes to either limit value. #+end_src #+caption: =<>= += #+begin_src C :noweb-ref resources fprintf(stderr, "=> setting rlimit..."); if (setrlimit(RLIMIT_NOFILE, & (struct rlimit) { .rlim_max = FD_COUNT, .rlim_cur = FD_COUNT, })) { fprintf(stderr, "failed: %m\n"); return 1; } fprintf(stderr, "done.\n"); return 0; } #+end_src We'd also like to clean up the cgroup for this hostname. There's built-in functionality for this, but we would need to change system-wide values to do it cleanly[fn:built-in-cgroup-free]. Since we have the ~contained~ process waiting on the contained process, it's simple to do it this way. First we move the ~contained~ process back into the root ~tasks~; then, since the child process is finished, and leaving the pid namespace ~SIGKILLS~ its children, the ~tasks~ is empty. We can safely ~rmdir~ at this point. [fn:built-in-cgroup-free] #+caption: [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v1/cgroups.txt?id%3Dc8d2bc9bc39ebea8437fd974fdbc21847bb897a3][~Documentation/cgroup-v1/cgroups.txt@c8d2bc~]] #+begin_src text 1.4 What does notify_on_release do ? ------------------------------------ If the notify_on_release flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the "release_agent" file in that hierarchy's root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup. This enables automatic removal of abandoned cgroups. The default value of notify_on_release in the root cgroup at system boot is disabled (0). The default value of other cgroups at creation is the current value of their parents' notify_on_release settings. The default value of a cgroup hierarchy's release_agent path is empty. #+end_src It's annoying to set the release agent on a per-container basis, so we'll avoid it. #+caption: =<>= += #+begin_src C :noweb-ref resources int free_resources(struct child_config *config) { fprintf(stderr, "=> cleaning cgroups..."); for (struct cgrp_control **cgrp = cgrps; *cgrp; cgrp++) { char dir[PATH_MAX] = {0}; char task[PATH_MAX] = {0}; int task_fd = 0; if (snprintf(dir, sizeof(dir), "/sys/fs/cgroup/%s/%s", (*cgrp)->control, config->hostname) == -1 || snprintf(task, sizeof(task), "/sys/fs/cgroup/%s/tasks", (*cgrp)->control) == -1) { fprintf(stderr, "snprintf failed: %m\n"); return -1; } if ((task_fd = open(task, O_WRONLY)) == -1) { fprintf(stderr, "opening %s failed: %m\n", task); return -1; } if (write(task_fd, "0", 2) == -1) { fprintf(stderr, "writing to %s failed: %m\n", task); close(task_fd); return -1; } close(task_fd); if (rmdir(dir)) { fprintf(stderr, "rmdir %s failed: %m", dir); return -1; } } fprintf(stderr, "done.\n"); return 0; } #+end_src ** Networking Container networking takes a little too much explanation for this space. It usually works like this: + Create a bridge device. + Create a virtual ethernet pair and attach one end to the bridge. + Put the other end in the network namespace. + For outside networking access, the host needs to be set to forward (and possibly NAT) packets. Having multiple contained processes sharing a bridge device would mean they're both on the same LAN from the host's perspective. So ARP spoofing is a recurring issue with containers that work this way[fn:arp-spoofing]. The canonical way to do this from C is the ~rtnetlink~ interface; it would probably be easier to use ~ip link ...~. [fn:arp-spoofing] #+caption: [[https://bugs.launchpad.net/ubuntu/%2Bsource/lxc/%2Bbug/1548497]["Cross-Container ARP Poisoning"]], an LXC bug report by Jesse Hertz of NCCGroup #+begin_src text Description: An unprivileged LXC container can conduct an ARP spoofing attack against another unprivileged LXC container running on the same host. This allows man-in-the-middle attacks on another container's traffic. Recommendation: Due to the complex nature of this involving the Linux bridge interface, NCC is not aware of an easy fix. We suggest involving the kernel networking team to allow for ARP restrictions on virtual bridge interfaces. Using ebtables to block and control link layer traffic may also be an effective fix. Documentation should reflect the risks of not using any future protections or ebtables. Stéphane Graber (stgraber) wrote on 2016-02-22: #1 Hi, Thanks for the report. This is not exactly news to us and has been mentioned publicly a few times. Our usual answer to this is that if you don't trust your users, you shouldn't grant them access to a shared bridge, instead setup a separate bridge for them. MAC filtering through ebtables is an option but the problem with this approach is that it essentially prevents container nesting as that would lead to more than one MAC being used by the container which ebtables would block. [...] On a local system, our answer to that is as I said to either trust everyone you give access to a shared bridge or to segment traffic by using multiple bridges. #+end_src We could also limit the network usage with the ~net_prio~ cgroup controller[fn:net-prio]. [fn:net-prio] #+caption: [[http://man7.org/linux/man-pages/man7/cgroups.7.html][~man 7 cgroups~]] #+begin_src text Cgroups version 1 controllers Each of the cgroups version 1 controllers is governed by a kernel configuration option (listed below). Additionally, the availability of the cgroups feature is governed by the CONFIG_CGROUPS kernel configuration option. [...] net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO) This allows priorities to be specified, per network interface, for cgroups. Further information can be found in the kernel source file Documentation/cgroup-v1/net_prio.txt. #+end_src