I am going to use the ptmalloc2 heap implementation used by default in the glibc library. Accordingly, I will need a Linux PC and a number of tools. I install the GDB debugger, GCC (the entire build-essential package can be installed at once), and the debug version of libc making it possible to see detailed information about the heap. In addition, I install pwngdb and its dependency, peda, to be able to use such handy commands as vmmap
, hexdump
, and heapinfo
.
sudo apt install gdb build-essential libc6-dbg
git clone https://github.com/scwuaptx/Pwngdb.git ~/Pwngdb
cp ~/Pwngdb/.gdbinit ~/
git clone https://github.com/longld/peda.git ~/peda
GDB basics
To understand operation principles of the test programs, you have to know basic GDB commands:
-
r[
– run a file;un] -
b[
– set a breakpoint at linereak] basic. c: 123 123
of thebasic.
source code;c -
c[
– continue program execution;ontinue] -
s[
– execute one assembly instruction;tep] -
n[
– execute one string in the source code;ext] -
x/
– print ten 8-byte words at address10xg 0x1234 0x1234
; -
p[
– print the value of therint] a a
variable; -
p[
– take the memory contents at addressrint] *(( mchunkptr) 0x555555756680) 0x555555756680
as themchunkptr
type, dereference it, and print; and -
where
– shows current line number in the source code and the executed function.
Peda
и pwngdb
commands:
-
vmmap
– show the memory map; -
hexdump
– show the memory contents at a given address in thehexdump
format; and -
heapinfo
– show heap info.
Chunk structure
When a program requests a buffer to store data (e.g. 10 bytes in size) using malloc
, the actually allocated memory space is greater since additional space is required to store metadata. Such a piece of memory containing metadata is called a chunk.
The chunk structure used in ptmalloc2 is shown below. As you can see, there are two fields before the pointer to the allocated memory buffer (mem
) that is returned to the user: size of the current chunk and size of the previous chunk.
chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Size of previous chunk, if unallocated (P clear) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Size of chunk, in bytes |A|M|P| mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | User data starts here... . . . . (malloc_usable_size() bytes) . . |
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| (size of chunk, but used for application data) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of next chunk, in bytes |A|0|1|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The chunk structure is as follows:
struct malloc_chunk { INTERNAL_SIZE_T mchunk_prev_size; /* Size of previous chunk, if it is free. */ INTERNAL_SIZE_T mchunk_size; /* Size in bytes, including overhead. */ struct malloc_chunk* fd; /* double links - used only if this chunk is free. */ struct malloc_chunk* bk; /* Only used for large blocks: pointer to next larger size. */ struct malloc_chunk* fd_nextsize; /* double links - used only if this chunk is free. */ struct malloc_chunk* bk_nextsize;};typedef struct malloc_chunk* mchunkptr;
To get from a pointer to a chunk (service structure) a pointer to a memory buffer that can be used, you have to add to the former one the value 2*SIZE_SZ
. For the x64 architecture, SIZE_SZ
is 8; for x86, it’s 4. In other words, on x64, user_mem
. Conversely, to get a pointer to a chunk from the pointer returned by malloc
, you have to subtract 2*SIZE_SZ
from it. The following macros are responsible for these operations:
#define chunk2mem(p) ((void*)((char*)(p) + 2*SIZE_SZ))#define mem2chunk(mem) ((mchunkptr)((char*)(mem) - 2*SIZE_SZ))
Important: the mchunk_prev_size
in the next chunk is used to store user data of the previous chunk.
Arena
Old heap managers had a single heap for the entire process and used mutexes to synchronize the access of different threads to it. Needless to say that this approach doesn’t boost performance. Ptmalloc2 uses arenas: memory areas where each thread can store its own heap.
But in real life, there may be too many threads in an app; in such cases, the maximum number of arenas that can be created is calculated using the following formula (n
is the number of processors):
#define NARENAS_FROM_NCORES(n) ((n) * (**sizeof** (**long**) == 4 ? 2 : 8))
As long as the number of arenas is less than the maximum, the heap manager creates a new arena for each new thread. After that, several threads have to share the same arena.
The first arena created by the heap manager is called the main arena. A single-threaded app only uses the main arena.
Flags
Let’s examine chunk flags in more detail. The field containing the size of the previous chunk (mchunk_size
) also contains three flags: A
, M
, and P
. This feature is possible due to the chunk size alignment. Since the chunk size is always a multiple of either 8 or 16 bytes, the last 3 bits of its size don’t contain meaningful info and can be used to store flags.
-
A
(NON_MAIN_ARENA): (0) chunk was allocated from the main arena and the main heap; and (1) chunk belongs to one of the secondary arenas. When an app creates additional threads, each of these threads gets its own arena (roughly speaking, its own heap). TheA
bit is set in chunks allocated in these arenas; -
M
(IS_MMAPPED): (1) chunk allocated by callingmmap
. The remaining flags are ignored because these chunks are neither located in an arena nor adjacent to other chunks; and -
P
(PREV_INUSE): (0) previous chunk isn’t in use. Themchunk_size
field is preceded by a value reflecting the size of the previous chunk; (1) previous chunk is in use. Themchunk_size
field is preceded by user data.
Bins
To boost performance, chunks are reused (and this feature is used to exploit the heap). Previously used and subsequently freed chunks are put into bins. There are five types of bins in the studied heap implementation:
- small (62 bins);
- large (63 bins);
- unsorted (1 bin);
- fast (10 bins); and
- tcache (64 bins per thread).
Small bins
- In each of the small bins, chunks are stored in a doubly-linked list.
- Freed chunks are inserted into this list at its beginning (head) and deleted at its end (tail, FIFO). The fd and bk pointers are used to maintain this list (see the chunk structure).
- There are 62 such bins. Each of the small bins stores only chunks of the same size: 16, 24, …, 504 bytes for x86 and 1008 bytes for x64.
- If two adjacent chunks get into the small bin, then they are merged together and sent to the unsorted bin.
Large bins
- In each of the large bins, chunks are stored in a doubly-linked list.
- In each of the bins, chunk sizes vary in a certain range.
- Chunks are sorted as follows: the largest chunk is stored in the head of the list; while the smallest one, in the tail. The
fd_nextsize
andbk_nextsize
pointers are used for this purpose. - Insertions and deletions occur at any position.
- There are 63 such bins; they store chunks over 512 bytes in size for x86 and over 1024 bytes for x64.
Unsorted bin
Instead of putting newly-freed chunks into suitable bins, the heap manager merges them with neighbors and puts into the unsorted bin. When malloc
is called next time, each chunk in the unsorted bin is checked: does its size fit or not? If it fits, then malloc
uses it. Otherwise, the chunk is placed to the respective bin (i.e. small or large one).
Fast bins
- Designed to optimize the release and allocation of small chunks.
- Fast bins store chunks of a fixed size. The maximum chunk size is 88 bytes for x86 and 176 bytes for x64.
- Chunks are stored in a singly-linked list (LIFO).
- The
P
flag is not removed from chunks in fast bins. Therefore, adjacent freed chunks aren’t merged. This is done to free and allocate small chunks faster.
Tcache bins
- Each thread has 64 tcache bins. This is done to further optimize the allocation of small chunks. Tcache bins are always used if the number of threads exceeds the maximum number of arenas: in such cases, each thread can use its own cache first (rather than wait for a mutex responsible for heap access synchronization).
- Chunks are stored in a singly-linked list.
- Each bin stores a maximum of seven chunks of the same size (12-516 on x86 and 24-1032 on x64).
- Chunks aren’t merged and freed ‘for real’ (the
P
flag is not removed). - When a memory call is made, the program first looks for a suitable chunk in tcache and then in other bins.
Test program
I am going to write a simple test program for demonstration purposes:
#include <stdio.h>#include <stdlib.h>#include <string.h>int main(){ puts("Basic allocation example.\n"); char* a = malloc(0x10); strcpy(a, "AAAAAAAAAAAAAAA"); // A * 15 char* b = malloc(0x12); memcpy(b, "BBBBBBBBBBBBBBBBBBBBBBBB", 24); // B * 23 char* c = malloc(496); // was 0x200 char* c = malloc(496); // was 0x200 char* d = malloc(0x500); char* e = malloc(0x500); char* f = malloc(0x500); char* g = malloc(0x500); char* h = malloc(0x500); free(a); free(b); free(c); free(d); // free(e); free(f); free(h); free(g); puts("End.\n");}
To compile it with debugging information, I use the command:
gcc basic.c -o basic -ggdb
Practice
I am going to use an x64 platform. I open the test program in GDB and set a breakpoint at line 8 (char*
) using the command b
.
I run the program with the r
command and look what happens (the heapinfo
command).
The execution of the user code begins, the heap is initialized, and I can see the address the top chunk of the heap is located at: 0x5555555596a0
.
info
Top chunk is the topmost chunk in the heap that borders its end. The P
flag is always set for this chunk.
Using the vmmap
command, I can review the process’s memory map and see where the heap is located.
Now I know for sure that the top chunk of the heap is indeed located in the heap segment.
I execute one string of the program (char*
) using the n
command. As you can see, the top heap chunk has been shifted forward by 0x20 bytes, and the available heap size has decreased respectively. But why 0x20 bytes if I’ve requested 0x10 bytes? The remaining 16 bytes store the metadata located before the a
pointer: mchunk_prev_size
and mchunk_size
.
Also, you can see that the P
flag in the mchunk_size
field is set to 1 (i.e. the previous chunk isn’t free). Using the command p
, you can print chunk fields as structures.
Even if you request a buffer 1 byte in size, a 0x20-byte chunk will be allocated in the heap, and a pointer to the buffer 0x18 bytes in size will be returned.
Now I execute the string char*
. The heap manager allocates a chunk 0x20 bytes in size. Similar to the previous example, the user can use a buffer 0x18 bytes in size.
If you write a 24-byte buffer into your chunk, you’ll see that the last 8 bytes of the buffer overlap with the next chunk, but the metadata won’t be overwritten.
The string char*
is executed, and a chunk 496 + 16 = 0x200 bytes in size is allocated as a result.
Time to examine the freed chunks. First, I execute string 18: free(
. Since the size of the a
chunk is less than 1032 bytes, the freed chunk goes into tcache.
As you remember, chunks aren’t freed ‘for real’ in tcache; therefore, the next chunk has the P
flag set.
I free one more chunk. It’s added to the list of chunks in tcache.
At offset 0x55555555*96d0, you can see a pointer to the next chunk in your tcache bin. If you print the chunk structure at offset 0x5555555596c0, it turns out that the fd
field (pointer to the next free chunk in the bin) is equal to 0x5555555596b0* – this is the chunk you have freed one string earlier. The values of the fd_nextsize
and bk_nextsize
pointers are of no importance for this chunk since they are only used for chunks stored in the large bin.
When you free a chunk 0x200 bytes in size, it goes to another bin (the one suitable for its size).
String 21 is executed, and the freed chunk goes to the unsorted bin.
String 23 frees another 0x500 chunk. Now there are two chunks in the unsorted bin, and you can see how these chunks are stored in a doubly-linked list.
You see that chunk 0x55555555a300 is at the top of the list. The fd
field points to the next chunk: 0x5555555598e0.
Fast bin Dup
Now that you possess the required knowledge, let’s perform the simplest fastbin duplication attack. The idea is as follows: if a double-free vulnerability is present in an app, then you can force malloc
to return the same chunks from fastbin. This technique is used to get a write-what-where primitive (here is an example from 0ctf).
I am going to analyze an example of this attack available in the how2heap repository.
info
I will examine the code for libc 2.31 because this version is used in Ubuntu 20.04. You can use different code, but make sure that the libc version matches the one used by your program.
First of all, you have to fill tcache (as you remember, tcache stores a maximum of seven chunks of the same size).
void *ptrs[8];for (int i=0; i<8; i++) { ptrs[i] = malloc(8);}for (int i=0; i<7; i++) { free(ptrs[i]);}
This has to be done because the tcache implementation in libc 2.31 has a built-in protection against such an attack. When a chunk is placed in tcache, a pointer to tcache is saved in it.
typedef struct tcache_entry{ struct tcache_entry *next; /* This field exists to detect double frees. */ struct tcache_perthread_struct *key;} tcache_entry;//...static __always_inline voidtcache_put (mchunkptr chunk, size_t tc_idx){ tcache_entry *e = (tcache_entry *) chunk2mem (chunk); /* Mark this chunk as "in the tcache" so the test in _int_free will detect a double free. */ e->key = tcache;//...}
Prior to putting the freed chunk into tcache, the free
function, checks whether the pointer to tcache is already stored in this chunk. If yes, a double free check is triggered. By the way, in the current version of glibc, this check has been improved: instead of a pointer to tcache, a random number is used as a key.
if (__glibc_unlikely (e->key == tcache)) { tcache_entry *tmp; LIBC_PROBE (memory_tcache_double_free, 2, e, tc_idx); for (tmp = tcache->entries[tc_idx]; tmp; tmp = tmp->next) if (tmp == e) malloc_printerr ("free(): double free detected in tcache 2"); /* If we get here, it was a coincidence. We've wasted a few cycles, but don't abort. */ }
Next, three buffers are allocated. Calloc
is used for allocation since it doesn’t use tcache. I know that in real life, calloc
is used more rarely than malloc
, but for training purposes it’s OK.
int *a = calloc(1, 8);int *b = calloc(1, 8);int *c = calloc(1, 8);
Then I sequentially free a
and b
.
free(a);free(b);
Nothing criminal so far; the heap looks as shown below.
There are two chunks in fastbin[
; b→fd
points at a
.
But if you run free(
again, you’ll see that the a
chunk is added to fastbin[
for a second time.
Now if you allocate three more buffers with calloc
, you’ll get three identical pointers since all of them are taken from fastbin[
!
Let’s find out why this trick has worked. The old
chunk in the snippet below is the old chunk at the top of fastbin, and p
is the chunk being freed. If you try to free the chunk at the top of fastbin twice (old
), the program will display the message “double free or corruption (fasttop)” and close.
unsigned int idx = fastbin_index(size);fb = &fastbin (av, idx);/* Atomically link P to its fastbin: P->FD = *FB; *FB = P; */mchunkptr old = *fb, old2;//.../* Check that the top of the bin is not the record we are going to add (i. e., double free). */if (__builtin_expect (old == p, 0)) malloc_printerr ("double free or corruption (fasttop)");p->fd = old2 = old;
This is why you can’t successfully execute free(
twice. But if you put a b
chunk on top of fastbin, then free
will successfully overwrite the fd
pointer of the a
chunk (p->
in the above snippet) and add it to fastbin!
More on heap attacks
If you’ve finished this article and heap exploitation is of interest to you, I strongly recommend reviewing other classical attacks from the how2heap repository; don’t forget to open in the next tab malloc.
from the tested glibc version.