No News is Good News

Using ‘new’ without care can be slow. Paul Floyd uses Godbolt’s compiler explorer to see what happens when you do.

There are two influences that have inspired me to write this article. The first is that I’ve been playing a lot with Compiler Explorer ( https://godbolt.org ). Secondly, a while back I read Optimized C++ by Kurt Guntheroth. It contains a chapter on using dynamic memory (Chapter 6: Optimize Dynamically Allocated Variables).

I agree with a lot of what is said. In short, there is a description of the types of memory available in C++ (automatic, dynamic and static); a description of how this memory relates to variables in code; APIs to deal with dynamic memory; smart pointers and many tips on optimizations related to dynamic memory. In this article I’m going to explore why you should be trying to optimize use of new by digging down to the machine code.

When I’m looking at production C++ code I do see a lot of gratuitous uses of new , for instance

  list <handle>* handleList = new list <handle>;
  ...
  processList(handleList);

I suspect that there are a few possible reasons for writing such code:

the influence of Java
the assumption that where an API takes a pointer, you have to pass it a pointer allocated on the heap
not realizing that, at least for standard library containers, the object itself is quite small and that the bulk of the memory (what is contained) is dynamically allocated. A std::list is only 24 bytes, for instance (on 64bit Linux with GCC).

Obviously, in the second case the memory doesn’t need to be allocated dynamically. The top-level object can perfectly well be on the stack. For instance, the above example could have been written:

  list <handle> handleList;
  ...
  processList(&handleList);

In addition, if I could change the interface to the processList API, I would almost certainly change it to take a reference rather than a pointer. Furthermore, std::list is rarely a good choice when it comes to performance, so I’d probably also change that if I could.

Both versions of the code do the same thing. So, what is wrong with the first version?

Memory refresher

For those of you who are a bit rusty on what the difference is between dynamic and automatic memory (I’ll skip static), here is a quick refresher (Wikipedia has a longer description with diagrams [ Wikipedia ]). Firstly, they are also commonly also known by other names, referring to how they are often implemented. Dynamic allocation is also known as heap allocation, and automatic allocation is known as stack allocation. The stack is a large block of memory. It is referred to by CPU registers such as the Stack Pointer. Memory can be ‘allocated’ on the stack very quickly simply by manipulating the stack registers. The main drawbacks of stack memory are that it does not persist beyond the current scope and it can be quite limited in size. Heap memory is a separate block of memory, but this time it is controlled via functions like malloc [ C++Ref-a ] and operator new [ C++Ref-b ]. It’s not so limited in size and persists until explicitly deallocated.

Problems with new

Performance

There are a couple of reasons why heap allocation has worse performance than stack allocation.

More memory is required – the allocator generally needs a small amount of memory for housekeeping in addition to the memory requested by the caller. In addition, the amount of memory might get rounded up to a larger size like the nearest power of 2 whilst stack allocation probably only rounds up to the nearest machine word size. On 64bit Linux, allocations have a 16byte minimum size and there is an 8byte housekeeping overhead.
The biggest difference is that functions like new and delete are relatively slow. Much effort by library and OS writers has been made to make them as fast as possible (for instance, see the history of jemalloc [ github-a ]).

There is a third question, concerning the size of code that gets generated. This complicates the picture because it isn’t always an apples and pears comparison. When you use stack allocation, it generally means that you are using RAII and the compiler will generate the code necessary for clean-up at the end of the scope. When you use heap allocation with raw pointers as above then it’s up to you to ensure that resources get freed.

About the example code

In the examples that follow, I will continue to use handleList . In my testing I defined Handle to be

  class Handle {
  public:
    int data;
  };

It doesn’t matter what Handle is. The only thing of importance is to consider that handleList itself is something that needs some memory. I’m going to stick with the std::list in the examples for two reasons. Firstly, I want there to be a code smell. Secondly and more seriously, though the examples presented here are trivial I don’t want them to be so small that the compiler optimizes them to almost nothing. Also, for that reason I’ve added calls to an externally defined populateList function.

The assembly was generated on Compiler Explorer using GCC 7.3 64bit.

Comparison of allocation methods

Digging deeper into the different allocation methods, if we have a stack allocation function like this:

  void processStack()
  {
    list<Handle> handleList;
    populateList(&handleList);
  }

the optimized machine code that gets generated is in Table 1

processStack():
1	push r12	A: Function entry prologue
2	push rbp
3	push rbx
4	sub rsp, 48
5	lea rbp, [rsp+16]
6	mov QWORD PTR [rsp+32], 0	B: Inlined std::list construction of handleList on stack
7	mov QWORD PTR [rsp+8], rbp
8	mov rdi, rbp
9	movq xmm0, QWORD PTR [rsp+8]
10	punpcklqdq xmm0, xmm0
11	movaps XMMWORD PTR [rsp+16], xmm0
12	call populateList()	C: Call function
13	mov rdi, QWORD PTR [rsp+16]	D: Check result and inlined destructor
14	cmp rdi, rbp
15	je .L1
16	.L3: mov rbx, QWORD PTR [rdi]
17	call operator delete(void*)
18	cmp rbx, rbp
19	mov rdi, rbx
20	jne .L3	E: Function exit epilogue
21	.L1: add rsp, 48
22	pop rbx
23	pop rbp
24	pop r12
25	ret
26	mov rdi, QWORD PTR [rsp+16]	F: Stack Unwind Handling
27	mov rbx, rax
28	.L6: cmp rdi, rbp
29	je .L5
30	mov r12, QWORD PTR [rdi]
31	call operator delete(void*)
32	mov rdi, r12
33	jmp .L6
34	.L5: mov rdi, rbx
35	call _Unwind_Resume

Table 1

Note that for non-exceptional flow, only items in blocks A to F in column 3 are executed. Block F only gets called when exceptions are thrown via the stack unwinding mechanism.

On the other hand, for heap allocation that does not ensure proper clean-up, like this:

  void processHeap()
  {
    list<Handle>* handleList = new list<Handle>;
    populateList(handleList);
    delete handleList;
  }

the machine code that gets generated is in Table 2.

processHeap():
1	push rbp	A: Function Entry Prologue
2	push rbx
3	mov edi, 24
4	sub rsp, 8
5	call operator new(unsigned long)	B: Dynamic allocation and inlined std::list constructor
6	mov rbx, rax
7	mov rdi, rax
8	mov QWORD PTR [rax+16], 0
9	mov QWORD PTR [rax], rax
10	mov QWORD PTR [rax+8], rax
11	call populateList()	C: Call function
12	mov rdi, QWORD PTR [rbx]	D: Inlined std::list destructor, function exit epilogue and tail optimized delete
13	cmp rbx, rdi
14	je .L12
15	.L13: mov rbp, QWORD PTR [rdi]
16	call operator delete(void*)
17	cmp rbx, rbp
18	mov rdi, rbp
19	jne .L13
20	.L12: add rsp, 8
21	mov rdi, rbx
22	mov esi, 24
23	pop rbx
24	pop rbp
25	jmp operator delete(void*, unsigned long)

Table 2

Thus, it has lost exception safety but made the generated code slightly shorter.

To regain exception safety with heap allocation, still using raw pointers, we would need to write something like Listing 1.

void processHeapNoleak()
{
    list<Handle>* handleList = nullptr;
    try
    {
        handleList = new list<Handle>;
        populateList(handleList);
        delete handleList;    
    }
    catch (...)
    {
        delete handleList;
        throw;
    }
}

Listing 1

That doesn’t look too pretty. Please don’t do this at home. The generated machine code for this is in Table 3.

processHeapNoleak():
1	push rbp	A: Function Entry Prologue
2	push rbx
3	mov edi, 24
4	sub rsp, 8
5	call operator new(unsigned long)	B: Dynamic allocation and inlined std::list constructor
6	mov rbx, rax
7	mov QWORD PTR [rax+16], 0
8	mov rdi, rax
9	mov QWORD PTR [rbx], rax
10	mov QWORD PTR [rbx+8], rax
11	call populateList()	C: Call function
12	mov rdi, QWORD PTR [rbx]	D: Inlined std::list destructor and tail optimized delete
13	cmp rbx, rdi
14	je .L20
15	.L21: mov rbp, QWORD PTR [rdi]
16	call operator delete(void*)
17	cmp rbx, rbp
18	mov rdi, rbp
19	jne .L21
20	.L20: add rsp, 8
21	mov rdi, rbx
22	mov esi, 24
23	pop rbx
24	pop rbp
25	jmp operator delete(void*, unsigned long)
26	mov rdi, rax	E: Exception handling, inlined std::list destructor delete and rethrow
27	call __cxa_begin_catch
28	.L23: call __cxa_rethrow
29	mov rdi, rax
30	call __cxa_begin_catch
31	mov rdi, QWORD PTR [rbx]
32	.L19: cmp rbx, rdi
33	je .L24
34	mov rbp, QWORD PTR [rdi]
35	call operator delete(void*)
36	mov rdi, rbp
37	jmp .L19
38	mov rbx, rax
39	call __cxa_end_catch
40	mov rdi, rbx
41	call _Unwind_Resume
42	.L24: mov esi, 24
43	mov rdi, rbx
44	call operator delete(void*, unsigned long)
45	jmp .L23

Table 3

Moving on quickly, let’s consider a fourth alternative, using a smart pointer.

  void processHeapSmartPtr()
  {
    auto handleList = make_unique<list<Handle>>();
    populateList(handleList.get());
  }

OK, that’s code that I could live with.

Note that if you want to use only the smart pointer and not the underlying raw pointer you would have to rewrite or overload populateList . When I tried this, I noticed that passing a reference to the unique_ptr prevented the compiler from inlining and optimizing the pointer use resulting in more register use and larger code. Furthermore, the CppCoreGuidelines discourage passing references to smart pointers in cases like this [ github-b ]. When get() is used, there is no interface issue and the code size is barely any larger than the stack version.

The code flow in this case is quite similar to processStack .

The machine code for this function is in Table 4.

processHeapSmartPtr():
1	push r12	A: Function Entry Prologue
2	push rbp
3	mov edi, 24
4	push rbx
5	call operator new(unsigned long)	B: Dynamic allocation and inlined std::list constructor
6	mov rbx, rax
7	mov QWORD PTR [rax+16], 0
8	mov rdi, rax
9	mov QWORD PTR [rbx], rax
10	mov QWORD PTR [rbx+8], rax
11	call populateList()	C: Call function
12	mov rdi, QWORD PTR [rbx]	D: Inlined std::list destructor and tail optimized delete
13	cmp rbx, rdi
14	je .L33
15	.L34: mov rbp, QWORD PTR [rdi]
16	call operator delete(void*)
17	cmp rbx, rbp
18	mov rdi, rbp
19	jne .L34
20	.L33: mov rdi, rbx
21	mov esi, 24
22	pop rbx
23	pop rbp
24	pop r12
25	jmp operator delete(void*, unsigned long)
26	mov rdi, QWORD PTR [rbx]	E: Stack unwind handling, inlined std::list destructor
27	mov rbp, rax
28	.L37: cmp rbx, rdi
29	je .L36
30	mov r12, QWORD PTR [rdi]
31	call operator delete(void*)
32	mov rdi, r12
33	jmp .L37
34	.L36: mov rdi, rbx
35	mov esi, 24
36	call operator delete(void*, unsigned long)
37	mov rdi, rbp
38	call _Unwind_Resume

Table 4

You may have noticed that the non-exception path for the three versions using the heap are the same (lines 1 to 25 and blocks A to D in the tables of assembler). The only thing that is different is how they handle exceptions.

Here is a summary of the size of the code generated. The byte sizes were obtained by nm

Performance

I did some measurements of these 4 functions. I just wrote a main() with a loop running a million times calling the 4 functions with empty stub populateList functions.

With Valgrind callgrind, I got the following numbers of op-codes executed per call.

Function	Op-codes executed
processStack	23
processHeap	27
processHeapNoLeak	27
processHeapSmartPointer	29

These are the exclusive counts i.e., only for the functions themselves. Whilst callgrind counts every instruction, by default it doesn’t record functions that take less than 1% of the total. So, adding a loop is an easy way to ensure that they are included in the output. I get a slightly higher count for processHeapSmartPointer as I was doing these tests with GCC trunk, I expect that if I’d used GCC 7.3 the count would have been the same as the other two Heap functions.

This is pretty much what I was expecting

processStack is the fastest but not the smallest due to the exception handling
processHeap is the smallest because it does no exception handling
All of the functions using the heap execute similar numbers of machine instructions.

The picture is very different for the inclusive counts, that is the functions plus any callees.

Function	Op-codes executed
processStack	24
processHeap	360
processHeapNoLeak	360
processHeapSmartPointer	362

There are two things that stand out

The number of op-codes executed hasn’t change for the stack version, other than the call to the empty stub.
The 3 heap versions have essentially the same count.

As usual, you may get different results on different platforms/compilers – I tried a couple of others and the results were broadly similar. Furthermore, this test case just uses stub functions. With real functions that actually do something the cost of the heap allocation would become relatively smaller. That said, the stack allocation here is around 15 times faster.

Conclusions

I think that the case is more or less settled. Use stack allocation where you can. It’s safer, faster and requires writing the least code. Obviously, there are times when you need heap allocation:

when you have huge data
when you need extended object lifetime
when you have self-referential data structures like graphs
when you want to decouple interfaces like with the pimpl idiom
when you have deeply recursive functions.

When you have to use heap allocation, use smart pointers. There is a small code size overhead, but if you use make_unique (or make_shared ) then the difference in time performance is negligible compared to using smart pointers with the benefit of not having to worry about resource leaks.