[HN Gopher] Benchmarking C++ Allocators ___________________________________________________________________ Benchmarking C++ Allocators Author : jeffbee Score : 50 points Date : 2020-04-27 19:17 UTC (3 hours ago) (HTM) web link (docs.google.com) (TXT) w3m dump (docs.google.com) | favorited wrote: | A bit unrelated, but if anyone is interested, I recently watched | Andrei Alexandrescu's 2015 talk about a composable allocator | design in C++ and it was really interesting. And Andrei's | presentation style was fantastic, per usual. | | Note that this talk is about a std::allocator-like type, | potentially _on top of_ one of malloc /jemalloc/etc. | | https://www.youtube.com/watch?v=LIb3L4vKZ7U | inquirrer wrote: | How does tcmalloc optimize away new/deletes at compile time | exactly? | scott_s wrote: | Not optimize them away entirely, but eliminate the actual | function call through inlining. User-code can implement their | own new and delete. If those implementations are available | during compilation, then they are candidates for inlining just | like any other function. | quotemstr wrote: | The C++ allocator interface really needs to support realloc. | Vector, in many cases, could avoid copying elements, particularly | in cases where the move constructor is trivial --- e.g., | vector<int>. | nteon wrote: | Can you compare `tcmalloc` to e.g. tcmalloc dynamically loaded, | or jemalloc specified with Bazel's malloc option? It is unclear | from the post whether the wins are improvements from the internal | Google improvements to tcmalloc post-gperftools, or to reduction | in overhead from bypassing the PLT | balls187 wrote: | Thank you for sharing. | | Would you consider linking your source code, including make | files? | CppCoder wrote: | Let's not forget a benchmark is not a typical application. Just | because it does improve without using dynamic linking does not | conclude it would so in every application. But, I definitely | would not dynamically link a custom allocator. For me it would be | the improved memory usage and performance, which I do not want to | risk loosing. | AshamedCaptain wrote: | > For C++ programs, replacing malloc and free at runtime is the | worst choice. When the compiler can see the definition of new and | delete at build time it can generate far better programs. When it | can't see them, it generates out-of-line function calls to malloc | for every operator new, which is bananas. | | This argument smells like bullshit, sorry, and it's totally not | justified in the article. Giving just numbers without even | explaining a potential causality points to benchmark setup | failure more than anything. | | Unless you are talking about statically linking your malloc | implementation, your entire STL and all depending libraries AND | then performing LTO on it, and even then I'd be surprised if | there's any noticeable improvement over this. | | Not to mention I don't know of _any_ real-life executable that | does this... | jlebar wrote: | There indeed isn't a ton of detail in TFA, and so I can | understand your skepticism. | | Jeff Baker is an expert on this at Google. | | Until you've seen the sorts of analyses that are conducted | inside of Google, it may be hard to appreciate the level of | scrutiny that _every CPU instruction_ in tcmalloc has gotten | from people like Jeff. | notacoward wrote: | > statically linking your malloc implementation AND then | performing LTO | | This is exactly how some large tech companies build all their | executables. You might not get to run those, but they | definitely exist. | banachtarski wrote: | > This argument smells like bullshit, sorry, and it's totally | not justified in the article. Giving just numbers without even | explaining a potential causality points to benchmark setup | failure more than anything. | | Why? This argument sounds perfectly reasonable and is true in | many other contexts as well. | mitchs wrote: | When I backtrace from my malloc wrapper, injected with | LD_PRELOAD, I see malloc, std::new, then the call site that | involved new. Unless std::new is doing some funky shit (Like | detecting dlsym("malloc") isn't glibc's malloc, and changing | implementations) I have my doubts about the claim that dynamic | replacement is costly. It appears new is already dynamically | linking malloc. | jeffbee wrote: | That is the entire point of the post. If you leave the | resolution of malloc to run time, the best you can hope for is | that new calls malloc, out of line. If you build with an | implementation of new, the outcomes may be dramatically better. | | By the way there is no such thing as std::new. We are | discussing ::new. | mitchs wrote: | Heh, std::new was some cruft I got out of backtrace(3) with | some wacky embedded tool chain. | | Are these gains supposed to be from inlining the top level | function of malloc into new? Compared to the costs of what | can go on inside malloc... Is that the biggest problem? | mehrdadn wrote: | > For C++ programs, replacing malloc and free at runtime is the | worst choice. When the compiler can see the definition of new and | delete at build time it can generate far better programs. When it | can't see them, it generates out-of-line function calls to malloc | for every operator new, which is bananas. | | I feel like whenever I've dealt with general-purpose memory | allocation (as opposed to special-purpose like from a stack | buffer without freeing) this kind of overhead has been dwarfed by | the actual overhead of the allocation and pointless to worry | about. Is this not the case in others' experience? | jeffbee wrote: | There are a wide variety of outcomes available when the | implementation of global operator new is visible to the | compiler. It may inline the entire new, and having done so it | may also apply any other optimization that would apply to any | other function. Without an allocator at build time the only | thing it can do it generate a call to malloc. Leaving malloc | resolution to runtime also negates all benefits of sized | delete. | the8472 wrote: | My understanding is that compilers are allowed to treat | malloc specially, i.e. they can assume that it returns non- | aliasing memory, does not mutate other memory and can be | elided if its result is unused. | mehrdadn wrote: | I'm not aware of that being a blanket assumption they can | make, is it? (Is it in the C++ standard?) That kind of | thing is generally dictated by compiler-specific | annotations (noalias etc.) which you can put on any | function, and which shouldn't apply unless you write them | explicitly... | mehrdadn wrote: | I'm completely aware of everything you said, but it's not | what I asked. I was very specifically not asking what is | theoretically possible but what actually happens in practice. | Because I don't see any evidence that this is actually the | case in practice. For a memory allocator that has a | legitimate general-purpose deallocator (i.e. ones that are | reasonable replacements for malloc/free), my experience says | this kind of overhead being worth optimizing is very much | hypothetical rather than the reality. Of course for | allocators that skip half the steps (e.g. don't bother to | return memory dynamically or only work in specific cases) | then the story might be very different but it's not really | fair to compare them against general purpose ones? | | Now I'm perfectly happy to accept my imagination/experience | is just lacking and that's why I don't see such use cases, | but I'm seeing multiple other top comments voicing the same | concerns as I am, so I'm mostly suggesting a plausible | example to demonstrate this is likely to really help convince | people this might be the right thing to focus on. | saagarjha wrote: | Have you ever seen operator new inlined in pratice? I've | always seen it make a call to malloc, because I can't think | of a compiler actually containing the implementation of a | memory allocator inside of it... | acqq wrote: | > Is this not the case in others' experience? | | My experience is that C++ programs that try to be "Java" are | the worst programs, from the performance standpoint, that can | be written of all. And here I mean the "Java" practice that | whatever object you have, it has to be made with "new." When | one wants the really good performance in C++ that's the worst | one can do. And my observation includes all implicit | allocations due to the use of the "containers." Not to mention | other effects that then get to be problem by the long-running | programs. | | That being said, in the targets I used the compiler was | effectively never able to avoid calling actual malloc and free | behind the new, and that matches your experience: malloc an | free themselves were much bigger overhead than a few | instructions that wrapped them to appear as "new." | | It was however a few years ago the last time I analyzed | something like that -- specifically for the code which has to | be really fast I simply avoid having these wrapped "malloc" | allocations in the hot paths, so I haven't had to investigate | that specific aspect. | chubot wrote: | I also think that if it _does_ matter, it could be a sign you | 're using new too much. | | In other words, the style of "invoke new every time I need a | new object" is not really what most people who care about | performance do. There are higher level patterns than that, and | often pluggable allocators, which could make the call out-of- | line anyway, etc. | summerlight wrote: | Modern allocators are extremely optimized for a fast path which | consists of tens of instructions and usually takes a few | nanoseconds. Function call overheads are not negligible for | this case. | mehrdadn wrote: | It feels a little bit like saying modern cars are extremely | optimized for a fast acceleration which consists of a 0-to-60 | of 3 seconds instead of 9 seconds, and acceleration delay is | not negligible in this case. I mean, I suppose you're right | and it would be a 3x improvement, but surely you're not using | your (general-purpose) heavyweight tool (car) to just zoom | down the block and back at peak acceleration all the time, | right? And if you really do have a niche requirement like | that, aren't there bigger fish to fry here (like eliminating | the car or the need to transport people there & back | altogether)? | summerlight wrote: | I'm not sure what you're trying to achieve with this | metaphor. Are you suggesting that fast path itself is | "niche" in terms of allocator performance? | Ididntdothis wrote: | I tend to agree. For some types of applications it may make | sense to worry about it but there are usually other things to | look into first. | winrid wrote: | Yeah I think it's very situational. Like an arena allocator is | going to give you a tremendous bump in allocation performance | since you just allocate once and replace memory every frame - | to give the gaming industry example. | BubRoss wrote: | An arena allocator isn't going to be the underpinning of the | new operator. | saagarjha wrote: | It could be! | The_rationalist wrote: | They should include mimalloc. It is gaining momentum, for example | it is the default allocator for kotlin native | jeffbee wrote: | Picking the best allocator was not my goal. I was writing about | how to correctly use any allocator. I agree that mimalloc is | worth evaluating. Note that Microsoft says this in their docs, | which agrees with what I am driving at: | | """ For best performance in C++ programs, it is also | recommended to override the global new and delete operators. | For convience, mimalloc provides mimalloc-new-delete.h which | does this for you -- just include it in a single(!) source file | in your project. """ | moonchild wrote: | Note that there are other considerations like absolute allocation | speed, like memory fragmentation, which isn't affected by | inlining. In the ithare post they link, jemalloc thrashes all | other allocators[1] on that front. It would be interesting to see | if the newer version of tcmalloc is improved at all in that | respect. | | 1: http://ithare.com/wp-content/uploads/malloc-overhead.png | jeffbee wrote: | In this test jemalloc actually uses the most memory of all. | With default tunings it uses 700k pages at peak, compared to | 170k for tcmalloc and 131k for glibc. I have no idea if that is | relevant in real life and I have never cared, so I didn't | mention it. | saagarjha wrote: | That's actually quite interesting, as IIRC jemalloc has much | looser alignment than other allocators and thus can give | "tighter" blocks for small allocations. | the8472 wrote: | The post has no date, does not mention versions used or allocator | tunables. Considering that this is a shifting landscape that | makes it difficult to judge how relevant those results are today. | jeffbee wrote: | True. I added a date at the top. I wrote it this morning with | head revisions of both jemalloc and tcmalloc. | benibela wrote: | Something that could help a lot with performance would be | functions to handle memory for many small blocks at once. | | Like a malloc extension function that allocates many small blocks | at once. Rather than returning one block of size n, it would | return k blocks of size n. | | Of course the user could do something similar by allocating one | big block of size k * n ordinarly. But then the user needs to | keep track which small blocks belong to which big block, and it | might too big, when some small blocks are not needed anymore. | Say, the function needs k small blocks temporarily for | processing, but only returns k/10 small blocks. Then 90% of the | memory would be wasted. | | But if you could allocate k small blocks at once, the allocator | could allocate one big array of k*(n + sizeof metadata). Then | each of the small blocks could be freed like a normally allocated | block of size k, but they have all the advantages of quick | allocation and cache locality like a big array. | | Or, alternatively, there could be a partial free that does not | free the entire malloced block, but only a part of it. Then you | could allocate a block of size k*n, and when you do not need it | anymore, free the 90% of the block that you do not need, but keep | the 10% that you need. | scott_s wrote: | Region-based allocation is similar to this: | https://en.wikipedia.org/wiki/Region-based_memory_management | munchbunny wrote: | Object pools are a pretty common pattern for that kind of | usage. Is this any different, or are you just asking to push | the "allocation" into the memory allocator instead? | benibela wrote: | The difference would be that each object can be freed | individually when it is no longer needed. | | Especially when the code that creates the objects and the | code that uses/discards the objects are in two different | projects. | | Like a library that can load a JSON file as some kind of tree | structure. Then someone uses the library, loads a file of a | million objects, but then only needs a single object. All the | other objects should be freed, but if they are in a pool, | they cannot. The library does not know which object the user | wants, and an user of the library should not need to know how | the library allocates the objects | ape4 wrote: | In my experience an applications specific pool works the best. | Lets say a part of your programs does a bunch of malloc()s then | frees... put it into a pool then delete the pool when its done. | haberman wrote: | We often call this "arena allocation". Wikipedia calls it | Region-based memory management: | https://en.wikipedia.org/wiki/Region-based_memory_management | pizlonator wrote: | My data says that perf of malloc/free in _actual programs_ (not | fake ass shit calling malloc in a loop) is never affected by the | call overhead. Not by dynamic call overhead, not even if you | write a wrapper that uses a switch or indirect call to select | mallocs. Reason is simple: malloc perf is all about data | dependencies in the malloc /free path and how many cache misses | it causes or prevents. Adding indirection on the malloc call path | is handled gracefully by the CPU's ability to ILP, and malloc is | a perfect candidate since it's bottlenecks will be cache misses | inside the malloc/free paths. | saagarjha wrote: | Has per-thread arenas made locking less of an issues these | days? | pizlonator wrote: | Not everyone does per thread arenas as a way of solving this | problem. And I don't think any solution to locking in malloc | is a silver bullet. Some schemes work exceedingly well for | specific use cases. So, there will always be folks in the | malloc space who think theirs is better or theirs completely | solves the problem because it is better and does solve the | problem for their use case. | s_kanev wrote: | This was the case a few years back when the fastest pools were | implemented with recursive data structures (e.g. linked lists | for the freelists in gperftools). | | In the new tcmalloc (and, I think, hoard?) the fastest pools | are essentially slabs with bump allocation, so the fastest (and | by far, the most common) calls are a grand total of 15 or so | instructions, without many cache misses (size class lookups | tend to stay in the cache). Call overhead can be a substantial | chunk of that. | pizlonator wrote: | First of all, let's be clear: for a lot of algorithms adding | call overhead around them is free because the cpu predicts | the call, predicts the return, and uses register renaming to | handle arguments and return. | | But that's not the whole story. Malloc perf is about what | happens on free and what happens when the program accesses | that memory that malloc gave it. | | When you factor that all in, it doesn't matter how many | instructions the malloc has. It matters whether those | instructions form a bad dependency chain, if they miss cache, | whether the memory we return is in the "best" place, and how | much work happens in free (using the same metrics - | dependency chain length and misses, not total number of | instructions or whether there's a call). | jlebar wrote: | > First of all, let's be clear: for a lot of algorithms | adding call overhead around them is free because the cpu | predicts the call, predicts the return, and uses register | renaming to handle arguments and return. | | This is only thinking at the hardware layer. There are also | effects at the software layer. | | Function calls not known to LTO prevent all sorts of | compiler optimizations as well. In addition, as Jeff says | on this thread, not being able to inline free means you get | no benefit from sized-delete, which is a substantial | improvement on some workloads. (I'd cite the exact number, | but I'm not sure Google ever released it.) | | Source: I literally worked on this. | pizlonator wrote: | Let's be clear: even without LTO, the compiler knows what | malloc and free do at a semantic level. | | I buy the sized delete win, I've seen that too, in some | niche cases. But that's a far cry from the claim that not | inlining malloc is bananas. If that's all you got then | this sounds like a stunning retreat from the thesis of | Jeff's post. You're just saying that you've got a free | that benefits from size specialization, which has nothing | to do with whether malloc got inlined. | nightcracker wrote: | You are not wrong but also missing the point of the person | you're replying to. | jlebar wrote: | > My data says that perf of malloc/free in actual programs (not | fake ass shit calling malloc in a loop) is never affected by | the call overhead. | | Your data is legitimate, but you've got to ask yourself, would | Jeff specifically and Google more generally care so much about | it if it were also true for their use-cases? | pizlonator wrote: | I don't consider appeals to authority when thinking about | technical issues. | jlebar wrote: | This isn't an appeal to authority. It's a suggestion that | we consider the possibility that they are not wasting their | time on make-work. | | This is Bayesian logic (our prior is that the systems | engineers at Google are not incompetent), not an appeal to | anything. | pizlonator wrote: | Just because someone isn't incompetent doesn't mean that | we should assume they are right. Highly competent people | are often wrong. Questioning things is good. ___________________________________________________________________ (page generated 2020-04-27 23:00 UTC)