[HN Gopher] Benchmarking C++ Allocators
       ___________________________________________________________________
        
       Benchmarking C++ Allocators
        
       Author : jeffbee
       Score  : 50 points
       Date   : 2020-04-27 19:17 UTC (3 hours ago)
        
 (HTM) web link (docs.google.com)
 (TXT) w3m dump (docs.google.com)
        
       | favorited wrote:
       | A bit unrelated, but if anyone is interested, I recently watched
       | Andrei Alexandrescu's 2015 talk about a composable allocator
       | design in C++ and it was really interesting. And Andrei's
       | presentation style was fantastic, per usual.
       | 
       | Note that this talk is about a std::allocator-like type,
       | potentially _on top of_ one of malloc /jemalloc/etc.
       | 
       | https://www.youtube.com/watch?v=LIb3L4vKZ7U
        
       | inquirrer wrote:
       | How does tcmalloc optimize away new/deletes at compile time
       | exactly?
        
         | scott_s wrote:
         | Not optimize them away entirely, but eliminate the actual
         | function call through inlining. User-code can implement their
         | own new and delete. If those implementations are available
         | during compilation, then they are candidates for inlining just
         | like any other function.
        
       | quotemstr wrote:
       | The C++ allocator interface really needs to support realloc.
       | Vector, in many cases, could avoid copying elements, particularly
       | in cases where the move constructor is trivial --- e.g.,
       | vector<int>.
        
       | nteon wrote:
       | Can you compare `tcmalloc` to e.g. tcmalloc dynamically loaded,
       | or jemalloc specified with Bazel's malloc option? It is unclear
       | from the post whether the wins are improvements from the internal
       | Google improvements to tcmalloc post-gperftools, or to reduction
       | in overhead from bypassing the PLT
        
       | balls187 wrote:
       | Thank you for sharing.
       | 
       | Would you consider linking your source code, including make
       | files?
        
       | CppCoder wrote:
       | Let's not forget a benchmark is not a typical application. Just
       | because it does improve without using dynamic linking does not
       | conclude it would so in every application. But, I definitely
       | would not dynamically link a custom allocator. For me it would be
       | the improved memory usage and performance, which I do not want to
       | risk loosing.
        
       | AshamedCaptain wrote:
       | > For C++ programs, replacing malloc and free at runtime is the
       | worst choice. When the compiler can see the definition of new and
       | delete at build time it can generate far better programs. When it
       | can't see them, it generates out-of-line function calls to malloc
       | for every operator new, which is bananas.
       | 
       | This argument smells like bullshit, sorry, and it's totally not
       | justified in the article. Giving just numbers without even
       | explaining a potential causality points to benchmark setup
       | failure more than anything.
       | 
       | Unless you are talking about statically linking your malloc
       | implementation, your entire STL and all depending libraries AND
       | then performing LTO on it, and even then I'd be surprised if
       | there's any noticeable improvement over this.
       | 
       | Not to mention I don't know of _any_ real-life executable that
       | does this...
        
         | jlebar wrote:
         | There indeed isn't a ton of detail in TFA, and so I can
         | understand your skepticism.
         | 
         | Jeff Baker is an expert on this at Google.
         | 
         | Until you've seen the sorts of analyses that are conducted
         | inside of Google, it may be hard to appreciate the level of
         | scrutiny that _every CPU instruction_ in tcmalloc has gotten
         | from people like Jeff.
        
         | notacoward wrote:
         | > statically linking your malloc implementation AND then
         | performing LTO
         | 
         | This is exactly how some large tech companies build all their
         | executables. You might not get to run those, but they
         | definitely exist.
        
         | banachtarski wrote:
         | > This argument smells like bullshit, sorry, and it's totally
         | not justified in the article. Giving just numbers without even
         | explaining a potential causality points to benchmark setup
         | failure more than anything.
         | 
         | Why? This argument sounds perfectly reasonable and is true in
         | many other contexts as well.
        
       | mitchs wrote:
       | When I backtrace from my malloc wrapper, injected with
       | LD_PRELOAD, I see malloc, std::new, then the call site that
       | involved new. Unless std::new is doing some funky shit (Like
       | detecting dlsym("malloc") isn't glibc's malloc, and changing
       | implementations) I have my doubts about the claim that dynamic
       | replacement is costly. It appears new is already dynamically
       | linking malloc.
        
         | jeffbee wrote:
         | That is the entire point of the post. If you leave the
         | resolution of malloc to run time, the best you can hope for is
         | that new calls malloc, out of line. If you build with an
         | implementation of new, the outcomes may be dramatically better.
         | 
         | By the way there is no such thing as std::new. We are
         | discussing ::new.
        
           | mitchs wrote:
           | Heh, std::new was some cruft I got out of backtrace(3) with
           | some wacky embedded tool chain.
           | 
           | Are these gains supposed to be from inlining the top level
           | function of malloc into new? Compared to the costs of what
           | can go on inside malloc... Is that the biggest problem?
        
       | mehrdadn wrote:
       | > For C++ programs, replacing malloc and free at runtime is the
       | worst choice. When the compiler can see the definition of new and
       | delete at build time it can generate far better programs. When it
       | can't see them, it generates out-of-line function calls to malloc
       | for every operator new, which is bananas.
       | 
       | I feel like whenever I've dealt with general-purpose memory
       | allocation (as opposed to special-purpose like from a stack
       | buffer without freeing) this kind of overhead has been dwarfed by
       | the actual overhead of the allocation and pointless to worry
       | about. Is this not the case in others' experience?
        
         | jeffbee wrote:
         | There are a wide variety of outcomes available when the
         | implementation of global operator new is visible to the
         | compiler. It may inline the entire new, and having done so it
         | may also apply any other optimization that would apply to any
         | other function. Without an allocator at build time the only
         | thing it can do it generate a call to malloc. Leaving malloc
         | resolution to runtime also negates all benefits of sized
         | delete.
        
           | the8472 wrote:
           | My understanding is that compilers are allowed to treat
           | malloc specially, i.e. they can assume that it returns non-
           | aliasing memory, does not mutate other memory and can be
           | elided if its result is unused.
        
             | mehrdadn wrote:
             | I'm not aware of that being a blanket assumption they can
             | make, is it? (Is it in the C++ standard?) That kind of
             | thing is generally dictated by compiler-specific
             | annotations (noalias etc.) which you can put on any
             | function, and which shouldn't apply unless you write them
             | explicitly...
        
           | mehrdadn wrote:
           | I'm completely aware of everything you said, but it's not
           | what I asked. I was very specifically not asking what is
           | theoretically possible but what actually happens in practice.
           | Because I don't see any evidence that this is actually the
           | case in practice. For a memory allocator that has a
           | legitimate general-purpose deallocator (i.e. ones that are
           | reasonable replacements for malloc/free), my experience says
           | this kind of overhead being worth optimizing is very much
           | hypothetical rather than the reality. Of course for
           | allocators that skip half the steps (e.g. don't bother to
           | return memory dynamically or only work in specific cases)
           | then the story might be very different but it's not really
           | fair to compare them against general purpose ones?
           | 
           | Now I'm perfectly happy to accept my imagination/experience
           | is just lacking and that's why I don't see such use cases,
           | but I'm seeing multiple other top comments voicing the same
           | concerns as I am, so I'm mostly suggesting a plausible
           | example to demonstrate this is likely to really help convince
           | people this might be the right thing to focus on.
        
           | saagarjha wrote:
           | Have you ever seen operator new inlined in pratice? I've
           | always seen it make a call to malloc, because I can't think
           | of a compiler actually containing the implementation of a
           | memory allocator inside of it...
        
         | acqq wrote:
         | > Is this not the case in others' experience?
         | 
         | My experience is that C++ programs that try to be "Java" are
         | the worst programs, from the performance standpoint, that can
         | be written of all. And here I mean the "Java" practice that
         | whatever object you have, it has to be made with "new." When
         | one wants the really good performance in C++ that's the worst
         | one can do. And my observation includes all implicit
         | allocations due to the use of the "containers." Not to mention
         | other effects that then get to be problem by the long-running
         | programs.
         | 
         | That being said, in the targets I used the compiler was
         | effectively never able to avoid calling actual malloc and free
         | behind the new, and that matches your experience: malloc an
         | free themselves were much bigger overhead than a few
         | instructions that wrapped them to appear as "new."
         | 
         | It was however a few years ago the last time I analyzed
         | something like that -- specifically for the code which has to
         | be really fast I simply avoid having these wrapped "malloc"
         | allocations in the hot paths, so I haven't had to investigate
         | that specific aspect.
        
         | chubot wrote:
         | I also think that if it _does_ matter, it could be a sign you
         | 're using new too much.
         | 
         | In other words, the style of "invoke new every time I need a
         | new object" is not really what most people who care about
         | performance do. There are higher level patterns than that, and
         | often pluggable allocators, which could make the call out-of-
         | line anyway, etc.
        
         | summerlight wrote:
         | Modern allocators are extremely optimized for a fast path which
         | consists of tens of instructions and usually takes a few
         | nanoseconds. Function call overheads are not negligible for
         | this case.
        
           | mehrdadn wrote:
           | It feels a little bit like saying modern cars are extremely
           | optimized for a fast acceleration which consists of a 0-to-60
           | of 3 seconds instead of 9 seconds, and acceleration delay is
           | not negligible in this case. I mean, I suppose you're right
           | and it would be a 3x improvement, but surely you're not using
           | your (general-purpose) heavyweight tool (car) to just zoom
           | down the block and back at peak acceleration all the time,
           | right? And if you really do have a niche requirement like
           | that, aren't there bigger fish to fry here (like eliminating
           | the car or the need to transport people there & back
           | altogether)?
        
             | summerlight wrote:
             | I'm not sure what you're trying to achieve with this
             | metaphor. Are you suggesting that fast path itself is
             | "niche" in terms of allocator performance?
        
         | Ididntdothis wrote:
         | I tend to agree. For some types of applications it may make
         | sense to worry about it but there are usually other things to
         | look into first.
        
         | winrid wrote:
         | Yeah I think it's very situational. Like an arena allocator is
         | going to give you a tremendous bump in allocation performance
         | since you just allocate once and replace memory every frame -
         | to give the gaming industry example.
        
           | BubRoss wrote:
           | An arena allocator isn't going to be the underpinning of the
           | new operator.
        
             | saagarjha wrote:
             | It could be!
        
       | The_rationalist wrote:
       | They should include mimalloc. It is gaining momentum, for example
       | it is the default allocator for kotlin native
        
         | jeffbee wrote:
         | Picking the best allocator was not my goal. I was writing about
         | how to correctly use any allocator. I agree that mimalloc is
         | worth evaluating. Note that Microsoft says this in their docs,
         | which agrees with what I am driving at:
         | 
         | """ For best performance in C++ programs, it is also
         | recommended to override the global new and delete operators.
         | For convience, mimalloc provides mimalloc-new-delete.h which
         | does this for you -- just include it in a single(!) source file
         | in your project. """
        
       | moonchild wrote:
       | Note that there are other considerations like absolute allocation
       | speed, like memory fragmentation, which isn't affected by
       | inlining. In the ithare post they link, jemalloc thrashes all
       | other allocators[1] on that front. It would be interesting to see
       | if the newer version of tcmalloc is improved at all in that
       | respect.
       | 
       | 1: http://ithare.com/wp-content/uploads/malloc-overhead.png
        
         | jeffbee wrote:
         | In this test jemalloc actually uses the most memory of all.
         | With default tunings it uses 700k pages at peak, compared to
         | 170k for tcmalloc and 131k for glibc. I have no idea if that is
         | relevant in real life and I have never cared, so I didn't
         | mention it.
        
           | saagarjha wrote:
           | That's actually quite interesting, as IIRC jemalloc has much
           | looser alignment than other allocators and thus can give
           | "tighter" blocks for small allocations.
        
       | the8472 wrote:
       | The post has no date, does not mention versions used or allocator
       | tunables. Considering that this is a shifting landscape that
       | makes it difficult to judge how relevant those results are today.
        
         | jeffbee wrote:
         | True. I added a date at the top. I wrote it this morning with
         | head revisions of both jemalloc and tcmalloc.
        
       | benibela wrote:
       | Something that could help a lot with performance would be
       | functions to handle memory for many small blocks at once.
       | 
       | Like a malloc extension function that allocates many small blocks
       | at once. Rather than returning one block of size n, it would
       | return k blocks of size n.
       | 
       | Of course the user could do something similar by allocating one
       | big block of size k * n ordinarly. But then the user needs to
       | keep track which small blocks belong to which big block, and it
       | might too big, when some small blocks are not needed anymore.
       | Say, the function needs k small blocks temporarily for
       | processing, but only returns k/10 small blocks. Then 90% of the
       | memory would be wasted.
       | 
       | But if you could allocate k small blocks at once, the allocator
       | could allocate one big array of k*(n + sizeof metadata). Then
       | each of the small blocks could be freed like a normally allocated
       | block of size k, but they have all the advantages of quick
       | allocation and cache locality like a big array.
       | 
       | Or, alternatively, there could be a partial free that does not
       | free the entire malloced block, but only a part of it. Then you
       | could allocate a block of size k*n, and when you do not need it
       | anymore, free the 90% of the block that you do not need, but keep
       | the 10% that you need.
        
         | scott_s wrote:
         | Region-based allocation is similar to this:
         | https://en.wikipedia.org/wiki/Region-based_memory_management
        
         | munchbunny wrote:
         | Object pools are a pretty common pattern for that kind of
         | usage. Is this any different, or are you just asking to push
         | the "allocation" into the memory allocator instead?
        
           | benibela wrote:
           | The difference would be that each object can be freed
           | individually when it is no longer needed.
           | 
           | Especially when the code that creates the objects and the
           | code that uses/discards the objects are in two different
           | projects.
           | 
           | Like a library that can load a JSON file as some kind of tree
           | structure. Then someone uses the library, loads a file of a
           | million objects, but then only needs a single object. All the
           | other objects should be freed, but if they are in a pool,
           | they cannot. The library does not know which object the user
           | wants, and an user of the library should not need to know how
           | the library allocates the objects
        
       | ape4 wrote:
       | In my experience an applications specific pool works the best.
       | Lets say a part of your programs does a bunch of malloc()s then
       | frees... put it into a pool then delete the pool when its done.
        
         | haberman wrote:
         | We often call this "arena allocation". Wikipedia calls it
         | Region-based memory management:
         | https://en.wikipedia.org/wiki/Region-based_memory_management
        
       | pizlonator wrote:
       | My data says that perf of malloc/free in _actual programs_ (not
       | fake ass shit calling malloc in a loop) is never affected by the
       | call overhead. Not by dynamic call overhead, not even if you
       | write a wrapper that uses a switch or indirect call to select
       | mallocs. Reason is simple: malloc perf is all about data
       | dependencies in the malloc /free path and how many cache misses
       | it causes or prevents. Adding indirection on the malloc call path
       | is handled gracefully by the CPU's ability to ILP, and malloc is
       | a perfect candidate since it's bottlenecks will be cache misses
       | inside the malloc/free paths.
        
         | saagarjha wrote:
         | Has per-thread arenas made locking less of an issues these
         | days?
        
           | pizlonator wrote:
           | Not everyone does per thread arenas as a way of solving this
           | problem. And I don't think any solution to locking in malloc
           | is a silver bullet. Some schemes work exceedingly well for
           | specific use cases. So, there will always be folks in the
           | malloc space who think theirs is better or theirs completely
           | solves the problem because it is better and does solve the
           | problem for their use case.
        
         | s_kanev wrote:
         | This was the case a few years back when the fastest pools were
         | implemented with recursive data structures (e.g. linked lists
         | for the freelists in gperftools).
         | 
         | In the new tcmalloc (and, I think, hoard?) the fastest pools
         | are essentially slabs with bump allocation, so the fastest (and
         | by far, the most common) calls are a grand total of 15 or so
         | instructions, without many cache misses (size class lookups
         | tend to stay in the cache). Call overhead can be a substantial
         | chunk of that.
        
           | pizlonator wrote:
           | First of all, let's be clear: for a lot of algorithms adding
           | call overhead around them is free because the cpu predicts
           | the call, predicts the return, and uses register renaming to
           | handle arguments and return.
           | 
           | But that's not the whole story. Malloc perf is about what
           | happens on free and what happens when the program accesses
           | that memory that malloc gave it.
           | 
           | When you factor that all in, it doesn't matter how many
           | instructions the malloc has. It matters whether those
           | instructions form a bad dependency chain, if they miss cache,
           | whether the memory we return is in the "best" place, and how
           | much work happens in free (using the same metrics -
           | dependency chain length and misses, not total number of
           | instructions or whether there's a call).
        
             | jlebar wrote:
             | > First of all, let's be clear: for a lot of algorithms
             | adding call overhead around them is free because the cpu
             | predicts the call, predicts the return, and uses register
             | renaming to handle arguments and return.
             | 
             | This is only thinking at the hardware layer. There are also
             | effects at the software layer.
             | 
             | Function calls not known to LTO prevent all sorts of
             | compiler optimizations as well. In addition, as Jeff says
             | on this thread, not being able to inline free means you get
             | no benefit from sized-delete, which is a substantial
             | improvement on some workloads. (I'd cite the exact number,
             | but I'm not sure Google ever released it.)
             | 
             | Source: I literally worked on this.
        
               | pizlonator wrote:
               | Let's be clear: even without LTO, the compiler knows what
               | malloc and free do at a semantic level.
               | 
               | I buy the sized delete win, I've seen that too, in some
               | niche cases. But that's a far cry from the claim that not
               | inlining malloc is bananas. If that's all you got then
               | this sounds like a stunning retreat from the thesis of
               | Jeff's post. You're just saying that you've got a free
               | that benefits from size specialization, which has nothing
               | to do with whether malloc got inlined.
        
           | nightcracker wrote:
           | You are not wrong but also missing the point of the person
           | you're replying to.
        
         | jlebar wrote:
         | > My data says that perf of malloc/free in actual programs (not
         | fake ass shit calling malloc in a loop) is never affected by
         | the call overhead.
         | 
         | Your data is legitimate, but you've got to ask yourself, would
         | Jeff specifically and Google more generally care so much about
         | it if it were also true for their use-cases?
        
           | pizlonator wrote:
           | I don't consider appeals to authority when thinking about
           | technical issues.
        
             | jlebar wrote:
             | This isn't an appeal to authority. It's a suggestion that
             | we consider the possibility that they are not wasting their
             | time on make-work.
             | 
             | This is Bayesian logic (our prior is that the systems
             | engineers at Google are not incompetent), not an appeal to
             | anything.
        
               | pizlonator wrote:
               | Just because someone isn't incompetent doesn't mean that
               | we should assume they are right. Highly competent people
               | are often wrong. Questioning things is good.
        
       ___________________________________________________________________
       (page generated 2020-04-27 23:00 UTC)