[HN Gopher] Custom Allocators Demystified
       ___________________________________________________________________
        
       Custom Allocators Demystified
        
       Author : deafcalculus
       Score  : 46 points
       Date   : 2020-10-13 06:28 UTC (16 hours ago)
        
 (HTM) web link (slembcke.github.io)
 (TXT) w3m dump (slembcke.github.io)
        
       | chrchang523 wrote:
       | I've gotten a lot of mileage out of the following two extensions
       | to the linear allocator:
       | 
       | 1. A "pop-to" operation. If you're ready to free ALL memory
       | allocated at or after allocation X, you can do this by moving the
       | start-of-free-space pointer back to the start of allocation X.
       | 
       | 2. An end-of-free-space pointer, with its own push and pop-to
       | operations.
       | 
       | In combination, these are often enough to get maximum value out
       | of a fixed-size memory workspace. A function can allocate all
       | returned data structures on one side of the workspace, and use
       | the other side of the workspace for all temporary allocations.
       | This approach does require tight coupling so there are a lot of
       | settings where it clearly doesn't belong, but I've found it to be
       | very effective for high-performance scientific code.
        
         | dbandstra wrote:
         | This is a really nice system. Quake used it back in 1996 to fit
         | the game in 8MB. They called it a "hunk"[0] and it worked
         | pretty much exactly as you said.
         | 
         | The main drawback I find is that you can't use some generic
         | data structure implementations that expect to be able to free
         | memory out of order, unless you're fine with leaking before the
         | final cleanup (if integrated into a generic allocator
         | interface, the "free" method will probably be a no-op). For
         | example, maybe you need a temporary hashmap to help with
         | parsing a text file. It can be interesting to come up with
         | implementations that work without freeing out of order.
         | 
         | Of course, you can always opt to use another, more general
         | purpose allocator, on a block of memory retrieved from the hunk
         | (see Quake's "zones").
         | 
         | [0] https://github.com/id-
         | Software/Quake/blob/master/WinQuake/zo...
        
       | malkia wrote:
       | For C++, I walked (halfway) the really long road of replacing
       | std:: with std::pmr:: where I can, safely pluging jemalloc - and
       | solve the performance issues we had with the standard heap
       | allocator (Windows). But std::pmr::string now being 8 bytes
       | bigger caused us slowdowns unexpectedly. Also tons of changes,
       | different types, overall not great.
       | 
       | Then discovered mimalloc, which hooks at low-level malloc/free,
       | and at the same time a coworker also did similar job and now I'm
       | no longer in love with "pmr".
       | 
       | I wish it was success story, but it's not. It probably is, if
       | everyone is on board with it. But hardly this was my case. Also
       | libraries are not - Qt is not, and many others.
       | 
       | pmr is great for things like - try to alloc on the stack, and if
       | not enough continue with heap (under the hood), but the extra 8
       | bytes, and the fact that do_alloc is virtual call is problem
       | (last one maybe not perf enough, but still hard to sell to
       | performance oriented programmers).
       | 
       | I wish it was designed in way where it could've been sold as
       | such.
        
         | gpderetta wrote:
         | Hum, pmr is for type erasing allocators, i.e. you need multiple
         | allocators and you do not want to instantiate your templates
         | more than once.
         | 
         | If you want to use jemalloc everywere, just create your own
         | (stateless) allocator wrapper and instantiate your string with
         | it.
         | 
         | In fact you probably don't even need to do that. Doesnt
         | jemalloc provide a dll interposer mode that transparently
         | replaces malloc/free?
        
       | favorited wrote:
       | Andrei Alexandrescu's 2015 talk about C++'s allocators is great.
       | He's a very entertaining presenter, and he makes case for
       | extremely composable templated allocators (and why std::allocator
       | isn't that).
       | 
       | https://www.youtube.com/watch?v=LIb3L4vKZ7U
        
         | chromatin wrote:
         | Great talk.
         | 
         | He and others made these principles manifest in Dlang's
         | std.experimental.allocator [0] which offers composability and
         | is really quite nice, allowing for you to use Dlang's GC,
         | malloc, custom malloc, various strategies for metering out the
         | allocated blocks, and any combination thereof.
         | 
         | The documentation can be hard to navigate and it is easy to
         | miss the submodules listed in the tree on the left hand side,
         | so I would especially draw attention to `building blocks` [1]
         | and `mallocator`, `gc_allocator`, etc.
         | 
         | [0] https://dlang.org/phobos/std_experimental_allocator.html
         | [1]
         | https://dlang.org/phobos/std_experimental_allocator_building...
        
       | fizixer wrote:
       | Tangential, but I just learned (TIL) that C is platform-
       | independent in the following way, compared to Java:
       | 
       | - C is "write once, compile everywhere, run everywhere"
       | 
       | - Java is "write once, compile once, run everywhere"
        
         | jjtheblunt wrote:
         | maybe taking the JIT-compiler inside the JVM into account...
         | 
         | - Java is "write once, partially compile once, finish compiling
         | everywhere, run everywhere"
         | 
         | ?
        
           | fizixer wrote:
           | I agree (was just oversimplifying).
           | 
           | I'm starting to think the biggest selling point for Java and
           | dotNET platforms is, not platform-independence, but code
           | obfuscation.
           | 
           | In mid-1990s closed source was big, and Java provided a
           | platform to obfuscate the code before shipment, while
           | retaining the advantages of hardware-specific code-gen (final
           | phase of a compiler), to make it harder for the user to copy
           | the ideas.
           | 
           | These days open source is king, no need to obfuscate, and a
           | reason C is making a comeback of sorts.
        
           | saagarjha wrote:
           | Not every JVM includes a JIT; the statement you're replying
           | to is fairly accurate.
        
       | saagarjha wrote:
       | Do note that "custom allocator" can often just be "take what
       | malloc gives you and carve it up yourself" rather than "interpose
       | malloc"; there's usually no need to drop the system allocator
       | completely. This lets you experiment with your own custom pools
       | or arenas without having to go all in and make something general-
       | purpose that works for every object in your program.
       | 
       | > Over the years, I've wasted many hours debugging memory issues.
       | With just a hash table and a set of linked lists, you can track
       | the history of all your allocations. That makes it pretty easy to
       | track down use after free errors, double free errors, and memory
       | leaks. Using guard pages around your allocations, you can detect
       | overflows.
       | 
       | No, just use address sanitizer-it's meant for this. Odds are your
       | custom allocator might have a few bugs in itself and they will
       | make you life more annoying if you're trying to use it for only
       | this.
        
         | gpderetta wrote:
         | Yes! The sanitizers have been a huge game changer for me.
        
       | dragontamer wrote:
       | One more allocator to mention:
       | 
       | * Fixed size allocator
       | 
       | If you don't need multiple sizes (like malloc(size)), then a
       | fixed-size allocator is significantly simpler to implement, to
       | the point of absurdity.
       | 
       | Step 1: Put all valid memory locations into a stack.
       | 
       | Step 2: alloc with stack.pop();
       | 
       | Step 3: free with stack.push();
       | 
       | The end. You get memory locality, cache-friendliness, O(1)
       | operations, zero external fragmentation. The works. Bonus points:
       | stack.push() and stack.pop() can be implemented with SIMD with
       | help of stream-compaction
       | (http://www.cse.chalmers.se/~uffe/streamcompaction.pdf), and can
       | therefore serve as a GPU-malloc as long as one-size is
       | sufficient.
       | 
       | --------------
       | 
       | The only downsize of "fixed size" allocation is the significant
       | amount of internal fragmentation per node. (Ex: if you only use
       | 8-bytes per node, your 64-byte reservation is mostly wasted). But
       | if you're making a custom-allocator, fixed-size is probably one
       | of the easiest to implement in practice.
       | 
       | -------------
       | 
       | You can extend this to multithreaded operations by simply using
       | atomic-swap, atomic-and, and atomic-or across a shared bitset
       | between threads. 1-bit per fixed-size block. (bit == 1 means
       | something has been alloc(). bit == 0 means something is free()).
       | The stack.push() and stack.pop() operations can run thread-local.
       | 
       | I recommend 64-bytes or 128-bytes as your "fixed size", because
       | 64-bytes is the smallest you get before false-sharing becomes a
       | thing.
       | 
       | ---------
       | 
       | If you have a fixed-size allocator but need items to extend
       | beyond a fixed-size array, then learn to use linked lists.
        
         | gpderetta wrote:
         | Well, if your stack is using a dynamic array as backing store,
         | push is only amortized O(1). You can simply chain all all freed
         | blocks in a singly linked free list without using additional
         | space.
         | 
         | Edit: ... which is what is described in the article which most
         | definitely I had read when I originally made this comment!
        
           | dragontamer wrote:
           | > Well, if your stack is using a dynamic array as backing
           | store, push is only amortized O(1). You can simply chain all
           | all freed blocks in a singly linked free list without using
           | additional space.
           | 
           | Why do you need a dynamic array as the backing store to the
           | stack?
           | 
           | If your fixed-size is 64B, a 1MB "slab" can be contained in a
           | std::array<void*, 16384>
        
         | hinkley wrote:
         | Isn't this a variant/degenerate case of slab allocation?
        
           | dragontamer wrote:
           | Hmmm... I'm pretty sure that when I hear "slab allocation", I
           | imagine multiple sizes supported.
           | 
           | In the case of "fixed-size" allocations, you support all
           | "smaller sizes" by returning a large size. Ex: malloc(4)
           | returns a 64-byte block. malloc(20) returns a 64-byte block.
           | malloc(64) returns a 64-byte block. malloc(65) causes assert(
           | size < 64) fails, and your program exits.
           | 
           | malloc(65) would probably create a "new slab", supporting
           | 128-byte blocks under a slab allocator. At least, based on
           | how I normally hear the term used. malloc(34) may return a
           | 64-byte block, but malloc(31) may return 32-byte block
           | (supporting true sizes as power-of-2).
           | 
           | -------------
           | 
           | The "slab allocator" discussed in the blogpost seems to be a
           | fixed-size slab allocator though. So the blogpost's
           | terminology doesn't match my own.
        
       | Aardwolf wrote:
       | I know custom allocators are important for tiny platforms and so
       | on, but what's a reason C/C++ can't allocate memory in a
       | satisfactory way by default on those platforms? And how do other
       | languages get away with it?
        
         | jeffbee wrote:
         | C++ doesn't come out of the box with an allocator at all.
         | Implementations have to provide it. But this article isn't
         | talking about the difference between, say, jemalloc and
         | mimalloc. It's talking about cases where you want to minimize
         | calls to global operator new, or cases where you want to make a
         | lot of allocations but you don't want to have to delete
         | anything. The latter is often a massive advantage in speed. For
         | example if you need to use a std::set<int> within a scope, and
         | it doesn't escape that scope, it will be much faster to provide
         | an arena allocator that allocates the nodes used by std::set,
         | both because it will minimize the necessary calls to global new
         | -- it may even eliminate them if you can safely use enough
         | stack space -- and especially because there isn't a
         | corresponding deallocation for every allocation. You simply
         | discard the entire arena at the end of the scope.
         | 
         | Fans of Java may rightly point out that garbage collection also
         | has this property, but GC brings other costs. Nothing in Java
         | even remotely approximates the performance of an STL container
         | backed by an arena allocator.
        
         | gpderetta wrote:
         | It is very hard to beat a good off the shelf allocator in the
         | general case, but it is easy in specialized cases, simply
         | because you can make tradeoffs and rely on knowledge the
         | allocator doesn't have.
        
         | jasonzemos wrote:
         | There's no single confluence where every application's needs
         | are provided by a single C/C++ implementation's default
         | allocator. I'll just speak to one example off the top of my
         | head: the default glibc malloc() stores an allocation size near
         | or below the actual allocation data itself. In contrast, an
         | alternative like jemalloc tracks the size in a separate control
         | block. When freeing allocations with the former, that memory
         | has to be touched; with the latter, the control block has to be
         | touched instead. Similarly it can lead to less or more
         | efficient packing of aligned data. All of this yields better or
         | worse performance depending on the application.
        
         | mac01021 wrote:
         | Which other languages do you have in mind?
        
         | alfalfasprout wrote:
         | It's not just tiny platforms. For example, you often want
         | allocators with some predictable characteristic. Eg; constant
         | time allocation for a given request size. Or you might want an
         | allocator that keeps certain items contiguous so they'll tend
         | to already be in cache. Custom allocators tend to be used when
         | you have a particular requirement that a general purpose
         | allocator isn't optimal for.
         | 
         | For tiny platforms often it's bare-metal so there's only one
         | running "process" and so something like "malloc" that needs to
         | be aware of memory usage across the system isn't relevant.
         | Moreover, embedded systems tend to care about deterministic
         | performance so an allocator specific to the application can be
         | a better match.
        
       | snicker7 wrote:
       | Would recommend watching the "What's a Memory Allocator, Anyway?"
       | talk from Zig contributor Benjamin Feng:
       | 
       | https://www.youtube.com/watch?v=vHWiDx_l4V0
        
       ___________________________________________________________________
       (page generated 2020-10-13 23:00 UTC)