[HN Gopher] Custom Allocators Demystified ___________________________________________________________________ Custom Allocators Demystified Author : deafcalculus Score : 46 points Date : 2020-10-13 06:28 UTC (16 hours ago) (HTM) web link (slembcke.github.io) (TXT) w3m dump (slembcke.github.io) | chrchang523 wrote: | I've gotten a lot of mileage out of the following two extensions | to the linear allocator: | | 1. A "pop-to" operation. If you're ready to free ALL memory | allocated at or after allocation X, you can do this by moving the | start-of-free-space pointer back to the start of allocation X. | | 2. An end-of-free-space pointer, with its own push and pop-to | operations. | | In combination, these are often enough to get maximum value out | of a fixed-size memory workspace. A function can allocate all | returned data structures on one side of the workspace, and use | the other side of the workspace for all temporary allocations. | This approach does require tight coupling so there are a lot of | settings where it clearly doesn't belong, but I've found it to be | very effective for high-performance scientific code. | dbandstra wrote: | This is a really nice system. Quake used it back in 1996 to fit | the game in 8MB. They called it a "hunk"[0] and it worked | pretty much exactly as you said. | | The main drawback I find is that you can't use some generic | data structure implementations that expect to be able to free | memory out of order, unless you're fine with leaking before the | final cleanup (if integrated into a generic allocator | interface, the "free" method will probably be a no-op). For | example, maybe you need a temporary hashmap to help with | parsing a text file. It can be interesting to come up with | implementations that work without freeing out of order. | | Of course, you can always opt to use another, more general | purpose allocator, on a block of memory retrieved from the hunk | (see Quake's "zones"). | | [0] https://github.com/id- | Software/Quake/blob/master/WinQuake/zo... | malkia wrote: | For C++, I walked (halfway) the really long road of replacing | std:: with std::pmr:: where I can, safely pluging jemalloc - and | solve the performance issues we had with the standard heap | allocator (Windows). But std::pmr::string now being 8 bytes | bigger caused us slowdowns unexpectedly. Also tons of changes, | different types, overall not great. | | Then discovered mimalloc, which hooks at low-level malloc/free, | and at the same time a coworker also did similar job and now I'm | no longer in love with "pmr". | | I wish it was success story, but it's not. It probably is, if | everyone is on board with it. But hardly this was my case. Also | libraries are not - Qt is not, and many others. | | pmr is great for things like - try to alloc on the stack, and if | not enough continue with heap (under the hood), but the extra 8 | bytes, and the fact that do_alloc is virtual call is problem | (last one maybe not perf enough, but still hard to sell to | performance oriented programmers). | | I wish it was designed in way where it could've been sold as | such. | gpderetta wrote: | Hum, pmr is for type erasing allocators, i.e. you need multiple | allocators and you do not want to instantiate your templates | more than once. | | If you want to use jemalloc everywere, just create your own | (stateless) allocator wrapper and instantiate your string with | it. | | In fact you probably don't even need to do that. Doesnt | jemalloc provide a dll interposer mode that transparently | replaces malloc/free? | favorited wrote: | Andrei Alexandrescu's 2015 talk about C++'s allocators is great. | He's a very entertaining presenter, and he makes case for | extremely composable templated allocators (and why std::allocator | isn't that). | | https://www.youtube.com/watch?v=LIb3L4vKZ7U | chromatin wrote: | Great talk. | | He and others made these principles manifest in Dlang's | std.experimental.allocator [0] which offers composability and | is really quite nice, allowing for you to use Dlang's GC, | malloc, custom malloc, various strategies for metering out the | allocated blocks, and any combination thereof. | | The documentation can be hard to navigate and it is easy to | miss the submodules listed in the tree on the left hand side, | so I would especially draw attention to `building blocks` [1] | and `mallocator`, `gc_allocator`, etc. | | [0] https://dlang.org/phobos/std_experimental_allocator.html | [1] | https://dlang.org/phobos/std_experimental_allocator_building... | fizixer wrote: | Tangential, but I just learned (TIL) that C is platform- | independent in the following way, compared to Java: | | - C is "write once, compile everywhere, run everywhere" | | - Java is "write once, compile once, run everywhere" | jjtheblunt wrote: | maybe taking the JIT-compiler inside the JVM into account... | | - Java is "write once, partially compile once, finish compiling | everywhere, run everywhere" | | ? | fizixer wrote: | I agree (was just oversimplifying). | | I'm starting to think the biggest selling point for Java and | dotNET platforms is, not platform-independence, but code | obfuscation. | | In mid-1990s closed source was big, and Java provided a | platform to obfuscate the code before shipment, while | retaining the advantages of hardware-specific code-gen (final | phase of a compiler), to make it harder for the user to copy | the ideas. | | These days open source is king, no need to obfuscate, and a | reason C is making a comeback of sorts. | saagarjha wrote: | Not every JVM includes a JIT; the statement you're replying | to is fairly accurate. | saagarjha wrote: | Do note that "custom allocator" can often just be "take what | malloc gives you and carve it up yourself" rather than "interpose | malloc"; there's usually no need to drop the system allocator | completely. This lets you experiment with your own custom pools | or arenas without having to go all in and make something general- | purpose that works for every object in your program. | | > Over the years, I've wasted many hours debugging memory issues. | With just a hash table and a set of linked lists, you can track | the history of all your allocations. That makes it pretty easy to | track down use after free errors, double free errors, and memory | leaks. Using guard pages around your allocations, you can detect | overflows. | | No, just use address sanitizer-it's meant for this. Odds are your | custom allocator might have a few bugs in itself and they will | make you life more annoying if you're trying to use it for only | this. | gpderetta wrote: | Yes! The sanitizers have been a huge game changer for me. | dragontamer wrote: | One more allocator to mention: | | * Fixed size allocator | | If you don't need multiple sizes (like malloc(size)), then a | fixed-size allocator is significantly simpler to implement, to | the point of absurdity. | | Step 1: Put all valid memory locations into a stack. | | Step 2: alloc with stack.pop(); | | Step 3: free with stack.push(); | | The end. You get memory locality, cache-friendliness, O(1) | operations, zero external fragmentation. The works. Bonus points: | stack.push() and stack.pop() can be implemented with SIMD with | help of stream-compaction | (http://www.cse.chalmers.se/~uffe/streamcompaction.pdf), and can | therefore serve as a GPU-malloc as long as one-size is | sufficient. | | -------------- | | The only downsize of "fixed size" allocation is the significant | amount of internal fragmentation per node. (Ex: if you only use | 8-bytes per node, your 64-byte reservation is mostly wasted). But | if you're making a custom-allocator, fixed-size is probably one | of the easiest to implement in practice. | | ------------- | | You can extend this to multithreaded operations by simply using | atomic-swap, atomic-and, and atomic-or across a shared bitset | between threads. 1-bit per fixed-size block. (bit == 1 means | something has been alloc(). bit == 0 means something is free()). | The stack.push() and stack.pop() operations can run thread-local. | | I recommend 64-bytes or 128-bytes as your "fixed size", because | 64-bytes is the smallest you get before false-sharing becomes a | thing. | | --------- | | If you have a fixed-size allocator but need items to extend | beyond a fixed-size array, then learn to use linked lists. | gpderetta wrote: | Well, if your stack is using a dynamic array as backing store, | push is only amortized O(1). You can simply chain all all freed | blocks in a singly linked free list without using additional | space. | | Edit: ... which is what is described in the article which most | definitely I had read when I originally made this comment! | dragontamer wrote: | > Well, if your stack is using a dynamic array as backing | store, push is only amortized O(1). You can simply chain all | all freed blocks in a singly linked free list without using | additional space. | | Why do you need a dynamic array as the backing store to the | stack? | | If your fixed-size is 64B, a 1MB "slab" can be contained in a | std::array<void*, 16384> | hinkley wrote: | Isn't this a variant/degenerate case of slab allocation? | dragontamer wrote: | Hmmm... I'm pretty sure that when I hear "slab allocation", I | imagine multiple sizes supported. | | In the case of "fixed-size" allocations, you support all | "smaller sizes" by returning a large size. Ex: malloc(4) | returns a 64-byte block. malloc(20) returns a 64-byte block. | malloc(64) returns a 64-byte block. malloc(65) causes assert( | size < 64) fails, and your program exits. | | malloc(65) would probably create a "new slab", supporting | 128-byte blocks under a slab allocator. At least, based on | how I normally hear the term used. malloc(34) may return a | 64-byte block, but malloc(31) may return 32-byte block | (supporting true sizes as power-of-2). | | ------------- | | The "slab allocator" discussed in the blogpost seems to be a | fixed-size slab allocator though. So the blogpost's | terminology doesn't match my own. | Aardwolf wrote: | I know custom allocators are important for tiny platforms and so | on, but what's a reason C/C++ can't allocate memory in a | satisfactory way by default on those platforms? And how do other | languages get away with it? | jeffbee wrote: | C++ doesn't come out of the box with an allocator at all. | Implementations have to provide it. But this article isn't | talking about the difference between, say, jemalloc and | mimalloc. It's talking about cases where you want to minimize | calls to global operator new, or cases where you want to make a | lot of allocations but you don't want to have to delete | anything. The latter is often a massive advantage in speed. For | example if you need to use a std::set<int> within a scope, and | it doesn't escape that scope, it will be much faster to provide | an arena allocator that allocates the nodes used by std::set, | both because it will minimize the necessary calls to global new | -- it may even eliminate them if you can safely use enough | stack space -- and especially because there isn't a | corresponding deallocation for every allocation. You simply | discard the entire arena at the end of the scope. | | Fans of Java may rightly point out that garbage collection also | has this property, but GC brings other costs. Nothing in Java | even remotely approximates the performance of an STL container | backed by an arena allocator. | gpderetta wrote: | It is very hard to beat a good off the shelf allocator in the | general case, but it is easy in specialized cases, simply | because you can make tradeoffs and rely on knowledge the | allocator doesn't have. | jasonzemos wrote: | There's no single confluence where every application's needs | are provided by a single C/C++ implementation's default | allocator. I'll just speak to one example off the top of my | head: the default glibc malloc() stores an allocation size near | or below the actual allocation data itself. In contrast, an | alternative like jemalloc tracks the size in a separate control | block. When freeing allocations with the former, that memory | has to be touched; with the latter, the control block has to be | touched instead. Similarly it can lead to less or more | efficient packing of aligned data. All of this yields better or | worse performance depending on the application. | mac01021 wrote: | Which other languages do you have in mind? | alfalfasprout wrote: | It's not just tiny platforms. For example, you often want | allocators with some predictable characteristic. Eg; constant | time allocation for a given request size. Or you might want an | allocator that keeps certain items contiguous so they'll tend | to already be in cache. Custom allocators tend to be used when | you have a particular requirement that a general purpose | allocator isn't optimal for. | | For tiny platforms often it's bare-metal so there's only one | running "process" and so something like "malloc" that needs to | be aware of memory usage across the system isn't relevant. | Moreover, embedded systems tend to care about deterministic | performance so an allocator specific to the application can be | a better match. | snicker7 wrote: | Would recommend watching the "What's a Memory Allocator, Anyway?" | talk from Zig contributor Benjamin Feng: | | https://www.youtube.com/watch?v=vHWiDx_l4V0 ___________________________________________________________________ (page generated 2020-10-13 23:00 UTC)