[HN Gopher] All About Libpas, Phil's Super Fast Malloc ___________________________________________________________________ All About Libpas, Phil's Super Fast Malloc Author : Jarred Score : 238 points Date : 2022-06-01 13:45 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | carlsborg wrote: | Trying to understand the license for bmalloc, and webkit in | general. The WebCore folder has an Apple license and two LGPL | licenses. The root folder has no license file. Can anyone | comment? | pizlonator wrote: | libpas is BSD 2-clause. so is bmalloc. libpas is in the bmalloc | directory in WebKit, but is really a separate piece of code. | | each file has its own license header. | SemanticStrengh wrote: | ridiculous_fish wrote: | Hey Phil, thanks for the write up! Curious why this is only used | for JSC's JIT, and not to back its heap? Did the design of the | JSC GC influence libpas at all? | pizlonator wrote: | It's used for JSC's JIT and for all heap allocations in | JavaScriptCore and WebKit that aren't GC allocations. | | Riptide (the JSC GC) has a similar design and some of the key | ideas were first pioneered there: | | - The use of fast bitvector searches to find an eligible page | (Riptide calls it first eligible block, I think). | | - The use of bitvector simd to change object states in bulk. | | - Riptide uses a mark bitvector that is indexed by minalign | atom (16 bytes) and a block size of 16KB. Libpas's default | "small segregated" configuration is 16 byte minalign, 16KB | block, and an alloc bitvector indexed by minalign. | | I have since thought about how Riptide could be wired up to use | libpas. It's possible to do it. Already today you could create | a libpas page_config that leaves room for mark bits in the page | header, and makes them use the same indexing as the alloc bits. | Riptide has a "newly allocated" bitvector that serves almost | the same purpose as the alloc bits (kinda, if you squint really | hard) - so you'd go from Riptides newly_allocated and mark | bitvectors, to libpas's alloc bitvector and a custom mark | bitvector stuffed into page headers using an exotic | page_config. If JSC did this, it would buy: | | - Better decommit policies than Riptide. I did a lot of tuning | in libpas to make decommit Just Right (TM). | | - Faster allocation of medium-sized objects. | | - Possibly better memory efficiency. | | But, Riptide is already very nicely tuned, and it gets some | benefit from libpas's decommit rules because Riptide calls | fastMalloc to get memory and sometimes uses fastFree to return | it. fastFree calls bmalloc::api::free, which calls libpas's | bmalloc_deallocate, which then goes to pas_deallocate with the | BMALLOC_HEAP_CONFIG. So, Riptide already gets some of the | goodness of libpas by sitting on top of it. | [deleted] | dragontamer wrote: | General purpose memory-allocation is a lie. Everything has use | cases where they're faster than other libraries. | | I think that's why we keep seeing newer malloc schemes pop up, | because the performance of the heap 100% depends on the use-case, | and different people have different use cases. | | Still, studying everyone else's heaps (and garbage collectors, a | closely related discussion) is probably good for high-performance | programmers. | pizlonator wrote: | Yeah, you're totally right. | | Libpas beats other mallocs in WebKit. I also had benchmarks | involving non-WebKit workloads and the results were all over | the place (sometimes slower than other mallocs by a lot, | sometimes faster by a lot - same thing with memory, sometimes | more efficient, sometimes less). This didn't surprise me; I've | seen this before when writing memory management code. | | I think it makes sense for large software projects that use | malloc a lot and care about perf to eventually either write | their own malloc or to take an existing one and then tune it a | lot. | latenightcoding wrote: | I've seen a few opensource projects archive their custom | allocators because they were not beating the system's one or | jemalloc. So if you have the skill and time then yeah go for | it, else stick with general purpose ones. | mananaysiempre wrote: | There are a couple of approaches that might legitimately be | simpler than using a general-purpose allocator: in a single- | pass batch process, just throw things on the floor and let | the OS sort it out on exit[1] (I think the D compiler does | this); in a multiple-pass batch process, make each pass | litter its own room (arena, obstack, etc.) then demolish said | room when done (GCC does this or at least did in the past). | | On the other hand, these may require rearranging the logic | somewhat to fit them, so the question of how much | rearrangement to tolerate still remains. And, well[4], | /* * We divy out chunks of memory rather than call | malloc each time so * we don't have to worry about | leaking memory. It's probably * not a big deal if all | this memory was wasted but if this ever * goes into a | library that would probably not be a good idea. * | * XXX - this *is* in a library.... */ | | [1] I am rather dismayed by how Raymond Chen advocates for | this for memory[2] but insists it could not possibly be a | good idea for Windows GDI handles, no, you stupid sloppy | programmer[3]. Maybe because he was involved in GDI from the | other side? (Of course, for file descriptors on Unix it's | still the standard practice.) | | [2] http://bytepointer.com/resources/old_new_thing/20120105_0 | 06_... | | [3] http://bytepointer.com/resources/old_new_thing/20051014_3 | 05_... | | [4] https://github.com/the-tcpdump- | group/libpcap/blob/4c1e516dd2... | jstimpfle wrote: | This "[2] vs [3]" is a good find. I have several possible | explanations. It could be the difference of 7 years between | those posts. It could also be that Windows Handles can't be | cleaned up in userspace. Not how malloc chunks up the | "handles" (i.e. pages / regions) that it got from Windows. | This is, AFAIK, different than HANDLE's that have to be | requested from the system one-by-one and can't be chunked | up. (Or am I wrong? I actually don't know a lot about how | this works under the hood). | dragontamer wrote: | Its very easy to beat the general purpose ones if you know | your exact use case. | | Ex: 16-bit pointers is a 65536-sized heap. Assume 8-bytes per | element, that's 512KB of space. A bit small, but large enough | to so a lot of things. | | 65536 elements can be represented as a bitmask. The bitmask | only takes up 8192-bytes (8KB), which fits inside of 16 | AVX512 registers (Intel offers 32x AVX512/ZMM registers btw). | Or it fits inside of GPU __shared__ memory. | | If you need multithreaded, you perform atomic AND and atomic | OR to clear, and set, the bitmask as appropriate. | | ------- | | Do you know the exact size of your heap? The size of the | pointer? The amount of parallelism involved? What about the | size of the elements? Access pattern? (Bump-alloc'd Linked | Lists are a sequential traversal over RAM btw, so that's very | efficient), etc. etc. | | The 64-bit pointer is overkill for most people's purposes. | 32-bits represents 4-billion objects, and if each object is | 16-bytes long that's 64GBs of RAM. Right here right now, a | 32-bit pointer / custom allocator already offers many | benefits over the 64-bit pointer. (half-sized pointers, more | data in cache, etc. etc.) | thesz wrote: | 512 in AVX512 is the number of bits per register. You are | off by factor of 8. | cmrdporcupine wrote: | 100%, and also I think there's a tendency among people who | compare 'higher up the stack' in GC'd languages to manual heap | allocation in C/C++ etc. in a way that implies the the latter | is almost "free" from a performance POV when in fact underneath | the covers in these allocators there's still a lot of the same | kind of walking of datastructures that also happens inside a | tracing GC. | | You can get unpredictable pauses from malloc/free, too. | asveikau wrote: | I think in either case, GC or not, you write something | intuitively and then when it becomes an actual problem, you | study the patterns and improve them, but most of the time you | can leave it alone and the general purpose thing is good | enough. | | Or if you are in a niche like audio/video or something, you | avoid allocations all together during the bulk of the code. | cmrdporcupine wrote: | Oh for sure. Premature optimizing is usually bad form. | | What is tricky about performance issues in allocation is | that they can be hard to profile. There are tools for | analyzing GC performance in GC'd languages, but sometimes | malloc/free can just be a big black box. | djmips wrote: | Premature optimization is not bad form when it's re- | framed as good architecture. So it's not 'usually bad | form' to architect something from the outset, using your | experience, and that's something everyone understands. | This pervasive disdain for premature optimization leads | to bad architecture that often leads to expensive | rewrites. So because the word optimization is so | overloaded and treated with disdain it feels like we need | a new language to talk about what's really meant by | 'premature optimization' ( the bad kind) | asveikau wrote: | That's one reason I point out a few comments above that | certain niches will need to take it into account ahead of | time. Eg. An a/v application will typically allocate all | buffers up front and re-use them frequently rather than | return them to the allocator. A lot of server | applications will want to keep per-client memory usage | low. For general purposes, there's the general purpose | allocator. | tptacek wrote: | This is a pretty well-known CS truism, isn't it? One of the | first papers I ever fell in love with covers it in detail: | https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd... | lidavidm wrote: | I wonder how this compares to jemalloc, mimalloc, snmalloc? | pizlonator wrote: | I never got a chance to compare it to those, since I was most | interested in beating bmalloc. And I mainly wanted to beat it | on Safari workloads. | | I believe bmalloc was previously compared against jemalloc and | tcmalloc, also using Safari workloads, and bmalloc was | significantly faster at the time. | IshKebab wrote: | For other people who have never heard of bmalloc - it's a | custom allocator used only by WebKit. I guess it's not | surprising they added one since the Mac system allocator is | extremely slow. A custom allocator is pretty much a free 20% | speed up on Mac (depending on your workload) but I found they | made no difference on Linux. Haven't tried on Windows. | pizlonator wrote: | Just some credit where credit is due. | | I measured the system malloc crushing all other mallocs on | some workloads and they were the kind of workloads that | some folks run every day. | | I measured the system malloc crushing most other mallocs on | memory efficiency on most workloads. System malloc is | really good at reusing memory and has very mature decommit | policies that easily rival what I came up with. | | So there's that. | pcwalton wrote: | macOS malloc was _incredibly_ slow for Pathfinder, far | behind every other OS. Everything became bottlenecked on | it. It was a free 2x speedup if not more to switch to | jemalloc. | | I suspect this is because Pathfinder's CPU portion is a | multicore workload and macOS allocator performs poorly | when under heavy multithreaded contention. It probably | just isn't the kind of workload that macOS allocator was | tuned for. | pizlonator wrote: | My understanding is that the system malloc is excellent | under contention, but has a high baseline cost for every | malloc/free call. | | I don't remember exactly what workloads it performed | really great at (and if I did I dunno if I could say), | but I do remember they were parallel, and the speed-ups | got bigger the more cores you added. | | Everything else about your experience matches mine. | Libpas is much faster than system malloc in WebKit and | JSC. The difference isn't 2x on my preferred benchmarks | (which are large and do lots of things that don't rely on | malloc), but it is easily more than 2x on smaller | benchmarks. So your 2x result sounds about right. | meisel wrote: | What sort of workloads are those? I have yet to see a | workload where macOS's system allocator is not | substantially slower than alternatives like jemalloc. | astrange wrote: | Alas, many people think performance engineering is only | about the wall clock time of their program and not what | impact it has on anything else. | | malloc performance highly depends on how large the | allocations are though. | KerrAvon wrote: | What's the workload where you found that to be the case? | The system allocator is heavily tuned for general use, and | should be very difficult to beat -- generally. | IshKebab wrote: | It was a non-traditional compiler written in C++. | | > The system allocator is heavily tuned | | Perhaps but it probably has more constraints than a | custom allocator too, e.g. backwards compatibility with | software that unwittingly depends on its exact behaviour. | | I found a discussion of this where people reported | various speedups (10%, 100%, 300%) on Mac by switching to | a custom allocator: | https://news.ycombinator.com/item?id=29068828 | | I'm sure there are better benchmarks if you Google it. | jeffbee wrote: | Well the other browser's design doc is not anywhere near this | detailed, but if you want to compare and contrast the overall | design here is Chromium's PartitionAlloc. | | https://chromium.googlesource.com/chromium/src/+/master/base... | rq1 wrote: | > Consequently, passing a function pointer (or struct of function | pointers), where the pointer points to an always_inline function | and the callee is always_inline results in specialization akin to | template monomorphization. This works to any depth; the compiler | won't be satisfied until there are no more always_inline function | calls. This fortuitous development in compilers allowed me to | write very nice template code in C. Libpas achieves templates in | C using config structs that contain function pointers -- | sometimes to always_inline functions (when we want specialization | and inlining) and sometimes to out-of-line functions (when we | want specialization but not inlining). Additionally, the C | template style allows us to have true polymorphic functions. Lots | of libpas slow paths are huge and not at all hot. We don't want | that code specialized for every config. Luckily, this works just | fine in C templates -- those polymorphic functions just pass | around a pointer to the config they are using, and dynamically | load and call things in that config, almost exactly the same way | that the specialized code would do. This saves a lot of code size | versus C++ templates. | | This document is incredible! | photochemsyn wrote: | Seems like a real tour de force of low-level memory management. | Prior to seeing this my go-to for understanding custom low- | level (for embedded) memory managers was this (explains the | basics of stack, block, bitmap and thread-local allocation) but | now I have something far more complicated to get confused | about. | | https://www.embedded-software-engineering.de/dynamic-memory-... | pizlonator wrote: | Thank you! :-) | | I always wanted to write a malloc, and this is what I came up | with after 3.5 years or so. | [deleted] | adamnemecek wrote: | I'm always suspicious of things that are named after the maker. | astrange wrote: | A better rule is to be suspicious of anything with "fast" or | synonyms in the name. It either won't be fast or it'll be | insecure. | fmajid wrote: | Did you know the entire US Air Traffic Control system runs on | long-discontinued computers and written in JOVIAL, "Jules' Own | Version of the International Algebraic Language" (i.e. ALGOL), | named after Jules Schwartz? | | https://en.wikipedia.org/wiki/JOVIAL | | The FAA's been trying to replce it for ages, but AFAIK it's | still ongoing. | thrtythreeforty wrote: | Like Linux, or gcc.godbolt.org? | | I kid, but I don't see it as a downside for software. Naming a | scientific phenomenon after yourself is a red flag though, see | also the Crackpot Index [1]. | | [1]: https://math.ucr.edu/home/baez/crackpot.html | tialaramex wrote: | Godbolt _says_ Compiler Explorer right there, the fact people | call it "Godbolt" is because that's the memorable name | rather than because Matt Godbolt tried to name it after | himself. He calls it Compiler Explorer. | | You can't control what people call things. My favourite | example is Blackfriars Bridge in London. When that bridge was | built, it was formally named after William Pitt. But like, | who cares? In what way is this bridge, Pitt's bridge? People | knew where it was, the Blackfriars area of London, so too bad | William Pitt, everybody called it Blackfriars Bridge. And so, | that's what its name is. | | The priory which is why "Blackfriars" got the name hadn't | existed for two hundred years by the time the bridge was | proposed. Nobody living when the bridge was finished would | have ever known anybody who'd met any of the friars during | their lifetime. Nevertheless that part of London had "always" | been called Blackfriars, and so a bridge and a railway | station built many years later are named that too. | bool3max wrote: | People call it Godbolt because you access it via | godbolt.org, | throwaway744678 wrote: | From the top of my head: Linux, Bourne shell... | adamnemecek wrote: | kbenson wrote: | Sometimes things are named after the maker ex post facto | because they need a way to distinguish it. The Bourne | shell, mentioned here, is likely called that for the same | reason the very first shell was called the Thompson shell, | in that they were written by a single person originally and | not really given a name (the original being from Ken | Thompson), since they were just different (or the original) | implementations of the system shell. Eventually you're | going to need a way to distinguish one from the other | though, right? Noting who wrote it is as good (probably | better) than any other option. | | Linux was probably named differently, but also probably | doesn't really match whatever criteria you're using to be | suspicious. It started as a hobby project to make a free | and open source UNIX system, of which there were few | (none?), written by one guy. There was no expectation that | other people would use it for anything other than something | interesting to play around with, and using a portion of | your own name for your own little hobby project isn't | something I would look askance at, personally, especially | in an age where there was no expectation that it might grow | to anything, because that was definitely not the norm at | the time. | ncmncm wrote: | "Linux" was forced on him. | pizlonator wrote: | It's called libpas, and pas stands for whatever you like. | mattgreenrocks wrote: | I admit I'm a sucker for metacircularity, but have to ask: | | > The bootstrap heap has hacks to allow itself to allocate that | array out of itself. | | Not quite following that, can someone elaborate? | convolvatron wrote: | I suspect its that - generally when you write a heap it needs | to keep some metadata. if you have a multi-heap system (really | handy), then its generally easier to host that metadata on | another heap. sometimes you really want that, because the | target heap might be specialized for different kinds of data. | so I assume this is carefully constructing the heap state so | that it can be used for its own metadata as its being | constructed | pizlonator wrote: | The bootstrap heap has a freelist that is just an array of | entries that tell you where the free memory is. | | That array has to be allocated somewhere. | | All pas_simple_large_free_heaps allocate that array from the | bootstrap heap. Including the bootstrap heap. So, the | pas_simple_large_free_heap has hacks that go something like: | "if I'm allocating or freeing something and I need to allocate | or free the free list, then _very carefully_ also perform that | allocation /deallocation in tandem with the one I'm doing". The | main hack there is just that there's some statically allocated | "slop" for the boostrap free list, so if you want to add to it | in the middle of an allocation, you add it to the slop array, | and then after you're doing you reallocate the free list and | copy stuff from slop into it. | sydthrowaway wrote: | Did the author leak this IP when he left Apple? | pizlonator wrote: | lol | | WebKit is open source. Libpas was always designed with WebKit | in mind. It landed in WebKit half a year before I left Apple. | RcouF1uZ4gsC wrote: | > Consequently, passing a function pointer (or struct of function | pointers), where the pointer points to an always_inline function | and the callee is always_inline results in specialization akin to | template monomorphization. This works to any depth; the compiler | won't be satisfied until there are no more always_inline function | calls. This fortuitous development in compilers allowed me to | write very nice template code in C. Libpas achieves templates in | C using config structs that contain function pointers -- | sometimes to always_inline functions (when we want specialization | and inlining) and sometimes to out-of-line functions (when we | want specialization but not inlining). Additionally, the C | template style allows us to have true polymorphic functions. Lots | of libpas slow paths are huge and not at all hot. We don't want | that code specialized for every config. Luckily, this works just | fine in C templates -- those polymorphic functions just pass | around a pointer to the config they are using, and dynamically | load and call things in that config, almost exactly the same way | that the specialized code would do. This saves a lot of code size | versus C++ templates. | | I thought this was super interesting. Seems like a very nice | technique for when you use function pointers in C for stuff like | sort comparison. There was an earlier thread about PostgreSQL | that discussed how they created multiple sort functions to avoid | the overhead of a indirect function pointer call when sorting | things like integers where the function call dwarfs the | comparison cost. This seems like a technique that could have been | used to avoid duplicating the sorting functions while still | getting the benefits. | thrtythreeforty wrote: | I still don't quite understand how the technique saves code | size versus C++ templates, although I am quite eager to | understand it since template instantiation is a thorn in my | side when writing code for a microcontroller with 32KB of code | space. | | Is the idea that the struct contains pointers to always_inline | functions, and that the compiler has visibility into this at | the top of the always_inline call stack, so the always_inline | gets monomorphized even through multiple layers of pointer | indirection? If I'm understanding it right, that's a pretty | nice technique. Would this allow you to mark a "class" (or | individual members of that class) as "hot, please monomorphize" | by making it always_inline, or "cold, please use regular | pointers" by omitting always_inline? | | C++ templates would force you to monomorphize every variant, or | use vtables (function pointers) for every variant, without the | ability to choose the approach for each subclass. The compiler | is also spectacularly bad at recognizing deduplication | opportunities between template specializations. | verdagon wrote: | I think that's right. | | C++ binaries tend to be pretty large because they use | templates/monomorphization a lot, even when there aren't | really meaningful performance gains. For example, it often | doesn't make sense to have 20 instantiations of std::vector, | when only two of them are used in the critical path. Rust can | be even further in this unfortunate direction because of the | borrow checker's restrictions around polymorphism and | abstraction. | | In comparison, C, Java, and Javascript tend to rely on | dynamic dispatch more often (not that C has any built-in | dynamic dispatch features, but in practice, C programs tend | to use function pointers manually). | | A few languages like C# have a really nice balance, and | monomorphize in the JIT at run-time. | | With this technique they mention, C is the only mainstream | language I know of that lets you control it in a decoupled | way, which is really cool. | pizlonator wrote: | Say you have: template<typename T> | void foo(T& value) { value += 42; } | | There's no easy way to tell the C++ compiler, "please make | only one version of foo and use callbacks to figure out what | it means to += on T". | | In libpas, this would be done by first creating: | always_inline void foo(void* value, void (*add_int)(void* | value, int addend)) { add_int(value, 42); | } | | And then you could also create: | never_inline void foo_outline(void* value, void | (*add_int)(void* value, int addend)) { foo(value, add_int); } | | Now, if you want monomorphization like C++, call foo() and | pass a literal for add_int. But if you don't, then call | foo_outline. | | Say you have 1000 callsites to foo(), they all use different | types, and 5 of those callsites are hot while the other ones | are hardly ever called (maybe they only run in unusual-but- | necessary situations). So, the 5 hot callsites can call | foo(), and the remaining 995 cold ones cal call | foo_outline(). | thrtythreeforty wrote: | > There's no easy way to tell the C++ compiler, "please | make only one version of foo and use callbacks to figure | out what it means to += on T". | | Partial counterpoint: you could use class polymorphism, if | T is always a type you control. But you're right in | general; C++ doesn't have typeclasses or some other way to | create an ad-hoc vtable for, say, int. | | > Now, if you want monomorphization like C++, call foo() | and pass a literal for add_int. But if you don't, then call | foo_outline. | | Does this work through structs of function pointers? Is | that the reason it's so powerful? | | For example, a my_class constructor create_my_class makes | the class point to foo_outline, unless it's known to be a | hot type, then it points to foo. When you call | create_my_class, would the compiler see the values of the | pointers, and start inlining foo calls, including if you | pass the my_class struct into foo as "this", continuing the | inlining? | pizlonator wrote: | Yes, it works with structs of function pointers. And it | works recursively. | | This works (this is slightly shorthand C, fill in the | blanks yourself): struct config { | void (*foo)(things bar); stuff (*bar)(int | baz); }; always_inline stuff doit(config | c) { c.foo(whatever); | return c.bar(42); } always_inline void | my_foo(...) { ... } always_inline stuff | my_bar(...) { ... } stuff dostuff(void) { | config c; c.foo = my_foo; c.bar | = my_bar; return doit(c); } | | This'll all get inlined and specialized to death. | thrtythreeforty wrote: | Got it. That is definitely a cool insight. Thanks for | sharing! | RcouF1uZ4gsC wrote: | > Would this allow you to mark a "class" (or individual | members of that class) as "hot, please monomorphize" by | making it always_inline, or "cold, please use regular | pointers" by omitting always_inline? | | From my understanding of the technique, this is correct. | verdagon wrote: | IIRC, the branch predictor can work with function pointers as | well. It can sometimes mitigate problems like this, though not | as completely as just monomorphizing away the run-time | branching/calling like they did. | exikyut wrote: | A weekend project idea I just thought of, but wouldn't be able to | commentate on as well as someone more experienced in memory | management: | | Run a bunch of memory allocator demo programs (trivial and | complex) under _blinkenlights_ | (https://justine.lol/blinkenlights/), a program execution | visualizer that shows how programs interact with the stack and | heap as they run. | | For bonus points, proceed to explain _why_ their visualized | performance is different. :) | meisel wrote: | Will this have support for macOS by inserting itself, as jemalloc | does, as the default malloc zone? If so, does that mean it will | be very fast at determining if a given pointer was allocated by | libpas? | pizlonator wrote: | libpas can do that, though some of the code to do it is not in | the WebKit repo (but the enumerator support - a complex | component that is necessary to fully replace system malloc - is | in the repo). | | libpas can already determine if it owns a pointer in O(1) time | and no locking. It uses megapages for this purpose. Though, | right now, it's not configured to fully leverage this (if you | pass an unknown ptr to free then as a last step it'll lock the | heap_lock even though it would be a small change to just use | megapages there). | donfuzius wrote: | This looks awesome and I really like the tone of the | documentation. Great work! At the same time I'm sort of worried, | that such a deep change in the memory management of my day-to-day | browser opens up a significant security risk for a considerable | time, until new bugs are found and closed. I'm sure this got more | than one pair of eyes at apple, but I'd really welcome a specific | bug bounty on such a change... | pizlonator wrote: | That's a fair point. Code churn creates bugs. That's a good | argument for slowing down the pace of browser development | overall, except maybe the kind of development that focuses on | security hardening. | | Libpas has some security features that bmalloc didn't have. It | goes further than bmalloc in avoiding inline metadata (bmalloc | had an IsoHeap<> thing that used scrambled inline freelists). | Also, libpas opens the door to isoheaping _all_ allocations in | WebKit (see https://bugs.webkit.org/show_bug.cgi?id=231938), | which would be a big security win. | | Also, libpas has been fuzzed to death. In test_pas, there's a | bunch of fuzz tests I call "chaos" and I've run those for | extended periods of time (days, weeks) after major changes. Of | course it still has bugs, but considering how insane the heap | organization is, libpas also makes attackers' lives much harder | in the short term. | | The combination of libpas's enhanced security features and the | fact that it substantially changes heap layout might be enough | to cancel out the fact that it has increased bug tail. And... | if you're worried about bug tail from major changes introducing | security vulns then there are probably much worse major changes | that someone could do. | ryanianian wrote: | Are there any projects that surgically augment the memory APIs to | be more cooperative? Sure `malloc` implementations will make | various tradeoffs, but what if the malloc/free APIs were expanded | to expose more programmer-intent or cooperative defrag, etc? | | - Perhaps `malloc` could ask for an intended lifecycle (think GC | generation) that could swap arenas or other algo internals. This | opens us up to meta-allocators. | | - Perhaps program lifecycle hooks that give allocators more | context into how already-allocated memory intends to be accessed. | ("Done bootstrapping the server, so now new memory will be | accessed by single threads and only live for one request", etc) | | - Perhaps the programmer could periodically call `remalloc` to | allow objects to be moved or promoted between arenas/generations. | | - Perhaps mallocs/frees could be more intelligently batched | (where explicit structs/arrays don't normally make sense) for | surprise-free defrag or OS allocation. | | - Intelligent pairing of malloc/free with explicit notions of | intended re-use; some indication of "I will free this and ask for | the same memory again in the same call" or auto-struct-ing | mallocs which know that we tend stack | malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual | OS/memory involvement whatsoever. | | - Perhaps `identmalloc` could handle common flyweight patterns of | instantiate-unless-exists; e.g. intelligently turn shared state | into offsets into a contiguous arenas. This may be more useful in | C++ with existing conventions around copy-by-value constructs. | | Some of these are indeed components of existing | mallocs/allocators (e.g. libpas is type-aware), but my point is | that allocators have to jump through hoops to derive heuristics, | and these heuristics often amount to "the programmer often | intends to use memory like X." Performance engineers then | pattern-match for X rather than cooperating with the compiler to | expose X intrinsically from the beginning. The above ideas impose | _slightly_ different programming paradigms, so there 's still the | white wale of a perfect drop-in heuristic, but it seems less | subject to temporary or local maxima. | | So: Are there compelling (albeit slightly less general-purpose) | alternatives where the allocator and programmer communicate | additional hints or other API contracts that allow the allocator | to move away from heuristics and into proven patterns? Or is this | basically akin to "use c++/smart pointers or just get a gc"? | pizlonator wrote: | I've thought about this a lot. I think that overall, | malloc/free/new/delete are already so hard to use that if you | added more stuff, it would create too much cognitive load for | the programmer. The result would be that the hints the | programmer gave you would be more wrong than the malloc's best | guess. | | Let me go through these point by point and offer some thoughts. | | - Perhaps `malloc` could ask for an intended lifecycle (think | GC generation) that could swap arenas or other algo internals. | This opens us up to meta-allocators. | | It could, and if that information was accurate, then it would | be profitable. But if that information was inaccurate, then | you'd get worse perf and worse memory usage than if you let | malloc just do whatever it wants. Also, anything like this | creates more fragmentation since it creates a new opportunity | for the programmer to force the allocator to pick something | other than the first fit. | | - Perhaps program lifecycle hooks that give allocators more | context into how already-allocated memory intends to be | accessed. ("Done bootstrapping the server, so now new memory | will be accessed by single threads and only live for one | request", etc) | | If you convey this information after you've already allocated | the object then there isn't much the malloc could do. If the | malloc uses this information to decide where the next malloc | call puts memory then you've created a fragmentation risk. | | - Perhaps the programmer could periodically call `remalloc` to | allow objects to be moved or promoted between | arenas/generations. | | Yes, they could, and this is a great idea. I believe that | calling realloc (or just mallocing a new object, memcpying or | copy-constructing, and freeing the old one) relieves | fragmentation because it gives the malloc an opportunity to put | the object in a more profitable location. | | - Perhaps mallocs/frees could be more intelligently batched | (where explicit structs/arrays don't normally make sense) for | surprise-free defrag or OS allocation. | | libpas batches allocation and deallocation. These strategies | help performance. They have only a small effect on memory usage | (and that effect is that you use a bit more memory by | batching). | | - Intelligent pairing of malloc/free with explicit notions of | intended re-use; some indication of "I will free this and ask | for the same memory again in the same call" or auto-struct-ing | mallocs which know that we tend stack | malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual | OS/memory involvement whatsoever. | | Allocators like libpas (and mimalloc and many others) have very | cheap malloc/free calls that perform great in these situations. | They don't do any locking or call into the OS in the common | case when you do this. | | - Perhaps `identmalloc` could handle common flyweight patterns | of instantiate-unless-exists; e.g. intelligently turn shared | state into offsets into a contiguous arenas. This may be more | useful in C++ with existing conventions around copy-by-value | constructs. | | I think that most fast mallocs like libpas (and others) make | this situation cheap enough that you don't need anything else. | | - Some of these are indeed components of existing | mallocs/allocators (e.g. libpas is type-aware), but my point is | that allocators have to jump through hoops to derive | heuristics, and these heuristics often amount to "the | programmer often intends to use memory like X." Performance | engineers then pattern-match for X rather than cooperating with | the compiler to expose X intrinsically from the beginning. The | above ideas impose slightly different programming paradigms, so | there's still the white wale of a perfect drop-in heuristic, | but it seems less subject to temporary or local maxima. | | If the program is simple enough that you understand the object | lifetimes completely, or you have enough resources to throw at | the problem that you can completely understand lifetimes even | for a complex program, then the best path forward is to just | not call malloc. Lay out your memory a priori or write a highly | customized "allocator" (that may not even have an API that | looks anything like malloc). | | If the program is too complex for such reasoning, then I | believe there's no way for the programmer to tell the malloc | anything it can't deduce with heuristics. The programmer can | say some stuff, but when you have a large sophisticated heap, | then the programmer will not be able to predict what it is | about the heap that creates pain for the malloc. In most | experiments where I've tried to do this kind of tuning, I end | up with something that runs slower or uses more memory, and the | investigation usually leads me to learn that my hypothesis | about the malloc's pain point was totally wrong. | | - So: Are there compelling (albeit slightly less general- | purpose) alternatives where the allocator and programmer | communicate additional hints or other API contracts that allow | the allocator to move away from heuristics and into proven | patterns? Or is this basically akin to "use c++/smart pointers | or just get a gc"? | | Maybe! I'm not proving a negative here. But everything I've | tried in this area has been a negative result for me. | twoodfin wrote: | _I 've thought about this a lot. I think that overall, | malloc/free/new/delete are already so hard to use that if you | added more stuff, it would create too much cognitive load for | the programmer. The result would be that the hints the | programmer gave you would be more wrong than the malloc's | best guess._ | | I think this is a generally applicable lesson. Programmers | overestimate their ability to statically optimize the dynamic | behavior of their code, _especially_ relative to a runtime | that _can_ access and act on dynamic state. | | This comes up in the database world as application developers | desiring to "pin" certain tables in memory. Experience has | shown this to be an anti-feature: As the schema and workload | of a database system evolves, these pinning decisions can | become obsolete, starving more relevant tables of buffer | space. Dynamic decisions, even made by simple heuristics like | LRU, are an empirically winning strategy. | | The failure of Itanium/EPIC and the "sufficiently smart | compiler" to beat traditional OoO execution is another | example from a different computing domain. | akrymski wrote: | Now if only I could _limit_ the amount of memory a browser would | consume! The fact that desktop browsers grind a 16GB machine to a | halt, while on mobile devices having 1000s of tabs is a non-issue | is plain wrong. Since I can 't limit a process' memory on Mac OS | (any OS?), why not give me a setting in preferences to do so? | After all the browser is really just a VM at this stage. I'm | quite certain I don't need 100 tabs in active memory at once. | | While at it, why can't we hibernate tabs? Write the JSC heap and | DOM state to disk, the way VMs do it? | | What's taking up the majority of my memory anyway - the graphics | / gpu side or the JS VM? | | There's a whole spectrum of startups trying to solve the webkit | memory consumption issues crudely, such as | https://www.mightyapp.com | pizlonator wrote: | I think that desktop browser users expect that their tabs | generally won't get reloaded. On mobile, users have generally | accepted that their tabs will get reloaded. Maybe browsers | should just reload tabs if anything fishy happens or you're not | using them. | | Hibernating tabs to disk is hard because the JS heap (and the | DOM heap) will contain lots of stuff that is tied to handles | you got from the OS. It's not impossible. But it would require | a lot of work to build specialized serialization functions for | stuff that has native state attached to it that the browser | doesn't control or have good visibility into. | | I think that some websites use a lot of graphics memory while | other websites use a lot of JS memory. And there are other | categories of memory usage, too. | | Maybe if they put the cloud browsers on the blockchain using | machine learning, and then write them in Rust and compile them | to wasm, then it will be even better. | akrymski wrote: | > I think that desktop browser users expect that their tabs | generally won't get reloaded | | But they do get reloaded. There's absolutely no guarantee | provided, and browser tabs will get swapped out of memory as | necessary (eg opening lots of new tabs or apps). | | This is easily solvable though: let me pin tabs that I really | need to persist. | pizlonator wrote: | Sure they do, but much less aggressively than mobile | browsers. I think that desktop Safari will try to keep your | tab live so long as nothing fishy happens. Based on my | experiences using Chrome, it seems to do the same thing. | tptacek wrote: | This is a thread about an allocator design, not really about | Webkit. | akrymski wrote: | It's also a thread about memory allocation in Webkit tho. | jibcage wrote: | Fun fact: I believe PAS stood for "Phil's awesome shit." Whether | that's subject to NDA is anyone's guess :D | pizlonator wrote: | I can neither confirm nor deny this awesome conjecture. ___________________________________________________________________ (page generated 2022-06-01 23:00 UTC)