[HN Gopher] All About Libpas, Phil's Super Fast Malloc
       ___________________________________________________________________
        
       All About Libpas, Phil's Super Fast Malloc
        
       Author : Jarred
       Score  : 238 points
       Date   : 2022-06-01 13:45 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | carlsborg wrote:
       | Trying to understand the license for bmalloc, and webkit in
       | general. The WebCore folder has an Apple license and two LGPL
       | licenses. The root folder has no license file. Can anyone
       | comment?
        
         | pizlonator wrote:
         | libpas is BSD 2-clause. so is bmalloc. libpas is in the bmalloc
         | directory in WebKit, but is really a separate piece of code.
         | 
         | each file has its own license header.
        
       | SemanticStrengh wrote:
        
       | ridiculous_fish wrote:
       | Hey Phil, thanks for the write up! Curious why this is only used
       | for JSC's JIT, and not to back its heap? Did the design of the
       | JSC GC influence libpas at all?
        
         | pizlonator wrote:
         | It's used for JSC's JIT and for all heap allocations in
         | JavaScriptCore and WebKit that aren't GC allocations.
         | 
         | Riptide (the JSC GC) has a similar design and some of the key
         | ideas were first pioneered there:
         | 
         | - The use of fast bitvector searches to find an eligible page
         | (Riptide calls it first eligible block, I think).
         | 
         | - The use of bitvector simd to change object states in bulk.
         | 
         | - Riptide uses a mark bitvector that is indexed by minalign
         | atom (16 bytes) and a block size of 16KB. Libpas's default
         | "small segregated" configuration is 16 byte minalign, 16KB
         | block, and an alloc bitvector indexed by minalign.
         | 
         | I have since thought about how Riptide could be wired up to use
         | libpas. It's possible to do it. Already today you could create
         | a libpas page_config that leaves room for mark bits in the page
         | header, and makes them use the same indexing as the alloc bits.
         | Riptide has a "newly allocated" bitvector that serves almost
         | the same purpose as the alloc bits (kinda, if you squint really
         | hard) - so you'd go from Riptides newly_allocated and mark
         | bitvectors, to libpas's alloc bitvector and a custom mark
         | bitvector stuffed into page headers using an exotic
         | page_config. If JSC did this, it would buy:
         | 
         | - Better decommit policies than Riptide. I did a lot of tuning
         | in libpas to make decommit Just Right (TM).
         | 
         | - Faster allocation of medium-sized objects.
         | 
         | - Possibly better memory efficiency.
         | 
         | But, Riptide is already very nicely tuned, and it gets some
         | benefit from libpas's decommit rules because Riptide calls
         | fastMalloc to get memory and sometimes uses fastFree to return
         | it. fastFree calls bmalloc::api::free, which calls libpas's
         | bmalloc_deallocate, which then goes to pas_deallocate with the
         | BMALLOC_HEAP_CONFIG. So, Riptide already gets some of the
         | goodness of libpas by sitting on top of it.
        
         | [deleted]
        
       | dragontamer wrote:
       | General purpose memory-allocation is a lie. Everything has use
       | cases where they're faster than other libraries.
       | 
       | I think that's why we keep seeing newer malloc schemes pop up,
       | because the performance of the heap 100% depends on the use-case,
       | and different people have different use cases.
       | 
       | Still, studying everyone else's heaps (and garbage collectors, a
       | closely related discussion) is probably good for high-performance
       | programmers.
        
         | pizlonator wrote:
         | Yeah, you're totally right.
         | 
         | Libpas beats other mallocs in WebKit. I also had benchmarks
         | involving non-WebKit workloads and the results were all over
         | the place (sometimes slower than other mallocs by a lot,
         | sometimes faster by a lot - same thing with memory, sometimes
         | more efficient, sometimes less). This didn't surprise me; I've
         | seen this before when writing memory management code.
         | 
         | I think it makes sense for large software projects that use
         | malloc a lot and care about perf to eventually either write
         | their own malloc or to take an existing one and then tune it a
         | lot.
        
         | latenightcoding wrote:
         | I've seen a few opensource projects archive their custom
         | allocators because they were not beating the system's one or
         | jemalloc. So if you have the skill and time then yeah go for
         | it, else stick with general purpose ones.
        
           | mananaysiempre wrote:
           | There are a couple of approaches that might legitimately be
           | simpler than using a general-purpose allocator: in a single-
           | pass batch process, just throw things on the floor and let
           | the OS sort it out on exit[1] (I think the D compiler does
           | this); in a multiple-pass batch process, make each pass
           | litter its own room (arena, obstack, etc.) then demolish said
           | room when done (GCC does this or at least did in the past).
           | 
           | On the other hand, these may require rearranging the logic
           | somewhat to fit them, so the question of how much
           | rearrangement to tolerate still remains. And, well[4],
           | /*        * We divy out chunks of memory rather than call
           | malloc each time so        * we don't have to worry about
           | leaking memory.  It's probably        * not a big deal if all
           | this memory was wasted but if this ever        * goes into a
           | library that would probably not be a good idea.        *
           | * XXX - this *is* in a library....        */
           | 
           | [1] I am rather dismayed by how Raymond Chen advocates for
           | this for memory[2] but insists it could not possibly be a
           | good idea for Windows GDI handles, no, you stupid sloppy
           | programmer[3]. Maybe because he was involved in GDI from the
           | other side? (Of course, for file descriptors on Unix it's
           | still the standard practice.)
           | 
           | [2] http://bytepointer.com/resources/old_new_thing/20120105_0
           | 06_...
           | 
           | [3] http://bytepointer.com/resources/old_new_thing/20051014_3
           | 05_...
           | 
           | [4] https://github.com/the-tcpdump-
           | group/libpcap/blob/4c1e516dd2...
        
             | jstimpfle wrote:
             | This "[2] vs [3]" is a good find. I have several possible
             | explanations. It could be the difference of 7 years between
             | those posts. It could also be that Windows Handles can't be
             | cleaned up in userspace. Not how malloc chunks up the
             | "handles" (i.e. pages / regions) that it got from Windows.
             | This is, AFAIK, different than HANDLE's that have to be
             | requested from the system one-by-one and can't be chunked
             | up. (Or am I wrong? I actually don't know a lot about how
             | this works under the hood).
        
           | dragontamer wrote:
           | Its very easy to beat the general purpose ones if you know
           | your exact use case.
           | 
           | Ex: 16-bit pointers is a 65536-sized heap. Assume 8-bytes per
           | element, that's 512KB of space. A bit small, but large enough
           | to so a lot of things.
           | 
           | 65536 elements can be represented as a bitmask. The bitmask
           | only takes up 8192-bytes (8KB), which fits inside of 16
           | AVX512 registers (Intel offers 32x AVX512/ZMM registers btw).
           | Or it fits inside of GPU __shared__ memory.
           | 
           | If you need multithreaded, you perform atomic AND and atomic
           | OR to clear, and set, the bitmask as appropriate.
           | 
           | -------
           | 
           | Do you know the exact size of your heap? The size of the
           | pointer? The amount of parallelism involved? What about the
           | size of the elements? Access pattern? (Bump-alloc'd Linked
           | Lists are a sequential traversal over RAM btw, so that's very
           | efficient), etc. etc.
           | 
           | The 64-bit pointer is overkill for most people's purposes.
           | 32-bits represents 4-billion objects, and if each object is
           | 16-bytes long that's 64GBs of RAM. Right here right now, a
           | 32-bit pointer / custom allocator already offers many
           | benefits over the 64-bit pointer. (half-sized pointers, more
           | data in cache, etc. etc.)
        
             | thesz wrote:
             | 512 in AVX512 is the number of bits per register. You are
             | off by factor of 8.
        
         | cmrdporcupine wrote:
         | 100%, and also I think there's a tendency among people who
         | compare 'higher up the stack' in GC'd languages to manual heap
         | allocation in C/C++ etc. in a way that implies the the latter
         | is almost "free" from a performance POV when in fact underneath
         | the covers in these allocators there's still a lot of the same
         | kind of walking of datastructures that also happens inside a
         | tracing GC.
         | 
         | You can get unpredictable pauses from malloc/free, too.
        
           | asveikau wrote:
           | I think in either case, GC or not, you write something
           | intuitively and then when it becomes an actual problem, you
           | study the patterns and improve them, but most of the time you
           | can leave it alone and the general purpose thing is good
           | enough.
           | 
           | Or if you are in a niche like audio/video or something, you
           | avoid allocations all together during the bulk of the code.
        
             | cmrdporcupine wrote:
             | Oh for sure. Premature optimizing is usually bad form.
             | 
             | What is tricky about performance issues in allocation is
             | that they can be hard to profile. There are tools for
             | analyzing GC performance in GC'd languages, but sometimes
             | malloc/free can just be a big black box.
        
               | djmips wrote:
               | Premature optimization is not bad form when it's re-
               | framed as good architecture. So it's not 'usually bad
               | form' to architect something from the outset, using your
               | experience, and that's something everyone understands.
               | This pervasive disdain for premature optimization leads
               | to bad architecture that often leads to expensive
               | rewrites. So because the word optimization is so
               | overloaded and treated with disdain it feels like we need
               | a new language to talk about what's really meant by
               | 'premature optimization' ( the bad kind)
        
               | asveikau wrote:
               | That's one reason I point out a few comments above that
               | certain niches will need to take it into account ahead of
               | time. Eg. An a/v application will typically allocate all
               | buffers up front and re-use them frequently rather than
               | return them to the allocator. A lot of server
               | applications will want to keep per-client memory usage
               | low. For general purposes, there's the general purpose
               | allocator.
        
         | tptacek wrote:
         | This is a pretty well-known CS truism, isn't it? One of the
         | first papers I ever fell in love with covers it in detail:
         | https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd...
        
       | lidavidm wrote:
       | I wonder how this compares to jemalloc, mimalloc, snmalloc?
        
         | pizlonator wrote:
         | I never got a chance to compare it to those, since I was most
         | interested in beating bmalloc. And I mainly wanted to beat it
         | on Safari workloads.
         | 
         | I believe bmalloc was previously compared against jemalloc and
         | tcmalloc, also using Safari workloads, and bmalloc was
         | significantly faster at the time.
        
           | IshKebab wrote:
           | For other people who have never heard of bmalloc - it's a
           | custom allocator used only by WebKit. I guess it's not
           | surprising they added one since the Mac system allocator is
           | extremely slow. A custom allocator is pretty much a free 20%
           | speed up on Mac (depending on your workload) but I found they
           | made no difference on Linux. Haven't tried on Windows.
        
             | pizlonator wrote:
             | Just some credit where credit is due.
             | 
             | I measured the system malloc crushing all other mallocs on
             | some workloads and they were the kind of workloads that
             | some folks run every day.
             | 
             | I measured the system malloc crushing most other mallocs on
             | memory efficiency on most workloads. System malloc is
             | really good at reusing memory and has very mature decommit
             | policies that easily rival what I came up with.
             | 
             | So there's that.
        
               | pcwalton wrote:
               | macOS malloc was _incredibly_ slow for Pathfinder, far
               | behind every other OS. Everything became bottlenecked on
               | it. It was a free 2x speedup if not more to switch to
               | jemalloc.
               | 
               | I suspect this is because Pathfinder's CPU portion is a
               | multicore workload and macOS allocator performs poorly
               | when under heavy multithreaded contention. It probably
               | just isn't the kind of workload that macOS allocator was
               | tuned for.
        
               | pizlonator wrote:
               | My understanding is that the system malloc is excellent
               | under contention, but has a high baseline cost for every
               | malloc/free call.
               | 
               | I don't remember exactly what workloads it performed
               | really great at (and if I did I dunno if I could say),
               | but I do remember they were parallel, and the speed-ups
               | got bigger the more cores you added.
               | 
               | Everything else about your experience matches mine.
               | Libpas is much faster than system malloc in WebKit and
               | JSC. The difference isn't 2x on my preferred benchmarks
               | (which are large and do lots of things that don't rely on
               | malloc), but it is easily more than 2x on smaller
               | benchmarks. So your 2x result sounds about right.
        
               | meisel wrote:
               | What sort of workloads are those? I have yet to see a
               | workload where macOS's system allocator is not
               | substantially slower than alternatives like jemalloc.
        
             | astrange wrote:
             | Alas, many people think performance engineering is only
             | about the wall clock time of their program and not what
             | impact it has on anything else.
             | 
             | malloc performance highly depends on how large the
             | allocations are though.
        
             | KerrAvon wrote:
             | What's the workload where you found that to be the case?
             | The system allocator is heavily tuned for general use, and
             | should be very difficult to beat -- generally.
        
               | IshKebab wrote:
               | It was a non-traditional compiler written in C++.
               | 
               | > The system allocator is heavily tuned
               | 
               | Perhaps but it probably has more constraints than a
               | custom allocator too, e.g. backwards compatibility with
               | software that unwittingly depends on its exact behaviour.
               | 
               | I found a discussion of this where people reported
               | various speedups (10%, 100%, 300%) on Mac by switching to
               | a custom allocator:
               | https://news.ycombinator.com/item?id=29068828
               | 
               | I'm sure there are better benchmarks if you Google it.
        
       | jeffbee wrote:
       | Well the other browser's design doc is not anywhere near this
       | detailed, but if you want to compare and contrast the overall
       | design here is Chromium's PartitionAlloc.
       | 
       | https://chromium.googlesource.com/chromium/src/+/master/base...
        
       | rq1 wrote:
       | > Consequently, passing a function pointer (or struct of function
       | pointers), where the pointer points to an always_inline function
       | and the callee is always_inline results in specialization akin to
       | template monomorphization. This works to any depth; the compiler
       | won't be satisfied until there are no more always_inline function
       | calls. This fortuitous development in compilers allowed me to
       | write very nice template code in C. Libpas achieves templates in
       | C using config structs that contain function pointers --
       | sometimes to always_inline functions (when we want specialization
       | and inlining) and sometimes to out-of-line functions (when we
       | want specialization but not inlining). Additionally, the C
       | template style allows us to have true polymorphic functions. Lots
       | of libpas slow paths are huge and not at all hot. We don't want
       | that code specialized for every config. Luckily, this works just
       | fine in C templates -- those polymorphic functions just pass
       | around a pointer to the config they are using, and dynamically
       | load and call things in that config, almost exactly the same way
       | that the specialized code would do. This saves a lot of code size
       | versus C++ templates.
       | 
       | This document is incredible!
        
         | photochemsyn wrote:
         | Seems like a real tour de force of low-level memory management.
         | Prior to seeing this my go-to for understanding custom low-
         | level (for embedded) memory managers was this (explains the
         | basics of stack, block, bitmap and thread-local allocation) but
         | now I have something far more complicated to get confused
         | about.
         | 
         | https://www.embedded-software-engineering.de/dynamic-memory-...
        
           | pizlonator wrote:
           | Thank you! :-)
           | 
           | I always wanted to write a malloc, and this is what I came up
           | with after 3.5 years or so.
        
         | [deleted]
        
       | adamnemecek wrote:
       | I'm always suspicious of things that are named after the maker.
        
         | astrange wrote:
         | A better rule is to be suspicious of anything with "fast" or
         | synonyms in the name. It either won't be fast or it'll be
         | insecure.
        
         | fmajid wrote:
         | Did you know the entire US Air Traffic Control system runs on
         | long-discontinued computers and written in JOVIAL, "Jules' Own
         | Version of the International Algebraic Language" (i.e. ALGOL),
         | named after Jules Schwartz?
         | 
         | https://en.wikipedia.org/wiki/JOVIAL
         | 
         | The FAA's been trying to replce it for ages, but AFAIK it's
         | still ongoing.
        
         | thrtythreeforty wrote:
         | Like Linux, or gcc.godbolt.org?
         | 
         | I kid, but I don't see it as a downside for software. Naming a
         | scientific phenomenon after yourself is a red flag though, see
         | also the Crackpot Index [1].
         | 
         | [1]: https://math.ucr.edu/home/baez/crackpot.html
        
           | tialaramex wrote:
           | Godbolt _says_ Compiler Explorer right there, the fact people
           | call it  "Godbolt" is because that's the memorable name
           | rather than because Matt Godbolt tried to name it after
           | himself. He calls it Compiler Explorer.
           | 
           | You can't control what people call things. My favourite
           | example is Blackfriars Bridge in London. When that bridge was
           | built, it was formally named after William Pitt. But like,
           | who cares? In what way is this bridge, Pitt's bridge? People
           | knew where it was, the Blackfriars area of London, so too bad
           | William Pitt, everybody called it Blackfriars Bridge. And so,
           | that's what its name is.
           | 
           | The priory which is why "Blackfriars" got the name hadn't
           | existed for two hundred years by the time the bridge was
           | proposed. Nobody living when the bridge was finished would
           | have ever known anybody who'd met any of the friars during
           | their lifetime. Nevertheless that part of London had "always"
           | been called Blackfriars, and so a bridge and a railway
           | station built many years later are named that too.
        
             | bool3max wrote:
             | People call it Godbolt because you access it via
             | godbolt.org,
        
         | throwaway744678 wrote:
         | From the top of my head: Linux, Bourne shell...
        
           | adamnemecek wrote:
        
             | kbenson wrote:
             | Sometimes things are named after the maker ex post facto
             | because they need a way to distinguish it. The Bourne
             | shell, mentioned here, is likely called that for the same
             | reason the very first shell was called the Thompson shell,
             | in that they were written by a single person originally and
             | not really given a name (the original being from Ken
             | Thompson), since they were just different (or the original)
             | implementations of the system shell. Eventually you're
             | going to need a way to distinguish one from the other
             | though, right? Noting who wrote it is as good (probably
             | better) than any other option.
             | 
             | Linux was probably named differently, but also probably
             | doesn't really match whatever criteria you're using to be
             | suspicious. It started as a hobby project to make a free
             | and open source UNIX system, of which there were few
             | (none?), written by one guy. There was no expectation that
             | other people would use it for anything other than something
             | interesting to play around with, and using a portion of
             | your own name for your own little hobby project isn't
             | something I would look askance at, personally, especially
             | in an age where there was no expectation that it might grow
             | to anything, because that was definitely not the norm at
             | the time.
        
               | ncmncm wrote:
               | "Linux" was forced on him.
        
         | pizlonator wrote:
         | It's called libpas, and pas stands for whatever you like.
        
       | mattgreenrocks wrote:
       | I admit I'm a sucker for metacircularity, but have to ask:
       | 
       | > The bootstrap heap has hacks to allow itself to allocate that
       | array out of itself.
       | 
       | Not quite following that, can someone elaborate?
        
         | convolvatron wrote:
         | I suspect its that - generally when you write a heap it needs
         | to keep some metadata. if you have a multi-heap system (really
         | handy), then its generally easier to host that metadata on
         | another heap. sometimes you really want that, because the
         | target heap might be specialized for different kinds of data.
         | so I assume this is carefully constructing the heap state so
         | that it can be used for its own metadata as its being
         | constructed
        
         | pizlonator wrote:
         | The bootstrap heap has a freelist that is just an array of
         | entries that tell you where the free memory is.
         | 
         | That array has to be allocated somewhere.
         | 
         | All pas_simple_large_free_heaps allocate that array from the
         | bootstrap heap. Including the bootstrap heap. So, the
         | pas_simple_large_free_heap has hacks that go something like:
         | "if I'm allocating or freeing something and I need to allocate
         | or free the free list, then _very carefully_ also perform that
         | allocation /deallocation in tandem with the one I'm doing". The
         | main hack there is just that there's some statically allocated
         | "slop" for the boostrap free list, so if you want to add to it
         | in the middle of an allocation, you add it to the slop array,
         | and then after you're doing you reallocate the free list and
         | copy stuff from slop into it.
        
       | sydthrowaway wrote:
       | Did the author leak this IP when he left Apple?
        
         | pizlonator wrote:
         | lol
         | 
         | WebKit is open source. Libpas was always designed with WebKit
         | in mind. It landed in WebKit half a year before I left Apple.
        
       | RcouF1uZ4gsC wrote:
       | > Consequently, passing a function pointer (or struct of function
       | pointers), where the pointer points to an always_inline function
       | and the callee is always_inline results in specialization akin to
       | template monomorphization. This works to any depth; the compiler
       | won't be satisfied until there are no more always_inline function
       | calls. This fortuitous development in compilers allowed me to
       | write very nice template code in C. Libpas achieves templates in
       | C using config structs that contain function pointers --
       | sometimes to always_inline functions (when we want specialization
       | and inlining) and sometimes to out-of-line functions (when we
       | want specialization but not inlining). Additionally, the C
       | template style allows us to have true polymorphic functions. Lots
       | of libpas slow paths are huge and not at all hot. We don't want
       | that code specialized for every config. Luckily, this works just
       | fine in C templates -- those polymorphic functions just pass
       | around a pointer to the config they are using, and dynamically
       | load and call things in that config, almost exactly the same way
       | that the specialized code would do. This saves a lot of code size
       | versus C++ templates.
       | 
       | I thought this was super interesting. Seems like a very nice
       | technique for when you use function pointers in C for stuff like
       | sort comparison. There was an earlier thread about PostgreSQL
       | that discussed how they created multiple sort functions to avoid
       | the overhead of a indirect function pointer call when sorting
       | things like integers where the function call dwarfs the
       | comparison cost. This seems like a technique that could have been
       | used to avoid duplicating the sorting functions while still
       | getting the benefits.
        
         | thrtythreeforty wrote:
         | I still don't quite understand how the technique saves code
         | size versus C++ templates, although I am quite eager to
         | understand it since template instantiation is a thorn in my
         | side when writing code for a microcontroller with 32KB of code
         | space.
         | 
         | Is the idea that the struct contains pointers to always_inline
         | functions, and that the compiler has visibility into this at
         | the top of the always_inline call stack, so the always_inline
         | gets monomorphized even through multiple layers of pointer
         | indirection? If I'm understanding it right, that's a pretty
         | nice technique. Would this allow you to mark a "class" (or
         | individual members of that class) as "hot, please monomorphize"
         | by making it always_inline, or "cold, please use regular
         | pointers" by omitting always_inline?
         | 
         | C++ templates would force you to monomorphize every variant, or
         | use vtables (function pointers) for every variant, without the
         | ability to choose the approach for each subclass. The compiler
         | is also spectacularly bad at recognizing deduplication
         | opportunities between template specializations.
        
           | verdagon wrote:
           | I think that's right.
           | 
           | C++ binaries tend to be pretty large because they use
           | templates/monomorphization a lot, even when there aren't
           | really meaningful performance gains. For example, it often
           | doesn't make sense to have 20 instantiations of std::vector,
           | when only two of them are used in the critical path. Rust can
           | be even further in this unfortunate direction because of the
           | borrow checker's restrictions around polymorphism and
           | abstraction.
           | 
           | In comparison, C, Java, and Javascript tend to rely on
           | dynamic dispatch more often (not that C has any built-in
           | dynamic dispatch features, but in practice, C programs tend
           | to use function pointers manually).
           | 
           | A few languages like C# have a really nice balance, and
           | monomorphize in the JIT at run-time.
           | 
           | With this technique they mention, C is the only mainstream
           | language I know of that lets you control it in a decoupled
           | way, which is really cool.
        
           | pizlonator wrote:
           | Say you have:                   template<typename T>
           | void foo(T& value) { value += 42; }
           | 
           | There's no easy way to tell the C++ compiler, "please make
           | only one version of foo and use callbacks to figure out what
           | it means to += on T".
           | 
           | In libpas, this would be done by first creating:
           | always_inline void foo(void* value, void (*add_int)(void*
           | value, int addend))         {             add_int(value, 42);
           | }
           | 
           | And then you could also create:
           | never_inline void foo_outline(void* value, void
           | (*add_int)(void* value, int addend)) { foo(value, add_int); }
           | 
           | Now, if you want monomorphization like C++, call foo() and
           | pass a literal for add_int. But if you don't, then call
           | foo_outline.
           | 
           | Say you have 1000 callsites to foo(), they all use different
           | types, and 5 of those callsites are hot while the other ones
           | are hardly ever called (maybe they only run in unusual-but-
           | necessary situations). So, the 5 hot callsites can call
           | foo(), and the remaining 995 cold ones cal call
           | foo_outline().
        
             | thrtythreeforty wrote:
             | > There's no easy way to tell the C++ compiler, "please
             | make only one version of foo and use callbacks to figure
             | out what it means to += on T".
             | 
             | Partial counterpoint: you could use class polymorphism, if
             | T is always a type you control. But you're right in
             | general; C++ doesn't have typeclasses or some other way to
             | create an ad-hoc vtable for, say, int.
             | 
             | > Now, if you want monomorphization like C++, call foo()
             | and pass a literal for add_int. But if you don't, then call
             | foo_outline.
             | 
             | Does this work through structs of function pointers? Is
             | that the reason it's so powerful?
             | 
             | For example, a my_class constructor create_my_class makes
             | the class point to foo_outline, unless it's known to be a
             | hot type, then it points to foo. When you call
             | create_my_class, would the compiler see the values of the
             | pointers, and start inlining foo calls, including if you
             | pass the my_class struct into foo as "this", continuing the
             | inlining?
        
               | pizlonator wrote:
               | Yes, it works with structs of function pointers. And it
               | works recursively.
               | 
               | This works (this is slightly shorthand C, fill in the
               | blanks yourself):                   struct config {
               | void (*foo)(things bar);             stuff (*bar)(int
               | baz);         };         always_inline stuff doit(config
               | c)         {             c.foo(whatever);
               | return c.bar(42);         }         always_inline void
               | my_foo(...) { ... }         always_inline stuff
               | my_bar(...) { ... }         stuff dostuff(void)         {
               | config c;             c.foo = my_foo;              c.bar
               | = my_bar;             return doit(c);         }
               | 
               | This'll all get inlined and specialized to death.
        
               | thrtythreeforty wrote:
               | Got it. That is definitely a cool insight. Thanks for
               | sharing!
        
           | RcouF1uZ4gsC wrote:
           | > Would this allow you to mark a "class" (or individual
           | members of that class) as "hot, please monomorphize" by
           | making it always_inline, or "cold, please use regular
           | pointers" by omitting always_inline?
           | 
           | From my understanding of the technique, this is correct.
        
         | verdagon wrote:
         | IIRC, the branch predictor can work with function pointers as
         | well. It can sometimes mitigate problems like this, though not
         | as completely as just monomorphizing away the run-time
         | branching/calling like they did.
        
       | exikyut wrote:
       | A weekend project idea I just thought of, but wouldn't be able to
       | commentate on as well as someone more experienced in memory
       | management:
       | 
       | Run a bunch of memory allocator demo programs (trivial and
       | complex) under _blinkenlights_
       | (https://justine.lol/blinkenlights/), a program execution
       | visualizer that shows how programs interact with the stack and
       | heap as they run.
       | 
       | For bonus points, proceed to explain _why_ their visualized
       | performance is different. :)
        
       | meisel wrote:
       | Will this have support for macOS by inserting itself, as jemalloc
       | does, as the default malloc zone? If so, does that mean it will
       | be very fast at determining if a given pointer was allocated by
       | libpas?
        
         | pizlonator wrote:
         | libpas can do that, though some of the code to do it is not in
         | the WebKit repo (but the enumerator support - a complex
         | component that is necessary to fully replace system malloc - is
         | in the repo).
         | 
         | libpas can already determine if it owns a pointer in O(1) time
         | and no locking. It uses megapages for this purpose. Though,
         | right now, it's not configured to fully leverage this (if you
         | pass an unknown ptr to free then as a last step it'll lock the
         | heap_lock even though it would be a small change to just use
         | megapages there).
        
       | donfuzius wrote:
       | This looks awesome and I really like the tone of the
       | documentation. Great work! At the same time I'm sort of worried,
       | that such a deep change in the memory management of my day-to-day
       | browser opens up a significant security risk for a considerable
       | time, until new bugs are found and closed. I'm sure this got more
       | than one pair of eyes at apple, but I'd really welcome a specific
       | bug bounty on such a change...
        
         | pizlonator wrote:
         | That's a fair point. Code churn creates bugs. That's a good
         | argument for slowing down the pace of browser development
         | overall, except maybe the kind of development that focuses on
         | security hardening.
         | 
         | Libpas has some security features that bmalloc didn't have. It
         | goes further than bmalloc in avoiding inline metadata (bmalloc
         | had an IsoHeap<> thing that used scrambled inline freelists).
         | Also, libpas opens the door to isoheaping _all_ allocations in
         | WebKit (see https://bugs.webkit.org/show_bug.cgi?id=231938),
         | which would be a big security win.
         | 
         | Also, libpas has been fuzzed to death. In test_pas, there's a
         | bunch of fuzz tests I call "chaos" and I've run those for
         | extended periods of time (days, weeks) after major changes. Of
         | course it still has bugs, but considering how insane the heap
         | organization is, libpas also makes attackers' lives much harder
         | in the short term.
         | 
         | The combination of libpas's enhanced security features and the
         | fact that it substantially changes heap layout might be enough
         | to cancel out the fact that it has increased bug tail. And...
         | if you're worried about bug tail from major changes introducing
         | security vulns then there are probably much worse major changes
         | that someone could do.
        
       | ryanianian wrote:
       | Are there any projects that surgically augment the memory APIs to
       | be more cooperative? Sure `malloc` implementations will make
       | various tradeoffs, but what if the malloc/free APIs were expanded
       | to expose more programmer-intent or cooperative defrag, etc?
       | 
       | - Perhaps `malloc` could ask for an intended lifecycle (think GC
       | generation) that could swap arenas or other algo internals. This
       | opens us up to meta-allocators.
       | 
       | - Perhaps program lifecycle hooks that give allocators more
       | context into how already-allocated memory intends to be accessed.
       | ("Done bootstrapping the server, so now new memory will be
       | accessed by single threads and only live for one request", etc)
       | 
       | - Perhaps the programmer could periodically call `remalloc` to
       | allow objects to be moved or promoted between arenas/generations.
       | 
       | - Perhaps mallocs/frees could be more intelligently batched
       | (where explicit structs/arrays don't normally make sense) for
       | surprise-free defrag or OS allocation.
       | 
       | - Intelligent pairing of malloc/free with explicit notions of
       | intended re-use; some indication of "I will free this and ask for
       | the same memory again in the same call" or auto-struct-ing
       | mallocs which know that we tend stack
       | malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual
       | OS/memory involvement whatsoever.
       | 
       | - Perhaps `identmalloc` could handle common flyweight patterns of
       | instantiate-unless-exists; e.g. intelligently turn shared state
       | into offsets into a contiguous arenas. This may be more useful in
       | C++ with existing conventions around copy-by-value constructs.
       | 
       | Some of these are indeed components of existing
       | mallocs/allocators (e.g. libpas is type-aware), but my point is
       | that allocators have to jump through hoops to derive heuristics,
       | and these heuristics often amount to "the programmer often
       | intends to use memory like X." Performance engineers then
       | pattern-match for X rather than cooperating with the compiler to
       | expose X intrinsically from the beginning. The above ideas impose
       | _slightly_ different programming paradigms, so there 's still the
       | white wale of a perfect drop-in heuristic, but it seems less
       | subject to temporary or local maxima.
       | 
       | So: Are there compelling (albeit slightly less general-purpose)
       | alternatives where the allocator and programmer communicate
       | additional hints or other API contracts that allow the allocator
       | to move away from heuristics and into proven patterns? Or is this
       | basically akin to "use c++/smart pointers or just get a gc"?
        
         | pizlonator wrote:
         | I've thought about this a lot. I think that overall,
         | malloc/free/new/delete are already so hard to use that if you
         | added more stuff, it would create too much cognitive load for
         | the programmer. The result would be that the hints the
         | programmer gave you would be more wrong than the malloc's best
         | guess.
         | 
         | Let me go through these point by point and offer some thoughts.
         | 
         | - Perhaps `malloc` could ask for an intended lifecycle (think
         | GC generation) that could swap arenas or other algo internals.
         | This opens us up to meta-allocators.
         | 
         | It could, and if that information was accurate, then it would
         | be profitable. But if that information was inaccurate, then
         | you'd get worse perf and worse memory usage than if you let
         | malloc just do whatever it wants. Also, anything like this
         | creates more fragmentation since it creates a new opportunity
         | for the programmer to force the allocator to pick something
         | other than the first fit.
         | 
         | - Perhaps program lifecycle hooks that give allocators more
         | context into how already-allocated memory intends to be
         | accessed. ("Done bootstrapping the server, so now new memory
         | will be accessed by single threads and only live for one
         | request", etc)
         | 
         | If you convey this information after you've already allocated
         | the object then there isn't much the malloc could do. If the
         | malloc uses this information to decide where the next malloc
         | call puts memory then you've created a fragmentation risk.
         | 
         | - Perhaps the programmer could periodically call `remalloc` to
         | allow objects to be moved or promoted between
         | arenas/generations.
         | 
         | Yes, they could, and this is a great idea. I believe that
         | calling realloc (or just mallocing a new object, memcpying or
         | copy-constructing, and freeing the old one) relieves
         | fragmentation because it gives the malloc an opportunity to put
         | the object in a more profitable location.
         | 
         | - Perhaps mallocs/frees could be more intelligently batched
         | (where explicit structs/arrays don't normally make sense) for
         | surprise-free defrag or OS allocation.
         | 
         | libpas batches allocation and deallocation. These strategies
         | help performance. They have only a small effect on memory usage
         | (and that effect is that you use a bit more memory by
         | batching).
         | 
         | - Intelligent pairing of malloc/free with explicit notions of
         | intended re-use; some indication of "I will free this and ask
         | for the same memory again in the same call" or auto-struct-ing
         | mallocs which know that we tend stack
         | malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual
         | OS/memory involvement whatsoever.
         | 
         | Allocators like libpas (and mimalloc and many others) have very
         | cheap malloc/free calls that perform great in these situations.
         | They don't do any locking or call into the OS in the common
         | case when you do this.
         | 
         | - Perhaps `identmalloc` could handle common flyweight patterns
         | of instantiate-unless-exists; e.g. intelligently turn shared
         | state into offsets into a contiguous arenas. This may be more
         | useful in C++ with existing conventions around copy-by-value
         | constructs.
         | 
         | I think that most fast mallocs like libpas (and others) make
         | this situation cheap enough that you don't need anything else.
         | 
         | - Some of these are indeed components of existing
         | mallocs/allocators (e.g. libpas is type-aware), but my point is
         | that allocators have to jump through hoops to derive
         | heuristics, and these heuristics often amount to "the
         | programmer often intends to use memory like X." Performance
         | engineers then pattern-match for X rather than cooperating with
         | the compiler to expose X intrinsically from the beginning. The
         | above ideas impose slightly different programming paradigms, so
         | there's still the white wale of a perfect drop-in heuristic,
         | but it seems less subject to temporary or local maxima.
         | 
         | If the program is simple enough that you understand the object
         | lifetimes completely, or you have enough resources to throw at
         | the problem that you can completely understand lifetimes even
         | for a complex program, then the best path forward is to just
         | not call malloc. Lay out your memory a priori or write a highly
         | customized "allocator" (that may not even have an API that
         | looks anything like malloc).
         | 
         | If the program is too complex for such reasoning, then I
         | believe there's no way for the programmer to tell the malloc
         | anything it can't deduce with heuristics. The programmer can
         | say some stuff, but when you have a large sophisticated heap,
         | then the programmer will not be able to predict what it is
         | about the heap that creates pain for the malloc. In most
         | experiments where I've tried to do this kind of tuning, I end
         | up with something that runs slower or uses more memory, and the
         | investigation usually leads me to learn that my hypothesis
         | about the malloc's pain point was totally wrong.
         | 
         | - So: Are there compelling (albeit slightly less general-
         | purpose) alternatives where the allocator and programmer
         | communicate additional hints or other API contracts that allow
         | the allocator to move away from heuristics and into proven
         | patterns? Or is this basically akin to "use c++/smart pointers
         | or just get a gc"?
         | 
         | Maybe! I'm not proving a negative here. But everything I've
         | tried in this area has been a negative result for me.
        
           | twoodfin wrote:
           | _I 've thought about this a lot. I think that overall,
           | malloc/free/new/delete are already so hard to use that if you
           | added more stuff, it would create too much cognitive load for
           | the programmer. The result would be that the hints the
           | programmer gave you would be more wrong than the malloc's
           | best guess._
           | 
           | I think this is a generally applicable lesson. Programmers
           | overestimate their ability to statically optimize the dynamic
           | behavior of their code, _especially_ relative to a runtime
           | that _can_ access and act on dynamic state.
           | 
           | This comes up in the database world as application developers
           | desiring to "pin" certain tables in memory. Experience has
           | shown this to be an anti-feature: As the schema and workload
           | of a database system evolves, these pinning decisions can
           | become obsolete, starving more relevant tables of buffer
           | space. Dynamic decisions, even made by simple heuristics like
           | LRU, are an empirically winning strategy.
           | 
           | The failure of Itanium/EPIC and the "sufficiently smart
           | compiler" to beat traditional OoO execution is another
           | example from a different computing domain.
        
       | akrymski wrote:
       | Now if only I could _limit_ the amount of memory a browser would
       | consume! The fact that desktop browsers grind a 16GB machine to a
       | halt, while on mobile devices having 1000s of tabs is a non-issue
       | is plain wrong. Since I can 't limit a process' memory on Mac OS
       | (any OS?), why not give me a setting in preferences to do so?
       | After all the browser is really just a VM at this stage. I'm
       | quite certain I don't need 100 tabs in active memory at once.
       | 
       | While at it, why can't we hibernate tabs? Write the JSC heap and
       | DOM state to disk, the way VMs do it?
       | 
       | What's taking up the majority of my memory anyway - the graphics
       | / gpu side or the JS VM?
       | 
       | There's a whole spectrum of startups trying to solve the webkit
       | memory consumption issues crudely, such as
       | https://www.mightyapp.com
        
         | pizlonator wrote:
         | I think that desktop browser users expect that their tabs
         | generally won't get reloaded. On mobile, users have generally
         | accepted that their tabs will get reloaded. Maybe browsers
         | should just reload tabs if anything fishy happens or you're not
         | using them.
         | 
         | Hibernating tabs to disk is hard because the JS heap (and the
         | DOM heap) will contain lots of stuff that is tied to handles
         | you got from the OS. It's not impossible. But it would require
         | a lot of work to build specialized serialization functions for
         | stuff that has native state attached to it that the browser
         | doesn't control or have good visibility into.
         | 
         | I think that some websites use a lot of graphics memory while
         | other websites use a lot of JS memory. And there are other
         | categories of memory usage, too.
         | 
         | Maybe if they put the cloud browsers on the blockchain using
         | machine learning, and then write them in Rust and compile them
         | to wasm, then it will be even better.
        
           | akrymski wrote:
           | > I think that desktop browser users expect that their tabs
           | generally won't get reloaded
           | 
           | But they do get reloaded. There's absolutely no guarantee
           | provided, and browser tabs will get swapped out of memory as
           | necessary (eg opening lots of new tabs or apps).
           | 
           | This is easily solvable though: let me pin tabs that I really
           | need to persist.
        
             | pizlonator wrote:
             | Sure they do, but much less aggressively than mobile
             | browsers. I think that desktop Safari will try to keep your
             | tab live so long as nothing fishy happens. Based on my
             | experiences using Chrome, it seems to do the same thing.
        
         | tptacek wrote:
         | This is a thread about an allocator design, not really about
         | Webkit.
        
           | akrymski wrote:
           | It's also a thread about memory allocation in Webkit tho.
        
       | jibcage wrote:
       | Fun fact: I believe PAS stood for "Phil's awesome shit." Whether
       | that's subject to NDA is anyone's guess :D
        
         | pizlonator wrote:
         | I can neither confirm nor deny this awesome conjecture.
        
       ___________________________________________________________________
       (page generated 2022-06-01 23:00 UTC)