[HN Gopher] Rust std fs slower than Python? No, it's hardware
       ___________________________________________________________________
        
       Rust std fs slower than Python? No, it's hardware
        
       Author : Pop_-
       Score  : 555 points
       Date   : 2023-11-29 09:18 UTC (13 hours ago)
        
 (HTM) web link (xuanwo.io)
 (TXT) w3m dump (xuanwo.io)
        
       | royjacobs wrote:
       | I was prepared to read the article and scoff at the author's
       | misuse of std::fs. However, the article is a delightful
       | succession of rabbit holes and mysteries. Well written and very
       | interesting!
        
         | bri3d wrote:
         | This was such a good article! The debugging was smart (writing
         | test programs to peel each layer off), the conclusion was
         | fascinating and unexpected, and the writing was clear and easy
         | to follow.
        
       | sgift wrote:
       | Either the author changed the headline to something less
       | clickbaity in the meantime or you edited it for clickbait Pop_-
       | (in that case: shame on you) - current headline: "Rust std fs
       | slower than Python!? No, it's hardware!"
        
         | epage wrote:
         | Based on the /r/rust thread, the author seemed to change the
         | headline based on feedback to make it less clickbait-y
        
         | xuanwo wrote:
         | Sorry for the clickbaity title, I have changed it based on
         | others advice.
        
           | thechao wrote:
           | I disagree that it's clickbait-y. Diving down from Python
           | bindings to ucode is ... not how things usually go. Doubly
           | so, since Python is a very mature runtime, and I'd be
           | inclined to believe they've dug up file-reading Kung Fu not
           | available to the Average Joe.
        
         | Pop_- wrote:
         | The author has updated the title and also contacted me. But
         | unfortunately I'm no longer able to update it so.
        
       | Pesthuf wrote:
       | Clickbait headline, but the article is great!
        
         | joshfee wrote:
         | Surprisingly I think this usage of clickbait is totally
         | reasonable because it matches the author's initial
         | thoughts/experiences of "what?! this can't be right..."
        
         | saghm wrote:
         | I think there might be a range of where people draw the line
         | between reasonable headlines and clickbait, because I tend to
         | think of clickbait as something where the "answer" to some
         | question is intentionally left out to try to bait people into
         | clicking. For this article, something I'd consider clickbait
         | would be something like "Rust std fs is slower than Python?"
         | without the answer after. More commonly, the headline isn't
         | phrased directly as a question, but instead of saying something
         | like "So-and-so musician loves burritos", it will leave out the
         | main detail and say something like "The meal so-and-so eats
         | before every concert", which is trying to get you to click and
         | have to read through lots of extraneous prose just to find the
         | word "burritos".
         | 
         | Having a hook to get people to want to read the article is
         | reasonable in my opinion; after all, if you could fit every
         | detail in the size of a headline, you wouldn't need an article
         | at all! Clickbait inverts this by _only_ having enough enough
         | substance that you could get all the info in the headline, but
         | instead it leaves out the one detail that's interesting and
         | then pads it with fluff that you're forced to click and read
         | through if you want the answer.
        
       | iampims wrote:
       | Most interesting article I've read this week. Excellent write-up.
        
       | Pop_- wrote:
       | Disclaimer: The title has been changed to "Rust std fs slower
       | than Python!? No, it's hardware!" to avoid clickbait. However I'm
       | not able to fix the title in HN.
        
         | 3cats-in-a-coat wrote:
         | What's the TLDR on how... hardware performs differently on two
         | software runtimes?
        
           | lynndotpy wrote:
           | One of the very first things in the article is a TLDR section
           | that points you to the conclusion.
           | 
           | > In conclusion, the issue isn't software-related. Python
           | outperforms C/Rust due to an AMD CPU bug.
        
             | j16sdiz wrote:
             | It _is_ software-related. Just the CPU perform badly on
             | some _software_ instruction.
        
               | xuanwo wrote:
               | FSRM is a CPU feature embedded in the microcode (in this
               | instance, amd-ucode) that software such as glibc cannot
               | interact with. I refer to it as hardware because I
               | consider microcode a part of the hardware.
        
           | pornel wrote:
           | AMD's implementation of `rep movsb` instruction is
           | surprisingly slow when addresses are page aligned. Python's
           | allocator happens to add a 16-byte offset that avoids the
           | hardware quirk/bug.
        
             | sound1 wrote:
             | thank you, upvoted!
        
         | sharperguy wrote:
         | "Works on contingency? No, money down!"
        
         | pvg wrote:
         | you can mail hn@ycombinator.com and they can change it for you
         | to whatever.
        
       | quietbritishjim wrote:
       | I'm a bit confused about the premise. This is not comparing pure
       | Python code against some native (C or Rust) code. It's comparing
       | one Python wrapper around native code (Python's file read method)
       | against another Python wrapper around some native code (OpenDAL).
       | OK it's still interesting that there's a difference in
       | performance, but it's very odd to describe it as "slower than
       | Python". Did they expect that the Python standard library is all
       | written in pure Python? On the contrary, I would expect the
       | implementations of functions in Python's standard library to be
       | native and, individually, highly optimised.
       | 
       | I'm not surprised the conclusion had something to do with the way
       | that native code works. Admittedly I was surprised at the
       | specific answer - still a very interesting article despite the
       | confusing start.
       | 
       | Edit: The conclusion also took me a couple of attempts to parse.
       | There's a heading "C is slower than Python with specified
       | offset". To me, as a native English speaker, this reads as "C is
       | slower (than Python) with specified offset" i.e. it sounds like
       | they took the C code, specified the same offset as Python, and
       | then it's still slower than Python. But it's the opposite: once
       | the offset from Python was also specified in the C code, the C
       | code was then faster. Still very interesting once I got what they
       | were saying though.
        
         | xuanwo wrote:
         | Thanks for the comments. I have fixed the headers :)
        
         | crabbone wrote:
         | > individually, highly optimised.
         | 
         | Now why would you expect _that_?
         | 
         | What happened to OP is a pure chance. CPython's C code doesn't
         | even care about const-consistency. It's flush with dynamic
         | memory allocations, bunch of helper / convenience calls... Even
         | stuff like arithmetic does dynamic memory allocation...
         | 
         | Normally, you don't expect CPython to perform well, not if you
         | have any experience working with it. Whenever you want to
         | improve performance you want to sidestep all the functionality
         | available there.
         | 
         | Also, while Python doesn't have a standard library, since it
         | doesn't have a standard... the library that's distributed with
         | it is _mostly_ written in Python. Of course, some of it comes
         | written in C, but there 's also a sizable fraction of that C
         | code that's essentially Python code translated mechanically
         | into C (a good example of this is Python's binary search
         | implementation which was originally written in Python, and
         | later translated into C using Python's C API).
         | 
         | What one would expect is that functionality that is simple to
         | map to operating system functionality has a relatively thin
         | wrapper. I.e. reading files wouldn't require much in terms of
         | binding code because, essentially, it goes straight into the
         | system interface.
        
           | codr7 wrote:
           | Have you ever attempted to write a scripting language that
           | performs better?
           | 
           | I have, several, and it's far from trivial.
           | 
           | The basics are seriously optimized for typical use cases,
           | take a look at the source code for the dict type.
        
             | svieira wrote:
             | Raymond Hettinger's talk _Modern Python Dictionaries: A
             | confluence of a dozen great ideas_ is an awesome  "history
             | of how we got these optimizations" and a walk through why
             | they are so effective -
             | https://www.youtube.com/watch?v=npw4s1QTmPg
        
               | codr7 wrote:
               | Yeah, I had a nice chat with Raymond Hettinger at a Pycon
               | in Birmingham/UK back in the days (had no idea who he was
               | at the time). He seemed like a dedicated and intelligent
               | person, I'm sure we can thank him for some of that.
        
             | crabbone wrote:
             | > Have you ever attempted to write a scripting language
             | that performs better?
             | 
             | No, because "scripting language" is not a thing.
             | 
             | But, if we are talking about implementing languages, then I
             | worked with many language implementations. The most
             | comparable one that I know fairly well, inside-and-out
             | would be the AVM, i.e. the ActionScript Virtual Machine.
             | It's not well-written either unfortunately.
             | 
             | I've looked at implementations of Lua, Emacs Lisp and
             | Erlang at different times and to various degree. I'm also
             | somewhat familiar with SBCL and ECL, the implementation
             | side. There are different things the authors looked for in
             | these implementations. For example, SBCL emphasizes
             | performance, where ECL emphasizes simplicity and interop
             | with C.
             | 
             | If I had to grade language implementations I've seen,
             | Erlang would absolutely take the cake. It's a very
             | thoughtful and disciplined program where authors went to a
             | great length to design and implement it. CPython is on the
             | lower end of such programs. It's anarchic, very unevenly
             | implemented, you run into comments testifying to the author
             | not knowing what they are doing, what their predecessor
             | did, nor what to do next. Sometimes the code is written
             | from that perspective as well, as in if the author somehow
             | manages to drive themselves in the corner they don't know
             | what the reference count is anymore, they'll just hammer it
             | until they hope all references are dead (well, maybe).
             | 
             | It's the code style that, unfortunately, I associate with
             | proprietary projects where deadlines and cost dictate the
             | quality, where concurrency problems are solved with sleeps,
             | and if that doesn't work, then the sleep delay is doubled.
             | It's not because I specifically hate code being
             | proprietary, but because I meet that kind of code in my day
             | job more than I meet it in hobby open-source projects.
             | 
             | > take a look at the source code for the dict type.
             | 
             | I wrote a Protobuf parser in C with the intention of
             | exposing its bindings to Python. Dictionaries were a
             | natural choice for the hash-map Protobuf elements. I
             | benchmarked my implementation against C++ (Google's)
             | implementation only to discover that std::map wins against
             | Python's dictionary by a landslide.
             | 
             | Maybe Python's dict isn't as bad as most of the rest of the
             | interpreter, but being the best of the worst still doesn't
             | make it good.
        
               | codr7 wrote:
               | Except it is, because everyone knows sort of what it
               | means, an interpreted language that prioritizes
               | convenience over performance;
               | Perl/Python/Ruby/Lua/PHP/etc.
               | 
               | SBCL is definitely a different beast.
               | 
               | I would expect Emacs Lisp & Lua to be more similar.
               | 
               | Erlang had plenty more funding and stricter requirements.
               | 
               | C++'s std::map has most likely gotten even more attention
               | than Python's dict, but I'm not sure from your comment if
               | you're including Python's VM dispatch in that comparison.
               | 
               | What are you trying to prove here?
        
             | wahern wrote:
             | > The basics are seriously optimized for typical use cases,
             | take a look at the source code for the dict type
             | 
             | Python is well micro-optimized, but the broader
             | architecture of the language and especially the CPython
             | implementation did not put much concern into performance,
             | even for a dynamically typed scripting language. For
             | example, in CPython values of built-in types are still
             | allocated as regular objects and passed by reference; this
             | is atrocious for performance and no amount of micro
             | optimization will suffice to completely bridge the
             | performance gap for tasks which stress this aspect of
             | CPython. By contrast, primitive types in Lua (including PUC
             | Lua, the reference, non-JIT implementation) and JavaScript
             | are passed around internally as scalar values, and the
             | languages were designed with this in mind.
             | 
             | Perl is similar to Python in this regard--the language
             | constructs and type systems weren't designed for high
             | primitive operation throughput. Rather, performance
             | considerations were focused on higher level, functional
             | tasks. For example, Perl string objects were designed to
             | support fast concatenation and copy-on-write references,
             | optimizations which pay huge dividends for the tasks for
             | which Perl became popular. Perl can often seem ridiculously
             | fast for naive string munging compared to even compiled
             | languages, yet few people care to defend Perl as a
             | performant language per se.
        
         | qd011 wrote:
         | I don't understand why Python gets shit for being a slow
         | language when it's slow but no credit for being fast when it's
         | fast just because "it's not really Python".
         | 
         | If I write Python and my code is fast, to me that sounds like
         | Python is fast, I couldn't care less whether it's because the
         | implementation is in another language or for some other reason.
        
           | paulddraper wrote:
           | Yeah, it's weird.
        
           | afdbcreid wrote:
           | Usually, yes, but when it's a bug in the hardware, it's not
           | really that Python is fast, more like that CPython developers
           | were lucky enough to not have the bug.
        
             | munch117 wrote:
             | How do you know that it's luck?
        
               | cozzyd wrote:
               | Because the offset is entirely due to space for the
               | PyObject header.
        
               | munch117 wrote:
               | The PyObject header is a target for optimisation.
               | Performance regressions are likely to be noticed, and if
               | a different header layout is faster, then it's entirely
               | possible that it will be used for purely empirical
               | reasons. Trying different options and picking the best
               | performing one is not luck, even if you can't explain why
               | it's the best performing.
        
               | cozzyd wrote:
               | I suspect any size other than 0 would lead to this.
               | 
               | But the Zen3/4 were developed far, far after the PyObject
               | header...
        
               | adgjlsfhk1 wrote:
               | because the offset here is a result of python's reference
               | counting which dates ~20 years before zen3
        
           | benrutter wrote:
           | I wonder if its because we're sometimes talking cross
           | purposes.
           | 
           | For me, coding is almost exclusively using python libraries
           | like numpy to call out to other languages like c or FORTRAN.
           | It feels silly to say I'm not coding in Python to me.
           | 
           | On the other hand, if you're writing those libraries, coding
           | to you is mostly writing FORTRAN and c optimizations. It
           | probably feels silly to say you're coding in Python just
           | because that's where your code is called from.
        
           | kbenson wrote:
           | Because for any nontrivial case you would expect
           | python+compiled library and associated marshaling of data to
           | be slower than that library in its native implementation
           | without any inyerop/marshaling required.
           | 
           | When you see an interpreted language faster than a compiled
           | one, it's worth looking at why, because _most_ the time it 's
           | because there's some hidden issue causing the other to be
           | slow (which could just be a different and much worse
           | implementation).
           | 
           | Put another way, you can do a lot to make a Honda Civic very
           | fast, but when you hear one goes up against a Ferrari and
           | wins your first thoughts should be about what the test was,
           | how the Civic was modified, and if the Ferrari had problems
           | or the test wasn't to its strengths at all. If you just think
           | "yeah, I love Civics, that's awesome" then you're not
           | thinking critically enough about it.
        
             | Attummm wrote:
             | In this case, Python's code (opening and loading the
             | content of a file) operates almost fully within its C
             | runtime.
             | 
             | The C components initiate the system call and manage the
             | file pointer, which loads the data from the disk into a
             | pyobj string.
             | 
             | Therefore, it isn't so much Python itself that is being
             | tested, but rather python underlying C runtime.
        
               | kbenson wrote:
               | Yep, and the next logical question when both
               | implementations are for the most part bare metal
               | (compiled and low-level), is why is there a large
               | difference? Is it a matter of implementation/algorithm,
               | inefficiency, or a bug somewhere? In this case, that
               | search turned up a hardware issue that should be
               | addressed, which is why it's so useful to examine these
               | things.
        
           | rafaelmn wrote:
           | But you will care if that "python" breaks - you get to drop
           | down to C/C++ and debugging native code. Likewise for adding
           | features or understanding the implementation. Not to mention
           | having to deal with native build tooling and platform
           | specific stuff.
           | 
           | It's completely fair to say that's not python because it
           | isn't - any language out there can FFI to C and it has the
           | same problems mentioned above.
        
           | IshKebab wrote:
           | Because when people talk about Python performance they're
           | talking about the performance of Python code itself, not
           | C/Rust code that it's wrapping.
           | 
           | Pretty much any language can wrap C/Rust code.
           | 
           | Why does it matter?
           | 
           | 1. Having to split your code across 2 languages via FFI is a
           | huge pain.
           | 
           | 2. You are still writing _some_ Python. There 's plenty of
           | code that is pure Python. That code is slow.
        
             | munch117 wrote:
             | Of course in this case there's no FFI involved - the _open_
             | function is built-in. It 's as pure-Python as it can get.
        
               | IshKebab wrote:
               | Not sure I agree there, but anyway in this case the
               | performance had nothing to do with Python being a slow or
               | fast language.
        
           | insanitybit wrote:
           | >I don't understand why Python gets shit for being a slow
           | language when it's slow but no credit for being fast when
           | it's fast just because "it's not really Python".
           | 
           | What's there to understand? When it's fast it's not really
           | Python, it's C. C is fast. Python can call out to C. You
           | don't have to care that the implementation is in another
           | language, but it is.
        
         | fl0ki wrote:
         | The premise is that any time you say "Python [...] faster than
         | Rust [...]" you get page views even if it's not true. People
         | have noticed after the last few dozen times something like this
         | was posted.
        
         | lambda wrote:
         | I'm a bit confused by why you are confused.
         | 
         | It's surprising that something as simple as reading a file is
         | slower in the Rust standard library as the Python standard
         | library. Even knowing that a Python standard library call like
         | this is written in C, you'd still expect the Rust standard
         | library call to be of a similar speed; so you'd expect either
         | that you're using it wrong, or that the Rust standard library
         | has some weird behavior.
         | 
         | In this case, it turns out that neither were the case; there's
         | just a weird hardware performance cliff based on the exact
         | alignment of an allocation on particular hardware.
         | 
         | So, yeah, I'd expect a filesystem read to be pretty well
         | optimized in Python, but I'd expect the same in Rust, so it's
         | surprising that the latter was so much slower, and especially
         | surprising that it turned out to be hardware and allocator
         | dependent.
        
       | drtgh wrote:
       | >Rust std fs slower than Python!? No, it's hardware!
       | 
       | >...
       | 
       | >Python features three memory domains, each representing
       | different allocation strategies and optimized for various
       | purposes.
       | 
       | >...
       | 
       | >Rust is slower than Python only on my machine.
       | 
       | if one library performs wildly better than the other in the same
       | test, on the same hardware, how can that not be a software-
       | related problem? sounds like a contradiction.
       | 
       | Maybe should be considered a coding issue and/or feature absent?
       | IMHO it would be expected Rust's std library perform well without
       | making all the users to circumvent the issue manually.
       | 
       | The article is well investigated so I assume the author just want
       | to show the problem existence without creating controversy
       | because other way I can not understand.
        
         | Pop_- wrote:
         | The root cause is AMD's bad support for rep movsb (which is a
         | hardware problem). However, python by default has a small
         | offset when reading memories while lower level language (rust
         | and c) does not, which is why python seems to perform better
         | than c/rust. It "accidentally" avoided the hardware problem.
        
           | CoastalCoder wrote:
           | I'm not sure it makes sense to pin this only on AMD.
           | 
           | Whenever you're writing performance-critical software, you
           | need to consider the relevant combinations of hardware +
           | software + workload + configuration.
           | 
           | Sometimes a problem can be created or fixed by adjusting any
           | one / some subset of those details.
        
             | hobofan wrote:
             | If that's a bug that only happens with AMD CPUs, I think
             | that's totally fair.
             | 
             | If we start adding in exceptions at the top of the software
             | stack for individuals failures of specific CPUs/vendors,
             | that seems like a strong regression from where we are today
             | in terms of ergonomics of writing performance-critical
             | software. We can't be writing individual code for each N x
             | M x O x P combination of hardware + software + workload +
             | configuration (even if you can narrow down the "relevant"
             | ones).
        
               | jpc0 wrote:
               | > We can't be writing individual code for each N x M x O
               | x P combination of hardware + software + workload +
               | configuration
               | 
               | That is kind of exactly what you would do when optimising
               | for popular platforms.
               | 
               | If this error occurs on an AMD Cpu used by half your
               | users is your response to your user going to be "just buy
               | a different CPU" or are you going to fix it in code and
               | ship a "performance improvement on XYZ platform" update
        
               | jacoblambda wrote:
               | Nobody said "just buy a different CPU" anywhere in this
               | discussion or the article. And they are pinning the root
               | cause on AMD which is completely fair because they are
               | the source of the issue.
               | 
               | Given that the fix is within the memory allocator, there
               | is already a relatively trivial fix for users who really
               | need it (recompile with jemalloc as the global memory
               | allocator).
               | 
               | For everyone else, it's probably better to wait until AMD
               | reports back with an analysis from their side and either
               | recommends an "official" mitigation or pushes out a
               | microcode update.
        
               | ansible wrote:
               | The fix is that AMD needs to develop, test and deploy a
               | microcode update for their affected CPUs, and then the
               | problem is truly fixed for everyone, not just the people
               | who have detected the issue and tried to mitigate it.
        
               | richardwhiuk wrote:
               | You are going to be disappointed when you find out
               | there's lots of architecture and CPU specific code in
               | software libraries and the kernel.
        
               | pmontra wrote:
               | Well, if Excel would be running at half the speed (or
               | half of LibreOffice Calc!) on half of the machines around
               | here somebody at Redmond would notice, find the hardware
               | bug and work around it.
               | 
               | I guess that in most big companies it suffices that there
               | is a problem with their own software running on the
               | laptop of a C* manager or of somebody close to there.
               | When I was working for a mobile operator the antennas the
               | network division cared about most were the ones close to
               | the home of the CEO. If he could make his test calls with
               | no problems they had the time to fix the problems of the
               | rest of the network in all the country.
        
             | Pop_- wrote:
             | It's a known issue for AMD and has been tested by multiple
             | people, and by the data provided by the author. It's fair
             | to pin this problem to AMD.
        
           | formerly_proven wrote:
           | That extra 0x20 (32 byte) offset is the size of the PyBytes
           | object header for anyone wondering; 64 bits each for type
           | object pointer, reference count, base pointer and item count.
        
             | mrweasel wrote:
             | Thank you, because I was wondering if some Python developer
             | found the same issue and decided to just implement the
             | offset. It makes much more sense that it just happens to
             | work out that way in Python.
        
           | meneer_oke wrote:
           | It doesn't seem faster. Seem would imply that it isn't the
           | case. It is faster currently on that setup.
           | 
           | But since python runtime is written in C, the issue can't be
           | Python vs C.
        
             | TylerE wrote:
             | C is a very wide target. There are plenty of things that
             | one can do "in C" that no human would ever write. For
             | instance, the C code generated by languages like nim and
             | zig that essentially use C as a sort of IR.
        
               | meneer_oke wrote:
               | That is true, With C allot of possible
               | 
               | > However, python by default has a small offset when
               | reading memories while lower level language (rust and c)
               | 
               | Yet if the runtime is made with C, then that statement is
               | incorrect.
        
               | bilkow wrote:
               | By going through that line of thought, you could also
               | argue that the slow implementation for the slow version
               | in C and Rust is actually implemented in C, as memcpy is
               | on glibc. Hence, Python being faster than Rust would also
               | mean in this case that Python is faster than C.
               | 
               | The point is not that one language is faster than
               | another. The point is that the default way to implement
               | something in a language ended up being surprisingly
               | faster when compared to other languages in this specific
               | scenario due to a performance issue in the hardware.
               | 
               | In other words: on this specific hardware, the default
               | way to do this in Python is faster than the default way
               | to do this in C and Rust. That can be true, as Python
               | does not use C in the default way, it adds an offset! You
               | can change your implementation in any of those languages
               | to make it faster, in this case by just adding an offset,
               | so it doesn't mean that "Python is faster than C or Rust
               | in general".
        
             | topaz0 wrote:
             | It's obviously not python vs c -- the time difference turns
             | out to be in kernel code (system call) and not user code at
             | all, and the post explicitly constructs a c program that
             | doesn't have the slowdown by adding a memory offset. It
             | just turns up by default in a comparison of python vs c
             | code because python reads have a memory offset by default
             | (for completely unrelated reasons) and analogous c reads
             | don't by default. In principle you could also construct
             | python code that does see this slowdown, it would just be
             | much less likely to show up at random. So the python vs c
             | comp is a total red herring here, it just happened to be
             | what the author noticed and used as a hook to understand
             | the problem.
        
           | magicalhippo wrote:
           | I recall when Pentium was introduced we were told to avoid
           | rep and write a carefully tuned loop ourselves. To go really
           | fast one could use the FPU to do the loads and stores.
           | 
           | Not too long ago I read in Intel's optimization guidelines
           | that rep was now faster again and should be used.
           | 
           | Seems most of these things needs to be benchmarked on the
           | CPU, as they change "all the time". I've sped up plenty of
           | code by just replacing hand crafted assembly with high-level
           | functional equivalent code.
           | 
           | Of course so-slow-it's-bad is different, however a runtime-
           | determined implementation choice would avoid that as well.
        
         | mwcampbell wrote:
         | Years ago, Rust's standard library used jemalloc. That decision
         | substantially increased the minimum executable size, though. I
         | didn't publicly complain about it back then (as far as I can
         | recall), but perhaps others did. So the Rust library team
         | switched to using the OS's allocator by default.
         | 
         | Maybe using an alternative allocator only solves the problem by
         | accident and there's another way to solve it intentionally; I
         | don't yet fully understand the problem. My point is that using
         | a different allocator by default was already tried.
        
           | saghm wrote:
           | > I didn't publicly complain about it back then (as far as I
           | can recall), but perhaps others did. So the Rust library team
           | switched to using the OS's allocator by default.
           | 
           | I've honestly never worked in a domain where binary size ever
           | really mattered beyond maybe invoking `strip` on a binary
           | before deploying it, so I try to keep an open mind. That
           | said, this has always been a topic of discussion around
           | Rust[0], and while I obviously don't have anything against
           | binary sizes being smaller, bugs like this do make me wonder
           | about huge changes like switching the default allocator where
           | we can't really test all of the potential side effects; next
           | time, the unintended consequences might not be worth the
           | tradeoff.
           | 
           | [0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=fals
           | e&qu...
        
       | exxos wrote:
       | It's the hardware. Of course Rust remains the fastest and safest
       | language and you must rewrite your applications in Rust.
        
         | dang wrote:
         | You've been posting like this so frequently as to cross into
         | abusing the forum, so I've banned the account.
         | 
         | If you don't want to be banned, you're welcome to email
         | hn@ycombinator.com and give us reason to believe that you'll
         | follow the rules in the future. They're here:
         | https://news.ycombinator.com/newsguidelines.html.
        
       | Aissen wrote:
       | Associated glibc bug (Zen 4 though):
       | https://sourceware.org/bugzilla/show_bug.cgi?id=30994
        
         | Arnavion wrote:
         | The bug is also about Zen 3, and even mentions the 5900X (the
         | article author's CPU).
        
           | nabakin wrote:
           | If you read the bug tracker, a comment mentions this affects
           | Zen 3 and Zen 4
        
         | fweimer wrote:
         | And AMD is investigating: https://inbox.sourceware.org/libc-
         | alpha/20231115190559.29112...
        
       | explodingwaffle wrote:
       | Anyone else feeling the frequency illusion with rep movsb?
       | 
       | (https://lock.cmpxchg8b.com/reptar.html)
        
       | a1o wrote:
       | > Rust developers might consider switching to jemallocator for
       | improved performance
       | 
       | I am curious if this is something that everyone can do to get
       | free performance or if there are caveats. Can C codebases benefit
       | from this too? Is this performance that is simply left on table
       | currently?
        
         | nicoburns wrote:
         | I think it's pretty much free performance that's being left on
         | the table. There's slight cost to binary size. And it may not
         | perform better in absolutely all circumstances (but it will in
         | almost all).
         | 
         | Rust used to use jemalloc by default but switched as people
         | found this surprising as the default.
        
         | Pop_- wrote:
         | Switching to non-default allocator does not always brings
         | performance boost. It really depend on your workload, which
         | requires profiling and benchmarking. But C/C++/Rust and other
         | lower level languages should all at least be able to choose
         | from these allocators. One caveat is binary size. Custom
         | allocator does add more bytes to executable.
        
           | vlovich123 wrote:
           | I don't know why people still look to jemalloc. Mimalloc
           | outperforms the standard allocator on nearly every single
           | benchmark. Glibc's allocator & jemalloc both are long in the
           | tooth & don't actually perform as well as state of the art
           | allocators. I wish Rust would switch to mimalloc or the
           | latest tcmalloc (not the one in gperftools).
        
             | masklinn wrote:
             | > I wish Rust would switch to mimalloc or the latest
             | tcmalloc (not the one in gperftools).
             | 
             | That's nonsensical. Rust uses the system allocators for
             | reliability, compatibility, binary bloat, maintenance
             | burden, ..., not because they're _good_ (they were not when
             | Rust switched away from jemalloc, and they aren 't now).
             | 
             | If you want to use mimalloc in your rust programs, you can
             | just set it as global allocator same as jemalloc, that
             | takes all of three lines:
             | https://github.com/purpleprotocol/mimalloc_rust#usage
             | 
             | If you want the rust compiler to link against mimilloc
             | rather than jemalloc, feel free to test it out and open an
             | issue, but maybe take a gander at the previous attempt:
             | https://github.com/rust-lang/rust/pull/103944 which died
             | for the exact same reason the the one before that
             | (https://github.com/rust-lang/rust/pull/92249) did:
             | unacceptable regression of max-rss.
        
               | vlovich123 wrote:
               | I know it's easy to change but the arguments for using
               | glibc's allocator are less clear to me:
               | 
               | 1. Reliability - how is an alternate allocator less
               | reliable? Seems like a FUD-based argument. Unless by
               | reliability you mean performance in which case yes -
               | jemalloc isn't reliably faster than standard allocators,
               | but mimalloc is.
               | 
               | 2. Compatibility - again sounds like a FUD argument. How
               | is compatibility reduced by swapping out the allocator?
               | You don't even have to do it on all systems if you want.
               | Glibc is just unequivocally bad.
               | 
               | 3. Binary bloat - This one is maybe an OK argument
               | although I don't know what size difference we're talking
               | about for mimalloc. Also, most people aren't writing
               | hello world applications so the default should probably
               | be for a good allocator. I'd also note that having a
               | dependency of the std runtime on glibc in the first place
               | likely bloats your binary more than the specific
               | allocator selected.
               | 
               | 4. Maintenance burden - I don't really buy this argument.
               | In both cases you're relying on a 3rd party to maintain
               | the code.
        
               | masklinn wrote:
               | > I know it's easy to change but the arguments for using
               | glibc's allocator are less clear to me:
               | 
               | You can find them at the original motivation for removing
               | jemalloc, 7 years ago: https://github.com/rust-
               | lang/rust/issues/36963
               | 
               | Also it's not "glibc's allocator", it's the system
               | allocator. If you're unhappy with glibc's, get that
               | replaced.
               | 
               | > 1. Reliability - how is an alternate allocator less
               | reliable?
               | 
               | Jemalloc had to be disabled on various platforms and
               | architectures, there is no reason to think mimalloc or
               | tcmalloc are any different.
               | 
               | The system allocator, while shit, is always there and
               | functional, the project does not have to curate its
               | availability across platforms.
               | 
               | > 2. Compatibility - again sounds like a FUD argument.
               | How is compatibility reduced by swapping out the
               | allocator?
               | 
               | It makes interactions with anything which _does_ use the
               | system allocator worse, and almost certainly fails to
               | interact correctly with some of the more specialised
               | system facilities (e.g. malloc.conf) or tooling (in rust,
               | jemalloc as shipped did not work with valgrind).
               | 
               | > Also, most people aren't writing hello world
               | applications
               | 
               | Most people aren't writing applications bound on
               | allocation throughput either
               | 
               | > so the default should probably be for a good allocator.
               | 
               | Probably not, no.
               | 
               | > I'd also note that having a dependency of the std
               | runtime on glibc in the first place likely bloats your
               | binary more than the specific allocator selected.
               | 
               | That makes no sense whatsoever. The libc is the system's
               | and dynamically linked. And changing allocator does not
               | magically unlink it.
               | 
               | > 4. Maintenance burden - I don't really buy this
               | argument.
               | 
               | It doesn't matter that you don't buy it. Having to ship,
               | resync, debug, and curate (cf (1)) an allocator is a
               | maintenance burden. With a system allocator, all the
               | project does is ensure it calls the system allocators
               | correctly, the rest is out of its purview.
        
               | vlovich123 wrote:
               | The reason the reliability & compatibility arguments
               | don't make sense to me is that jemalloc is still in use
               | for rustc (again - not sure why they haven't switched to
               | mimalloc) which has all the same platform requirements as
               | the standard library. There's also no reason an alternate
               | allocator can't be used on Linux specifically because
               | glibc's allocator is just bad full stop.
               | 
               | > It makes interactions with anything which does use the
               | system allocator worse
               | 
               | That's a really niche argument. Most people are not doing
               | any of that and malloc.conf is only for people who are
               | tuning the glibc allocator which is a silly thing to do
               | when mimalloc will outperform whatever tuning you do (yes
               | - glibc really is that bad).
               | 
               | > or tooling (in rust, jemalloc as shipped did not work
               | with valgrind)
               | 
               | That's a fair argument, but it's not an unsolvable one.
               | 
               | > Most people aren't writing applications bound on
               | allocation throughput either
               | 
               | You'd be surprised at how big an impact the allocator can
               | make even when you don't think you're bound on
               | allocations. There's also all sorts of other things
               | beyond allocation throughput & glibc sucks at all of them
               | (e.g. freeing memory, behavior in multithreaded programs,
               | fragmentation etc etc).
               | 
               | > The libc is the system's and dynamically linked. And
               | changing allocator does not magically unlink it
               | 
               | I meant that the dependency on libc at all in the
               | standard library bloats the size of a statically linked
               | executable.
        
               | josephg wrote:
               | > jemalloc is still in use for rustc (again - not sure
               | why they haven't switched to mimalloc)
               | 
               | Performance of rustc matters a lot! If the rust compiler
               | runs faster when using mimalloc, please benchmark &
               | submit a patch to the compiler.
        
               | vlovich123 wrote:
               | Any links to instructions on how to run said benchmarks?
        
               | masklinn wrote:
               | I literally linked two attempts to use mimalloc in rustc
               | just a few comments upthread.
        
           | charcircuit wrote:
           | I've never not gotten increased performance by swapping outc
           | the allocator.
        
         | nh2 wrote:
         | Be aware `jemalloc` will make you suffer the observability
         | issues of `MADV_FREE`. `htop` will no longer show the truth
         | about how much memory is in use.
         | 
         | *
         | https://github.com/jemalloc/jemalloc/issues/387#issuecomment...
         | 
         | * https://gitlab.haskell.org/ghc/ghc/-/issues/17411
         | 
         | Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds
         | after `MADV_FREE`:
         | https://github.com/JuliaLang/julia/issues/51086#issuecomment...
         | 
         | So while this "fixes" the issue, it'll introduce a confusing
         | time delay between you freeing the memory and you observing
         | that in `htop`.
         | 
         | But according to https://jemalloc.net/jemalloc.3.html you can
         | set `opt.muzzy_decay_ms = 0` to remove the delay.
         | 
         | Still, the musl author has some reservations against making
         | `jemalloc` the default:
         | 
         | https://www.openwall.com/lists/musl/2018/04/23/2
         | 
         | > It's got serious bloat problems, problems with undermining
         | ASLR, and is optimized pretty much only for being as fast as
         | possible without caring how much memory you use.
         | 
         | With the above-mentioned tunables, this should be mitigated to
         | some extent, but the general "theme" (focusing on e.g.
         | performance vs memory usage) will likely still mean "it's a
         | tradeoff" or "it's no tradeoff, but only if you set tunables to
         | what you need".
        
           | a1o wrote:
           | Thank you! That was very thorough! I will be reading the
           | links. :)
        
           | singron wrote:
           | Note that glibc has a similar problem in multithreaded
           | contexts. It strands unused memory in thread-local pools,
           | which grows your memory usage over time like a memory leak.
           | We got lower memory usage that didn't grow over time by
           | switching to jemalloc.
           | 
           | Example of this:
           | https://github.com/prestodb/presto/issues/8993
        
           | masklinn wrote:
           | The musl remark is funny, because jemalloc's use of pretty
           | fine-grained arenas sometimes leads to better memory
           | utilisation through reduced fragmentation. For instance
           | Aerospike couldn't fit in available memory under (admittedly
           | old) glibc, and jemalloc fixed the issue:
           | http://highscalability.com/blog/2015/3/17/in-memory-
           | computin...
           | 
           | And this is not a one-off: https://hackernoon.com/reducing-
           | rails-memory-use-on-amazon-l...
           | https://engineering.linkedin.com/blog/2021/taming-memory-
           | fra...
           | 
           | jemalloc also has extensive observability / debugging
           | capabilities, which can provide a useful global view of the
           | system, it's been used to debug memleaks in JNI-bridge code:
           | https://www.evanjones.ca/java-native-leak-bug.html
           | https://technology.blog.gov.uk/2015/12/11/using-jemalloc-
           | to-...
        
           | dralley wrote:
           | glibc isn't totally free of such issues
           | https://www.algolia.com/blog/engineering/when-allocators-
           | are...
        
           | the8472 wrote:
           | Aiming to please people who panic about their RSS numbers
           | seems... misguided? It seems like worrying about RAM being
           | "used" as file cache[0].
           | 
           | If you want to gauge whether your system is memory-limited
           | look at the PSI metrics instead.
           | 
           | [0] https://www.linuxatemyram.com/
        
         | TillE wrote:
         | jemalloc and mimalloc are very popular in C and C++ software,
         | yes. There are few drawbacks, and it's really easy to benchmark
         | different allocators against eachother in your particular use
         | case.
        
         | kragen wrote:
         | basically that's why jason wrote it in the first place, but
         | other allocators have caught up since then to some extent. so
         | jemalloc might make your c either slower or faster, you'll have
         | to test to know. it's pretty reliable at being close to the
         | best choice
         | 
         | does tend to use more ram tho
        
         | secondcoming wrote:
         | You can override the allocator for any app via LD_PRELOAD
        
       | fsniper wrote:
       | The article itself is a great read and it has fascinating info
       | related to this issue.
       | 
       | However I am more interested/concerned about another part. How
       | the issue is reported/recorded and how the communications are
       | handled.
       | 
       | Reporting is done over discord, which is a proprietary
       | environment which is not indexed, or searchable. Will not be
       | archived.
       | 
       | Communications and deliberations are done over discord and
       | telegram, which is probably worse than discord in this context.
       | 
       | This blog post and the github repository is the lingering remains
       | of them. If Xuanwo did not blog this. It would be lost in
       | timeline.
       | 
       | Isn't this fascinating?
        
       | amluto wrote:
       | I sent this to the right people.
        
       | londons_explore wrote:
       | So the obvious thing to do... Send a patch to change the
       | "copy_user_generic" kernel method to use a different memory
       | copying implementation when the CPU is detected to be a bad one
       | and the memory alignment is one that triggers the slowness bug...
        
         | p3n1s wrote:
         | Not obvious. Seems like if it can be corrected with microcode
         | just have people use updated microcode rather than litter the
         | kernel with fixes that are effectively patchable software
         | problems.
         | 
         | The accepted fix would not be trivial to anyone not already
         | experienced with the kernel. But more important, it obviously
         | isn't obvious what is the right way to enable the workaround.
         | The best way is to probably measure at boot time, otherwise how
         | do you know which models and steppings are affected.
        
           | londons_explore wrote:
           | I don't think AMD does microcode updates for performance
           | issues do they? I thought it was strictly correctness or
           | security issues.
           | 
           | If the vendor won't patch it, then a workaround is the next
           | best thing. There shouldn't be many - that's why all copying
           | code is in just a handful of functions.
        
             | p3n1s wrote:
             | A significant performance degradation due to normal use of
             | the instruction (FSRM) not otherwise documented is a
             | correctness problem. Especially considering that the
             | workaround is to avoid using the CPU feature in many cases.
             | People pay for this CPU feature now they need kernel
             | tooling to warn them when they fallback to some slower
             | workaround because of an alignment issue way up the stack.
        
             | prirun wrote:
             | If AMD has a performance issue and doesn't fix it, AMD
             | should pay the negative publicity costs rather than kernel
             | and library authors adding exceptions. IMHO.
        
       | pmontra wrote:
       | > However, mmap has other uses too. It's commonly used to
       | allocate large regions of memory for applications.
       | 
       | Slack is allocating 1132 GB of virtual memory on my laptop right
       | now. I don't know if they are using mmap but that's 1100 GB more
       | than the physical memory.
        
         | Waterluvian wrote:
         | I'm not sure allocations mean anything practical anymore. I
         | recall OSX allocating ridiculous amounts of virtual memory to
         | stuff but never found OSX or the software to ever feel slow and
         | pagey.
        
           | dietrichepp wrote:
           | The way I describe mmap these days is to say it allocates
           | address space. This can sometimes be a clearer way of
           | describing it, since the physical memory will only get
           | allocated once you use the memory (maybe never).
        
             | byteknight wrote:
             | But is it not still limited by allocating the RAM +
             | Page/Swap size?
        
               | wbkang wrote:
               | I don't think so, but it's difficult to find an actual
               | reference. For sure it does overcommit like crazy. Here's
               | an output from my mac:
               | 
               | % ps aux | sort -k5 -rh | head -1
               | 
               | xxxxxxxx 88273 1.2 0.9 1597482768 316064 ?? S 4:07PM
               | 35:09.71
               | /Applications/Slack.app/Contents/Frameworks/Slack Helper
               | (Renderer).app/...
               | 
               | Since ps displays vsz column in KiB, 1597482768
               | corresponds to 1TB+.
        
               | aseipp wrote:
               | Maybe I'm misunderstanding you but: no, you can allocate
               | terabytes of address space on modern 64-bit Linux on a
               | machine with only 8GB of RAM with overcommit. Try it; you
               | can allocate 2^46 bytes of space (~= 100TB) today, with
               | no problem. There is no limit to the allocation space in
               | an overcommit system; there is only a limit to the actual
               | working set, which is very different.
        
               | j16sdiz wrote:
               | You can do it without overcommit -- you can just back the
               | mmap with file
        
         | Pop_- wrote:
         | I don't know why but this really makes me laugh
        
         | aseipp wrote:
         | That is Chromium doing it, and yes, it is using mmap to create
         | a very large, (almost certainly) contiguous range of memory.
         | Many runtimes do this, because it's useful (on 64-bit systems)
         | to create a ridiculously large virtually mapped address space
         | and then only commit small parts of it over time as needed,
         | because it makes memory allocation simpler in several ways;
         | notably it means you don't have to worry about allocating new
         | address spaces when simply allocating memory, and it means
         | answering things like "Is this a heap object?" is easier.
        
           | rasz wrote:
           | dolphin emulator has recent example of this: https://dolphin-
           | emu.org/blog/2023/11/25/dolphin-progress-rep...
           | 
           | seems its not without perils on Windows:
           | 
           | "In an ideal world, that would be all we have to say about
           | the new solution. But for Windows users, there's a special
           | quirk. On most operating systems, we can use a special flag
           | to signal that we don't really care if the system has 32 GiB
           | of real memory. Unfortunately, Windows has no convenient way
           | to do this. Dolphin still works fine on Windows computers
           | that have less than 32 GiB of RAM, but if Windows is set to
           | automatically manage the size of the page file, which is the
           | case by default, starting any game in Dolphin will cause the
           | page file to balloon in size. Dolphin isn't actually writing
           | to all this newly allocated space in the page file, so there
           | are no concerns about performance or disk lifetime. Also,
           | Windows won't try to grow the page file beyond the amount of
           | available disk space, and the page file shrinks back to its
           | previous size when you close Dolphin, so for the most part
           | there are no real consequences... "
        
       | comonoid wrote:
       | jemalloc was Rust's default allocator till 2018.
       | 
       | https://internals.rust-lang.org/t/jemalloc-was-just-removed-...
        
       | titaniumtown wrote:
       | Extremely well written article! Very surprising outcome.
        
       | diamondlovesyou wrote:
       | AMD's string store is not like Intel's. Generally, you don't want
       | to use it until you are past the CPU's L2 size (L3 is a victim
       | cache), making ~2k WAY too small. Once past that point, it's
       | profitable to use string store, and should run at "DRAM speed".
       | But it has a high startup cost, hence 256bit vector loads/stores
       | should be used until that threshold is met.
        
         | rasz wrote:
         | Or you leave it as is forcing AMD to fix their shit. "fast
         | string mode" has been strongly hinted as _the_ optimal way over
         | 30 years ago with Pentium Pro, further enforced over 10 years
         | ago with ERMSB and FSRM 4 years ago. AMD get with the program.
        
         | js2 wrote:
         | Isn't the high startup cost what FSRM is intended to solve?
         | 
         | > With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally
         | added to AMD's CPU functions analog to Intel's
         | X86_FEATURE_FSRM. Intel had already introduced this in 2017
         | with the Ice Lake Client microarchitecture. But now AMD is
         | obviously using this feature to increase the performance of REP
         | MOVSB for short and very short operations. This improvement
         | applies to Intel for string lengths between 1 and 128 bytes and
         | one can assume that AMD's implementation will look the same for
         | compatibility reasons.
         | 
         | https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...
        
           | diamondlovesyou wrote:
           | Fast is relative here. These are microcoded instructions,
           | which are generally terrible for latency: microcoded
           | instructions don't get branch prediction benefits, nor OoO
           | benefits (they lock the FE/scheduler while running). Small
           | memcpy/moves are always latency bound, hence even if the HW
           | supports "fast" rep store, you're better off not using them.
           | L2 is wicked fast, and these copies are linear, so prediction
           | will be good.
           | 
           | Note that for rep store to be better it must overcome the
           | cost of the initial latency and then catch up to the 32byte
           | vector copies, which yes generally have not-as-good-perf vs
           | DRAM speed, but they aren't that bad either. Thus for small
           | copies.... just don't use string store.
           | 
           | All this is not even considering non-temporal loads/stores;
           | many larger copies would see better perf by not trashing the
           | L2 cache, since the destination or source is often not
           | inspected right after. String stores don't have a non-
           | temporal option, so this has to be done with vectors.
        
             | js2 wrote:
             | I'm not sure that your comment is responsive to the
             | original post.
             | 
             | FSRM is fast on Intel, even with single byte strings. AMD
             | claims to support FSRM with recent CPUs but performs poorly
             | on small strings, so code which Just Works on Intel has a
             | performance regression when running on AMD.
             | 
             | Now here you're saying `REP MOVSB` shouldn't be used on AMD
             | with small strings. In that case, AMD CPUs shouldn't
             | advertise FSRM. As long as they're advertising it, it
             | shouldn't perform worse than the alternative.
             | 
             | https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/203051
             | 5
             | 
             | https://sourceware.org/bugzilla/show_bug.cgi?id=30994
             | 
             | I'm not a CPU expert so perhaps I'm misinterpreting you and
             | we're talking past each other. If so, please clarify.
        
       | forrestthewoods wrote:
       | Delightful article. Thank you author for sharing! I felt like I
       | experienced every shock twist in surprise in your journey like I
       | was right there with you all along.
        
       | darkwater wrote:
       | Totally unrelated but: this post talks about the bug being first
       | discovered in OpenDAL [1], which seems to be an Apache
       | (Incubator) project to add an abstraction layer for storage over
       | several types of storage backend. What's the point/use case of
       | such an abstraction? Anybody using it?
       | 
       | [1] https://opendal.apache.org/
        
       | the8472 wrote:
       | There are two dedicated CPU feature flags to indicate that REP
       | STOS/MOV are fast and usable as short instruction sequence for
       | memset/memcpy. Having to hand-roll optimized routines for each
       | new CPU generation has been an ongoing pain for decades.
       | 
       | And yet here we are again. Shouldn't this be part of some timing
       | testsuite of CPU vendors by now?
        
         | giancarlostoro wrote:
         | So correct me if I am wrong but does this mean you need to
         | compile two executables for a specific compile time build? Or
         | is it just you need to compile it from specific hardware?
         | Wondering what the fix would be, some sort of runtime check?
        
           | immibis wrote:
           | glibc has the ability to dynamically link a different version
           | of a function based on the CPU.
        
           | dralley wrote:
           | Glibc supports runtime selection of different optimized
           | paths, yes. There was a recent discussion about a security
           | vulnerability in that feature (discussion
           | https://news.ycombinator.com/item?id=37756357), but in
           | essence this is exactly the kind of thing it's useful for.
        
           | fweimer wrote:
           | The exact nature of the fix is unclear at present.
           | 
           | During dynamic linking, glibc picks a memcpy implementation
           | which seems most appropriate for the current machine. We have
           | about 13 different implementations just for x86-64. We could
           | add another one for current(ish) AMD CPUs, select a different
           | existing implementation for them, or change the default for a
           | configurable cutover point in a parameterized implementation.
        
           | ww520 wrote:
           | Since the CPU instructions are the same, instruction patching
           | at startup or install time can be used. Just patch in the
           | correct instructions for the respective hardware.
        
           | the8472 wrote:
           | The sibling comments mention the hardware specific dynamic
           | linking in glibc that's used for function calls. But if your
           | compiler inlines memcpy (usually for short, fixed-sized
           | copies) into the binary then yes you'll have to compile it
           | for a specific CPU to get optimal performance. But that's
           | true for all target-dependent optimizations.
           | 
           | More broadly compatible routines will still work on newer
           | CPUs, they just won yield the best performance.
           | 
           | It still would be nice if such central routines could just be
           | compiled to the REP-prefixed instructions and would deliver
           | (near-)optimal performance so we could stop worrying about
           | that particular part.
        
       | lxe wrote:
       | I wonder what other things we can improve by removing spectre
       | mitigations and tuning hugepage, syscall altency, and core
       | affinity
        
       | lxe wrote:
       | So Python isn't affected by the bug because pymalloc performs
       | better on buggy CPUs than jemalloc or malloc?
        
         | js2 wrote:
         | No, it has nothing to do with pymalloc's performance. Rather,
         | the performance issue only occurs when using `rep movsb` on AMD
         | CPUs with unaligned pages, and pymalloc just happens to be
         | using unaligned pages in this case.
        
       | jokethrowaway wrote:
       | Clickbait title but interesting article.
       | 
       | This has nothing to do with python or rust
        
       | codedokode wrote:
       | Why is there need to move memory? Hardware cannot DMA data into
       | non-page-aligned memory? Or Linux doesn't want to load non-
       | aligned data?
        
         | wmf wrote:
         | The Linux page cache keeps data page-aligned so if you want the
         | data to be unaligned Linux will copy it.
        
           | codedokode wrote:
           | What if I don't want to use cache?
        
             | tedunangst wrote:
             | Pull out some RAM sticks.
        
             | wmf wrote:
             | You can use O_DIRECT although that also forces alignment
             | IIRC.
        
       | eigenform wrote:
       | would be lovely if ${cpu_vendor} would document exactly how
       | FSRM/ERMS/etc are implemented and what the expected behavior is
        
       ___________________________________________________________________
       (page generated 2023-11-29 23:00 UTC)