[HN Gopher] Rust std fs slower than Python? No, it's hardware ___________________________________________________________________ Rust std fs slower than Python? No, it's hardware Author : Pop_- Score : 555 points Date : 2023-11-29 09:18 UTC (13 hours ago) (HTM) web link (xuanwo.io) (TXT) w3m dump (xuanwo.io) | royjacobs wrote: | I was prepared to read the article and scoff at the author's | misuse of std::fs. However, the article is a delightful | succession of rabbit holes and mysteries. Well written and very | interesting! | bri3d wrote: | This was such a good article! The debugging was smart (writing | test programs to peel each layer off), the conclusion was | fascinating and unexpected, and the writing was clear and easy | to follow. | sgift wrote: | Either the author changed the headline to something less | clickbaity in the meantime or you edited it for clickbait Pop_- | (in that case: shame on you) - current headline: "Rust std fs | slower than Python!? No, it's hardware!" | epage wrote: | Based on the /r/rust thread, the author seemed to change the | headline based on feedback to make it less clickbait-y | xuanwo wrote: | Sorry for the clickbaity title, I have changed it based on | others advice. | thechao wrote: | I disagree that it's clickbait-y. Diving down from Python | bindings to ucode is ... not how things usually go. Doubly | so, since Python is a very mature runtime, and I'd be | inclined to believe they've dug up file-reading Kung Fu not | available to the Average Joe. | Pop_- wrote: | The author has updated the title and also contacted me. But | unfortunately I'm no longer able to update it so. | Pesthuf wrote: | Clickbait headline, but the article is great! | joshfee wrote: | Surprisingly I think this usage of clickbait is totally | reasonable because it matches the author's initial | thoughts/experiences of "what?! this can't be right..." | saghm wrote: | I think there might be a range of where people draw the line | between reasonable headlines and clickbait, because I tend to | think of clickbait as something where the "answer" to some | question is intentionally left out to try to bait people into | clicking. For this article, something I'd consider clickbait | would be something like "Rust std fs is slower than Python?" | without the answer after. More commonly, the headline isn't | phrased directly as a question, but instead of saying something | like "So-and-so musician loves burritos", it will leave out the | main detail and say something like "The meal so-and-so eats | before every concert", which is trying to get you to click and | have to read through lots of extraneous prose just to find the | word "burritos". | | Having a hook to get people to want to read the article is | reasonable in my opinion; after all, if you could fit every | detail in the size of a headline, you wouldn't need an article | at all! Clickbait inverts this by _only_ having enough enough | substance that you could get all the info in the headline, but | instead it leaves out the one detail that's interesting and | then pads it with fluff that you're forced to click and read | through if you want the answer. | iampims wrote: | Most interesting article I've read this week. Excellent write-up. | Pop_- wrote: | Disclaimer: The title has been changed to "Rust std fs slower | than Python!? No, it's hardware!" to avoid clickbait. However I'm | not able to fix the title in HN. | 3cats-in-a-coat wrote: | What's the TLDR on how... hardware performs differently on two | software runtimes? | lynndotpy wrote: | One of the very first things in the article is a TLDR section | that points you to the conclusion. | | > In conclusion, the issue isn't software-related. Python | outperforms C/Rust due to an AMD CPU bug. | j16sdiz wrote: | It _is_ software-related. Just the CPU perform badly on | some _software_ instruction. | xuanwo wrote: | FSRM is a CPU feature embedded in the microcode (in this | instance, amd-ucode) that software such as glibc cannot | interact with. I refer to it as hardware because I | consider microcode a part of the hardware. | pornel wrote: | AMD's implementation of `rep movsb` instruction is | surprisingly slow when addresses are page aligned. Python's | allocator happens to add a 16-byte offset that avoids the | hardware quirk/bug. | sound1 wrote: | thank you, upvoted! | sharperguy wrote: | "Works on contingency? No, money down!" | pvg wrote: | you can mail hn@ycombinator.com and they can change it for you | to whatever. | quietbritishjim wrote: | I'm a bit confused about the premise. This is not comparing pure | Python code against some native (C or Rust) code. It's comparing | one Python wrapper around native code (Python's file read method) | against another Python wrapper around some native code (OpenDAL). | OK it's still interesting that there's a difference in | performance, but it's very odd to describe it as "slower than | Python". Did they expect that the Python standard library is all | written in pure Python? On the contrary, I would expect the | implementations of functions in Python's standard library to be | native and, individually, highly optimised. | | I'm not surprised the conclusion had something to do with the way | that native code works. Admittedly I was surprised at the | specific answer - still a very interesting article despite the | confusing start. | | Edit: The conclusion also took me a couple of attempts to parse. | There's a heading "C is slower than Python with specified | offset". To me, as a native English speaker, this reads as "C is | slower (than Python) with specified offset" i.e. it sounds like | they took the C code, specified the same offset as Python, and | then it's still slower than Python. But it's the opposite: once | the offset from Python was also specified in the C code, the C | code was then faster. Still very interesting once I got what they | were saying though. | xuanwo wrote: | Thanks for the comments. I have fixed the headers :) | crabbone wrote: | > individually, highly optimised. | | Now why would you expect _that_? | | What happened to OP is a pure chance. CPython's C code doesn't | even care about const-consistency. It's flush with dynamic | memory allocations, bunch of helper / convenience calls... Even | stuff like arithmetic does dynamic memory allocation... | | Normally, you don't expect CPython to perform well, not if you | have any experience working with it. Whenever you want to | improve performance you want to sidestep all the functionality | available there. | | Also, while Python doesn't have a standard library, since it | doesn't have a standard... the library that's distributed with | it is _mostly_ written in Python. Of course, some of it comes | written in C, but there 's also a sizable fraction of that C | code that's essentially Python code translated mechanically | into C (a good example of this is Python's binary search | implementation which was originally written in Python, and | later translated into C using Python's C API). | | What one would expect is that functionality that is simple to | map to operating system functionality has a relatively thin | wrapper. I.e. reading files wouldn't require much in terms of | binding code because, essentially, it goes straight into the | system interface. | codr7 wrote: | Have you ever attempted to write a scripting language that | performs better? | | I have, several, and it's far from trivial. | | The basics are seriously optimized for typical use cases, | take a look at the source code for the dict type. | svieira wrote: | Raymond Hettinger's talk _Modern Python Dictionaries: A | confluence of a dozen great ideas_ is an awesome "history | of how we got these optimizations" and a walk through why | they are so effective - | https://www.youtube.com/watch?v=npw4s1QTmPg | codr7 wrote: | Yeah, I had a nice chat with Raymond Hettinger at a Pycon | in Birmingham/UK back in the days (had no idea who he was | at the time). He seemed like a dedicated and intelligent | person, I'm sure we can thank him for some of that. | crabbone wrote: | > Have you ever attempted to write a scripting language | that performs better? | | No, because "scripting language" is not a thing. | | But, if we are talking about implementing languages, then I | worked with many language implementations. The most | comparable one that I know fairly well, inside-and-out | would be the AVM, i.e. the ActionScript Virtual Machine. | It's not well-written either unfortunately. | | I've looked at implementations of Lua, Emacs Lisp and | Erlang at different times and to various degree. I'm also | somewhat familiar with SBCL and ECL, the implementation | side. There are different things the authors looked for in | these implementations. For example, SBCL emphasizes | performance, where ECL emphasizes simplicity and interop | with C. | | If I had to grade language implementations I've seen, | Erlang would absolutely take the cake. It's a very | thoughtful and disciplined program where authors went to a | great length to design and implement it. CPython is on the | lower end of such programs. It's anarchic, very unevenly | implemented, you run into comments testifying to the author | not knowing what they are doing, what their predecessor | did, nor what to do next. Sometimes the code is written | from that perspective as well, as in if the author somehow | manages to drive themselves in the corner they don't know | what the reference count is anymore, they'll just hammer it | until they hope all references are dead (well, maybe). | | It's the code style that, unfortunately, I associate with | proprietary projects where deadlines and cost dictate the | quality, where concurrency problems are solved with sleeps, | and if that doesn't work, then the sleep delay is doubled. | It's not because I specifically hate code being | proprietary, but because I meet that kind of code in my day | job more than I meet it in hobby open-source projects. | | > take a look at the source code for the dict type. | | I wrote a Protobuf parser in C with the intention of | exposing its bindings to Python. Dictionaries were a | natural choice for the hash-map Protobuf elements. I | benchmarked my implementation against C++ (Google's) | implementation only to discover that std::map wins against | Python's dictionary by a landslide. | | Maybe Python's dict isn't as bad as most of the rest of the | interpreter, but being the best of the worst still doesn't | make it good. | codr7 wrote: | Except it is, because everyone knows sort of what it | means, an interpreted language that prioritizes | convenience over performance; | Perl/Python/Ruby/Lua/PHP/etc. | | SBCL is definitely a different beast. | | I would expect Emacs Lisp & Lua to be more similar. | | Erlang had plenty more funding and stricter requirements. | | C++'s std::map has most likely gotten even more attention | than Python's dict, but I'm not sure from your comment if | you're including Python's VM dispatch in that comparison. | | What are you trying to prove here? | wahern wrote: | > The basics are seriously optimized for typical use cases, | take a look at the source code for the dict type | | Python is well micro-optimized, but the broader | architecture of the language and especially the CPython | implementation did not put much concern into performance, | even for a dynamically typed scripting language. For | example, in CPython values of built-in types are still | allocated as regular objects and passed by reference; this | is atrocious for performance and no amount of micro | optimization will suffice to completely bridge the | performance gap for tasks which stress this aspect of | CPython. By contrast, primitive types in Lua (including PUC | Lua, the reference, non-JIT implementation) and JavaScript | are passed around internally as scalar values, and the | languages were designed with this in mind. | | Perl is similar to Python in this regard--the language | constructs and type systems weren't designed for high | primitive operation throughput. Rather, performance | considerations were focused on higher level, functional | tasks. For example, Perl string objects were designed to | support fast concatenation and copy-on-write references, | optimizations which pay huge dividends for the tasks for | which Perl became popular. Perl can often seem ridiculously | fast for naive string munging compared to even compiled | languages, yet few people care to defend Perl as a | performant language per se. | qd011 wrote: | I don't understand why Python gets shit for being a slow | language when it's slow but no credit for being fast when it's | fast just because "it's not really Python". | | If I write Python and my code is fast, to me that sounds like | Python is fast, I couldn't care less whether it's because the | implementation is in another language or for some other reason. | paulddraper wrote: | Yeah, it's weird. | afdbcreid wrote: | Usually, yes, but when it's a bug in the hardware, it's not | really that Python is fast, more like that CPython developers | were lucky enough to not have the bug. | munch117 wrote: | How do you know that it's luck? | cozzyd wrote: | Because the offset is entirely due to space for the | PyObject header. | munch117 wrote: | The PyObject header is a target for optimisation. | Performance regressions are likely to be noticed, and if | a different header layout is faster, then it's entirely | possible that it will be used for purely empirical | reasons. Trying different options and picking the best | performing one is not luck, even if you can't explain why | it's the best performing. | cozzyd wrote: | I suspect any size other than 0 would lead to this. | | But the Zen3/4 were developed far, far after the PyObject | header... | adgjlsfhk1 wrote: | because the offset here is a result of python's reference | counting which dates ~20 years before zen3 | benrutter wrote: | I wonder if its because we're sometimes talking cross | purposes. | | For me, coding is almost exclusively using python libraries | like numpy to call out to other languages like c or FORTRAN. | It feels silly to say I'm not coding in Python to me. | | On the other hand, if you're writing those libraries, coding | to you is mostly writing FORTRAN and c optimizations. It | probably feels silly to say you're coding in Python just | because that's where your code is called from. | kbenson wrote: | Because for any nontrivial case you would expect | python+compiled library and associated marshaling of data to | be slower than that library in its native implementation | without any inyerop/marshaling required. | | When you see an interpreted language faster than a compiled | one, it's worth looking at why, because _most_ the time it 's | because there's some hidden issue causing the other to be | slow (which could just be a different and much worse | implementation). | | Put another way, you can do a lot to make a Honda Civic very | fast, but when you hear one goes up against a Ferrari and | wins your first thoughts should be about what the test was, | how the Civic was modified, and if the Ferrari had problems | or the test wasn't to its strengths at all. If you just think | "yeah, I love Civics, that's awesome" then you're not | thinking critically enough about it. | Attummm wrote: | In this case, Python's code (opening and loading the | content of a file) operates almost fully within its C | runtime. | | The C components initiate the system call and manage the | file pointer, which loads the data from the disk into a | pyobj string. | | Therefore, it isn't so much Python itself that is being | tested, but rather python underlying C runtime. | kbenson wrote: | Yep, and the next logical question when both | implementations are for the most part bare metal | (compiled and low-level), is why is there a large | difference? Is it a matter of implementation/algorithm, | inefficiency, or a bug somewhere? In this case, that | search turned up a hardware issue that should be | addressed, which is why it's so useful to examine these | things. | rafaelmn wrote: | But you will care if that "python" breaks - you get to drop | down to C/C++ and debugging native code. Likewise for adding | features or understanding the implementation. Not to mention | having to deal with native build tooling and platform | specific stuff. | | It's completely fair to say that's not python because it | isn't - any language out there can FFI to C and it has the | same problems mentioned above. | IshKebab wrote: | Because when people talk about Python performance they're | talking about the performance of Python code itself, not | C/Rust code that it's wrapping. | | Pretty much any language can wrap C/Rust code. | | Why does it matter? | | 1. Having to split your code across 2 languages via FFI is a | huge pain. | | 2. You are still writing _some_ Python. There 's plenty of | code that is pure Python. That code is slow. | munch117 wrote: | Of course in this case there's no FFI involved - the _open_ | function is built-in. It 's as pure-Python as it can get. | IshKebab wrote: | Not sure I agree there, but anyway in this case the | performance had nothing to do with Python being a slow or | fast language. | insanitybit wrote: | >I don't understand why Python gets shit for being a slow | language when it's slow but no credit for being fast when | it's fast just because "it's not really Python". | | What's there to understand? When it's fast it's not really | Python, it's C. C is fast. Python can call out to C. You | don't have to care that the implementation is in another | language, but it is. | fl0ki wrote: | The premise is that any time you say "Python [...] faster than | Rust [...]" you get page views even if it's not true. People | have noticed after the last few dozen times something like this | was posted. | lambda wrote: | I'm a bit confused by why you are confused. | | It's surprising that something as simple as reading a file is | slower in the Rust standard library as the Python standard | library. Even knowing that a Python standard library call like | this is written in C, you'd still expect the Rust standard | library call to be of a similar speed; so you'd expect either | that you're using it wrong, or that the Rust standard library | has some weird behavior. | | In this case, it turns out that neither were the case; there's | just a weird hardware performance cliff based on the exact | alignment of an allocation on particular hardware. | | So, yeah, I'd expect a filesystem read to be pretty well | optimized in Python, but I'd expect the same in Rust, so it's | surprising that the latter was so much slower, and especially | surprising that it turned out to be hardware and allocator | dependent. | drtgh wrote: | >Rust std fs slower than Python!? No, it's hardware! | | >... | | >Python features three memory domains, each representing | different allocation strategies and optimized for various | purposes. | | >... | | >Rust is slower than Python only on my machine. | | if one library performs wildly better than the other in the same | test, on the same hardware, how can that not be a software- | related problem? sounds like a contradiction. | | Maybe should be considered a coding issue and/or feature absent? | IMHO it would be expected Rust's std library perform well without | making all the users to circumvent the issue manually. | | The article is well investigated so I assume the author just want | to show the problem existence without creating controversy | because other way I can not understand. | Pop_- wrote: | The root cause is AMD's bad support for rep movsb (which is a | hardware problem). However, python by default has a small | offset when reading memories while lower level language (rust | and c) does not, which is why python seems to perform better | than c/rust. It "accidentally" avoided the hardware problem. | CoastalCoder wrote: | I'm not sure it makes sense to pin this only on AMD. | | Whenever you're writing performance-critical software, you | need to consider the relevant combinations of hardware + | software + workload + configuration. | | Sometimes a problem can be created or fixed by adjusting any | one / some subset of those details. | hobofan wrote: | If that's a bug that only happens with AMD CPUs, I think | that's totally fair. | | If we start adding in exceptions at the top of the software | stack for individuals failures of specific CPUs/vendors, | that seems like a strong regression from where we are today | in terms of ergonomics of writing performance-critical | software. We can't be writing individual code for each N x | M x O x P combination of hardware + software + workload + | configuration (even if you can narrow down the "relevant" | ones). | jpc0 wrote: | > We can't be writing individual code for each N x M x O | x P combination of hardware + software + workload + | configuration | | That is kind of exactly what you would do when optimising | for popular platforms. | | If this error occurs on an AMD Cpu used by half your | users is your response to your user going to be "just buy | a different CPU" or are you going to fix it in code and | ship a "performance improvement on XYZ platform" update | jacoblambda wrote: | Nobody said "just buy a different CPU" anywhere in this | discussion or the article. And they are pinning the root | cause on AMD which is completely fair because they are | the source of the issue. | | Given that the fix is within the memory allocator, there | is already a relatively trivial fix for users who really | need it (recompile with jemalloc as the global memory | allocator). | | For everyone else, it's probably better to wait until AMD | reports back with an analysis from their side and either | recommends an "official" mitigation or pushes out a | microcode update. | ansible wrote: | The fix is that AMD needs to develop, test and deploy a | microcode update for their affected CPUs, and then the | problem is truly fixed for everyone, not just the people | who have detected the issue and tried to mitigate it. | richardwhiuk wrote: | You are going to be disappointed when you find out | there's lots of architecture and CPU specific code in | software libraries and the kernel. | pmontra wrote: | Well, if Excel would be running at half the speed (or | half of LibreOffice Calc!) on half of the machines around | here somebody at Redmond would notice, find the hardware | bug and work around it. | | I guess that in most big companies it suffices that there | is a problem with their own software running on the | laptop of a C* manager or of somebody close to there. | When I was working for a mobile operator the antennas the | network division cared about most were the ones close to | the home of the CEO. If he could make his test calls with | no problems they had the time to fix the problems of the | rest of the network in all the country. | Pop_- wrote: | It's a known issue for AMD and has been tested by multiple | people, and by the data provided by the author. It's fair | to pin this problem to AMD. | formerly_proven wrote: | That extra 0x20 (32 byte) offset is the size of the PyBytes | object header for anyone wondering; 64 bits each for type | object pointer, reference count, base pointer and item count. | mrweasel wrote: | Thank you, because I was wondering if some Python developer | found the same issue and decided to just implement the | offset. It makes much more sense that it just happens to | work out that way in Python. | meneer_oke wrote: | It doesn't seem faster. Seem would imply that it isn't the | case. It is faster currently on that setup. | | But since python runtime is written in C, the issue can't be | Python vs C. | TylerE wrote: | C is a very wide target. There are plenty of things that | one can do "in C" that no human would ever write. For | instance, the C code generated by languages like nim and | zig that essentially use C as a sort of IR. | meneer_oke wrote: | That is true, With C allot of possible | | > However, python by default has a small offset when | reading memories while lower level language (rust and c) | | Yet if the runtime is made with C, then that statement is | incorrect. | bilkow wrote: | By going through that line of thought, you could also | argue that the slow implementation for the slow version | in C and Rust is actually implemented in C, as memcpy is | on glibc. Hence, Python being faster than Rust would also | mean in this case that Python is faster than C. | | The point is not that one language is faster than | another. The point is that the default way to implement | something in a language ended up being surprisingly | faster when compared to other languages in this specific | scenario due to a performance issue in the hardware. | | In other words: on this specific hardware, the default | way to do this in Python is faster than the default way | to do this in C and Rust. That can be true, as Python | does not use C in the default way, it adds an offset! You | can change your implementation in any of those languages | to make it faster, in this case by just adding an offset, | so it doesn't mean that "Python is faster than C or Rust | in general". | topaz0 wrote: | It's obviously not python vs c -- the time difference turns | out to be in kernel code (system call) and not user code at | all, and the post explicitly constructs a c program that | doesn't have the slowdown by adding a memory offset. It | just turns up by default in a comparison of python vs c | code because python reads have a memory offset by default | (for completely unrelated reasons) and analogous c reads | don't by default. In principle you could also construct | python code that does see this slowdown, it would just be | much less likely to show up at random. So the python vs c | comp is a total red herring here, it just happened to be | what the author noticed and used as a hook to understand | the problem. | magicalhippo wrote: | I recall when Pentium was introduced we were told to avoid | rep and write a carefully tuned loop ourselves. To go really | fast one could use the FPU to do the loads and stores. | | Not too long ago I read in Intel's optimization guidelines | that rep was now faster again and should be used. | | Seems most of these things needs to be benchmarked on the | CPU, as they change "all the time". I've sped up plenty of | code by just replacing hand crafted assembly with high-level | functional equivalent code. | | Of course so-slow-it's-bad is different, however a runtime- | determined implementation choice would avoid that as well. | mwcampbell wrote: | Years ago, Rust's standard library used jemalloc. That decision | substantially increased the minimum executable size, though. I | didn't publicly complain about it back then (as far as I can | recall), but perhaps others did. So the Rust library team | switched to using the OS's allocator by default. | | Maybe using an alternative allocator only solves the problem by | accident and there's another way to solve it intentionally; I | don't yet fully understand the problem. My point is that using | a different allocator by default was already tried. | saghm wrote: | > I didn't publicly complain about it back then (as far as I | can recall), but perhaps others did. So the Rust library team | switched to using the OS's allocator by default. | | I've honestly never worked in a domain where binary size ever | really mattered beyond maybe invoking `strip` on a binary | before deploying it, so I try to keep an open mind. That | said, this has always been a topic of discussion around | Rust[0], and while I obviously don't have anything against | binary sizes being smaller, bugs like this do make me wonder | about huge changes like switching the default allocator where | we can't really test all of the potential side effects; next | time, the unintended consequences might not be worth the | tradeoff. | | [0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=fals | e&qu... | exxos wrote: | It's the hardware. Of course Rust remains the fastest and safest | language and you must rewrite your applications in Rust. | dang wrote: | You've been posting like this so frequently as to cross into | abusing the forum, so I've banned the account. | | If you don't want to be banned, you're welcome to email | hn@ycombinator.com and give us reason to believe that you'll | follow the rules in the future. They're here: | https://news.ycombinator.com/newsguidelines.html. | Aissen wrote: | Associated glibc bug (Zen 4 though): | https://sourceware.org/bugzilla/show_bug.cgi?id=30994 | Arnavion wrote: | The bug is also about Zen 3, and even mentions the 5900X (the | article author's CPU). | nabakin wrote: | If you read the bug tracker, a comment mentions this affects | Zen 3 and Zen 4 | fweimer wrote: | And AMD is investigating: https://inbox.sourceware.org/libc- | alpha/20231115190559.29112... | explodingwaffle wrote: | Anyone else feeling the frequency illusion with rep movsb? | | (https://lock.cmpxchg8b.com/reptar.html) | a1o wrote: | > Rust developers might consider switching to jemallocator for | improved performance | | I am curious if this is something that everyone can do to get | free performance or if there are caveats. Can C codebases benefit | from this too? Is this performance that is simply left on table | currently? | nicoburns wrote: | I think it's pretty much free performance that's being left on | the table. There's slight cost to binary size. And it may not | perform better in absolutely all circumstances (but it will in | almost all). | | Rust used to use jemalloc by default but switched as people | found this surprising as the default. | Pop_- wrote: | Switching to non-default allocator does not always brings | performance boost. It really depend on your workload, which | requires profiling and benchmarking. But C/C++/Rust and other | lower level languages should all at least be able to choose | from these allocators. One caveat is binary size. Custom | allocator does add more bytes to executable. | vlovich123 wrote: | I don't know why people still look to jemalloc. Mimalloc | outperforms the standard allocator on nearly every single | benchmark. Glibc's allocator & jemalloc both are long in the | tooth & don't actually perform as well as state of the art | allocators. I wish Rust would switch to mimalloc or the | latest tcmalloc (not the one in gperftools). | masklinn wrote: | > I wish Rust would switch to mimalloc or the latest | tcmalloc (not the one in gperftools). | | That's nonsensical. Rust uses the system allocators for | reliability, compatibility, binary bloat, maintenance | burden, ..., not because they're _good_ (they were not when | Rust switched away from jemalloc, and they aren 't now). | | If you want to use mimalloc in your rust programs, you can | just set it as global allocator same as jemalloc, that | takes all of three lines: | https://github.com/purpleprotocol/mimalloc_rust#usage | | If you want the rust compiler to link against mimilloc | rather than jemalloc, feel free to test it out and open an | issue, but maybe take a gander at the previous attempt: | https://github.com/rust-lang/rust/pull/103944 which died | for the exact same reason the the one before that | (https://github.com/rust-lang/rust/pull/92249) did: | unacceptable regression of max-rss. | vlovich123 wrote: | I know it's easy to change but the arguments for using | glibc's allocator are less clear to me: | | 1. Reliability - how is an alternate allocator less | reliable? Seems like a FUD-based argument. Unless by | reliability you mean performance in which case yes - | jemalloc isn't reliably faster than standard allocators, | but mimalloc is. | | 2. Compatibility - again sounds like a FUD argument. How | is compatibility reduced by swapping out the allocator? | You don't even have to do it on all systems if you want. | Glibc is just unequivocally bad. | | 3. Binary bloat - This one is maybe an OK argument | although I don't know what size difference we're talking | about for mimalloc. Also, most people aren't writing | hello world applications so the default should probably | be for a good allocator. I'd also note that having a | dependency of the std runtime on glibc in the first place | likely bloats your binary more than the specific | allocator selected. | | 4. Maintenance burden - I don't really buy this argument. | In both cases you're relying on a 3rd party to maintain | the code. | masklinn wrote: | > I know it's easy to change but the arguments for using | glibc's allocator are less clear to me: | | You can find them at the original motivation for removing | jemalloc, 7 years ago: https://github.com/rust- | lang/rust/issues/36963 | | Also it's not "glibc's allocator", it's the system | allocator. If you're unhappy with glibc's, get that | replaced. | | > 1. Reliability - how is an alternate allocator less | reliable? | | Jemalloc had to be disabled on various platforms and | architectures, there is no reason to think mimalloc or | tcmalloc are any different. | | The system allocator, while shit, is always there and | functional, the project does not have to curate its | availability across platforms. | | > 2. Compatibility - again sounds like a FUD argument. | How is compatibility reduced by swapping out the | allocator? | | It makes interactions with anything which _does_ use the | system allocator worse, and almost certainly fails to | interact correctly with some of the more specialised | system facilities (e.g. malloc.conf) or tooling (in rust, | jemalloc as shipped did not work with valgrind). | | > Also, most people aren't writing hello world | applications | | Most people aren't writing applications bound on | allocation throughput either | | > so the default should probably be for a good allocator. | | Probably not, no. | | > I'd also note that having a dependency of the std | runtime on glibc in the first place likely bloats your | binary more than the specific allocator selected. | | That makes no sense whatsoever. The libc is the system's | and dynamically linked. And changing allocator does not | magically unlink it. | | > 4. Maintenance burden - I don't really buy this | argument. | | It doesn't matter that you don't buy it. Having to ship, | resync, debug, and curate (cf (1)) an allocator is a | maintenance burden. With a system allocator, all the | project does is ensure it calls the system allocators | correctly, the rest is out of its purview. | vlovich123 wrote: | The reason the reliability & compatibility arguments | don't make sense to me is that jemalloc is still in use | for rustc (again - not sure why they haven't switched to | mimalloc) which has all the same platform requirements as | the standard library. There's also no reason an alternate | allocator can't be used on Linux specifically because | glibc's allocator is just bad full stop. | | > It makes interactions with anything which does use the | system allocator worse | | That's a really niche argument. Most people are not doing | any of that and malloc.conf is only for people who are | tuning the glibc allocator which is a silly thing to do | when mimalloc will outperform whatever tuning you do (yes | - glibc really is that bad). | | > or tooling (in rust, jemalloc as shipped did not work | with valgrind) | | That's a fair argument, but it's not an unsolvable one. | | > Most people aren't writing applications bound on | allocation throughput either | | You'd be surprised at how big an impact the allocator can | make even when you don't think you're bound on | allocations. There's also all sorts of other things | beyond allocation throughput & glibc sucks at all of them | (e.g. freeing memory, behavior in multithreaded programs, | fragmentation etc etc). | | > The libc is the system's and dynamically linked. And | changing allocator does not magically unlink it | | I meant that the dependency on libc at all in the | standard library bloats the size of a statically linked | executable. | josephg wrote: | > jemalloc is still in use for rustc (again - not sure | why they haven't switched to mimalloc) | | Performance of rustc matters a lot! If the rust compiler | runs faster when using mimalloc, please benchmark & | submit a patch to the compiler. | vlovich123 wrote: | Any links to instructions on how to run said benchmarks? | masklinn wrote: | I literally linked two attempts to use mimalloc in rustc | just a few comments upthread. | charcircuit wrote: | I've never not gotten increased performance by swapping outc | the allocator. | nh2 wrote: | Be aware `jemalloc` will make you suffer the observability | issues of `MADV_FREE`. `htop` will no longer show the truth | about how much memory is in use. | | * | https://github.com/jemalloc/jemalloc/issues/387#issuecomment... | | * https://gitlab.haskell.org/ghc/ghc/-/issues/17411 | | Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds | after `MADV_FREE`: | https://github.com/JuliaLang/julia/issues/51086#issuecomment... | | So while this "fixes" the issue, it'll introduce a confusing | time delay between you freeing the memory and you observing | that in `htop`. | | But according to https://jemalloc.net/jemalloc.3.html you can | set `opt.muzzy_decay_ms = 0` to remove the delay. | | Still, the musl author has some reservations against making | `jemalloc` the default: | | https://www.openwall.com/lists/musl/2018/04/23/2 | | > It's got serious bloat problems, problems with undermining | ASLR, and is optimized pretty much only for being as fast as | possible without caring how much memory you use. | | With the above-mentioned tunables, this should be mitigated to | some extent, but the general "theme" (focusing on e.g. | performance vs memory usage) will likely still mean "it's a | tradeoff" or "it's no tradeoff, but only if you set tunables to | what you need". | a1o wrote: | Thank you! That was very thorough! I will be reading the | links. :) | singron wrote: | Note that glibc has a similar problem in multithreaded | contexts. It strands unused memory in thread-local pools, | which grows your memory usage over time like a memory leak. | We got lower memory usage that didn't grow over time by | switching to jemalloc. | | Example of this: | https://github.com/prestodb/presto/issues/8993 | masklinn wrote: | The musl remark is funny, because jemalloc's use of pretty | fine-grained arenas sometimes leads to better memory | utilisation through reduced fragmentation. For instance | Aerospike couldn't fit in available memory under (admittedly | old) glibc, and jemalloc fixed the issue: | http://highscalability.com/blog/2015/3/17/in-memory- | computin... | | And this is not a one-off: https://hackernoon.com/reducing- | rails-memory-use-on-amazon-l... | https://engineering.linkedin.com/blog/2021/taming-memory- | fra... | | jemalloc also has extensive observability / debugging | capabilities, which can provide a useful global view of the | system, it's been used to debug memleaks in JNI-bridge code: | https://www.evanjones.ca/java-native-leak-bug.html | https://technology.blog.gov.uk/2015/12/11/using-jemalloc- | to-... | dralley wrote: | glibc isn't totally free of such issues | https://www.algolia.com/blog/engineering/when-allocators- | are... | the8472 wrote: | Aiming to please people who panic about their RSS numbers | seems... misguided? It seems like worrying about RAM being | "used" as file cache[0]. | | If you want to gauge whether your system is memory-limited | look at the PSI metrics instead. | | [0] https://www.linuxatemyram.com/ | TillE wrote: | jemalloc and mimalloc are very popular in C and C++ software, | yes. There are few drawbacks, and it's really easy to benchmark | different allocators against eachother in your particular use | case. | kragen wrote: | basically that's why jason wrote it in the first place, but | other allocators have caught up since then to some extent. so | jemalloc might make your c either slower or faster, you'll have | to test to know. it's pretty reliable at being close to the | best choice | | does tend to use more ram tho | secondcoming wrote: | You can override the allocator for any app via LD_PRELOAD | fsniper wrote: | The article itself is a great read and it has fascinating info | related to this issue. | | However I am more interested/concerned about another part. How | the issue is reported/recorded and how the communications are | handled. | | Reporting is done over discord, which is a proprietary | environment which is not indexed, or searchable. Will not be | archived. | | Communications and deliberations are done over discord and | telegram, which is probably worse than discord in this context. | | This blog post and the github repository is the lingering remains | of them. If Xuanwo did not blog this. It would be lost in | timeline. | | Isn't this fascinating? | amluto wrote: | I sent this to the right people. | londons_explore wrote: | So the obvious thing to do... Send a patch to change the | "copy_user_generic" kernel method to use a different memory | copying implementation when the CPU is detected to be a bad one | and the memory alignment is one that triggers the slowness bug... | p3n1s wrote: | Not obvious. Seems like if it can be corrected with microcode | just have people use updated microcode rather than litter the | kernel with fixes that are effectively patchable software | problems. | | The accepted fix would not be trivial to anyone not already | experienced with the kernel. But more important, it obviously | isn't obvious what is the right way to enable the workaround. | The best way is to probably measure at boot time, otherwise how | do you know which models and steppings are affected. | londons_explore wrote: | I don't think AMD does microcode updates for performance | issues do they? I thought it was strictly correctness or | security issues. | | If the vendor won't patch it, then a workaround is the next | best thing. There shouldn't be many - that's why all copying | code is in just a handful of functions. | p3n1s wrote: | A significant performance degradation due to normal use of | the instruction (FSRM) not otherwise documented is a | correctness problem. Especially considering that the | workaround is to avoid using the CPU feature in many cases. | People pay for this CPU feature now they need kernel | tooling to warn them when they fallback to some slower | workaround because of an alignment issue way up the stack. | prirun wrote: | If AMD has a performance issue and doesn't fix it, AMD | should pay the negative publicity costs rather than kernel | and library authors adding exceptions. IMHO. | pmontra wrote: | > However, mmap has other uses too. It's commonly used to | allocate large regions of memory for applications. | | Slack is allocating 1132 GB of virtual memory on my laptop right | now. I don't know if they are using mmap but that's 1100 GB more | than the physical memory. | Waterluvian wrote: | I'm not sure allocations mean anything practical anymore. I | recall OSX allocating ridiculous amounts of virtual memory to | stuff but never found OSX or the software to ever feel slow and | pagey. | dietrichepp wrote: | The way I describe mmap these days is to say it allocates | address space. This can sometimes be a clearer way of | describing it, since the physical memory will only get | allocated once you use the memory (maybe never). | byteknight wrote: | But is it not still limited by allocating the RAM + | Page/Swap size? | wbkang wrote: | I don't think so, but it's difficult to find an actual | reference. For sure it does overcommit like crazy. Here's | an output from my mac: | | % ps aux | sort -k5 -rh | head -1 | | xxxxxxxx 88273 1.2 0.9 1597482768 316064 ?? S 4:07PM | 35:09.71 | /Applications/Slack.app/Contents/Frameworks/Slack Helper | (Renderer).app/... | | Since ps displays vsz column in KiB, 1597482768 | corresponds to 1TB+. | aseipp wrote: | Maybe I'm misunderstanding you but: no, you can allocate | terabytes of address space on modern 64-bit Linux on a | machine with only 8GB of RAM with overcommit. Try it; you | can allocate 2^46 bytes of space (~= 100TB) today, with | no problem. There is no limit to the allocation space in | an overcommit system; there is only a limit to the actual | working set, which is very different. | j16sdiz wrote: | You can do it without overcommit -- you can just back the | mmap with file | Pop_- wrote: | I don't know why but this really makes me laugh | aseipp wrote: | That is Chromium doing it, and yes, it is using mmap to create | a very large, (almost certainly) contiguous range of memory. | Many runtimes do this, because it's useful (on 64-bit systems) | to create a ridiculously large virtually mapped address space | and then only commit small parts of it over time as needed, | because it makes memory allocation simpler in several ways; | notably it means you don't have to worry about allocating new | address spaces when simply allocating memory, and it means | answering things like "Is this a heap object?" is easier. | rasz wrote: | dolphin emulator has recent example of this: https://dolphin- | emu.org/blog/2023/11/25/dolphin-progress-rep... | | seems its not without perils on Windows: | | "In an ideal world, that would be all we have to say about | the new solution. But for Windows users, there's a special | quirk. On most operating systems, we can use a special flag | to signal that we don't really care if the system has 32 GiB | of real memory. Unfortunately, Windows has no convenient way | to do this. Dolphin still works fine on Windows computers | that have less than 32 GiB of RAM, but if Windows is set to | automatically manage the size of the page file, which is the | case by default, starting any game in Dolphin will cause the | page file to balloon in size. Dolphin isn't actually writing | to all this newly allocated space in the page file, so there | are no concerns about performance or disk lifetime. Also, | Windows won't try to grow the page file beyond the amount of | available disk space, and the page file shrinks back to its | previous size when you close Dolphin, so for the most part | there are no real consequences... " | comonoid wrote: | jemalloc was Rust's default allocator till 2018. | | https://internals.rust-lang.org/t/jemalloc-was-just-removed-... | titaniumtown wrote: | Extremely well written article! Very surprising outcome. | diamondlovesyou wrote: | AMD's string store is not like Intel's. Generally, you don't want | to use it until you are past the CPU's L2 size (L3 is a victim | cache), making ~2k WAY too small. Once past that point, it's | profitable to use string store, and should run at "DRAM speed". | But it has a high startup cost, hence 256bit vector loads/stores | should be used until that threshold is met. | rasz wrote: | Or you leave it as is forcing AMD to fix their shit. "fast | string mode" has been strongly hinted as _the_ optimal way over | 30 years ago with Pentium Pro, further enforced over 10 years | ago with ERMSB and FSRM 4 years ago. AMD get with the program. | js2 wrote: | Isn't the high startup cost what FSRM is intended to solve? | | > With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally | added to AMD's CPU functions analog to Intel's | X86_FEATURE_FSRM. Intel had already introduced this in 2017 | with the Ice Lake Client microarchitecture. But now AMD is | obviously using this feature to increase the performance of REP | MOVSB for short and very short operations. This improvement | applies to Intel for string lengths between 1 and 128 bytes and | one can assume that AMD's implementation will look the same for | compatibility reasons. | | https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh... | diamondlovesyou wrote: | Fast is relative here. These are microcoded instructions, | which are generally terrible for latency: microcoded | instructions don't get branch prediction benefits, nor OoO | benefits (they lock the FE/scheduler while running). Small | memcpy/moves are always latency bound, hence even if the HW | supports "fast" rep store, you're better off not using them. | L2 is wicked fast, and these copies are linear, so prediction | will be good. | | Note that for rep store to be better it must overcome the | cost of the initial latency and then catch up to the 32byte | vector copies, which yes generally have not-as-good-perf vs | DRAM speed, but they aren't that bad either. Thus for small | copies.... just don't use string store. | | All this is not even considering non-temporal loads/stores; | many larger copies would see better perf by not trashing the | L2 cache, since the destination or source is often not | inspected right after. String stores don't have a non- | temporal option, so this has to be done with vectors. | js2 wrote: | I'm not sure that your comment is responsive to the | original post. | | FSRM is fast on Intel, even with single byte strings. AMD | claims to support FSRM with recent CPUs but performs poorly | on small strings, so code which Just Works on Intel has a | performance regression when running on AMD. | | Now here you're saying `REP MOVSB` shouldn't be used on AMD | with small strings. In that case, AMD CPUs shouldn't | advertise FSRM. As long as they're advertising it, it | shouldn't perform worse than the alternative. | | https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/203051 | 5 | | https://sourceware.org/bugzilla/show_bug.cgi?id=30994 | | I'm not a CPU expert so perhaps I'm misinterpreting you and | we're talking past each other. If so, please clarify. | forrestthewoods wrote: | Delightful article. Thank you author for sharing! I felt like I | experienced every shock twist in surprise in your journey like I | was right there with you all along. | darkwater wrote: | Totally unrelated but: this post talks about the bug being first | discovered in OpenDAL [1], which seems to be an Apache | (Incubator) project to add an abstraction layer for storage over | several types of storage backend. What's the point/use case of | such an abstraction? Anybody using it? | | [1] https://opendal.apache.org/ | the8472 wrote: | There are two dedicated CPU feature flags to indicate that REP | STOS/MOV are fast and usable as short instruction sequence for | memset/memcpy. Having to hand-roll optimized routines for each | new CPU generation has been an ongoing pain for decades. | | And yet here we are again. Shouldn't this be part of some timing | testsuite of CPU vendors by now? | giancarlostoro wrote: | So correct me if I am wrong but does this mean you need to | compile two executables for a specific compile time build? Or | is it just you need to compile it from specific hardware? | Wondering what the fix would be, some sort of runtime check? | immibis wrote: | glibc has the ability to dynamically link a different version | of a function based on the CPU. | dralley wrote: | Glibc supports runtime selection of different optimized | paths, yes. There was a recent discussion about a security | vulnerability in that feature (discussion | https://news.ycombinator.com/item?id=37756357), but in | essence this is exactly the kind of thing it's useful for. | fweimer wrote: | The exact nature of the fix is unclear at present. | | During dynamic linking, glibc picks a memcpy implementation | which seems most appropriate for the current machine. We have | about 13 different implementations just for x86-64. We could | add another one for current(ish) AMD CPUs, select a different | existing implementation for them, or change the default for a | configurable cutover point in a parameterized implementation. | ww520 wrote: | Since the CPU instructions are the same, instruction patching | at startup or install time can be used. Just patch in the | correct instructions for the respective hardware. | the8472 wrote: | The sibling comments mention the hardware specific dynamic | linking in glibc that's used for function calls. But if your | compiler inlines memcpy (usually for short, fixed-sized | copies) into the binary then yes you'll have to compile it | for a specific CPU to get optimal performance. But that's | true for all target-dependent optimizations. | | More broadly compatible routines will still work on newer | CPUs, they just won yield the best performance. | | It still would be nice if such central routines could just be | compiled to the REP-prefixed instructions and would deliver | (near-)optimal performance so we could stop worrying about | that particular part. | lxe wrote: | I wonder what other things we can improve by removing spectre | mitigations and tuning hugepage, syscall altency, and core | affinity | lxe wrote: | So Python isn't affected by the bug because pymalloc performs | better on buggy CPUs than jemalloc or malloc? | js2 wrote: | No, it has nothing to do with pymalloc's performance. Rather, | the performance issue only occurs when using `rep movsb` on AMD | CPUs with unaligned pages, and pymalloc just happens to be | using unaligned pages in this case. | jokethrowaway wrote: | Clickbait title but interesting article. | | This has nothing to do with python or rust | codedokode wrote: | Why is there need to move memory? Hardware cannot DMA data into | non-page-aligned memory? Or Linux doesn't want to load non- | aligned data? | wmf wrote: | The Linux page cache keeps data page-aligned so if you want the | data to be unaligned Linux will copy it. | codedokode wrote: | What if I don't want to use cache? | tedunangst wrote: | Pull out some RAM sticks. | wmf wrote: | You can use O_DIRECT although that also forces alignment | IIRC. | eigenform wrote: | would be lovely if ${cpu_vendor} would document exactly how | FSRM/ERMS/etc are implemented and what the expected behavior is ___________________________________________________________________ (page generated 2023-11-29 23:00 UTC)