[HN Gopher] Should small Rust structs be passed by-copy or by-bo... ___________________________________________________________________ Should small Rust structs be passed by-copy or by-borrow? (2019) Author : aloukissas Score : 209 points Date : 2022-12-31 13:33 UTC (9 hours ago) (HTM) web link (www.forrestthewoods.com) (TXT) w3m dump (www.forrestthewoods.com) | dwheeler wrote: | This is one advantage of Ada, where parameters are abstractly | declared as "in" or "in out" or "out". The compiler can then | decide how to best implement it for that specific size and | architecture. | ardel95 wrote: | How is that semantically different from Rust? | | in - regular function arguments | | inout - mut function arguments | | out - function return | | Is there any additional information that a compiler can infer | from Ada's parameter syntax? | layer8 wrote: | The difference between passing by reference vs. by value is | observable when comparing pointers to the original vs. to the | argument. This difference may be unobservable in Ada though | (not sure), so Ada would have more freedom choosing between | the two. | [deleted] | [deleted] | chromatin wrote: | Dlang can also qualify parameters as in, out, and inout; | although I don't know to what degree the compiler is able to | use that for optimization purposes (it is used for safety | checks IIRC) | bvrmn wrote: | Always curious how Ada solves ABI issue with such optimizations | in place. | rightbyte wrote: | As long as the calling convention is deterministic from the | declaration of the function it should be fine right? | usrnm wrote: | If it's deterministic, the compiler cannot actually choose | the best way to optimise it. | [deleted] | rightbyte wrote: | It just have to come up with the same best way each time? | sampo wrote: | > This is one advantage of Ada, where parameters are abstractly | declared as "in" or "in out" or "out". | | Also Fortran has "in", "inout" and "out". | FpUser wrote: | So does Delphi / FreePascal | trifurcate wrote: | Also, MSVC has similar annotations for various static | analyses: https://learn.microsoft.com/en-us/cpp/code- | quality/understan... | jb1991 wrote: | And Swift also has "inout" parameters. | stephencanon wrote: | But not "out" params, sadly. | | It can return multiple values, so this doesn't matter much | for value types, but it would be nice to be able to specify | that a pointer arg is an out-param sometimes and enforce | that it is not read from while handling allocation in the | caller. | pletnes wrote: | Fortran also has <<default>> / no intent. This is somehow | different from inout. | wiz21c wrote: | and GL/SL IIRC ... | Congeec wrote: | C++23 is not too late to the party | https://en.cppreference.com/w/cpp/memory/out_ptr_t/out_ptr | rwaksmunski wrote: | A question to the Rust experts, would lifetime annotations 'a | in Rust have similar benefit as "in" or "in out" or "out" in | Ada and other languages? With the additional benefit in Rust | where the compiler can deduce those automatically for most | cases? | chc wrote: | As a sibling comment points out, "in" is effectively | equivalent to "&T", and "inout" is effectively equivalent to | "&mut T". Rust is missing purely "out" parameters, but that | isn't a very common case, and I'm not sure how much value | there is in saying "this reference can't be read" since | references are always guaranteed to be valid in Rust. | tuetuopay wrote: | This is not really surprising in such a case. The Rust compiler | is pretty good at optimizing out uneeded copies. Here it does see | that the copied value is not used after the function call, so it | should simply not emit the copies in the final assembly. | eloff wrote: | For this code, the compiler inlined the call. So there should be | no difference between pass by copy or pass by reference, which is | what was measured. Where it could matter is when the code isn't | inlined. But with small structs it might not matter all that | much. | | It does sometimes matter though. One optimization I've seen in a | few places is to box the error type, so that a result doesn't | copy the (usually empty) error by value on the stack. That | actually makes a small performance difference, on the order of | about 5-10%. | lukaszwojtow wrote: | I always prefer by-borrow. That's because in the future this | struct may become non-copy and that means some unnecessary | refactoring. My thinking is a bit like "don't take ownership if | not needed" - the "not needed" part is the most important thing. | Don't require things that are not needed. | theptip wrote: | Rust noob here - is it common to see a struct lose Copy as | things grow? | carlmr wrote: | Exactly, and if performance at some point matters: benchmark! | | And I would bet 9 times out of 10 it won't be the bottleneck or | even make a measurable difference. | QuadDamaged wrote: | Exactly why IMHO the rust stdlib is so easy to understand. | Ownership only when required as a design principle tends to | make the design of the overall system more consistent / | easier to approach. | eterevsky wrote: | If it's a 3D real-valued vector, or similarly basic structure, | you can be fairly certain, that it will stay copyable. | josephg wrote: | I agree. Being copyable is part of the signature for | something like this. Explicitly so in rust. | zozbot234 wrote: | If a struct might lose Copy you shouldn't implement Copy at | all, to preserve forward compatibility. You can still derive | Clone in most cases; using .clone() does not per se add any | overhead. | redox99 wrote: | I'm surprised he tested MSVC and Clang, and not GCC which usually | generates faster code than those two. | 3836293648 wrote: | Well, they are the two easily available compilers on Windows. | And rustc vs clang should be the fair comparison as they both | use llvm | im3w1l wrote: | My first thought was "now what is the calling convention for | float parameters again? they are passed in registers right? the | compiler can probably arrange so they don't have to actually be | copied" and then I realized it will probably even inline it. | | Anyway, assuming it's not inlined I would guess pass-by-copy, | maybe with an occasional exception in code with heavy register | pressure. | | Edit: Actually since it's a structure, the calling convention is | to memory allocate it and pass a pointer, doh. So it should | actually compile the same. | masklinn wrote: | > Edit: Actually since it's a structure, the calling convention | is to memory allocate it and pass a pointer, doh. So it should | actually compile the same. | | FWIW the AMD64 SysV v1.0 psABI allows structures of up to 8 | members to be passed via registers. Though older revisions | limit that to 2 (and it's unclear whether MS's divergent ABI | allows aggregates to be splat at all. | | As sad as it's unsurprising, it does not look like LLVM | (linux?) has followed up, on godbolt a 2-struct passes | everything via registers but a 3-struct passes everything via | the stack. Maybe there's a magic flag to use the 1.0 ABI, but a | quick googling didn't reveal one. ICC doesn't seem to have | followed up either. | unsafecast wrote: | > Edit: Actually since it's a structure, the calling convention | is to memory allocate it and pass a pointer, doh. So it should | actually compile the same. | | Depending on calling convention, the structure may be spread | out into registers. | Veedrac wrote: | The general usability impact matters slightly less than it looks | here, in part because the `do_math` with references in the | article has two extra &s, and in part because methods | autoreference when called like x.f(). | | Performance-wise, if you're likely to touch every element in a | type anyway, err on the side of copies. They are going to have to | end up in registers eventually anyway, so you might as well let | the caller find out the best way to put them there. | BooneJS wrote: | Folks, processors continue to give smaller and smaller gains | every year. Something has to give. If you have critical path code | that absolutely must max out the core, then this type of analysis | (as pedantic as it is) is useful in the long run. | mcguire wrote: | This is one of those questions where you really, honestly, do | need to look at a very low level. | | Back in the ancient days, I worked at IBM doing benchmarking for | an OS project that was never released. We were using PPC601 | Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was | devs writing their own memcpy using _dst++ =_ src++ loops rather | than the one in the library, which was written by one of my | coworkers and consisted of 3 pages of assembly that used at least | 18 registers. | | The simple loop was something like X cycles/byte, while the | library version was P + (Q cycles/byte) but the difference was | such that the crossover point was about 8 bytes. So, scraping out | the simple memcpy implementations from the code was about a | weekly thing for me. | | At this point, we discovered that our C compiler would pass | structs by value (This was the early-ish days of ANSI C and was a | surprise to some of my older coworkers.) and benchmarked _that_. | | And discovered that its copy code was _worse_ than the simple | _dst++ =_ src++ loops. By about a factor of 4. (The simple loop | would be optimized to work with word-sized ints, while the | compiler was generating code that copied each byte individually.) | | If you are doing something where this matters, something like | VTune is very important. So is the ability to convince people who | do stupid things to stop doing the stupid things. | cmrdporcupine wrote: | There is no single answer to this question because it's going to | depend completely on call patterns further up. Especially in | regards to how much of the rest of the running program's data | fits in L1 cache, and _most especially_ in regards to what 's | going on in terms of concurrency. | | The benchmark made here could completely fall apart once more | threads are added. | | Modern computer architectures are non-uniform in terms of any | kind of memory accesses. The same logical operations can have | extremely varied costs depending on how the whole program flow | goes. | m00dy wrote: | It is a problem of statistics and depends on internals of | underlying operating system. I'm not sure you really need that | sort of optimisation | eloff wrote: | What does this have to do with the operating system? There are | no syscalls in the code measured here. | masklinn wrote: | The "C ABI" is really "the platform ABI", because most OS are | interacted with through libc (or equivalent). | | Though that should not apply to Rust at all, as it does not | pledge to follow the C ABI internally (aka `extern "Rust"`). | m00dy wrote: | Because rust is a compiled language and therefore it means | you compile your code to a certain architecture. Who told you | about syscalls ? There are systems not using syscalls | eloff wrote: | There are no syscalls or equivalent operating system calls | in the code paths measured. The architecture is also | independent of the operating system, with exceptions in | some languages for the calling convention (not in rust, | afaik, or at least rust makes no guarantees there.) | CryZe wrote: | However, in practice Rust's calling convention does | actually depend on the operating system. So on Linux Rust | will make use of the stack red zone, while on Windows it | doesn't. (Also some codegen in LLVM depends on the | operating system) | pclmulqdq wrote: | The dependency is on the ABI, which can be OS-dependent. | Also, it is a depressingly manual optimization to do: | compilers don't know when it is safe to change a reference to | a copy (for example) without an analysis of future code that | they don't do. | littlestymaar wrote: | With Rust ownership guarantees, the compiler has the info | it needs to perform this kind of optimizations. | pclmulqdq wrote: | C and C++ are also set up such that the compiler can do | that optimization, they just don't. I'm pretty sure the | Rust compiler is in the same boat - has the information, | but doesn't do the optimization. | forrestthewoods wrote: | Oh neat, that's my blog. My old posts don't resurface on HN that | often. | | Lots of criticism of my methodology in the comments here. That's | fine. That post was more of a self nerd snipe that went way | deeper than I expected. | | I hoped that my post would lead to a more definitive answer from | some actual experts in the field. Unfortunately that never | happened, afaik. Bummer. | brundolf wrote: | Maybe it'll happen here! :) | the_mitsuhiko wrote: | My only criticism is the "ugly mess" part. You can implement | the traits on references too. | forrestthewoods wrote: | True, that does work for traits. But it's super annoying if | you have to write multiple copies of the same thing. That can | get out of control quick if you need to implement every | combination. | | And that doesn't help at all if you're writing a "free | function" like 3D primitive intersection functions. I suppose | you could change that simple function into a generic function | that takes AsDeref? Bleh. | adham01 wrote: | [dead] | arcticbull wrote: | > Blech! Having to explicitly borrow temporary values is super | gross. | | I don't think you ever have to write code like this. Implement | your math traits in terms for both value and reference types like | the standard library does. | | Go down to Trait Implementations for scalar types, for instance | i32 [1] | | impl Add<&i32> for &i32 | | impl Add<&i32> for i32 | | impl Add<i32> for &i32 | | impl Add<i32> for i32 | | Once you do that your ergonomics should be exactly the same as | with built in scalar types. | | [1] https://doc.rust-lang.org/std/primitive.i32.html | datafulman wrote: | [dead] | FpUser wrote: | I did the test on my computer: | | Rust - By-Copy: 14124, By-Borrow: 8150 | | C++ - By-Copy: 12160, By-Ref: 11423 | | P.S. Just built it using LLVM under CLion IDE and the results | are: G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake- | build- release\fts_cmake_cpp_bench.exe Totals: | Overlaps: 220384338 By-Copy: 4397 By-Ref: 4396 | Delta: -0.0227428% Process finished with exit code 0 | jeffbee wrote: | How did you build it? It doesn't build with either gcc-12 or | clang-15 on linux. | FpUser wrote: | I built it on Windows, Visual C++ 2022. Did not check Linux | as I do not think it matters much. | | Now comes big surprise: I just built it using LLVM under | CLion IDE and the results are: | G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- | release\fts_cmake_cpp_bench.exe Totals: | Overlaps: 220384338 By-Copy: 4397 By-Ref: | 4396 Delta: -0.0227428% Process finished with | exit code 0 | 29athrowaway wrote: | A more direct comparison would have been a r-value reference. | Rustwerks wrote: | I just went through all of this when building a raytracer. | | * Sprinkling & around everything in math expressions does make | them ugly. Maybe rust needs an asBorrow or similar? | | * If you inline everything then the speed is the same. | | * Link time optimizations are also an easy win. | | https://github.com/mcallahan/lightray | masklinn wrote: | > * Sprinkling & around everything in math expressions does | make them ugly. Maybe rust needs an asBorrow or similar? | | Do you mean AsRef, or do you mean magic which automatically | borrows parameters and is specifically what rust does not do | any more than e.g. C does? | | Though you can probably get both _if_ the by-ref version is | faster (or more convenient internally): wrap the by-ref | function with a by-value wrapper which is #[inline]-ed, this | way the interface is by value but the actual parameter passing | is byref (as the value-consuming wrapper will be inlined and | essentially removed). | woodruffw wrote: | > Maybe rust needs an asBorrow or similar? | | FWIW, the `Borrow`, `AsRef`, and `Deref` traits all exist to | support different variants of this. | lowbloodsugar wrote: | I understand that this is an example for the purposes of | answering the given question, but when actually doing things with | 3D vertices one should be thinking in terms of structures of | arrays. As someone said here already: good generals worry about | strategy and great generals worry about logistics. | spuz wrote: | I'd be interested to know what the benchmarks of the two rust | solutions are when inlining is disabled so we can get an idea of | the different performance characteristics of each function call | even if it's not a very realistic scenario. | | The other question I have is which style should you use when | writing a library? It's obviously not possible to benchmark all | the software that will call your library but you still want to | consider readability, performance as well as other factors such | as common convention. | ptero wrote: | I would go with the version that gives the clean user interface | (that is, by copy in this case). _If_ it turns out that the other | version is significantly more performant _and_ this additional | performance is critical for the end users consider adding the by- | borrow option. | | The clarity of the code using a particular library is such an big | (but often under-appreciated) benefit that I would heavily lean | in this direction when considering interface options. My 2c. | daviddever23box wrote: | Agreed - and this applies in nearly every language: start | simple, trust your compiler, and optimize only when performance | becomes untenable. | osigurdson wrote: | The assumption behind such arguments is when a performance | problem does arise, a profiler will point to a single, easy | to fix, smoking gun. Unfortunately this is not always the | case. Performance problems can be hard to diagnose and hard | to fix. A lot of damage has been done by unexamined / | dogmatic "root of all evil" mantra. | mattgreenrocks wrote: | The misapplication of that mantra doesn't justify the | design damage done by dogmatically passing everything by | ref. | | There's no hard and fast rule here. Even if there was, | optimizers still occasionally surprise seasoned native devs | in both positive and negative ways. | | Glad the author's first instinct was to pull out profiling | tools. | throw10920 wrote: | In the vast majority of situations (1) you'll prematurely | optimize in the wrong place and (2) yes the profiler _will_ | point to a single, easy-to-fix smoking gun. | | Situations otherwise are the exception, rather than the | rule, and it takes an expert to (1) recognize those | situations and (2) know exactly how to write optimized code | in that situation. | | That's why "don't prematurely optimize" is a good rule of | thumb - because it works the majority of the time, and it | takes experience to know when not to apply it. | osigurdson wrote: | Suggest acquiring the needed knowledge instead of | applying dogma. The true root of all evil is unexamined | dogma. | kllrnohj wrote: | > In the vast majority of situations [..] yes the | profiler will point to a single, easy-to-fix smoking gun. | | [citation needed] | | This claim depends hugely on the industry you're actually | working in and the problem space. Things like UIs & games | basically never have a single, easy-to-fix smoking gun. | The _entire app_ is more or less a hotspot - be it | interactive performance, startup performance, RAM usage, | or general responsiveness. | | And once you're gone down the route of "build it first, | optimize it later" you're pretty much fucked when you get | to the "optimize" step because now your performance | mistakes are basically unfixable without a rewrite - | every layer of your architecture has issues that you | can't fix without drastic overhauls. It would have been | _much_ easier to do some up-front measurements, get some | guidelines in place (even if they aren 't perfect), and | _then_ build the app. | kllrnohj wrote: | This advice hinges _hugely_ on what "start simple" really | means. There's a ton of counter-examples here where that just | isn't true at all depending on what you're calling "simple". | In particular JIT'd languages can be especially problematic | here. An example would be using Java's Streams interfaces to | do something that could be done without much difficulty with | a regular boring ol' for loop. At the end of the day you're | hoping the JIT will eventually convert the streams version | into the same bytecode the for loop version would have | started with. But it won't do that consistently, and you've | still wasted time before it did so. | | Trusting the compiler also means knowing what the compiler | actually understands & handles vs. what's a library-provided | abstraction that's maybe too bloated for its own good and | that quickly becomes "not simple" depending on your language | of choice. | mlindner wrote: | I agree in general, but the side-effect of doing this is that | no matter how fast your hardware gets, your software will | always end up optimized to the new hardware. So over time | your software gets slower and slower but performance stays | consistent as hardware gets faster. | ardel95 wrote: | Minor nit: many of the differences in the article aren't really | specific to the Rust vs C++, but rather differences between llvm | vs whatever compiler backend is used by msvc. | amelius wrote: | This is one of the problems I have with writing rust code. You | have to think about so many mundane details that you barely have | time left to think about more important and more interesting | things. | mlindner wrote: | Having written a lot of C, you spend basically all your time | thinking about "mundane details", and worse, if you make a | mistake, you often don't know you made a mistake until it's | running on some customer's servers and you just got a ticket | escalated 3 times up to you with vague information about | crashes rarely happening. Good luck remembering which bit of | code you wrote 6 months ago that may be causing the problem. | | I'll take Rust shouting at me for missing "mundane details" any | day of the week. | scotty79 wrote: | As a Rust beginner that likes to learn the hard way I think I | have some insights why Rust seems cumbersome and/or hard for | programmers trying it. | | Rust uses syntax that feels familiar but means completely | different things than in pretty much any other language. | | For example '=' doesn't mean assign handle or copy. It by | default means move. | | 'let' doesn't mean create a name for something. It means create | physical space for something (of known size) that can be moved | into or moved out of. | | You don't deal with objects and values of primitive types. | Instead everything in Rust is a value. When you move, you move | the value. If you compare, you compare by value. If you pass | something from variable into function, you move the value into | the function. | | And when the space where you keep the value goes out of scope | value dies with it if it wasn't moved out to somewhere else. | | Scope for variables (which are just named spaces for values) | ends with the end of the block, but some values, created by | functions and returned from them, if they are not moved into | any named space, can die sooner, even in the middle of the line | where they were acquired from function call. | | Everything else stems from that fixed size moved value | semantics. If you don't want to move the value into the | function when you call it you need to pass something else | instead, so you create and pass in the borrow. But you have to | ensure that the value doesn't die or get moved anywhere (even | inside the container you borrowed from) before borrows to it | all die. | | Because of this you are better off with borrows that are short | lived and local. Often it's better to keep the index of an | element of a Vec instead of the borrow of this element. If you | must create types that contain borrows you must know that they | become borrows themselves and you need to treat them exactly | the same trying to limit their scope and life time. | | It's hard when you come from any other language because borrows | are superficialy similar to pointers or references to objects. | So you try to use them as such. And crash into the compiler | because they are not that. What's worse their syntax is very | minimalistic which triggers intuition that they must be fast | and optimal solution for many problems which they sure can be | once you fully internalize their limitations but not a moment | sooner. | | Another thing is that values in Rust must have the fixed size. | So even as simple thing as a string requires a bit of hackery. | Basically in Rust the default strategy to have something of | variable size is to allocate it on the heap and treat pointer | to it (possibly with some other fixed sized data like length) | as the fixed sized value you can move around clone and borrow. | | So if you want to have semantics you know from other languages | you can't just use basic Rust syntax. | | You need constructs such as Box and Rc, Cell, RefCell. Make | your things clonable and sometimes even copyable and avoid | creating borrows whenever possible initially. When you do it | Rust becomes as flexible as any other language and you can use | it pretty much just as comfortably. Then the value semantics | shines as you can very easily compare your data by value, order | it, create operators for it, create has for it so you can keep | it in HashMaps and HashSets. Then it's delightful. | | My advice is when you create a long lived type just wrap it in | Rc and treat this Rc as your 'object'. And avoid borrows in | your types unless you have a very good performance (measured) | reason to have them or you are creating something obviously | dependant and usually short lived like an iterator. | zozbot234 wrote: | > For example '=' doesn't mean assign handle or copy. It by | default means move. | | Some complain about this, but the fact is there's no such | thing as a zero-overhead "copy" for non-trivial types. C++ | started out with = meaning clone the object which was an even | bigger footgun, and support for move had to be added after | the fact. | scotty79 wrote: | Yeah. Making = mean copy is a really bad idea. I very much | like the solution in Rust where attempt to move out | something that can't be moved out results in automatic copy | if the type implements trait Copy. | | It's very elegant solution for simple, small data types. | But it further occludes how meaningfully Rust is different | from everything else because thanks to that = sometimes | does mean copy. | amelius wrote: | But Rc is only useful for creating tree-like data-structures. | | One non-tree cross-link or back-link and you'll have to | redesign your entire code. | puffoflogic wrote: | Sibling and parent pointers are almost universally a sign | that an abstract data structure (and associated algorithm) | has been mistaken for concrete. The exception that comes to | mind first is Knuth's dancing links, and its obscurity is | an indication of the rarity of actually needing these | pointers. In any case, it's also a poster child for using | indices rather than pointers. | scotty79 wrote: | Currently I'm working on constructing proofs of | tautologies directly from the system of axioms using | substitution and modus ponens rule. | https://en.m.wikipedia.org/wiki/List_of_Hilbert_systems | | Main objects in my program are expression trees. I | manipulate them, cut them, merge them, compare them, | splice one into the other. Rc's enable me to have full | flexibility and share tremendous amount of data across | objects in my program. | | Rust is absolutely wonderful language for this problem | thanks to Rc's, enums, value semantics, auto-deriving | traits and ability to implement traits for existing types | and of course speed. | | I'm not implementing specific algorithms. I'm making them | up as I go although I used some simple ones like | topological sort or A* that eventually turned into just | breadth search because I have no idea how far I am from | the solution. | amelius wrote: | > I'm not implementing specific algorithms. I'm making | them up as I go | | It's mindboggling to me that people are using a systems | programming language for mathematical research, | especially if they don't know yet what the final | algorithms will look like. | | But all the more power to you for trying. | scotty79 wrote: | If you further wrap Rc in an Option you can set crosslinks | and backlinks to None when you are dropping your data to | get rid of the problem of crosslinks or backlinks making | reference-counting leak memory. Then you just need to be | mindful to not loose handle to a cycle of your nodes before | you break the cycle by setting some crosslinks to None. | | You can fairly easily refactor your almost-tree code to | adapt it to that additional Option wrap. | | Of course you might instead opt to introduce some garbage | collector crate into your project. They usually provide | garbage collected Rc equivalent, which makes swapping it | out very easy. | | Rc's are really very useful first approach to making | anything complex in Rust. | | I usually have something like struct | NodeStruct { my_data: i32, link: Node | } | | and struct Node(Rc<NodeStruct>); | | or struct Node(Option<Rc<NodeStruct>>); | | if I need cross-links. | | Great thing is you can then add 'methods' to your type with | impl Node {} | | Or define operators and other traits with: | impl Add<Node> for Node {} | | Sometimes, when I need mutability I even wrap the | NodeStruct in RefCell. | | It seems like a lot of wrappers but thanks to them you can | have very nice code that uses this type that has pretty | much 'normal modren language' semantics + value semantics | and is still blazing fast. | | When you implement Ord, Eq, Hash they all go through all | the wrappers and let you treat your final type Node as a | comparable, sortable, hashable and cheaply clonable value. | Dereferencing also goes through all or most of the wrappers | automatically. | ReflectedImage wrote: | Rc, Cell & RefCell are suppose to be rare. For example I've | got 2,000 line Rust program in front of me and I've used Arc | 3 times and RWLock 1 time, that's all. | | You need to structure your program as a Directed Acyclic | Graph (DAG), with things interacting only with the things | below them in the graph. | | Then occasionally you might need to break the DAG structure | by using Rc, Cell & RefCell, etc... | scotty79 wrote: | The thing is not everything can be expressed as a DAG. | | And finding it towards the end of writing your program | after hours of fighting with borrow checker is extremely | unpleasant. | | And I don't think I ever landed in the situation where I | could fix the discrepancy by sprinkling in few Rc, RefCells | and such. | | So I prefer to write with RefCells from the start and when | I got the thing working and I am ambitious enough then I | look at which parts could be borrows instead and I swap | them out. | ReflectedImage wrote: | There are many many many ways you can express something | and it's very likely that one of those ways is a DAG. | | The issue here is that you are writing C++ code rather | than Rust code. | scotty79 wrote: | How do you express as a DAG a tree where nodes need to | keep references to their children and parents? | | Two separate synced trees? Is it worth it? | | > The issue here is that you are writing C++ code rather | than Rust code. | | How dare you! I'm writing TypeScript code! ;-) | | Rust is not Forth. I can write whatever I want and | there's nothing wrong with that. | ReflectedImage wrote: | > How do you express as a DAG a tree where nodes need to | keep references to their children and parents? | | Rewrite your program in a form where it does not contain | a tree. | | If you want an actual tree as a data structure, see the | trees crate. | | > Rust is not Forth. I can write whatever I want and | there's nothing wrong with that. | | And other people write Haskell code in Python :p. If your | code style doesn't match the language you are using you | are going to have a lot of unnecessary friction. | scotty79 wrote: | I think Rust is flexible enough to still work very well | with my style. | | But you inspired me about something. I think I can | rewrite the program that I am writing to use reverse | Polish notation instead of a tree. Thanks! | ReflectedImage wrote: | Good luck! | matheusmoreira wrote: | Well, it _is_ a systems programming language. Thinking about | exactly how the language passes bits around is the whole point. | Rust should specify a stable ABI already so that everyone can | form a good mental model of what their code becomes once | compiled. | amelius wrote: | True, I probably was using Rust for the wrong type of | problem, i.e. was hoping to write a user-level application | with a graphical UI at the time. | | Rust is probably better used for writing fast low-level | libraries that you call from higher level languages, possibly | with a garbage collector, so you don't waste time thinking | about memory management while you design/write your high- | level application. | throw10920 wrote: | My (brief) experience with Rust was that, while I had to | struggle to learn the borrow-checker, I didn't have lots of | "mundane details" to worry about - if any, less than C(++). | | What did you have in mind? | kibwen wrote: | Note that this is from 2019, so it's probably worth re- | benchmarking to see if anything has changed in the interim. Can | we get the year added to the title? | bjackman wrote: | A potential lesson here (i.e. I am applying confirmation bias to | retroactively view this article as justification for a strongly | held opinion, lol): | | Unless you are gonna benchmark something, for details like this | you should pretty much always just trust the damn compiler and | write the code in the most maintainable way. | | This comes up in code review a LOT at my work: | | - "you can write this simpler with XYZ" | | - "but that will be slower because it's a copy/a function call/an | indirect branch/a channel send/a shared memory access/some other | combination of assumptions about what the compiler will generate | and what is slow on a CPU" | | I always ask them to either prove it or write the simple thing. | If the code in question isn't hot enough to bother benchmarking | it, the performance benefits probably aren't worth it _even if | they exist_. | Dobbs wrote: | edit: I misread the previous post. Ignore this. | | How are you using the word simpler? Because to me that implies | a combination of more obvious and number of lines of code. | Something that a benchmark shouldn't be involved in. | | For example asking someone to delete 10 lines of code and | instead use go's ` net.SplitHostPort` would be an example of | "simpler". | 2OEH8eoCRo0 wrote: | I've read that good generals worry about tactics and great | generals worry about logistics. | | Good programmers play code golf, great programmers write | readable and maintainable code. | | Your example seems reasonable but programmers also like to | act like the smartest one in the room. I often come across | tricky and borderline obfuscated code because somebody wanted | to look clever. This is a logistical nightmare. | tested23 wrote: | Ugh, you are right but then someone comes and uses this to | rationalize not including things like map, filter and | reduce in a language because they are supposedly too | complicated and you can just do it with a for loop | baby wrote: | I work in a Rust codebase that uses a lot of functional | functions, and I'll say this: on average the imperative | style takes less lines of code and less indentation. I | also find it more readable personally, and idiomatic. | nicoburns wrote: | Functional iteration is good for the same reason we use | for loops over while loops, and while loops over goto: | they are more constrained, more clearly communicate | intent, and are therefore easier to reason about. | josephg wrote: | Sure but it's easy to go overboard with this stuff. | Reduce (fold) especially can be pretty hard to read in | hairy situations. | | My general rule is that if you need fewer lines of code | to implement your logic with a simple for loop, you | probably should. | josephg wrote: | Just because we're on the topic of performance: the rust | optimizer can sometimes generate better code if you use | map / filter / etc. The slice iterator in any context is | a huge win over manual array iteration because it only | needs to do bounds checking once. | | Javascript (v8, last I checked) is the opposite. Simple | for loops almost always outperform anything else. | duckerude wrote: | I've seen cases where an iterator was better, but I've | also seen gains from using an imperative loop with manual | indexing. Loop conditions and the occasional assertion | can be enough to elide bounds checks. (Though sometimes | the compiler gets too paranoid about integer overflow.) | | Most of the time you should just write whatever's | clear/convenient but sometimes it's worth trying both and | scrutinizing godbolt. | karamanolev wrote: | That's what they're saying: if someone says anything more | complicated is faster, they challenge them to benchmark it. | Usually, it turns out whoever argues the "is faster" point | doesn't bother to benchmark it and the simpler code-wise | thing wins out. So yes - the benchmark goes to performance, | simplicity is in lines-of-code, cyclomatic, "in the eye of | the beholder" or whatever other metric you choose, but | usually it's obvious. | Diggsey wrote: | One neat thing here is that the compiler is aware of which | types are `Copy` and not internally mutable (not contianing an | `UnsafeCell`). For these types, passing `&T` and `T` are | equivalent, so the compiler could just choose the faster | option. | | Even if it's not smart enough to do that today, it could | implement this optimization in the future. This could work even | without inlining, since the Rust calling convention is | unstable, and an optimization based on type size could be | incorporated into it. | zozbot234 wrote: | It would be more advisable to add this as a clippy hint, | because `&T` and `T` are not always equivalent wrt. FFI. | kibwen wrote: | Indeed, but the compiler is still capable of doing it on a | case-by-case basis. Quite often the observed semantics are | identical and it's easy for the backend to see that a | pointer has been created only to be immediately | dereferenced. | cwzwarich wrote: | It would be nice if Rust could do this, but it breaks | backwards compatibility. Some existing code depends on | pointer values of &T being equal or not equal. | comex wrote: | As an addendum, LLVM _can_ automatically perform the " | &T-as-T" optimization (without inlining) in some cases | where the callee function is in the same compilation unit | and known to not care about the pointer value. However, | these types of optimizations tend to be fragile, easily | disturbed when things get slightly complex. | throwaway894345 wrote: | I generally agree, but it's also not obvious to me in Rust (or | in Go) whether passing by reference or by copy is more | maintainable or clear. I guess what I want is some guidance on | what I should do by default, which you sort of give with "do | what is more maintainable", but I can't tell what that means in | practice (I've been told to default to pass-by-reference in the | past because most traits take &self and not self). | kibwen wrote: | _> I've been told to default to pass-by-reference in the past | because most traits take &self and not self_ | | This is only blanket advice for designing traits, because as | the trait author you don't know what concrete type the | downstream user is going to want to use, and taking `&self` | in that circumstance is the choice that is friendliest to | both Copy and non-Copy types. | | If you're just writing a non-generic function and you _do_ | know what concrete types you 're using, the flowchart is | pretty simple: | | 1. If the type is not Copy, then pass by-ref if you just need | to read the value, pass by-mutable-ref if you just need to | mutate the value, and pass by-value if you want to consume | the value. | | 2. If the type is Copy, then pass by-value, but if your type | is _really_ big or if benchmarking has determined that this | is a critical code path then pass by-ref. | bjackman wrote: | Yeah totally agree it's not always/usually obvious. But there | are cases where there's a clear readibility/assumed- | performance tradeoff and in those cases I say always prefer | readibility (unless you benchmark). | __turbobrew__ wrote: | I would still consider myself a go novice, but I have been | burned a number of times passing simple objects by reference | and then that object gets mutated causing subtle bugs. Also, | go is happy to blow your foot off if you take the reference | of a loop variable. Although, there is a proposal to fix | that. | | Generally I find that less bugs get introduced when using | copy instead of pass by reference, but I'm sure others have | the opposite opinion. | mcguire wrote: | I have had the same results. Passing by copy is simpler and | less bug-prone and reduces the urge to "just set the value | since I have a reference to the object" which is a well- | paved road to significant pain. | | And the objects have to get surprisingly large before | passing by reference really makes a difference. | 411111111111111 wrote: | This is about rust though and thats not really possible | there (at least to my knowledge). You should get a compiler | error if you attempt this. | | I got very little experience in rust though, so there might | be a way (I'm just not aware of) to circumvent this check | [deleted] | saghm wrote: | I don't think one of them is more clear or maintainable | universally, but in a lot of contexts, there might be an | obvious choice. As a trivial example (that isn't quite fair | given that the topic is about structs), it will almost never | be more clear or maintainable to pass a shared reference to | an integer (although there may be cases where a mutable | reference might make sense). I don't think there's much need | for one to have precedence over the other by default; if | anything, I see the discussion about performance tradeoffs | not being worth fretting about in the absence of actual | measurement to be an argument _against_ one of them being | inherently preferable. | jackmott wrote: | [dead] | forrestthewoods wrote: | Blog author here. I somewhat agree, somewhat disagree. This | line makes me uneasy: | | > I always ask them to either prove it or write the simple | thing. If the code in question isn't hot enough to bother | benchmarking it, the performance benefits probably aren't worth | it _even if they exist_. | | One of my philosophies is that death by a thousand cuts is | fine, but death by ten thousand cuts isn't. A team of 10 | engineers can probably fix most of a thousand cuts in two or | three months. But if you have ten thousand cuts you're probably | doomed. And those don't show up cleanly in a flame graph. | | Now for some context my background is video games. Which means | the team knows they need to hit an aggressive performance bar. | This isn't true for many projects. shared_ptr is a canonical | example of death by ten thousand cuts. | | That said, I strongly agree with the principle of "just do the | simple thing". However I think it's important to have "sane | defaults". A project can easily have a thousand or ten thousand | papercuts that kill performance. But you can't microbench every | tiny decision. And microbenches are only a vague approximation | of what actually matters. | | I'm also wary of "the compiler will make it fast". Because | that's true... until it's not! Although these days you don't | have any choice but to lean heavily on the compiler and "trust | but verify". | | No one wants a super complex solution if it's not needed. | However I am very amenable to "do a slightly more complex thing | if you know it's correct and we can never think about this ever | again". It's much easier to do the fast thing upfront than for | someone else to try and speed it up in two years when we're | doing a papercut pass. | klyrs wrote: | > But if you have ten thousand cuts you're probably doomed. | And those don't show up cleanly in a flame graph. | | I am reminded of the lovely nanosecond/microsecond talk by | Grace Hopper. If your code does a little bit of setup and | then spends all of its time in a single hotspot, fine. But if | your code is full of microsecond-suboptimal speed bumps, you | can probably hide your hotspot altogether. And a flat-ish | flame graph looks fine: nothing stands out as a problem! | | It's valuable to do micro-benchmarks, not just to hone your | optimization skills, but to learn optimal patterns in your | language of choice. Then, when you're "in the zone" and | laying down new code, you just do the optimal thing | reflexively. Or, when you're reviewing or rewriting | something, those micro-hotspots jump out and grab your | attention. | | There's a reason that ancient software running on ancient | hardware is way more responsive & snappy than what we have | today. Laziness. | Dylan16807 wrote: | > There's a reason that ancient software running on ancient | hardware is way more responsive & snappy than what we have | today. Laziness. | | Laziness in terms of using entirely inappropriate | algorithms, sure. | | Laziness in not microbenchmarking minutia? It shouldn't be. | There's a limit on how much that can hurt you. I would say | much less than a factor of ten, but let's go with 10x just | for argument's sake. If you have a CPU that's 500x faster, | and use easy code that's 10x slower, you're doing just | fine. This is not the problem with modern unresponsiveness. | klyrs wrote: | When I rewrite python code in C, I often hit 1000x | speedups, and sub-100x is rare. And that's line-for-line. | When I fix an accidentally-quadratic issue, for example, | I've seen speedups in the billions without even changing | the language. | | People have lionized Knuth's quote about premature | optimization, and used that to ignore performance issues | across the board. Since the early '00s, we have not seen | a 500x improvement in CPU speed. It's less than 2x on | frequency, and let's say 8x on core-count for most users | (which doesn't help your single-core lazy programmer). In | my experience, programmers will make projections based on | a 500x-faster processors that _will never arrive_ , | because it's easier than honing their skills and keeping | them sharp. And even if these magical THz-frequency chips | arrive, if you have three layers of 10x slowdowns, you're | back down to GHz. | josephg wrote: | This has been my experience too. I wrote a text crdt last | year which improved automerge's (then) 5 minute runtime. | My code currently takes 6ms to do the same work. | | Automerge's design assumed this stuff would always be | slow, so they had this whole frontend / backend code | split so they could put the expensive operations on | worker threads. Good optimizations in the right places | make all that complexity unnecessary. The new automerge | is shaping up to be simpler as well as faster. | Dylan16807 wrote: | > When I rewrite python code in C, I often hit 1000x | speedups, and sub-100x is rare. And that's line-for-line. | When I fix an accidentally-quadratic issue, for example, | I've seen speedups in the billions without even changing | the language. | | And neither of those is a microbenchmark thing, which is | kind of my point. I'm surprised language would hurt that | much, but that's enough to break things on its own | without any layering. | | > Since the early '00s, we have not seen a 500x | improvement in CPU speed. It's less than 2x on frequency, | and let's say 8x on core-count for most users (which | doesn't help your single-core lazy programmer). | | I don't think people are talking about 2004 when they | talk about the responsiveness of ancient software on | ancient hardware. I interpret that as more like an Apple | II. But instructions per clock have also gone up a lot | since the pentium 4 days, and having more than one core | in your CPU has a huge impact even for single-threaded | programs. | klyrs wrote: | > And neither of those is a microbenchmark thing, which | is kind of my point. I'm surprised language would hurt | that much... | | The point I'm making here is that every line matters -- | not just the hotspots. If you're suprised that language | can have that much impact, perhaps it's time to learn a | bit about performance issues that you're being dismissive | of? | | > I don't think people are talking about 2004 when they | talk about the responsiveness of ancient software on | ancient hardware. | | No, I was responding to your mention of a 500x | improvement in hardware. That pipedream ended in the | early 00s, and people still talk like Moore's law will | absolve their inattentive coding practice. And that felt | fine in the decades we went from kHz to GHz, but it's | unacceptable today. | Dylan16807 wrote: | > The point I'm making here is that every line matters -- | not just the hotspots. | | That depends on how it would have performed if you only | transformed the hottest 10% into C. | | But I was mainly responding to the idea that micro- | optimizations are needed to keep general software snappy, | and I don't think they are. If one language is that much | faster, that's not micro-optimization. | | > No, I was responding to your mention of a 500x | improvement in hardware. | | What do you mean "No"? I was talking about current | hardware being 500x faster than 1988 hardware, which _it | is_. If that 's not what you meant by "ancient software | on ancient hardware", fine, but _that 's what my 500x was | talking about_. | | > people still talk like Moore's law will absolve their | inattentive coding practice | | I'm not trying to excuse inattentive coding. I'm trying | to say certain kinds of attention are important and | others aren't. | naasking wrote: | > And a flat-ish flame graph looks fine: nothing stands out | as a problem! | | If your program is still slow, that would also indicate | that everything is a problem, ie. the ten thousand cuts. | Start optimizing at some obvious spots and then see what | happens. | bjackman wrote: | Haha "death by a thousand cuts" is exactly the phrase I | encounter in these debates! | | And actually I still disagree - e.g. I once took over a DMA | management firmware and the TL told me "we are really trying | to avoid DBATC so we take care to write every line | efficiently". But the thing was that once you have a holistic | understanding of the systems performance you tend to find | _only a small fraction of the code ever affects the metrics | you care about_! | | E.g. in that case the CPU was so rarely the bottleneck that | it really didn't matter, we could have rewritten half the | code in Python (if we'd had the memory) without hurting | latency or throughput. | | Admittedly I can see how games or like JS engines might be a | kinda special case here, where the OVERALL compute bandwidth | begins to become a concern (almost like an HPC system) and | maybe then every line really does count. | Dylan16807 wrote: | Very little of your code is in hot loops. If the code that | takes half a millisecond per frame _could_ be twice as fast, | but the hot loop is very optimized, then it doesn 't really | matter. And that's what I would think of by default for | having many many cuts. Better to spend the optimization | effort elsewhere. | | > shared_ptr is a canonical example of death by ten thousand | cuts | | Why does that count as ten thousand cuts rather than one cut? | That doesn't sound intractable to fix if you have months. | [deleted] | nextaccountic wrote: | > I always ask them to either prove it or write the simple | thing. | | Even if they do, they also need to make a case that in this | specific case, performance matters enough to pessimize code | simplicity and maintainability. | | Also, if performance is that critical, it's imperative to | benchmark again after each compiler release to guard against | codegen regressions. And benchmark after changing this piece of | code. Otherwise, we can say that performance doesn't really | matter. | eldenring wrote: | I think you're missing the point of the previous comment, | they are saying a good proxy for it being worth it to | optimize, is if you're willing to benchmark it. | szundi wrote: | My favorite way of thinking. Should be applied to question the | need for the existence of the given feature or function too, | and best to delete the whole stuff. | heydenberk wrote: | Even if you do benchmark something, maintainability can be more | important than a marginal performance improvement. | | I've seen this happen _a lot_ with JavaScript, particularly in | the last 5-10 years as JS engines have developed increasingly | sophisticated approaches to performance. Today 's optimization | can be tomorrow's de-optimization. Even given an unchanging | landscape of compiler/interpreter, tightly-optimized code can | become de-optimized when updated and extended, as compared to | maintainable code that may not suffer much performance | degradation upon extension. | Patrol8394 wrote: | This x 10000 ! If I had a dime for every time I provided this | exact feedback in code reviews...I find surprising that a lot | of devs in tech industry are obsessed with pointless micro | optimizations and they don't care about writing maintainable, | testable simple code. My final comment is always to not | outsmart compilers/jvm because they tend to do a much better | job that developers. | | Please, don't optimize unless you have reasons to do so and | numbers backing that up. | ok123456 wrote: | This is true for application code. But, Rust is trying to | sell itself as a systems language and an embedded language | and a language you can write kernel modules in. Memory budget | matters in these cases. | tialaramex wrote: | If memory budget matters, you _have_ a memory budget. So | you should be measuring and you can actually tell where you | need improvements. | | But in practice what we see _overwhelmingly_ is that people | want to do this stuff but they aren 't measuring, because | measuring is _boring_ whereas making the code more | complicated to show off how much you think you know about | optimisation is easy. Knock it off. | ok123456 wrote: | Then knock off trying to use Rust as a systems language. | Linear types make refactoring this nearly impossible if | it does become and issue. | tialaramex wrote: | Works really nice for me, of course, I actually measure | what I'm doing. | | You edited your comment, so I guess I will too: Rust | doesn't actually have Linear types. Linear types ("must | use or compile error") would be tricky to provide, Aria | blogged about it back in the day. So that's definitely | going to be a problem with your refactoring. | dahfizz wrote: | It depends on your specialization, I guess. If you're making | a website, a few microseconds here and there probably don't | matter. | | But in my field (Fintech), performance really does matter. | Doing the simple, slow thing is just lazy and won't make it | through review. | imron wrote: | Great. Should be easy to prove then. | dahfizz wrote: | Yup, we have a good benchmarking suite to make sure | changes don't cause regressions and that optimization | changes actually work. | | That said, I think asking a developer to write everything | they do twice so that they can A/B test is overboard. You | can come back and really aggressively optimize later, but | I think the "default" should be the fast thing, rather | than the slow & easy thing. | bmacho wrote: | If "performance really does matter" (in Fintech) then a | developer surely can write everything twice or more. | dahfizz wrote: | We could, yeah. I'm sure you're capable of re-writing | everything you do twice. It's just a huge waste of time. | | You get just as much benefit by assigning a performance | refactor to an engineer when needed vs literally halving | or worse the whole teams productivity. | elimerl wrote: | The idea is that the "default" is the easy thing, which | is usually optimized by the compiler. | josephg wrote: | My advice is the opposite: if you want to make performance | justifications for code, you need a benchmarking suite. I | have them for a lot of my projects. (Rust's criterion is a | delight). A good benchmark suite is a subtle thing to write - | you want real world testing data, and benchmarks for a range | of scenarios. The benchmarks should be run often. For some | changes, I'll rerun my benchmarks multiple times for a single | commit. I benchmark the speed of a few operations, | serialisation size, wasm bundle size and some other things. | | Having real benchmarking data is eye opening. It was shocking | to me how much the wasm bundle size increased when I added | serialisation code. The time to serialise / deserialise a big | chunk of data for me is 0.5ms - so fast that it's not worth | more microoptimizations. Lots of changes I think will make | the code slower have no impact whatsoever on performance. And | my instincts are _so often_ wrong. About 50% of | microoptimizations I try either have no effect or make the | code slightly slower. And it's quite common for changes that | shouldn't change performance at all to cause significant | performance regressions for unexpected reasons. | | I've also learned how important "short circuit" cases can be | for performance. Adding a single early check for the trivial | case in a function can sometimes improve end to end | performance by 15-20%, which in already well tuned code is | massive. | | Performance work is really fun. But if you do performance | tuning without measuring the results, you're driving | blindfolded. You're as likely to make your code worse as you | are to make it better. Add benchmarks. | jjice wrote: | > but that will be slower because it's a copy/a function | call/an indirect branch/a channel send/a shared memory access | | I really dislike these takes. I see engineers optimize these | cases and then go ahead and make two separate SQL queries that | could be one, ruining any false optimization gains they got by | lord know how many times. | | Yeah, you can loop over that 100 element list twice doing basic | computation if you want, it's not going to make a difference | for many engineering workloads, but could make a big difference | in readability. | saghm wrote: | This seems to be an unpopular opinion, but I feel similarly | about how sometimes people seem to toss out `inline` (and even | more suspect, `inline(always)` annotations on Rust functions | like candy on Halloween and there are almost never any sort of | actual measurements of whether it actually helps in the cases | it's used. It's not even that I think it really hurts that much | in most of the stuff I've worked on (which tends to be more | sensitive to concurrency design and network round trips), but I | can't help but worry that people using stuff like this when | they don't seem to fully grasp the implications is a recipe for | trouble. | bjackman wrote: | Yeah inline is an absolute classic for this. The number of | uncommented __attribute__((always_inline))s I see in C code | drives me crazy. There are absolutely legitimate reasons to | use that attribute but there should ALWAYS be a comment about | why, so that later readers know in what conditions they can | safely remove it. | the__alchemist wrote: | Interesting! Of note, My `Vec3` and `Quaternion` types (f32 and | f64) have `Copy` APIs, but I've wondered about this since their | inception. | yobbo wrote: | The Rust-test implements the traits Add, Sub, Mul by value. This | makes the few references less important in the total test. The | ergonomics argument is motivated by using these traits. | Otherwise, references would have had the same ergonomics. | | But also, the struct is 3x32 bits, and Rust auto-implements the | Copy-trait for it. It is barely larger than u64, which is the | size of the reference. | | But life is only simpler when Copy and Clone can be auto- | implemented. | birdyrooster wrote: | I guess by-copy bc I'm cool | jackmott wrote: | [dead] | celeritascelery wrote: | I don't feel like this gave a satisfactory answer the question. | Since everything was inlined, the argument passing convention | made no difference in the micro benchmarks. But what happens when | it does not inline? Then you would actually be testing by-borrow | be by-copy instead of how good rust is at optimizing. | ncallaway wrote: | I sort of agree and sort of disagree. | | > Then you would actually be testing by-borrow be by-copy | instead of how good rust is at optimizing. | | I don't think the question is actually: "what is faster in | practice, a by-copy method call or a by-value method call", I | think the question is: "as an implementer, which semantics | should I choose when I'm writing my function". | | For the second question: "Rust is usually pretty good at | aggressively inlining, so... if you're willing to trust Rust's | compiler, you're often okay going with by-copy implementations, | but you should keep an eye on it". Whereas, as you note, for | the first question it's not an answer. | | But, I do think if someone was going to put more work into it | I'd be very curious what the answer to the first question is. | If I'm choosing to implement with by-copy semantics and | trusting the Rust compiler to hopefully inline things for me, | I'd like to know the implications in the cases when it doesn't. | forrestthewoods wrote: | Blog author here. This feels like the best summary in this | comment section. | | The root question is indeed "what semantics should I use". | And the answer I came up with was "the compiler does a lot of | magic so by-copy seems pretty good". I agree with the | previous commenter this is not a satisfying conclusion! | | My experience with Rust is that it requires a moderate amount | of trust in the compiler. Iterator code is another example | where the compiler should produce near optimal code. Emphasis | on should! | FpUser wrote: | When value size is small (whatever "small" means for | particular architecture) I'd say "trust the compiler" | suggestion is reasonable. When the size grows there should | be no more "trust" unless compiler can decipher if it is | safe to use ref instead of value basing on value size (we | assume that the function does not mutate the value). | | Your tests on my PC: Rust - By-Copy: 14124, | By-Borrow: 8150 C++ - By-Copy: 12160, By-Ref: 11423 | | P.S. Just built it using LLVM under CLion IDE and the | results are: G:\temp\cpp\rust-cpp- | bench\cpp\cmake\cmake-build- | release\fts_cmake_cpp_bench.exe Totals: | Overlaps: 220384338 By-Copy: 4397 By-Ref: | 4396 Delta: -0.0227428% Process finished with | exit code 0 | nicoburns wrote: | > When the size grows there should be no more "trust" | unless compiler can decipher if it is safe to use ref | instead of value basing on value size | | I believe that the Rust compiler at least does exactly | that. Large structs will be passed by reference under the | hood even if it passed by value in the code. I suspect | C++ compilers do the same, although I'm not sure about | that. | FpUser wrote: | Now comes big surprise: I just built it using LLVM under | CLion IDE and the results are: | G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- | release\fts_cmake_cpp_bench.exe Totals: | Overlaps: 220384338 By-Copy: 4397 By- | Ref: 4396 Delta: -0.0227428% Process finished | with exit code 0 | josephg wrote: | Why is performance so much better in this case? That | seems like a suspiciously large delta from the first | test. | | Were the other benchmarks run in debug mode / with | optimizations turned off or something like that? What | compiler & flags are you using? | FpUser wrote: | >"Were the other benchmarks run in debug mode / with | optimizations turned off or something like that?" | | Why would I do something like that? Of course all builds | are release mode, optimize for speed. | Rust - Windows - By-Copy: 14124, By-Borrow: 8150 | C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423 | C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396 | | >"Why is performance so much better in this case?" | | Not sure and not in a mood to investigate. I do know if | cache locality and branch prediction stars line up | properly the performance difference can be staggering. | Maybe LLVM has accomplished something nice in this | department. | forrestthewoods wrote: | I just updated Visual Studio 2022 with all the latest | updates and installed the Clang toolchain. I also updated | Rust to latest. | | C++ MSVC: By-Copy: 12,077 By-Ref: 11,901 C++ Clang: By- | Copy: 5,020 By-Ref: 5,029 Rust: By-Copy: 3,173 By-Borrow: | 3,148 | | All on Windows, and on the same i7-8700k desktop I used | for the original post in 2019. | | Your Rust numbers are particularly curious. Maybe run | `rustup update` and try again? | furyofantares wrote: | I feel like they got excited by their C++ code being so much | slower and curious about the "weird" C++ result and forgot to | figure out the original question. | FpUser wrote: | >"C++ code being so much slower" Rust - | Windows - By-Copy: 14124, By-Borrow: 8150 C++ - Windows | MS Compiler - By-Copy: 12160, By-Ref: 11423 C++ - | Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396 Delta: | -0.0227428% | | So it appears that C++ - Windows LLVM 15 beats Rust by large | margin. | fnordpiglet wrote: | To be fair I got excited too. But I still want to know the | answer as well. | jasonhansel wrote: | In Rust it's considered idiomatic to pass things by-value | whenever you can. Usually this is also the most performant | option, since it avoids dereferencing in the callee. | | Of course, if your struct is truly enormous, you may want to | break this rule to avoid large copies. But in that case you | probably want to Box<T> the struct anyway. | | Of course, if your struct contains something that can't be | copied--like a Vec<T>--you'll have to decide whether to clone | the whole struct (and thus the vector in it), pass the struct | by-borrow, or find some other solution. | brundolf wrote: | I don't think I'd agree that idioms come into play here, one | way or the other. Safely borrowing things by reference is one | of Rust's headline features | kibwen wrote: | _> Safely borrowing things by reference is one of Rust 's | headline features_ | | Sure, but it's worth noting that references in Rust do not | exist merely to avoid passing by-value. They also exist to | make it easier to deal with Rust's ownership semantics: | they let you pass things to a function without also | requiring the function to "pass back" those things as | returned values. In other words, references let you do `fn | foo(x: &Bar)` rather than `fn foo(x: Bar) -> Bar`. This is | a unique and interesting consequence of languages with by- | default move semantics. | throwawaybycopy wrote: | Should have also tried pass-by-move . | ergonaught wrote: | It's compiled, so, without any investigation at all, I would have | been disappointed if there were any significant difference in the | code emitted in these cases. I would expect the compiler to do | the efficient thing based on usage rather than the particular | syntax. I may have too much faith in the compiler. | CHY872 wrote: | I'd expect your claim to be true whenever the callee is inlined | into the caller. In this case, the compiler has all the | relevant information at the right point in time. As other | commenters have pointed out, by enabling inlining the author | has gone down a rabbit hole somewhat unrelated to the question, | because any copies can be simply elided. | | If there's no inlining at play, I'd expect vast differences to | be possible. For example, imagine a chain of 3 functions - f | calls g, g calls h, where one of the arguments is a 1kB struct | and the options are passing by copy or by borrowing. In this | case, each stack frame will be 1kB in size in the copy case and | there will be a large performance overhead as opposed to the | by-reference case. One would expect simply calling the function | to be similar in overhead to an uncached memory load. | | Within a single crate the inlining is possible, with multiple | crates it's only possible with LTO enabled (and I'm not sure | how _probable_ it is that the inlining would occur). | | In either case, the difference between a 32 byte and 8 byte | argument in terms of overhead is likely meaningless - the sort | of thing to be optimized if profiling says it's a problem as | opposed to ahead of time. | kibwen wrote: | _> Within a single crate (more specifically, codegen unit) | the inlining is possible_ | | Cross-crate inlining happens all the time. In order to be | eligible for inlining, a function needs to have its IR | included in the object's metadata. This happens automatically | for every generic function (it's the only way | monomorphization can work), and for non-generic functions can | be enabled manually via the `#[inline]` attribute (which does | not _force_ inlining, it only makes it possible to inline at | the backend 's discretion). | | However, as you, say, if you have LTO enabled then "cross- | crate" inlining can happen regardless, since it's all just | one giant compilation unit at that point. | cogman10 wrote: | At the VERY end of the article, the author points out "Oh, btw, | I used MSVC for the C++ compilation, when I used clang things | changed!" | | So, what the author actually measured was the difference | between llvm and msvc throughout the article. Particularly when | they talked about rust being better at autovectorization than | C++. | forrestthewoods wrote: | Incorrect. Clang C++ vs MSVC C++ is very comparable, and | noticeably worse for f64 by-ref. Clang C++ is still slower | than Rust by a large margin. Using Clang C++ throughout would | not change any conclusion (or lack thereof). | kolbe wrote: | Anyone know why seemingly knowledgeable people (like the person | who wrote this article) don't use micro benchmarking frameworks | when they run these tests? | | Also, whenever you do one of these, please post the full source | with it. There's no reason to leave your readers in the dark, | wondering what could be going on, which is exactly what I'm doing | now, because there's almost no excuse for c++ to be slower in a | task than rust--it's just a matter of how much work you need to | put in to make it get there. | kllrnohj wrote: | For C++ I guess you could make the claim that it's just too | annoying to take a dependency on something like google- | benchmark or whatever, since C++ dependency management is such | a mess to deal with in general. | | But yeah I have no idea why a benchmark framework wasn't used | for Rust. | kolbe wrote: | Whenever I don't want to endure that annoyance, i just copy | this single file header only microbenchmarking code: | | https://github.com/sheredom/ubench.h | forrestthewoods wrote: | > please post the full source with it. | | There's literally a section called Source Code... | kolbe wrote: | I see now. I looked twice. I think most people stop after a | section called "Conclusion" that ends with "Thanks for | reading." It doesn't help that the formatting then leaves a | large gap between sections that doesn't indicate there's more | after that. | forrestthewoods wrote: | Fair! | zamalek wrote: | The benchmarks lack the standard deviation, so the results may | well be equivalent. Don't roll your own micro-benchmark runners. | | References may get optimized to copies where possible and sound | (i.e. blittable and const), a common heuristic involves the size | of a cache line (64b on most modern ISAs, including x86_64). | | Using a Vector4 would have pushed the structure size beyond the | 64b heuristic. You would also need to disable inlining for the | measured methods. | cogman10 wrote: | It was also (needlessly) using 2 different compilers, MSVC and | LLVM. This is just a bad way to compare things all around. | | And, for simple operations like this, you really should just | look at the assembly output. If you are only generating 20ish | instructions, then look at those 20 instructions rather than | trying to heuristically guess what is happening. | dang wrote: | Discussed at the time: | | _Should small Rust structs be passed by-copy or by-borrow?_ - | https://news.ycombinator.com/item?id=20798033 - Aug 2019 (107 | comments) | aboelez wrote: | [dead] ___________________________________________________________________ (page generated 2022-12-31 23:00 UTC)