[HN Gopher] Should small Rust structs be passed by-copy or by-bo...
       ___________________________________________________________________
        
       Should small Rust structs be passed by-copy or by-borrow? (2019)
        
       Author : aloukissas
       Score  : 209 points
       Date   : 2022-12-31 13:33 UTC (9 hours ago)
        
 (HTM) web link (www.forrestthewoods.com)
 (TXT) w3m dump (www.forrestthewoods.com)
        
       | dwheeler wrote:
       | This is one advantage of Ada, where parameters are abstractly
       | declared as "in" or "in out" or "out". The compiler can then
       | decide how to best implement it for that specific size and
       | architecture.
        
         | ardel95 wrote:
         | How is that semantically different from Rust?
         | 
         | in - regular function arguments
         | 
         | inout - mut function arguments
         | 
         | out - function return
         | 
         | Is there any additional information that a compiler can infer
         | from Ada's parameter syntax?
        
           | layer8 wrote:
           | The difference between passing by reference vs. by value is
           | observable when comparing pointers to the original vs. to the
           | argument. This difference may be unobservable in Ada though
           | (not sure), so Ada would have more freedom choosing between
           | the two.
        
             | [deleted]
        
             | [deleted]
        
         | chromatin wrote:
         | Dlang can also qualify parameters as in, out, and inout;
         | although I don't know to what degree the compiler is able to
         | use that for optimization purposes (it is used for safety
         | checks IIRC)
        
         | bvrmn wrote:
         | Always curious how Ada solves ABI issue with such optimizations
         | in place.
        
           | rightbyte wrote:
           | As long as the calling convention is deterministic from the
           | declaration of the function it should be fine right?
        
             | usrnm wrote:
             | If it's deterministic, the compiler cannot actually choose
             | the best way to optimise it.
        
               | [deleted]
        
               | rightbyte wrote:
               | It just have to come up with the same best way each time?
        
         | sampo wrote:
         | > This is one advantage of Ada, where parameters are abstractly
         | declared as "in" or "in out" or "out".
         | 
         | Also Fortran has "in", "inout" and "out".
        
           | FpUser wrote:
           | So does Delphi / FreePascal
        
           | trifurcate wrote:
           | Also, MSVC has similar annotations for various static
           | analyses: https://learn.microsoft.com/en-us/cpp/code-
           | quality/understan...
        
           | jb1991 wrote:
           | And Swift also has "inout" parameters.
        
             | stephencanon wrote:
             | But not "out" params, sadly.
             | 
             | It can return multiple values, so this doesn't matter much
             | for value types, but it would be nice to be able to specify
             | that a pointer arg is an out-param sometimes and enforce
             | that it is not read from while handling allocation in the
             | caller.
        
           | pletnes wrote:
           | Fortran also has <<default>> / no intent. This is somehow
           | different from inout.
        
           | wiz21c wrote:
           | and GL/SL IIRC ...
        
           | Congeec wrote:
           | C++23 is not too late to the party
           | https://en.cppreference.com/w/cpp/memory/out_ptr_t/out_ptr
        
         | rwaksmunski wrote:
         | A question to the Rust experts, would lifetime annotations 'a
         | in Rust have similar benefit as "in" or "in out" or "out" in
         | Ada and other languages? With the additional benefit in Rust
         | where the compiler can deduce those automatically for most
         | cases?
        
           | chc wrote:
           | As a sibling comment points out, "in" is effectively
           | equivalent to "&T", and "inout" is effectively equivalent to
           | "&mut T". Rust is missing purely "out" parameters, but that
           | isn't a very common case, and I'm not sure how much value
           | there is in saying "this reference can't be read" since
           | references are always guaranteed to be valid in Rust.
        
       | tuetuopay wrote:
       | This is not really surprising in such a case. The Rust compiler
       | is pretty good at optimizing out uneeded copies. Here it does see
       | that the copied value is not used after the function call, so it
       | should simply not emit the copies in the final assembly.
        
       | eloff wrote:
       | For this code, the compiler inlined the call. So there should be
       | no difference between pass by copy or pass by reference, which is
       | what was measured. Where it could matter is when the code isn't
       | inlined. But with small structs it might not matter all that
       | much.
       | 
       | It does sometimes matter though. One optimization I've seen in a
       | few places is to box the error type, so that a result doesn't
       | copy the (usually empty) error by value on the stack. That
       | actually makes a small performance difference, on the order of
       | about 5-10%.
        
       | lukaszwojtow wrote:
       | I always prefer by-borrow. That's because in the future this
       | struct may become non-copy and that means some unnecessary
       | refactoring. My thinking is a bit like "don't take ownership if
       | not needed" - the "not needed" part is the most important thing.
       | Don't require things that are not needed.
        
         | theptip wrote:
         | Rust noob here - is it common to see a struct lose Copy as
         | things grow?
        
         | carlmr wrote:
         | Exactly, and if performance at some point matters: benchmark!
         | 
         | And I would bet 9 times out of 10 it won't be the bottleneck or
         | even make a measurable difference.
        
           | QuadDamaged wrote:
           | Exactly why IMHO the rust stdlib is so easy to understand.
           | Ownership only when required as a design principle tends to
           | make the design of the overall system more consistent /
           | easier to approach.
        
         | eterevsky wrote:
         | If it's a 3D real-valued vector, or similarly basic structure,
         | you can be fairly certain, that it will stay copyable.
        
           | josephg wrote:
           | I agree. Being copyable is part of the signature for
           | something like this. Explicitly so in rust.
        
         | zozbot234 wrote:
         | If a struct might lose Copy you shouldn't implement Copy at
         | all, to preserve forward compatibility. You can still derive
         | Clone in most cases; using .clone() does not per se add any
         | overhead.
        
       | redox99 wrote:
       | I'm surprised he tested MSVC and Clang, and not GCC which usually
       | generates faster code than those two.
        
         | 3836293648 wrote:
         | Well, they are the two easily available compilers on Windows.
         | And rustc vs clang should be the fair comparison as they both
         | use llvm
        
       | im3w1l wrote:
       | My first thought was "now what is the calling convention for
       | float parameters again? they are passed in registers right? the
       | compiler can probably arrange so they don't have to actually be
       | copied" and then I realized it will probably even inline it.
       | 
       | Anyway, assuming it's not inlined I would guess pass-by-copy,
       | maybe with an occasional exception in code with heavy register
       | pressure.
       | 
       | Edit: Actually since it's a structure, the calling convention is
       | to memory allocate it and pass a pointer, doh. So it should
       | actually compile the same.
        
         | masklinn wrote:
         | > Edit: Actually since it's a structure, the calling convention
         | is to memory allocate it and pass a pointer, doh. So it should
         | actually compile the same.
         | 
         | FWIW the AMD64 SysV v1.0 psABI allows structures of up to 8
         | members to be passed via registers. Though older revisions
         | limit that to 2 (and it's unclear whether MS's divergent ABI
         | allows aggregates to be splat at all.
         | 
         | As sad as it's unsurprising, it does not look like LLVM
         | (linux?) has followed up, on godbolt a 2-struct passes
         | everything via registers but a 3-struct passes everything via
         | the stack. Maybe there's a magic flag to use the 1.0 ABI, but a
         | quick googling didn't reveal one. ICC doesn't seem to have
         | followed up either.
        
         | unsafecast wrote:
         | > Edit: Actually since it's a structure, the calling convention
         | is to memory allocate it and pass a pointer, doh. So it should
         | actually compile the same.
         | 
         | Depending on calling convention, the structure may be spread
         | out into registers.
        
       | Veedrac wrote:
       | The general usability impact matters slightly less than it looks
       | here, in part because the `do_math` with references in the
       | article has two extra &s, and in part because methods
       | autoreference when called like x.f().
       | 
       | Performance-wise, if you're likely to touch every element in a
       | type anyway, err on the side of copies. They are going to have to
       | end up in registers eventually anyway, so you might as well let
       | the caller find out the best way to put them there.
        
       | BooneJS wrote:
       | Folks, processors continue to give smaller and smaller gains
       | every year. Something has to give. If you have critical path code
       | that absolutely must max out the core, then this type of analysis
       | (as pedantic as it is) is useful in the long run.
        
       | mcguire wrote:
       | This is one of those questions where you really, honestly, do
       | need to look at a very low level.
       | 
       | Back in the ancient days, I worked at IBM doing benchmarking for
       | an OS project that was never released. We were using PPC601
       | Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was
       | devs writing their own memcpy using _dst++ =_ src++ loops rather
       | than the one in the library, which was written by one of my
       | coworkers and consisted of 3 pages of assembly that used at least
       | 18 registers.
       | 
       | The simple loop was something like X cycles/byte, while the
       | library version was P + (Q cycles/byte) but the difference was
       | such that the crossover point was about 8 bytes. So, scraping out
       | the simple memcpy implementations from the code was about a
       | weekly thing for me.
       | 
       | At this point, we discovered that our C compiler would pass
       | structs by value (This was the early-ish days of ANSI C and was a
       | surprise to some of my older coworkers.) and benchmarked _that_.
       | 
       | And discovered that its copy code was _worse_ than the simple
       | _dst++ =_ src++ loops. By about a factor of 4. (The simple loop
       | would be optimized to work with word-sized ints, while the
       | compiler was generating code that copied each byte individually.)
       | 
       | If you are doing something where this matters, something like
       | VTune is very important. So is the ability to convince people who
       | do stupid things to stop doing the stupid things.
        
       | cmrdporcupine wrote:
       | There is no single answer to this question because it's going to
       | depend completely on call patterns further up. Especially in
       | regards to how much of the rest of the running program's data
       | fits in L1 cache, and _most especially_ in regards to what 's
       | going on in terms of concurrency.
       | 
       | The benchmark made here could completely fall apart once more
       | threads are added.
       | 
       | Modern computer architectures are non-uniform in terms of any
       | kind of memory accesses. The same logical operations can have
       | extremely varied costs depending on how the whole program flow
       | goes.
        
       | m00dy wrote:
       | It is a problem of statistics and depends on internals of
       | underlying operating system. I'm not sure you really need that
       | sort of optimisation
        
         | eloff wrote:
         | What does this have to do with the operating system? There are
         | no syscalls in the code measured here.
        
           | masklinn wrote:
           | The "C ABI" is really "the platform ABI", because most OS are
           | interacted with through libc (or equivalent).
           | 
           | Though that should not apply to Rust at all, as it does not
           | pledge to follow the C ABI internally (aka `extern "Rust"`).
        
           | m00dy wrote:
           | Because rust is a compiled language and therefore it means
           | you compile your code to a certain architecture. Who told you
           | about syscalls ? There are systems not using syscalls
        
             | eloff wrote:
             | There are no syscalls or equivalent operating system calls
             | in the code paths measured. The architecture is also
             | independent of the operating system, with exceptions in
             | some languages for the calling convention (not in rust,
             | afaik, or at least rust makes no guarantees there.)
        
               | CryZe wrote:
               | However, in practice Rust's calling convention does
               | actually depend on the operating system. So on Linux Rust
               | will make use of the stack red zone, while on Windows it
               | doesn't. (Also some codegen in LLVM depends on the
               | operating system)
        
           | pclmulqdq wrote:
           | The dependency is on the ABI, which can be OS-dependent.
           | Also, it is a depressingly manual optimization to do:
           | compilers don't know when it is safe to change a reference to
           | a copy (for example) without an analysis of future code that
           | they don't do.
        
             | littlestymaar wrote:
             | With Rust ownership guarantees, the compiler has the info
             | it needs to perform this kind of optimizations.
        
               | pclmulqdq wrote:
               | C and C++ are also set up such that the compiler can do
               | that optimization, they just don't. I'm pretty sure the
               | Rust compiler is in the same boat - has the information,
               | but doesn't do the optimization.
        
       | forrestthewoods wrote:
       | Oh neat, that's my blog. My old posts don't resurface on HN that
       | often.
       | 
       | Lots of criticism of my methodology in the comments here. That's
       | fine. That post was more of a self nerd snipe that went way
       | deeper than I expected.
       | 
       | I hoped that my post would lead to a more definitive answer from
       | some actual experts in the field. Unfortunately that never
       | happened, afaik. Bummer.
        
         | brundolf wrote:
         | Maybe it'll happen here! :)
        
         | the_mitsuhiko wrote:
         | My only criticism is the "ugly mess" part. You can implement
         | the traits on references too.
        
           | forrestthewoods wrote:
           | True, that does work for traits. But it's super annoying if
           | you have to write multiple copies of the same thing. That can
           | get out of control quick if you need to implement every
           | combination.
           | 
           | And that doesn't help at all if you're writing a "free
           | function" like 3D primitive intersection functions. I suppose
           | you could change that simple function into a generic function
           | that takes AsDeref? Bleh.
        
       | adham01 wrote:
       | [dead]
        
       | arcticbull wrote:
       | > Blech! Having to explicitly borrow temporary values is super
       | gross.
       | 
       | I don't think you ever have to write code like this. Implement
       | your math traits in terms for both value and reference types like
       | the standard library does.
       | 
       | Go down to Trait Implementations for scalar types, for instance
       | i32 [1]
       | 
       | impl Add<&i32> for &i32
       | 
       | impl Add<&i32> for i32
       | 
       | impl Add<i32> for &i32
       | 
       | impl Add<i32> for i32
       | 
       | Once you do that your ergonomics should be exactly the same as
       | with built in scalar types.
       | 
       | [1] https://doc.rust-lang.org/std/primitive.i32.html
        
       | datafulman wrote:
       | [dead]
        
       | FpUser wrote:
       | I did the test on my computer:
       | 
       | Rust - By-Copy: 14124, By-Borrow: 8150
       | 
       | C++ - By-Copy: 12160, By-Ref: 11423
       | 
       | P.S. Just built it using LLVM under CLion IDE and the results
       | are:                 G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-
       | build-         release\fts_cmake_cpp_bench.exe        Totals:
       | Overlaps: 220384338          By-Copy: 4397          By-Ref: 4396
       | Delta: -0.0227428%            Process finished with exit code 0
        
         | jeffbee wrote:
         | How did you build it? It doesn't build with either gcc-12 or
         | clang-15 on linux.
        
           | FpUser wrote:
           | I built it on Windows, Visual C++ 2022. Did not check Linux
           | as I do not think it matters much.
           | 
           | Now comes big surprise: I just built it using LLVM under
           | CLion IDE and the results are:
           | G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build-
           | release\fts_cmake_cpp_bench.exe        Totals:
           | Overlaps: 220384338          By-Copy: 4397          By-Ref:
           | 4396  Delta: -0.0227428%            Process finished with
           | exit code 0
        
       | 29athrowaway wrote:
       | A more direct comparison would have been a r-value reference.
        
       | Rustwerks wrote:
       | I just went through all of this when building a raytracer.
       | 
       | * Sprinkling & around everything in math expressions does make
       | them ugly. Maybe rust needs an asBorrow or similar?
       | 
       | * If you inline everything then the speed is the same.
       | 
       | * Link time optimizations are also an easy win.
       | 
       | https://github.com/mcallahan/lightray
        
         | masklinn wrote:
         | > * Sprinkling & around everything in math expressions does
         | make them ugly. Maybe rust needs an asBorrow or similar?
         | 
         | Do you mean AsRef, or do you mean magic which automatically
         | borrows parameters and is specifically what rust does not do
         | any more than e.g. C does?
         | 
         | Though you can probably get both _if_ the by-ref version is
         | faster (or more convenient internally): wrap the by-ref
         | function with a by-value wrapper which is #[inline]-ed, this
         | way the interface is by value but the actual parameter passing
         | is byref (as the value-consuming wrapper will be inlined and
         | essentially removed).
        
         | woodruffw wrote:
         | > Maybe rust needs an asBorrow or similar?
         | 
         | FWIW, the `Borrow`, `AsRef`, and `Deref` traits all exist to
         | support different variants of this.
        
       | lowbloodsugar wrote:
       | I understand that this is an example for the purposes of
       | answering the given question, but when actually doing things with
       | 3D vertices one should be thinking in terms of structures of
       | arrays. As someone said here already: good generals worry about
       | strategy and great generals worry about logistics.
        
       | spuz wrote:
       | I'd be interested to know what the benchmarks of the two rust
       | solutions are when inlining is disabled so we can get an idea of
       | the different performance characteristics of each function call
       | even if it's not a very realistic scenario.
       | 
       | The other question I have is which style should you use when
       | writing a library? It's obviously not possible to benchmark all
       | the software that will call your library but you still want to
       | consider readability, performance as well as other factors such
       | as common convention.
        
       | ptero wrote:
       | I would go with the version that gives the clean user interface
       | (that is, by copy in this case). _If_ it turns out that the other
       | version is significantly more performant _and_ this additional
       | performance is critical for the end users consider adding the by-
       | borrow option.
       | 
       | The clarity of the code using a particular library is such an big
       | (but often under-appreciated) benefit that I would heavily lean
       | in this direction when considering interface options. My 2c.
        
         | daviddever23box wrote:
         | Agreed - and this applies in nearly every language: start
         | simple, trust your compiler, and optimize only when performance
         | becomes untenable.
        
           | osigurdson wrote:
           | The assumption behind such arguments is when a performance
           | problem does arise, a profiler will point to a single, easy
           | to fix, smoking gun. Unfortunately this is not always the
           | case. Performance problems can be hard to diagnose and hard
           | to fix. A lot of damage has been done by unexamined /
           | dogmatic "root of all evil" mantra.
        
             | mattgreenrocks wrote:
             | The misapplication of that mantra doesn't justify the
             | design damage done by dogmatically passing everything by
             | ref.
             | 
             | There's no hard and fast rule here. Even if there was,
             | optimizers still occasionally surprise seasoned native devs
             | in both positive and negative ways.
             | 
             | Glad the author's first instinct was to pull out profiling
             | tools.
        
             | throw10920 wrote:
             | In the vast majority of situations (1) you'll prematurely
             | optimize in the wrong place and (2) yes the profiler _will_
             | point to a single, easy-to-fix smoking gun.
             | 
             | Situations otherwise are the exception, rather than the
             | rule, and it takes an expert to (1) recognize those
             | situations and (2) know exactly how to write optimized code
             | in that situation.
             | 
             | That's why "don't prematurely optimize" is a good rule of
             | thumb - because it works the majority of the time, and it
             | takes experience to know when not to apply it.
        
               | osigurdson wrote:
               | Suggest acquiring the needed knowledge instead of
               | applying dogma. The true root of all evil is unexamined
               | dogma.
        
               | kllrnohj wrote:
               | > In the vast majority of situations [..] yes the
               | profiler will point to a single, easy-to-fix smoking gun.
               | 
               | [citation needed]
               | 
               | This claim depends hugely on the industry you're actually
               | working in and the problem space. Things like UIs & games
               | basically never have a single, easy-to-fix smoking gun.
               | The _entire app_ is more or less a hotspot - be it
               | interactive performance, startup performance, RAM usage,
               | or general responsiveness.
               | 
               | And once you're gone down the route of "build it first,
               | optimize it later" you're pretty much fucked when you get
               | to the "optimize" step because now your performance
               | mistakes are basically unfixable without a rewrite -
               | every layer of your architecture has issues that you
               | can't fix without drastic overhauls. It would have been
               | _much_ easier to do some up-front measurements, get some
               | guidelines in place (even if they aren 't perfect), and
               | _then_ build the app.
        
           | kllrnohj wrote:
           | This advice hinges _hugely_ on what  "start simple" really
           | means. There's a ton of counter-examples here where that just
           | isn't true at all depending on what you're calling "simple".
           | In particular JIT'd languages can be especially problematic
           | here. An example would be using Java's Streams interfaces to
           | do something that could be done without much difficulty with
           | a regular boring ol' for loop. At the end of the day you're
           | hoping the JIT will eventually convert the streams version
           | into the same bytecode the for loop version would have
           | started with. But it won't do that consistently, and you've
           | still wasted time before it did so.
           | 
           | Trusting the compiler also means knowing what the compiler
           | actually understands & handles vs. what's a library-provided
           | abstraction that's maybe too bloated for its own good and
           | that quickly becomes "not simple" depending on your language
           | of choice.
        
           | mlindner wrote:
           | I agree in general, but the side-effect of doing this is that
           | no matter how fast your hardware gets, your software will
           | always end up optimized to the new hardware. So over time
           | your software gets slower and slower but performance stays
           | consistent as hardware gets faster.
        
       | ardel95 wrote:
       | Minor nit: many of the differences in the article aren't really
       | specific to the Rust vs C++, but rather differences between llvm
       | vs whatever compiler backend is used by msvc.
        
       | amelius wrote:
       | This is one of the problems I have with writing rust code. You
       | have to think about so many mundane details that you barely have
       | time left to think about more important and more interesting
       | things.
        
         | mlindner wrote:
         | Having written a lot of C, you spend basically all your time
         | thinking about "mundane details", and worse, if you make a
         | mistake, you often don't know you made a mistake until it's
         | running on some customer's servers and you just got a ticket
         | escalated 3 times up to you with vague information about
         | crashes rarely happening. Good luck remembering which bit of
         | code you wrote 6 months ago that may be causing the problem.
         | 
         | I'll take Rust shouting at me for missing "mundane details" any
         | day of the week.
        
         | scotty79 wrote:
         | As a Rust beginner that likes to learn the hard way I think I
         | have some insights why Rust seems cumbersome and/or hard for
         | programmers trying it.
         | 
         | Rust uses syntax that feels familiar but means completely
         | different things than in pretty much any other language.
         | 
         | For example '=' doesn't mean assign handle or copy. It by
         | default means move.
         | 
         | 'let' doesn't mean create a name for something. It means create
         | physical space for something (of known size) that can be moved
         | into or moved out of.
         | 
         | You don't deal with objects and values of primitive types.
         | Instead everything in Rust is a value. When you move, you move
         | the value. If you compare, you compare by value. If you pass
         | something from variable into function, you move the value into
         | the function.
         | 
         | And when the space where you keep the value goes out of scope
         | value dies with it if it wasn't moved out to somewhere else.
         | 
         | Scope for variables (which are just named spaces for values)
         | ends with the end of the block, but some values, created by
         | functions and returned from them, if they are not moved into
         | any named space, can die sooner, even in the middle of the line
         | where they were acquired from function call.
         | 
         | Everything else stems from that fixed size moved value
         | semantics. If you don't want to move the value into the
         | function when you call it you need to pass something else
         | instead, so you create and pass in the borrow. But you have to
         | ensure that the value doesn't die or get moved anywhere (even
         | inside the container you borrowed from) before borrows to it
         | all die.
         | 
         | Because of this you are better off with borrows that are short
         | lived and local. Often it's better to keep the index of an
         | element of a Vec instead of the borrow of this element. If you
         | must create types that contain borrows you must know that they
         | become borrows themselves and you need to treat them exactly
         | the same trying to limit their scope and life time.
         | 
         | It's hard when you come from any other language because borrows
         | are superficialy similar to pointers or references to objects.
         | So you try to use them as such. And crash into the compiler
         | because they are not that. What's worse their syntax is very
         | minimalistic which triggers intuition that they must be fast
         | and optimal solution for many problems which they sure can be
         | once you fully internalize their limitations but not a moment
         | sooner.
         | 
         | Another thing is that values in Rust must have the fixed size.
         | So even as simple thing as a string requires a bit of hackery.
         | Basically in Rust the default strategy to have something of
         | variable size is to allocate it on the heap and treat pointer
         | to it (possibly with some other fixed sized data like length)
         | as the fixed sized value you can move around clone and borrow.
         | 
         | So if you want to have semantics you know from other languages
         | you can't just use basic Rust syntax.
         | 
         | You need constructs such as Box and Rc, Cell, RefCell. Make
         | your things clonable and sometimes even copyable and avoid
         | creating borrows whenever possible initially. When you do it
         | Rust becomes as flexible as any other language and you can use
         | it pretty much just as comfortably. Then the value semantics
         | shines as you can very easily compare your data by value, order
         | it, create operators for it, create has for it so you can keep
         | it in HashMaps and HashSets. Then it's delightful.
         | 
         | My advice is when you create a long lived type just wrap it in
         | Rc and treat this Rc as your 'object'. And avoid borrows in
         | your types unless you have a very good performance (measured)
         | reason to have them or you are creating something obviously
         | dependant and usually short lived like an iterator.
        
           | zozbot234 wrote:
           | > For example '=' doesn't mean assign handle or copy. It by
           | default means move.
           | 
           | Some complain about this, but the fact is there's no such
           | thing as a zero-overhead "copy" for non-trivial types. C++
           | started out with = meaning clone the object which was an even
           | bigger footgun, and support for move had to be added after
           | the fact.
        
             | scotty79 wrote:
             | Yeah. Making = mean copy is a really bad idea. I very much
             | like the solution in Rust where attempt to move out
             | something that can't be moved out results in automatic copy
             | if the type implements trait Copy.
             | 
             | It's very elegant solution for simple, small data types.
             | But it further occludes how meaningfully Rust is different
             | from everything else because thanks to that = sometimes
             | does mean copy.
        
           | amelius wrote:
           | But Rc is only useful for creating tree-like data-structures.
           | 
           | One non-tree cross-link or back-link and you'll have to
           | redesign your entire code.
        
             | puffoflogic wrote:
             | Sibling and parent pointers are almost universally a sign
             | that an abstract data structure (and associated algorithm)
             | has been mistaken for concrete. The exception that comes to
             | mind first is Knuth's dancing links, and its obscurity is
             | an indication of the rarity of actually needing these
             | pointers. In any case, it's also a poster child for using
             | indices rather than pointers.
        
               | scotty79 wrote:
               | Currently I'm working on constructing proofs of
               | tautologies directly from the system of axioms using
               | substitution and modus ponens rule.
               | https://en.m.wikipedia.org/wiki/List_of_Hilbert_systems
               | 
               | Main objects in my program are expression trees. I
               | manipulate them, cut them, merge them, compare them,
               | splice one into the other. Rc's enable me to have full
               | flexibility and share tremendous amount of data across
               | objects in my program.
               | 
               | Rust is absolutely wonderful language for this problem
               | thanks to Rc's, enums, value semantics, auto-deriving
               | traits and ability to implement traits for existing types
               | and of course speed.
               | 
               | I'm not implementing specific algorithms. I'm making them
               | up as I go although I used some simple ones like
               | topological sort or A* that eventually turned into just
               | breadth search because I have no idea how far I am from
               | the solution.
        
               | amelius wrote:
               | > I'm not implementing specific algorithms. I'm making
               | them up as I go
               | 
               | It's mindboggling to me that people are using a systems
               | programming language for mathematical research,
               | especially if they don't know yet what the final
               | algorithms will look like.
               | 
               | But all the more power to you for trying.
        
             | scotty79 wrote:
             | If you further wrap Rc in an Option you can set crosslinks
             | and backlinks to None when you are dropping your data to
             | get rid of the problem of crosslinks or backlinks making
             | reference-counting leak memory. Then you just need to be
             | mindful to not loose handle to a cycle of your nodes before
             | you break the cycle by setting some crosslinks to None.
             | 
             | You can fairly easily refactor your almost-tree code to
             | adapt it to that additional Option wrap.
             | 
             | Of course you might instead opt to introduce some garbage
             | collector crate into your project. They usually provide
             | garbage collected Rc equivalent, which makes swapping it
             | out very easy.
             | 
             | Rc's are really very useful first approach to making
             | anything complex in Rust.
             | 
             | I usually have something like                   struct
             | NodeStruct {           my_data: i32,           link: Node
             | }
             | 
             | and                   struct Node(Rc<NodeStruct>);
             | 
             | or                   struct Node(Option<Rc<NodeStruct>>);
             | 
             | if I need cross-links.
             | 
             | Great thing is you can then add 'methods' to your type with
             | impl Node {}
             | 
             | Or define operators and other traits with:
             | impl Add<Node> for Node {}
             | 
             | Sometimes, when I need mutability I even wrap the
             | NodeStruct in RefCell.
             | 
             | It seems like a lot of wrappers but thanks to them you can
             | have very nice code that uses this type that has pretty
             | much 'normal modren language' semantics + value semantics
             | and is still blazing fast.
             | 
             | When you implement Ord, Eq, Hash they all go through all
             | the wrappers and let you treat your final type Node as a
             | comparable, sortable, hashable and cheaply clonable value.
             | Dereferencing also goes through all or most of the wrappers
             | automatically.
        
           | ReflectedImage wrote:
           | Rc, Cell & RefCell are suppose to be rare. For example I've
           | got 2,000 line Rust program in front of me and I've used Arc
           | 3 times and RWLock 1 time, that's all.
           | 
           | You need to structure your program as a Directed Acyclic
           | Graph (DAG), with things interacting only with the things
           | below them in the graph.
           | 
           | Then occasionally you might need to break the DAG structure
           | by using Rc, Cell & RefCell, etc...
        
             | scotty79 wrote:
             | The thing is not everything can be expressed as a DAG.
             | 
             | And finding it towards the end of writing your program
             | after hours of fighting with borrow checker is extremely
             | unpleasant.
             | 
             | And I don't think I ever landed in the situation where I
             | could fix the discrepancy by sprinkling in few Rc, RefCells
             | and such.
             | 
             | So I prefer to write with RefCells from the start and when
             | I got the thing working and I am ambitious enough then I
             | look at which parts could be borrows instead and I swap
             | them out.
        
               | ReflectedImage wrote:
               | There are many many many ways you can express something
               | and it's very likely that one of those ways is a DAG.
               | 
               | The issue here is that you are writing C++ code rather
               | than Rust code.
        
               | scotty79 wrote:
               | How do you express as a DAG a tree where nodes need to
               | keep references to their children and parents?
               | 
               | Two separate synced trees? Is it worth it?
               | 
               | > The issue here is that you are writing C++ code rather
               | than Rust code.
               | 
               | How dare you! I'm writing TypeScript code! ;-)
               | 
               | Rust is not Forth. I can write whatever I want and
               | there's nothing wrong with that.
        
               | ReflectedImage wrote:
               | > How do you express as a DAG a tree where nodes need to
               | keep references to their children and parents?
               | 
               | Rewrite your program in a form where it does not contain
               | a tree.
               | 
               | If you want an actual tree as a data structure, see the
               | trees crate.
               | 
               | > Rust is not Forth. I can write whatever I want and
               | there's nothing wrong with that.
               | 
               | And other people write Haskell code in Python :p. If your
               | code style doesn't match the language you are using you
               | are going to have a lot of unnecessary friction.
        
               | scotty79 wrote:
               | I think Rust is flexible enough to still work very well
               | with my style.
               | 
               | But you inspired me about something. I think I can
               | rewrite the program that I am writing to use reverse
               | Polish notation instead of a tree. Thanks!
        
               | ReflectedImage wrote:
               | Good luck!
        
         | matheusmoreira wrote:
         | Well, it _is_ a systems programming language. Thinking about
         | exactly how the language passes bits around is the whole point.
         | Rust should specify a stable ABI already so that everyone can
         | form a good mental model of what their code becomes once
         | compiled.
        
           | amelius wrote:
           | True, I probably was using Rust for the wrong type of
           | problem, i.e. was hoping to write a user-level application
           | with a graphical UI at the time.
           | 
           | Rust is probably better used for writing fast low-level
           | libraries that you call from higher level languages, possibly
           | with a garbage collector, so you don't waste time thinking
           | about memory management while you design/write your high-
           | level application.
        
         | throw10920 wrote:
         | My (brief) experience with Rust was that, while I had to
         | struggle to learn the borrow-checker, I didn't have lots of
         | "mundane details" to worry about - if any, less than C(++).
         | 
         | What did you have in mind?
        
       | kibwen wrote:
       | Note that this is from 2019, so it's probably worth re-
       | benchmarking to see if anything has changed in the interim. Can
       | we get the year added to the title?
        
       | bjackman wrote:
       | A potential lesson here (i.e. I am applying confirmation bias to
       | retroactively view this article as justification for a strongly
       | held opinion, lol):
       | 
       | Unless you are gonna benchmark something, for details like this
       | you should pretty much always just trust the damn compiler and
       | write the code in the most maintainable way.
       | 
       | This comes up in code review a LOT at my work:
       | 
       | - "you can write this simpler with XYZ"
       | 
       | - "but that will be slower because it's a copy/a function call/an
       | indirect branch/a channel send/a shared memory access/some other
       | combination of assumptions about what the compiler will generate
       | and what is slow on a CPU"
       | 
       | I always ask them to either prove it or write the simple thing.
       | If the code in question isn't hot enough to bother benchmarking
       | it, the performance benefits probably aren't worth it _even if
       | they exist_.
        
         | Dobbs wrote:
         | edit: I misread the previous post. Ignore this.
         | 
         | How are you using the word simpler? Because to me that implies
         | a combination of more obvious and number of lines of code.
         | Something that a benchmark shouldn't be involved in.
         | 
         | For example asking someone to delete 10 lines of code and
         | instead use go's ` net.SplitHostPort` would be an example of
         | "simpler".
        
           | 2OEH8eoCRo0 wrote:
           | I've read that good generals worry about tactics and great
           | generals worry about logistics.
           | 
           | Good programmers play code golf, great programmers write
           | readable and maintainable code.
           | 
           | Your example seems reasonable but programmers also like to
           | act like the smartest one in the room. I often come across
           | tricky and borderline obfuscated code because somebody wanted
           | to look clever. This is a logistical nightmare.
        
             | tested23 wrote:
             | Ugh, you are right but then someone comes and uses this to
             | rationalize not including things like map, filter and
             | reduce in a language because they are supposedly too
             | complicated and you can just do it with a for loop
        
               | baby wrote:
               | I work in a Rust codebase that uses a lot of functional
               | functions, and I'll say this: on average the imperative
               | style takes less lines of code and less indentation. I
               | also find it more readable personally, and idiomatic.
        
               | nicoburns wrote:
               | Functional iteration is good for the same reason we use
               | for loops over while loops, and while loops over goto:
               | they are more constrained, more clearly communicate
               | intent, and are therefore easier to reason about.
        
               | josephg wrote:
               | Sure but it's easy to go overboard with this stuff.
               | Reduce (fold) especially can be pretty hard to read in
               | hairy situations.
               | 
               | My general rule is that if you need fewer lines of code
               | to implement your logic with a simple for loop, you
               | probably should.
        
               | josephg wrote:
               | Just because we're on the topic of performance: the rust
               | optimizer can sometimes generate better code if you use
               | map / filter / etc. The slice iterator in any context is
               | a huge win over manual array iteration because it only
               | needs to do bounds checking once.
               | 
               | Javascript (v8, last I checked) is the opposite. Simple
               | for loops almost always outperform anything else.
        
               | duckerude wrote:
               | I've seen cases where an iterator was better, but I've
               | also seen gains from using an imperative loop with manual
               | indexing. Loop conditions and the occasional assertion
               | can be enough to elide bounds checks. (Though sometimes
               | the compiler gets too paranoid about integer overflow.)
               | 
               | Most of the time you should just write whatever's
               | clear/convenient but sometimes it's worth trying both and
               | scrutinizing godbolt.
        
           | karamanolev wrote:
           | That's what they're saying: if someone says anything more
           | complicated is faster, they challenge them to benchmark it.
           | Usually, it turns out whoever argues the "is faster" point
           | doesn't bother to benchmark it and the simpler code-wise
           | thing wins out. So yes - the benchmark goes to performance,
           | simplicity is in lines-of-code, cyclomatic, "in the eye of
           | the beholder" or whatever other metric you choose, but
           | usually it's obvious.
        
         | Diggsey wrote:
         | One neat thing here is that the compiler is aware of which
         | types are `Copy` and not internally mutable (not contianing an
         | `UnsafeCell`). For these types, passing `&T` and `T` are
         | equivalent, so the compiler could just choose the faster
         | option.
         | 
         | Even if it's not smart enough to do that today, it could
         | implement this optimization in the future. This could work even
         | without inlining, since the Rust calling convention is
         | unstable, and an optimization based on type size could be
         | incorporated into it.
        
           | zozbot234 wrote:
           | It would be more advisable to add this as a clippy hint,
           | because `&T` and `T` are not always equivalent wrt. FFI.
        
             | kibwen wrote:
             | Indeed, but the compiler is still capable of doing it on a
             | case-by-case basis. Quite often the observed semantics are
             | identical and it's easy for the backend to see that a
             | pointer has been created only to be immediately
             | dereferenced.
        
           | cwzwarich wrote:
           | It would be nice if Rust could do this, but it breaks
           | backwards compatibility. Some existing code depends on
           | pointer values of &T being equal or not equal.
        
             | comex wrote:
             | As an addendum, LLVM _can_ automatically perform the "
             | &T-as-T" optimization (without inlining) in some cases
             | where the callee function is in the same compilation unit
             | and known to not care about the pointer value. However,
             | these types of optimizations tend to be fragile, easily
             | disturbed when things get slightly complex.
        
         | throwaway894345 wrote:
         | I generally agree, but it's also not obvious to me in Rust (or
         | in Go) whether passing by reference or by copy is more
         | maintainable or clear. I guess what I want is some guidance on
         | what I should do by default, which you sort of give with "do
         | what is more maintainable", but I can't tell what that means in
         | practice (I've been told to default to pass-by-reference in the
         | past because most traits take &self and not self).
        
           | kibwen wrote:
           | _> I've been told to default to pass-by-reference in the past
           | because most traits take  &self and not self_
           | 
           | This is only blanket advice for designing traits, because as
           | the trait author you don't know what concrete type the
           | downstream user is going to want to use, and taking `&self`
           | in that circumstance is the choice that is friendliest to
           | both Copy and non-Copy types.
           | 
           | If you're just writing a non-generic function and you _do_
           | know what concrete types you 're using, the flowchart is
           | pretty simple:
           | 
           | 1. If the type is not Copy, then pass by-ref if you just need
           | to read the value, pass by-mutable-ref if you just need to
           | mutate the value, and pass by-value if you want to consume
           | the value.
           | 
           | 2. If the type is Copy, then pass by-value, but if your type
           | is _really_ big or if benchmarking has determined that this
           | is a critical code path then pass by-ref.
        
           | bjackman wrote:
           | Yeah totally agree it's not always/usually obvious. But there
           | are cases where there's a clear readibility/assumed-
           | performance tradeoff and in those cases I say always prefer
           | readibility (unless you benchmark).
        
           | __turbobrew__ wrote:
           | I would still consider myself a go novice, but I have been
           | burned a number of times passing simple objects by reference
           | and then that object gets mutated causing subtle bugs. Also,
           | go is happy to blow your foot off if you take the reference
           | of a loop variable. Although, there is a proposal to fix
           | that.
           | 
           | Generally I find that less bugs get introduced when using
           | copy instead of pass by reference, but I'm sure others have
           | the opposite opinion.
        
             | mcguire wrote:
             | I have had the same results. Passing by copy is simpler and
             | less bug-prone and reduces the urge to "just set the value
             | since I have a reference to the object" which is a well-
             | paved road to significant pain.
             | 
             | And the objects have to get surprisingly large before
             | passing by reference really makes a difference.
        
             | 411111111111111 wrote:
             | This is about rust though and thats not really possible
             | there (at least to my knowledge). You should get a compiler
             | error if you attempt this.
             | 
             | I got very little experience in rust though, so there might
             | be a way (I'm just not aware of) to circumvent this check
        
             | [deleted]
        
           | saghm wrote:
           | I don't think one of them is more clear or maintainable
           | universally, but in a lot of contexts, there might be an
           | obvious choice. As a trivial example (that isn't quite fair
           | given that the topic is about structs), it will almost never
           | be more clear or maintainable to pass a shared reference to
           | an integer (although there may be cases where a mutable
           | reference might make sense). I don't think there's much need
           | for one to have precedence over the other by default; if
           | anything, I see the discussion about performance tradeoffs
           | not being worth fretting about in the absence of actual
           | measurement to be an argument _against_ one of them being
           | inherently preferable.
        
           | jackmott wrote:
           | [dead]
        
         | forrestthewoods wrote:
         | Blog author here. I somewhat agree, somewhat disagree. This
         | line makes me uneasy:
         | 
         | > I always ask them to either prove it or write the simple
         | thing. If the code in question isn't hot enough to bother
         | benchmarking it, the performance benefits probably aren't worth
         | it _even if they exist_.
         | 
         | One of my philosophies is that death by a thousand cuts is
         | fine, but death by ten thousand cuts isn't. A team of 10
         | engineers can probably fix most of a thousand cuts in two or
         | three months. But if you have ten thousand cuts you're probably
         | doomed. And those don't show up cleanly in a flame graph.
         | 
         | Now for some context my background is video games. Which means
         | the team knows they need to hit an aggressive performance bar.
         | This isn't true for many projects. shared_ptr is a canonical
         | example of death by ten thousand cuts.
         | 
         | That said, I strongly agree with the principle of "just do the
         | simple thing". However I think it's important to have "sane
         | defaults". A project can easily have a thousand or ten thousand
         | papercuts that kill performance. But you can't microbench every
         | tiny decision. And microbenches are only a vague approximation
         | of what actually matters.
         | 
         | I'm also wary of "the compiler will make it fast". Because
         | that's true... until it's not! Although these days you don't
         | have any choice but to lean heavily on the compiler and "trust
         | but verify".
         | 
         | No one wants a super complex solution if it's not needed.
         | However I am very amenable to "do a slightly more complex thing
         | if you know it's correct and we can never think about this ever
         | again". It's much easier to do the fast thing upfront than for
         | someone else to try and speed it up in two years when we're
         | doing a papercut pass.
        
           | klyrs wrote:
           | > But if you have ten thousand cuts you're probably doomed.
           | And those don't show up cleanly in a flame graph.
           | 
           | I am reminded of the lovely nanosecond/microsecond talk by
           | Grace Hopper. If your code does a little bit of setup and
           | then spends all of its time in a single hotspot, fine. But if
           | your code is full of microsecond-suboptimal speed bumps, you
           | can probably hide your hotspot altogether. And a flat-ish
           | flame graph looks fine: nothing stands out as a problem!
           | 
           | It's valuable to do micro-benchmarks, not just to hone your
           | optimization skills, but to learn optimal patterns in your
           | language of choice. Then, when you're "in the zone" and
           | laying down new code, you just do the optimal thing
           | reflexively. Or, when you're reviewing or rewriting
           | something, those micro-hotspots jump out and grab your
           | attention.
           | 
           | There's a reason that ancient software running on ancient
           | hardware is way more responsive & snappy than what we have
           | today. Laziness.
        
             | Dylan16807 wrote:
             | > There's a reason that ancient software running on ancient
             | hardware is way more responsive & snappy than what we have
             | today. Laziness.
             | 
             | Laziness in terms of using entirely inappropriate
             | algorithms, sure.
             | 
             | Laziness in not microbenchmarking minutia? It shouldn't be.
             | There's a limit on how much that can hurt you. I would say
             | much less than a factor of ten, but let's go with 10x just
             | for argument's sake. If you have a CPU that's 500x faster,
             | and use easy code that's 10x slower, you're doing just
             | fine. This is not the problem with modern unresponsiveness.
        
               | klyrs wrote:
               | When I rewrite python code in C, I often hit 1000x
               | speedups, and sub-100x is rare. And that's line-for-line.
               | When I fix an accidentally-quadratic issue, for example,
               | I've seen speedups in the billions without even changing
               | the language.
               | 
               | People have lionized Knuth's quote about premature
               | optimization, and used that to ignore performance issues
               | across the board. Since the early '00s, we have not seen
               | a 500x improvement in CPU speed. It's less than 2x on
               | frequency, and let's say 8x on core-count for most users
               | (which doesn't help your single-core lazy programmer). In
               | my experience, programmers will make projections based on
               | a 500x-faster processors that _will never arrive_ ,
               | because it's easier than honing their skills and keeping
               | them sharp. And even if these magical THz-frequency chips
               | arrive, if you have three layers of 10x slowdowns, you're
               | back down to GHz.
        
               | josephg wrote:
               | This has been my experience too. I wrote a text crdt last
               | year which improved automerge's (then) 5 minute runtime.
               | My code currently takes 6ms to do the same work.
               | 
               | Automerge's design assumed this stuff would always be
               | slow, so they had this whole frontend / backend code
               | split so they could put the expensive operations on
               | worker threads. Good optimizations in the right places
               | make all that complexity unnecessary. The new automerge
               | is shaping up to be simpler as well as faster.
        
               | Dylan16807 wrote:
               | > When I rewrite python code in C, I often hit 1000x
               | speedups, and sub-100x is rare. And that's line-for-line.
               | When I fix an accidentally-quadratic issue, for example,
               | I've seen speedups in the billions without even changing
               | the language.
               | 
               | And neither of those is a microbenchmark thing, which is
               | kind of my point. I'm surprised language would hurt that
               | much, but that's enough to break things on its own
               | without any layering.
               | 
               | > Since the early '00s, we have not seen a 500x
               | improvement in CPU speed. It's less than 2x on frequency,
               | and let's say 8x on core-count for most users (which
               | doesn't help your single-core lazy programmer).
               | 
               | I don't think people are talking about 2004 when they
               | talk about the responsiveness of ancient software on
               | ancient hardware. I interpret that as more like an Apple
               | II. But instructions per clock have also gone up a lot
               | since the pentium 4 days, and having more than one core
               | in your CPU has a huge impact even for single-threaded
               | programs.
        
               | klyrs wrote:
               | > And neither of those is a microbenchmark thing, which
               | is kind of my point. I'm surprised language would hurt
               | that much...
               | 
               | The point I'm making here is that every line matters --
               | not just the hotspots. If you're suprised that language
               | can have that much impact, perhaps it's time to learn a
               | bit about performance issues that you're being dismissive
               | of?
               | 
               | > I don't think people are talking about 2004 when they
               | talk about the responsiveness of ancient software on
               | ancient hardware.
               | 
               | No, I was responding to your mention of a 500x
               | improvement in hardware. That pipedream ended in the
               | early 00s, and people still talk like Moore's law will
               | absolve their inattentive coding practice. And that felt
               | fine in the decades we went from kHz to GHz, but it's
               | unacceptable today.
        
               | Dylan16807 wrote:
               | > The point I'm making here is that every line matters --
               | not just the hotspots.
               | 
               | That depends on how it would have performed if you only
               | transformed the hottest 10% into C.
               | 
               | But I was mainly responding to the idea that micro-
               | optimizations are needed to keep general software snappy,
               | and I don't think they are. If one language is that much
               | faster, that's not micro-optimization.
               | 
               | > No, I was responding to your mention of a 500x
               | improvement in hardware.
               | 
               | What do you mean "No"? I was talking about current
               | hardware being 500x faster than 1988 hardware, which _it
               | is_. If that 's not what you meant by "ancient software
               | on ancient hardware", fine, but _that 's what my 500x was
               | talking about_.
               | 
               | > people still talk like Moore's law will absolve their
               | inattentive coding practice
               | 
               | I'm not trying to excuse inattentive coding. I'm trying
               | to say certain kinds of attention are important and
               | others aren't.
        
             | naasking wrote:
             | > And a flat-ish flame graph looks fine: nothing stands out
             | as a problem!
             | 
             | If your program is still slow, that would also indicate
             | that everything is a problem, ie. the ten thousand cuts.
             | Start optimizing at some obvious spots and then see what
             | happens.
        
           | bjackman wrote:
           | Haha "death by a thousand cuts" is exactly the phrase I
           | encounter in these debates!
           | 
           | And actually I still disagree - e.g. I once took over a DMA
           | management firmware and the TL told me "we are really trying
           | to avoid DBATC so we take care to write every line
           | efficiently". But the thing was that once you have a holistic
           | understanding of the systems performance you tend to find
           | _only a small fraction of the code ever affects the metrics
           | you care about_!
           | 
           | E.g. in that case the CPU was so rarely the bottleneck that
           | it really didn't matter, we could have rewritten half the
           | code in Python (if we'd had the memory) without hurting
           | latency or throughput.
           | 
           | Admittedly I can see how games or like JS engines might be a
           | kinda special case here, where the OVERALL compute bandwidth
           | begins to become a concern (almost like an HPC system) and
           | maybe then every line really does count.
        
           | Dylan16807 wrote:
           | Very little of your code is in hot loops. If the code that
           | takes half a millisecond per frame _could_ be twice as fast,
           | but the hot loop is very optimized, then it doesn 't really
           | matter. And that's what I would think of by default for
           | having many many cuts. Better to spend the optimization
           | effort elsewhere.
           | 
           | > shared_ptr is a canonical example of death by ten thousand
           | cuts
           | 
           | Why does that count as ten thousand cuts rather than one cut?
           | That doesn't sound intractable to fix if you have months.
        
             | [deleted]
        
         | nextaccountic wrote:
         | > I always ask them to either prove it or write the simple
         | thing.
         | 
         | Even if they do, they also need to make a case that in this
         | specific case, performance matters enough to pessimize code
         | simplicity and maintainability.
         | 
         | Also, if performance is that critical, it's imperative to
         | benchmark again after each compiler release to guard against
         | codegen regressions. And benchmark after changing this piece of
         | code. Otherwise, we can say that performance doesn't really
         | matter.
        
           | eldenring wrote:
           | I think you're missing the point of the previous comment,
           | they are saying a good proxy for it being worth it to
           | optimize, is if you're willing to benchmark it.
        
         | szundi wrote:
         | My favorite way of thinking. Should be applied to question the
         | need for the existence of the given feature or function too,
         | and best to delete the whole stuff.
        
         | heydenberk wrote:
         | Even if you do benchmark something, maintainability can be more
         | important than a marginal performance improvement.
         | 
         | I've seen this happen _a lot_ with JavaScript, particularly in
         | the last 5-10 years as JS engines have developed increasingly
         | sophisticated approaches to performance. Today 's optimization
         | can be tomorrow's de-optimization. Even given an unchanging
         | landscape of compiler/interpreter, tightly-optimized code can
         | become de-optimized when updated and extended, as compared to
         | maintainable code that may not suffer much performance
         | degradation upon extension.
        
         | Patrol8394 wrote:
         | This x 10000 ! If I had a dime for every time I provided this
         | exact feedback in code reviews...I find surprising that a lot
         | of devs in tech industry are obsessed with pointless micro
         | optimizations and they don't care about writing maintainable,
         | testable simple code. My final comment is always to not
         | outsmart compilers/jvm because they tend to do a much better
         | job that developers.
         | 
         | Please, don't optimize unless you have reasons to do so and
         | numbers backing that up.
        
           | ok123456 wrote:
           | This is true for application code. But, Rust is trying to
           | sell itself as a systems language and an embedded language
           | and a language you can write kernel modules in. Memory budget
           | matters in these cases.
        
             | tialaramex wrote:
             | If memory budget matters, you _have_ a memory budget. So
             | you should be measuring and you can actually tell where you
             | need improvements.
             | 
             | But in practice what we see _overwhelmingly_ is that people
             | want to do this stuff but they aren 't measuring, because
             | measuring is _boring_ whereas making the code more
             | complicated to show off how much you think you know about
             | optimisation is easy. Knock it off.
        
               | ok123456 wrote:
               | Then knock off trying to use Rust as a systems language.
               | Linear types make refactoring this nearly impossible if
               | it does become and issue.
        
               | tialaramex wrote:
               | Works really nice for me, of course, I actually measure
               | what I'm doing.
               | 
               | You edited your comment, so I guess I will too: Rust
               | doesn't actually have Linear types. Linear types ("must
               | use or compile error") would be tricky to provide, Aria
               | blogged about it back in the day. So that's definitely
               | going to be a problem with your refactoring.
        
           | dahfizz wrote:
           | It depends on your specialization, I guess. If you're making
           | a website, a few microseconds here and there probably don't
           | matter.
           | 
           | But in my field (Fintech), performance really does matter.
           | Doing the simple, slow thing is just lazy and won't make it
           | through review.
        
             | imron wrote:
             | Great. Should be easy to prove then.
        
               | dahfizz wrote:
               | Yup, we have a good benchmarking suite to make sure
               | changes don't cause regressions and that optimization
               | changes actually work.
               | 
               | That said, I think asking a developer to write everything
               | they do twice so that they can A/B test is overboard. You
               | can come back and really aggressively optimize later, but
               | I think the "default" should be the fast thing, rather
               | than the slow & easy thing.
        
               | bmacho wrote:
               | If "performance really does matter" (in Fintech) then a
               | developer surely can write everything twice or more.
        
               | dahfizz wrote:
               | We could, yeah. I'm sure you're capable of re-writing
               | everything you do twice. It's just a huge waste of time.
               | 
               | You get just as much benefit by assigning a performance
               | refactor to an engineer when needed vs literally halving
               | or worse the whole teams productivity.
        
               | elimerl wrote:
               | The idea is that the "default" is the easy thing, which
               | is usually optimized by the compiler.
        
           | josephg wrote:
           | My advice is the opposite: if you want to make performance
           | justifications for code, you need a benchmarking suite. I
           | have them for a lot of my projects. (Rust's criterion is a
           | delight). A good benchmark suite is a subtle thing to write -
           | you want real world testing data, and benchmarks for a range
           | of scenarios. The benchmarks should be run often. For some
           | changes, I'll rerun my benchmarks multiple times for a single
           | commit. I benchmark the speed of a few operations,
           | serialisation size, wasm bundle size and some other things.
           | 
           | Having real benchmarking data is eye opening. It was shocking
           | to me how much the wasm bundle size increased when I added
           | serialisation code. The time to serialise / deserialise a big
           | chunk of data for me is 0.5ms - so fast that it's not worth
           | more microoptimizations. Lots of changes I think will make
           | the code slower have no impact whatsoever on performance. And
           | my instincts are _so often_ wrong. About 50% of
           | microoptimizations I try either have no effect or make the
           | code slightly slower. And it's quite common for changes that
           | shouldn't change performance at all to cause significant
           | performance regressions for unexpected reasons.
           | 
           | I've also learned how important "short circuit" cases can be
           | for performance. Adding a single early check for the trivial
           | case in a function can sometimes improve end to end
           | performance by 15-20%, which in already well tuned code is
           | massive.
           | 
           | Performance work is really fun. But if you do performance
           | tuning without measuring the results, you're driving
           | blindfolded. You're as likely to make your code worse as you
           | are to make it better. Add benchmarks.
        
         | jjice wrote:
         | > but that will be slower because it's a copy/a function
         | call/an indirect branch/a channel send/a shared memory access
         | 
         | I really dislike these takes. I see engineers optimize these
         | cases and then go ahead and make two separate SQL queries that
         | could be one, ruining any false optimization gains they got by
         | lord know how many times.
         | 
         | Yeah, you can loop over that 100 element list twice doing basic
         | computation if you want, it's not going to make a difference
         | for many engineering workloads, but could make a big difference
         | in readability.
        
         | saghm wrote:
         | This seems to be an unpopular opinion, but I feel similarly
         | about how sometimes people seem to toss out `inline` (and even
         | more suspect, `inline(always)` annotations on Rust functions
         | like candy on Halloween and there are almost never any sort of
         | actual measurements of whether it actually helps in the cases
         | it's used. It's not even that I think it really hurts that much
         | in most of the stuff I've worked on (which tends to be more
         | sensitive to concurrency design and network round trips), but I
         | can't help but worry that people using stuff like this when
         | they don't seem to fully grasp the implications is a recipe for
         | trouble.
        
           | bjackman wrote:
           | Yeah inline is an absolute classic for this. The number of
           | uncommented __attribute__((always_inline))s I see in C code
           | drives me crazy. There are absolutely legitimate reasons to
           | use that attribute but there should ALWAYS be a comment about
           | why, so that later readers know in what conditions they can
           | safely remove it.
        
       | the__alchemist wrote:
       | Interesting! Of note, My `Vec3` and `Quaternion` types (f32 and
       | f64) have `Copy` APIs, but I've wondered about this since their
       | inception.
        
       | yobbo wrote:
       | The Rust-test implements the traits Add, Sub, Mul by value. This
       | makes the few references less important in the total test. The
       | ergonomics argument is motivated by using these traits.
       | Otherwise, references would have had the same ergonomics.
       | 
       | But also, the struct is 3x32 bits, and Rust auto-implements the
       | Copy-trait for it. It is barely larger than u64, which is the
       | size of the reference.
       | 
       | But life is only simpler when Copy and Clone can be auto-
       | implemented.
        
       | birdyrooster wrote:
       | I guess by-copy bc I'm cool
        
       | jackmott wrote:
       | [dead]
        
       | celeritascelery wrote:
       | I don't feel like this gave a satisfactory answer the question.
       | Since everything was inlined, the argument passing convention
       | made no difference in the micro benchmarks. But what happens when
       | it does not inline? Then you would actually be testing by-borrow
       | be by-copy instead of how good rust is at optimizing.
        
         | ncallaway wrote:
         | I sort of agree and sort of disagree.
         | 
         | > Then you would actually be testing by-borrow be by-copy
         | instead of how good rust is at optimizing.
         | 
         | I don't think the question is actually: "what is faster in
         | practice, a by-copy method call or a by-value method call", I
         | think the question is: "as an implementer, which semantics
         | should I choose when I'm writing my function".
         | 
         | For the second question: "Rust is usually pretty good at
         | aggressively inlining, so... if you're willing to trust Rust's
         | compiler, you're often okay going with by-copy implementations,
         | but you should keep an eye on it". Whereas, as you note, for
         | the first question it's not an answer.
         | 
         | But, I do think if someone was going to put more work into it
         | I'd be very curious what the answer to the first question is.
         | If I'm choosing to implement with by-copy semantics and
         | trusting the Rust compiler to hopefully inline things for me,
         | I'd like to know the implications in the cases when it doesn't.
        
           | forrestthewoods wrote:
           | Blog author here. This feels like the best summary in this
           | comment section.
           | 
           | The root question is indeed "what semantics should I use".
           | And the answer I came up with was "the compiler does a lot of
           | magic so by-copy seems pretty good". I agree with the
           | previous commenter this is not a satisfying conclusion!
           | 
           | My experience with Rust is that it requires a moderate amount
           | of trust in the compiler. Iterator code is another example
           | where the compiler should produce near optimal code. Emphasis
           | on should!
        
             | FpUser wrote:
             | When value size is small (whatever "small" means for
             | particular architecture) I'd say "trust the compiler"
             | suggestion is reasonable. When the size grows there should
             | be no more "trust" unless compiler can decipher if it is
             | safe to use ref instead of value basing on value size (we
             | assume that the function does not mutate the value).
             | 
             | Your tests on my PC:                 Rust - By-Copy: 14124,
             | By-Borrow: 8150       C++ - By-Copy: 12160, By-Ref: 11423
             | 
             | P.S. Just built it using LLVM under CLion IDE and the
             | results are:                 G:\temp\cpp\rust-cpp-
             | bench\cpp\cmake\cmake-build-
             | release\fts_cmake_cpp_bench.exe        Totals:
             | Overlaps: 220384338          By-Copy: 4397          By-Ref:
             | 4396  Delta: -0.0227428%            Process finished with
             | exit code 0
        
               | nicoburns wrote:
               | > When the size grows there should be no more "trust"
               | unless compiler can decipher if it is safe to use ref
               | instead of value basing on value size
               | 
               | I believe that the Rust compiler at least does exactly
               | that. Large structs will be passed by reference under the
               | hood even if it passed by value in the code. I suspect
               | C++ compilers do the same, although I'm not sure about
               | that.
        
               | FpUser wrote:
               | Now comes big surprise: I just built it using LLVM under
               | CLion IDE and the results are:
               | G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build-
               | release\fts_cmake_cpp_bench.exe        Totals:
               | Overlaps: 220384338          By-Copy: 4397          By-
               | Ref: 4396  Delta: -0.0227428%            Process finished
               | with exit code 0
        
               | josephg wrote:
               | Why is performance so much better in this case? That
               | seems like a suspiciously large delta from the first
               | test.
               | 
               | Were the other benchmarks run in debug mode / with
               | optimizations turned off or something like that? What
               | compiler & flags are you using?
        
               | FpUser wrote:
               | >"Were the other benchmarks run in debug mode / with
               | optimizations turned off or something like that?"
               | 
               | Why would I do something like that? Of course all builds
               | are release mode, optimize for speed.
               | Rust - Windows - By-Copy: 14124, By-Borrow: 8150
               | C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
               | C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396
               | 
               | >"Why is performance so much better in this case?"
               | 
               | Not sure and not in a mood to investigate. I do know if
               | cache locality and branch prediction stars line up
               | properly the performance difference can be staggering.
               | Maybe LLVM has accomplished something nice in this
               | department.
        
               | forrestthewoods wrote:
               | I just updated Visual Studio 2022 with all the latest
               | updates and installed the Clang toolchain. I also updated
               | Rust to latest.
               | 
               | C++ MSVC: By-Copy: 12,077 By-Ref: 11,901 C++ Clang: By-
               | Copy: 5,020 By-Ref: 5,029 Rust: By-Copy: 3,173 By-Borrow:
               | 3,148
               | 
               | All on Windows, and on the same i7-8700k desktop I used
               | for the original post in 2019.
               | 
               | Your Rust numbers are particularly curious. Maybe run
               | `rustup update` and try again?
        
         | furyofantares wrote:
         | I feel like they got excited by their C++ code being so much
         | slower and curious about the "weird" C++ result and forgot to
         | figure out the original question.
        
           | FpUser wrote:
           | >"C++ code being so much slower"                 Rust -
           | Windows - By-Copy: 14124, By-Borrow: 8150       C++ - Windows
           | MS Compiler - By-Copy: 12160, By-Ref: 11423       C++ -
           | Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396  Delta:
           | -0.0227428%
           | 
           | So it appears that C++ - Windows LLVM 15 beats Rust by large
           | margin.
        
           | fnordpiglet wrote:
           | To be fair I got excited too. But I still want to know the
           | answer as well.
        
         | jasonhansel wrote:
         | In Rust it's considered idiomatic to pass things by-value
         | whenever you can. Usually this is also the most performant
         | option, since it avoids dereferencing in the callee.
         | 
         | Of course, if your struct is truly enormous, you may want to
         | break this rule to avoid large copies. But in that case you
         | probably want to Box<T> the struct anyway.
         | 
         | Of course, if your struct contains something that can't be
         | copied--like a Vec<T>--you'll have to decide whether to clone
         | the whole struct (and thus the vector in it), pass the struct
         | by-borrow, or find some other solution.
        
           | brundolf wrote:
           | I don't think I'd agree that idioms come into play here, one
           | way or the other. Safely borrowing things by reference is one
           | of Rust's headline features
        
             | kibwen wrote:
             | _> Safely borrowing things by reference is one of Rust 's
             | headline features_
             | 
             | Sure, but it's worth noting that references in Rust do not
             | exist merely to avoid passing by-value. They also exist to
             | make it easier to deal with Rust's ownership semantics:
             | they let you pass things to a function without also
             | requiring the function to "pass back" those things as
             | returned values. In other words, references let you do `fn
             | foo(x: &Bar)` rather than `fn foo(x: Bar) -> Bar`. This is
             | a unique and interesting consequence of languages with by-
             | default move semantics.
        
       | throwawaybycopy wrote:
       | Should have also tried pass-by-move .
        
       | ergonaught wrote:
       | It's compiled, so, without any investigation at all, I would have
       | been disappointed if there were any significant difference in the
       | code emitted in these cases. I would expect the compiler to do
       | the efficient thing based on usage rather than the particular
       | syntax. I may have too much faith in the compiler.
        
         | CHY872 wrote:
         | I'd expect your claim to be true whenever the callee is inlined
         | into the caller. In this case, the compiler has all the
         | relevant information at the right point in time. As other
         | commenters have pointed out, by enabling inlining the author
         | has gone down a rabbit hole somewhat unrelated to the question,
         | because any copies can be simply elided.
         | 
         | If there's no inlining at play, I'd expect vast differences to
         | be possible. For example, imagine a chain of 3 functions - f
         | calls g, g calls h, where one of the arguments is a 1kB struct
         | and the options are passing by copy or by borrowing. In this
         | case, each stack frame will be 1kB in size in the copy case and
         | there will be a large performance overhead as opposed to the
         | by-reference case. One would expect simply calling the function
         | to be similar in overhead to an uncached memory load.
         | 
         | Within a single crate the inlining is possible, with multiple
         | crates it's only possible with LTO enabled (and I'm not sure
         | how _probable_ it is that the inlining would occur).
         | 
         | In either case, the difference between a 32 byte and 8 byte
         | argument in terms of overhead is likely meaningless - the sort
         | of thing to be optimized if profiling says it's a problem as
         | opposed to ahead of time.
        
           | kibwen wrote:
           | _> Within a single crate (more specifically, codegen unit)
           | the inlining is possible_
           | 
           | Cross-crate inlining happens all the time. In order to be
           | eligible for inlining, a function needs to have its IR
           | included in the object's metadata. This happens automatically
           | for every generic function (it's the only way
           | monomorphization can work), and for non-generic functions can
           | be enabled manually via the `#[inline]` attribute (which does
           | not _force_ inlining, it only makes it possible to inline at
           | the backend 's discretion).
           | 
           | However, as you, say, if you have LTO enabled then "cross-
           | crate" inlining can happen regardless, since it's all just
           | one giant compilation unit at that point.
        
         | cogman10 wrote:
         | At the VERY end of the article, the author points out "Oh, btw,
         | I used MSVC for the C++ compilation, when I used clang things
         | changed!"
         | 
         | So, what the author actually measured was the difference
         | between llvm and msvc throughout the article. Particularly when
         | they talked about rust being better at autovectorization than
         | C++.
        
           | forrestthewoods wrote:
           | Incorrect. Clang C++ vs MSVC C++ is very comparable, and
           | noticeably worse for f64 by-ref. Clang C++ is still slower
           | than Rust by a large margin. Using Clang C++ throughout would
           | not change any conclusion (or lack thereof).
        
       | kolbe wrote:
       | Anyone know why seemingly knowledgeable people (like the person
       | who wrote this article) don't use micro benchmarking frameworks
       | when they run these tests?
       | 
       | Also, whenever you do one of these, please post the full source
       | with it. There's no reason to leave your readers in the dark,
       | wondering what could be going on, which is exactly what I'm doing
       | now, because there's almost no excuse for c++ to be slower in a
       | task than rust--it's just a matter of how much work you need to
       | put in to make it get there.
        
         | kllrnohj wrote:
         | For C++ I guess you could make the claim that it's just too
         | annoying to take a dependency on something like google-
         | benchmark or whatever, since C++ dependency management is such
         | a mess to deal with in general.
         | 
         | But yeah I have no idea why a benchmark framework wasn't used
         | for Rust.
        
           | kolbe wrote:
           | Whenever I don't want to endure that annoyance, i just copy
           | this single file header only microbenchmarking code:
           | 
           | https://github.com/sheredom/ubench.h
        
         | forrestthewoods wrote:
         | > please post the full source with it.
         | 
         | There's literally a section called Source Code...
        
           | kolbe wrote:
           | I see now. I looked twice. I think most people stop after a
           | section called "Conclusion" that ends with "Thanks for
           | reading." It doesn't help that the formatting then leaves a
           | large gap between sections that doesn't indicate there's more
           | after that.
        
             | forrestthewoods wrote:
             | Fair!
        
       | zamalek wrote:
       | The benchmarks lack the standard deviation, so the results may
       | well be equivalent. Don't roll your own micro-benchmark runners.
       | 
       | References may get optimized to copies where possible and sound
       | (i.e. blittable and const), a common heuristic involves the size
       | of a cache line (64b on most modern ISAs, including x86_64).
       | 
       | Using a Vector4 would have pushed the structure size beyond the
       | 64b heuristic. You would also need to disable inlining for the
       | measured methods.
        
         | cogman10 wrote:
         | It was also (needlessly) using 2 different compilers, MSVC and
         | LLVM. This is just a bad way to compare things all around.
         | 
         | And, for simple operations like this, you really should just
         | look at the assembly output. If you are only generating 20ish
         | instructions, then look at those 20 instructions rather than
         | trying to heuristically guess what is happening.
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Should small Rust structs be passed by-copy or by-borrow?_ -
       | https://news.ycombinator.com/item?id=20798033 - Aug 2019 (107
       | comments)
        
       | aboelez wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2022-12-31 23:00 UTC)