[HN Gopher] Rust: Dropping heavy things in another thread can ma...
       ___________________________________________________________________
        
       Rust: Dropping heavy things in another thread can make your code
       10000x faster
        
       Author : timooo
       Score  : 280 points
       Date   : 2020-05-30 16:53 UTC (6 hours ago)
        
 (HTM) web link (abramov.io)
 (TXT) w3m dump (abramov.io)
        
       | heftig wrote:
       | If I seriously wanted to move object destruction off-thread, I
       | would use at least a dedicated thread with a channel, so I could
       | make sure the dropper is done at some point (before the program
       | terminates, at the latest). It also avoids starting and stopping
       | threads constantly.
       | 
       | Something like this: https://play.rust-
       | lang.org/?version=stable&mode=debug&editio...
       | 
       | You could have an even more advanced version spawning tasks into
       | something like rayon's thread pool, I assume.
        
         | ReactiveJelly wrote:
         | Someone is working on this as a direct response to this blog:
         | 
         | https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...
         | 
         | And yes, spawning a thread for every drop is horrible. It's
         | just to prove the concept. The defer_drop crate uses a global
         | worker thread.
        
       | Ididntdothis wrote:
       | I used to do this sometimes with C++ when I realized that
       | clearing out a vector with lots of objects was slow. Is Rust
       | basically based on unique_ptr? One problem with this approach was
       | that you still had to wait for these threads when the application
       | would shut down.
        
         | masklinn wrote:
         | > Is Rust basically based on unique_ptr?
         | 
         | Rust is based on ownership and statically checked move
         | semantics (by default though can be opted out). So each item
         | has a single owner (which is why Rust deals _very badly_ with
         | graphs, and more generally any situation where ownership is
         | unclear) and the compiler will prevent you from using a moved
         | object (unlike C++).
         | 
         | Separately it has a smart pointer which is the dual of
         | unique_ptr (Box), with the guarantee noted above:
         | let b = Box::new(1);         drop(b);         println!("{}",
         | b);
         | 
         | will not compile because the second line _moves_ the box, after
         | which it can 't be used because it's been removed entirely from
         | this scope.
        
           | wnoise wrote:
           | > which is why Rust deals very badly with graphs, and more
           | generally any situation where ownership is unclear
           | 
           | To be fair, so do 90+% of programmers. Much of rust's benefit
           | in safe code is training programmers to avoid code like that
           | where possible, and spreading design patterns that avoid it.
        
         | saagarjha wrote:
         | Rust basically gives the compiler understanding of unique_ptr
         | and prevents you from using it after you've moved it.
        
           | Ididntdothis wrote:
           | Would you have to keep track of these threads in Rust? I have
           | done a lot of desktop development where you have to be aware
           | of what happens during shutdown. Seems a lot of server guys
           | write their code under the assumption that it will never shut
           | down.
        
             | pornel wrote:
             | You would need to add `thread.join()` at the end of main,
             | or have some RAII guard that does it for you.
             | 
             | In practice that's probably optional, because the heap and
             | all resources are usually torn down with the process
             | anyway. Important things, like saving data or committing
             | transactions, shouldn't be done in destructors.
        
             | firethief wrote:
             | The answer to this question is the same for any language
             | without a heavy runtime. You can choose to join the worker
             | thread, detach it, or kill it.
        
           | [deleted]
        
         | zozbot234 wrote:
         | > One problem with this approach was that you still had to wait
         | for these threads when the application would shut down.
         | 
         | If you know that an object will live for the rest of the
         | program and not need any finalization logic, Rust allows you to
         | "leak" it and save that overhead on shutdown.
        
         | ordu wrote:
         | You could have just one thread and to kill it at exit. Do not
         | start new threads for each closure that drops object, send
         | closures into one special thread instead.
        
         | qcoh wrote:
         | Out of curiosity, how did you do that in C++?
        
           | Ididntdothis wrote:
           | It depends. Either iterate over the vector and delete the
           | objects or just call clear(). Obviously you have to be sure
           | that nobody else is accessing it at the same time.
        
       | jkoudys wrote:
       | It'd be interesting to implement this on a type that would defer
       | all of these drop threads (or one big drop threads built off a
       | bunch of futures) until the end of some major action, like
       | sending the http response on an actix-web thread. Could be a
       | great way to get the fastest possible response time, since then
       | the client has their response before any delay on cleanup.
        
       | epage wrote:
       | For those wanting a real world example where this can be useful:
       | 
       | I am writing a static site generator. When run in "watch" mode,
       | it deletes everything and starts over (I'd like to reduce these
       | with partial updates but can't always do it). Moving that cleanup
       | to a thread would make "watch" more responsive.
        
         | firethief wrote:
         | Why can't it cleanup right after the work?
        
           | pmontra wrote:
           | Or no cleanup at all. A CLI command that runs for a very
           | short time can allocate memory to perform its job, print the
           | result and exit. Then the OS releases all the memory of the
           | process. No idea if Rust can work like this.
        
             | ReactiveJelly wrote:
             | "Watch mode" for static site gens would mean you leave the
             | process running and let it rebuild the site whenever a file
             | changes, probably 10s to 100s of times in a typical run
        
             | estebank wrote:
             | std::mem::forget, which doesn't run destructors:
             | 
             | https://doc.rust-lang.org/std/mem/fn.forget.html
        
         | elcomet wrote:
         | That's not really the same issue that is mentionned in the
         | article though, is it ?
         | 
         | The issue from the article would be solved by just passing a
         | reference to the variable.
         | 
         | In your case, cleanup is an action that _needs_ to be done
         | before writing new files. So you have to wait for cleanup
         | anyway, don 't you ?
        
       | [deleted]
        
       | cs702 wrote:
       | In other words, Rust's automagical memory deallocation is NOT a
       | zero-cost abstraction:                 fn get_len1(things:
       | HeavyThings) -> usize {           things.len()       }
       | fn get_len2(things: HeavyThings) -> usize {           let len =
       | things.len();           thread::spawn(move || drop(things));
       | len       }
       | 
       | The OP shows an example in which a function like get_len2 is
       | 10000x faster than a function like get_len1 for a hashmap with 1M
       | keys.
       | 
       | See also this comment by chowells:
       | https://news.ycombinator.com/item?id=23362925
        
         | DasIch wrote:
         | Nothing about how Rust handles deallocation is magical in any
         | way.
         | 
         | It's also definitely a zero-cost abstraction as I can see
         | because the manual solution that's equivalent to get_len1()
         | would be to essentially call free() on things. That would
         | ultimately suffer from the same problem.
        
           | cs702 wrote:
           | Yeah, you're right. In hindsight this was a poorly thought-
           | out and poorly written post on my part.
        
         | dathinab wrote:
         | No the zero-cost refers to the abstraction (and runtime cost),
         | which still is zero-cost. Deallocating is part of the normal
         | work load not the abstraction.
         | 
         | Also this isn't rust specific. Most (all?) RAII languages are
         | affected and many GC approaches have this effect, too. Some do
         | add _additional_ abstraction to magically always or sometimes
         | put the de-allocation into another thread.
         | 
         | But de-allocating in another thread is not generally good or
         | bad. There are a lot of use-cases where doing so is rather bad
         | or can't be done (in case TLS is involved). Rust and other
         | similar RAII languages at least let you decide what you want to
         | do.
         | 
         | Now it's (I think) generally known that certain kinds (not all)
         | of GC do make some thinks simpler for GUI-like usage. Through
         | they also tend to have less control.
         | 
         | Note that it's a common pattern for small user CLI facing tools
         | (which are not GC'ed) to leak resources instead of cleaning
         | them up properly. You can do so in rust too if you want but
         | it's a potential problem for longer running applications.
         | 
         | Also here is a faster get `get_len` then both which is also
         | more idiomatic rust then both:
         | 
         | ``` fn get_len1(things: &HeavyThings) -> usize { things.len() }
         | ```
         | 
         | If you have a certain thread (e.g. UI thread) in which you
         | never want to do any cleanup work you can consider using a
         | container like:
         | 
         | ``` struct DropElsewhere<T: Send>(pub Option<T>); impl<T: Send>
         | Drop for DropElsewhere<T> { fn drop(&mut self) { if let
         | Some(value) = self.take() { thread::spawn(move || drop(value));
         | } } } ```
         | 
         | You can optimize this with `ManualDrop` to have close to zero-
         | runtime overhead (removes the `take` and `if let` part).
        
           | cs702 wrote:
           | > No the zero-cost refers to the abstraction (and runtime
           | cost), which still is zero-cost. Deallocating is part of the
           | normal work load not the abstraction.
           | 
           | Yeah, you're right. In hindsight my comment was poorly
           | thought-out and poorly written.
        
       | floppy123 wrote:
       | Why should i ever need to drop a heavy object for only getting a
       | size? Not in C++ and also not in Rust, the diffent thread idea is
       | just creativ stupidity, sorry
        
       | fpgaminer wrote:
       | Some important things I think people should note before blindly
       | commenting:
       | 
       | * The example code is obviously contrived. The real gist is that
       | massive deallocations in the UI thread cause lag, which the
       | example code proves. That very thing can easily happen in the
       | real world.
       | 
       | * I didn't see any difference on my machine between a debug build
       | and a release build.
       | 
       | * The example is preforming 1 _million_ deallocations. That's why
       | it's so pathological. It's not just a "large" vector. It's a
       | vector of 1 million vectors. While that may seem contrived,
       | consider a vector of 1 million strings, something that's not too
       | uncommon, and which would likely suffer the same performance
       | penalty.
       | 
       | * Rust is not copying anything, nor duplicating the structures
       | here. In the example code the structures would be moved, not
       | copied, which costs nothing. The deallocation is taking up 99% of
       | the time.
       | 
       | * As an aside, compilers have used the trick of not free-ing data
       | structures before, because it provides a significant performance
       | boost. Instead of calling free on all those billions of tiny data
       | structures a compiler would generate during its lifetime, they
       | just let them leak. Since a compiler is short lived its not a
       | problem, they get a free lunch (pun unintended), and the OS takes
       | care of cleaning up after all is said and done. My point is that
       | this post isn't theoretical, we do deallocation trickery in the
       | real world.
        
         | papaf wrote:
         | This deallocation trick is neat but in C and C++ you could use
         | a memory pool to do this.
         | 
         | In theory, you could also use a memory pool in Rust but I think
         | the standard library uses malloc without some way of overriding
         | this behaviour.
        
           | CoolGuySteve wrote:
           | Even just keeping a free list and deallocating it's elements
           | at an idle time is probably cheaper and faster than spawning
           | a thread.
        
           | wongarsu wrote:
           | In a toy raytracer I once wrote in C++, switching from malloc
           | to custom memory pools for small fixed-size objects was a big
           | performance boost. Making free() a noop was another big
           | performance boost, both for deallocation and allocation.
           | Turns out sequentially handing out memory from a big chunk of
           | memory is much easier than keeping track of and reusing empty
           | slots, and it keeps sequentially allocated objects in
           | sequential memory locations.
        
             | folmar wrote:
             | jemalloc is the answer to the question you did not ask
        
           | estebank wrote:
           | In Rust you could just call `mem::forget` on whatever heavy
           | thing that you're no longer using is before it would get
           | dropped, but then the programmer is effectively responsible
           | for that memory leak not becoming a _problematic_ leak during
           | refactors.
           | 
           | Edit: this will also break any code that _relies_ on Drop
           | being called for clean up, but that is already a
           | "suspect"/incorrect pattern because there are no assurances
           | that it will ever run.
        
             | mcguire wrote:
             | Isn't that a problem for the RAII approach?
        
             | kd5bjo wrote:
             | > There are no assurances that it will ever run.
             | 
             | Yes and no. Whenever control leaves a code block, Rust
             | automatically calls the drop() method of all values still
             | owned by that block. There is no guarantee that control
             | will exit every block (cf. Turing), but a moderately
             | exceptional circumstance needs to occur for this not to
             | happen, like an infinite loop.
        
               | rcxdude wrote:
               | There's also no guarantee the object won't be moved
               | elsewhere, including into a context which never gets
               | freed (for example, you can construct mem::forget using
               | entirely safe code by forming a cycle of reference-
               | counted boxes). That said, Rust has support for such a
               | concept through the Pin trait (which essentially
               | guarantees an object will not get moved until it is
               | 'Unpinned').
        
           | jomohke wrote:
           | You can easily use a custom global allocator in Rust:
           | #[global_allocator]         static GLOBAL: MyAllocator =
           | MyAllocator;
        
           | [deleted]
        
           | orf wrote:
           | You can change the global allocator in any rust project. You
           | can write your own easy enough, or use one like jemalloc
        
             | bauerd wrote:
             | Think with LD_PRELOAD it is always possible to override the
             | allocator?
        
               | dan-robertson wrote:
               | Only if it is dynamically linked
        
           | pjmlp wrote:
           | In C++/WinRT the same approach is taken, because you cannot
           | just use a memory pool for COM.
        
           | projektfu wrote:
           | Yeah, I like Apple's (Next's) approach of pool allocation for
           | each run through the event loop. Defer dealloc, drop pool at
           | the end.
        
             | saurik wrote:
             | Apple's pools are for helping manage reference counts of
             | returned objects (via autorelease) but aren't doing pool
             | allocation (as any of those objects could escape, so you
             | can't do the fast thing of just deallocating the pool). The
             | normal memory allocator is used while in the scope of an
             | autorelease pool.
        
               | projektfu wrote:
               | True, I was thinking the same after I posted. Maybe I'm
               | thinking more of Apache's approach and how it gets some
               | of the benefits of a fork-per-request server without the
               | overhead.
        
             | mcguire wrote:
             | Unless your destructors do more than deallocation, in which
             | case you will leak whatever other resource you're managing.
        
               | MaxBarraclough wrote:
               | A pool can invoke destructors when it is cleared. Might
               | take a bit of overhead (if the pool is to support
               | arbitrary classes), but you could retain the fast
               | pointer-bump allocation.
        
           | vvanders wrote:
           | You can also use the typed-arena crate[0] or roll your own if
           | you're feeling like cracking open unsafe.
           | 
           | [0] https://crates.io/crates/typed-arena
        
         | rubber_duck wrote:
         | >As an aside, compilers have used the trick of not free-ing
         | data structures before, because it provides a significant
         | performance boost. Instead of calling free on all those
         | billions of tiny data structures a compiler would generate
         | during its lifetime, they just let them leak. Since a compiler
         | is short lived its not a problem, they get a free lunch (pun
         | unintended), and the OS takes care of cleaning up after all is
         | said and done. My point is that this post isn't theoretical, we
         | do deallocation trickery in the real world.
         | 
         | And then someone tries to use your compiler as a service (code
         | analysis, change triggered compiler) and it's a dead end
        
           | qzw wrote:
           | Well, then that's not the original use case anymore, and
           | it'll have to be re-engineered. In the meantime it may have
           | been used for years and the perf difference may have saved
           | many developer-years collectively across its user base.
           | Surely you're not suggesting that the compiler developers
           | should be prematurely optimizing for future use cases that
           | they may not even have envisioned.
        
             | pdimitar wrote:
             | I am suggesting they apply good practices. I'd never
             | imagine that compilers were actually doing what was stated
             | -- sounds awful.
             | 
             | I understand it's tradeoffs and we all have real-world
             | limitations to contend with -- but again, of all the
             | corners that could be cut that's exactly the one I didn't
             | imagine they would.
             | 
             | Nasty.
        
               | rcxdude wrote:
               | Deallocation at the end of a program's execution can
               | substantially add to its runtime, and it's entirely
               | waste. It's a much more common strategy than you might
               | think.
        
               | pdimitar wrote:
               | You are right, I indeed didn't know it was that common.
               | 
               | But still, in a world where languages and runtimes are
               | also judged by their ability to run in lambda/serverless
               | setups, I'd think this practice will start being
               | obsolete, wouldn't you think?
               | 
               | (What I mean is that I imagine that any serverless
               | function that runs in severely constrained and measured
               | environments like the AWS Lambda would gain a significant
               | edge over the competition if it did an eager cleanup.
               | Should allow more of them to work in parallel?)
        
               | folmar wrote:
               | Most compiler are not designed to run in daemon mode,
               | specifically it's non-issue since their startup is
               | normally fast. And the compiler and runtime are a
               | different thing.
        
               | pdimitar wrote:
               | I realise that, but nowadays language servers are a
               | pretty normal practice in no small amount of areas.
        
               | thebean11 wrote:
               | Can you articulate why it's a bad practice? If it works
               | better than alternatives and it's documented, not really
               | sure what the issue is.
               | 
               | I don't think it's even that uncommon. I believe some HFT
               | firms run Java with a huge amount of RAM and GC disabled,
               | and get around it by just rebooting the software
               | occasionally.
               | 
               | To me writing software like that is fair game, I don't
               | see the point in being dogmatic about "how things should
               | be done".
        
               | pdimitar wrote:
               | Mostly because I look at it from the angle of one-off /
               | general purpose / CLI programs. If one such has to run
               | for 10-30 seconds and its memory just keeps growing and
               | growing with the idea of throwing it all away at the end
               | and letting the OS handle it, it might become disruptive
               | for other programs on the machine.
               | 
               | For specialised apps and servers it's of course a
               | perfectly good practice.
        
             | rubber_duck wrote:
             | Avoiding leaks is not optimisation, it's a matter of
             | correctness - not freeing memory is an optimisation based
             | on a very shortsighted assumption that is not practical for
             | any new language (modern languages are expected to come
             | with language server support)
        
               | random314 wrote:
               | You have not provided any refutation to the OPs argument.
        
               | jashmatthews wrote:
               | Why do you say that? Even if you call free immediately
               | after a piece of memory is no longer needed, malloc won't
               | release that immediately anyway.
               | 
               | If this is incorrect, then every modern malloc
               | implementation is incorrect.
        
               | jpitz wrote:
               | Correctness means adherence to the spec, not some
               | contrived absolute truth.
        
           | yrro wrote:
           | As distasteful as leaky code is, is it that bad to run it in
           | a separate process? You get a bit more robustness against
           | crashes as well.
        
             | rubber_duck wrote:
             | You want to be able to incrementally update state to get
             | performance out of incremental code analysis (eg. language
             | server implemention for IDE)
        
               | folmar wrote:
               | There are more ways to go then that: AST fragment
               | caching, intermediate representation fragment caching and
               | so on. Incremental updates fit some languages better than
               | the others.
        
           | ycombobreaker wrote:
           | In a world where processes can fork-and-exec, nothing about
           | "as a service" changes that. The compiler would just be
           | reinvoked as needed. Converting it into a persistent process
           | breaks a lot more than just allocation optimizations.
        
             | rubber_duck wrote:
             | But you want to share state and only update state
             | incrementally on edit to get any reasonable level of
             | performance for stuff like language server code analysis.
        
               | all-fakes wrote:
               | I'm not sure what the performance overhead would be, but
               | that state can easily be stored on disk.
        
         | ckcheng wrote:
         | >As an aside, compilers have used the trick of not free-ing
         | data structures before, because it provides a significant
         | performance boost. Instead of calling free on all those
         | billions of tiny data structures a compiler would generate
         | during its lifetime, they just let them leak. Since a compiler
         | is short lived its not a problem, they get a free lunch (pun
         | unintended), and the OS takes care of cleaning up after all is
         | said and done. My point is that this post isn't theoretical, we
         | do deallocation trickery in the real world.
         | 
         | This reminds me of the exploding ultimate GC technique [1]:
         | 
         | > on-board software for a missile...chief software engineer
         | said "Of course it leaks". ... They added this much additional
         | memory to the hardware to "support" the leaks. Since the
         | missile will explode when it hits its target or at the end of
         | its flight, the ultimate in garbage collection is performed
         | without programmer intervention.
         | 
         | [1]:
         | https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...
        
         | hinkley wrote:
         | One of my "favorite" snags in perf analysis is that periodicity
         | in allocations can misattribute the cost of allocations to the
         | wrong function.
         | 
         | If I allocate just enough memory, but not too much, then pauses
         | for defragmentation of free space may be costed to the code
         | that calls me.
         | 
         | A solution to this that I've seen in soft real time systems is
         | to amortize cleanups across all allocations. Every allocation
         | performs n steps of a cleanup process prior to receiving a
         | block of memory. In which case most of the bad actors have to
         | pay part of the cost of memory overhead.
         | 
         | Might be good for Rust to try something in that general realm,
         | or in the cleanup side may be easier to tack on. On free, set a
         | ceiling for operations and queue what is left. That would at
         | least peak shave.
        
           | rini17 wrote:
           | Doing unrelated cleanup sounds like flushing CPU cache per
           | every allocation.
        
             | hinkley wrote:
             | This post is already about unrelated cleanup, so I'm not
             | sure what other escape hatch you imagine people taking. Do
             | you have a suggestion?
             | 
             | You can tell that it's unrelated cleanup because if it were
             | related, then the cost of freeing wouldn't be noteworthy.
             | It would be cache hot because you would be visiting it for
             | the second time. In which case we'd be talking about why
             | you are scanning a giant object on an event loop in the
             | first place. That's not what's at issue. What's at issue is
             | that you've been handed this great bomb of uncached data
             | from someone else and now you're stuck doing the janitorial
             | work.
             | 
             | Freeing an object of arbitrary size is effectively an
             | unbounded operation. Cache invalidation has a very high
             | cost, sure, but it's still bounded.
             | 
             | Putting a limit on the amount of work you do, you could
             | stop before purging the entire cache. You could use a
             | smaller limit on doing work for someone else and control
             | invalidation there, too.
        
         | indemnity wrote:
         | Isn't a Rust "move" implemented as a bit wise copy (e.g. memcpy
         | call)? I see people claiming move has no cost but I'm not sure
         | that is true.
        
           | steveklabnik wrote:
           | Semantically, it is a bit wise copy, yes.
           | 
           | However, these copies can often be elided by optimizations.
        
           | dathinab wrote:
           | What is bit-wise copied is the pointer to the memory.
           | 
           | I.e. a `HashMap` struct, or `Vec` struct don't directly
           | contain the data.
           | 
           | For example the `Vec` is defined internally as something
           | similar to:
           | 
           | `struct Vec<T> { data: *mut [T], capacity: usize, len: usize,
           | marker: PhantomData<T> }`
           | 
           | (Slightly simplified, not actual Vec type).
           | 
           | So a move of a Vec copies at most 3 usize (24 bytes on 64bit
           | systems), similar thinks apply for a HashMap.
           | 
           | Additionally the copy can often be elided through compiler
           | optimizations.
           | 
           | As a interesting side note a new empty Vec/HashMap will not
           | actually allocate any memory, only once elements get added it
           | will start doing so. This is why it crates vec's of vecs of
           | length 1. Or else it wouldn't need to do "number of element"
           | free calls.
        
           | justinpombrio wrote:
           | EDIT: See kevincox's reply. Rust will bitwise copy the
           | _containing type_ , which is typically very cheap. For
           | example, it you move a String, it will copy the String
           | struct, which contains a couple pointers and a length (or
           | something along those lines). Importantly, it will _not_ copy
           | the underlying char array.
           | 
           | I was thinking of the following code, where I believe the
           | assignment to y is actually free. Though apparently this
           | isn't called a "move".                   let x = <<large
           | owned type like [char; 1000]>>;         let y = x;
           | 
           | More info: https://doc.rust-lang.org/rust-by-
           | example/scope/move.html
        
             | ahupp wrote:
             | That's not exactly true. In C++ terms, the example code is
             | moving a value:                 std::map<...> foo;
             | someFunction(std::move(foo));
             | 
             | And not moving a pointer like:
             | std::unique_ptr<std::map<...>> foo = ...;
             | someFunction(std::move(foo));
             | 
             | So it copies sizeof(std::map<...>), not a pointer.
        
             | kevincox wrote:
             | Semantically it is a bitwise copy.
             | 
             | In the example from the article it is probably actually a
             | copy because the value was originally on the parent
             | thread's stack, which will be reused after the function
             | returns, so the value will need to be copied to the new
             | thread's stack.
             | 
             | However it is important to not that it isn't a
             | deep/recursive bitwise copy. It just needs to copy the
             | HashMap itself (which is probably a handful of words).
             | 
             | So yes, it is doing a bitwise copy, but this is also very
             | cheap. It will be much, much cheaper than spawning the
             | thread.
        
       | staticfloat wrote:
       | It seems that this would be a great reason to not pass the entire
       | heavy object through your function, and to instead pass it as a
       | reference. When passing an object (rather than a reference to an
       | object) there's a lot more work going on both in function setup,
       | and in object dropping. I'm not a rust guru, so I don't know the
       | precise wording, but it's simple enough to realize that if this
       | function, as claimed, must drop all the sub-objects within the
       | `HeavyObject` type, then those objects must have been copied from
       | the original object.
       | 
       | If you instead define the function to take in a reference (by
       | adding just two `&` characters into your program), the single-
       | threaded case is now almost 100x faster than the multithreaded
       | case.
       | 
       | Here's a link to a Rust Playground with just those two characters
       | changed: https://play.rust-
       | lang.org/?version=stable&mode=debug&editio...
       | 
       | Note that the code that drops the data in a separate thread is
       | not timing the amount of time your CPU is spinning, dropping the
       | data. So while this does decrease the latency of the original
       | thread, the best solution is to avoid copying and then freeing
       | large, complex objects as much as possible. While it is of course
       | necessary to do this sometimes, this particular example is just
       | not one of them. :)
       | 
       | As an aside, I'm somewhat surprised that the Rust compiler isn't
       | inlining and eliminating all the copying and dropping; this would
       | seem to be a classic case where compiler analysis should be able
       | to determine that `a.size()` should be computable without copying
       | `a`, and it should be able to eliminate the function call cost as
       | well. Manually doing this gives the exact same timing as my gist
       | above, so I assume that this is happening when passing a
       | reference, but not happening when passing the object itself.
        
         | heftig wrote:
         | As already mentioned, Rust wasn't copying anything; the
         | `HashMap` is not a `Copy`-able type, so it was just moved
         | around (it's also not very large: all its items are behind a
         | pointer to the heap).
         | 
         | All you did was move the drop from the
         | `fn_that_drops_heavy_things` to the end of `main`, where it is
         | outside the timing function.
        
         | heavenlyblue wrote:
         | If your function takes a reference to the object, something
         | still needs to free it.
        
         | fpgaminer wrote:
         | Rust isn't copying anything; everything in the original code
         | would be a move.
        
       | cesarb wrote:
       | Just be careful, because moving heavy things to be dropped to
       | another thread can change the _semantics_ of the program. For
       | instance, consider what happens if within that heavy thing you
       | had a BufWriter: unless its buffer is empty, dropping it writes
       | the buffer, so now your file is being written and closed in a
       | random moment in the future, instead of being guaranteed to have
       | been sent to the kernel and closed when the function returns.
       | 
       | And it can even be worse if it's holding a limited resource, like
       | a file descriptor or a database connection. That is, I wouldn't
       | recommend using this trick unless you're sure that the only thing
       | the "heavy thing" is holding is memory (and even then, keep in
       | mind that _memory_ can also be a limited resource).
        
         | usefulcat wrote:
         | It seems like the caller should ensure that the buffer is
         | written before giving away ownership. Also, what happens if
         | there is an error writing during finalization/destruction/etc?
         | Seems like you'd want to find out about such errors earlier if
         | at all possible.
        
         | lostmyoldone wrote:
         | I only know a very little rust, but since it's generally a good
         | practice to never defer writing (or other side effects) to an
         | ambiguous future point in time - with memory allocations as the
         | only plausible exception - is there any way in rust to make
         | sure one doesn't accidentally move complex objects with drop
         | side-effects into other threads?
         | 
         | Granted the way the type system work you usually know the type
         | of a variable quite well, but could this happen with opaque
         | types?
         | 
         | I'm very much out of my depth, but it felt like one of those
         | things that could really bite you if you are unaware, as
         | happened with finalizers in Java decades ago.
        
           | masklinn wrote:
           | > I only know a very little rust, but since it's generally a
           | good practice to never defer writing (or other side effects)
           | to an ambiguous future point in time - with memory
           | allocations as the only plausible exception - is there any
           | way in rust to make sure one doesn't accidentally move
           | complex objects with drop side-effects into other threads?
           | 
           | If you're the one creating the structure, you could opt it
           | out of Send, that'd make it... not sendable. So it wouldn't
           | be able to cross thread-boundaries. For instance Rc is !Send,
           | you simply can not send it across a thread-boundary (because
           | it's a non-threadsafe reference-counting handle).
           | 
           | If you don't control the type, then you'd have to wrap it
           | (newtype pattern) or remember to manually mem::drop it. The
           | latter would obviously have no safety whatsoever, the former
           | you might be able to lint for I guess, though even that is
           | limited or complicated (because of type inference the
           | problematic type might never get explicitly mentioned).
        
         | the8472 wrote:
         | Considering that writing files can also block the process you
         | probably don't want to have that in your latency-sensitive
         | parts either, so you'll have to optimize that one way or
         | another anyway.
         | 
         | For the more general problem you have can also dedicate more
         | threads to the task or apply backpressure.
        
       | chowells wrote:
       | This is the standard problem with tracing data structures to free
       | them. You frequently run into it with systems based on
       | malloc/free or reference counting. The underlying problem is that
       | freeing the structure takes time proportional to the number of
       | pointers in the structure it has to chase.
       | 
       | Generational/compacting GC has the opposite problem. Garbage
       | collection takes time proportional to the live set, and the
       | amount of memory collected is unimportant.
       | 
       | It's actually a lot to be said for rust that the ownership system
       | lets you transfer freeing responsibility off-thread safely and
       | cheaply in order to not have it block the critical path.
       | 
       | But overall, there's nothing really unexpected here, if you're
       | familiar with memory management.
        
         | loufe wrote:
         | I've not worked with any language thus far without automatic
         | garbage collecting, so this was definitely a neat read for me.
         | It sounds rather elegant.
        
         | Reelin wrote:
         | > the ownership system lets you transfer freeing responsibility
         | off-thread safely and cheaply in order to not have it block the
         | critical path
         | 
         | This can also trivially be done in other languages. Atomically
         | append your pointer to a queue of "large things that need to be
         | freed" and move on as though you had actually called free.
         | 
         | Within a particularly time sensitive loop you can even opt to
         | place pointers into a preallocated array locally. Then once per
         | loop iteration swap that array with the thread handling the
         | deallocations for you. It eats up a bit of CPU time but can
         | significantly reduce latency.
        
           | im3w1l wrote:
           | A lot of C++ code depends on deallocation order for
           | correctness. Like a destructor may want to say bye-bye to a
           | pointed-to-object, and if you reverse order of deallocation,
           | that pointer may be dangling.
           | 
           | Consider this code                   {             Window a;
           | ClickHandler* b = new ClickHandler(&a);             delete b;
           | }
           | 
           | Let's say b tries to deregister itself when it's deleted.
           | This code will work as written. But if you defer the deletion
           | of b, then stack allocated Window a may already be gone.
        
           | jacobparker wrote:
           | OP said safely; what you're describing isn't safe in, say,
           | C++ in the same sense that it is in Rust.
        
             | Reelin wrote:
             | Isn't that essentially a tautology? Manual memory
             | management and threading in such languages lacks safety
             | guarantees to begin with.
        
         | arcticbull wrote:
         | > This is the standard problem with tracing data structures to
         | free them. You frequently run into it with systems based on
         | malloc/free or reference counting. The underlying problem is
         | that freeing the structure takes time proportional to the
         | number of pointers in the structure it has to chase.
         | 
         | That doesn't seem to make intuitive sense. A GC has the same
         | problem.
         | 
         | A garbage collector has to traverse the data structure in a
         | similar way to determine whether it (and it's embedded keys and
         | values) are part of the live set or not, and to invoke
         | finalizers. You're beginning your comparison after the mark
         | step, which isn't a fair assessment since what Rust is doing is
         | akin both both the mark and sweep phases.
         | 
         | The only way to drop an extensively nested structure like this
         | any faster than traversing it would be an arena allocator, and
         | forgetting about the entire arena.
         | 
         | The difference between a GC and this kind of memory management
         | is that the GC does the traversal later, at some point, non-
         | deterministically. Rust allows you to decide between
         | deallocating it in place, immediately, or deferring it to a
         | different thread.
        
           | Reelin wrote:
           | > The only way to drop an extensively nested structure like
           | this any faster than traversing it would be an arena
           | allocator, and forgetting about the entire arena.
           | 
           | Isn't that incompatible with RAII though?
        
             | arcticbull wrote:
             | You can handle this in Rust pretty neatly with lifetimes.
             | There's a bunch of crates that do this. [1]
             | 
             | [1] https://crates.io/crates/typed-arena
        
               | Reelin wrote:
               | That's a neat library but as far as I can tell it doesn't
               | avoid any traversal or cleanup code. It appears to delay
               | the cleanup so it all happens at once. That's certainly
               | useful, but if you have RAII the traversal still has to
               | happen at some point.
        
               | winstonewert wrote:
               | It avoids it _if_ your type has no-op drop
               | implementation. So if you use typed-arena for objects
               | which don 't own resources, they all get dropped in one
               | massive deallocation and don't have to be traversed.
               | 
               | EDIT: and then I noticed that you mentioned RAII...
               | Right, if the object own some sort of resources that
               | doesn't apply.
        
               | Reelin wrote:
               | No worries. And to clarify, in context the point is that
               | traversal fundamentally can't be avoided in the case of
               | RAII. This defeats (what I see as) the primary use case
               | of an arena allocator - deallocating an arbitrarily large
               | chunk of contiguous memory in O(1) time regardless of
               | object count.
               | 
               | Of course this is all somewhat tangential to the original
               | topic of generational GCs, where the RAII idiom also has
               | significant negative impacts. The performance
               | characteristics would otherwise be O(n) based on the live
               | set and thus similar to an arena allocator in terms of
               | the ability to dispose of an arbitrarily large number of
               | objects efficiently.
        
               | winstonewert wrote:
               | I'm not sure its fair to say it defeats that use case. It
               | only defeats that use case if you need RAII. Furthermore,
               | in my experience, an arena is most useful when you
               | allocate lots of small objects which is the least likely
               | to case to need RAII.
        
           | chowells wrote:
           | I said generational/compacting collector. You're talking
           | about a mark and sweep collector.
           | 
           | A generational/compacting collector traverses pointers from
           | the live roots, and copies everything it finds to the start
           | of its memory space, and then declares the rest unused. If
           | there is 1GB of unused memory, it's irrelevant. Only the
           | things that can be reached are even examined.
           | 
           | As I said, this has the opposite problem. When the live set
           | becomes huge, this can drag performance. When the live set is
           | small, it doesn't matter how much garbage it produces,
           | performance is fast.
        
             | arcticbull wrote:
             | How are finalizers invoked if the structure isn't
             | traversed? Would it just be optimized away none of the
             | objects have finalizers? Hence my suggestion about the area
             | allocators being a better point of comparison.
        
               | mcguire wrote:
               | Maintain a collection of objects with finalizers
               | associated with the generation; when an object is moved
               | out of the generation, move the finalizer record with it.
               | Then, just before the generation is discarded, run the
               | finalizers.
               | 
               | If the finalizers do something stupid like resurrect the
               | object, have the runtime system notify someone with the
               | authority to go beat the programmer with a stick.
        
               | Reelin wrote:
               | My understanding is that finalizers are special cased for
               | a generational GC, have a tendency to introduce
               | significant overhead, and (as with any GC scheme)
               | generally run at unpredictable times. My impression is
               | that RAII idioms are strongly discouraged in conjunction
               | with most GC ecosystems.
        
               | pcl wrote:
               | Many languages allow a class to avoid declaring a
               | finalized, for just this reason.
               | 
               | The JVM is notable in this regard. And thus, most classes
               | that compile to Java bytecode offer no-finalizer
               | semantics.
        
               | zucker42 wrote:
               | Java is an example of a language with a generational copy
               | collector by default. Most objects in Java don't have a
               | finalizer, since after all the main point of, for
               | example, destructors in C++ is to make sure you don't
               | leak memory, which the GC solves. But when the `finalize`
               | method is used is causes significant overhead.
               | 
               | > Objects with finalizers (those that have a non-trivial
               | finalize() method) have significant overhead compared to
               | objects without finalizers, and should be used sparingly.
               | Finalizeable objects are both slower to allocate and
               | slower to collect. At allocation time, the JVM must
               | register any finalizeable objects with the garbage
               | collector, and (at least in the HotSpot JVM
               | implementation) finalizeable objects must follow a slower
               | allocation path than most other objects. Similarly,
               | finalizeable objects are slower to collect, too. It takes
               | at least two garbage collection cycles (in the best case)
               | before a finalizeable object can be reclaimed, and the
               | garbage collector has to do extra work to invoke the
               | finalizer. [1]
               | 
               | Sure, you're technically correct that if the objects all
               | had finalizers that did the same thing as C++
               | destructors, it would be equivalent, but because of the
               | existence of a GC we don't have to do any work for most
               | objects. A GC is equivalent to an arena allocator in this
               | sense.
               | 
               | Another point is the C++/Rust pattern of each object
               | recursively freeing the objects it owns presumably leads
               | to slower deallocation, because in the general case it
               | involves pointer following and non-local access.
               | 
               | [1] https://www.ibm.com/developerworks/java/library/j-jtp
               | 01274/i...
        
               | msclrhd wrote:
               | Destructors in C++ aren't just for making sure you leak
               | memory. They are used for many lifetime controlled things
               | such as: 1. general resource cleanup (file handle,
               | database connection, etc.) using RAII (Resource
               | Aquisition Is Initialization); 2. tracing function
               | entry/exit.
        
               | pdpi wrote:
               | In the JVM the equivalent to RAII is implemented with
               | try-with-resource/Autocloseable instead.
        
               | mcguire wrote:
               | Apparently, that doesn't work in Rust:
               | https://news.ycombinator.com/item?id=23363647
        
               | the8472 wrote:
               | It does work in rust. you just cannot rely on Drop _for
               | memory-safety_. If you mem::forget a struct that holds
               | onto some other resource then all that means is that you
               | 're committing that resource to the lifetime of the
               | process. We usually call that a leak but it can be
               | intentional.
        
               | pron wrote:
               | Which is why finalizers in Java have been officially
               | deprecated [1] and might be removed altogether in a
               | future release.
               | 
               | [1]: https://docs.oracle.com/en/java/javase/14/docs/api/j
               | ava.base...
        
           | mcguire wrote:
           | Finalizers/destructors do not work well in garbage collected
           | languages, for that very reason.
        
           | saagarjha wrote:
           | Usually in a background thread ;)
        
             | arcticbull wrote:
             | Indeed. The difference is this is happening
             | deterministically, in place (with an optional deferral). A
             | garbage collector has all sorts of different trade-offs.
        
           | pron wrote:
           | > A garbage collector has to traverse the data structure in a
           | similar way to determine whether it (and it's embedded keys
           | and values) are part of the live set or not
           | 
           | Yes, but in practice tracing in a tracing GC is done
           | concurrently and with the help of GC barriers that don't
           | require synchronization and so are generally cheaper than the
           | common mechanisms for reference-counting GC.
           | 
           | > and to invoke finalizers
           | 
           | As others have said, finalizers are very uncommon and, in
           | fact, have been deprecated in Java.
        
         | Jasper_ wrote:
         | One of my favorite papers by Bacon et al expands on this
         | intuition that garbage collection and reference counting are
         | opposite tradeoffs in many respects, and gives a formal theory
         | for it. My views on gc/rc haven't been the same since.
         | 
         | http://researcher.watson.ibm.com/researcher/files/us-bacon/B...
        
           | pron wrote:
           | That's a great paper, but one important thing to point out is
           | that some production-grade tracing GCs are on the
           | sophisticated end of that paper, while almost all reference
           | counting GCs are on the rather primitive end. Given the same
           | amount of effort, it's easier to get a reasonable result with
           | reference-counting, but there are industrial-strength tracing
           | GCs out there that have had _a lot_ of effort put into them.
        
         | jeffdavis wrote:
         | "Generational/compacting GC has the opposite problem. Garbage
         | collection takes time proportional to the live set, and the
         | amount of memory collected is unimportant."
         | 
         | Takes time proportional the live set _times the number of GC
         | runs that happen while the objects are alive_. In other words,
         | the longer the objects live, the more GC runs have to scan that
         | object (assuming there is enough activity to trigger the GC),
         | and the worse GC looks.
        
       | rhacker wrote:
       | Pass by reference?
        
         | bszupnick wrote:
         | If you pass by reference the heavy object won't be dropped. If
         | your goal is to drop a heavy object, this is a cool way to do
         | it.
        
       | thickice wrote:
       | Is this applicable for Go as well ?
        
       | [deleted]
        
       | SilasX wrote:
       | Completely different dynamic (because no Rust GC), but this
       | reminds me of how Twitch made their server, written in Go, a lot
       | faster by allocating a bunch of dummy memory at the beginning so
       | the garbage collector doesn't trigger nearly as often:
       | 
       | https://news.ycombinator.com/item?id=21670110
        
         | the8472 wrote:
         | The java equivalent to the Go case would simply be adjusting
         | the -Xms flag. The Go approach is a needlessly convoluted
         | because the runtime doesn't offer any tuning knobs.
         | 
         | As for the rust case, if you squint then it's similar to a
         | concurrent collector.
        
       | saagarjha wrote:
       | Why would you ever write a get_size function that drops the
       | object you call it on? Surely in an actual, non-contrived usecase
       | spawning another thread and letting the drop occur there would
       | just be plain worse?
        
         | pjmlp wrote:
         | Not at all, Herb Sutter has a CppCon talk about this kind of
         | optimisations.
         | 
         | It is also the approach taken by C++/WinRT, COM and UWP
         | components get moved into a background cleaning thread, to
         | avoid application pauses on complex data structures reaching
         | zero count.
        
         | ashtonkem wrote:
         | It's a contrived example to demonstrate the technique.
        
         | epage wrote:
         | I believe this is contrived to prove a point.
         | 
         | And this isn't just a help in these contrived examples. I
         | believe process cleanup (an extreme case of cleaning up
         | objects) is one of cases where garbage collection performs
         | better because it doesn't have to unwind the stack, call
         | cleanup functions that are not in the cache, and make a lot of
         | `free` calls to the allocator.
         | 
         | I vaguely remember reading about Google killing processes
         | rather than having them clean up correctly, relying on the OS
         | to properly clean up any resources of significance.
         | 
         | Now this doesn't mean you should do this in all cases. Profile
         | first, see if you can avoid the large objects, and then look
         | into deferred de-allocations ... if the timing of resource
         | cleanup meets your application's guarantees.
        
           | Reelin wrote:
           | > killing processes rather than having them clean up
           | correctly, relying on the OS
           | 
           | I recall Firefox preventing cleanup code from running when
           | you quit a few years ago. Prior to that, quitting with a lot
           | of pages open (ie hundreds) could cause it to lock up for
           | quite some time.
        
           | seventh-chord wrote:
           | Killing a process without freeing all allocations is, as far
           | as I can tell, routine in C. Especially for memory it makes
           | no sense "freeing" allocations, the whole memory space is
           | getting scrapped anyways. Of course, once you add RAAI the
           | compiler cant reason about which destructors it can skip on
           | program exit, and if programmers are negligent of this you
           | get programs that are slow to close.
        
             | gpderetta wrote:
             | exit(2) will only call destructors of static objects.
             | Quick_exit not even those.
        
             | estebank wrote:
             | > Killing a process without freeing all allocations is, as
             | far as I can tell, routine in C.
             | 
             | Many times by accident :)
             | 
             | > if programmers are negligent of this you get programs
             | that are slow to close.
             | 
             | I wouldn't call that negligence, just not fully optimized.
        
         | Reelin wrote:
         | I think the contrived use case is just for illustrative
         | purposes? If I'm understanding correctly, the combination of
         | cleanup code and deallocation can sometimes consume enough time
         | that it's worth dispatching it on another thread. That's hardly
         | specific to Rust though.
         | 
         | As you note that will certainly add some overhead, although
         | that could be minimized by not spawning a fresh thread each
         | time. It could easily reduce latency for a thread the UI is
         | waiting on in many cases.
        
           | tedunangst wrote:
           | It would be helpful to see an example from a real
           | application, too.
        
             | masklinn wrote:
             | A very large Vec<String> (say a few million non-empty
             | strings) would do I'd guess, Rust would drop the Vec which
             | would recursively drop each String.
        
         | Areading314 wrote:
         | Right there is no reason to pass ownership to a function like
         | this.
        
         | nickm12 wrote:
         | I took this to be a contrived example to illustrate the point.
         | I could imagine a process that creates a big data structure
         | (e.g. parse an xml file), pulls some data out, and then drops
         | the data structure. If you want to use that data sooner, you
         | can push the cleanup off your thread.
        
       | andrewfromx wrote:
       | hmm my first thought its, having to do that is a lot like c and
       | cleaning up my own allocations. This feels like something rust
       | should automatically do for me?
        
         | ashtonkem wrote:
         | Rust will automatically clean up data that's left scope, but
         | you can also manually accomplish this by the "drop" function,
         | which is only necessary if you want to cleanup explicitly, such
         | as in a different thread.
         | 
         | Interestingly, the drop function is actually user-creatable.
         | It's actually an empty function with a very permissive Non-
         | reference argument. The semantics of ownership in Rust makes
         | that sufficient to trigger memory cleanup.
        
         | klyrs wrote:
         | As I understand it, rust _is_ automatically cleaning up, and
         | that can cause glitchy timing. The clever hack is that rust
         | lets you shunt that cleanup process off to another thread when
         | you 're the sole owner of that object. You can do the same
         | thing in C, but unlike rust, the cost of cleanup is not hidden
         | by the syntax.
        
           | klyrs wrote:
           | On further reflection... I'm curious about how allocators
           | would handle this -- if you return from this context only to
           | make another heavyweight object, it seems like you'd be
           | trading glitchy timing with allocator contention.
        
         | devit wrote:
         | Because it's impossible to do this automatically in the general
         | case.
         | 
         | In particular, types may not be sendable to other threads, or
         | may have side effects on dropping, and in those cases you would
         | need to rearchitect the code before you can apply this
         | technique.
         | 
         | Also this technique adds overhead, so it should never be used
         | (including not doing it conditionally) if you don't care about
         | latency or if the objects are always small, and the compiler
         | cannot know whether that is the case.
        
         | ReaLNero wrote:
         | In C, if you forget to clean up, you have a memory leak which
         | is hard to track down. In Rust, if you don't do this, you're
         | not sacrificing memory leaks, only performance. A profiler can
         | tell you when you should drop asynchronously.
        
           | [deleted]
        
           | madmax96 wrote:
           | >A profiler can tell you when you should drop asynchronously
           | 
           | Is there any profiler that does this today?
           | 
           | What are the drawbacks with asynchronous drops?
        
             | ehsanu1 wrote:
             | See some discussion here: https://www.reddit.com/r/rust/com
             | ments/gntv7l/dropping_heavy...
        
               | [deleted]
        
       | dirtydroog wrote:
       | Oh my good god.
       | 
       | I'm hoping this is down to developer naivety rather than being a
       | feature of rust.
        
         | sockgrant wrote:
         | 1) he should pass by reference to avoid the extra copy. So in
         | his example yes it's dev naivety
         | 
         | 2) but somewhere somehow this object will deallocate, so his
         | trick of putting it to another thread would work if the deal
         | location takes awhile. Same for cpp if you have a massive
         | object in a unique ptr. So it's not a rust issue
        
           | renewiltord wrote:
           | Where's the extra copy? I don't see one. He's moving the
           | struct into the function, getting size and then dropping it.
        
           | VWWHFSfQ wrote:
           | > avoid the extra copy
           | 
           | there is no copy happening here
        
         | ReactiveJelly wrote:
         | The same could happen in C++, I think. Destructors are supposed
         | to be called recursively.
        
         | wizzwizz4 wrote:
         | It's not a feature of Rust; it's a "feature" of the way we
         | design operating systems and processors. This is the same in C.
        
       | [deleted]
        
       | andreygrehov wrote:
       | Does anyone know how would this work in Go?
        
         | echlebek wrote:
         | Lots to be learned at https://blog.golang.org/ismmkeynote
        
         | arendtio wrote:
         | I have no idea, but my guess is that it doesn't matter, as the
         | deallocation is being done by the garbage collection.
        
       | maxton wrote:
       | I'm not very familiar with Rust, but I don't understand why you
       | wouldn't just use a reference-to-HeavyThing as the function
       | argument, so that the object isn't moved and then dropped in the
       | `get_size` function?
        
         | ehsanu1 wrote:
         | If you never drop it, you have a memory leak. If the caller
         | drops it, it's still the same as the `get_size` dropping it in
         | terms of performance impact.
         | 
         | Generally you'd only pass ownership when that's needed for some
         | reason. So this toy example might not be realistic but it does
         | demonstrate the performance impact.
        
         | heavenlyblue wrote:
         | So the caller of the function still needs to free HeavyThing in
         | the same thread.
        
         | epage wrote:
         | For these contrived cases, yes, you would just pass a reference
         | to the function but I think the point is to simplify the case
         | down to demonstrate a point.
        
         | Cyph0n wrote:
         | You're spot on: this is simply a bad example that you would
         | never see in a real application.
        
       | jeffdavis wrote:
       | Speedup numbers should be given when optimizing constant factors
       | -- e.g. "I made this operation 5X faster using SIMD" or "By
       | employing readahead, I sped up this file copy by 10X".
       | 
       | The points raised in this article are really different:
       | 
       | * don't do slow stuff in your latency-critical path
       | 
       | * threads are a nice way to unload slow stuff that you don't need
       | done right away (especially if you have spare cores)
       | 
       | * dropping can be slow
       | 
       | The first and second points are good, but not really related to
       | rust, deallocations, or the number 10000.
       | 
       | The last point is worth discussing, but still not really related
       | to the number 10000 and barely related to rust. Rust encourages
       | an eager deallocation strategy (kind of like C), whereas many
       | other languages would use a more deferred strategy (like many
       | GCs).
       | 
       | It seems like deferred (e.g. GC) would be better here, because
       | after the main object is dropped, the GC doesn't bother to
       | traverse all of the tiny allocations because they are all dead
       | (unreachable by the root), and it just discards them. But that's
       | not the full story either.
       | 
       | It's not terribly common to build up zillions of allocations and
       | then immediately free them. What's more common is to keep the
       | structure (and its zillions of allocations) around for a while,
       | perhaps making small random modifications, and then eventually
       | freeing them all at once. If using a GC, while the large
       | structure is alive, the GC needs to scan all of those objects,
       | causing a pause each time, which is not great. The eager strategy
       | is also not great: it only needs to traverse the structure once
       | (at deallocation time), but it needs to individually deallocate.
       | 
       | The answer here is to recognize that all of the objects in the
       | structure will be deallocated together. Use a separate
       | region/arena/heap for the entire structure, and wipe out that
       | region/arena/heap when the structure gets dropped. You don't need
       | to traverse anything while the structure is alive, or when it
       | gets dropped.
       | 
       | In rust, probably the most common way to approximate this is by
       | using slices into a larger buffer rather than separate
       | allocations. I wish there was a little better way of doing this,
       | though. It would be awesome if you could make new heaps specific
       | to an object (like a hash table), then allocate the keys/values
       | on that heap. When you drop the structure, the memory disappears
       | without traversal.
        
       | ncmncm wrote:
       | There is nothing unique to Rust about this; it is a very old
       | technique. It is usually much inferior to the "arena allocator"
       | method, where all the discarded allocations are coalesced and
       | released in a single, cheap operation that could as well be done
       | without another thread. That method is practical in many
       | languages, Rust possibly included. C++ supports it in the
       | Standard Library, for all the standard containers.
       | 
       | If important work must be done in the destructors, it is still
       | better to farm the work out to a thread pool, rather than
       | starting another thread. Again, C++ supports this in its Standard
       | Library, as I think Rust does too.
       | 
       | One could suggest that the only reason to present the idea in
       | Rust is the cynical one that Rust articles get free upvotes on
       | HN.
        
       | dathinab wrote:
       | One thing I just noticed is that the example doesn't make sure to
       | actually run the new thread to completion before the main thread
       | exists.
       | 
       | This means that if you do a "drop in other thread" and then main
       | exists, the drop might never run. Which is often fine as the exit
       | of main causes process termination and as such will free the
       | memory normally anyway.
       | 
       | But it would be a problem one some systems where memory cleanup
       | on process exit is less reliable. Through such systems are more
       | rare by now I think.
        
         | ReactiveJelly wrote:
         | It would have to be a non-desktop system.
         | 
         | I'm pretty sure Linux will always free process-private memory,
         | and threads, and file descriptors when a process exits.
         | 
         | The only things that can leak in typical cases are some kinds
         | of shared memory and maybe child processes?
        
       | pierrebai wrote:
       | I've seen variations on this trick multiple times. Using threads,
       | using a message sent to self, using a list and a timer to do the
       | work "later", using a list and waiting for idle time...
       | 
       | They all have one thing in common: pampering over a bad design.
       | 
       | In the particular example given, the sub-vector probably come
       | from a common source. One could keep a big buffer (a single
       | allocation) and an array of internal pointers. For example of
       | such a design to hold a large array of text strings, see for
       | example this blog entry and its associated github repo:
       | https://www.spiria.com/en/blog/desktop-software/optimizing-
       | shared-data/
       | https://github.com/pierrebai/FastTextContainer
       | 
       | Roughly it is this:                   struct TextHolder         {
       | const char* common_buffer;             std::vector<const char*>
       | internal_pointers;         };
       | 
       | This is of course addressing the example, but the underlying
       | message is generally applicable: change your flawed design, don't
       | hide your flaws.
        
       | cperciva wrote:
       | If _freeing_ the data structure in question takes this long, how
       | much time are you wasting _duplicating_ the data structure?
        
         | saagarjha wrote:
         | I'm actually very curious why it takes this long; is Rust
         | memseting the buffer when dropping it?
         | 
         | Edit: it seems like turning on optimizations seems to improve
         | the situation quite a bit. Not sure why they were profiling the
         | debug build.
        
           | Reelin wrote:
           | > I'm actually very curious why it takes this long; is Rust
           | memseting the buffer when dropping it?
           | 
           | Regardless of memset and optimizations, consider a
           | particularly complicated object which lives on the heap and
           | contains hundreds of other nested objects (which themselves
           | contain nested objects, etc). Now imagine that a significant
           | fraction of them make use of RAII. That cleanup code can't be
           | elided.
           | 
           | That being said, it's a pretty bad example if they were
           | actually profiling the debug build ...
        
           | fpgaminer wrote:
           | > it seems like turning on optimizations seems to improve the
           | situation quite a bit.
           | 
           | I'm not seeing that on my local machine? Were you comparing
           | on the Playground which would be quite variable in its
           | results?                   > cargo build            Compiling
           | foo v0.1.0 (/private/tmp/foo)             Finished dev
           | [unoptimized + debuginfo] target(s) in 0.42s         >
           | ./target/debug/foo         drop in another thread 52.121us
           | drop in this thread 514.687233ms         >         >
           | > cargo build --release            Compiling foo v0.1.0
           | (/private/tmp/foo)             Finished release [optimized]
           | target(s) in 0.47s         > ./target/release/foo
           | drop in another thread 48.418us         drop in this thread
           | 548.005373ms
        
             | saagarjha wrote:
             | I saw an increase of about 2x on my computer, though I
             | didn't take too much effort to control for noise.
        
           | [deleted]
        
           | firethief wrote:
           | > Edit: it seems like turning on optimizations seems to
           | improve the situation quite a bit. Not sure why they were
           | profiling the debug build.
           | 
           | This is the most important point in the thread, since it
           | invalidates the results for most purposes.
        
             | saagarjha wrote:
             | Not completely, it's still 2-3 orders of magnitude slower.
        
               | firethief wrote:
               | You're right, I expected it would make a bigger
               | difference
        
               | dathinab wrote:
               | The thing is it's not slow because rust is doing anything
               | wrong or unoptimized, is slow because cleaning up insane
               | amounts of memory allocations is slow.
               | 
               | Also if you run this:
               | 
               | ``` fn main() { ::std::thread::spawn(move || {
               | println!("end")}); println!("Hello, world!"); } ```
               | 
               | You might notice that "end" might not be printed because
               | the main thread exists before it prints and terminates
               | the process. This means that the dropping might actually
               | not happen if it's at the end of the program and nothing
               | is faster then not doing the work.
               | 
               | Also it's a not uncommon pattern in small user facing CLI
               | to leak (memory) resources, as they (should) be cleaned
               | up with the process termination.
        
         | bszupnick wrote:
         | This code doesn't duplicate it. In Rust when a variable is sent
         | as an argument to a function it's "ownership" moves to be in
         | the scope of that function.
         | 
         | https://doc.rust-lang.org/book/ch04-01-what-is-ownership.htm...
        
           | cperciva wrote:
           | You're missing my point. Unless the only thing you want to do
           | with your giant data structure is measure its size, you're
           | not going to be passing ownership of your only copy of it
           | into the get_size function. You're going to be passing in a
           | copy -- hence the cost of duplicating everything.
        
             | phyzome wrote:
             | In the different contrived case where it gets copied, you'd
             | instead change this code to take an immutable reference to
             | it, and compute the size of that. Or you'd call .size()
             | instead of calling this function!
        
             | eMSF wrote:
             | It is just an example. You can think of "measuring size"
             | here as getting the result of a long computation that
             | involves a lot of allocations. After you get the result,
             | you no longer care about the intermediate stuff - i.e. all
             | the allocations. You certainly don't want to duplicate
             | them, you just want to get rid of them, and the author
             | tells that you might not want to deallocate (drop) them in
             | the UI thread.
             | 
             | If it helps you, you might want to imagine the contents of
             | the get_size function as being the end part of a longer
             | calc_foo function. What's really missing the point is
             | focusing so hard on the part that the example even contains
             | a call to size() of a collection.
        
             | ReactiveJelly wrote:
             | Rust is stricter about aliasing than C++ is.
             | 
             | Vectors are the size of 3 pointers (data, size, capacity),
             | so I guess 24 bytes on x64.
             | 
             | Even if the move requires a memcpy, it's only copying that
             | 24 bytes - The heap allocation is not copied, because there
             | are never two owners of the vector at once.
        
         | [deleted]
        
       | Animats wrote:
       | There's a worse case in deallocation. Tracing through a data
       | structure being released for a long-running program can cause
       | page faults, unused data having been swapped out. This is part of
       | why some programs take far too long to exit.
        
       | littlestymaar wrote:
       | The title is slightly wrong: it's not going to make your code
       | _faster_ , it's going to reduce _latency_ on the given thread.
       | 
       | It maybe a net win if this is the UI thread of a desktop app, but
       | overall, it will come at a performance cost: because modern
       | allocators have thread-local memory pools, and now you're moving
       | away from it. And if you're running you code on a NUMA system
       | (most server nowadays), when moving from one thread to another,
       | you can end up freeing non-local memory instead of local one.
       | Also, you won't have any backpressure on your allocations, and
       | you are susceptible to run out of memory (especially because your
       | deallocations now occur more slowly than they should)
       | 
       | Main takeaway: if you use it blindly it's an anti-pattern, but it
       | can be a good idea in its niche: the UI thread of a GUI.
        
       | wmichelin wrote:
       | Minor typo, `froget` instead of `forget`
        
       | grogers wrote:
       | Contrived examples like this are ridiculous. Creating such a
       | heavy thing is likely even more expensive than tearing it down.
       | So unless you create it on a separate thread, you probably
       | shouldn't be freeing it on a separate one. It's not going to
       | solve your interactivity problem. If you are creating the object
       | on a separate thread then it's already going to be natural to
       | free it on a separate one too.
        
         | ReactiveJelly wrote:
         | Something is better than nothing.
        
       | thePunisher wrote:
       | The obvious solution would be to borrow the HeavyThing instead of
       | having it dropped inside the function.
        
       ___________________________________________________________________
       (page generated 2020-05-30 23:00 UTC)