[HN Gopher] How fast are Linux pipes anyway?
       ___________________________________________________________________
        
       How fast are Linux pipes anyway?
        
       Author : rostayob
       Score  : 576 points
       Date   : 2022-06-02 09:19 UTC (13 hours ago)
        
 (HTM) web link (mazzo.li)
 (TXT) w3m dump (mazzo.li)
        
       | sandGorgon wrote:
       | Android's flavor of Linux uses "binder" instead of pipes because
       | of its security model. IMHO filesystem-based IPC mechanisms
       | (notably pipes), can't be used because of a lack of a world-
       | writable directory - i may be wrong here.
       | 
       | Binder comes from Palm actually (OpenBinder)
        
         | Matthias247 wrote:
         | Pipes don't necessarily mean one has to use FS permissions. Eg
         | a server could hand out anonymous pipes to authorized clients
         | via fd passing on Unix domain sockets. The server can then
         | implement an arbitrary permission check before doing this.
        
         | megous wrote:
         | "lack of a world-writable directory"
         | 
         | What's that?
         | 
         | A lot of programs store sockets in /run which is typically
         | implemented by `tmpfs`.
        
         | marcodiego wrote:
         | History of binder is more involved and has its seeds on BeOS
         | IIRC.
        
       | stackbutterflow wrote:
       | This site is pleasing to the eye.
        
         | apostate wrote:
         | It looks like it is using the "Tufte" style, named after Edward
         | Tufte, who is very famous for his writing on data
         | visualization. More examples: https://rstudio.github.io/tufte/
        
       | ianai wrote:
       | I usually just use cat /dev/urandom > /dev/null to generate load.
       | Not sure how this compares to their code.
       | 
       | Edit: it's actually "yes" that I've used before for generating
       | load. I remember reading somewhere "yes" was optimized
       | differently than the original Unix command as part of the unix
       | certification lawsuit(s).
       | 
       | Long night.
        
         | yakubin wrote:
         | On 5.10.0-14-amd64 "pv < /dev/urandom >/dev/null" reports
         | 72.2MiB/s. "pv < /dev/zero >/dev/null" reports 16.5GiB/s. AMD
         | Ryzen 7 2700X with 16GB of DDR4 3000MHz memory.
         | 
         | "tr '\0' 1 </dev/zero | pv >/dev/null" reports 1.38GiB/s.
         | 
         | "yes | pv >/dev/null" reports 7.26GiB/s.
         | 
         | So "/dev/urandom" may not be the best source when testing
         | performance.
        
           | sumtechguy wrote:
           | Think they were generating load? Going through the urandom
           | device not bad as it has to do a bit of work to get that rand
           | number? Just for throughput though zero is prob better.
        
             | gtirloni wrote:
             | "Generating load" for measuring pipe performance means
             | generating bytes. Any bytes. urandom is terrible for that.
        
             | yakubin wrote:
             | I don't understand. If you're testing how fast pipes are,
             | then I'd expect you to measure throughput or latency. Why
             | would you measure how fast something unrelated to pipes is?
             | If you want to measure this other thing on the other hand,
             | why would you bother with pipes, which add noise to the
             | measurement?
             | 
             | UPDATE: If you mean that you want to test how fast pipes
             | are when there is other load in the system, then I'd
             | suggest just running a lot of stuff in the background. But
             | I wouldn't put the process dedicated for doing something
             | else into the pipeline you're measuring. As a matter of
             | fact, the numbers I gave were taken with plenty of heavy
             | processes running in the background, such as Firefox,
             | Thunderbird, a VM with another instance of Firefox,
             | OpenVPN, etc. etc. :)
        
               | khorne wrote:
               | Because they mentioned generating load, not testing pipe
               | performance.
        
               | yakubin wrote:
               | Oh, wait. You mean that this "cat </dev/urandom
               | >/dev/null" was meant to be running in the background and
               | not be the pipeline which is tested? Ok, my bad for not
               | getting the point.
        
           | ianai wrote:
           | You're right and I miss-typed, it's yes that I usually use. I
           | think it's optimized for throughput.
        
       | spacedcowboy wrote:
       | Ran the basic initial implementation on my Mac Studio and was
       | pleasantly surprised to see                 @elysium pipetest %
       | pipetest | pv > /dev/null        102GiB 0:00:13 [8.00GiB/s]
       | @elysium ~ % pv < /dev/zero > /dev/null        143GiB 0:00:04
       | [36.4GiB/s]
       | 
       | Not a valid comparison between the two machines because I don't
       | know what the original machine is, but MacOS rarely comes out
       | shining in this sort of comparison, and the simplistic approach
       | here giving 8 GB/s rather than the author's 3.5 GB/s was better
       | than I'd expected, even given the machine I'm using.
        
         | mhh__ wrote:
         | Given the machine as in a brand new Mac?
        
           | spacedcowboy wrote:
           | given that the machine is the most performant Mac that Apple
           | make.
        
       | [deleted]
        
       | sylware wrote:
       | yep, you want perf? Don't mutex then yield, do spin and check
       | your cpu heat sink.
       | 
       | :)
        
       | jagrsw wrote:
       | Something maybe a bit related.
       | 
       | I just had 25Gb/s internet installed
       | (https://www.init7.net/en/internet/fiber7/), and at those speeds
       | Chrome and Firefox (which is Chrome-based) pretty much die when
       | using speedtest.net at around 10-12Gbps.
       | 
       | The symptoms are that the whole tab freezes, and the shown speed
       | drops from those 10-12Gbps to <1Gbps and the page starts updating
       | itself only every second or so.
       | 
       | IIRC Chrome-based browsers use some form of IPC with a separate
       | networking process, which actually handles networking, I wonder
       | if this might be the case that the local speed limit for
       | socketpair/pipe under Linux was reached and that's why I'm seeing
       | this.
        
         | [deleted]
        
         | Spooky23 wrote:
         | I ran into this with a VDI environment in a data center. We had
         | initially delivered 10Gb Ethernet to the VMs, because why not.
         | 
         | Turned out windows 7 or the NICs needed a lot of tuning to work
         | well. There was alot of freezing and other fail.
        
         | implying wrote:
         | Firefox is not based on the chromium codebase, it is older.
        
           | formerly_proven wrote:
           | Well if we're talking ancestors that's technically true, but
           | not by that much - Firefox comes from Netscape,
           | Chrome/Safari/... come from KHTML.
        
             | elpescado wrote:
             | AFAIR, KHTML was/is not related to Netscape/Gecko in any
             | way.
        
             | wodenokoto wrote:
             | > ... on August 16, 1999 that [Lars Knoll] had checked in
             | what amounted to a complete rewrite of the KHTML library--
             | changing KHTML to use the standard W3C DOM as its internal
             | document representation.
             | https://en.wikipedia.org/wiki/KHTML#Re-
             | write_and_improvement
             | 
             | > In March 1998, Netscape released most of the code base
             | for its popular Netscape Communicator suite under an open
             | source license. The name of the application developed from
             | this would be Mozilla, coordinated by the newly created
             | Mozilla Organization https://en.wikipedia.org/wiki/Mozilla_
             | Application_Suite#Hist...
             | 
             | Netscape Communicator (or Netscape 4) was released in 1997,
             | so If we are tracing lineage, I'd say Firefox has a 2 year
             | head start.
        
         | def- wrote:
         | Firefox is only Chrome-based on iOS.
        
           | rwaksmunski wrote:
           | You mean WebKit.
        
           | karamanolev wrote:
           | It's Safari-based, which is Webkit-based. Chrome is also
           | Safari-based on iOS, because all the browsers must be.
           | There's no actual Chrome (as in Blink, the browser engine) on
           | iOS, at least in Play Store.
        
             | Izkata wrote:
             | > It's Safari-based, which is Webkit-based.
             | 
             | Firefox only uses Webkit on iOS, due to Apple requirements.
             | It uses Gecko everywhere else. And I don't think it's ever
             | been Safari-based anywhere.
        
         | jve wrote:
         | Do you actually mean Gbit/s? 25Gb/s would translate to
         | 200Gbit/s ...
        
           | Denvercoder9 wrote:
           | The small "b" is customarily used to refer to bits, with the
           | large "B" used to refer to bytes. So 25 Gb/s would be 25
           | Gbit/s, while 25 GB/s would be 200 Gbit/s.
        
           | karamanolev wrote:
           | Gb != GB. Per Wikipedia, which aligns with my understanding,
           | 
           | "The gigabit has the unit symbol Gbit or Gb."
           | 
           | 25GB/s would translate to 200Gbit/s and also 200Gb/s.
        
         | reitanqild wrote:
         | > and at those speeds Chrome and Firefox (which is Chrome-
         | based)
         | 
         | AFAIK, Firefox is not Chrome-based anywhere.
         | 
         | On iOS it uses whatever iOS provides for webview - as does
         | Chrome on iOS.
         | 
         | Firefox and Safari is now the only supported mainstream
         | browsers that has their own rendering engines. Firefox is the
         | only that has their own rendering engine and is cross platform.
         | It is also open source.
        
           | yosamino wrote:
           | > AFAIK, Firefox is not Chrome-based anywhere.
           | 
           | Not technically "Chrome-based", but Firefox draws graphics
           | using Chrome's Skia graphics engine.
           | 
           | Firefox is not completely independent from Chrome.
        
             | bawolff wrote:
             | I feel like counting every library is silly.
             | 
             | In any case, i thought chrome used libnss which is a
             | mozilla library, so you could say the reverse as well.
        
             | SahAssar wrote:
             | Skia started in 2004 independently of google and was then
             | acquired by google. Calling it "Chrome's Skia graphics
             | engine" makes it sound like it was built _for_ chrome.
        
               | [deleted]
        
               | [deleted]
        
           | SahAssar wrote:
           | > Firefox is the only that has their own rendering engine and
           | is cross platform.
           | 
           | Interestingly safaris rendering engine is open source and
           | cross platform, but the browser is not. Lots of linux-focused
           | browsers (konquerer, gnome web, surf) and most embedded
           | browsers (nintendo ds & switch, playstation) use webkit. Also
           | some user interfaces (like WebOS, which is running all of
           | LG's TVs and smart refrigerators) use webkit as their
           | renderer.
        
             | qwerty456127 wrote:
             | WebKit itself is a fork of the Konqueror's original KHTML
             | engine by the way.
        
               | tmccrary55 wrote:
               | Browser Genealogy
        
               | cturtle wrote:
               | Now I want to see the family tree!
        
               | capableweb wrote:
               | Ask, and you shall receive :)
               | 
               | https://en.wikipedia.org/wiki/File:Timeline_of_web_browse
               | rs....
               | 
               | https://en.wikipedia.org/wiki/Timeline_of_web_browsers
               | has tables as well.
        
           | tinus_hn wrote:
           | IOS uses WebKit which is also what Chrome is based on.
        
             | Comevius wrote:
             | Chrome uses Blink, which was forked from WebKit's WebCore
             | in 2013. They replaced JavaScriptCore with V8.
        
         | merightnow wrote:
         | Unrelated question, what hardware do you use to setup your
         | network for 25Gb/s? I've been looking at init7 for a while, but
         | gave up and stayed with Salt after trying to find the right
         | hardware for the job.
        
           | jagrsw wrote:
           | NIC: Intel E810-XXVDA2
           | 
           | Optics: To ISP: Flexoptics (https://www.flexoptix.net/de/p-b1
           | 625g-10-ad.html?co10426=972...), Router-PC:
           | https://mikrotik.com/product/S-3553LC20D
           | 
           | Router: Mikrotik CCR-2004 -
           | https://mikrotik.com/product/ccr2004_1g_12s_2xs - warning:
           | it's good to up to ~20Gb/s one way. It can handle ~25Gb/s
           | down, but only ~18Gb/s up, and with IPv6 the max seems to be
           | ~10Gb/s any direction.
           | 
           | If Mikrotik is something you're comfortable using you can
           | also take a look at
           | https://mikrotik.com/product/ccr2216_1g_12xs_2xq - it's more
           | expensive (~2500EUR), but should handle 25Gb/s easily.
        
             | zrail wrote:
             | IIRC most Mikrotik products lack hardware IPv6 offload
             | which is probably why you're seeing lower speeds.
        
               | BenjiWiebe wrote:
               | In that case 10Gb/s sounds actually pretty good, if
               | that's without hardware offload.
        
         | sph wrote:
         | This makes me wonder... does anyone offer an iperf-based
         | speedtest service on the Internet?
        
           | scoopr wrote:
           | Well there are some public iperf servers listed here:
           | https://iperf.fr/iperf-servers.php
        
           | jagrsw wrote:
           | Ha.. my ISP does :) I can hit those 25Gb/s when connecting
           | directly (bypassing the router as it barely handles those
           | 25Gb/s).
           | 
           | With it in the way I get ~15-20Gb/s                 $ iperf3
           | -l 1M --window 64M -P10 -c speedtest.init7.net       ..
           | [SUM]   0.00-1.00   sec  1.87 GBytes  16.0 Gbits/sec  181406
           | $ iperf3 -R -l 1M --window 64M -P10 -c speedtest.init7.net
           | ..       [SUM]   0.00-1.00   sec  2.29 GBytes  19.6 Gbits/sec
        
             | [deleted]
        
         | jcims wrote:
         | Speedtest does have a CLI as well, might be interesting to
         | compare them.
        
           | jagrsw wrote:
           | Yup, the CLI version works well - https://www.speedtest.net/r
           | esult/c/e9104814-294f-4927-af9f-d...
        
           | zrail wrote:
           | Thing to note: the open source version on GitHub, installable
           | by homebrew and native package managers, is not the same
           | version as Ookla distributes from their website and is not
           | accurate at all.
        
         | [deleted]
        
         | pca006132 wrote:
         | Is it only affecting the browser or the entire system? It might
         | be possible that the CPU is busy handling interrupts from the
         | ethernet controller, although in general these controllers
         | should use DMA and should not send interrupts frequently.
        
           | jagrsw wrote:
           | Only browser(s), the OS is capable of 25Gb/s - checked with
           | iperf and also speedtest-cli - https://www.speedtest.net/resu
           | lt/c/e9104814-294f-4927-af9f-d...
        
         | jcranberry wrote:
         | Sounds like a hard drive cache filling up.
        
           | megous wrote:
           | One would assume speed testing website would use `Cache-
           | Control: no-store`...
           | 
           | But alas, they do not, lol. They just use no-cache on the
           | query which will not prevent the browser from storing the
           | data.
           | 
           | https://megous.com/dl/tmp/8112dd9346dd66e8.png
        
         | bayindirh wrote:
         | Chrome fires many processes and creates an IPC based comm-
         | network between them to isolate stuff. It's somewhat abusing
         | your OS to get what its want in terms of isolation and whatnot.
         | 
         | (Which is similar to how K8S abuses ip-tables and makes it
         | useless for other ends, and makes you install a dedicated
         | firewall in front of your ingress path, but let's not digress).
         | 
         | On the other hand, Firefox is neither chromium based, nor is a
         | cousin of it. It's a completely different codebase, inherited
         | from Netscape days and evolved up to this point.
         | 
         | As another test point, Firefox doesn't even blink at a
         | symmetric gigabit connection going at full speed (my network is
         | capped by my NIC, the pipe is _way_ fatter).
        
           | jagrsw wrote:
           | > As another test point, Firefox doesn't even blink at a
           | symmetric gigabit connection going at full speed (my network
           | is capped by my NIC, the pipe is way fatter).
           | 
           | FWIW Firefox under Linux (Firefox Browser 100.0.2 (64-bit))
           | behaves pretty much the same as Chrome. The speed raises
           | quickly to 5-8Gb/s, then the UI starts choking, and the shown
           | speed drops to 500Mb/s. It could be that there's some
           | scheduling limit or other bottleneck hit in the OS itself,
           | assuming these are different codebases (are they?).
        
             | bayindirh wrote:
             | I'd love to test and debug the path where it dies, but none
             | of the systems we have firefox have pipes that fat (again
             | NIC limited).
             | 
             | However, you can test the limits of Linux by installing CLI
             | version of Speedtest and hitting a nearby server.
             | 
             | The bottleneck maybe in the browser itself, or in your
             | graphics stack, too.
             | 
             | Linux can do pretty amazing things in the network
             | department, otherwise 100Gbps Infiniband cards wouldn't be
             | possible at Linux servers, yet we have them on our systems.
             | 
             | And yes, Chrome and Firefox are way different browsers. I
             | can confidently say this, because I'm using Firefox since
             | it's called Netscape 6.0 (and Mozilla in Knoppix).
        
               | jeffreygoesto wrote:
               | From my experience long ago, all high performance
               | networking under Linux was traditionally user space and
               | pre-allocated pools (netmap, dpdk, pf-ring...). Did not
               | follow, how much io_uring has been catching up for
               | network stack usage... Maybe somebody else knows?
        
               | sophacles wrote:
               | I have a service that beats epoll with io_uring (it reads
               | gre packets from one socket, and does some
               | lookups/munging on the inner packet and re-encaps them to
               | a different mechanism and writes them back to a different
               | socket). General usage for io_uring vs epoll is pretty
               | comparable IIUC. It wouldn't surprise me if streams (e.g.
               | tcp) end up being faster via io_uring and buffer
               | registration though.
               | 
               | Totally tangential - it looks like io_uring is evolving
               | beyond just io and into an alternate syscall interface,
               | which is pretty neat imho.
        
               | bayindirh wrote:
               | While I'm not very knowledgeable in specifics, there are
               | many paths for networking in Linux now. The usual kernel
               | based one is there, also there's kernel-bypass [0] paths
               | used by very high performance cards.
               | 
               | Also, Infiniband can directly RDMA to and from MPI
               | processes for making "remote memory local", allowing very
               | low latencies and high performance in HPC environments.
               | 
               | I also like this post from Cloudflare [1]. I've read it
               | completely, but the specifics are lost on me since I'm
               | not directly concerned with the network part of our
               | system.
               | 
               | [0]: https://medium.com/@penberg/on-kernel-bypass-
               | networking-and-...
               | 
               | [1]: https://blog.cloudflare.com/how-to-receive-a-
               | million-packets...
        
               | bawolff wrote:
               | > I can confidently say this, because I'm using Firefox
               | since it's called Netscape 6.0 (and Mozilla in Knoppix).
               | 
               | Mozilla suite/seamonkey isn't usually considered the same
               | as firefox, although obviously related.
        
               | bayindirh wrote:
               | I'm not talking about the version which evolved to
               | Seamonkey. I'm talking about Mozilla/Firefox 0.8 which
               | had a Mozilla logo as a "Spinner" instead of Netscape
               | logo on the top right.
        
               | bawolff wrote:
               | Netscape 6 was not firefox based
               | https://en.m.wikipedia.org/wiki/Netscape_6
               | 
               | Firefox 0.8 did not have netscape branding
               | http://theseblog.free.fr/firefox-0.8.jpg
        
           | pjmlp wrote:
           | It is using what OS processes where created in first place.
           | 
           | Unfortunately the security industry has proven the why
           | threads are a bad ideas for applications when security is a
           | top concern.
           | 
           | Same applies to dynamically loaded code as plugins, where the
           | host application takes the blame for all instabilty and
           | exploits they introduce.
        
             | bayindirh wrote:
             | Yes, Firefox is also doing the same, however due to the
             | nature of Firefox's processes, the OS doesn't lose much
             | responsiveness or doesn't feel bogged down when I have 50+
             | tabs open due to some research.
             | 
             | If you need security, you need isolation. If you want
             | hardware-level isolation, you need processes. That's
             | normal.
             | 
             | My disagreement with Google's applications are how they're
             | behaving like they're the only running processes on the
             | system itself. I'm pretty aware that some of the most
             | performant or secure things doesn't have the prettiest
             | implementation on paper.
        
               | ReactiveJelly wrote:
               | There used to be a setting to tweak Chrome's process
               | behavior.
               | 
               | I believe the default behavior is "Coalesce tabs into the
               | same content process if they're from the same trust
               | domain".
               | 
               | Then you can make it more aggressive like "Don't coalesce
               | tabs ever" or less aggressive like "Just have one content
               | process". I think.
               | 
               | I'm not sure how Firefox decides when to spawn new
               | processes. I know they have one GPU process and then
               | multiple untrusted "content processes" that can touch
               | untrusted data but can't touch the GPU.
               | 
               | I don't mind it. It's a trade-off between security and
               | overhead. The IPC is pretty efficient and the page cache
               | in both Windows and Linux _should_ mean that all the code
               | pages are shared between all content processes.
               | 
               | Static pages actually feel light to me. I think crappy
               | webapps make the web slow, not browser security.
               | 
               | (inb4 I'm replying to someone who works on the Firefox
               | IPC team or something lol)
        
               | girvo wrote:
               | > inb4 I'm replying to someone who works on the Firefox
               | IPC team or something lol
               | 
               | The danger and joy of commenting on HN!
        
               | bayindirh wrote:
               | I'm harmless, don't worry. :) Also you can find more
               | information about me in my profile.
               | 
               | Even if I was working on Firefox/Chrome/whatever, I'd not
               | be mad at someone who doesn't know something very well.
               | Why should I? We're just conversing here.
               | 
               | Also, I've been very wrong here at times, and this
               | improved my conversation / discussion skills a great
               | deal.
               | 
               | So, don't worry, and comment away.
        
       | mastax wrote:
       | I'm glad huge pages make a big difference because I just spent
       | several hours setting them up. Also everyone says to disable
       | transparent_hugepage, so I set it to `madvise`, but I'm skeptical
       | that any programs outside databases will actually use them.
        
         | deagle50 wrote:
         | JVM can. I have JetBrains set up to use them.
        
       | gigatexal wrote:
       | Now this is the kind of content I come to HN for. Absolutely
       | fascinating read.
        
       | lazide wrote:
       | The majority of this overhead (and the slow transfers) naively
       | seem to be in the scripts/systems using the pipes.
       | 
       | I was worried when I saw zfs send/receive used pipes for instance
       | because of performance worries - but using it in reality I had no
       | problems pushing 800MB/s+. It seemed limited by iop/s on my local
       | disk arrays, not any limits in pipe performance.
        
         | Matthias247 wrote:
         | Right. I'm actually surprised the test with 256kB transfers
         | gives reasonable results, and would rather have tested with >
         | 1GB instead. For such a small transfer it seemed likely that
         | the overhead of spawning the process and loading libraries by
         | far dominates the amount of actual work. I'm also surprised
         | this didn't show up in profiles. But if obviously depends on
         | where the measurement start and end points are
        
           | azornathogron wrote:
           | Perhaps I've misunderstood what you're referring to, but the
           | test in the article is measuring speed transferring 10 GiB.
           | 256 KiB is just the buffer size.
        
             | Matthias247 wrote:
             | The first C program in the blog post allocates a 256kB
             | buffer and writes that one exactly once to stdout. I don't
             | see another loop which writes it multiple times.
        
               | azornathogron wrote:
               | There's an outer while(true){} loop - the write side just
               | writes continuously.
               | 
               | More generally though, sidenote 5 says that the code in
               | the article itself is incomplete and the real test code
               | is available in the github repo:
               | https://github.com/bitonic/pipes-speed-test
        
       | BeeOnRope wrote:
       | This is a well-written article with excellent explanations and I
       | thoroughly enjoyed it.
       | 
       | However, none of the variants using vmsplice (i.e., all but the
       | slowest) are safe. When you gift [1] pages to the kernel there is
       | no reliable general purpose way to know when the pages are safe
       | to reuse again.
       | 
       | This post (and the earlier FizzBuzz variant) try to get around
       | this by assuming the pages are available again after "pipe size"
       | bytes have been written after the gift, _but this is not true in
       | general_. For example, the read side may also use splice-like
       | calls to move the pages to another pipe or IO queue in zero-copy
       | way so the lifetime of the page can extend beyond the original
       | pipe.
       | 
       | This will show up as race conditions and spontaneously changing
       | data where a downstream consumer sees the page suddenly change as
       | it it overwritten by the original process.
       | 
       | The author of these splice methods, Jens Axboe, had proposed a
       | mechanism which enabled you to determine when it was safe to
       | reuse the page, but as far as I know nothing was ever merged. So
       | the scenarios where you can use this are limited to those where
       | you control both ends of the pipe and can be sure of the exact
       | page lifetime.
       | 
       | ---
       | 
       | [1] Specifically, using SPLICE_F_GIFT.
        
         | haberman wrote:
         | What if the writer frees the memory entirely? Can you segv the
         | reader? That would be quite a dangerous pattern.
        
         | rostayob wrote:
         | (I am the author of the post)
         | 
         | I haven't digested this comment fully yet, but just to be
         | clear, I am _not_ using SPLICE_F_GIFT (and I don't think the
         | fizzbuzz program is either). However I think what you're saying
         | makes sense in general, SPLICE_F_GIFT or not.
         | 
         | Are you sure this unsafety depends on SPLICE_F_GIFT?
         | 
         | Also, do you have a reference to the discussions regarding this
         | (presumably on LKML)?
        
           | rostayob wrote:
           | Actually, from re-reading the man page for vmsplice, it seems
           | like it _should_ depend on SPLICE_F_GIFT (or in other words,
           | it should be safe without it).
           | 
           | But from what I know about how vmsplice is implemented,
           | gifting or not, it sounds like it should be unsafe anyhow.
        
           | DerSaidin wrote:
           | Hello
           | 
           | https://mazzo.li/posts/fast-pipes.html#what-are-pipes-
           | made-o...
           | 
           | I think the diagram near the start of this section has "head"
           | and "tail" swapped.
           | 
           | Edit: Nevermind, I didn't read far enough.
        
           | BeeOnRope wrote:
           | Yeah my mention of gift was a red herring: I had assumed gift
           | was being used but the same general problem (the "page
           | garbage collection issue") crops up regardless.
           | 
           | If you don't use gift, you never know when the pages are free
           | to use again, so in principle you need to keep writing to new
           | buffers indefinitely. One "solution" to this problem is to
           | gift the pages, in which case the kernel does the GC for you,
           | but you need to churn through new pages constantly because
           | you've gifted the old ones. Gift is especially useful when
           | the page gifted can be used directly in the page cache (i.e.,
           | writing a file, not a pipe).
           | 
           | Without gift some consumption patterns may be safe but I
           | think they are exactly those which involve a copy (not using
           | gift means that a copy will occur for additional read-side
           | scenarios). Ultimately the problem is that if some downstream
           | process is able to get a zero-copy view of a page from an
           | upstream writer, how can this be safe to concurrently
           | modification? The pipe size trick is one way it could work,
           | but it doesn't pan out because the pages may live beyond the
           | immediately pipe (this is actually alluded in the FizzBuzz
           | article where they mentioned things blew up if more than one
           | pipe was involved).
        
             | rostayob wrote:
             | Yes, this all makes sense, although like everything
             | splicing-related, it is very subtle. Maybe I should have
             | mentioned the subtleness and dangerousness of splicing at
             | the beginning, rather than at the end.
             | 
             | I still think the man page of vmsplice is quite misleading!
             | Specifically:                      SPLICE_F_GIFT
             | The  user pages are a gift to the kernel.  The application
             | may not modify                   this memory ever,
             | otherwise the page cache and on-disk data  may  differ.
             | Gifting   pages   to   the  kernel  means  that  a
             | subsequent  splice(2)                   SPLICE_F_MOVE can
             | successfully move the pages; if this flag is not speci-
             | fied,  then  a  subsequent  splice(2)  SPLICE_F_MOVE must
             | copy the pages.                   Data must also be
             | properly page aligned, both in memory and length.
             | 
             | To me, this indicates that if we're _not_ using
             | SPLICE_F_GIFT downstream splices will be automatically
             | taken care of, safety-wise.
        
               | scottlamb wrote:
               | Hmm, reading this side-by-side with a paragraph from
               | BeeOnRope's comment:
               | 
               | > This post (and the earlier FizzBuzz variant) try to get
               | around this by assuming the pages are available again
               | after "pipe size" bytes have been written after the gift,
               | _but this is not true in general_. For example, the read
               | side may also use splice-like calls to move the pages to
               | another pipe or IO queue in zero-copy way so the lifetime
               | of the page can extend beyond the original pipe.
               | 
               | The paragraph you quoted says that the "splice-like calls
               | to move the pages" actually copy when SPLICE_F_GIFT is
               | not specified. So perhaps the combination of not using
               | SPLICE_F_GIFT and waiting until "pipe size" bytes have
               | been written is safe.
        
               | BeeOnRope wrote:
               | Yes it is not clear to me when the copy actually happens
               | but I had assumed the > 30 GB/s result after read was
               | changed to use splice must imply zero copy.
        
               | rostayob wrote:
               | It could be that when splicing to /dev/null (which I'm
               | doing), the kernel knows that they their content is never
               | witnessed, and therefore no copy is required. But I
               | haven't verified that
        
               | scottlamb wrote:
               | Makes sense. If so, some of the nice benchmark numbers
               | for vmsplice would go away in a real scenario, so that'd
               | be nice to know.
        
               | BeeOnRope wrote:
               | Splicing seems to work well for the "middle" part of a
               | chain of piped processes, e.g., how pv works: it can
               | splice pages from one pipe to another w/o needing to
               | worry about reusing the page since someone upstream
               | already wrote the page.
               | 
               | Similarly for splicing from a pipe to a file or something
               | like that. It's really the end(s) of the chain that want
               | to (a) generate the data in memory or (b) read the data
               | in memory that seem to create the problem.
        
           | scottlamb wrote:
           | I think you're right that the same problem applies without
           | SPLICE_F_GIFT. One of the other fizzbuzz code golfers
           | discusses that here:
           | https://codegolf.stackexchange.com/a/239848
           | 
           | I wonder if io_uring handles this (yet). io_uring is a newer
           | async IO mechanism by the same author which tells you when
           | your IOs have completed. So you might think it would:
           | 
           | * But from a quick look, I think its vmsplice equivalent
           | operation just tells you when the syscall would have
           | returned, so maybe not. [edit: actually, looks like there's
           | not even an IORING_OP_VMSPLICE operation in the latest
           | mainline tree yet, just drafts on lkml. Maybe if/when the
           | vmsplice op is added, it will wait to return for the right
           | time.]
           | 
           | * And in this case (no other syscalls or work to perform
           | while waiting) I don't see any advantage in io_uring's
           | read/write operations over just plain synchronous read/write.
        
             | Matthias247 wrote:
             | uring only really applies for async IO - and would tell you
             | when an otherwise blocking syscall would have finished.
             | Since the benchmark here uses blocking calls, there
             | shouldn't be any change in behavior. The lifetime of the
             | buffer is an orthogonal concern to the lifetime of the
             | operation. Even if the kernel knows when the operation is
             | done inside the kernel it wouldn't have a way to know
             | whether the consuming application is done with it.
        
               | scottlamb wrote:
               | > uring only really applies for async IO - and would tell
               | you when an otherwise blocking syscall would have
               | finished. Since the benchmark here uses blocking calls,
               | there shouldn't be any change in behavior. The lifetime
               | of the buffer is an orthogonal concern to the lifetime of
               | the operation. Even if the kernel knows when the
               | operation is done inside the kernel it wouldn't have a
               | way to know whether the consuming application is done
               | with it.
               | 
               | That doesn't match what I've read. E.g.
               | https://lwn.net/Articles/810414/ opens with "At its core,
               | io_uring is a mechanism for performing asynchronous I/O,
               | but it has been steadily growing beyond that use case and
               | adding new capabilities."
               | 
               | More precisely:
               | 
               | * While most/all ops are async IO now, is there any
               | reason to believe folks won't want to extend it to batch
               | basically any hot-path non-vDSO syscall? As I said,
               | batching doesn't help here, but it does in a lot of other
               | scenarios.
               | 
               | * Several IORING_OP_s seem to be growing capabilities
               | that aren't matched by like-named syscalls. E.g. IO
               | without file descriptors, registered buffers, automatic
               | buffer selection, multishot, and (as of a month ago)
               | "ring mapped supplied buffers". Beyond the individual
               | operation level, support for chains. Why not a mechanism
               | that signals completion when the buffer passed to
               | vmsplice is available for reuse? (Maybe by essentially
               | delaying the vmsplice syscall'ss return [1], maybe by a
               | second command, maybe by some extra completion event from
               | the same command, details TBD.)
               | 
               | [1] edit: although I guess that's not ideal. The reader
               | side could move the page and want to examine following
               | bytes, but those won't get written until the writer sees
               | the vmsplice return and issues further writes.
        
               | BeeOnRope wrote:
               | Yeah this.
               | 
               | The vanilla io_uring fits "naturally" in an async model,
               | but batching and some of the other capabilities it
               | provide are definitely useful for stuff written to a
               | synchronous model too.
               | 
               | Additionally, io_uring can avoid syscalls sometimes even
               | without any explicit batching by the application, because
               | it can poll the submission queue (root only, last time I
               | checked unfortunately): so with the right setup a series
               | of "synchronous" ops via io_uring (i.e., submit &
               | immediately wait for the response) could happen with < 1
               | user-kernel transition per op, because the kernel is busy
               | servicing ops directly from the incoming queue and the
               | application gets the response during its polling phase
               | before it waits.
        
             | yxhuvud wrote:
             | Perhaps it could be sortof simulated in uring using the
             | splice op against a memfd that has been mmaped in advance?
             | I wonder how fast that could be and how it would compare
             | safetywise.
        
             | BeeOnRope wrote:
             | I don't know if io_uring provides a mechanism to solve this
             | page ownership thing but I bet Jens does: I've asked [1].
             | 
             | ---
             | 
             | [1]
             | https://twitter.com/trav_downs/status/1532491167077572608
        
         | robocat wrote:
         | > However, none of the variants using vmsplice (i.e., all but
         | the slowest) are safe. When you gift [1] pages to the kernel
         | there is no reliable general purpose way to know when the pages
         | are safe to reuse again. [snip] This will show up as race
         | conditions and spontaneously changing data where a downstream
         | consumer sees the page suddenly change as it it overwritten by
         | the original process.
         | 
         | That sounds like a security issue - the ability of an upstream
         | generator process to write into the memory of a downstream
         | reader process, or more perverser vice versa is even worser. I
         | presume that the Linux kernel only lets this happen (zero copy)
         | when the two processes are running as the same user?
        
           | hamandcheese wrote:
           | It's not clear to me that the kernel allows the receiving
           | process to write instead of just read.
           | 
           | But also, if you are sending data, why would you later
           | read/process that send buffer?
           | 
           | The only attack vector I could imagine would be if one sender
           | was splicing the same memory to two or more receivers. A
           | malicious receiver with write access to the spliced memory
           | could compromise other readers.
        
       | nice2meetu wrote:
       | I once had to change my mental model for how fast some of these
       | things were. I was using `seq` as an input for something else,
       | and my thinking was along the lines that it is a small generator
       | program running hot in the cpu and would be super quick.
       | Specifically because it would only be writing things out to
       | memory for the next program to consume, not reading anything in.
       | 
       | But that was way off and `seq` turned out to be ridiculously
       | slow. I dug down a little and made a faster version of `seq`,
       | that kind of got me what I wanted. But then noticed at the end
       | that the point was moot anyway, because just piping it to the
       | next program over the command line was going to be the slow
       | point, so it didn't matter anyway.
       | 
       | https://github.com/tverniquet/hseq
        
         | freedomben wrote:
         | I had a somewhat similar discovery once using GNU parallel. I
         | was trying to generate as much web traffic as possible from a
         | single machine to load test a service I was building, and I
         | assumed that the network I/o would be the bottleneck by a long
         | shot, not the overhead of spawning many processes. I was
         | disappointed by the amount of traffic generated, so I rewrote
         | it in Ruby using the parallel gem with threads (instead of
         | processes), and got orders of magnitude more performance.
        
           | strictfp wrote:
           | Node is great for this usecase
        
       | Klasiaster wrote:
       | Netmap offers zero-copy pipes (included in FreeBSD, on Linux it's
       | a third party module):
       | https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4
        
       | v3gas wrote:
       | Love the subtle stonks background in the first image.
        
         | [deleted]
        
       | alex_hirner wrote:
       | Does an API similar to vmsplice exist for Windows?
        
       | herodoturtle wrote:
       | This was a long but highly insightful read!
       | 
       | (And as an aside, the combination of that font with the hand-
       | drawn diagrams is really cool)
        
       | arkitaip wrote:
       | The visual design is amazing.
        
       | anotherhue wrote:
       | pv is written in perl so isn't the snappiest, I'm surprised to
       | see it score so highly. I wonder what the initial speed would
       | have been if it just wrote to /dev/null
        
         | merpkz wrote:
         | Confused with parallel, maybe?
        
         | rostayob wrote:
         | It's not written in perl, it's written in C, and it uses
         | splice() (one of the syscalls discussed in the post).
        
           | anotherhue wrote:
           | I was totally wrong. Thank you for showing me the facts.
        
           | karamanolev wrote:
           | Definitely C, per what appears to be the official repo
           | (linking the splice syscall) - https://github.com/icetee/pv/b
           | lob/master/src/pv/transfer.c#L...
        
       | effnorwood wrote:
        
       | [deleted]
        
       | mg wrote:
       | For some reason, this raised my curiosity how fast different
       | languages write individual characters to a pipe:
       | 
       | PHP comes in at about 900KiB/s:                   php -r 'while
       | (1) echo 1;' | pv > /dev/null
       | 
       | Python is about 50% faster at about 1.5MiB/s:
       | python3 -c 'while (1): print (1, end="")' | pv > /dev/null
       | 
       | Javascript is slowest at around 200KiB/s:                   node
       | -e 'while (1) process.stdout.write("1");' | pv > /dev/null
       | 
       | What's also interesting is that node crashes after about a
       | minute:                   FATAL ERROR: Ineffective mark-compacts
       | near heap limit Allocation failed -         JavaScript heap out
       | of memory
       | 
       | All results from within a Debian 10 docker container with the
       | default repo versions of PHP, Python and Node.
       | 
       | Update:
       | 
       | Checking with strace shows that Python caches the output:
       | strace python3 -c 'while (1): print (1, end="")' | pv > /dev/null
       | 
       | Outputs a series of:                   write(1,
       | "11111111111111111111111111111111"..., 8193) = 8193
       | 
       | PHP and JS do not.
       | 
       | So the Python equivalent would be:                   python3 -c
       | 'while (1): print (1, end="", flush=True)' | pv > /dev/null
       | 
       | Which makes it compareable to the speed of JS.
       | 
       | Interesting, that PHP is over 4x faster than the Python and JS.
        
         | cestith wrote:
         | I'm on a 2015 MB Air with two browsers running, probably a
         | dozen tabs between them, three tabs in iTerm2, Outlook, Word,
         | and Teams running.
         | 
         | Perl 5.18.0 gives me 3.5 MiB per second. Perl 5.28.3, 5.30.3,
         | and 5.34.0 gives 4 MiB per second.                   perl5.34.0
         | -e 'while (){ print 1 }' | pv > /dev/null
         | 
         | For Python 3.10.4, I get about 2.8 MiB/s as you have it
         | written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for
         | 3.8) with this. I also get 4.8 MiB/s with 2.7:
         | python3 -c 'while (1): print (1)' | pv > /dev/null
         | 
         | If I make Perl behave like yes and print a character and a
         | newline, it has a jump of its own. The following gives me 37.3
         | MiB per second.                   perl5.34.0 -e 'while (){
         | print "1\n" }' | pv > /dev/null
         | 
         | Interestingly, using Perl's say function (which is like a
         | Println) slows it down significantly. This version is only 7.3
         | MiB/s.                   perl5.34.0 -E 'while (1) {say 1}' | pv
         | > /dev/null
         | 
         | Go 1.18 has 940 KiB/s with fmt.Print and 1.5 MiB/s with
         | fmt.Println for some comparison.                   package main
         | import "fmt"              func main() {                 for ;;
         | {                         fmt.Println("1")                 }
         | }
         | 
         | These are all macports builds.
        
         | mscdex wrote:
         | Potential buffering issues aside, as others have pointed out
         | the node.js example is performing asynchronous writes, unlike
         | the other languages' examples (as far as I know).
         | 
         | To do a proper synchronous write, you'd do something like:
         | node -e 'const { writeSync } = require("fs"); while (1)
         | writeSync(1, "1");' | pv > /dev/null
         | 
         | That gets me ~1.1MB/s with node v18.1.0 and kernel 5.4.0.
        
         | themulticaster wrote:
         | If you ever need to write a random character to a pipe very
         | fast, GNU coreutils has you covered with yes(1). It runs at
         | about 6 GiB/s on my system:                 yes | pv >
         | /dev/null
         | 
         | There's an article floating around [1] about how yes(1) is
         | extremely optimized considering its original purpose. In care
         | you're wondering, yes(1) is meant for commands that
         | (repeatedly) ask whether to proceed, expecting a y/n input or
         | something like that. Instead of repeatedly typing "y", you just
         | run "yes | the_command".
         | 
         | Not sure about how yes(1) compares to the techniques presented
         | in the linked post. Perhaps there's still room for improvement.
         | 
         | [1] Previous HN discussion:
         | https://news.ycombinator.com/item?id=14542938
        
           | gitgud wrote:
           | > _It runs at about 6 GiB /s on my system..._
           | 
           | Honest question: what are the practical use cases of this?
           | 
           | Repeatedly typing the 'y' character into a Linux pipe is
           | surely not that common, especially at that bit rate. Also
           | seems like the bottleneck would always be the consuming
           | program...
        
             | travisgriggs wrote:
             | > Honest question: what are the practical use cases of
             | this?
             | 
             | It also allows you to script otherwise interactive command
             | line operations with the correct answer. Many come like
             | tools now days provide specific options to override
             | queries. But there are still a couple hold outs which might
             | not.
        
             | jolmg wrote:
             | > especially at that bit rate. Also seems like the
             | bottleneck would always be the consuming program...
             | 
             | It's not _made_ to be fast; it 's just fast _by nature_ ,
             | because there's no other computation it needs to do than to
             | just output the string.
        
             | singron wrote:
             | Yes can repeat any string, not just "y". It can be useful
             | for basic load generation.
        
               | jolmg wrote:
               | I've used it to test some db behavior with `yes 'insert
               | ...;' | mysql ...`. Fastest insertions I could think of.
        
             | TacticalCoder wrote:
             | > Repeatedly typing the 'y' character into a Linux pipe is
             | surely not that common, especially at that bit rate.
             | 
             | At that rate no but I definitely use it once in a while.
             | For example if a copy quite a few files and then get
             | repeatedly asked if I want to overwrite the destination
             | (when it's already present). Sure, I could get my commmand
             | back and use the proper flag to "cp" or whatever to
             | overwrite, but it's usually much quicker to just get back
             | the previous line, go at the beginning (C-a), then type
             | "yes | " and be done with it.
             | 
             | Note that you can pass a parameter to "yes" and then it
             | repeats what you passed instead of 'y'.
        
             | linsomniac wrote:
             | Historically, you could have dirty filesystems after a
             | reboot that "fsck" would ask an absurd number of questions
             | about ("blah blah blah inode 1234567890 fix? (y/n)").
             | Unless you were in a very specific circumstance, you'd
             | probably just answer "y" to them. It could easily ask
             | thousands of questions though. So: "yes | fsck" was not
             | uncommon.
        
               | jolmg wrote:
               | > Historically
               | 
               | It's probably still common in installation scripts, like
               | in Dockerfiles. `apt-get install` has the `-y` option,
               | but it would be useful for all other programs that don't.
        
           | dpflug wrote:
           | Faster still is                 pv < /dev/zero > /dev/null
        
             | BenjiWiebe wrote:
             | Yes but you don't have control of which character is
             | written (only NULLs).
             | 
             | yes lets you specify which character to output. 'yes n' for
             | example to output n.
        
               | rocqua wrote:
               | Yes doesn't just let you choose a character. It lets you
               | choose a string that will be repeated. So
               | yes 123abc
               | 
               | will print
               | 123abc123abc123abc123abc123abc
               | 
               | and so on.
        
               | jolmg wrote:
               | each time terminated by a newline, so:
               | 123abc       123abc       123abc       ...
        
         | megous wrote:
         | "Javascript" is slowest probably because node pushes the writes
         | to a thread instead of printing directly from the main process
         | like PHP.
         | 
         | Python cheats, and it's still slow as heck even while cheating
         | (buffers the output at 8192 chunks instead of issuing 1 byte
         | writes).
         | 
         | write(1, "1", 1) loop in C pushes 6.38MiB/s on my PC. :)
        
           | cout wrote:
           | Why is it cheating to use a buffer? This is the behavior you
           | would get in C if you used the C standard library
           | (putc/fputc) instead of a system call (write).
        
         | soheil wrote:
         | You're testing a very specific operation, a loop, in each
         | language to determine its speed, not sure if I'd generalize
         | that. I wonder what it'd look like if you replaced the loop
         | with static print statements that were 1000s of characters long
         | with line breaks, the sort of things that compiler
         | optimizations do.
        
         | dpflug wrote:
         | I was getting different results depending on when I run it.
         | Took me a second to realize it was my processor frequency
         | scaling.
        
         | klohto wrote:
         | Python pushes 15MiB on my M1 Pro if you go down a level and use
         | sys directly.                  python3 -c 'import sys
         | while (1): sys.stdout.write("1")'| pv>/dev/null
        
           | mg wrote:
           | That caches though. You can see it when you strace it.
        
             | klohto wrote:
             | Good point, but so does a standard print call. Calling
             | flush() after each write does bring the perf to 1.5MiB
        
             | rovr138 wrote:
             | python3 -u -c 'import sys           while (1):
             | sys.stdout.write("1")'| pv>/dev/null
             | 
             | 427KiB/s                   python3 -c 'import sys
             | while (1): sys.stdout.write("1")'| pv>/dev/null
             | 
             | 6.08MiB/s
             | 
             | Using python 3.9.7 on macOS Monterey.
        
         | capableweb wrote:
         | > Javascript is slowest at around 200KiB/s:
         | 
         | I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s.
         | Python gets 4.35MiB/s.
         | 
         | > What's also interesting is that node crashes after about a
         | minute
         | 
         | I believe this is because `while(1)` runs so fast that there is
         | no "idle" time for V8 to actually run GC. V8 is a strange
         | beast, and this is just a guess of mine.
         | 
         | The following code shouldn't crash, give it a try:
         | node -e 'function write() {process.stdout.write("1");
         | process.nextTick(write)} write()' | pv > /dev/null
         | 
         | It's slower for me though, giving me 1.18MiB/s.
         | 
         | More examples with Babashka and Clojure:                   bb
         | -e "(while true (print \"1\"))" | pv > /dev/null
         | 
         | 513KiB/s                   clj -e "(while true (print \"1\"))"
         | | pv > /dev/null
         | 
         | 3.02MiB/s                   clj -e "(require '[clojure.java.io
         | :refer [copy]]) (while true (copy \"1\" *out*))" | pv >
         | /dev/null
         | 
         | 3.53MiB/s                   clj -e "(while true (.println
         | System/out \"1\"))" | pv > /dev/null
         | 
         | 5.06MiB/s
         | 
         | Versions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka
         | v0.8.1, Clojure 1.11.1.1105
        
           | marginalia_nu wrote:
           | > I believe this is because `while(1)` runs so fast that
           | there is no "idle" time for V8 to actually run GC. V8 is a
           | strange beast, and this is just a guess of mine.
           | 
           | Java has (had) weird idiosyncrasies like this as well, well
           | it doesn't crash, but depending on the construct you can get
           | performance degradations depending on how the language
           | inserts safepoints (where the VM is at a knowable state and a
           | thread can be safely paused for GC or whatever).
           | 
           | I don't know if this holds today, but I know there was a time
           | where you basically wanted to avoid looping over long-type
           | variables, as they had different semantics. The details are a
           | bit fuzzy to me right now.
        
           | wolfgang42 wrote:
           | _> > What's also interesting is that node crashes after about
           | a minute_
           | 
           |  _> I believe this is because `while(1)` runs so fast that
           | there is no  "idle" time for V8 to actually run GC. V8 is a
           | strange beast, and this is just a guess of mine._
           | 
           | Not exactly: the GC is still running; it's _live_ memory
           | that's growing unbounded.
           | 
           | What's going on here is that WritableStream is non-blocking;
           | it has _advisory_ backpressure, but if you ignore that it
           | will do its best to accept writes anyway and keep them in a
           | buffer until it can actually write them out. Since you're not
           | giving it any breathing room, that buffer just keeps growing
           | until there's no more memory left. `process.nextTick()` is
           | presumably slowing things down enough on your system to give
           | it a chance to drain the buffer. (I see there's some
           | discussion below about this changing by version; I'd guess
           | that's an artifact of other optimizations and such.)
           | 
           | To do this properly, you need to listen to the return value
           | from `.write()` and, if it returns false, back off until the
           | stream drains and there's room in the buffer again.
           | 
           | Here's the (not particularly optimized) function I use to do
           | that:                 async function writestream(chunks,
           | stream) {           for await (const chunk of chunks) {
           | if (!stream.write(chunk)) {                   // When write
           | returns null, stream is starting to buffer and we need to
           | wait for it to drain                   // (otherwise we'll
           | run out of memory!)                   await new
           | Promise(resolve => stream.once('drain', () => resolve()))
           | }           }       }
           | 
           | I _do_ wish Node made it more obvious what was going on in
           | this situation; this is a very common mistake with streams
           | and it's easy to not notice until things suddenly go very
           | wrong.
           | 
           | ETA: I should probably note that transform streams,
           | `readable.pipe()`, `stream.pipeline()`, and the like all
           | handle this stuff automatically. Here's a one-liner, though
           | it's not especially fast:                 node -e 'const
           | {Readable} = require("stream");
           | Readable.from(function*(){while(1) yield
           | "1"}()).pipe(process.stdout)' | pv > /dev/null
        
             | Matthias247 wrote:
             | Are there still no async write functions which handle this
             | easier than the old event based mechanism? Waiting for
             | drain also sounds like it might reduce throughout since
             | then there is 0 buffered data and the peer would be forced
             | t Ol pause reading. A ,,writable" event sounds more
             | appropriate - but the node docs don't mention one.
        
           | mg wrote:
           | Your node version indeed did not crash. Tried for 2 minutes.
           | 
           | But using a longer string crashed after 23s here:
           | node -e 'function write() {process.stdout.write("111111111122
           | 2222222233333333334444444444555555555566666666667777777777888
           | 888888899999999990000000000"); process.nextTick(write)}
           | write()' | pv > /dev/null
        
             | capableweb wrote:
             | Hm, strange. With the same out of memory error as before or
             | a different one? Tried running that one for 2 minutes, no
             | errors here, and memory stays constant.
             | 
             | Also, what NodeJS version are you on?
        
               | mg wrote:
               | Yes, same error as before. Memory usage stays the same
               | for a while, then starts to skyrocket shortly before it
               | crashes.
               | 
               | node is v10.24.0. (Default from the Debian 10 repo)
        
               | capableweb wrote:
               | Huh yeah, seems to be a old memory leak. Running it on
               | v10.24.0 crashes for me too.
               | 
               | After some quick testing in a couple of versions, it
               | seems like it got fixed in v11 at least (didn't test any
               | minor/patch versions).
               | 
               | By the way, all versions up to NodeJS 12 (LTS) are "end
               | of life", and should probably not be used if you're
               | downloading 3rd party dependencies, as there are bunch of
               | security fixes since then, that are not being backported.
        
               | captn3m0 wrote:
               | I used this exact issue today while pointing out how
               | Debian support dates can be misleading as packages
               | themselves aren't always getting fixes:
               | https://github.com/endoflife-
               | date/endoflife.date/issues/763#...
        
         | MaxBarraclough wrote:
         | Perhaps different approaches to caching?
         | 
         | I'm reminded of this StackOverflow question, _Why is reading
         | lines from stdin much slower in C++ than Python?_
         | 
         | https://stackoverflow.com/q/9371238/
        
         | xthrowawayxx wrote:
         | I find that NodeJS runs eventually out of memory and crashes
         | with applications that do a large amount of data processing
         | over a long time with little breaks even if there are no memory
         | leaks.
         | 
         | Edit: I've found this consistently building multiple data
         | processing applications over multiple years and multiple
         | companies
        
         | rascul wrote:
         | I did the same test, but added a rust and bash version. My
         | results:
         | 
         | Rust: 21.9MiB/s
         | 
         | Bash: 282KiB/s
         | 
         | PHP: 2.35MiB/s
         | 
         | Python: 2.30MiB/s
         | 
         | Node: 943KiB/s
         | 
         | In my case, node did not crash after about two minutes. I find
         | it interesting that PHP and Python are comparable for me but
         | not you, but I'm sure there's a plethora of reasons to explain
         | that. I'm not surprised rust is vastly faster and bash vastly
         | slower, I just thought it interesting to compare since I use
         | those languages a lot.
         | 
         | Rust:                 fn main() {           loop {
         | print!("1");           }       }
         | 
         | Bash (no discernible difference between echo and printf):
         | while :; do printf "1"; done | pv > /dev/null
        
           | anon946 wrote:
           | For languages like C, C++, and Rust, the bottleneck is going
           | to mainly be system calls. With a big buffer, on an old
           | machine, I get about 1.5 GiB/s with C++. Writing 1 char at a
           | time, I get less than 1 MiB/s.                   $ ./a.out
           | 1000000 2000 | cat >/dev/null         buffer size: 1000000,
           | num syscalls: 2000, perf:1578.779593 MiB/s         $ ./a.out
           | 1 2000000 | cat >/dev/null         buffer size: 1, num
           | syscalls: 2000000, perf:0.832587 MiB/s
           | 
           | Code is:                   #include <cstddef>
           | #include <random>         #include <chrono>         #include
           | <cassert>         #include <array>         #include <cstdio>
           | #include <unistd.h>         #include <cstring>
           | #include <cstdlib>              int main(int argc, char
           | **argv) {                  int rv;
           | assert(argc == 3);             const unsigned int n =
           | std::atoi(argv[1]);             char *buf = new char[n];
           | std::memset(buf, '1', n);                  const unsigned int
           | k = std::atoi(argv[2]);                  auto start =
           | std::chrono::high_resolution_clock::now();             for
           | (size_t i = 0; i < k; i++) {                 rv = write(1,
           | buf, n);                 assert(rv == int(n));             }
           | auto stop = std::chrono::high_resolution_clock::now();
           | auto duration = stop - start;
           | std::chrono::duration<double> secs = duration;
           | std::fprintf(stderr, "buffer size: %d, num syscalls: %d,
           | perf:%f MiB/s\n", n, k,
           | (double(n)*k)/(1024*1024)/secs.count());         }
           | 
           | EDIT: Also note that a big write to a pipe (bigger than
           | PIPE_BUF) may require multiple syscalls on the read side.
           | 
           | EDIT 2: Also, it appears that the kernel is smart enough to
           | not copy anything when it's clear that there is no need. When
           | I don't go through cat, I get rates that are well above
           | memory bandwidth, implying that it's not doing any actual
           | work:                   $ ./a.out 1000000 1000 >/dev/null
           | buffer size: 1000000, num syscalls: 1000, perf:
           | 1827368.373827 MiB/s
        
             | mortehu wrote:
             | There's no special "no work" detection needed. a.out is
             | calling the write function for the null device, which just
             | returns without doing anything. No pipes are involved.
        
           | hderms wrote:
           | with Rust you could also avoid using a lock on STDOUT and get
           | it even faster!
        
             | skitter wrote:
             | Tested it, seems to about double the speed (from 22.3mb/s
             | to 47.6mb/s).
        
           | ur-whale wrote:
           | for the bash case, the cost of forking to write two chars is
           | overwhelming compared to anything related to I/O.
        
             | mauvehaus wrote:
             | Echo and printf are shell built-ins in bash[0]. Does it
             | have to fork to execute them?
             | 
             | You could probably answer this by replacing printf with
             | /bin/echo and comparing the results. I'm not in front of a
             | Linux box, or I'd try.
             | 
             | [0]
             | https://www.gnu.org/software/bash/manual/html_node/Bash-
             | Buil...
        
               | ur-whale wrote:
               | > Echo and printf are shell built-ins in bash
               | 
               | Ah, yeah, good point, I am wrong.
        
             | megous wrote:
             | There's no forking and it's wrinting one character.
        
           | megous wrote:
           | Rust also cheats.
           | 
           | https://megous.com/dl/tmp/1046458b5b450018.png
        
             | cle wrote:
             | Seems like it's buffering output, which Python also does.
             | Python is much slower if you flush every write (I get 2.6
             | MiB/s default, 600 KiB/s with flush=True).
             | 
             | Interestingly, Go is very fast with a 8 KiB buffer (same as
             | Python's), I get 218 MiB/s.
        
           | [deleted]
        
         | cout wrote:
         | What version of node are you using? It seems to run
         | indefinitely on 14.19.3 that comes with Ubuntu 20.04.
        
         | GlitchMr wrote:
         | `process.stdout.write` is different to PHP's `echo` and
         | Python's `print` in that it pushes a write to an event queue
         | without waiting for the result which could result in filling
         | event queue with writes. Instead, you can consider `await`-ing
         | `write` so that it would write before pushing another `write`
         | to an event queue.                   node -e '
         | const stdoutWrite =
         | util.promisify(process.stdout.write).bind(process.stdout);
         | (async () => {                 while (true) {
         | await stdoutWrite("1");                 }             })();
         | ' | pv > /dev/null
        
         | fasteo wrote:
         | Luajit using print and io.write                 LuaJIT
         | 2.1.0-beta3
         | 
         | Using print is about 17 MiB/s                 luajit -e "while
         | true do print('x') end" | pv > /dev/null
         | 
         | Using io.write is about 111 MiB/s                 luajit -e
         | "while true do io.write('x') end" | pv > /dev/null
        
         | [deleted]
        
         | rhyn00 wrote:
         | Adding a few results:
         | 
         | Using OP's code for following                   php 1.8mb/sec
         | python 3.8 Mb/sec         node 1.0 Mb/sec
         | 
         | Java print 1.3 Mb/sec                   echo 'class Code
         | {public static void main(String[] args) {while
         | (true){System.out.print("1");}}}' >Code.java; javac Code.java ;
         | java Code | pv>/dev/null
         | 
         | Java with buffering 57.4 Mb/sec                   echo 'import
         | java.io.*;class Code2 {public static void main(String[] args)
         | throws IOException {BufferedWriter log = new BufferedWriter(new
         | OutputStreamWriter(System.out));while(true){log.write("1");}}}'
         | > Code2.java ; javac Code2.java ; java Code2 | pv >/dev/null
        
           | kuschku wrote:
           | Java can get even much much faster: https://gist.github.com/j
           | ustjanne/12306b797f4faa977436070ec0...
           | 
           | That manages about 7 GiB/s reusing the same buffer, or about
           | 300 MiB/s with clearing and refilling the buffer every time
           | 
           | (the magic is in using java's APIs for writing to
           | files/sockets, which are designed for high performance,
           | instead of using the APIs which are designed for writing to
           | stdout)
        
             | rhyn00 wrote:
             | Nice, that's pretty cool!
        
         | petercooper wrote:
         | I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec
         | with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed
         | as Node if I use $stdout.sync = true though..)
        
         | nequo wrote:
         | For me:
         | 
         | Python3: 3 MiB/s
         | 
         | Node: 350 KiB/s
         | 
         | Lua: 12 MiB/s                 lua -e 'while true do
         | io.write("1") end' | pv > /dev/null
         | 
         | Haskell: 5 MiB/s                 loop = do         putStr "1"
         | loop            main = loop
         | 
         | Awk: 4.2 MiB/s                 yes | awk '{printf("1")}' | pv >
         | /dev/null
        
           | VWWHFSfQ wrote:
           | Lua is an interesting one.                   while true do
           | io.write "1"         end
           | 
           | PUC-Rio 5.1: 25 MiB/s
           | 
           | PUC-Rio 5.4: 25 MiB/s
           | 
           | LuaJIT 2.1.0-beta3: 550 MiB/s <--- WOW
           | 
           | They all go slightly faster if you localize the reference to
           | `io.write`                   local write = io.write
           | while true do           write "1"         end
        
             | yakubin wrote:
             | _> They all go slightly faster if you localize the
             | reference to `io.write`_
             | 
             | No noticeable difference for LuaJIT, which makes sense,
             | since JIT should figure it out without help.
        
               | bjoli wrote:
               | And this, folks, is why you have immutable modules. If
               | you know before runtime what something is, lookup is a
               | lot faster.
        
               | VWWHFSfQ wrote:
               | Ah yes you're right. Basically no difference with LuaJIT.
               | 
               | 5.1 and 5.4 show about ~8% improvement.
        
           | dllthomas wrote:
           | Haskell can be even simpler:                   main = putStr
           | (repeat '1')
           | 
           | [Edit: as pointed out below, this is no longer the case!]
           | 
           | Strings are printed one character at a time in Haskell. This
           | choice is justified by unpredictability of the interaction
           | between laziness and buffering; I am uncertain it's the
           | correct choice, but the proper response is to use Text where
           | performance is relevant.
        
             | nequo wrote:
             | Wow, this does 160 MiB/s. That's a huge improvement! The
             | output of strace looks completely different:
             | poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1,
             | revents=POLLOUT}])       write(1,
             | "11111111111111111111111111111111"..., 8192) = 8192
             | poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1,
             | revents=POLLOUT}])       write(1,
             | "11111111111111111111111111111111"..., 8192) = 8192
             | 
             | With the recursive code, it buffered the output in the same
             | way but bugged the kernel a whole lot more in-between
             | writes. Not exactly sure what is going on:
             | poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1,
             | revents=POLLOUT}])       write(1,
             | "11111111111111111111111111111111"..., 8192) = 8192
             | rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
             | clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0,
             | tv_nsec=920390843}) = 0       rt_sigprocmask(SIG_SETMASK,
             | [], NULL, 8) = 0       rt_sigprocmask(SIG_BLOCK, [INT], [],
             | 8) = 0       clock_gettime(CLOCK_PROCESS_CPUTIME_ID,
             | {tv_sec=0, tv_nsec=920666397}) = 0       ...
             | rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
             | poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1,
             | revents=POLLOUT}])       write(1,
             | "11111111111111111111111111111111"..., 8192) = 8192
        
               | dllthomas wrote:
               | I'm honestly surprised either of them wind up buffered!
               | That must be a change since I stopped paying as much
               | attention to GHC.
               | 
               | I'm also not sure what's going on in the second case.
               | IIRC, at some point historically, a sufficiently tight
               | loop could cause trouble with handling SIGINT, so it
               | might be related to some overagressive workaround for
               | that?
        
         | wazoox wrote:
         | On my extremely old desktop PC (Phenom II 550) running an out-
         | of-date OS (Slackware 14.2):
         | 
         | Bash:                   while :; do printf "1"; done  | ./pv >
         | /dev/null         [ 156KiB/s]
         | 
         | Python3 3.7.2:                   python3 -c 'while (1): print
         | (1, end="")' | ./pv > /dev/null         [1,02MiB/s]
         | 
         | Perl 5.22.2:                   perl -e 'while (true) {print 1}'
         | | ./pv > /dev/null         [3,03MiB/s]
         | 
         | Node.js v12.22.1:                   node -e 'while (1)
         | process.stdout.write("1");' | ./pv > /dev/null         [
         | 482KiB/s]
        
         | cle wrote:
         | A major contributing factor is whether or not the language
         | buffers output by default, and how big the buffer is. I don't
         | think NodeJS buffers, whereas Python does. Here's some
         | comparisons with Go (does not buffer by default):
         | 
         | - Node (no buffering): 1.2 MiB/s
         | 
         | - Go (no buffering): 2.4 MiB/s
         | 
         | - Python (8 KiB buffer): 2.7 MiB/s
         | 
         | - Go (8 KiB buffer): 218 MiB/s
         | 
         | Go program:                   f :=
         | bufio.NewWriterSize(os.Stdout, 8192)         for {
         | f.WriteRune('1')         }
        
           | preseinger wrote:
           | In addition to buffering within the process, Linux (usually)
           | buffers process stdout with ~16KB, and does not buffer
           | stderr.
        
           | reincarnate0x14 wrote:
           | Not specifically addressed at you, but it's a bit amusing
           | watching a younger generation of programmers rediscovering
           | things like this, which seemed hugely important in like 1990
           | but largely don't matter that much to modern workflows with
           | dedicated APIs or various shared memory or network protocols,
           | as not much that is really performance-critical is typically
           | piped back and forth anymore.
           | 
           | More than a few old backup or transfer scripts had extra dd
           | or similar tools in the pipeline to create larger and semi-
           | asynchronous buffers, or to re-size blocks on output to
           | something handled better by the receiver, which was a big
           | deal on high speed tape drives back in the day. I suspect
           | most modern hardware devices have large enough static RAM and
           | fast processors to make that mostly irrelevant.
        
         | abuckenheimer wrote:
         | > python3 -c 'while (1): print (1, end="")' | pv > /dev/null
         | 
         | python actually buffers its writes with print only flushing to
         | stdout occasionally, you may want to try:
         | python3 -c 'while (1): print (1, end="", flush=True)' | pv >
         | /dev/null
         | 
         | which I find goes much slower (550Kib/s)
        
         | orf wrote:
         | Using `sys.stdout.write()` instead of `print()` gets ~8MiB/s on
         | my machine.
        
       | bfors wrote:
       | Love the subtle "stonks" overlay on the first chart
        
       ___________________________________________________________________
       (page generated 2022-06-02 23:00 UTC)