[HN Gopher] How fast are Linux pipes anyway? ___________________________________________________________________ How fast are Linux pipes anyway? Author : rostayob Score : 576 points Date : 2022-06-02 09:19 UTC (13 hours ago) (HTM) web link (mazzo.li) (TXT) w3m dump (mazzo.li) | sandGorgon wrote: | Android's flavor of Linux uses "binder" instead of pipes because | of its security model. IMHO filesystem-based IPC mechanisms | (notably pipes), can't be used because of a lack of a world- | writable directory - i may be wrong here. | | Binder comes from Palm actually (OpenBinder) | Matthias247 wrote: | Pipes don't necessarily mean one has to use FS permissions. Eg | a server could hand out anonymous pipes to authorized clients | via fd passing on Unix domain sockets. The server can then | implement an arbitrary permission check before doing this. | megous wrote: | "lack of a world-writable directory" | | What's that? | | A lot of programs store sockets in /run which is typically | implemented by `tmpfs`. | marcodiego wrote: | History of binder is more involved and has its seeds on BeOS | IIRC. | stackbutterflow wrote: | This site is pleasing to the eye. | apostate wrote: | It looks like it is using the "Tufte" style, named after Edward | Tufte, who is very famous for his writing on data | visualization. More examples: https://rstudio.github.io/tufte/ | ianai wrote: | I usually just use cat /dev/urandom > /dev/null to generate load. | Not sure how this compares to their code. | | Edit: it's actually "yes" that I've used before for generating | load. I remember reading somewhere "yes" was optimized | differently than the original Unix command as part of the unix | certification lawsuit(s). | | Long night. | yakubin wrote: | On 5.10.0-14-amd64 "pv < /dev/urandom >/dev/null" reports | 72.2MiB/s. "pv < /dev/zero >/dev/null" reports 16.5GiB/s. AMD | Ryzen 7 2700X with 16GB of DDR4 3000MHz memory. | | "tr '\0' 1 </dev/zero | pv >/dev/null" reports 1.38GiB/s. | | "yes | pv >/dev/null" reports 7.26GiB/s. | | So "/dev/urandom" may not be the best source when testing | performance. | sumtechguy wrote: | Think they were generating load? Going through the urandom | device not bad as it has to do a bit of work to get that rand | number? Just for throughput though zero is prob better. | gtirloni wrote: | "Generating load" for measuring pipe performance means | generating bytes. Any bytes. urandom is terrible for that. | yakubin wrote: | I don't understand. If you're testing how fast pipes are, | then I'd expect you to measure throughput or latency. Why | would you measure how fast something unrelated to pipes is? | If you want to measure this other thing on the other hand, | why would you bother with pipes, which add noise to the | measurement? | | UPDATE: If you mean that you want to test how fast pipes | are when there is other load in the system, then I'd | suggest just running a lot of stuff in the background. But | I wouldn't put the process dedicated for doing something | else into the pipeline you're measuring. As a matter of | fact, the numbers I gave were taken with plenty of heavy | processes running in the background, such as Firefox, | Thunderbird, a VM with another instance of Firefox, | OpenVPN, etc. etc. :) | khorne wrote: | Because they mentioned generating load, not testing pipe | performance. | yakubin wrote: | Oh, wait. You mean that this "cat </dev/urandom | >/dev/null" was meant to be running in the background and | not be the pipeline which is tested? Ok, my bad for not | getting the point. | ianai wrote: | You're right and I miss-typed, it's yes that I usually use. I | think it's optimized for throughput. | spacedcowboy wrote: | Ran the basic initial implementation on my Mac Studio and was | pleasantly surprised to see @elysium pipetest % | pipetest | pv > /dev/null 102GiB 0:00:13 [8.00GiB/s] | @elysium ~ % pv < /dev/zero > /dev/null 143GiB 0:00:04 | [36.4GiB/s] | | Not a valid comparison between the two machines because I don't | know what the original machine is, but MacOS rarely comes out | shining in this sort of comparison, and the simplistic approach | here giving 8 GB/s rather than the author's 3.5 GB/s was better | than I'd expected, even given the machine I'm using. | mhh__ wrote: | Given the machine as in a brand new Mac? | spacedcowboy wrote: | given that the machine is the most performant Mac that Apple | make. | [deleted] | sylware wrote: | yep, you want perf? Don't mutex then yield, do spin and check | your cpu heat sink. | | :) | jagrsw wrote: | Something maybe a bit related. | | I just had 25Gb/s internet installed | (https://www.init7.net/en/internet/fiber7/), and at those speeds | Chrome and Firefox (which is Chrome-based) pretty much die when | using speedtest.net at around 10-12Gbps. | | The symptoms are that the whole tab freezes, and the shown speed | drops from those 10-12Gbps to <1Gbps and the page starts updating | itself only every second or so. | | IIRC Chrome-based browsers use some form of IPC with a separate | networking process, which actually handles networking, I wonder | if this might be the case that the local speed limit for | socketpair/pipe under Linux was reached and that's why I'm seeing | this. | [deleted] | Spooky23 wrote: | I ran into this with a VDI environment in a data center. We had | initially delivered 10Gb Ethernet to the VMs, because why not. | | Turned out windows 7 or the NICs needed a lot of tuning to work | well. There was alot of freezing and other fail. | implying wrote: | Firefox is not based on the chromium codebase, it is older. | formerly_proven wrote: | Well if we're talking ancestors that's technically true, but | not by that much - Firefox comes from Netscape, | Chrome/Safari/... come from KHTML. | elpescado wrote: | AFAIR, KHTML was/is not related to Netscape/Gecko in any | way. | wodenokoto wrote: | > ... on August 16, 1999 that [Lars Knoll] had checked in | what amounted to a complete rewrite of the KHTML library-- | changing KHTML to use the standard W3C DOM as its internal | document representation. | https://en.wikipedia.org/wiki/KHTML#Re- | write_and_improvement | | > In March 1998, Netscape released most of the code base | for its popular Netscape Communicator suite under an open | source license. The name of the application developed from | this would be Mozilla, coordinated by the newly created | Mozilla Organization https://en.wikipedia.org/wiki/Mozilla_ | Application_Suite#Hist... | | Netscape Communicator (or Netscape 4) was released in 1997, | so If we are tracing lineage, I'd say Firefox has a 2 year | head start. | def- wrote: | Firefox is only Chrome-based on iOS. | rwaksmunski wrote: | You mean WebKit. | karamanolev wrote: | It's Safari-based, which is Webkit-based. Chrome is also | Safari-based on iOS, because all the browsers must be. | There's no actual Chrome (as in Blink, the browser engine) on | iOS, at least in Play Store. | Izkata wrote: | > It's Safari-based, which is Webkit-based. | | Firefox only uses Webkit on iOS, due to Apple requirements. | It uses Gecko everywhere else. And I don't think it's ever | been Safari-based anywhere. | jve wrote: | Do you actually mean Gbit/s? 25Gb/s would translate to | 200Gbit/s ... | Denvercoder9 wrote: | The small "b" is customarily used to refer to bits, with the | large "B" used to refer to bytes. So 25 Gb/s would be 25 | Gbit/s, while 25 GB/s would be 200 Gbit/s. | karamanolev wrote: | Gb != GB. Per Wikipedia, which aligns with my understanding, | | "The gigabit has the unit symbol Gbit or Gb." | | 25GB/s would translate to 200Gbit/s and also 200Gb/s. | reitanqild wrote: | > and at those speeds Chrome and Firefox (which is Chrome- | based) | | AFAIK, Firefox is not Chrome-based anywhere. | | On iOS it uses whatever iOS provides for webview - as does | Chrome on iOS. | | Firefox and Safari is now the only supported mainstream | browsers that has their own rendering engines. Firefox is the | only that has their own rendering engine and is cross platform. | It is also open source. | yosamino wrote: | > AFAIK, Firefox is not Chrome-based anywhere. | | Not technically "Chrome-based", but Firefox draws graphics | using Chrome's Skia graphics engine. | | Firefox is not completely independent from Chrome. | bawolff wrote: | I feel like counting every library is silly. | | In any case, i thought chrome used libnss which is a | mozilla library, so you could say the reverse as well. | SahAssar wrote: | Skia started in 2004 independently of google and was then | acquired by google. Calling it "Chrome's Skia graphics | engine" makes it sound like it was built _for_ chrome. | [deleted] | [deleted] | SahAssar wrote: | > Firefox is the only that has their own rendering engine and | is cross platform. | | Interestingly safaris rendering engine is open source and | cross platform, but the browser is not. Lots of linux-focused | browsers (konquerer, gnome web, surf) and most embedded | browsers (nintendo ds & switch, playstation) use webkit. Also | some user interfaces (like WebOS, which is running all of | LG's TVs and smart refrigerators) use webkit as their | renderer. | qwerty456127 wrote: | WebKit itself is a fork of the Konqueror's original KHTML | engine by the way. | tmccrary55 wrote: | Browser Genealogy | cturtle wrote: | Now I want to see the family tree! | capableweb wrote: | Ask, and you shall receive :) | | https://en.wikipedia.org/wiki/File:Timeline_of_web_browse | rs.... | | https://en.wikipedia.org/wiki/Timeline_of_web_browsers | has tables as well. | tinus_hn wrote: | IOS uses WebKit which is also what Chrome is based on. | Comevius wrote: | Chrome uses Blink, which was forked from WebKit's WebCore | in 2013. They replaced JavaScriptCore with V8. | merightnow wrote: | Unrelated question, what hardware do you use to setup your | network for 25Gb/s? I've been looking at init7 for a while, but | gave up and stayed with Salt after trying to find the right | hardware for the job. | jagrsw wrote: | NIC: Intel E810-XXVDA2 | | Optics: To ISP: Flexoptics (https://www.flexoptix.net/de/p-b1 | 625g-10-ad.html?co10426=972...), Router-PC: | https://mikrotik.com/product/S-3553LC20D | | Router: Mikrotik CCR-2004 - | https://mikrotik.com/product/ccr2004_1g_12s_2xs - warning: | it's good to up to ~20Gb/s one way. It can handle ~25Gb/s | down, but only ~18Gb/s up, and with IPv6 the max seems to be | ~10Gb/s any direction. | | If Mikrotik is something you're comfortable using you can | also take a look at | https://mikrotik.com/product/ccr2216_1g_12xs_2xq - it's more | expensive (~2500EUR), but should handle 25Gb/s easily. | zrail wrote: | IIRC most Mikrotik products lack hardware IPv6 offload | which is probably why you're seeing lower speeds. | BenjiWiebe wrote: | In that case 10Gb/s sounds actually pretty good, if | that's without hardware offload. | sph wrote: | This makes me wonder... does anyone offer an iperf-based | speedtest service on the Internet? | scoopr wrote: | Well there are some public iperf servers listed here: | https://iperf.fr/iperf-servers.php | jagrsw wrote: | Ha.. my ISP does :) I can hit those 25Gb/s when connecting | directly (bypassing the router as it barely handles those | 25Gb/s). | | With it in the way I get ~15-20Gb/s $ iperf3 | -l 1M --window 64M -P10 -c speedtest.init7.net .. | [SUM] 0.00-1.00 sec 1.87 GBytes 16.0 Gbits/sec 181406 | $ iperf3 -R -l 1M --window 64M -P10 -c speedtest.init7.net | .. [SUM] 0.00-1.00 sec 2.29 GBytes 19.6 Gbits/sec | [deleted] | jcims wrote: | Speedtest does have a CLI as well, might be interesting to | compare them. | jagrsw wrote: | Yup, the CLI version works well - https://www.speedtest.net/r | esult/c/e9104814-294f-4927-af9f-d... | zrail wrote: | Thing to note: the open source version on GitHub, installable | by homebrew and native package managers, is not the same | version as Ookla distributes from their website and is not | accurate at all. | [deleted] | pca006132 wrote: | Is it only affecting the browser or the entire system? It might | be possible that the CPU is busy handling interrupts from the | ethernet controller, although in general these controllers | should use DMA and should not send interrupts frequently. | jagrsw wrote: | Only browser(s), the OS is capable of 25Gb/s - checked with | iperf and also speedtest-cli - https://www.speedtest.net/resu | lt/c/e9104814-294f-4927-af9f-d... | jcranberry wrote: | Sounds like a hard drive cache filling up. | megous wrote: | One would assume speed testing website would use `Cache- | Control: no-store`... | | But alas, they do not, lol. They just use no-cache on the | query which will not prevent the browser from storing the | data. | | https://megous.com/dl/tmp/8112dd9346dd66e8.png | bayindirh wrote: | Chrome fires many processes and creates an IPC based comm- | network between them to isolate stuff. It's somewhat abusing | your OS to get what its want in terms of isolation and whatnot. | | (Which is similar to how K8S abuses ip-tables and makes it | useless for other ends, and makes you install a dedicated | firewall in front of your ingress path, but let's not digress). | | On the other hand, Firefox is neither chromium based, nor is a | cousin of it. It's a completely different codebase, inherited | from Netscape days and evolved up to this point. | | As another test point, Firefox doesn't even blink at a | symmetric gigabit connection going at full speed (my network is | capped by my NIC, the pipe is _way_ fatter). | jagrsw wrote: | > As another test point, Firefox doesn't even blink at a | symmetric gigabit connection going at full speed (my network | is capped by my NIC, the pipe is way fatter). | | FWIW Firefox under Linux (Firefox Browser 100.0.2 (64-bit)) | behaves pretty much the same as Chrome. The speed raises | quickly to 5-8Gb/s, then the UI starts choking, and the shown | speed drops to 500Mb/s. It could be that there's some | scheduling limit or other bottleneck hit in the OS itself, | assuming these are different codebases (are they?). | bayindirh wrote: | I'd love to test and debug the path where it dies, but none | of the systems we have firefox have pipes that fat (again | NIC limited). | | However, you can test the limits of Linux by installing CLI | version of Speedtest and hitting a nearby server. | | The bottleneck maybe in the browser itself, or in your | graphics stack, too. | | Linux can do pretty amazing things in the network | department, otherwise 100Gbps Infiniband cards wouldn't be | possible at Linux servers, yet we have them on our systems. | | And yes, Chrome and Firefox are way different browsers. I | can confidently say this, because I'm using Firefox since | it's called Netscape 6.0 (and Mozilla in Knoppix). | jeffreygoesto wrote: | From my experience long ago, all high performance | networking under Linux was traditionally user space and | pre-allocated pools (netmap, dpdk, pf-ring...). Did not | follow, how much io_uring has been catching up for | network stack usage... Maybe somebody else knows? | sophacles wrote: | I have a service that beats epoll with io_uring (it reads | gre packets from one socket, and does some | lookups/munging on the inner packet and re-encaps them to | a different mechanism and writes them back to a different | socket). General usage for io_uring vs epoll is pretty | comparable IIUC. It wouldn't surprise me if streams (e.g. | tcp) end up being faster via io_uring and buffer | registration though. | | Totally tangential - it looks like io_uring is evolving | beyond just io and into an alternate syscall interface, | which is pretty neat imho. | bayindirh wrote: | While I'm not very knowledgeable in specifics, there are | many paths for networking in Linux now. The usual kernel | based one is there, also there's kernel-bypass [0] paths | used by very high performance cards. | | Also, Infiniband can directly RDMA to and from MPI | processes for making "remote memory local", allowing very | low latencies and high performance in HPC environments. | | I also like this post from Cloudflare [1]. I've read it | completely, but the specifics are lost on me since I'm | not directly concerned with the network part of our | system. | | [0]: https://medium.com/@penberg/on-kernel-bypass- | networking-and-... | | [1]: https://blog.cloudflare.com/how-to-receive-a- | million-packets... | bawolff wrote: | > I can confidently say this, because I'm using Firefox | since it's called Netscape 6.0 (and Mozilla in Knoppix). | | Mozilla suite/seamonkey isn't usually considered the same | as firefox, although obviously related. | bayindirh wrote: | I'm not talking about the version which evolved to | Seamonkey. I'm talking about Mozilla/Firefox 0.8 which | had a Mozilla logo as a "Spinner" instead of Netscape | logo on the top right. | bawolff wrote: | Netscape 6 was not firefox based | https://en.m.wikipedia.org/wiki/Netscape_6 | | Firefox 0.8 did not have netscape branding | http://theseblog.free.fr/firefox-0.8.jpg | pjmlp wrote: | It is using what OS processes where created in first place. | | Unfortunately the security industry has proven the why | threads are a bad ideas for applications when security is a | top concern. | | Same applies to dynamically loaded code as plugins, where the | host application takes the blame for all instabilty and | exploits they introduce. | bayindirh wrote: | Yes, Firefox is also doing the same, however due to the | nature of Firefox's processes, the OS doesn't lose much | responsiveness or doesn't feel bogged down when I have 50+ | tabs open due to some research. | | If you need security, you need isolation. If you want | hardware-level isolation, you need processes. That's | normal. | | My disagreement with Google's applications are how they're | behaving like they're the only running processes on the | system itself. I'm pretty aware that some of the most | performant or secure things doesn't have the prettiest | implementation on paper. | ReactiveJelly wrote: | There used to be a setting to tweak Chrome's process | behavior. | | I believe the default behavior is "Coalesce tabs into the | same content process if they're from the same trust | domain". | | Then you can make it more aggressive like "Don't coalesce | tabs ever" or less aggressive like "Just have one content | process". I think. | | I'm not sure how Firefox decides when to spawn new | processes. I know they have one GPU process and then | multiple untrusted "content processes" that can touch | untrusted data but can't touch the GPU. | | I don't mind it. It's a trade-off between security and | overhead. The IPC is pretty efficient and the page cache | in both Windows and Linux _should_ mean that all the code | pages are shared between all content processes. | | Static pages actually feel light to me. I think crappy | webapps make the web slow, not browser security. | | (inb4 I'm replying to someone who works on the Firefox | IPC team or something lol) | girvo wrote: | > inb4 I'm replying to someone who works on the Firefox | IPC team or something lol | | The danger and joy of commenting on HN! | bayindirh wrote: | I'm harmless, don't worry. :) Also you can find more | information about me in my profile. | | Even if I was working on Firefox/Chrome/whatever, I'd not | be mad at someone who doesn't know something very well. | Why should I? We're just conversing here. | | Also, I've been very wrong here at times, and this | improved my conversation / discussion skills a great | deal. | | So, don't worry, and comment away. | mastax wrote: | I'm glad huge pages make a big difference because I just spent | several hours setting them up. Also everyone says to disable | transparent_hugepage, so I set it to `madvise`, but I'm skeptical | that any programs outside databases will actually use them. | deagle50 wrote: | JVM can. I have JetBrains set up to use them. | gigatexal wrote: | Now this is the kind of content I come to HN for. Absolutely | fascinating read. | lazide wrote: | The majority of this overhead (and the slow transfers) naively | seem to be in the scripts/systems using the pipes. | | I was worried when I saw zfs send/receive used pipes for instance | because of performance worries - but using it in reality I had no | problems pushing 800MB/s+. It seemed limited by iop/s on my local | disk arrays, not any limits in pipe performance. | Matthias247 wrote: | Right. I'm actually surprised the test with 256kB transfers | gives reasonable results, and would rather have tested with > | 1GB instead. For such a small transfer it seemed likely that | the overhead of spawning the process and loading libraries by | far dominates the amount of actual work. I'm also surprised | this didn't show up in profiles. But if obviously depends on | where the measurement start and end points are | azornathogron wrote: | Perhaps I've misunderstood what you're referring to, but the | test in the article is measuring speed transferring 10 GiB. | 256 KiB is just the buffer size. | Matthias247 wrote: | The first C program in the blog post allocates a 256kB | buffer and writes that one exactly once to stdout. I don't | see another loop which writes it multiple times. | azornathogron wrote: | There's an outer while(true){} loop - the write side just | writes continuously. | | More generally though, sidenote 5 says that the code in | the article itself is incomplete and the real test code | is available in the github repo: | https://github.com/bitonic/pipes-speed-test | BeeOnRope wrote: | This is a well-written article with excellent explanations and I | thoroughly enjoyed it. | | However, none of the variants using vmsplice (i.e., all but the | slowest) are safe. When you gift [1] pages to the kernel there is | no reliable general purpose way to know when the pages are safe | to reuse again. | | This post (and the earlier FizzBuzz variant) try to get around | this by assuming the pages are available again after "pipe size" | bytes have been written after the gift, _but this is not true in | general_. For example, the read side may also use splice-like | calls to move the pages to another pipe or IO queue in zero-copy | way so the lifetime of the page can extend beyond the original | pipe. | | This will show up as race conditions and spontaneously changing | data where a downstream consumer sees the page suddenly change as | it it overwritten by the original process. | | The author of these splice methods, Jens Axboe, had proposed a | mechanism which enabled you to determine when it was safe to | reuse the page, but as far as I know nothing was ever merged. So | the scenarios where you can use this are limited to those where | you control both ends of the pipe and can be sure of the exact | page lifetime. | | --- | | [1] Specifically, using SPLICE_F_GIFT. | haberman wrote: | What if the writer frees the memory entirely? Can you segv the | reader? That would be quite a dangerous pattern. | rostayob wrote: | (I am the author of the post) | | I haven't digested this comment fully yet, but just to be | clear, I am _not_ using SPLICE_F_GIFT (and I don't think the | fizzbuzz program is either). However I think what you're saying | makes sense in general, SPLICE_F_GIFT or not. | | Are you sure this unsafety depends on SPLICE_F_GIFT? | | Also, do you have a reference to the discussions regarding this | (presumably on LKML)? | rostayob wrote: | Actually, from re-reading the man page for vmsplice, it seems | like it _should_ depend on SPLICE_F_GIFT (or in other words, | it should be safe without it). | | But from what I know about how vmsplice is implemented, | gifting or not, it sounds like it should be unsafe anyhow. | DerSaidin wrote: | Hello | | https://mazzo.li/posts/fast-pipes.html#what-are-pipes- | made-o... | | I think the diagram near the start of this section has "head" | and "tail" swapped. | | Edit: Nevermind, I didn't read far enough. | BeeOnRope wrote: | Yeah my mention of gift was a red herring: I had assumed gift | was being used but the same general problem (the "page | garbage collection issue") crops up regardless. | | If you don't use gift, you never know when the pages are free | to use again, so in principle you need to keep writing to new | buffers indefinitely. One "solution" to this problem is to | gift the pages, in which case the kernel does the GC for you, | but you need to churn through new pages constantly because | you've gifted the old ones. Gift is especially useful when | the page gifted can be used directly in the page cache (i.e., | writing a file, not a pipe). | | Without gift some consumption patterns may be safe but I | think they are exactly those which involve a copy (not using | gift means that a copy will occur for additional read-side | scenarios). Ultimately the problem is that if some downstream | process is able to get a zero-copy view of a page from an | upstream writer, how can this be safe to concurrently | modification? The pipe size trick is one way it could work, | but it doesn't pan out because the pages may live beyond the | immediately pipe (this is actually alluded in the FizzBuzz | article where they mentioned things blew up if more than one | pipe was involved). | rostayob wrote: | Yes, this all makes sense, although like everything | splicing-related, it is very subtle. Maybe I should have | mentioned the subtleness and dangerousness of splicing at | the beginning, rather than at the end. | | I still think the man page of vmsplice is quite misleading! | Specifically: SPLICE_F_GIFT | The user pages are a gift to the kernel. The application | may not modify this memory ever, | otherwise the page cache and on-disk data may differ. | Gifting pages to the kernel means that a | subsequent splice(2) SPLICE_F_MOVE can | successfully move the pages; if this flag is not speci- | fied, then a subsequent splice(2) SPLICE_F_MOVE must | copy the pages. Data must also be | properly page aligned, both in memory and length. | | To me, this indicates that if we're _not_ using | SPLICE_F_GIFT downstream splices will be automatically | taken care of, safety-wise. | scottlamb wrote: | Hmm, reading this side-by-side with a paragraph from | BeeOnRope's comment: | | > This post (and the earlier FizzBuzz variant) try to get | around this by assuming the pages are available again | after "pipe size" bytes have been written after the gift, | _but this is not true in general_. For example, the read | side may also use splice-like calls to move the pages to | another pipe or IO queue in zero-copy way so the lifetime | of the page can extend beyond the original pipe. | | The paragraph you quoted says that the "splice-like calls | to move the pages" actually copy when SPLICE_F_GIFT is | not specified. So perhaps the combination of not using | SPLICE_F_GIFT and waiting until "pipe size" bytes have | been written is safe. | BeeOnRope wrote: | Yes it is not clear to me when the copy actually happens | but I had assumed the > 30 GB/s result after read was | changed to use splice must imply zero copy. | rostayob wrote: | It could be that when splicing to /dev/null (which I'm | doing), the kernel knows that they their content is never | witnessed, and therefore no copy is required. But I | haven't verified that | scottlamb wrote: | Makes sense. If so, some of the nice benchmark numbers | for vmsplice would go away in a real scenario, so that'd | be nice to know. | BeeOnRope wrote: | Splicing seems to work well for the "middle" part of a | chain of piped processes, e.g., how pv works: it can | splice pages from one pipe to another w/o needing to | worry about reusing the page since someone upstream | already wrote the page. | | Similarly for splicing from a pipe to a file or something | like that. It's really the end(s) of the chain that want | to (a) generate the data in memory or (b) read the data | in memory that seem to create the problem. | scottlamb wrote: | I think you're right that the same problem applies without | SPLICE_F_GIFT. One of the other fizzbuzz code golfers | discusses that here: | https://codegolf.stackexchange.com/a/239848 | | I wonder if io_uring handles this (yet). io_uring is a newer | async IO mechanism by the same author which tells you when | your IOs have completed. So you might think it would: | | * But from a quick look, I think its vmsplice equivalent | operation just tells you when the syscall would have | returned, so maybe not. [edit: actually, looks like there's | not even an IORING_OP_VMSPLICE operation in the latest | mainline tree yet, just drafts on lkml. Maybe if/when the | vmsplice op is added, it will wait to return for the right | time.] | | * And in this case (no other syscalls or work to perform | while waiting) I don't see any advantage in io_uring's | read/write operations over just plain synchronous read/write. | Matthias247 wrote: | uring only really applies for async IO - and would tell you | when an otherwise blocking syscall would have finished. | Since the benchmark here uses blocking calls, there | shouldn't be any change in behavior. The lifetime of the | buffer is an orthogonal concern to the lifetime of the | operation. Even if the kernel knows when the operation is | done inside the kernel it wouldn't have a way to know | whether the consuming application is done with it. | scottlamb wrote: | > uring only really applies for async IO - and would tell | you when an otherwise blocking syscall would have | finished. Since the benchmark here uses blocking calls, | there shouldn't be any change in behavior. The lifetime | of the buffer is an orthogonal concern to the lifetime of | the operation. Even if the kernel knows when the | operation is done inside the kernel it wouldn't have a | way to know whether the consuming application is done | with it. | | That doesn't match what I've read. E.g. | https://lwn.net/Articles/810414/ opens with "At its core, | io_uring is a mechanism for performing asynchronous I/O, | but it has been steadily growing beyond that use case and | adding new capabilities." | | More precisely: | | * While most/all ops are async IO now, is there any | reason to believe folks won't want to extend it to batch | basically any hot-path non-vDSO syscall? As I said, | batching doesn't help here, but it does in a lot of other | scenarios. | | * Several IORING_OP_s seem to be growing capabilities | that aren't matched by like-named syscalls. E.g. IO | without file descriptors, registered buffers, automatic | buffer selection, multishot, and (as of a month ago) | "ring mapped supplied buffers". Beyond the individual | operation level, support for chains. Why not a mechanism | that signals completion when the buffer passed to | vmsplice is available for reuse? (Maybe by essentially | delaying the vmsplice syscall'ss return [1], maybe by a | second command, maybe by some extra completion event from | the same command, details TBD.) | | [1] edit: although I guess that's not ideal. The reader | side could move the page and want to examine following | bytes, but those won't get written until the writer sees | the vmsplice return and issues further writes. | BeeOnRope wrote: | Yeah this. | | The vanilla io_uring fits "naturally" in an async model, | but batching and some of the other capabilities it | provide are definitely useful for stuff written to a | synchronous model too. | | Additionally, io_uring can avoid syscalls sometimes even | without any explicit batching by the application, because | it can poll the submission queue (root only, last time I | checked unfortunately): so with the right setup a series | of "synchronous" ops via io_uring (i.e., submit & | immediately wait for the response) could happen with < 1 | user-kernel transition per op, because the kernel is busy | servicing ops directly from the incoming queue and the | application gets the response during its polling phase | before it waits. | yxhuvud wrote: | Perhaps it could be sortof simulated in uring using the | splice op against a memfd that has been mmaped in advance? | I wonder how fast that could be and how it would compare | safetywise. | BeeOnRope wrote: | I don't know if io_uring provides a mechanism to solve this | page ownership thing but I bet Jens does: I've asked [1]. | | --- | | [1] | https://twitter.com/trav_downs/status/1532491167077572608 | robocat wrote: | > However, none of the variants using vmsplice (i.e., all but | the slowest) are safe. When you gift [1] pages to the kernel | there is no reliable general purpose way to know when the pages | are safe to reuse again. [snip] This will show up as race | conditions and spontaneously changing data where a downstream | consumer sees the page suddenly change as it it overwritten by | the original process. | | That sounds like a security issue - the ability of an upstream | generator process to write into the memory of a downstream | reader process, or more perverser vice versa is even worser. I | presume that the Linux kernel only lets this happen (zero copy) | when the two processes are running as the same user? | hamandcheese wrote: | It's not clear to me that the kernel allows the receiving | process to write instead of just read. | | But also, if you are sending data, why would you later | read/process that send buffer? | | The only attack vector I could imagine would be if one sender | was splicing the same memory to two or more receivers. A | malicious receiver with write access to the spliced memory | could compromise other readers. | nice2meetu wrote: | I once had to change my mental model for how fast some of these | things were. I was using `seq` as an input for something else, | and my thinking was along the lines that it is a small generator | program running hot in the cpu and would be super quick. | Specifically because it would only be writing things out to | memory for the next program to consume, not reading anything in. | | But that was way off and `seq` turned out to be ridiculously | slow. I dug down a little and made a faster version of `seq`, | that kind of got me what I wanted. But then noticed at the end | that the point was moot anyway, because just piping it to the | next program over the command line was going to be the slow | point, so it didn't matter anyway. | | https://github.com/tverniquet/hseq | freedomben wrote: | I had a somewhat similar discovery once using GNU parallel. I | was trying to generate as much web traffic as possible from a | single machine to load test a service I was building, and I | assumed that the network I/o would be the bottleneck by a long | shot, not the overhead of spawning many processes. I was | disappointed by the amount of traffic generated, so I rewrote | it in Ruby using the parallel gem with threads (instead of | processes), and got orders of magnitude more performance. | strictfp wrote: | Node is great for this usecase | Klasiaster wrote: | Netmap offers zero-copy pipes (included in FreeBSD, on Linux it's | a third party module): | https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4 | v3gas wrote: | Love the subtle stonks background in the first image. | [deleted] | alex_hirner wrote: | Does an API similar to vmsplice exist for Windows? | herodoturtle wrote: | This was a long but highly insightful read! | | (And as an aside, the combination of that font with the hand- | drawn diagrams is really cool) | arkitaip wrote: | The visual design is amazing. | anotherhue wrote: | pv is written in perl so isn't the snappiest, I'm surprised to | see it score so highly. I wonder what the initial speed would | have been if it just wrote to /dev/null | merpkz wrote: | Confused with parallel, maybe? | rostayob wrote: | It's not written in perl, it's written in C, and it uses | splice() (one of the syscalls discussed in the post). | anotherhue wrote: | I was totally wrong. Thank you for showing me the facts. | karamanolev wrote: | Definitely C, per what appears to be the official repo | (linking the splice syscall) - https://github.com/icetee/pv/b | lob/master/src/pv/transfer.c#L... | effnorwood wrote: | [deleted] | mg wrote: | For some reason, this raised my curiosity how fast different | languages write individual characters to a pipe: | | PHP comes in at about 900KiB/s: php -r 'while | (1) echo 1;' | pv > /dev/null | | Python is about 50% faster at about 1.5MiB/s: | python3 -c 'while (1): print (1, end="")' | pv > /dev/null | | Javascript is slowest at around 200KiB/s: node | -e 'while (1) process.stdout.write("1");' | pv > /dev/null | | What's also interesting is that node crashes after about a | minute: FATAL ERROR: Ineffective mark-compacts | near heap limit Allocation failed - JavaScript heap out | of memory | | All results from within a Debian 10 docker container with the | default repo versions of PHP, Python and Node. | | Update: | | Checking with strace shows that Python caches the output: | strace python3 -c 'while (1): print (1, end="")' | pv > /dev/null | | Outputs a series of: write(1, | "11111111111111111111111111111111"..., 8193) = 8193 | | PHP and JS do not. | | So the Python equivalent would be: python3 -c | 'while (1): print (1, end="", flush=True)' | pv > /dev/null | | Which makes it compareable to the speed of JS. | | Interesting, that PHP is over 4x faster than the Python and JS. | cestith wrote: | I'm on a 2015 MB Air with two browsers running, probably a | dozen tabs between them, three tabs in iTerm2, Outlook, Word, | and Teams running. | | Perl 5.18.0 gives me 3.5 MiB per second. Perl 5.28.3, 5.30.3, | and 5.34.0 gives 4 MiB per second. perl5.34.0 | -e 'while (){ print 1 }' | pv > /dev/null | | For Python 3.10.4, I get about 2.8 MiB/s as you have it | written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for | 3.8) with this. I also get 4.8 MiB/s with 2.7: | python3 -c 'while (1): print (1)' | pv > /dev/null | | If I make Perl behave like yes and print a character and a | newline, it has a jump of its own. The following gives me 37.3 | MiB per second. perl5.34.0 -e 'while (){ | print "1\n" }' | pv > /dev/null | | Interestingly, using Perl's say function (which is like a | Println) slows it down significantly. This version is only 7.3 | MiB/s. perl5.34.0 -E 'while (1) {say 1}' | pv | > /dev/null | | Go 1.18 has 940 KiB/s with fmt.Print and 1.5 MiB/s with | fmt.Println for some comparison. package main | import "fmt" func main() { for ;; | { fmt.Println("1") } | } | | These are all macports builds. | mscdex wrote: | Potential buffering issues aside, as others have pointed out | the node.js example is performing asynchronous writes, unlike | the other languages' examples (as far as I know). | | To do a proper synchronous write, you'd do something like: | node -e 'const { writeSync } = require("fs"); while (1) | writeSync(1, "1");' | pv > /dev/null | | That gets me ~1.1MB/s with node v18.1.0 and kernel 5.4.0. | themulticaster wrote: | If you ever need to write a random character to a pipe very | fast, GNU coreutils has you covered with yes(1). It runs at | about 6 GiB/s on my system: yes | pv > | /dev/null | | There's an article floating around [1] about how yes(1) is | extremely optimized considering its original purpose. In care | you're wondering, yes(1) is meant for commands that | (repeatedly) ask whether to proceed, expecting a y/n input or | something like that. Instead of repeatedly typing "y", you just | run "yes | the_command". | | Not sure about how yes(1) compares to the techniques presented | in the linked post. Perhaps there's still room for improvement. | | [1] Previous HN discussion: | https://news.ycombinator.com/item?id=14542938 | gitgud wrote: | > _It runs at about 6 GiB /s on my system..._ | | Honest question: what are the practical use cases of this? | | Repeatedly typing the 'y' character into a Linux pipe is | surely not that common, especially at that bit rate. Also | seems like the bottleneck would always be the consuming | program... | travisgriggs wrote: | > Honest question: what are the practical use cases of | this? | | It also allows you to script otherwise interactive command | line operations with the correct answer. Many come like | tools now days provide specific options to override | queries. But there are still a couple hold outs which might | not. | jolmg wrote: | > especially at that bit rate. Also seems like the | bottleneck would always be the consuming program... | | It's not _made_ to be fast; it 's just fast _by nature_ , | because there's no other computation it needs to do than to | just output the string. | singron wrote: | Yes can repeat any string, not just "y". It can be useful | for basic load generation. | jolmg wrote: | I've used it to test some db behavior with `yes 'insert | ...;' | mysql ...`. Fastest insertions I could think of. | TacticalCoder wrote: | > Repeatedly typing the 'y' character into a Linux pipe is | surely not that common, especially at that bit rate. | | At that rate no but I definitely use it once in a while. | For example if a copy quite a few files and then get | repeatedly asked if I want to overwrite the destination | (when it's already present). Sure, I could get my commmand | back and use the proper flag to "cp" or whatever to | overwrite, but it's usually much quicker to just get back | the previous line, go at the beginning (C-a), then type | "yes | " and be done with it. | | Note that you can pass a parameter to "yes" and then it | repeats what you passed instead of 'y'. | linsomniac wrote: | Historically, you could have dirty filesystems after a | reboot that "fsck" would ask an absurd number of questions | about ("blah blah blah inode 1234567890 fix? (y/n)"). | Unless you were in a very specific circumstance, you'd | probably just answer "y" to them. It could easily ask | thousands of questions though. So: "yes | fsck" was not | uncommon. | jolmg wrote: | > Historically | | It's probably still common in installation scripts, like | in Dockerfiles. `apt-get install` has the `-y` option, | but it would be useful for all other programs that don't. | dpflug wrote: | Faster still is pv < /dev/zero > /dev/null | BenjiWiebe wrote: | Yes but you don't have control of which character is | written (only NULLs). | | yes lets you specify which character to output. 'yes n' for | example to output n. | rocqua wrote: | Yes doesn't just let you choose a character. It lets you | choose a string that will be repeated. So | yes 123abc | | will print | 123abc123abc123abc123abc123abc | | and so on. | jolmg wrote: | each time terminated by a newline, so: | 123abc 123abc 123abc ... | megous wrote: | "Javascript" is slowest probably because node pushes the writes | to a thread instead of printing directly from the main process | like PHP. | | Python cheats, and it's still slow as heck even while cheating | (buffers the output at 8192 chunks instead of issuing 1 byte | writes). | | write(1, "1", 1) loop in C pushes 6.38MiB/s on my PC. :) | cout wrote: | Why is it cheating to use a buffer? This is the behavior you | would get in C if you used the C standard library | (putc/fputc) instead of a system call (write). | soheil wrote: | You're testing a very specific operation, a loop, in each | language to determine its speed, not sure if I'd generalize | that. I wonder what it'd look like if you replaced the loop | with static print statements that were 1000s of characters long | with line breaks, the sort of things that compiler | optimizations do. | dpflug wrote: | I was getting different results depending on when I run it. | Took me a second to realize it was my processor frequency | scaling. | klohto wrote: | Python pushes 15MiB on my M1 Pro if you go down a level and use | sys directly. python3 -c 'import sys | while (1): sys.stdout.write("1")'| pv>/dev/null | mg wrote: | That caches though. You can see it when you strace it. | klohto wrote: | Good point, but so does a standard print call. Calling | flush() after each write does bring the perf to 1.5MiB | rovr138 wrote: | python3 -u -c 'import sys while (1): | sys.stdout.write("1")'| pv>/dev/null | | 427KiB/s python3 -c 'import sys | while (1): sys.stdout.write("1")'| pv>/dev/null | | 6.08MiB/s | | Using python 3.9.7 on macOS Monterey. | capableweb wrote: | > Javascript is slowest at around 200KiB/s: | | I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. | Python gets 4.35MiB/s. | | > What's also interesting is that node crashes after about a | minute | | I believe this is because `while(1)` runs so fast that there is | no "idle" time for V8 to actually run GC. V8 is a strange | beast, and this is just a guess of mine. | | The following code shouldn't crash, give it a try: | node -e 'function write() {process.stdout.write("1"); | process.nextTick(write)} write()' | pv > /dev/null | | It's slower for me though, giving me 1.18MiB/s. | | More examples with Babashka and Clojure: bb | -e "(while true (print \"1\"))" | pv > /dev/null | | 513KiB/s clj -e "(while true (print \"1\"))" | | pv > /dev/null | | 3.02MiB/s clj -e "(require '[clojure.java.io | :refer [copy]]) (while true (copy \"1\" *out*))" | pv > | /dev/null | | 3.53MiB/s clj -e "(while true (.println | System/out \"1\"))" | pv > /dev/null | | 5.06MiB/s | | Versions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka | v0.8.1, Clojure 1.11.1.1105 | marginalia_nu wrote: | > I believe this is because `while(1)` runs so fast that | there is no "idle" time for V8 to actually run GC. V8 is a | strange beast, and this is just a guess of mine. | | Java has (had) weird idiosyncrasies like this as well, well | it doesn't crash, but depending on the construct you can get | performance degradations depending on how the language | inserts safepoints (where the VM is at a knowable state and a | thread can be safely paused for GC or whatever). | | I don't know if this holds today, but I know there was a time | where you basically wanted to avoid looping over long-type | variables, as they had different semantics. The details are a | bit fuzzy to me right now. | wolfgang42 wrote: | _> > What's also interesting is that node crashes after about | a minute_ | | _> I believe this is because `while(1)` runs so fast that | there is no "idle" time for V8 to actually run GC. V8 is a | strange beast, and this is just a guess of mine._ | | Not exactly: the GC is still running; it's _live_ memory | that's growing unbounded. | | What's going on here is that WritableStream is non-blocking; | it has _advisory_ backpressure, but if you ignore that it | will do its best to accept writes anyway and keep them in a | buffer until it can actually write them out. Since you're not | giving it any breathing room, that buffer just keeps growing | until there's no more memory left. `process.nextTick()` is | presumably slowing things down enough on your system to give | it a chance to drain the buffer. (I see there's some | discussion below about this changing by version; I'd guess | that's an artifact of other optimizations and such.) | | To do this properly, you need to listen to the return value | from `.write()` and, if it returns false, back off until the | stream drains and there's room in the buffer again. | | Here's the (not particularly optimized) function I use to do | that: async function writestream(chunks, | stream) { for await (const chunk of chunks) { | if (!stream.write(chunk)) { // When write | returns null, stream is starting to buffer and we need to | wait for it to drain // (otherwise we'll | run out of memory!) await new | Promise(resolve => stream.once('drain', () => resolve())) | } } } | | I _do_ wish Node made it more obvious what was going on in | this situation; this is a very common mistake with streams | and it's easy to not notice until things suddenly go very | wrong. | | ETA: I should probably note that transform streams, | `readable.pipe()`, `stream.pipeline()`, and the like all | handle this stuff automatically. Here's a one-liner, though | it's not especially fast: node -e 'const | {Readable} = require("stream"); | Readable.from(function*(){while(1) yield | "1"}()).pipe(process.stdout)' | pv > /dev/null | Matthias247 wrote: | Are there still no async write functions which handle this | easier than the old event based mechanism? Waiting for | drain also sounds like it might reduce throughout since | then there is 0 buffered data and the peer would be forced | t Ol pause reading. A ,,writable" event sounds more | appropriate - but the node docs don't mention one. | mg wrote: | Your node version indeed did not crash. Tried for 2 minutes. | | But using a longer string crashed after 23s here: | node -e 'function write() {process.stdout.write("111111111122 | 2222222233333333334444444444555555555566666666667777777777888 | 888888899999999990000000000"); process.nextTick(write)} | write()' | pv > /dev/null | capableweb wrote: | Hm, strange. With the same out of memory error as before or | a different one? Tried running that one for 2 minutes, no | errors here, and memory stays constant. | | Also, what NodeJS version are you on? | mg wrote: | Yes, same error as before. Memory usage stays the same | for a while, then starts to skyrocket shortly before it | crashes. | | node is v10.24.0. (Default from the Debian 10 repo) | capableweb wrote: | Huh yeah, seems to be a old memory leak. Running it on | v10.24.0 crashes for me too. | | After some quick testing in a couple of versions, it | seems like it got fixed in v11 at least (didn't test any | minor/patch versions). | | By the way, all versions up to NodeJS 12 (LTS) are "end | of life", and should probably not be used if you're | downloading 3rd party dependencies, as there are bunch of | security fixes since then, that are not being backported. | captn3m0 wrote: | I used this exact issue today while pointing out how | Debian support dates can be misleading as packages | themselves aren't always getting fixes: | https://github.com/endoflife- | date/endoflife.date/issues/763#... | MaxBarraclough wrote: | Perhaps different approaches to caching? | | I'm reminded of this StackOverflow question, _Why is reading | lines from stdin much slower in C++ than Python?_ | | https://stackoverflow.com/q/9371238/ | xthrowawayxx wrote: | I find that NodeJS runs eventually out of memory and crashes | with applications that do a large amount of data processing | over a long time with little breaks even if there are no memory | leaks. | | Edit: I've found this consistently building multiple data | processing applications over multiple years and multiple | companies | rascul wrote: | I did the same test, but added a rust and bash version. My | results: | | Rust: 21.9MiB/s | | Bash: 282KiB/s | | PHP: 2.35MiB/s | | Python: 2.30MiB/s | | Node: 943KiB/s | | In my case, node did not crash after about two minutes. I find | it interesting that PHP and Python are comparable for me but | not you, but I'm sure there's a plethora of reasons to explain | that. I'm not surprised rust is vastly faster and bash vastly | slower, I just thought it interesting to compare since I use | those languages a lot. | | Rust: fn main() { loop { | print!("1"); } } | | Bash (no discernible difference between echo and printf): | while :; do printf "1"; done | pv > /dev/null | anon946 wrote: | For languages like C, C++, and Rust, the bottleneck is going | to mainly be system calls. With a big buffer, on an old | machine, I get about 1.5 GiB/s with C++. Writing 1 char at a | time, I get less than 1 MiB/s. $ ./a.out | 1000000 2000 | cat >/dev/null buffer size: 1000000, | num syscalls: 2000, perf:1578.779593 MiB/s $ ./a.out | 1 2000000 | cat >/dev/null buffer size: 1, num | syscalls: 2000000, perf:0.832587 MiB/s | | Code is: #include <cstddef> | #include <random> #include <chrono> #include | <cassert> #include <array> #include <cstdio> | #include <unistd.h> #include <cstring> | #include <cstdlib> int main(int argc, char | **argv) { int rv; | assert(argc == 3); const unsigned int n = | std::atoi(argv[1]); char *buf = new char[n]; | std::memset(buf, '1', n); const unsigned int | k = std::atoi(argv[2]); auto start = | std::chrono::high_resolution_clock::now(); for | (size_t i = 0; i < k; i++) { rv = write(1, | buf, n); assert(rv == int(n)); } | auto stop = std::chrono::high_resolution_clock::now(); | auto duration = stop - start; | std::chrono::duration<double> secs = duration; | std::fprintf(stderr, "buffer size: %d, num syscalls: %d, | perf:%f MiB/s\n", n, k, | (double(n)*k)/(1024*1024)/secs.count()); } | | EDIT: Also note that a big write to a pipe (bigger than | PIPE_BUF) may require multiple syscalls on the read side. | | EDIT 2: Also, it appears that the kernel is smart enough to | not copy anything when it's clear that there is no need. When | I don't go through cat, I get rates that are well above | memory bandwidth, implying that it's not doing any actual | work: $ ./a.out 1000000 1000 >/dev/null | buffer size: 1000000, num syscalls: 1000, perf: | 1827368.373827 MiB/s | mortehu wrote: | There's no special "no work" detection needed. a.out is | calling the write function for the null device, which just | returns without doing anything. No pipes are involved. | hderms wrote: | with Rust you could also avoid using a lock on STDOUT and get | it even faster! | skitter wrote: | Tested it, seems to about double the speed (from 22.3mb/s | to 47.6mb/s). | ur-whale wrote: | for the bash case, the cost of forking to write two chars is | overwhelming compared to anything related to I/O. | mauvehaus wrote: | Echo and printf are shell built-ins in bash[0]. Does it | have to fork to execute them? | | You could probably answer this by replacing printf with | /bin/echo and comparing the results. I'm not in front of a | Linux box, or I'd try. | | [0] | https://www.gnu.org/software/bash/manual/html_node/Bash- | Buil... | ur-whale wrote: | > Echo and printf are shell built-ins in bash | | Ah, yeah, good point, I am wrong. | megous wrote: | There's no forking and it's wrinting one character. | megous wrote: | Rust also cheats. | | https://megous.com/dl/tmp/1046458b5b450018.png | cle wrote: | Seems like it's buffering output, which Python also does. | Python is much slower if you flush every write (I get 2.6 | MiB/s default, 600 KiB/s with flush=True). | | Interestingly, Go is very fast with a 8 KiB buffer (same as | Python's), I get 218 MiB/s. | [deleted] | cout wrote: | What version of node are you using? It seems to run | indefinitely on 14.19.3 that comes with Ubuntu 20.04. | GlitchMr wrote: | `process.stdout.write` is different to PHP's `echo` and | Python's `print` in that it pushes a write to an event queue | without waiting for the result which could result in filling | event queue with writes. Instead, you can consider `await`-ing | `write` so that it would write before pushing another `write` | to an event queue. node -e ' | const stdoutWrite = | util.promisify(process.stdout.write).bind(process.stdout); | (async () => { while (true) { | await stdoutWrite("1"); } })(); | ' | pv > /dev/null | fasteo wrote: | Luajit using print and io.write LuaJIT | 2.1.0-beta3 | | Using print is about 17 MiB/s luajit -e "while | true do print('x') end" | pv > /dev/null | | Using io.write is about 111 MiB/s luajit -e | "while true do io.write('x') end" | pv > /dev/null | [deleted] | rhyn00 wrote: | Adding a few results: | | Using OP's code for following php 1.8mb/sec | python 3.8 Mb/sec node 1.0 Mb/sec | | Java print 1.3 Mb/sec echo 'class Code | {public static void main(String[] args) {while | (true){System.out.print("1");}}}' >Code.java; javac Code.java ; | java Code | pv>/dev/null | | Java with buffering 57.4 Mb/sec echo 'import | java.io.*;class Code2 {public static void main(String[] args) | throws IOException {BufferedWriter log = new BufferedWriter(new | OutputStreamWriter(System.out));while(true){log.write("1");}}}' | > Code2.java ; javac Code2.java ; java Code2 | pv >/dev/null | kuschku wrote: | Java can get even much much faster: https://gist.github.com/j | ustjanne/12306b797f4faa977436070ec0... | | That manages about 7 GiB/s reusing the same buffer, or about | 300 MiB/s with clearing and refilling the buffer every time | | (the magic is in using java's APIs for writing to | files/sockets, which are designed for high performance, | instead of using the APIs which are designed for writing to | stdout) | rhyn00 wrote: | Nice, that's pretty cool! | petercooper wrote: | I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec | with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed | as Node if I use $stdout.sync = true though..) | nequo wrote: | For me: | | Python3: 3 MiB/s | | Node: 350 KiB/s | | Lua: 12 MiB/s lua -e 'while true do | io.write("1") end' | pv > /dev/null | | Haskell: 5 MiB/s loop = do putStr "1" | loop main = loop | | Awk: 4.2 MiB/s yes | awk '{printf("1")}' | pv > | /dev/null | VWWHFSfQ wrote: | Lua is an interesting one. while true do | io.write "1" end | | PUC-Rio 5.1: 25 MiB/s | | PUC-Rio 5.4: 25 MiB/s | | LuaJIT 2.1.0-beta3: 550 MiB/s <--- WOW | | They all go slightly faster if you localize the reference to | `io.write` local write = io.write | while true do write "1" end | yakubin wrote: | _> They all go slightly faster if you localize the | reference to `io.write`_ | | No noticeable difference for LuaJIT, which makes sense, | since JIT should figure it out without help. | bjoli wrote: | And this, folks, is why you have immutable modules. If | you know before runtime what something is, lookup is a | lot faster. | VWWHFSfQ wrote: | Ah yes you're right. Basically no difference with LuaJIT. | | 5.1 and 5.4 show about ~8% improvement. | dllthomas wrote: | Haskell can be even simpler: main = putStr | (repeat '1') | | [Edit: as pointed out below, this is no longer the case!] | | Strings are printed one character at a time in Haskell. This | choice is justified by unpredictability of the interaction | between laziness and buffering; I am uncertain it's the | correct choice, but the proper response is to use Text where | performance is relevant. | nequo wrote: | Wow, this does 160 MiB/s. That's a huge improvement! The | output of strace looks completely different: | poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1, | revents=POLLOUT}]) write(1, | "11111111111111111111111111111111"..., 8192) = 8192 | poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1, | revents=POLLOUT}]) write(1, | "11111111111111111111111111111111"..., 8192) = 8192 | | With the recursive code, it buffered the output in the same | way but bugged the kernel a whole lot more in-between | writes. Not exactly sure what is going on: | poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1, | revents=POLLOUT}]) write(1, | "11111111111111111111111111111111"..., 8192) = 8192 | rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 | clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, | tv_nsec=920390843}) = 0 rt_sigprocmask(SIG_SETMASK, | [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], | 8) = 0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, | {tv_sec=0, tv_nsec=920666397}) = 0 ... | rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 | poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1, | revents=POLLOUT}]) write(1, | "11111111111111111111111111111111"..., 8192) = 8192 | dllthomas wrote: | I'm honestly surprised either of them wind up buffered! | That must be a change since I stopped paying as much | attention to GHC. | | I'm also not sure what's going on in the second case. | IIRC, at some point historically, a sufficiently tight | loop could cause trouble with handling SIGINT, so it | might be related to some overagressive workaround for | that? | wazoox wrote: | On my extremely old desktop PC (Phenom II 550) running an out- | of-date OS (Slackware 14.2): | | Bash: while :; do printf "1"; done | ./pv > | /dev/null [ 156KiB/s] | | Python3 3.7.2: python3 -c 'while (1): print | (1, end="")' | ./pv > /dev/null [1,02MiB/s] | | Perl 5.22.2: perl -e 'while (true) {print 1}' | | ./pv > /dev/null [3,03MiB/s] | | Node.js v12.22.1: node -e 'while (1) | process.stdout.write("1");' | ./pv > /dev/null [ | 482KiB/s] | cle wrote: | A major contributing factor is whether or not the language | buffers output by default, and how big the buffer is. I don't | think NodeJS buffers, whereas Python does. Here's some | comparisons with Go (does not buffer by default): | | - Node (no buffering): 1.2 MiB/s | | - Go (no buffering): 2.4 MiB/s | | - Python (8 KiB buffer): 2.7 MiB/s | | - Go (8 KiB buffer): 218 MiB/s | | Go program: f := | bufio.NewWriterSize(os.Stdout, 8192) for { | f.WriteRune('1') } | preseinger wrote: | In addition to buffering within the process, Linux (usually) | buffers process stdout with ~16KB, and does not buffer | stderr. | reincarnate0x14 wrote: | Not specifically addressed at you, but it's a bit amusing | watching a younger generation of programmers rediscovering | things like this, which seemed hugely important in like 1990 | but largely don't matter that much to modern workflows with | dedicated APIs or various shared memory or network protocols, | as not much that is really performance-critical is typically | piped back and forth anymore. | | More than a few old backup or transfer scripts had extra dd | or similar tools in the pipeline to create larger and semi- | asynchronous buffers, or to re-size blocks on output to | something handled better by the receiver, which was a big | deal on high speed tape drives back in the day. I suspect | most modern hardware devices have large enough static RAM and | fast processors to make that mostly irrelevant. | abuckenheimer wrote: | > python3 -c 'while (1): print (1, end="")' | pv > /dev/null | | python actually buffers its writes with print only flushing to | stdout occasionally, you may want to try: | python3 -c 'while (1): print (1, end="", flush=True)' | pv > | /dev/null | | which I find goes much slower (550Kib/s) | orf wrote: | Using `sys.stdout.write()` instead of `print()` gets ~8MiB/s on | my machine. | bfors wrote: | Love the subtle "stonks" overlay on the first chart ___________________________________________________________________ (page generated 2022-06-02 23:00 UTC)