[HN Gopher] Achieving 5M persistent connections with Project Loo... ___________________________________________________________________ Achieving 5M persistent connections with Project Loom virtual threads Author : genzer Score : 271 points Date : 2022-04-30 08:07 UTC (14 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | deepsun wrote: | How does that compare to Kotlin suspend functions? | jillesvangurp wrote: | Loom will make a great backend for kotlin's co-routines. Roman | Elizarov (kotlin language lead & person who is behind Kotlin's | co-routine framework) has already confirmed that will happen | and it makes a lot of sense. | | For those who don't understand this, Kotlin's co-routine | framework is designed to be language neutral and already works | on top the major platforms that have kotlin compilers (native, | javascript, jvm, and soon wasm). So, it doesn't really compete | with the "native" way of doing concurrent, aynchronous, or | parallel computing on any of those platforms but simply | abstracts the underlying functionality. | | It's actually a multi platform library that implements all the | platform specific aspects in the platform appropriate way. It's | also very easy to adapt existing frameworks in this space via | Kotlin extension functions and the JVM implementation actually | ships out of the box with such functions for most common | solutions on the JVM for this (Java's threads, futures, | threadpools, etc., Spring Flux, RxJava, Vert.x, etc.). Loom | will be just another solution in this long list. | | If you use Spring Boot with Kotlin for example, rather than | dealing with Spring's Flux, you simply define your asynchronous | resources as suspend functions. Spring does the rest. | | With Kotlin-js in a browser you can call Promise.toCoroutine() | ans async { ... }.asPromise(). That makes it really easy to | write asynchronous event handling in a web application for | example or work with javascript APIs that expect promises from | Kotlin. And if you use web-compose, fritz2, or even react with | kotlin-js, anything asynchronous, you'd likely be dealing with | via some kind of co-routine and suspend functions. | | Once Loom ships, it basically will enable some nice, low level | optimization to happen in the JVM implementation for co- | routines and there will likely be some new extension functions | to adapt the various new Java APIs for this. Not a big deal but | it will probably be nice for situations with extremely large | amounts of co-routines and IO. Not that it's particularly | struggling there of course but all little bits help. It's not | likely to require any code updates either. When the time comes, | simply update your jvm and co-routine library and you should be | good to go. | richdougherty wrote: | I made a comment about this above: | https://news.ycombinator.com/item?id=31218826 | | I won't repeat it all, but the main point is that having | runtime support is much better than relying on compiler | support, even if compiler support is pretty fantastic. | | Note that the two aren't mutally exclusive, you should still be | able to use coroutines after Project Loom ships, and it still | might make sense in many places. | torginus wrote: | While I can't answer the question directly there is an article | about C#-s async/await vs Go's goroutines, which compare the | two approaches, and while some of the stuff is probably stack- | specific, a lot of it is probably intrinsic to the approach: | | - Green threads scale somewhat better, but both scale | ridiculously well, meaning probably you won't run into scaling | issues. | | - async/await generators use way less memory than a dedicated | green thread, this affects both memory consumption and startup | time, since the process has to run around asking the OS for | more memory | | - green threads are faster to execute | | Here's the link: | | https://alexyakunin.medium.com/go-vs-c-part-1-goroutines-vs-... | Andrew_nenakhov wrote: | Sounds like a job for Erlang. | speed_spread wrote: | Sounds like Erlang's out of a job. | cheradenine_uk wrote: | I think a lot of people are missing the point. | | Go look at the sourcecode. Look at how simple it is - anyone who | has created a thread with java knows what's happening. With only | minor tweaks, this means your pre-existing code can take | advantage of this with, basically, no effort. And it retains all | the debuggability of traditional java thread (I.e: a stack trace | that makes sense!) | | If you've spent any time at all dealing with the horrors of c# | async/await (Why am I here? Oh, no idea) and it's doubling of | your APIs to support function colouring - or, you've fought with | the complexities of reactive solutions in the Java space -- | often, frankly, in the name of "scalability" that will never be | practically required -- this is a big deal. | | You no longer have to worry about any of that. | pjmlp wrote: | Or inserting the occasional Task.Run() calls, as means to | avoiding changing the whole call stack up to Main(). | gavinray wrote: | This hasn't been that much of a problem, IME | | If you decide somewhere deep in your program you want to use | async operations, most languages allow you to keep the | invoking function/closure synchronous and return some kind of | Promise/Future-like value | pjmlp wrote: | Which is exactly the workaround with Task.Run(), being able | to integrate a library written with async/await in | codebases older than the feature, where no one is paying | for a full rewrite. | SemanticStrengh wrote: | Except Kotlin coroutines already works, can be very easily | integrated in existing java codebases and are much superior | than loom (structured concurrency, flow, etc) | richdougherty wrote: | Kotlin coroutines are amazing. They're built on very clever | tech that converts fairly normal source code into a state | machine when compiled. This has huge benefits and allows the | programmer to break their code up without the hassle of | explicitly programming callbacks, etc. | | https://kotlinlang.org/spec/asynchronous-programming-with- | co... | | However... an unavoidable fact is that converted code works | differently to other code. The programmer needs to know the | difference. Normal and converted code compose together | differently. The Kotlin compiler and type system helps keep | track, but it can't paper over everything. | | Having lightweight thread and continuations support directly | in the VM makes things very much simpler for programmers (and | compiler writers!) since the VM can handle the details of | suspending/resuming and code composes together effortlessly, | even without compiler support, so it works across languages | and codebases. | | I don't want to be critical about Kotlin. It's amazing what | it achieves and I'm a big fan of this stuff. Here are some | notes I wrote on something similar, Scala's experiments with | compile-time delimited continuations: | https://rd.nz/2009/02/delimited-continuations-in- | scala_24.ht... | | I think this is a general principle about compiler features | vs runtime features. Having things in the runtime makes life | a lot easier for everyone, at the cost of runtime complexity, | of course. | | Another one I'd like to see is native support for tail calls | in Java. Kotlin, Scala, etc have to do compile-time tricks to | get basic tail call support, but it doesn't work across | functions well. | | Scala and Kotlin both ask the programmer to add annotations | where tail calls are needed, since the code gen so often | fails. | | https://kotlinlang.org/docs/functions.html#tail-recursive- | fu... | | https://www.scala- | lang.org/api/3.x/scala/annotation/tailrec.... | | https://rd.nz/2009/04/tail-calls-tailrec-and- | trampolines.htm... | | As a side note, I can see that tail calls are planned for | Project Loom too, but I haven't heard if that's implemented | yet. Does anyone know the status? | | "Project Loom is to intended to explore, incubate and deliver | Java VM features and APIs built on top of them for the | purpose of supporting easy-to-use, high-throughput | lightweight concurrency and new programming models on the | Java platform. This is accomplished by the addition of the | following constructs: | | * Virtual threads | | * Delimited continuations | | * Tail-call elimination" | | https://wiki.openjdk.java.net/display/loom/Main | SemanticStrengh wrote: | Coroutines are _much less_ coloured than async await | programming though since functions returns resolved types | directly instead of futures. But yes there is the notion of | coroutine scope but I don 't see how to supress it without | making it less expressive. | | Very few people know it but Oracle is developping an | alternative to Loom, in parallel. | https://github.com/oracle/graal/pull/4114 | | BTW i expect Kotlin coroutines to leverage loom eventually. | | As for the tailrecursive keyword, it is not a constraint | but a feature since it guarantee at the type level that | this function cannot stack overflow. Few people know there | is an alternative to tailrecursive, that can make any | function stackoverflow safe by leveraging the heap via | continuations | https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-deep- | re... | | As for Java, there is universal support for tail recursion | at the bytecode level https://github.com/Sipkab/jvm-tail- | recursion | ohgodplsno wrote: | > Coroutines are much less coloured than async await | programming though since functions returns resolved types | directly instead of futures | | Only because the compiler does its magic behind the | scenes and transforms it into bytecode that takes a | lambda with a continuation. Try calling a suspend | function from java or starting a job and surprise, it's | continuations all the way down | SemanticStrengh wrote: | yes interfacing with java is generally made via RxJava | and reactor. Interfacing is easy but yes nobody wants to | use rxjava and reactor in the first place.. I wonder | wether loom will enable easier interop and make the magic | work from java side POV | gavinray wrote: | Thanks for posting that link to Java tail recursion | library, super handy + didn't know about it. You need | tail recursion for writing expression evaluators/visitors | frequently. | | I've been using an IntelliJ extension that can do magic | by rewriting recursive functions to stateful stack-based | code for performance, but it spits out very ugly code: | | https://github.com/andreisilviudragnea/remove-recursion- | insp... > "This inspection detects | methods containing recursive calls (not just tail | recursive calls) and removes the recursion from the | method body, while preserving the original semantics of | the code. However, the resulting code becomes rather | obfuscated if the control flow in the recursive method is | complex." | | It was this guy's whole Bachelor thesis I guess: | | https://github.com/andreisilviudragnea/remove-recursion- | insp... | bullen wrote: | Agreed it's simpler, but using NIO with one OS thread per core | also has it's benefits. | | The context switch (how ever small) will cause latency when | this solution is at saturation. | | I think they should write four tests: fiber, NIO and each with | userspace networking (no kernel copying network memory) and | compare them. | | Why Oracle is stalling removing the kernel for Java networking | is surprising to me, they allready have a VM. | blibble wrote: | there's still a context switch with NIO, you're just doing it | manually | pron wrote: | https://github.com/ebarlas/project-loom-comparison | vlovich123 wrote: | Shouldn't you be able to send authorization and | authentication requests in parallel in the async and | virtual threads cases? | threeseed wrote: | It is just an example so they could do anything. | | But in the real world it is common to need information | from the authorization stage to use in the authentication | stage. For example you may have a user login with an | email address/password which you then pass to an LDAP | server in order to get a userId. This userId is then used | in a database to determine with objects/groups they have | access to. | the8472 wrote: | net.netfilter.nf_conntrack_buckets = 1966050 | net.netfilter.nf_conntrack_max = 7864200 | | or avoid conntrack entirely | LinuxBender wrote: | For completeness sake I would add that one must also set | options nf_conntrack expect_hashsize=X hashsize=X | | in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of | conntrack_max | metabrew wrote: | API for the server example looks... actually good, wow. Nice job! | | Also tickled to see my erlang 1M comet blog post referenced. A | lifetime ago now, pre-websockets. | alberth wrote: | Is this a test of just having 5M people knock on your door? | | Or is this a test where something actually happens (data | exchanges) with each connection? | | I ask because those are two totally different workloads and | typically where in the later test Erlang shines. | bufferoverflow wrote: | It's an echo server. The client sends the data, the server | responds with the same data. | newskfm wrote: | sgtnoodle wrote: | I'm not a java programmer. I tried clicking 3 layers deep of | links, but still have no idea what virtual threads are in this | context. Is it a userspace thread implementation? | | I've used explicit context switching syscalls to "mock out" | embedded real time OS task switching APIs. It's pretty fun and | useful. The context switching itself may not be any faster than | if the kernel does it, but the fact that it's synchronous to your | program flow means that you don't have to spend any overhead | synchronizing to mutexes, queues, etc. (You still have them, they | just don't have to be thread safe.) | grishka wrote: | > Is it a userspace thread implementation? | | Yes. | zinxq wrote: | Loom sets out to give you a sane programming paradigm similar to | what threads do (i.e. as opposed to programming asynchronous I/O | in Java with some type of callback) without the overhead of | Operating System threads. | | That's a very cool and a noble pursuit. But the title of this | article might as well have been "5M persistent connections with | Linux" because that's where the magic 5M connections happen. | | I could also attempt 5M connections at the Java level using Netty | and asynchronous IO - no threads or Loom. Again, it'd take more | Linux configuration than anything else. If that configuration did | happen though now you can also do it in C# async/await, | javascript, I'm sure Erlang and anything else that does | Asynchronous I/O whether it's masked by something like | Loom/Async/Await or not. | simulate-me wrote: | As the GP said, what's cool about this is how simple the code | is. You might be able to achieve 5M connections in Java using | an event loop based solution (eg Netty), but if the connection | handlers need to do any async work, then they also need to be | written using an event loop, which is not how most people write | Java. Simply put, 5M connections was not possible using Java in | the way most people write Java. | [deleted] | pron wrote: | It is true that the experiment exercises the OS, but that's | only _part_ of the point. The other part is that it uses a | simple, blocking, thread-per-request model with Java 1.0 | networking APIs. So this is "achieving 5M persistent | connections with (essentially) 26-year-old code that's fully | debuggable and observable by the platform." This stresses both | the OS and the Java runtime. | | So while you could achieve 5M in other ways, those ways would | not only be more complex, but also not really | observable/debuggable by Java platform tools. | cheradenine_uk wrote: | This. | | Writing the sort of applications that I get involved with, | it's frequently the case whilst it's true that 1 OS | thread/java thread was a theoretical scalability limitation - | in practice we were never likely to hit it (and there was | always the 'get a bigger computer'). | | But: the complexity mavens inside our company and projects we | rely upon get bitten by an obsessive need to chase | 'scalability' /at all costs/. Which is fine, but the downside | to that is the negative consequences of coloured functions | comes into play. We end up suffering having to deal with | vert.x or kotlin or whatever flavour-of-the-month solution is | that is /inherently/ harder to reason about than a linear | piece of code. If you're in a c# project, the you get a | library that's async, and boom, game over. | | If loom gets even within performance shouting distance of | those other models, it's ought to kill (for all but the | edgiest of edge-cases) reactive programming in the java space | dead. You might be able to make a case - obviously depending | on your use cases which are not mine - that extracting, say, | 50% more scalability is worth the downsides. If that number | is, say, 5%, then for the vast majority of projects the | answer is going to be 'no'. | | I say 'ought to', as I fear the adage that "developers love | complexity the way moths love flames - and often with the | same results". I see both engineers and projects (Hibernate | and keycloak, IIRC) have a great deal of themselves invested | in their Rx position, and I already sense that they're not | going to give it up without a fight. | | So: the headline number is less important than "for virtually | everyone you will no longer have to trade simplicity for | scalability". I can't wait! | amluto wrote: | Threads (whether lightweight or heavyweight) can't fully | replace reactive/proactive/async programming even ignoring | performance and scalability. Sometimes network code simply | needs to wait for more than one event as a matter of | functionality. For example, a program might need to handle | the availability of outgoing buffer space and _also_ handle | the availability of incoming data. And it might also need | to handle completion of a database query or incoming data | on a separate connection. Sure, using extra threads might | do it, but it's awkward. | pron wrote: | > Sure, using extra threads might do it, but it's | awkward. | | It's simpler and nicer, actually -- and definitely offers | better tooling and observability -- especially with | structured concurrency: https://download.java.net/java/ea | rly_access/loom/docs/api/jd... | mike_hearn wrote: | A couple of points to consider. | | 1. Demanding scalability for inappropriate projects and at | any cost is something I've seen too, and on investigation | it was usually related to former battle scars. A software | system that stops scaling at the wrong time can be horrific | for the business. Some of them never recover, the canonical | example being MySpace, but I've heard of other examples | that were less public. In finance entire multi-year IT | projects by huge teams have failed and had to be scrapped | because they didn't scale to even current business needs, | let alone future needs. Emergency projects to make | something "scale" because new customers have been on- | boarded, or business requirements changed, are the sort of | thing nobody wants to get caught up in. Over time these | people graduate into senior management where they become | architects who react to those bad experiences by insisting | on making scalability a checkbox to tick. | | Of course there's also trying to make easy projects more | challenging, resume-driven development etc too. It's not | just that. But that's one way it can happen. | | 2. Rx type models aren't just about the cost of threads. An | abstraction over a stream of events is useful in many | contexts, for example, single-threaded GUIs. | cheradenine_uk wrote: | I think my point is more that you end up having to pay | the costs (of Rx-style APIs) whether you need the | scalability or not, because the libraries end up going | down that route. This has sometimes felt that I'm being | forced to do work in order to satisfy the fringe needs of | some other project! | | And sure, if you are living in a single-threaded | environment, your choices are somewhat limited. I, | personally, dislike front-end programming for exactly | that reason - things like RxJS feel hideously | overcomplicated to me. My guess is that most, though not | all, will much prefer the loom-style threading over | async/await given free choice. | lostcolony wrote: | One additional - as noted, it's been 26 years since | Java's founding. Project Loom has been around since at | least 2018 and still has no release date. It'll be cool | for Java projects whenever it comes out, but I | just...have a hard time caring right now. I can't use it | for old codebases currently, and new codebases I'm not | using one request per Java thread anyway (tbh - when it's | my choice I'm not choosing the JVM at all). The space has | moved, and continues to move. In no way to say the JVM | shouldn't be adopting the good ideas that come along the | way, that is one of the benefits of being as conservative | and glacial in adoption as it is, but I just...don't get | excited about them, or find myself in any position in | relation to the JVM (Java specifically, but the | fundamentals affect other languages) other than "ugh, | this again". | chrisseaton wrote: | > I'm not using one request per Java thread anyway | | The point is with Loom you can, and you can stop putting | everything into a continuation and go back to straight- | line code. | lostcolony wrote: | >> The point is with Loom you can | | The point I was making is that Loom isn't released, | stable, production ready, supported, etc, and there's no | still no date when it's supposed to be, so what you can | do with Loom in no way affects what I can do with a | production codebase, either new or legacy. I'm not sure | how you missed that from my post. | | I'm not defending reactive programming on the JVM. I'm | also not defending threads as units of concurrency. I'm | saying I can get the benefits of Project Loom -right | now-, in production ready languages/libraries, outside of | the JVM, and I can't reasonably pick Project Loom if I | want something stable and supported by its creators. | pron wrote: | > and there's no still no date when it's supposed to be | | September 20 (in Preview) | | > I'm saying I can get the benefits of Project Loom | -right now-, in production ready languages/libraries, | outside of the JVM | | Only sort-of. The only languages offering something | similar in terms of programming model are Erlang | (/Elixir) and Go -- both inspired virtual threads. But | Erlang doesn't offer similar performance, and Go doesn't | offer similar observbility. Neither offers the same | popularity. | lostcolony wrote: | I'm not saying there aren't tradeoffs, just that if I | need the benefits of virtual threads...I have other | options. I'm all for this landing on the JVM, mainly so | that non-Java languages there can take advantage of it | rather than the hoops they currently have to jump through | to offer a saner concurrency model, but that until it | does...don't care. And last I saw this feature is | proposed to land in preview in JDK19; not that it would, | and...it's still preview. Meaning the soonest we can | expect to see this safely available to production code is | next year (preview in Java is a bit weird, admittedly. | "This is not experimental but we can change any part of | it or remove it for future versions depending how things | go" was basically my take on it when I looked in the | past). | | Meanwhile, as you say, Erlang/Elixir gives me this model | with 35+ years of history behind it (and no | libraries/frameworks in use trying to provide me a leaky | abstraction of something 'better'), better observability | than the JVM, a safer memory model for concurrent code, a | better model for reliability, with the main issue being | the CPU hit (less of a concern for IO bound workloads, | which is where this kind of concurrency is generally | impactful anyway). Go has reduced observability than | Java, sure, but a number of other tradeoffs I personally | prefer (not least of all because in most of the Java | shops I was in, I was the one most familiar with | profiling and debugging Java. The tools are there, the | experience amongst the average Java developer isn't), and | will also be releasing twice between now and next year. | | Again, I'm not saying virtual threads from Loom aren't | cool (in fact, I said they were; the technical | achievement of making it a drop in replacement is itself | incredible), or that it wouldn't be useful when it | releases for those choosing Java, stuck with Java due to | legacy reasons, or using a JVM language that is now able | to migrate to take advantage of this to remove some of | the impedance mismatch between their concurrency model(s) | and Java's threading and the resulting caveats. Just that | I don't care until it does (because I've been hearing | about it for the past 4 years), it still doesn't put it | on par with the models other languages have adopted | (memory model matters to me quite a bit since I tend to | care about correct behavior under load more than raw | performance numbers; that said, of course, nothing is | preventing people from adopting safer practices | there...just like nothing has been in years previous. | They just...haven't), nor do I care about the claims | people make about it displacing X, Y, or Z. It probably | will for new code! Whenever it gets fully supported in | production. But there's still all that legacy code | written over the past two decades using libraries and | frameworks built to work around Java's initial 1:1 | threading model, and which simply due to calling | conventions and architecture (i.e., reactive and etc) | would have to be rewritten, which probably won't happen | due to the reality of production projects, even if there | were clear gains in doing so (which as the great- | grandparent mentions, is not nearly so clearcut). | namdnay wrote: | And hopefully we can bury Reactor Core in the garden and | never talk about it again | Scarbutt wrote: | What has the space move to? | pron wrote: | > and still has no release date | | JEP 425 has been proposed to target JDK 19, out September | 20. It will first be a "Preview" feature, which means | supported but subject to change, and if all goes well | would normally be out of Preview two releases, i.e. one | year, after that. | | > I'm not using one request per Java thread anyway | | You don't have to, but not that _only_ the thread-per- | request model offers you world-class observability | /debuggability. | | > other than "ugh, this again". | | Ok, although in 2022, the Java platform is still among | the most technologically advanced, state-of-the art, | software plarform out there. It stands shoulder to | shoulder with clang and V8 on compilation, and beats | everything else on GC and low-overhead observability | (yes, even eBPF). | zinxq wrote: | I think we're in agreement. Ignoring under the hood - Loom's | programming paradigm (from the viewpoint of control flow) is | the Threading programming paradigm. (Virtual)Thread-per- | connection programming is easier and far more intuitive than | asynchronous (i.e. callback-esque) programming. | | I still attest though - The 5M connections in this example is | still a red herring. | | Can we get to 6M? Can we get to 10M? Is that a question for | Loom or Java's asynchronous IO system? No - it's a question | for the operating system. | | Loom and Java NIO can handle probably a billion connections | as programmed. Java Threads cannot - although that too is a | broken statement. "Linux Threads cannot" is the real | statement. You can't have that many for resource reasons. | Java Threads are just a thin abstraction on top of that. | | Linux out of the box can't do 5M connections (last I | checked). It takes Linux tuning artistry to get it there. | | Don't get me wrong - I think Loom is cool. It's attempted to | do the same thing as Async/Await tried - just better. But it | is most definitely not the only way to achieve 5MM | connections with Java or anything else. Possibly however, | it's the most friendly and intuitive way to do it. | | *We typically vilify Java Threads for the Ram they consume. | Something like 1M per thread or something (tunable). Loom | must still use "some" ram per connection although surely far | far less (and of course Linux must use some amount of kernel | ram per connection too). | pron wrote: | > But it is most definitely not the only way to achieve 5MM | connections with Java or anything else. Possibly however, | it's the most friendly and intuitive way to do it. | | It is the only way to achieve that many connections with | Java in a way that's debuggable and observable by the | platform and its tools, regardless of its intuitiveness or | friendliness to human programmers. It's important to | understand that this is an objective technical difference, | and one of the cornerstones of the project. Computations | that are composed in the asynchronous style are invisible | to the runtime. Your server could be overloaded with I/O, | and yet your profile will show idle thread pools. | | Virtual threads don't just allow you to write something you | could do anyway in some other way. They actually do work | that has simply been impossible so far at that scale: they | allow the runtime and its tools to understand how your | program is composed and observe it at runtime in a | meaningful and helpful way. | | One of the main reasons so many companies turn to Java for | their most important server-side applications is that it | offers unmatched observability into what the program is | doing (at least among other languages/platforms with | similar performance). But that ability was missing for | high-scale concurrency. Virtual threads add it to the | platform. | mike_hearn wrote: | I don't quite follow your argument. | | Saying "Linux cannot handle 5M connections with one thread | per connection" isn't a reasonable statement because no | operating system can do that, they can't even get close. | The resource usage of a kernel thread is defined by pretty | fundamental limits in operating system architecture, | namely, that the kernel doesn't know anything about the | software using the thread. Any general purpose kernel will | be unable to provision userspace with that many threads | without consuming infeasible quantities of RAM. | | The reason JVM virtual threads can do this is because the | JVM has deep control and understanding of the stack and the | heap (it compiled all the code). The reason Loom | scalability gets worse if you call into native code is that | then you're back to not controlling the stack. | | Getting to 10M is therefore very much a question for the | JVM as well as the operating system. It'll be heavily | affected by GC performance with huge heaps, which luckily | modern G1 excels at, it'll be affected by the performance | of the JVM's userspace schedulers (ForkJoinPool etc), it'll | be affected by the JVM's internal book-keeping logic and | many other things. It stresses every level of the stack. | pron wrote: | For more information about virtual threads see | https://openjdk.java.net/jeps/425 (planned to preview in JDK 19, | out this September). | | What's remarkable about this experiment is that it uses simple | 26-year-old (Java 1.0) networking APIs. | midislack wrote: | I see a lot of these making the FP of HN. But it's very difficult | to be impressed, or unimpressed because it's all about hardware. | How much hardware is everybody throwing at all of this? 5M | persistent connections on a Pi with mere GigE? Pretty frickin' | amazing. 5M persistent connections on a Threadripper with 128 | cores and a dozen trunked 4 port 10GE NICs? Yaaaaawwwnnn snooze. | | We need a standardized computer for benchmarking these types of | claims. I propose the RasPi 4 4GB model. Everybody can find one, | all the hardware's soldered on so no cheating is really possible, | etc. Then we can really shoot for efficiency. | shadowpho wrote: | Raspberry pi 4 performance changes wildly based on cooling. | Bare die vs heatsink vs heatsink + fan will give you wildly | different results. | midislack wrote: | Same is true with any computer these days. So let's go no | heat sink, Pi 4 4GB anyway. | KingOfCoders wrote: | Something to learn for everybody, the article is mainly about | Linux tuning. | jeroenhd wrote: | The Linux tuning part seems to have been inspired by these blog | posts from 14 years ago: | https://www.metabrew.com/article/a-million-user-comet-applic... | | It's almost a little disappointing that beefy modern servers | only manage a x5 scale improvement, though that could be due to | the differences in runtime behaviour between Erlang and the | JVM. | wiradikusuma wrote: | The experiment is about Java app, but the tweaks are at the O/S | level. Does it mean any app (Java/not, Loom/not) can achieve | target given correct tweak? | | Also, why are these not default for the O/S? What are we | compromising by setting those values? | mike_hearn wrote: | No, it doesn't. The reason the tweaks are at the OS level is | because, apparently, Loom-enabled JVMs already scale up to that | level without needing any tuning. But if you try that in C++ | you're going to die very quickly. | pjmlp wrote: | With C++ co-routines and a runtime like HPX, not really. | | However there are other reasons why a C++ applications | connected to the internet might indeed die faster than a Java | one. | gpderetta wrote: | There have been userspace thread libraries for c++ for | decades. | yosefk wrote: | Sure, I wrote some myself. Q is what libraries you can use | on top of the userspace thread package that are aware of | the userspace threads rather than just using OS APIs and | thus eg blocking the current OS thread. | gpderetta wrote: | There are .so interposition tricks that can be used for | that. | | I think Pth used to do that for example. | yosefk wrote: | Could you elaborate? | toast0 wrote: | You need both your operating system and your application | environment need to be up to the task. I'd expect most | operating systems to be up to the task; although it might need | settings set. Some of the settings are things that are | statically allocated in non-swappable memory and you don't want | to waste memory on being able to to have 5M sockets open if you | never go over 10k. Often you'll want to reduce socket buffers | from defaults, which will reduce throughput per socket, but | target throughput per socket is likely low or you wouldn't want | to cram so many connections per client. You may need to | increase the size of the connection table and the hash used for | it as well; again, it wastes non-swappable ram to have it too | big if you won't use it. | | For application level, it's going to depend on how you handle | concurrency. This post is interesting, because it's a benchmark | of a different way to do it in Java. You could probably do 5M | connections in regular Java through some explicit event loop | structure; but with the Loom preview, you can do it connection | per Thread. You would be unlikely to do it with connection per | Thread without Loom, since Linux threads are very unlikely to | scale so high (but I'd be happy to read a report showing 5M | Linux threads) | jiggawatts wrote: | There's always trade-offs. It would be very rare for any server | to reach even 100K concurrent connections, let alone 5M. | Optimising for that would be optimising for the 0.000001% case | at the expense of the common case. | | Some back of the envelope maths: | https://www.wolframalpha.com/input?i=100+Gbps+%2F+5+million | | If the server had a 100 Gbps Ethernet NIC, this would leave | just 20 kbps for each TCP connection. | | I could imagine some IoT scenarios where this _might_ be a | useful thing, but outside of that? I doubt there 's anyone that | wants 20 kbps throughput in this day and age... | | It's a good stress test however to squeeze out inefficiencies, | super-linear scaling issues, etc... | jeroenhd wrote: | 20kbps should be sufficient for things like chat apps if you | have the CPU power to actually process chat messages like | that. Modern apps also require attachments and those will | require more bandwidth, but for the core messaging | infrastructure without backfilling a message history I think | 20kbps should be sufficient. Chat apps are bursty, after all, | leaving you with more than just the average connection speed | in practice. | henrydark wrote: | I have a memory of some chat site, maybe discord, sending | attachments to a different server, thus exchanging the | bandwidth problem with extra system complexity | jeroenhd wrote: | That's how I'd solve the problem. The added complexity | isn't even that high, give the application an endpoint to | push an attachment into a distributed object store of | your choice, submit a message with a reference to the | object and persist it the moment the chat message was | sent. This could be done with mere bytes for the message | itself and some very dumb anycast-to-s3 services in | different data centers. | | I'm sure I'm skipping over tons of complexity here (HTTP | keepalives binding clients to a single attachment host | for example) because I'm no chat app developer, but the | theoretical complexity is still relatively low. | Koffiepoeder wrote: | Open, idle websockets can be a use case for a large amount of | tcp connections with a small data footprint. | jeffbee wrote: | Also IMAP has this unfortunate property. | wiseowise wrote: | And how is that any different from Kotlin coroutines if you still | need to call Thread.startVirtualThread? | pjmlp wrote: | Native VM support instead an additional library faking it, and | filling .class files with needless boilerplate. | ferdowsi wrote: | Kotlin coroutines are colored and infect your whole codebase. | Virtual threads do not. | pron wrote: | 1. These are actual threads from the Java runtime's | perspective. You can step through them and profile them with | existing debuggers and profilers. They maintain stacktraces and | ThreadLocals just like platform threads. | | 2. There is no need for a split world of APIs, some designed | for threads and others for coroutines (so-called "function | colouring"). Existing APIs, third-party libraries, and programs | -- even those dating back to Java 1.0 (just as this experiment | does with Java 1.0's java.net.ServerSocket) -- just work on | millions of virtual threads. | | Normally, you wouldn't even call Thread.startVirtualThread(), | but just replace your platform-thread-pool-based | ExecutorService with an ExecutorService that spawns a new | virtual thread for each task | (Executors.newVirtualThreadPerTaskExecutor()). For more | details, see the JEP: https://openjdk.java.net/jeps/425 | imranhou wrote: | It looks more closer to go routines, which to me begs the | question - where are the channels that I could use to communicate | between these virtual threads? | sdfgdfgbsdfg wrote: | In a library. Loom is more about adapting the JVM itself for | continuations and virtual threads than adding to userspace. | [deleted] | adra wrote: | Go's channels are simplistically a mutex in front of a queue. | Java has many existing objects that can do the same, it's just | that's not idiomatic best choice to do the same. Since green | threads should wake up from Object.notify(), any threads | blocking on the monitor should wake/consume. I'm curious how | scalable/performance a green thread ConcurrentDequeue would | stand up to go's channel. | Matthias247 wrote: | You are right. But Go Channels come also with the superpower | of ,,select", which allows to wait for multiple objects to | become ready and atomic execution of actions. I don't think | this part can be retrofitted on top of simple BlockingQueues. | sdfgdfgbsdfg wrote: | pron talks about this on https://cr.openjdk.java.net/~rpres | sler/loom/loom/sol1_part2.... | christophilus wrote: | Loom looks like it's nicely solved the function coloring problem. | This plus Graal makes me excited to pick up Clojure again. | invalidname wrote: | This is pretty fantastic! | | I'm very excited about the possibilities of Loom. Would love to | have a more realistic sample with Spring Boot that would | demonstrate the real world scale. I saw a few but nothing | remotely as ambitious as that. | isbvhodnvemrwvn wrote: | Spring Boot overhead would likely make that infeasible. | RhodesianHunter wrote: | Spring boot overhead is largely in startup time. It really | doesn't have much overhead there after. | | It's largely a collection of the same libraries you would use | anyways glued together with a custom di system. | invalidname wrote: | I'm not saying 5M. I just want to see to what scale it would | get without threading issues. Spring Boot isn't THAT heavy. | nelsonic wrote: | Reminds of https://phoenixframework.org/blog/the-road- | to-2-million-webs... Would love to see this extended to more | Languages/Frameworks. | mike_hearn wrote: | In theory once Graal adds support for it, any Graal/Truffle- | compatible language can benefit. | | IMHO it's only JVM+Graal that can bring this to other | languages. Loom relies very heavily on some fairly unique | aspects of the Java ecosystem (Go has these things too though). | One is that lots of important bits of code are implemented in | pure Java, like the IO and SSL stacks. Most languages rely | heavily on FFI to C libraries. That's especially true of | dynamic scripting languages but is also true of things like | Rust. The Java world has more of a culture of writing their own | implementations of things. | | For the Loom approach to work you need: | | a. Very tight and difficult integration between the compiler, | threading subsystem and garbage collector. | | b. The compiler/runtime to control all code being used. The | moment you cross the FFI into code generated by another | compiler (i.e. a native library) you have to pin the thread and | the scalability degrades or is lost completely. | | But! Graal has a trick up its sleeve. It can JIT compile lots | of languages, and those languages can call into each other | without a classical FFI. Instead the compiler sees both call | site and destination site, and can inline them together to | optimize as one. Moreover those languages include binary | languages like LLVM bitcode and WASM. In turn that means that | e.g. Python calling into a C extension can still work, because | the C extension will be compiled to LLVM bitcode and then the | JVM will take over from there. So there's one compiler for the | entire process, even when mixing code from multiple languages. | That's what Loom needs. | | At least in theory. Perhaps pron will contradict me here | because I have a feeling Loom also needs the invariant that | there are no pointers into the stack. True for most languages | but not once C gets involved. I don't know to what extent you | could "fix" C programs at the compiler level to respect that | invariant, even if you have LLVM bitcode. But at least the one- | compiler aspect is not getting in the way. | kaba0 wrote: | With Truffle you have to map your language's semantics to | java ones. I am unfortunately out of my depth on the details, | but my guess would be that LLVM operates here with this in | mind in a completely safe way (I guess pointers to the stack | are not safe) so presumably it should work for these as well. | mike_hearn wrote: | Not exactly, no. That's the whole point of Truffle and why | it's such a big leap forward. You do _not_ map your | language 's semantics to Java semantics. You can implement | them on top of the JVM but bypassing Java bytecode. Your | language doesn't even have to be garbage collected, and | LLVM bitcode isn't (unless you use the enterprise version | which adds support for automatically converting C/C++ to | memory safe GCd code!). | | So - C code running on the JVM via Sulong keeps C/C++ | semantics. That probably means you can build pointers into | the stack, and then I don't know what Loom would do. Right | now they aren't integrated so I guess that's a research | question. | bkolobara wrote: | With lunatic [0] we are trying to bring this to all languages | that compile to WebAssembly. A few days ago I wrote about our | journey of bringing it to Rust: | https://lunatic.solutions/blog/writing-rust-the-elixir-way-1... | | [0]: https://github.com/lunatic-solutions/lunatic | TYMorningCoffee wrote: | I was only able to get to 840,000 open connections with my | experiment. My machine only has 8GB of memory. | https://josephmate.github.io/2022-04-14-max-connections/ | | Is there anyway for the TCP connections share memory in kernel | space? My experiment only uses two 8 byte buffers in userspace. | toast0 wrote: | Does Linux actually allocate buffers for each socket or does it | just link to sk_buff's (which I understand are similar to | FreeBSD's mbuf's) and then limit how much storage can be | linked? FreeBSD has a limit on the total ram used for mbufs as | well, not sure about Linux. | | Otoh, FreeBSD's maximum FD limit is set as a factor of total | memory pages (edit: looked it up, it's in | sys/kern/subr_param.c, the limit is one FD per four pages, | unless you edit kernel source) and you've got 2M pages with 8GB | ram, so you would be limited to 512k FDs total, and if you're | running the client on the same machine as server, that's 256k | connections. But 8G is not much for a server, and some phones | have more than that... so it's not super limiting. | | When you're really not doing much with the connections, | userland tcp as suggest in a sibling, could help you squeeze in | more connections, but if you're going to actually do work, you | probably need more ram. | | Btw, as a former WhatsApp server engineer, WhatsApp listens on | three ports; 80, 443, and 5222. Not that that makes a | significant difference in the content. | mh- wrote: | no*, and as you've discovered, the skbufs allocated by the | kernel will often be the limiting factor for a highly | concurrent socket server on linux. | | * I don't know if someone has created some experimental | implementation somewhere. It would require a significant | overhaul of the TCP implementation in the kernel. | | edit: check out this sibling thread about userland TCP. I think | this is a more interesting/likely direction to explore in. | https://news.ycombinator.com/item?id=31215569 | 10000truths wrote: | A bit of a digression, but I'd love to see how much further one | could go with a memory-optimized userland TCP stack, and storing | the send and receive buffers on disk. | | A TCP connection state machine consists of a few variables to | keep track of sequence numbers and congestion control parameters | (no more than 100-200 bytes total), plus the space for | send/receive buffers. | | A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 | million 256-byte structs would take up only 32 GB of memory. In | theory, handling 100 million simultaneous connections on a single | machine is totally doable. Of course, the per-connection | throughput would be complete doodoo even with the best NICs, but | it would still be a monumental yet achievable milestone. | mike_hearn wrote: | Presumably at 100M simultaneous connections the machine CPU | would be saturated with setting up and closing them, without | getting much actual work done. TCP connections seem too fragile | to make it worth trying to keep them open for really long | periods. | | It's interesting to think about though, I agree. What are the | next scaling bottlenecks now (for JVM compatible languages) | threading is nearly solved? | | There are some obvious ones. Others in the thread have pointed | out network bandwidth. Some use cases don't need much bandwidth | but do need intense routability of data between connections, | like chat apps, and it seems ideal for those. Still, you're | going to face other problems: | | 1. If that process is restarted for any reason that's a _lot_ | of clients that get disrupted. JVMs are quite good at hot- | reloading code on the fly, so it 's not inherently the case | that this is problematic because you could make restarts very | rare. But it's still a problem. | | 2. Your CPU may be sufficient for the steady state but on | restart the clients will all try to reconnect at once. Adding | jitter doesn't really solve the issue, as users will still have | to wait. Handling 5M connections is great unless it takes a | long time to reach that level of connectivity and you are | depending on it. | | 3. TCP is rarely used alone now, it usually comes with SSL. | Doing SSL handshakes is more expensive than setting up a TCP | connection (probably!). Do you need to use something like QUIC | instead? Or can you offload that to the NIC making this a non- | issue? I don't know. BTW the Java SSL stack is written in Java | itself so it's fully Loom compatible. | natdempk wrote: | It depends on what you do, but I think GC/memory pressure can | become an issue rather quickly with the default programming | models Java leads you towards. I end up seeing this a lot in | somewhat high throughput services/workers I own where | fetching a lot of data to handle requests and discarding it | afterwards leads to a lot of GC time. Curious if anyone has | any sage advice on this front. | toast0 wrote: | You're totally spot on that connection establishment is much | more challenging than steady state; with TLS or just TCP. | | I don't think QUIC helps with that at all. Afaik, QUIC is all | userland, so you'd skip kernel processing, but that doesn't | really make establishment cheaper. And TCP+TLS establishes | the connection before doing crypto, so that saves effort on | spoofing (otoh, it increases the round trips, so pick your | tradeoffs). | | One nice thing about TCP though is it's trivial to determine | if packets are establishing or connected; you can easily drop | incoming SYNs when CPU is saturated to put back pressure on | clients. That will work enough when crypto setup is the issue | as well. Operating systems will essentially do this for you | if you get behind on accepting on your listen sockets. (Edit) | syncookies help somewhat if your system gets overwelmed and | can't keep state for all of them half-established | connections, although not without tradeoffs. | | In the before times, accelerator cards for TLS handshakes | were common (or at least available), but I think current NIC | acceleration is mainly the bulk ciphering which IMHO is more | useful for sending files than sending small data that I'd | expect in a large connection count machine. With file | sending, having the CPU do bulk ciphers is a RAM bottleneck: | the CPU needs to read the data, cipher it, and write to RAM | then tell the NIC to send it; if the NIC can do the bulk | cipher that's a read and write omitted. If it's chat data, | the CPU probably was already processing it, so a few cycles | with AES instructions to cipher it before sending it to send | buffers is not very expensive. | charcircuit wrote: | I think you meant to say TLS. Not SSL. | adra wrote: | I'm pretty sure the exercise was to show the absolute | extremes that could be achieved in a toy application and | possibly how easy one could achieve some level of IO blocking | scaling that has been harder than most other tasks in java of | late. More and more, heap allocations are cheaper, often with | sub-milli collector locks, CPU scaling has more to do with | what you're doing instead of the platform, but java have | enough tools to make your application fast. | | For extremely IO wait bound workloads though, there was | always a LOT if hoops to jump through to make performance | strong since OS threads always have a notable stack memory | footprint that just doesn't scale well when you could have | thousands of OS threads waiting around just taking up RAM. | toast0 wrote: | It's easy to just get 4TB of ram if that's what you need; I | haven't scoped out what you can shove into a cheap off the | shelf server these days, but I'd guess around 16TB before you | need to get fancy servers (Edit: maybe 8TB is more realistic | after looking at SuperMicro's 'Ultra' servers). I think you'd | need a very specialized applicatjon for 100M connections per | server to make sense, but if you've got one, that sounds like a | fun challenge; my email is in my profile. | | Moving 100M connections for maintenance will be a giant pain | though. You would want to spend a good amount of time on a test | suite so you can have confidence in the new deploys when you | make them. Also, the client side of testing will probably be | harder to scale than the server side... but you can do things | like run 1000 test clients with 100k outgoing connections each | to help with that. | Nullabillity wrote: | Loom is missing the point. | | Time has shown that bare threads are not a viable high-level API | for managing concurrency. As it turns out, we humans don't think | in terms of locks and condvars but "to do X, I first need to know | Y". That maps perfectly onto futures(/promises). And once you | have those, you don't need all the extra complexity and hacks | that green threads (/"colourless async") bring in. | | I'd take a system that combined the API of futures with the | performance of OS threads over the opposite combination, any day | of the week. But as it turns out, we don't have to choose. We can | have the performance of futures with the API of futures. | | Or we can waste person-years chasing mirages, I guess. I just | hope I won't get stuck having to use the end product of this. | IshKebab wrote: | Threads have essentially the same API as Futures - normally you | have some join of join handle and you can join a set of threads | (the equivalent of awaiting a set of futures). | | Threads don't require locks and condvars. You can use channels | and scoped joins etc. if you want. | | Give me some async code and I'll show you an easier threaded | version. | bpicolo wrote: | The goroutine model in go is plenty conceptually simple for | concurrency. Correct me if I'm wrong, but loom seems similar in | that sense? | | I don't find myself missing out on futures in Go. | pron wrote: | I think you're mixing specific synchronisation/communication | mechanisms with the basic concept of a thread, which is simply | the sequential composition of instructions _that is known and | observable by the runtime_. If you like the future /promise | API, that will work even better with threads, because then the | sequence is a reified concept known to the runtime and all its | tools. You'll be able to step through the sequence of | operations with a debugger; the profiler will know to associate | operations with their context. What API you choose to compose | your operations, whether you prefer message passing with no | shared state, shared state with locks, or a combination of the | two -- that's all orthogonal to threads. All they are is a | sequantial unit of instructions that may run concurrently to | other such units, _and is traceable and observable by the | platform and its tools_. | Nullabillity wrote: | You can implement futures by just running each future as a | thread, but it doesn't really give you much. It's a lot more | complex to write a preemptive thread scheduler + delegating | future scheduler than to just write a future scheduler in the | first place. | | Especially when that future scheduler already exists and | works, and the preemptive one is a multi-year research | project away. | pron wrote: | It gives you a lot (aside from the ability to use existing | libraries and APIs): observability and debuggabillity. | | Supporting tooling has been one of the most important | aspects of this project, because even those who were | willing to write asynchronous code, and even the few who | actually enjoyed it, constantly complained -- and rightly | so -- that they cannot easily observe, debug and profile | such programs. When it comes to "serious" applications, | observability is one of the most important aspects and | requirements of a system. | | Instead of introducing new kind of sequenatial code unit | through all layers of tooling -- which would have been a | huge project anyway, we abstracted the existing thread | concept. | rvcdbn wrote: | Maybe threads don't work for your thinking style but your claim | that this is generally true is baseless and pretty well refuted | by languages like Go or Erlang that feature stackfull | threads/processes as a critical part of their best-in-class | concurrency stories. | Nullabillity wrote: | Erlang sidesteps the problem by avoiding mutable shared | state, in this context they're threads/processes in name | only. | | Go is just yet another implementation of green threads that | is slightly less broken than prior implementations, because | it had the benefit of being implemented on day 1 (so the | whole ecosystem is green thread-aware). It's certainly | nowhere near "best-in-class". | toast0 wrote: | Shared mutable state is hard to work with, but Java threads | and Java promises both give you access to it. In either | case, you'd need discipline to avoid patterns which reduce | concurrency. | | From the article, it seems that Loom (in preview) enables | the threaded model for Java to scale. IMHO, this is great | because you can write simple straightforward code in a | threaded model. You can certainly write complex code in a | threaded model too. Maybe there's an argument that promises | can be simple and straightforward too, but my experience | with them hasn't been very straightforward. | chrisseaton wrote: | > Erlang sidesteps the problem by avoiding mutable shared | state | | Erlang is maximal shared mutable state! | | Processes are mutable state and they're shared between | other processes. | groestl wrote: | If I look at a thread, I see futures all over the place. | They're just implicit, and the OS takes care of | concurrency/preemption. Sure, that means that you need | concurrency primitives if you access shared resources, but only | in the trivial case you can get away without shared state in | the promise/future scenario as well (i.e. glue code that ties | together the hard stuff). Downside is your code gets convoluted | and your stacktraces suck. | torginus wrote: | While impressive, I don't really see it as something practical - | I think scaling across processes/VMs is a much more realistic | approach. | notorandit wrote: | With a maximum of 64k TCP connections per single server IP, you | need 77 different IP on the server side. This is a fact. | imperio59 wrote: | Pretty sure you can bump that up in the kernel to hold more | active connections per server that 64k... | jauer wrote: | How do you figure? | | Clients can connect to the server on the same server port, so | connection limit is more like 64k*2 for every Client IP-Server | IP pair. | akvadrako wrote: | Actually every client IP+port / server IP+port pair. Linux | uses 60999 - 32768 for ephemeral ports so can support 28e3^2 | = 784 million connections per IP pair. | mypalmike wrote: | Except your service is almost certainly listening on one | non-ephemeral port. | | But having "only" tens of thousands of connections per | client is rarely a problem in practice, apart from some | load testing scenarios (such as the experiment here, where | they opened a number of ports so they could test a large | number of connections with a single client machine). | charcircuit wrote: | 1 IP can correspond to multiple different clients. | peq wrote: | Isn't this limit per client ip, server ip, and server port? | (https://stackoverflow.com/a/2332756/303637) | alanfranz wrote: | "You need 77 ips" to do what? May be a fact or not, depending | on what you're doing. | | If you suppose just one open server port, you'll probably need | 77 client ips to do this test to get unique socket pairs. | | But it's a client problem, not a server one. | ivanr wrote: | I imagine that's the limit per client IP address [for a single | server port], no? The Linux kernel can use multiple pieces of | information to track connections: client IP address, client | port, server IP address, server port. | | Cloudflare has some interesting blog posts on this topic: | | - https://blog.cloudflare.com/how-we-built-spectrum/ | | - https://blog.cloudflare.com/how-to-stop-running-out-of- | ephem... | NovemberWhiskey wrote: | What? | | Having run production services that had over 250,000 sockets | connecting to a single server port, I'm calling "nope" on that. | | Are you thinking of the ephemeral port limit? That's on the | client side; not the server side. Each TCP socket pair is a | four-tuple of [server IP, server port, client IP, client port]; | the uniqueness comes from the client IP/port part in the server | case. | jeroenhd wrote: | You don't really need 77 IP addresses (the 64k limit for TCP is | per client IP, per source port, per server IP) but even if you | did, your average IPv6 server will have a few billion | available. Every client can connect to a server IP of their own | if you ignore the practical limits of the network acceleration | and driver stack. If you're somehow dealing with this scale, I | doubt you'll be stuck with pure legacy IP addressing. | | The real problem with such a setup is that you're not left with | a whole lot of bandwidth per connection, even if you ignore | things like packet loss and retransmits mucking up the | connections. Most VPS servers have a 1gbps connection, with 5 | million clients that leaves 200 bytes per second of concurrent | bandwidth for TCP signaling and data to flow through. You'll | need a ridiculous network card for a single server to deal with | such a load, in the terabits per second range. ___________________________________________________________________ (page generated 2022-04-30 23:00 UTC)