[HN Gopher] HPC is dying, and MPI is killing it (2015) ___________________________________________________________________ HPC is dying, and MPI is killing it (2015) Author : zdw Score : 53 points Date : 2023-04-07 00:18 UTC (1 days ago) (HTM) web link (www.dursi.ca) (TXT) w3m dump (www.dursi.ca) | prpl wrote: | A lot of MPI's (ab)use in HPC boils down to distributed task | management in lieu of a work queue system available to users. | People have embarrassingly parallel jobs but need to coordinate | on the task management because many HPC centers either don't | provide resources for a long-lived service to execute near the | cluster (or even general connectivity outside the cluster) | | The problem is that you do have to support true parallel MPI jobs | in those shared clusters though, so MPI just becomes the hammer | for everyone else. | | Managing the resources a level higher (all resources live in a | k8s cluster, slurm under k8s) seems to be the best way to really | accommodate both types of loads but most HPC centers are far off | from implementing that. | | https://slurm.schedmd.com/SC22/Slurm-and-or-vs-Kubernetes.pd... | | (I think that presentation has some misconceptions about k8s - | most k8s clusters are elastic to a max size - and it sounds like | the really want to control most scheduling - but it gives an | overview of merging the systems) | saltcured wrote: | Roughly 20 years ago, the Condor high-throughput computing | system gained "glide-ins" to do this sort of repurposing. | Before that, Condor was mostly persistent runners on desktop | fleets. After, you would submit a batch job to an HPC cluster | and for the duration of the job, those HPC nodes became | additional runners for an existing Condor scheduler. | | Around that time, there was also a period of reservation-based | "advanced scheduling" where HPC centers were flirting with | making future scheduling promises. The right way to think of | these would be like guarantees to get bare metal machine | capacity during a certain wall-clock period. In my opinion, the | commercial pre-cloud/cloud/virtualization stuff then infected | everyone and regressed to time-sharing with fuzzy QoS and lots | of over-subscription and dynamic rescheduling. | | Of course, these different approaches will all be isomorphic in | the end if they explore the full space of application | requirements. The traditional paths were just approaching from | very different economic priorities. The IaaS folks are | incrementally adding more QoS and pricing options which could | eventually provide HPC IaaS if carried to full fruition. I.e. | future guarantees of significant hardware resources. But as far | as I know, those are still in the realm of "talk to a sales | rep" and not some automated IaaS request flow at this point. | tannhaeuser wrote: | I believe the terminology is/was "advance reservations" (as | in reservations of resources such as CPU, mem, disk space, | i/o and net bandwith in advance on clusters otherwise freely | available to ad-hoc jobs) rather than "advanced" anything, or | at least it was with the Torque scheduler I reviewed for a | clickstream analysis customer project. | saltcured wrote: | Yes, I think so too. I blame the predictive typing in my | hands. | dekhn wrote: | During the years I was active in grid computing (before cloud | computing became huge) Miron Livny (creator of Condor) would | basically attend every talk and explain how "condor already | does this, why are you reinventing the wheel?" | saltcured wrote: | Hah, yes. And Oracle reps used to say the grid was inside | their cluster. A lot of folks did not appreciate the | decentralization of the grid. The grid computing concepts | were about federation of disparate organizations and their | resources, not about how sprawling of a system a single | vendor or HPC center could build. | prpl wrote: | The PITA part of condor/grid was software management before | containers. Sure, everyone was running at least RHEL4/5/6 (Or | SL4/5/6) and in many cases AFS worked and the more advanced | operators were adding VM execution, but it was (and still is) | annoying to deal with. Most annoying right now is that nobody | can agree on a container runtime - there's Docker/Shifter, | singularity, Charlie. It should just all be podman now but | everybody is still holding on. | | (I have worked with Slurm, condor, Torque/PBS, gridEngine, | DIRAC, and LSF) | | Plugging a different scheduler into k8s might be an | interesting way of solving this - it seems like there's a lot | of work on scheduler plugins when I last looked. Some of the | issues are similar in cloud too - coscheduling by latency. | | There's at least some incentive to not solve this - I | remember k8s on Mesos being popular and of course we know how | that played out. | aqme28 wrote: | A trend piece from 8 years ago that didn't get its prediction | right. Not sure why this was reposted. | markhahn wrote: | blast from the past eh? | | since then, it's become obvious that the 99% of the ML world is | using high-level interfaces, blissfully unaware of MPI. | | if you want to make a lasting contribution, work on PIM, not MPI | ;) | nerpderp82 wrote: | This is a nothing burger, I don't mean that in a derisive way | though. | | MPI and Spark solve very different problems. The overlap is | basically zero and the fact that MPI is flat while Spark | fluctuates shows this. HPC is a small small fraction of the job | market, the number of folks involved in HPC is tiny compared to | all the click-log analyzing Spark and Hadoop programmers. | | Both of those systems are bulk parallel systems, with the | majority of folks aggregating low information density data. This | is not what the MPI clusters are doing modeling weather and | calculating subatomic interactions or simulating wind tunnels. | | > The idea that the people at Google doing large-scale machine | learning problems (which involves huge sparse matrices) are | oblivious to scale and numerical performance is just delusional. | | Scale is not latency! Google scales to sizes many orders of | magnitudes larger than MPI clusters, but it does not run | workloads with the same connectivity needs that MPI workloads | need. It doesn't run Super Computers, it runs massively parallel | bulk embarrassingly parallel computers. | | Chapel is a language, MPI is a transport. The author obviously | has skills, but they shouldn't be conflated. Chapel _can use_ | MPI. | | Chapel supports OFI MPI uGNI GASNet. This is not unlike saying | don't use SCTP use Python! I am not being charitable. | dekhn wrote: | Google absolutely runs workloads that need supercomputer tight | scaling. That's why they built TPUs with ICI networks. | HyperSane wrote: | what is ICI networks? | dekhn wrote: | https://dl.acm.org/doi/pdf/10.1145/3360307 Very fast | (hundreds of gigabits per connection) networks that attach | TPUs on the same board, as well as between boards. It's | wired up point to point, forming a logical grid | (technically a 2D or 3D torus with wraparound links). | rlupi wrote: | ICI is interchip interconnect in TPUv4 pods. | | You can read more on paper that my collegues in Google | platforms recently published | https://arxiv.org/abs/2304.01433 https://cloud.google.com/b | log/topics/systems/tpu-v4-enables-... | | I am so happy this paper is finally out :-) I led the SRE | work during their NPI. It's a pain when you have to wait 3 | years to discuss what you worked on. | | They are based on OCS, which powers Google datacenter | networks https://arxiv.org/abs/2208.10041 | markhahn wrote: | ICI is basically nvlink, not like IB. | dekhn wrote: | no, ICI is a fiber optic (or electrical) network between | TPUs and it doesn't have any switch functionality. I used | to work for Google (on TPUs and ICI). Anyway, the | comparisons don't matter that much. The long and short of | it is that Google built their own supercomputers (finally) | that can do allreduce and alltoall patterns, as well as | low-latency broadcast. | markhahn wrote: | medium isn't the point, and duh, torus networks are | distributed-switching. | dekhn wrote: | Do you have a point? There's nothing technically wrong | with saying that TPUs with ICI are a supercomputer- they | are very similar in design to the T3E I used decades ago. | | Please make your point instead of trying to negate whatt | I said. | | Medium matters a lot- because supercomputers are | fundamentally constrained by physics in terms of power | dissipation, transistor density, and speed of light, all- | optical networks have slightly lower latencies than | electrical, and also let you build larger systems (longer | cables). | bscphil wrote: | I think a central element of the article is its focus on the | needs of genomics researchers. In this field there are many | tasks that are highly parallel but not embarrassingly parallel. | Ideally you want to scale out to as many processes as possible, | but some basic IPC is mandatory. | | Though not in the HPC field myself, I've been in a position of | helping researchers working with an HPC cluster. On this | hardware something like 16 cores per node was typical, so you | run out of scaling room quickly using traditional threading | libraries. And you really want to scale beyond this - jobs | could take weeks or months otherwise. This means using MPI | because that's all the HPC supported. This was a massive pain, | because MPI is complicated in ways that are unfamiliar to many | researchers, most of whom have no low-level language experience | at all. | | What the article calls for are HPC techniques that move | communication to a lower level API, rather than exposing them | to the programmer / end-user. I think that's exactly the right | idea for genomics. | | Disclaimer: I'm neither an expert in HPC _nor_ genomics. | phkahler wrote: | >> HPC is a small small fraction of the job market, the number | of folks involved in HPC is tiny compared to all the click-log | analyzing Spark and Hadoop programmers. | | You're basically saying HPC = MPI users, which is really | dismissive of a whole bunch of other people making use of vast | compute resources. | | I for one can't wait for these masses to convince chip makers | that ieee754 floating point really has a lot of crap that we | don't need. | morelandjs wrote: | What compute environments are used for LLMs like ChatGPT? Do | you see this recent frenzy driving demand for HPC engineers? | cavisne wrote: | LLM training (at least the ones we have research details on) | sync weights every step so they have very high networking and | latency needs. That's why every cloud vendor competes on | interconnect speeds for GPU machines. | | So they are basically classic supercomputing workloads. | cavisne wrote: | I don't see a lot of traditional "HPC engineers" in the ml | infrastructure space though. | | I do wonder if in hindsight using a Slurm cluster for | scheduling, Lustre for data, MPI for any connectivity thats | not covered by NCCL, would have been better than trying to | make object storage, grpc, kubernetes, ray etc work. | the_svd_doctor wrote: | "HPC engineering" and "Optimizing LLM training" are | similar. It's about profiling, performance modeling, | finding bottlenecks, rewriting what's needed, etc. Lots | of overlap, and people from more traditional HPC doing it | too. Obviously if you're a 50yo university professor | doing airplane simulations you won't switch to LLMs today | though... | adw wrote: | Some mixture of MPI, NCCL, Gloo, and whatever proprietary | stuff TPU clusters do. All of these are basically either | trad-HPC in style or literally from the supercomputing | community. Interconnects tend to be Infiniband or the like, | which, again, straight out of big iron. | hackandthink wrote: | "Our biggest jobs run MPI, and all pods within the job are | participating in a single MPI communicator." | | https://openai.com/research/scaling-kubernetes-to-7500-nodes | p_l wrote: | CUDA has support for MPI, including MPI done from GPU itself | (over nvlink or with support for accessing host infiniband | adapter, iirc). | osigurdson wrote: | They use Kubernetes and MPI | | https://openai.com/research/scaling-kubernetes-to-7500-nodes | mirker wrote: | I think one of the most glaring problems with the argument is | HPC maintains tons of legacy simulations which may not even | have the original authors around. Adding Spark support wouldn't | make this easier. | sfpotter wrote: | Not sure why this was downvoted. This is absolutely true. | osigurdson wrote: | The article is a little out of date. MPI is used for LLM training | so has made a bit of a comeback. | kergonath wrote: | Also, HPC is not dead, so obviously MPI did not kill it. | RcouF1uZ4gsC wrote: | With regards to Spark, consider if you could possibly do the | processing on a single machine. For many workloads, a single | threaded program running on a laptop will beat an entire Spark | cluster. | | https://www.frankmcsherry.org/assets/COST.pdf | kristjansson wrote: | This was a great point to make at the time, when people thought | "my days has exceeded Excel's row limit, therefore I should set | up a Hadoop cluster and run Spark jobs against it" | | Since then ... it's become a bit of a meme, unfortunately. | Definitely there still exist workloads assigned to Spark | clusters that could run on a laptop, especially if the data | happens to be there already. But the space as a whole provides | immense value, both enabling jobs that really don't fit on | laptops, and moving the compute for laptop sized jobs to where | the data happens to be. | briankelly wrote: | The datasets they test against are 6gb and 15gb, and I get that | those are the two one of their references uses, but that's | clearly not multi-node territory. Also as they point out graph | computation is not trivially parallelized. Spark is more for | doing long running transformations on independent data in a | fault tolerant way. ___________________________________________________________________ (page generated 2023-04-08 23:01 UTC)