       | prpl wrote:
       | A lot of MPI's (ab)use in HPC boils down to distributed task
       | management in lieu of a work queue system available to users.
       | People have embarrassingly parallel jobs but need to coordinate
       | on the task management because many HPC centers either don't
       | provide resources for a long-lived service to execute near the
       | cluster (or even general connectivity outside the cluster)
       | The problem is that you do have to support true parallel MPI jobs
       | in those shared clusters though, so MPI just becomes the hammer
       | for everyone else.
       | Managing the resources a level higher (all resources live in a
       | k8s cluster, slurm under k8s) seems to be the best way to really
       | accommodate both types of loads but most HPC centers are far off
       | from implementing that.
       | https://slurm.schedmd.com/SC22/Slurm-and-or-vs-Kubernetes.pd...
       | (I think that presentation has some misconceptions about k8s -
       | most k8s clusters are elastic to a max size - and it sounds like
       | the really want to control most scheduling - but it gives an
       | overview of merging the systems)
         | saltcured wrote:
         | Roughly 20 years ago, the Condor high-throughput computing
         | system gained "glide-ins" to do this sort of repurposing.
         | Before that, Condor was mostly persistent runners on desktop
         | fleets. After, you would submit a batch job to an HPC cluster
         | and for the duration of the job, those HPC nodes became
         | additional runners for an existing Condor scheduler.
         | Around that time, there was also a period of reservation-based
         | "advanced scheduling" where HPC centers were flirting with
         | making future scheduling promises. The right way to think of
         | these would be like guarantees to get bare metal machine
         | capacity during a certain wall-clock period. In my opinion, the
         | commercial pre-cloud/cloud/virtualization stuff then infected
         | everyone and regressed to time-sharing with fuzzy QoS and lots
         | of over-subscription and dynamic rescheduling.
         | Of course, these different approaches will all be isomorphic in
         | the end if they explore the full space of application
         | requirements. The traditional paths were just approaching from
         | very different economic priorities. The IaaS folks are
         | incrementally adding more QoS and pricing options which could
         | eventually provide HPC IaaS if carried to full fruition. I.e.
         | future guarantees of significant hardware resources. But as far
         | as I know, those are still in the realm of "talk to a sales
         | rep" and not some automated IaaS request flow at this point.
           | tannhaeuser wrote:
           | I believe the terminology is/was "advance reservations" (as
           | in reservations of resources such as CPU, mem, disk space,
           | i/o and net bandwith in advance on clusters otherwise freely
           | available to ad-hoc jobs) rather than "advanced" anything, or
           | at least it was with the Torque scheduler I reviewed for a
           | clickstream analysis customer project.
             | saltcured wrote:
             | Yes, I think so too. I blame the predictive typing in my
             | hands.
           | dekhn wrote:
           | During the years I was active in grid computing (before cloud
           | computing became huge) Miron Livny (creator of Condor) would
           | basically attend every talk and explain how "condor already
           | does this, why are you reinventing the wheel?"
             | saltcured wrote:
             | Hah, yes. And Oracle reps used to say the grid was inside
             | their cluster. A lot of folks did not appreciate the
             | decentralization of the grid. The grid computing concepts
             | were about federation of disparate organizations and their
             | resources, not about how sprawling of a system a single
             | vendor or HPC center could build.
           | prpl wrote:
           | The PITA part of condor/grid was software management before
           | containers. Sure, everyone was running at least RHEL4/5/6 (Or
           | SL4/5/6) and in many cases AFS worked and the more advanced
           | operators were adding VM execution, but it was (and still is)
           | annoying to deal with. Most annoying right now is that nobody
           | can agree on a container runtime - there's Docker/Shifter,
           | singularity, Charlie. It should just all be podman now but
           | everybody is still holding on.
           | (I have worked with Slurm, condor, Torque/PBS, gridEngine,
           | DIRAC, and LSF)
           | Plugging a different scheduler into k8s might be an
           | interesting way of solving this - it seems like there's a lot
           | of work on scheduler plugins when I last looked. Some of the
           | issues are similar in cloud too - coscheduling by latency.
           | There's at least some incentive to not solve this - I
           | remember k8s on Mesos being popular and of course we know how
           | that played out.
       | aqme28 wrote:
       | A trend piece from 8 years ago that didn't get its prediction
       | right. Not sure why this was reposted.
       | markhahn wrote:
       | blast from the past eh?
       | since then, it's become obvious that the 99% of the ML world is
       | using high-level interfaces, blissfully unaware of MPI.
       | if you want to make a lasting contribution, work on PIM, not MPI
       | ;)
       | nerpderp82 wrote:
       | This is a nothing burger, I don't mean that in a derisive way
       | though.
       | MPI and Spark solve very different problems. The overlap is
       | basically zero and the fact that MPI is flat while Spark
       | fluctuates shows this. HPC is a small small fraction of the job
       | market, the number of folks involved in HPC is tiny compared to
       | all the click-log analyzing Spark and Hadoop programmers.
       | Both of those systems are bulk parallel systems, with the
       | majority of folks aggregating low information density data. This
       | is not what the MPI clusters are doing modeling weather and
       | calculating subatomic interactions or simulating wind tunnels.
       | > The idea that the people at Google doing large-scale machine
       | learning problems (which involves huge sparse matrices) are
       | oblivious to scale and numerical performance is just delusional.
       | Scale is not latency! Google scales to sizes many orders of
       | magnitudes larger than MPI clusters, but it does not run
       | workloads with the same connectivity needs that MPI workloads
       | need. It doesn't run Super Computers, it runs massively parallel
       | bulk embarrassingly parallel computers.
       | Chapel is a language, MPI is a transport. The author obviously
       | has skills, but they shouldn't be conflated. Chapel _can use_
       | MPI.
       | Chapel supports OFI MPI uGNI GASNet. This is not unlike saying
       | don't use SCTP use Python! I am not being charitable.
         | dekhn wrote:
         | Google absolutely runs workloads that need supercomputer tight
         | scaling. That's why they built TPUs with ICI networks.
           | HyperSane wrote:
           | what is ICI networks?
             | dekhn wrote:
             | https://dl.acm.org/doi/pdf/10.1145/3360307 Very fast
             | (hundreds of gigabits per connection) networks that attach
             | TPUs on the same board, as well as between boards. It's
             | wired up point to point, forming a logical grid
             | (technically a 2D or 3D torus with wraparound links).
             | rlupi wrote:
             | ICI is interchip interconnect in TPUv4 pods.
             | You can read more on paper that my collegues in Google
             | platforms recently published
             | https://arxiv.org/abs/2304.01433 https://cloud.google.com/b
             | log/topics/systems/tpu-v4-enables-...
             | I am so happy this paper is finally out :-) I led the SRE
             | work during their NPI. It's a pain when you have to wait 3
             | years to discuss what you worked on.
             | They are based on OCS, which powers Google datacenter
             | networks https://arxiv.org/abs/2208.10041
           | markhahn wrote:
           | ICI is basically nvlink, not like IB.
             | dekhn wrote:
             | no, ICI is a fiber optic (or electrical) network between
             | TPUs and it doesn't have any switch functionality. I used
             | to work for Google (on TPUs and ICI). Anyway, the
             | comparisons don't matter that much. The long and short of
             | it is that Google built their own supercomputers (finally)
             | that can do allreduce and alltoall patterns, as well as
             | low-latency broadcast.
               | markhahn wrote:
               | medium isn't the point, and duh, torus networks are
               | distributed-switching.
               | dekhn wrote:
               | Do you have a point? There's nothing technically wrong
               | with saying that TPUs with ICI are a supercomputer- they
               | are very similar in design to the T3E I used decades ago.
               | Please make your point instead of trying to negate whatt
               | I said.
               | Medium matters a lot- because supercomputers are
               | fundamentally constrained by physics in terms of power
               | dissipation, transistor density, and speed of light, all-
               | optical networks have slightly lower latencies than
               | electrical, and also let you build larger systems (longer
               | cables).
         | bscphil wrote:
         | I think a central element of the article is its focus on the
         | needs of genomics researchers. In this field there are many
         | tasks that are highly parallel but not embarrassingly parallel.
         | Ideally you want to scale out to as many processes as possible,
         | but some basic IPC is mandatory.
         | Though not in the HPC field myself, I've been in a position of
         | helping researchers working with an HPC cluster. On this
         | hardware something like 16 cores per node was typical, so you
         | run out of scaling room quickly using traditional threading
         | libraries. And you really want to scale beyond this - jobs
         | could take weeks or months otherwise. This means using MPI
         | because that's all the HPC supported. This was a massive pain,
         | because MPI is complicated in ways that are unfamiliar to many
         | researchers, most of whom have no low-level language experience
         | at all.
         | What the article calls for are HPC techniques that move
         | communication to a lower level API, rather than exposing them
         | to the programmer / end-user. I think that's exactly the right
         | idea for genomics.
         | Disclaimer: I'm neither an expert in HPC _nor_ genomics.
         | phkahler wrote:
         | >> HPC is a small small fraction of the job market, the number
         | of folks involved in HPC is tiny compared to all the click-log
         | analyzing Spark and Hadoop programmers.
         | You're basically saying HPC = MPI users, which is really
         | dismissive of a whole bunch of other people making use of vast
         | compute resources.
         | I for one can't wait for these masses to convince chip makers
         | that ieee754 floating point really has a lot of crap that we
         | don't need.
         | morelandjs wrote:
         | What compute environments are used for LLMs like ChatGPT? Do
         | you see this recent frenzy driving demand for HPC engineers?
           | cavisne wrote:
           | LLM training (at least the ones we have research details on)
           | sync weights every step so they have very high networking and
           | latency needs. That's why every cloud vendor competes on
           | interconnect speeds for GPU machines.
           | So they are basically classic supercomputing workloads.
             | cavisne wrote:
             | I don't see a lot of traditional "HPC engineers" in the ml
             | infrastructure space though.
             | I do wonder if in hindsight using a Slurm cluster for
             | scheduling, Lustre for data, MPI for any connectivity thats
             | not covered by NCCL, would have been better than trying to
             | make object storage, grpc, kubernetes, ray etc work.
               | the_svd_doctor wrote:
               | "HPC engineering" and "Optimizing LLM training" are
               | similar. It's about profiling, performance modeling,
               | finding bottlenecks, rewriting what's needed, etc. Lots
               | of overlap, and people from more traditional HPC doing it
               | too. Obviously if you're a 50yo university professor
               | doing airplane simulations you won't switch to LLMs today
               | though...
           | adw wrote:
           | Some mixture of MPI, NCCL, Gloo, and whatever proprietary
           | stuff TPU clusters do. All of these are basically either
           | trad-HPC in style or literally from the supercomputing
           | community. Interconnects tend to be Infiniband or the like,
           | which, again, straight out of big iron.
           | hackandthink wrote:
           | "Our biggest jobs run MPI, and all pods within the job are
           | participating in a single MPI communicator."
           | https://openai.com/research/scaling-kubernetes-to-7500-nodes
           | p_l wrote:
           | CUDA has support for MPI, including MPI done from GPU itself
           | (over nvlink or with support for accessing host infiniband
           | adapter, iirc).
           | osigurdson wrote:
           | They use Kubernetes and MPI
           | https://openai.com/research/scaling-kubernetes-to-7500-nodes
         | mirker wrote:
         | I think one of the most glaring problems with the argument is
         | HPC maintains tons of legacy simulations which may not even
         | have the original authors around. Adding Spark support wouldn't
         | make this easier.
           | sfpotter wrote:
           | Not sure why this was downvoted. This is absolutely true.
       | osigurdson wrote:
       | The article is a little out of date. MPI is used for LLM training
       | so has made a bit of a comeback.
         | kergonath wrote:
         | Also, HPC is not dead, so obviously MPI did not kill it.
       | RcouF1uZ4gsC wrote:
       | With regards to Spark, consider if you could possibly do the
       | processing on a single machine. For many workloads, a single
       | threaded program running on a laptop will beat an entire Spark
       | cluster.
       | https://www.frankmcsherry.org/assets/COST.pdf
         | kristjansson wrote:
         | This was a great point to make at the time, when people thought
         | "my days has exceeded Excel's row limit, therefore I should set
         | up a Hadoop cluster and run Spark jobs against it"
         | Since then ... it's become a bit of a meme, unfortunately.
         | Definitely there still exist workloads assigned to Spark
         | clusters that could run on a laptop, especially if the data
         | happens to be there already. But the space as a whole provides
         | immense value, both enabling jobs that really don't fit on
         | laptops, and moving the compute for laptop sized jobs to where
         | the data happens to be.
         | briankelly wrote:
         | The datasets they test against are 6gb and 15gb, and I get that
         | those are the two one of their references uses, but that's
         | clearly not multi-node territory. Also as they point out graph
         | computation is not trivially parallelized. Spark is more for
         | doing long running transformations on independent data in a
         | fault tolerant way.
