[HN Gopher] The AI Research SuperCluster
       ___________________________________________________________________
        
       The AI Research SuperCluster
        
       Author : minimaxir
       Score  : 56 points
       Date   : 2022-01-24 18:34 UTC (4 hours ago)
        
 (HTM) web link (ai.facebook.com)
 (TXT) w3m dump (ai.facebook.com)
        
       | zwieback wrote:
       | What's the track record on "super computers"? It seems risky to
       | pour so much money into a platform that will be superseded in a
       | few years time. Then again, it's not clear that there are really
       | good alternatives.
        
         | halfeatenpie wrote:
         | Well it's partially the cost/benefit analysis of:
         | 
         | 1. How much benefit will we get? 2. Will this benefit be higher
         | than the cost of purchasing this right now? 3. What other
         | alternatives will satisfy our needs?
         | 
         | For most of these solutions, the answers are:
         | 
         | 1. A lot. We need computing power to perform these analysis
         | faster and to be competitive. 2. Yes, as continuing to innovate
         | in this space will keep us competitive and give our scientists
         | the resources to remain productive. 3. AWS/GCP/Azure are
         | alternatives sure, but then (the rate at which Meta probably
         | uses these resources) it probably cost them less to build this
         | out than to pay AWS/GCP/Azure for access to these hardware.
        
         | tedivm wrote:
         | Major GPU architecture changes only happen every few years.
         | 
         | * k80 - 2014, and this really was not that great of a chip.
         | 
         | * v100 in 2017, and I'd consider this the first "built from the
         | ground up" ML chip.
         | 
         | * A100 in late 2020, with the first major cloud general
         | availability being in 2021.
         | 
         | Even when new chips come out the old ones are still usable- you
         | can rent k80s for fairly cheap on all cloud providers, and have
         | kept a surprising amount of their resale value. The v100s are
         | also very much still in demand.
         | 
         | The A100 is also an amazing system- the new nvswitch
         | archetecture means the A100s work together far far better than
         | their V100 counterparts. I was part of an upgrade project
         | setting up an A100 cluster with infiniband and it really is
         | amazing how well these chips work together. That communication
         | barrier was a pretty obvious next step though (the k80s had
         | crap intergpu communication, the v100s introduced the nvlink,
         | and the nvswitch was the obvious way to go). There isn't an
         | obvious next step, and I imagine the A100s to be the standard
         | for at least the next four years (with lots of continued use
         | after that).
        
           | boulos wrote:
           | You skipped the T4 which was in between Volta and Ampere, as
           | well as the P100 between K80 and V100. So I'd say "meaningful
           | chip changes" is closer to every 18 months.
           | 
           | The T4 though isn't a "big part", but for people who fit
           | within its envelope, it's a huge win (since it's cost is so
           | much lower). A lot of deep learning folks had built out
           | Turing-based workstations in that time period, and I think
           | they're still reasonable value for money.
        
       | codeulike wrote:
       | The Terminator franchise is based around the folly of letting an
       | AI control a nuclear arsenal. But here we are building the
       | biggest AI ever and letting it analyse our social interactions.
       | Think of the power this could have if it goes rogue! It could
       | manipulate entire populatuions by minutely controlling what they
       | see and read. Surely if manipulation at that scale and fidelity
       | became possible it would be something to be concerned about?
       | 
       | .... Oh wait ....
        
         | georgeecollins wrote:
         | The premise of the science fiction novel "After On" is that the
         | first AI to reach sentience is running a dating app. It's
         | actually a good, well researched book.
        
       | abhinai wrote:
       | I wish researchers outside Meta were allowed to rent this
       | SuperCluster for maximum benefit to humanity.
        
         | forgotmyoldacc wrote:
         | Why not just rent AWS/Azure/GCP instead? They're all about the
         | same. Top of the line enterprise GPUs with fast interconnect.
        
           | tedivm wrote:
           | They are not the same at all. AWS has the best GPU instances
           | right now but there's some huge differences in networking
           | speed. The P3 instances have 400Gbps per machine, with 8 GPUs
           | on each machine. If you were to self host a cluster using the
           | standard DGX machines you get 200Gbps per GPU, for a total of
           | 1600Gbps for just the GPUs. The DGX machine has another two
           | infiniband ports that can be used to attach to storage at
           | pretty intense speeds as well.
           | 
           | This makes a huge difference when using more than a single
           | machine. I've done the math and purchased the machines at a
           | previous company- assuming you aren't leaving the machines
           | idle most of the time you save a considerable amount of money
           | and get a lot better performance when building your own
           | cluster.
        
             | boulos wrote:
             | Disclosure: I used to work for Google Cloud.
             | 
             | > you save a considerable amount of money and get a lot
             | better performance when building your own cluster.
             | 
             | This _heavily_ depends on how much benefit you get from
             | improving GPU performance from each generation. A lot of
             | people assume 3 /4-yr TCO. If you instead "rent" for 1 year
             | at a time, you've been getting >2x benefits per generation
             | lately.
             | 
             | Most folks also measure "occupancy" for clusters like this
             | rather than "utilization". That is, if a job is using 128
             | "GPUs" that counts as 128 in use. But that ignores that
             | many jobs might have been just fine with T4s (which are a
             | lot cheaper) versus A100s. (Depends a lot on the model, the
             | I/O, etc.) Once you've bought a physical cluster, you're
             | kind of stuck with that configuration (for better or
             | worse).
             | 
             | tl;dr: It's not just about "idle".
        
               | tedivm wrote:
               | These generations tend to be three years apart though, so
               | if you're buying as the new generation comes out then
               | your total TCO period has you running almost peak
               | hardware (there were several versions of the V100, each
               | with minor improvements). Many vendors also offer buy
               | back and upgrade programs.
               | 
               | At the same time it's hard to understate how different
               | the prices are here. Our break even point for using on
               | prem compared to AWS was about nine months. After that we
               | saved money for the rest of that hardwares lifetime.
               | 
               | I definitely agree that people shouldn't just rush out
               | and buy these without benchmarking and examining their
               | usecase. The cloud is really good for this! At the same
               | time though I have yet to see any cloud provider with
               | anything even approaching the interlink I can get on prem
               | and that means, so it's basically impossible to get the
               | same performance out of the cloud as it is on prem right
               | now.
        
               | boulos wrote:
               | > using on prem compared to AWS was about nine months.
               | 
               | At _list_ price or a moderate discount. The folks at this
               | scale aren 't paying that :).
        
             | dekhn wrote:
             | the network topology and the network switch itself also can
             | make a huge difference depending on traffic conditions; so
             | you might have tons of fat NICs per GPU but if all of them
             | want to alltoall, you better have a ton of cross section
             | bandwidth.
             | 
             | I always wonder about performance on these clusters. Back
             | in MY day, I'd wait a week or more for results from my
             | jobs, and immediately resubmit to wait two weeks in a queue
             | for another week of runtime and do lots of data processing
             | in the downtime. Then, I moved to cloud and decided on
             | "what can I afford to do overnight" (IE, I set my time to
             | result to be about 12 hours). I have a hard time justifying
             | additional hardware to get results in 10 minutes versus a
             | day, it seems like at that point you're just using it to
             | get fast cycle times on new ideas, but who has new ideas
             | every 10 minutes?
        
               | tedivm wrote:
               | > the network topology and the network switch itself also
               | can make a huge difference depending on traffic
               | conditions; so you might have tons of fat NICs per GPU
               | but if all of them want to alltoall, you better have a
               | ton of cross section bandwidth.
               | 
               | This is the beauty of the new nvswitch chips and the
               | infiniband networks instead of ethernet. Anyone who is
               | doing this is setting up a fully switched high bandwidth
               | infiniband network with ridiculous traffic between them.
               | Nvidia purchased Mellanox a year or two ago- combine that
               | with the ridiculously awesome nvswitch in the A100 dgx
               | machines and there's a huge jump in cross chip traffic
               | ability. At the same time though a decent mellanox router
               | is probably going to set you back $30k.
        
               | dekhn wrote:
               | I'm not aware of any cost-effective switch that permits
               | scaling all v all to arbitrary sizes. THat's my point.
               | modern infiniband uses nearly all the same tech as
               | previous supercomputers, but with faster interfaces, and
               | more of them. For example the facebook cluster is a dual-
               | layer clos network, which is one of a few cost-effective
               | ways to get very high randomly distributed traffic, but
               | all v all communication scales as n*2 and n*2 wires gets
               | expensive fast.
               | 
               | Better to find algorithms that need less communication,
               | than to make faster computers that allow you to write
               | algorithms that needs lots of communication. Otherwise
               | you'll always pay $$$ to reach peak GPU performance.
        
             | ska wrote:
             | On-prem for nearly anything is going to be at least a bit
             | of a win if your utilization is uniform or predictable. The
             | real win for not doing it is in adaptability.
        
               | tedivm wrote:
               | Yeah, definitely, but I think it's important to talk
               | about scale of that win.
               | 
               | I mentioned in another comment that the GPU generation is
               | roughly three years between major architecture upgrades-
               | this has held true for a bit now, and that time may even
               | stretch out a little. When the average company builds one
               | of these clusters it's safe to assume they'll either run
               | it for three years or sell it back for some return.
               | 
               | Going with the cloud and assuming you don't commit to
               | several years (losing that adaptability) the yearly cost
               | of a p4 is $287,087. Over three years that's $861,261 to
               | run a single machine. For about $450k you can build out a
               | solid two machine (16gpu) cluster (including infiniband
               | networking gear and a solid NAS) that will easily last
               | three years. There are datacenters which specialize in
               | this and companies that can manage these machines. If you
               | don't have the cash up front you can lease them on good
               | terms and your yearly bill will still be much lower than
               | AWS.
               | 
               | Model training is basically the one use case where I'm
               | really willing to purchase equipment instead of using the
               | cloud. The money it saves is enough to hire one or two
               | more staff members, and the maintenance is shockingly low
               | if you get it setup right to start.
        
         | halfeatenpie wrote:
         | You can rent cluster access (to an extent) using Azure Batch.
         | 
         | Granted it's probably not at this scale, but it gives you
         | access to a ton of resources.
        
           | ankeshanand wrote:
           | You can also rent a cloud TPU-v4 pod
           | (https://cloud.google.com/tpu) which 4096 TPUv-4 chips with
           | fast interconnect, amounting to around 1.1 exaflops of
           | compute. It won't be cheap though (excess of 20M$/year I
           | believe).
        
       | caaqil wrote:
       | This thread should probably be merged with this one:
       | https://news.ycombinator.com/item?id=30062019
        
       | kn8a wrote:
       | Are there any alternatives to gradient based learning that could
       | make this less useful? Is there another type of compute unit that
       | is the next evolution of CPU -> GPU -> ?
        
         | winterismute wrote:
         | It's a tough question, it's not even back-propagation but even
         | just sometimes the "parameters" of the models, for example [1]
         | shows that models such as ResNeXt already perform better on a
         | very different architecture such as Graphcore, for some sizes
         | of convolutions. Older models, or models that get tuned for
         | existing GPUs, do not perform as well.
         | 
         | It's tough to come up with a new architecture that can have an
         | advantage on current and future models, at least from a peak
         | perf point of view, from a perf/watt for example instead the
         | scaled-up Apple GPUs seem to show new interesting properties.
         | But the Graphcore architecture is quite interesting, being able
         | to act somehow both as a SIMD machine and a task-parallel
         | machine.
         | 
         | [1] https://arxiv.org/pdf/1912.03413v1.pdf
        
         | randcraw wrote:
         | Based the following constraints that lie at the center of AI
         | and parallelism, I'd say no -- stochastic gradient pursuit
         | using vector processors like GPUs is inescapable in all future
         | AI advances.
         | 
         | 1) All AI is based in search (esp. non-convex, where heuristics
         | are insufficient to provide a global convex solution), and thus
         | is inevitably implemented using iteration, driven locally by
         | gradient-pursuit and globally by... ways to efficiently gather
         | information to optimize the loss function that measures how
         | well that info gain is being refined and exploited.
         | 
         | 2) Search that is inherently non-convex and inefficient
         | requires as much compute power as possible, i.e. using
         | supercomputers.
         | 
         | 3) All supercomputer-based solutions to non-convex problems are
         | implemented iteratively, where results are improved not using
         | closed-form math or complete info, but by incremental
         | optimization of partial results that aggregate with the
         | iterations, like repeated stochastic gradient descent that
         | creates and enhances 'resonant' clusters of 'neurons'.
         | 
         | 4) The only form of supercomputing that has proven to scale up
         | at anywhere near indefinitely is data-parallelism (a dataflow-
         | specific form of SIMD) -- where the search space is spread as
         | evenly (and naively) as possible across as many processing
         | elements as possible.
         | 
         | 5) Vector processing hardware like GPUs implement data-
         | parallelism as well as any HPC architecture yet devised.
         | 
         | Thus, I believe that AI is stuck with GPUs, or equivalent
         | meshes of vector processors, indefinitely.
        
       | buildbot wrote:
       | It's not larger than Microsoft's?
       | https://blogs.microsoft.com/ai/openai-azure-supercomputer/
        
         | boulos wrote:
         | That was a V100 cluster though. 10k V100s is less powerful (for
         | ML stuff) than ~6k A100s.
        
       | mooneater wrote:
       | > All this infrastructure must be extremely reliable, as we
       | estimate some experiments could run for weeks and require
       | thousands of GPUs
       | 
       | Is it hardward-fault tolerant? Curious how well this will work
       | otherwise as it scales.
        
       | etaioinshrdlu wrote:
       | It is interesting that it only allows training on anonymized and
       | encrypted data. I wonder how much these restrictions slow down
       | their research?
       | 
       | Although, they are definitely a good idea considering the data
       | source.
        
       ___________________________________________________________________
       (page generated 2022-01-24 23:03 UTC)