[HN Gopher] The AI Research SuperCluster ___________________________________________________________________ The AI Research SuperCluster Author : minimaxir Score : 56 points Date : 2022-01-24 18:34 UTC (4 hours ago) (HTM) web link (ai.facebook.com) (TXT) w3m dump (ai.facebook.com) | zwieback wrote: | What's the track record on "super computers"? It seems risky to | pour so much money into a platform that will be superseded in a | few years time. Then again, it's not clear that there are really | good alternatives. | halfeatenpie wrote: | Well it's partially the cost/benefit analysis of: | | 1. How much benefit will we get? 2. Will this benefit be higher | than the cost of purchasing this right now? 3. What other | alternatives will satisfy our needs? | | For most of these solutions, the answers are: | | 1. A lot. We need computing power to perform these analysis | faster and to be competitive. 2. Yes, as continuing to innovate | in this space will keep us competitive and give our scientists | the resources to remain productive. 3. AWS/GCP/Azure are | alternatives sure, but then (the rate at which Meta probably | uses these resources) it probably cost them less to build this | out than to pay AWS/GCP/Azure for access to these hardware. | tedivm wrote: | Major GPU architecture changes only happen every few years. | | * k80 - 2014, and this really was not that great of a chip. | | * v100 in 2017, and I'd consider this the first "built from the | ground up" ML chip. | | * A100 in late 2020, with the first major cloud general | availability being in 2021. | | Even when new chips come out the old ones are still usable- you | can rent k80s for fairly cheap on all cloud providers, and have | kept a surprising amount of their resale value. The v100s are | also very much still in demand. | | The A100 is also an amazing system- the new nvswitch | archetecture means the A100s work together far far better than | their V100 counterparts. I was part of an upgrade project | setting up an A100 cluster with infiniband and it really is | amazing how well these chips work together. That communication | barrier was a pretty obvious next step though (the k80s had | crap intergpu communication, the v100s introduced the nvlink, | and the nvswitch was the obvious way to go). There isn't an | obvious next step, and I imagine the A100s to be the standard | for at least the next four years (with lots of continued use | after that). | boulos wrote: | You skipped the T4 which was in between Volta and Ampere, as | well as the P100 between K80 and V100. So I'd say "meaningful | chip changes" is closer to every 18 months. | | The T4 though isn't a "big part", but for people who fit | within its envelope, it's a huge win (since it's cost is so | much lower). A lot of deep learning folks had built out | Turing-based workstations in that time period, and I think | they're still reasonable value for money. | codeulike wrote: | The Terminator franchise is based around the folly of letting an | AI control a nuclear arsenal. But here we are building the | biggest AI ever and letting it analyse our social interactions. | Think of the power this could have if it goes rogue! It could | manipulate entire populatuions by minutely controlling what they | see and read. Surely if manipulation at that scale and fidelity | became possible it would be something to be concerned about? | | .... Oh wait .... | georgeecollins wrote: | The premise of the science fiction novel "After On" is that the | first AI to reach sentience is running a dating app. It's | actually a good, well researched book. | abhinai wrote: | I wish researchers outside Meta were allowed to rent this | SuperCluster for maximum benefit to humanity. | forgotmyoldacc wrote: | Why not just rent AWS/Azure/GCP instead? They're all about the | same. Top of the line enterprise GPUs with fast interconnect. | tedivm wrote: | They are not the same at all. AWS has the best GPU instances | right now but there's some huge differences in networking | speed. The P3 instances have 400Gbps per machine, with 8 GPUs | on each machine. If you were to self host a cluster using the | standard DGX machines you get 200Gbps per GPU, for a total of | 1600Gbps for just the GPUs. The DGX machine has another two | infiniband ports that can be used to attach to storage at | pretty intense speeds as well. | | This makes a huge difference when using more than a single | machine. I've done the math and purchased the machines at a | previous company- assuming you aren't leaving the machines | idle most of the time you save a considerable amount of money | and get a lot better performance when building your own | cluster. | boulos wrote: | Disclosure: I used to work for Google Cloud. | | > you save a considerable amount of money and get a lot | better performance when building your own cluster. | | This _heavily_ depends on how much benefit you get from | improving GPU performance from each generation. A lot of | people assume 3 /4-yr TCO. If you instead "rent" for 1 year | at a time, you've been getting >2x benefits per generation | lately. | | Most folks also measure "occupancy" for clusters like this | rather than "utilization". That is, if a job is using 128 | "GPUs" that counts as 128 in use. But that ignores that | many jobs might have been just fine with T4s (which are a | lot cheaper) versus A100s. (Depends a lot on the model, the | I/O, etc.) Once you've bought a physical cluster, you're | kind of stuck with that configuration (for better or | worse). | | tl;dr: It's not just about "idle". | tedivm wrote: | These generations tend to be three years apart though, so | if you're buying as the new generation comes out then | your total TCO period has you running almost peak | hardware (there were several versions of the V100, each | with minor improvements). Many vendors also offer buy | back and upgrade programs. | | At the same time it's hard to understate how different | the prices are here. Our break even point for using on | prem compared to AWS was about nine months. After that we | saved money for the rest of that hardwares lifetime. | | I definitely agree that people shouldn't just rush out | and buy these without benchmarking and examining their | usecase. The cloud is really good for this! At the same | time though I have yet to see any cloud provider with | anything even approaching the interlink I can get on prem | and that means, so it's basically impossible to get the | same performance out of the cloud as it is on prem right | now. | boulos wrote: | > using on prem compared to AWS was about nine months. | | At _list_ price or a moderate discount. The folks at this | scale aren 't paying that :). | dekhn wrote: | the network topology and the network switch itself also can | make a huge difference depending on traffic conditions; so | you might have tons of fat NICs per GPU but if all of them | want to alltoall, you better have a ton of cross section | bandwidth. | | I always wonder about performance on these clusters. Back | in MY day, I'd wait a week or more for results from my | jobs, and immediately resubmit to wait two weeks in a queue | for another week of runtime and do lots of data processing | in the downtime. Then, I moved to cloud and decided on | "what can I afford to do overnight" (IE, I set my time to | result to be about 12 hours). I have a hard time justifying | additional hardware to get results in 10 minutes versus a | day, it seems like at that point you're just using it to | get fast cycle times on new ideas, but who has new ideas | every 10 minutes? | tedivm wrote: | > the network topology and the network switch itself also | can make a huge difference depending on traffic | conditions; so you might have tons of fat NICs per GPU | but if all of them want to alltoall, you better have a | ton of cross section bandwidth. | | This is the beauty of the new nvswitch chips and the | infiniband networks instead of ethernet. Anyone who is | doing this is setting up a fully switched high bandwidth | infiniband network with ridiculous traffic between them. | Nvidia purchased Mellanox a year or two ago- combine that | with the ridiculously awesome nvswitch in the A100 dgx | machines and there's a huge jump in cross chip traffic | ability. At the same time though a decent mellanox router | is probably going to set you back $30k. | dekhn wrote: | I'm not aware of any cost-effective switch that permits | scaling all v all to arbitrary sizes. THat's my point. | modern infiniband uses nearly all the same tech as | previous supercomputers, but with faster interfaces, and | more of them. For example the facebook cluster is a dual- | layer clos network, which is one of a few cost-effective | ways to get very high randomly distributed traffic, but | all v all communication scales as n*2 and n*2 wires gets | expensive fast. | | Better to find algorithms that need less communication, | than to make faster computers that allow you to write | algorithms that needs lots of communication. Otherwise | you'll always pay $$$ to reach peak GPU performance. | ska wrote: | On-prem for nearly anything is going to be at least a bit | of a win if your utilization is uniform or predictable. The | real win for not doing it is in adaptability. | tedivm wrote: | Yeah, definitely, but I think it's important to talk | about scale of that win. | | I mentioned in another comment that the GPU generation is | roughly three years between major architecture upgrades- | this has held true for a bit now, and that time may even | stretch out a little. When the average company builds one | of these clusters it's safe to assume they'll either run | it for three years or sell it back for some return. | | Going with the cloud and assuming you don't commit to | several years (losing that adaptability) the yearly cost | of a p4 is $287,087. Over three years that's $861,261 to | run a single machine. For about $450k you can build out a | solid two machine (16gpu) cluster (including infiniband | networking gear and a solid NAS) that will easily last | three years. There are datacenters which specialize in | this and companies that can manage these machines. If you | don't have the cash up front you can lease them on good | terms and your yearly bill will still be much lower than | AWS. | | Model training is basically the one use case where I'm | really willing to purchase equipment instead of using the | cloud. The money it saves is enough to hire one or two | more staff members, and the maintenance is shockingly low | if you get it setup right to start. | halfeatenpie wrote: | You can rent cluster access (to an extent) using Azure Batch. | | Granted it's probably not at this scale, but it gives you | access to a ton of resources. | ankeshanand wrote: | You can also rent a cloud TPU-v4 pod | (https://cloud.google.com/tpu) which 4096 TPUv-4 chips with | fast interconnect, amounting to around 1.1 exaflops of | compute. It won't be cheap though (excess of 20M$/year I | believe). | caaqil wrote: | This thread should probably be merged with this one: | https://news.ycombinator.com/item?id=30062019 | kn8a wrote: | Are there any alternatives to gradient based learning that could | make this less useful? Is there another type of compute unit that | is the next evolution of CPU -> GPU -> ? | winterismute wrote: | It's a tough question, it's not even back-propagation but even | just sometimes the "parameters" of the models, for example [1] | shows that models such as ResNeXt already perform better on a | very different architecture such as Graphcore, for some sizes | of convolutions. Older models, or models that get tuned for | existing GPUs, do not perform as well. | | It's tough to come up with a new architecture that can have an | advantage on current and future models, at least from a peak | perf point of view, from a perf/watt for example instead the | scaled-up Apple GPUs seem to show new interesting properties. | But the Graphcore architecture is quite interesting, being able | to act somehow both as a SIMD machine and a task-parallel | machine. | | [1] https://arxiv.org/pdf/1912.03413v1.pdf | randcraw wrote: | Based the following constraints that lie at the center of AI | and parallelism, I'd say no -- stochastic gradient pursuit | using vector processors like GPUs is inescapable in all future | AI advances. | | 1) All AI is based in search (esp. non-convex, where heuristics | are insufficient to provide a global convex solution), and thus | is inevitably implemented using iteration, driven locally by | gradient-pursuit and globally by... ways to efficiently gather | information to optimize the loss function that measures how | well that info gain is being refined and exploited. | | 2) Search that is inherently non-convex and inefficient | requires as much compute power as possible, i.e. using | supercomputers. | | 3) All supercomputer-based solutions to non-convex problems are | implemented iteratively, where results are improved not using | closed-form math or complete info, but by incremental | optimization of partial results that aggregate with the | iterations, like repeated stochastic gradient descent that | creates and enhances 'resonant' clusters of 'neurons'. | | 4) The only form of supercomputing that has proven to scale up | at anywhere near indefinitely is data-parallelism (a dataflow- | specific form of SIMD) -- where the search space is spread as | evenly (and naively) as possible across as many processing | elements as possible. | | 5) Vector processing hardware like GPUs implement data- | parallelism as well as any HPC architecture yet devised. | | Thus, I believe that AI is stuck with GPUs, or equivalent | meshes of vector processors, indefinitely. | buildbot wrote: | It's not larger than Microsoft's? | https://blogs.microsoft.com/ai/openai-azure-supercomputer/ | boulos wrote: | That was a V100 cluster though. 10k V100s is less powerful (for | ML stuff) than ~6k A100s. | mooneater wrote: | > All this infrastructure must be extremely reliable, as we | estimate some experiments could run for weeks and require | thousands of GPUs | | Is it hardward-fault tolerant? Curious how well this will work | otherwise as it scales. | etaioinshrdlu wrote: | It is interesting that it only allows training on anonymized and | encrypted data. I wonder how much these restrictions slow down | their research? | | Although, they are definitely a good idea considering the data | source. ___________________________________________________________________ (page generated 2022-01-24 23:03 UTC)