[HN Gopher] Grid: AI platform from the makers of PyTorch Lightning ___________________________________________________________________ Grid: AI platform from the makers of PyTorch Lightning Author : ishcheklein Score : 71 points Date : 2020-10-08 16:27 UTC (6 hours ago) (HTM) web link (www.grid.ai) (TXT) w3m dump (www.grid.ai) | bkkaggle wrote: | i used pytorch lightning back in may when i was working on | pretraining gpt2 on TPUs (https://bkkaggle.github.io/blog/nlp- | research-part-2/). it was really impressive how stable it was | especially given how a lot of features were still being added at | a very fast pace. | | also, this was probably the first (and maybe still is?) high- | level pytorch library that let you train on tpus without a lot of | refactoring and bugs which was a really nice thing to be able to | do given how the pytorch-xla api was still unstable at that | point. <3 | ishcheklein wrote: | More on this is here https://techcrunch.com/2020/10/08/grid-ai- | raises-18-6m-serie... | | What do you think folks? | sillysaurusx wrote: | I dislike that pytorch advertises TPU support. Pytorch doesn't | support TPUs. Pytorch supports a gimped version of TPUs that have | no access to the TPU CPU, a massive 300GB memory store that | handles infeed. No infeed means you have to feed the TPUs | manually, on demand, like a gpu. And TPUs are not GPUs. When you | try to do that, you're talking _at least_ a 40x slowdown, no | exaggeration. The TPU CPU is the heart of the TPU's power and | advantage over GPUs, and neither pytorch nor Jax support it at | all yet. No MLPerf benchmark will ever use pytorch in its current | form on TPUs. | | Luckily, that form is changing. There are interesting plans. But | they are still just plans. | | It's better to go the other direction, I think. I ported pytorch | to tensorflow: | https://twitter.com/theshawwn/status/1311925180126511104?s=2... | | Pytorch is mostly just an api. And that api is mostly python. | When people say they "like pytorch", they're expressing a | preference for how to organize ML code, not for the set of | operations available to you when you use pytorch. | wfalcon wrote: | We currently have google engineers training on TPU pods with | PyTorch Lightning. | | TPU support is VERY real... but yes, sometimes it breaks but | PyTorch and Google are working very hard to bridge that gap. | | But we have dedicated partners at Google on the TPU team | working to get Lightning working seamlessly on pods. | | Check out the discussions here: | https://github.com/PyTorchLightning/pytorch-lightning/issues... | sillysaurusx wrote: | No, you do not support the TPU infeed, and this is a crucial | distinction. Saying that you do support this has caused | endless confusion and much surprise. It's almost not an | exaggeration to say that you're lying (sorry for phrasing | this so bluntly, but I've seriously spent dozens of hours | trying to break this misconception due to hype like this). | | TPU support is real. Pytorch does in fact run on TPUs. But | you don't support TPU _CPU memory_ , the staging area that | you're supposed to fill with training data. That staging area | is why a TPU v3-512 pod can train an imagenet resnet | classifier in 3 minutes at around 1M examples per second. | | You will not get _anywhere near_ that performance with | pytorch on TPUs. In fact, you're expected _to create a | separate VM for every 8 TPU cores_. The VMs are in charge of | feeding the cores. That's insane; I've driven TPU pods from a | single n1-standard-2 using tensorflow. | | Repeat after me: if you are required to create more than one | VM, you do not (yet!) support TPU pods. I wish I could triple | underline this and put it in bold. People need to understand | the limitations of this technique. Creating 256 VMs to feed a | v3-2048 is not sustainable. | wfalcon wrote: | Like I said... pytorch and tensorflow team are working very | hard to make this work. And yes, it's not a 1:1 with | tensorflow, but we're making progress very aggressively. | sillysaurusx wrote: | I love what you guys are doing, and I love improving the | ML ecosystem, but you've godda understand, people see | this and think "oh, ok, it's a small difference, no big | deal." In fact it's a _huge_ difference. | | Picture a person with one arm and without legs. Would you | say they aren't "1:1 in terms of features"? They | certainly won't be winning any races. | | And unlike real people, you can't graft on a prosthetic | limb to help this situation. The issue I'm describing | here is a fundamental one that everyone keeps trying to | sweep under the rug and pretend isn't an issue. And then | everyone wonders what's going on. | wfalcon wrote: | I 100% agree. We don't want to misrepresent TPU support. | In fact, we explicitly warn users in our docs. Open to | suggestions about how we can communicate this much better | to our users. | | We just need to be a part of the effort to help bridge | the big gap and barriers keeping users from TPU adoption. | | https://pytorch- | lightning.readthedocs.io/en/latest/tpu.html#... | minimaxir wrote: | There's a difference between "supporting TPUs" and | "supporting TPUs at 100% potential". Although the | distinction is important, I don't think the marketing here | is misleading. | sillysaurusx wrote: | Not only is it misleading, it even somehow tricked you. | :) | | We're not talking about a small 10% reduction in | performance here. We're talking like 40x differences. | | If it seems unbelievable, and like it can't possibly be | true, well: now you understand my frustration here, and | why I'm trying to break the myth. | | Notice not a single benchmark has ever gone head to head | in MLPerf using pytorch on TPUs. And that's because using | pytorch on TPUs requires you to feed _each image | manually_ to the TPU on demand, _from your VM_. Meaning | the TPU is always infeed bound. | | Engineers should be wincing at the sound of that. | Especially anyone with graphics experience. Being infeed | bound means you have lots of horsepower sitting around | doing nothing. And that's exactly the situation you'll | end up in with this technique. | | There's a way to settle this decisively: train a resnet | classifier on imagenet, as quickly as possible. If you | get anywhere _near_ the MLPerf v0.6 benchmarks for | tensorflow on TPUs, I will instantly pivot the other | direction and sing the praises of pytorch on TPUs far and | wide. | [deleted] | arugulum wrote: | I want to signal-boost this. TPU support on PyTorch is partial. | You can run modeling computation on the TPU with PyTorch, but | not the data-loading. And without the TPU's data-loading, | you're significantly, significantly bottle-necked, to the point | where you are often better off using GPUs. The reason why | TensorFlow and TPUs synergize so well is that the TPUs | themselves can consume data for training, allowing for massive | scalability. | | I have great respect for the PyTorch-TPU team, but I would | recommend not heavily advertising PyTorch-TPU support until | this major feature disparity is made up. | wfalcon wrote: | we highlight these issues in our docs explicitly. | | https://pytorch- | lightning.readthedocs.io/en/latest/tpu.html#... | marcinzm wrote: | Can you point out where exactly in those docs you highlight | the issue? | | I just read the linked page and found no references to data | loading limitations or performance limitations. Is it only | in the video which isn't search indexed and few people | would bother watching? | | edit: The page literally advertises the speed of TPUs with | "In general, a single TPU is about as fast as 5 V100 GPUs!" | which is the exact opposite of warning people. | hobofan wrote: | > When people say they "like pytorch", they're expressing a | preference for how to organize ML code | | Maybe that's the way you feel, but for me that's very | different. Pytorch is much more than just an API (which is also | nothing to scoff at). | | It's also much cleaner documentation, a very different | ecosystem of libraries (mostly better, but sometimes lacking | depending on the niche), less magic (which makes it easier to | debug). It also has the benefit of less ecosystem churn, while | the transitions of TF1->2 as well as the external | Keras->internal Keras are a shitshow that's almost as bad as | Python2->3. | desku wrote: | What niche libraries do you think PyTorch is lacking? Do you | have some examples of ones that exist in Tensorflow with no | PyTorch equivalent? | tmabraham wrote: | We both know (and correct me if I'm wrong) that this is an | issue with the GCP cloud architecture. | https://github.com/pytorch/xla/issues/1858 | | So you can't blame the PyTorch team. If there's anyone to | blame, it's Google Cloud. In the meantime, I don't think | there's any harm advertising PyTorch with TPU support if | running on TPUs with PyTorch is often much faster than running | on GPUs with PyTorch. | tmabraham wrote: | In the above-linked GitHub issue, the _Google_ TPU team is | now giving an ETA of early 2021. At that point PyTorch TPU | training (including on TPU pods) should be equivalent to TF | TPU training. But I think my point still stands that as long | as PyTorch TPU training is faster than GPU training, even in | its current state, there 's nothing wrong advertising TPU | support now. | minimaxir wrote: | So _this_ is the endgame of pytorch-lightning, which was always a | mystery to me. (if you haven 't used it, it's strongly | recommended if you use PyTorch: | https://github.com/PyTorchLightning/pytorch-lightning ) | | IMO, open source is at its best when it's supported by a SaaS as | it provides a strong incentive to keep the project up-to-date, | and the devs of PL have been very proactive. | wfalcon wrote: | yes! our goal is to completely remove any engineering from the | AI research -> production lifecycle. | | Not just a marginal improvement on that experience but a 10x | completely different approach. | wfalcon wrote: | Lightning is built for researchers by researchers... we've | already taken a much different approach. | | 100% agree with you that going the other way is likely not | the best approach. | | Lightning + Grid elevates and turns non experts closer to | researchers... ie: focus on building the products and doing | science and not the engineering. | | That's what lightning excels at today. That's the experience | Grid will 10x. | sillysaurusx wrote: | Dead end. Train researchers to be engineers. Not the other | way around. | | I wish you luck though. We'll see in ten years whether | programmers are as effective as researchers, or whether | researchers are as effective as programmers. In 60 some years | of computing, no one has achieved the latter, despite many | attempts. | | Also, I was surprised that this webpage is basically a | waitlist and nothing else. No discussion of technique, no | docs, no substance. Just a "you like pytorch? Pytorch rules!" | type hype. | | I do like pytorch, but I also like knowing one or two | substantive points about what the proposal here is. If you | want to train a model from your laptop, it's a matter of | applying to TFRC and kicking off a TPU. | | The whole ecosystem is in need of massive overhaul. I like | the ambition. But I dislike trying to pretend we aren't | programmers. ML is programming, and pretending otherwise will | always cause massive, avoidable delays. | tasubotadas wrote: | I would pick engineers over researchers any day as you can | teach an engineer how to do research but the opposite is | rarely a case. | mahaniok wrote: | not sure where is "you like pytorch" coming from. Grid.ai | is not limited to pytorch :) | minimaxir wrote: | I would not have touched TensorFlow (or AI in general) at | all if it weren't for Keras, and I wouldn't be happy with | PyTorch if it weren't for PyTorch Lightning. | | Easy onboarding for ML tooling is very valuable for the | industry as a whole. | orbifold wrote: | As a scientist: The amount I struggled with getting | distributed training on a HPC cluster to work vs. how easy it | was with Lightning was eye opening. Almost no code change and | finally I can run across 20 nodes with 4 V-100 each :). Plus | the automatic SLURM checkpoints and restarts <3. | neilc wrote: | Congratulations to the Grid team on the fundraise and the | announcement! Exciting stuff. | | It seems like there is an emerging consensus that (a) DL | development requires access to massive compute, but (b) if you're | only using off-the-shelf PyTorch or TensorFlow, moving your model | from your personal development environment to a cluster or cloud | setting is too difficult -- it is easy to spend most of your time | managing infrastructure rather than developing models. At | Determined AI, we've spent the last few years building an open | source DL training platform that tries to make that process a lot | simpler (https://github.com/determined-ai/determined), but I | think it's fair to say that this is still very much an open space | and an important problem. Curious to take a look at Grid AI and | see how it compares to other tools in the space -- some other | alternatives include Kubeflow, Polyaxon, and Spell AI. | wfalcon wrote: | maybe determined should come up with its own API instead of | copying Lightning's :) | | Not a nice move for the opensource spirit. Also, pretty sure | it's a violation of our patent and 100% copyright infringement. | kbash9 wrote: | Seems like Pytorch lightening is the only first-class citizen in | your offering. Is that true? Or are there value-added features | for TensorFlow and other non-DL libraries such as scikit-learn? | | Also, is there support for distributed training for large | datasets that don't fit into single instance memory? or just | distributed grid-search/hyper-parameter optimization? | wfalcon wrote: | No, we support other frameworks too! | | Just that if you use lightning you'll have zero friction. Well | as with the others... you might run into issues inherent in the | other framework's hard to work with designs. | high_derivative wrote: | I am extremely pessimistic for ML ops startup like this. At the | end of the day, cloud service providers have too much of an | incentive to provide these tools for free as a cloud value add. | | The other thing is that stitching together other open source | tools like this is simply not enough value. Who will be | incentivised to buy? | | Saying this as FAANG ML org person where I see the push to open | source ops tooling like this. | wfalcon wrote: | Well... we're not really an ML ops startup haha. I am ALSO | pessimistic about ML Ops startups. | | But calling Grid an ML Ops startup is like calling Lightning | Keras... maybe at a quick blink it looks like that, but that's | where the similarities end. | | For what it's worth, a lot of what we're building comes from my | experience at FAIR. | andrewmutz wrote: | If it is not ML Ops, how would you describe it? | wfalcon wrote: | it's honestly just a different approach. ML ops is | adjusting your code to work with the cloud and managing all | that. | | For us is basically integrating clouds directly into your | code so the barrier disappears and the cloud providers | become an extension of your laptop. | orbifold wrote: | Might not be your target audience, but high energy | physics has been operating an infrastructure like that | for years: https://www.etp.physik.uni- | muenchen.de/research/grid-computi... You can basically | use these frameworks to run your analysis jobs on any of | the connected HPC centers and interactively move | workloads around. The data never has to touch your hard | drive either but gets moved to the compute on demand. | This is how thousands of physicists do statistical | analysis on petabytes of data. | wfalcon wrote: | super cool! | | One of the professors at my lab at NYU CILVR (Kyle | Cranmer) i believed was super involved with this. Will | definitely sync up with him! | | Thanks for the heads up! | DevX101 wrote: | Sounds like a great reason to acquire a startup like this for | $XX million. Might be a great outcome if the team is lean with | minimal investment. | hobofan wrote: | Well that ship seems to have sailed with a ~$19mil investment | and a 20 person (+$4mil/year burn in NYC?) team. | minimaxir wrote: | On _paper_ that 's the case. | | Google certainly has made a push toward scalable AI training | and deployment. However, it is not fun to use in practice, | speaking from experience. | | Startups beating an incumbent with substantially better UX is | always a good story. Improving productivity is an easy winner | for potential customers. | [deleted] | visarga wrote: | How do you handle the security of training data? If the data is | super sensitive how do you deal with it? | | I know the same could be said about Azure and AWS, but the big | name cloud providers stake their prestige on having tight | security, while a startup has much less to lose. | mahaniok wrote: | On the opposite, startup has everything to lose. No other | business lines | seibelj wrote: | The name is unfortunately close to "The Grid", an AI website | builder that had a lot of buzz then scammed a lot of people out | of money then disappeared https://medium.com/@seibelj/the-grid- | over-promise-under-deli... | wfalcon wrote: | The grid is how you harness and distribute power and | electricity.... like that coming from lightning :) | | Second, electricity was a great new technology (ie: AI), but | you needed the power grid to make it usable - that's grid AI. | immigrantsheep wrote: | Also related to TRON ___________________________________________________________________ (page generated 2020-10-08 23:01 UTC)