[HN Gopher] Grid: AI platform from the makers of PyTorch Lightning
       ___________________________________________________________________
        
       Grid: AI platform from the makers of PyTorch Lightning
        
       Author : ishcheklein
       Score  : 71 points
       Date   : 2020-10-08 16:27 UTC (6 hours ago)
        
 (HTM) web link (www.grid.ai)
 (TXT) w3m dump (www.grid.ai)
        
       | bkkaggle wrote:
       | i used pytorch lightning back in may when i was working on
       | pretraining gpt2 on TPUs (https://bkkaggle.github.io/blog/nlp-
       | research-part-2/). it was really impressive how stable it was
       | especially given how a lot of features were still being added at
       | a very fast pace.
       | 
       | also, this was probably the first (and maybe still is?) high-
       | level pytorch library that let you train on tpus without a lot of
       | refactoring and bugs which was a really nice thing to be able to
       | do given how the pytorch-xla api was still unstable at that
       | point. <3
        
       | ishcheklein wrote:
       | More on this is here https://techcrunch.com/2020/10/08/grid-ai-
       | raises-18-6m-serie...
       | 
       | What do you think folks?
        
       | sillysaurusx wrote:
       | I dislike that pytorch advertises TPU support. Pytorch doesn't
       | support TPUs. Pytorch supports a gimped version of TPUs that have
       | no access to the TPU CPU, a massive 300GB memory store that
       | handles infeed. No infeed means you have to feed the TPUs
       | manually, on demand, like a gpu. And TPUs are not GPUs. When you
       | try to do that, you're talking _at least_ a 40x slowdown, no
       | exaggeration. The TPU CPU is the heart of the TPU's power and
       | advantage over GPUs, and neither pytorch nor Jax support it at
       | all yet. No MLPerf benchmark will ever use pytorch in its current
       | form on TPUs.
       | 
       | Luckily, that form is changing. There are interesting plans. But
       | they are still just plans.
       | 
       | It's better to go the other direction, I think. I ported pytorch
       | to tensorflow:
       | https://twitter.com/theshawwn/status/1311925180126511104?s=2...
       | 
       | Pytorch is mostly just an api. And that api is mostly python.
       | When people say they "like pytorch", they're expressing a
       | preference for how to organize ML code, not for the set of
       | operations available to you when you use pytorch.
        
         | wfalcon wrote:
         | We currently have google engineers training on TPU pods with
         | PyTorch Lightning.
         | 
         | TPU support is VERY real... but yes, sometimes it breaks but
         | PyTorch and Google are working very hard to bridge that gap.
         | 
         | But we have dedicated partners at Google on the TPU team
         | working to get Lightning working seamlessly on pods.
         | 
         | Check out the discussions here:
         | https://github.com/PyTorchLightning/pytorch-lightning/issues...
        
           | sillysaurusx wrote:
           | No, you do not support the TPU infeed, and this is a crucial
           | distinction. Saying that you do support this has caused
           | endless confusion and much surprise. It's almost not an
           | exaggeration to say that you're lying (sorry for phrasing
           | this so bluntly, but I've seriously spent dozens of hours
           | trying to break this misconception due to hype like this).
           | 
           | TPU support is real. Pytorch does in fact run on TPUs. But
           | you don't support TPU _CPU memory_ , the staging area that
           | you're supposed to fill with training data. That staging area
           | is why a TPU v3-512 pod can train an imagenet resnet
           | classifier in 3 minutes at around 1M examples per second.
           | 
           | You will not get _anywhere near_ that performance with
           | pytorch on TPUs. In fact, you're expected _to create a
           | separate VM for every 8 TPU cores_. The VMs are in charge of
           | feeding the cores. That's insane; I've driven TPU pods from a
           | single n1-standard-2 using tensorflow.
           | 
           | Repeat after me: if you are required to create more than one
           | VM, you do not (yet!) support TPU pods. I wish I could triple
           | underline this and put it in bold. People need to understand
           | the limitations of this technique. Creating 256 VMs to feed a
           | v3-2048 is not sustainable.
        
             | wfalcon wrote:
             | Like I said... pytorch and tensorflow team are working very
             | hard to make this work. And yes, it's not a 1:1 with
             | tensorflow, but we're making progress very aggressively.
        
               | sillysaurusx wrote:
               | I love what you guys are doing, and I love improving the
               | ML ecosystem, but you've godda understand, people see
               | this and think "oh, ok, it's a small difference, no big
               | deal." In fact it's a _huge_ difference.
               | 
               | Picture a person with one arm and without legs. Would you
               | say they aren't "1:1 in terms of features"? They
               | certainly won't be winning any races.
               | 
               | And unlike real people, you can't graft on a prosthetic
               | limb to help this situation. The issue I'm describing
               | here is a fundamental one that everyone keeps trying to
               | sweep under the rug and pretend isn't an issue. And then
               | everyone wonders what's going on.
        
               | wfalcon wrote:
               | I 100% agree. We don't want to misrepresent TPU support.
               | In fact, we explicitly warn users in our docs. Open to
               | suggestions about how we can communicate this much better
               | to our users.
               | 
               | We just need to be a part of the effort to help bridge
               | the big gap and barriers keeping users from TPU adoption.
               | 
               | https://pytorch-
               | lightning.readthedocs.io/en/latest/tpu.html#...
        
             | minimaxir wrote:
             | There's a difference between "supporting TPUs" and
             | "supporting TPUs at 100% potential". Although the
             | distinction is important, I don't think the marketing here
             | is misleading.
        
               | sillysaurusx wrote:
               | Not only is it misleading, it even somehow tricked you.
               | :)
               | 
               | We're not talking about a small 10% reduction in
               | performance here. We're talking like 40x differences.
               | 
               | If it seems unbelievable, and like it can't possibly be
               | true, well: now you understand my frustration here, and
               | why I'm trying to break the myth.
               | 
               | Notice not a single benchmark has ever gone head to head
               | in MLPerf using pytorch on TPUs. And that's because using
               | pytorch on TPUs requires you to feed _each image
               | manually_ to the TPU on demand, _from your VM_. Meaning
               | the TPU is always infeed bound.
               | 
               | Engineers should be wincing at the sound of that.
               | Especially anyone with graphics experience. Being infeed
               | bound means you have lots of horsepower sitting around
               | doing nothing. And that's exactly the situation you'll
               | end up in with this technique.
               | 
               | There's a way to settle this decisively: train a resnet
               | classifier on imagenet, as quickly as possible. If you
               | get anywhere _near_ the MLPerf v0.6 benchmarks for
               | tensorflow on TPUs, I will instantly pivot the other
               | direction and sing the praises of pytorch on TPUs far and
               | wide.
        
               | [deleted]
        
         | arugulum wrote:
         | I want to signal-boost this. TPU support on PyTorch is partial.
         | You can run modeling computation on the TPU with PyTorch, but
         | not the data-loading. And without the TPU's data-loading,
         | you're significantly, significantly bottle-necked, to the point
         | where you are often better off using GPUs. The reason why
         | TensorFlow and TPUs synergize so well is that the TPUs
         | themselves can consume data for training, allowing for massive
         | scalability.
         | 
         | I have great respect for the PyTorch-TPU team, but I would
         | recommend not heavily advertising PyTorch-TPU support until
         | this major feature disparity is made up.
        
           | wfalcon wrote:
           | we highlight these issues in our docs explicitly.
           | 
           | https://pytorch-
           | lightning.readthedocs.io/en/latest/tpu.html#...
        
             | marcinzm wrote:
             | Can you point out where exactly in those docs you highlight
             | the issue?
             | 
             | I just read the linked page and found no references to data
             | loading limitations or performance limitations. Is it only
             | in the video which isn't search indexed and few people
             | would bother watching?
             | 
             | edit: The page literally advertises the speed of TPUs with
             | "In general, a single TPU is about as fast as 5 V100 GPUs!"
             | which is the exact opposite of warning people.
        
         | hobofan wrote:
         | > When people say they "like pytorch", they're expressing a
         | preference for how to organize ML code
         | 
         | Maybe that's the way you feel, but for me that's very
         | different. Pytorch is much more than just an API (which is also
         | nothing to scoff at).
         | 
         | It's also much cleaner documentation, a very different
         | ecosystem of libraries (mostly better, but sometimes lacking
         | depending on the niche), less magic (which makes it easier to
         | debug). It also has the benefit of less ecosystem churn, while
         | the transitions of TF1->2 as well as the external
         | Keras->internal Keras are a shitshow that's almost as bad as
         | Python2->3.
        
           | desku wrote:
           | What niche libraries do you think PyTorch is lacking? Do you
           | have some examples of ones that exist in Tensorflow with no
           | PyTorch equivalent?
        
         | tmabraham wrote:
         | We both know (and correct me if I'm wrong) that this is an
         | issue with the GCP cloud architecture.
         | https://github.com/pytorch/xla/issues/1858
         | 
         | So you can't blame the PyTorch team. If there's anyone to
         | blame, it's Google Cloud. In the meantime, I don't think
         | there's any harm advertising PyTorch with TPU support if
         | running on TPUs with PyTorch is often much faster than running
         | on GPUs with PyTorch.
        
           | tmabraham wrote:
           | In the above-linked GitHub issue, the _Google_ TPU team is
           | now giving an ETA of early 2021. At that point PyTorch TPU
           | training (including on TPU pods) should be equivalent to TF
           | TPU training. But I think my point still stands that as long
           | as PyTorch TPU training is faster than GPU training, even in
           | its current state, there 's nothing wrong advertising TPU
           | support now.
        
       | minimaxir wrote:
       | So _this_ is the endgame of pytorch-lightning, which was always a
       | mystery to me. (if you haven 't used it, it's strongly
       | recommended if you use PyTorch:
       | https://github.com/PyTorchLightning/pytorch-lightning )
       | 
       | IMO, open source is at its best when it's supported by a SaaS as
       | it provides a strong incentive to keep the project up-to-date,
       | and the devs of PL have been very proactive.
        
         | wfalcon wrote:
         | yes! our goal is to completely remove any engineering from the
         | AI research -> production lifecycle.
         | 
         | Not just a marginal improvement on that experience but a 10x
         | completely different approach.
        
           | wfalcon wrote:
           | Lightning is built for researchers by researchers... we've
           | already taken a much different approach.
           | 
           | 100% agree with you that going the other way is likely not
           | the best approach.
           | 
           | Lightning + Grid elevates and turns non experts closer to
           | researchers... ie: focus on building the products and doing
           | science and not the engineering.
           | 
           | That's what lightning excels at today. That's the experience
           | Grid will 10x.
        
           | sillysaurusx wrote:
           | Dead end. Train researchers to be engineers. Not the other
           | way around.
           | 
           | I wish you luck though. We'll see in ten years whether
           | programmers are as effective as researchers, or whether
           | researchers are as effective as programmers. In 60 some years
           | of computing, no one has achieved the latter, despite many
           | attempts.
           | 
           | Also, I was surprised that this webpage is basically a
           | waitlist and nothing else. No discussion of technique, no
           | docs, no substance. Just a "you like pytorch? Pytorch rules!"
           | type hype.
           | 
           | I do like pytorch, but I also like knowing one or two
           | substantive points about what the proposal here is. If you
           | want to train a model from your laptop, it's a matter of
           | applying to TFRC and kicking off a TPU.
           | 
           | The whole ecosystem is in need of massive overhaul. I like
           | the ambition. But I dislike trying to pretend we aren't
           | programmers. ML is programming, and pretending otherwise will
           | always cause massive, avoidable delays.
        
             | tasubotadas wrote:
             | I would pick engineers over researchers any day as you can
             | teach an engineer how to do research but the opposite is
             | rarely a case.
        
             | mahaniok wrote:
             | not sure where is "you like pytorch" coming from. Grid.ai
             | is not limited to pytorch :)
        
             | minimaxir wrote:
             | I would not have touched TensorFlow (or AI in general) at
             | all if it weren't for Keras, and I wouldn't be happy with
             | PyTorch if it weren't for PyTorch Lightning.
             | 
             | Easy onboarding for ML tooling is very valuable for the
             | industry as a whole.
        
           | orbifold wrote:
           | As a scientist: The amount I struggled with getting
           | distributed training on a HPC cluster to work vs. how easy it
           | was with Lightning was eye opening. Almost no code change and
           | finally I can run across 20 nodes with 4 V-100 each :). Plus
           | the automatic SLURM checkpoints and restarts <3.
        
       | neilc wrote:
       | Congratulations to the Grid team on the fundraise and the
       | announcement! Exciting stuff.
       | 
       | It seems like there is an emerging consensus that (a) DL
       | development requires access to massive compute, but (b) if you're
       | only using off-the-shelf PyTorch or TensorFlow, moving your model
       | from your personal development environment to a cluster or cloud
       | setting is too difficult -- it is easy to spend most of your time
       | managing infrastructure rather than developing models. At
       | Determined AI, we've spent the last few years building an open
       | source DL training platform that tries to make that process a lot
       | simpler (https://github.com/determined-ai/determined), but I
       | think it's fair to say that this is still very much an open space
       | and an important problem. Curious to take a look at Grid AI and
       | see how it compares to other tools in the space -- some other
       | alternatives include Kubeflow, Polyaxon, and Spell AI.
        
         | wfalcon wrote:
         | maybe determined should come up with its own API instead of
         | copying Lightning's :)
         | 
         | Not a nice move for the opensource spirit. Also, pretty sure
         | it's a violation of our patent and 100% copyright infringement.
        
       | kbash9 wrote:
       | Seems like Pytorch lightening is the only first-class citizen in
       | your offering. Is that true? Or are there value-added features
       | for TensorFlow and other non-DL libraries such as scikit-learn?
       | 
       | Also, is there support for distributed training for large
       | datasets that don't fit into single instance memory? or just
       | distributed grid-search/hyper-parameter optimization?
        
         | wfalcon wrote:
         | No, we support other frameworks too!
         | 
         | Just that if you use lightning you'll have zero friction. Well
         | as with the others... you might run into issues inherent in the
         | other framework's hard to work with designs.
        
       | high_derivative wrote:
       | I am extremely pessimistic for ML ops startup like this. At the
       | end of the day, cloud service providers have too much of an
       | incentive to provide these tools for free as a cloud value add.
       | 
       | The other thing is that stitching together other open source
       | tools like this is simply not enough value. Who will be
       | incentivised to buy?
       | 
       | Saying this as FAANG ML org person where I see the push to open
       | source ops tooling like this.
        
         | wfalcon wrote:
         | Well... we're not really an ML ops startup haha. I am ALSO
         | pessimistic about ML Ops startups.
         | 
         | But calling Grid an ML Ops startup is like calling Lightning
         | Keras... maybe at a quick blink it looks like that, but that's
         | where the similarities end.
         | 
         | For what it's worth, a lot of what we're building comes from my
         | experience at FAIR.
        
           | andrewmutz wrote:
           | If it is not ML Ops, how would you describe it?
        
             | wfalcon wrote:
             | it's honestly just a different approach. ML ops is
             | adjusting your code to work with the cloud and managing all
             | that.
             | 
             | For us is basically integrating clouds directly into your
             | code so the barrier disappears and the cloud providers
             | become an extension of your laptop.
        
               | orbifold wrote:
               | Might not be your target audience, but high energy
               | physics has been operating an infrastructure like that
               | for years: https://www.etp.physik.uni-
               | muenchen.de/research/grid-computi... You can basically
               | use these frameworks to run your analysis jobs on any of
               | the connected HPC centers and interactively move
               | workloads around. The data never has to touch your hard
               | drive either but gets moved to the compute on demand.
               | This is how thousands of physicists do statistical
               | analysis on petabytes of data.
        
               | wfalcon wrote:
               | super cool!
               | 
               | One of the professors at my lab at NYU CILVR (Kyle
               | Cranmer) i believed was super involved with this. Will
               | definitely sync up with him!
               | 
               | Thanks for the heads up!
        
         | DevX101 wrote:
         | Sounds like a great reason to acquire a startup like this for
         | $XX million. Might be a great outcome if the team is lean with
         | minimal investment.
        
           | hobofan wrote:
           | Well that ship seems to have sailed with a ~$19mil investment
           | and a 20 person (+$4mil/year burn in NYC?) team.
        
         | minimaxir wrote:
         | On _paper_ that 's the case.
         | 
         | Google certainly has made a push toward scalable AI training
         | and deployment. However, it is not fun to use in practice,
         | speaking from experience.
         | 
         | Startups beating an incumbent with substantially better UX is
         | always a good story. Improving productivity is an easy winner
         | for potential customers.
        
       | [deleted]
        
       | visarga wrote:
       | How do you handle the security of training data? If the data is
       | super sensitive how do you deal with it?
       | 
       | I know the same could be said about Azure and AWS, but the big
       | name cloud providers stake their prestige on having tight
       | security, while a startup has much less to lose.
        
         | mahaniok wrote:
         | On the opposite, startup has everything to lose. No other
         | business lines
        
       | seibelj wrote:
       | The name is unfortunately close to "The Grid", an AI website
       | builder that had a lot of buzz then scammed a lot of people out
       | of money then disappeared https://medium.com/@seibelj/the-grid-
       | over-promise-under-deli...
        
         | wfalcon wrote:
         | The grid is how you harness and distribute power and
         | electricity.... like that coming from lightning :)
         | 
         | Second, electricity was a great new technology (ie: AI), but
         | you needed the power grid to make it usable - that's grid AI.
        
           | immigrantsheep wrote:
           | Also related to TRON
        
       ___________________________________________________________________
       (page generated 2020-10-08 23:01 UTC)