[HN Gopher] Accelerated PyTorch Training on M1 Mac
       ___________________________________________________________________
        
       Accelerated PyTorch Training on M1 Mac
        
       Author : tgymnich
       Score  : 335 points
       Date   : 2022-05-18 15:33 UTC (7 hours ago)
        
 (HTM) web link (pytorch.org)
 (TXT) w3m dump (pytorch.org)
        
       | buildbot wrote:
       | This is very interesting since the M1 studio supports 128GB of
       | unified memory - training a large memory heavy model slowly on a
       | single device could be interesting, or inferencing a very large
       | model.
        
         | zdw wrote:
         | Everything old is new again - the M1 studio's unified memory
         | echos the SGI O2 which had similar unified CPU/GPU memory back
         | in the 90's.
         | 
         | In both cases the unified memory machines outperformed much
         | larger machines in specific use cases.
        
           | smoldesu wrote:
           | ... _specific use cases_ being the key operand here. Unified
           | memory is cool, but there are reasons we don 't use it at-
           | scale:
           | 
           | - It needs extremely high-bandwidth controllers, which
           | severely limits the amount of memory you can use (Intel Macs
           | could be configured with an order of magnitude more ram in
           | it's server chips)
           | 
           | - ECC is still off-the-table on M1 apparently
           | 
           | - _Most_ workloads aren 't really constrained by memory
           | access in modern programs/kernels/compilers. Problems only
           | show up when you want to run a GPU off the same memory, which
           | is what these new Macs account for.
           | 
           | - _Most_ of the so-called  "specific workloads" that you're
           | outlining aren't very general applications. So far I've only
           | seen ARM outrun x86 in some low-precision physics demos,
           | which is... fine, I guess? I still don't foresee
           | meteorologists dropping their Intel rigs to buy a Mac Studio
           | anytime soon.
        
             | my123 wrote:
             | > - It needs extremely high-bandwidth controllers, which
             | severely limits the amount of memory you can use (Intel
             | Macs could be configured with an order of magnitude more
             | ram in it's server chips)
             | 
             | In the first half of 2023, NVIDIA Grace Superchip will ship
             | with an 1TB memory config (930GB usable because ECC bits)
             | on a 1024-bit wide LPDDR5X-8533 config (same width as M1
             | Ultra, with LPDDR5-6400).
             | 
             | So it's going to become much less of an issue really soon.
        
               | zdw wrote:
               | > So it's going to become much less of an issue really
               | soon.
               | 
               | The main issue would be trying to purchase one of those,
               | which is likely going to be both very rare and orders of
               | magnitude more expensive than a Mac Studio.
               | 
               | The Mac Studio isn't some crazy exotic hardware like
               | datacenter class GPUs, but definitely has some exotic
               | capabilities.
        
               | my123 wrote:
               | > The Mac Studio isn't some crazy exotic hardware like
               | datacenter class GPUs, but definitely has some exotic
               | capabilities.
               | 
               | Datacenter class GPUs are expensive yeah, but are quite
               | easy to buy, even in a single unit amount.
               | 
               | example: https://www.dell.com/en-us/work/shop/nvidia-
               | ampere-a100-pcie... for the first random link, but there
               | are other stores selling them for significantly cheaper.
               | 
               | I wonder what their CPU pricing will be though... we'll
               | see I guess.
        
             | Q6T46nT668w6i3m wrote:
             | > Most workloads aren't really constrained by memory access
             | in modern programs/kernels/compilers. Problems only show up
             | when you want to run a GPU off the same memory, which is
             | what these new Macs account for.
             | 
             | For sure but I expect this is different for the apps Apple
             | _wants_ to write. It's easy to imagine the next version of
             | Logic or whatever doing fine tuning everywhere.
        
               | smoldesu wrote:
               | What is there to fine-tune, in a program like Logic? I've
               | often heard that word associated with using extended
               | instruction sets and leveraging accelerators, but where
               | would the M1 have "untapped power" so-to-speak? I don't
               | think the "upgrade" from a CISC architecture to a RISC
               | one can yield much opportunity for optimization, at least
               | not besides what the compiler already does for you.
        
       | sbeckeriv wrote:
       | What is the * in the chart referencing?
        
         | mrchucklepants wrote:
         | Probably supposed to be referencing the text under the plot
         | stating the specific configuration of the hardware and
         | software.
        
           | sbeckeriv wrote:
           | looks like the website was updated after I posted. I used
           | page search to look for the *.
        
       | munro wrote:
       | yess! This is important for me, because I don't have any $$$ to
       | rent GPUs for personal projects. Now we just need M1 support for
       | JAX.
       | 
       | Since there are no hard benchmarks against other GPUs, here's a
       | Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks
       | like it's about 2x slower--the RTX laptop absolutely rips for
       | gaming, I love it.
       | 
       | [1]
       | https://browser.geekbench.com/v5/compute/compare/4140651?bas...
        
         | jph00 wrote:
         | You can use GPUs for free on Paperspace Gradient, Google Colab,
         | and Kaggle.
        
       | mkaic wrote:
       | This is really cool for a number of reasons:
       | 
       | 1.) Apple Silicon _currently_ can 't compete with Nvidia GPUs in
       | terms of raw compute power, but they're already way ahead on
       | energy efficiency. Training a small deep learning model on
       | battery power on a laptop could actually be a thing now.
       | 
       | Edit: I've been informed that for matrix math, Apple Silicon
       | isn't actually ahead in efficiency
       | 
       | 2.) Apple Silicon probably _will_ compete directly with Nvidia
       | GPUs in the near future in terms of raw compute power in future
       | generations of products like the Mac Studio and Mac Pro, which is
       | very exciting. Competition in this space is incredibly good for
       | consumers.
       | 
       | 3.) At $4800, an M1 Ultra Mac Studio appears to be far and away
       | the cheapest machine you can buy with 128GB of GPU memory. With
       | proper PyTorch support, we'll actually be able to use this memory
       | for training big models or using big batch sizes. For the kind of
       | DL work I do where dataloading is much more of a bottleneck than
       | actual raw compute power, Mac Studio is now looking _very_
       | enticing.
        
         | smoldesu wrote:
         | There's definitely competition, and it's going to be _really
         | interesting_ to watch Nvidia and Apple duke it out over the
         | next few years:
         | 
         | - Apple undoubtedly _owns_ the densest nodes, and will fight
         | TSMC tooth-and-nail over first dibs on whatever silicon they
         | have coming next.
         | 
         | - Apple's current GPU design philosophy relies on horizontally
         | scaling the tech they already use, whereas Nvidia has been
         | scaling vertically, albeit slowly.
         | 
         | - Nvidia has _insane_ engineers. Despite the fact they 're
         | using silicon that's more than twice as large by-area when
         | compared to Apple, they're still doubling their numbers across
         | the board. And that's their last-gen tech too, the comparison
         | once they're on 5nm later this summer is going to be _insane_.
         | 
         | I expect things to be very heated by the end of this year, with
         | new Nvidia, Intel and _potentially_ new Apple GPUs.
        
         | my123 wrote:
         | > but they're already way ahead on energy efficiency
         | 
         | 1) Nope. For neural network training not the case:
         | https://tlkh.dev/benchmarking-the-apple-m1-max
         | 
         | And that's with the 3090 set at a very high 400W power limit,
         | can get far more efficient when clocked lower.
         | 
         | (which is normal, because no dedicated matrix math accelerators
         | on the GPU notably)
         | 
         | 2) We'll see, hopefully Apple thinks that the market is worth
         | bothering with... (which would be great)
         | 
         | 3) Indeed, if you need a giant pool of VRAM above everything
         | else at a relatively low price tag, Apple is indeed a quite
         | enticing option. If you can stand Metal for your use case of
         | course.
        
         | hedgehog wrote:
         | To me the cool thing is working through a PyTorch-based course
         | like FastAI on a local Mac may now be above the tolerably fast
         | threshold.
        
         | [deleted]
        
         | mhh__ wrote:
         | The thing is with the efficiency (which I'm not sure of) and
         | the competition (probably possible) is that the current nvidia
         | lineup is pretty old and on an even older process. They have a
         | big moat.
        
         | dekhn wrote:
         | I remain skeptical that Apple's best GPU silicon will match
         | nvidia's premiere products (either the top-end desktop card, or
         | a server monster) for training.
         | 
         | It seems like this is ideal as an accelerator for already
         | trained models; one can imagine Photoshop utilizing it for
         | deep-learning based infill-painting.
         | 
         | I was doing training on battery with a laptop that had a 1080
         | and could do training; I have trained models on the airplane
         | while totalyl unplugged and still had enough power to websurf
         | afterwards.
        
         | sudosysgen wrote:
         | Apple Silicon is not ahead at all on energy efficiency for
         | desktop workloads. If they were ahead on energy efficiency,
         | they would simply be ahead on power. Indeed, GPUs are massively
         | parallel architectures, and they are generally limited by the
         | transistor and power budget (and memory, of course).
         | 
         | Apple is simply behind in the GPU space.
         | 
         | > At $4800, an M1 Ultra Mac Studio appears to be far and away
         | the cheapest machine you can buy with 128GB of GPU memory. With
         | proper PyTorch support, we'll actually be able to use this
         | memory for training big models or using big batch sizes. For
         | the kind of DL work I do where dataloading is much more of a
         | bottleneck than actual raw compute power, Mac Studio is now
         | looking very enticing.
         | 
         | The reason why it's cheaper is that its memory is at a
         | _fraction_ (around 20-35%) of the memory bandwidth of a 128GB
         | equivalent GPU set up, which also has to be split with the CPU.
         | This is an unavoidable bottleneck of shared memory systems, and
         | for a great many applications this is a terminal performance
         | bottleneck.
         | 
         | That's the reason you don't have a GPU with 128GB of normal
         | DDR5. It would just be quite limited. Perhaps for some cases it
         | can be useful.
        
           | p1esk wrote:
           | _its memory is at a fraction (around 30-40%) of the memory
           | bandwidth of a 128GB equivalent GPU setup_
           | 
           | Here's some info about M1 memory bandwidth:
           | https://www.anandtech.com/show/17024/apple-m1-max-
           | performanc...
        
             | sudosysgen wrote:
             | Yes. And the M1 Ultra has even more memory bandwidth than
             | the M1 Max. But a 128 GB system made of 3 NVidia A6000 has
             | 3x768Gb/s of memory bandwidth, a more common AI-grade card
             | has 2x2Tb/s of memory bandwidth, which simply dwarfs the M1
             | Ultra.
        
               | matthew-wegner wrote:
               | For researchers, sure, but it's still quite an apples-to-
               | oranges comparison.
               | 
               | A6000 is ~$5k per card. I guess you're referring to
               | something like an A100 on that other spec, which is
               | $10k/card (for 40GB of memory).
               | 
               | I do a fair bit of neural/AI art experimentation, where
               | memory on the execution side is sometimes a limiting
               | factor for me. I'm not training models, I'm not a
               | hardcore researcher--those folks will absolutely be using
               | NVIDIA's high-end stuff or TPU pods.
               | 
               | 128GB in a Studio is super compelling if it means I can
               | up-res some of my pieces without needing to use high-
               | memory-but-super-slow CPU cloud VMs, or hope I get lucky
               | with an A100 on Colab (or just pay for a GPU VM).
               | 
               | I have a 128GB/Ultra Studio in my office now. It's a
               | great piece of kit, and a big reason I splurged on it--
               | okay, maybe "excuse"--was that I expect it'll be useful
               | for a lot of my side project workloads over the next
               | couple of years...
        
               | sudosysgen wrote:
               | Hmm, that's interesting. What kind of inference workload
               | requires more than the 48GB of memory you'd get from 2
               | 3090s, for example? I'm genuinely curious because I
               | haven't ran across them and it sounds interesting
        
               | mkaic wrote:
               | Not sure about inference but for training, 128GB is big
               | enough to fit a decent-sized dataset entirely into
               | memory, which causes a massive speedup. It's also
               | probably cheaper to get a 128GB Mac Studio than a
               | dual-3090 rig unless you're willing to build the rig
               | yourself and pay the bare minimum for every component
               | except the GPUs themselves.
               | 
               | As for 128GB memory _on-inference_ models that a consumer
               | would be interested in, I got nothing, though it
               | certainly seems like it would be fun to mess around with
               | haha
        
               | matthew-wegner wrote:
               | Mostly it's old-school transfer style transfer! Well,
               | "old" in the sense that it's pre-CLIP. I've played with
               | CLIP-guided stuff too, but I've been tinkering with a
               | custom style transfer workflow for a few years. The
               | pipeline here is fractal IFS images (Chaotica/JWildfire)
               | -> misc processing -> style transfer -> photo editing,
               | basically.
               | 
               | Only the workflow is the custom part--the core here is
               | literally the original jcjohnson implementation.
               | Occasionally I look around at recent work in the area,
               | but most seems focused on fast (video-speed) inference or
               | pre-baked style models. I've never seen something that
               | retains artistic flexibility.
               | 
               | My original gut feeling on style transfer was that it
               | would be possible to mold it into a neat tool, but most
               | people bumped into it, ran their profile photo against
               | Starry Night, said "cool" and bounced off. And I get that
               | --parameter tuning can be a sloooow process. When I
               | really explore a series with a particular style I start
               | to feed it custom content images made just for how it's
               | reacting with various inputs.
               | 
               | Here's a piece that just finished a few minutes ago:
               | https://mwegner.com/misc/styled_render-
               | BMrHXWz_2RBaUq8pAYKfL...
               | 
               | That's from a local server in my garage with a K80. At
               | some point I had two K80s in there (so basically four
               | K40s with how they work), but dialed it back for power
               | consumption/power reasons.
               | 
               | I do have a 3090 in the house, and a decent amount of
               | cloud infra that I sometimes tap. The jcjohnson
               | implementation is so far back that it doesn't even run
               | against modern hardware. At some point I need to sort
               | that out, or figure out how to wrangle a more modern
               | implementation into behaving in the way that I like.
               | 
               | I don't really post these anywhere, although do throw
               | them over the wall on Twitter if anyone is curious to see
               | more. These are a mix of things, although the
               | CLIP/Midjourney/etc stuff is pretty easy to spot:
               | https://twitter.com/mwegner/media
        
               | visarga wrote:
               | GPT-3 sized models need that kind of memory for inference
        
               | sudosysgen wrote:
               | GPT-3 is more like 300GB iirc
        
           | mkaic wrote:
           | Interesting, I wasn't aware of the memory bandwidth point,
           | though it makes sense. TIL!
        
         | ActorNightly wrote:
         | > but they're already way ahead on energy efficiency.
         | 
         | For raw compute like you need for ML training, the M1s
         | efficiency doesn't matter. Under the hood at hardware level,
         | you have a direct mapping of power consumption to compute
         | circuit activation that you really can't get around.
         | 
         | The general efficiency of M1 is due its architecture and how it
         | fits together with normal consumer use. Less stuff on the
         | instruction decode, more efficient reordering, less energy
         | wasted moving around data due to shared memory architecture,
         | e.t.c
        
           | ribit wrote:
           | And yet somehow Apples GPU ALUs are more efficient at 3.8
           | watts per TFLOP. Mind, I am not talking about specialized
           | matrix multiplication units that have a different internal
           | organization and can do things like matrix multiplication
           | much more efficiently, but about basic general-purpose GPU
           | ALUs.
           | 
           | The comparison of efficiency between Apple and Nvidia here is
           | a bit misleading because one compares Apples general-purpose
           | ALUs to Nvidia's specialized ALUs. For a more direct
           | efficiency comparison, one would need to compare the Tensor
           | Cores against the AMX or ANE coprocessors.
           | 
           | As to how Apple achieves such high efficiency, nobody knows.
           | The fact that they are on 5nm node might help, but there must
           | be something special about the ALU design as well. My
           | speculation is that they are wider and much more simpler than
           | in other GPUs, which directly translates to efficiency wins.
        
         | [deleted]
        
       | arecurrence wrote:
       | This is much nicer ergonomics than what I had to do for
       | tensorflow. It's ostensibly out of the box support as a different
       | torch device.
        
         | mark_l_watson wrote:
         | I agree. I appreciated the M1/Metal TensorFlow support, but
         | that was not as easy to setup.
        
           | alfalfasprout wrote:
           | I mean, building tensorflow is generally an awful experience.
        
         | dangrie158 wrote:
        
       | lekevicius wrote:
       | Curiously neither PyTorch nor Tensorflow currently use M1's
       | Neural Engine. Is too limited? Too hard to interact with? Not
       | worth the effort?
        
         | why_only_15 wrote:
         | The ANE only has support for calculations with fp16, int16 and
         | int8 all of which are too small to train with (too much
         | instability). A common thing to do is train in fp32 to be able
         | to get the small differences and gradients and then once the
         | model is frozen do inference on fp16 or bf16.
        
           | jph00 wrote:
           | Using mixed precision training you can do most operations in
           | fp16 and just a few in fp32 where it's needed. This is the
           | norm for NVIDIA GPU training nowadays. For instance using
           | fastai add `.to_fp16()` after your learner call, and that
           | happens automatically.
        
             | omegalulw wrote:
             | How is the choice between fp16 and fp32 made? Is it like if
             | any gradients in the tensor need the extra range you use
             | fp32?
        
               | h-jones wrote:
               | The PyTorch docs give a pretty good overview of AMP here 
               | https://pytorch.org/tutorials/recipes/recipes/amp_recipe.
               | htm... and an overview of which operations cast to which
               | dtype can be found here
               | https://pytorch.org/docs/stable/amp.html#autocast-op-
               | referen....
               | 
               | Edit: Fixed second link.
        
         | RicoElectrico wrote:
         | Most probably Neural Engine is optimized for inference, not
         | training.
        
           | sillyinseattle wrote:
           | Question about terminology (no background in AI). In
           | econometrics, estimation is model fitting (training, I
           | guess), and inference refers to hypothesis testing (e.g. t or
           | F tests). What does inference mean here?
        
             | malshe wrote:
             | I have background in both and it's very confusing to me.
             | Inference in DL is running a trained model to
             | predict/classify. Inference in stats and econometrics is
             | totally different as you noted.
        
             | mattkrause wrote:
             | Prediction.
             | 
             | The model is literally "inferring" something about its
             | inputs: e.g., these pixels denote a hot dog, those don't.
        
             | iamaaditya wrote:
             | In machine learning (especially deep learning or neural
             | networks), the 'training' is done by using Stochastic
             | Gradient Descent. These gradients are computed using
             | Backpropagation. Backpropagation requires you to do a
             | backward pass of your model (typically many layers of
             | neural weights) and thus requires you to keep in memory a
             | lot of intermediate values (called activations). However,
             | if you are doing "inference" that is if the goal is only to
             | get the result but not improve the model, then you don't
             | have to do the backpropagation and thus you don't need to
             | store/save the intermediate values. As the layers and
             | number of parameters in Deep Learning grows, this
             | difference in computation in training vs inference becomes
             | signifiant. In most modern applications of ML, you train
             | once but infer many times, and thus it makes sense to have
             | specialized hardware that is optimized for "inference" at
             | the cost of its inability to do "training".
        
               | eklitzke wrote:
               | Just to add to this, the reason these inference
               | accelerators have become big recently (see also the
               | "neural core" in Pixel phones) is because they help doing
               | inference tasks in real time (lower model latency) with
               | better power usage than a GPU.
               | 
               | As a concrete example, on a camera you might want to run
               | a facial detector so the camera can automatically adjust
               | its focus when it sees a human face. Or you might want a
               | person detector that can detect the outline of the person
               | in the shot, so that you can blur/change their background
               | in something like a Zoom call. All of these applications
               | are going to work better if you can run your model at,
               | say, 60 HZ instead of 20 HZ. Optimizing hardware to do
               | inference tasks like this as fast as possible with the
               | least possible power usage it pretty different from
               | optimizing for all the things a GPU needs to do, so you
               | might end up with hardware that has both and uses them
               | for different tasks.
        
               | sillyinseattle wrote:
               | Thank you @iamaaditya and @eklitzke . Very informative
        
               | dataexporter wrote:
               | This sounds really fascinating. Are there any resources
               | that you'd recommend for someone who's starting out in
               | learning all this? I'm a complete beginner when it comes
               | to Machine Learning.
        
               | dr_zoidberg wrote:
               | Deep Learning with Python (2nd ed), by Francois Chollet.
               | 
               | If you don't mind about learning the part where you
               | program, it's got a lot of beginner/intermediate concepts
               | clearly explained. If you do dive into the programming
               | examples, you get to play around with a few architectures
               | and ideas and you're left on the step to dive into the
               | more advanced material knowing what you're doing.
        
             | upwardbound wrote:
             | Inference here means "running" the model. So maybe it has a
             | similar meaning as in econometrics?
             | 
             | Training is learning the weights (millions or billions of
             | parameters) that control the model's behavior, vs inference
             | is "running" the trained model on user data.
        
             | Q6T46nT668w6i3m wrote:
             | I'm surprised nobody has provided the basic explanation:
             | inference, here, means matrix, matrix or matrix, scalar
             | multiplication.
        
             | abm53 wrote:
             | It is confusing that the ML community have come to use
             | "inference" to mean prediction, whereas statisticians have
             | long used it to refer to training/fitting, or hypothesis
             | testing.
             | 
             | I'm not sure when or why this started.
        
           | munro wrote:
           | That /sounds/ right, but training still has a forward part,
           | so OP does raise a really great question. And looking at the
           | silicon, the neural engine is almost the size of the GPU.
           | Really need someone educated in this area to chime in :)
        
             | dgacmu wrote:
             | You have to stash more information from the forward pass in
             | order to calculate the gradients during backprop. You can't
             | just naively use an inference accelerator as part of
             | training - inference-only gets to discard intermediate
             | activations immediately.
             | 
             | (Also, many inference accelerators use lower precision than
             | you do when training)
             | 
             | There are tricks you can do to use inference to accelerate
             | training, such as one we developed to focus on likely-
             | poorly-performing examples:
             | https://arxiv.org/abs/1910.00762
        
             | my123 wrote:
             | The neural engine is only exposed through a CoreML
             | inference API.
             | 
             | You can't even poke the ANE hardware directly from a
             | regular process. The interface for accessing the neural
             | engine is not hardened (you can easily crash the machine
             | from it).
             | 
             | So the matter is essentially moot in practice as you'd need
             | your users to run with SIP off...
        
         | [deleted]
        
       | singularity2001 wrote:
       | Anyone else getting "illegal hardware instruction"?
       | 
       | (pytorch_env) ~/dev/ai/ python -c "import torch"
        
         | zimpenfish wrote:
         | IIRC, when I had that problem, it was because it was loading
         | the wrong arch for Python.
        
       | Scene_Cast2 wrote:
       | I'm curious about the performance compared to something like,
       | say, the RTX 3070.
        
         | ivstitia wrote:
         | Here are some comparison numbers I've come across:
         | https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...
         | 
         | It is not really comparable on a step per second level but the
         | power consumption and now GPU memory will make it pretty
         | enticing.
        
         | apohn wrote:
         | I wrote a comment about an Tensorflow on M1 comparison to some
         | cloud providers. I imagine PyTorch on M1 would give similar
         | results. I think the gist would be that the 3070 is going to be
         | a better investment.
         | 
         | https://news.ycombinator.com/item?id=30608125
        
         | my123 wrote:
         | Low. Apple doesn't have matrix math accelerators in their
         | current GPUs.
         | 
         | The neural engine is small and inference only. It's also only
         | exposed by a far higher level interface, CoreML.
         | 
         | Where it could still make sense is if you have a small VRAM
         | pool on the dGPU and a big one on the M1, but with the price of
         | a Mac, not sure that makes a lot of sense either in most
         | scenarios compared to paying for a big dGPU.
        
           | LeanderK wrote:
           | > The neural engine is small and inference only
           | 
           | Why is it inference only? At least the operations are the
           | same...just a bunch of linear algebra
        
             | londons_explore wrote:
             | Inference is often done fixed point, whereas training is
             | (usually) floating point.
             | 
             | Inference also prefers different IO patterns, because you
             | don't need to keep the activations for every layer ready
             | for backpropogation.
        
           | Kon-Peki wrote:
           | > Apple doesn't have matrix math accelerators in their
           | current GPUs.
           | 
           | That's because the M1 has a dedicated matrix math accelerator
           | called AMX [1]. I've used it with both Swift and pure C.
           | 
           | https://medium.com/swlh/apples-m1-secret-
           | coprocessor-6599492...
        
             | my123 wrote:
             | AMX is indeed very nice for FP64 where customer GPUs aren't
             | an alternative at all.
             | 
             | However, for lower precisions (which is what deep learning
             | uses), you're much better off with a GPU.
        
               | brrrrrm wrote:
               | have you actually benchmarked that? I think (someone
               | please correct me if I'm way off here) the AMX
               | instructions can hit ~2.8tflops (fp16) per co-processor
               | and there are 2 on the 7-core M1. That's 5.6tflops vs the
               | 4.6tflops the GPU can hit.
        
               | johndough wrote:
               | Often the limiting factor is memory bandwidth instead of
               | raw FLOPS, so dealing with 4 times larger data types
               | (FP64 vs FP16) is a disadvantage.
        
               | brrrrrm wrote:
               | to clarify: I am comparing FP16 performance, which both
               | the GPU and AMX have native support for.
               | 
               | FP64 is _also_ supported by AMX, making it quite an
               | impressive region of silicon.
        
               | my123 wrote:
               | Yeah that's within the M1 family, but get within dGPUs
               | and it doesn't even come close.
               | 
               | 30Tflops for a 3080 for vector FP32, but 119Tflops FP16
               | dense with FP16 accumulate, 59.5 with FP32 accumulate,
               | and if you exploit sparsity then that can go even higher.
        
               | brrrrrm wrote:
               | Ah yes, I misunderstood your original comment
        
       | Kalanos wrote:
       | Anyone care to comment on how this is better than Metal's
       | TensorFlow support?
        
       | ekelsen wrote:
       | Nice results! But why are people still reporting benchmark
       | results on VGG? Does anybody actually use this network anymore?
       | 
       | Better would be mobilenets or efficientNets or NFNets or vision
       | transformers or almost anything that's come out in the 8 years
       | since VGG was published (great work it was at the time!).
        
         | p1esk wrote:
         | _why are people still reporting benchmark results on VGG?_
         | 
         | Probably because it makes the hardware look good.
        
           | DSingularity wrote:
           | No. Because it is a way to compare performance. That's all.
           | Just convenience.
        
           | 0-_-0 wrote:
           | This is the right answer. Efficient networks like
           | EfficientNet are much harder to accelerate in HW.
        
         | jorgemf wrote:
         | Probably because it will be impossible to compare with old
         | results. If every year the community chooses a different model,
         | how are you going to compare results year over year?
        
           | learndeeply wrote:
           | ResNets have been around for 7 years...
        
             | jorgemf wrote:
             | It doesn't matter. Deep learning have been mainstream for
             | only 10 years. MNIST is a dataset from 1998 and it is still
             | being used in research papers. The most important thing is
             | to have a constant baseline, and ResNets are a baseline.
             | 
             | Think about changing the model every other year: - 2015:
             | ResNet trained in Nvidia k80 - 2017: Inception trained in
             | Nvidia 1080 ti - 2019: Transformer trained in Nvidia V100 -
             | 2021: GTP-3 trained in a cluster
             | 
             | Now you have your new fancy algorithm X and an Nvidia 4090.
             | How much better is your algorithm compared to the state of
             | the art, and how much have you improved compared to the
             | algorithms 5 years ago? Now you are in a nightmare and you
             | have to run all the past algorithms in order to compare it.
             | Or how fast is the new Nvidia card? which noone still have
             | and nvidia has decided to give numbers based on a their own
             | model?
        
         | 6gvONxR4sf7o wrote:
         | > But why are people still reporting benchmark results on VGG?
         | 
         | It makes me feel like i'm missing something! Is is still used
         | as a backbone in the same way as legacy code is everywhere, or
         | is it something else entirely??
        
         | plonk wrote:
         | > Does anybody actually use this network anymore?
         | 
         | Why not? It's still good for simple classification tasks. We
         | use it as an encoder for a segmentation model in some cases.
         | Most ResNet variants are much heavier.
        
           | jph00 wrote:
           | I don't think that's true - have a look at this analysis
           | here:
           | 
           | https://www.kaggle.com/code/jhoward/which-image-models-
           | are-b...
           | 
           | Those slow and inaccurate models at the bottom of the graph
           | are the VGG models. A resnet34 is faster and more accurate
           | than any VGG model. And there are better options now -- for
           | example resnet34d is as fast as resnet34, and more accurate.
           | And then convnext is dramatically better still.
        
           | YetAnotherNick wrote:
           | > ResNet > VGG: ResNet-50 is faster than VGG-16 and more
           | accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the
           | same speed as VGG-19 but much more accurate than VGG-16 (6.21
           | vs 9.0).
           | 
           | https://github.com/jcjohnson/cnn-
           | benchmarks#:~:text=ResNet%2....
        
       | toppy wrote:
       | Does speed up refer to absolute value or percentage?
        
         | dagmx wrote:
         | At least for the charts, it looks like a multiplier (or divisor
         | I guess) since the CPU baseline looks to be at 1
        
           | toppy wrote:
           | You're right! I've missed this.
        
       | alexfromapex wrote:
       | Since it's tangentially relevant, if you have an M1 Mac I've
       | created some boilerplate for working with the latest Tensorflow
       | with GPU acceleration as well:
       | https://github.com/alexfromapex/tensorexperiments . I'm thinking
       | of adding a branch for PyTorch now.
        
         | masklinn wrote:
         | Did you compare that to Apple's tf plugin to see what was what?
        
         | galoisscobi wrote:
         | This is great! Appreciate the note on H5Py troubleshooting as
         | well.
        
       | [deleted]
        
       | cj8989 wrote:
       | really hope to see some comparisons with nvidia gpus!
        
       | amelius wrote:
       | > Accelerated GPU training is enabled using Apple's Metal
       | Performance Shaders (MPS) as a backend for PyTorch.
       | 
       | What do shaders have to do with it? Deep learning is a mature
       | field now, it shouldn't need to borrow compute architecture from
       | the gaming/entertainment field. Anyone else find this
       | disconcerting?
        
         | my123 wrote:
         | Apple doesn't have a separate API tailored towards compute
         | only, but a single unified API that makes concessions to both.
         | 
         | Concessions towards compute: a C++ programming language for
         | device code (totally unlike what's done for most graphics
         | APIs!)
         | 
         | Concessions towards graphics: no single-source programming
         | model at all for example...
        
           | sudosysgen wrote:
           | Many GPUs allow you to write device code in C++ via SYCL. It
           | works well enough.
        
         | dagmx wrote:
         | Shaders are just the way compute is defined on the GPU.
         | 
         | Why is that concerning to you?
        
           | my123 wrote:
           | That terminology isn't used at all in GPGPU compute APIs
           | specifically tailored for that purpose, which use quite
           | different programming models where you can mix host and
           | device code in the same program.
           | 
           | And there are "GPUs" today that can't do graphics at all (AMD
           | MI100/MI200 generations) or in a restricted way (Hopper
           | GH100) which has the fixed function pipeline only on two
           | TPCs, for compatibility, but running very slowly due to that.
        
             | alfalfasprout wrote:
             | There's absolutely a lot of "graphics" terminology that
             | spills into GPGPU. For example, texture memory in CUDA :)
             | The reality is that GPU's, even the ones that can't output
             | video, are ultimately still using hardware that largely is
             | rooted in gaming. Obviously the underlying architectures
             | for these ML cards are moving away from that (increasingly
             | using more die space for ML related operations) but many of
             | the core components like memory are still shared. It boils
             | down to the fact that at the end of the day they're linear
             | algebra processors.
        
               | my123 wrote:
               | I'd say that there has been quite some sharing between
               | both back and forth. Evolutions in compute stacks shaped
               | modern graphics APIs too.
               | 
               | Texture units are indeed a part that is useful enough to
               | be exposed to GPGPU compute APIs directly. The "shader"
               | term itself disappeared quite early in those though, as
               | did access to a good part of the FF pipeline including
               | the rasterisers themselves.
        
           | WhitneyLand wrote:
           | It's not the greatest term even for graphics only.
           | 
           | People new to CG are likely to intuit "shaders" as something
           | related to, well, shading, but vertex shaders et al have
           | nothing to do with the color of a pixel or a polygon.
        
             | paulmd wrote:
             | Wait until they learn a kernel has nothing to do with
             | operating systems! And tensor operations have nothing to do
             | with tensor objects! And texture memory often isn't even
             | used for textures!
             | 
             | It's an unfortunate set of terminology due to the way this
             | space evolved from graphics programming - shader cores
             | _used to do_ fixed-function shading! But then people wanted
             | them to be able to run arbitrary shaders and not just
             | fixed-function. And then hey, look at this neat processor,
             | let 's run a compute program on it. At first that was
             | "compute shaders" running across graphics APIs, then came
             | CUDA, and later OpenCL. But it is still running on the part
             | of the hardware that provides shading to the graphics
             | pipeline.
             | 
             | Similarly, texture memory actually used to be used for
             | textures, now it is a general-purpose binding that
             | coalesces any type of memory access that has 1D/2D/3D
             | locality.
             | 
             | You kinda just get used to it. Lots of niches have their
             | own lingo that takes some learning. Mathematics is
             | incomprehensible without it, really.
        
         | geertj wrote:
         | Not sure if it's concerning but it caught my eye as well.
        
       | MasterScrat wrote:
       | Small code example in the PyTorch doc:
       | 
       | https://pytorch.org/docs/master/notes/mps.html
        
       | ivstitia wrote:
       | There was a report comparing M1 Pro with several other Nvidia
       | GPUs from a few months ago:
       | https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...
       | 
       | I'm curious on how the benchmarks change with this recent new
       | release!
        
       | nafizh wrote:
       | Exciting!! But don't see comparison with any laptop Nvidia GPUs
       | in terms of performance. That would be insightful.
        
         | sudosysgen wrote:
         | It compares unfavourably, but then again NVidia GPUs on laptop
         | are massive powerhogs.
        
           | smlacy wrote:
           | Do apple users really _require_ the ability to train large ML
           | models while mobile and without access to A /C power? Is this
           | a real-world use case for the target market?
        
             | sudosysgen wrote:
             | Indeed, I doubt anyone really needs that. And anyways while
             | training a model you'd be lucky to get an hour of battery
             | life even on an M1 Max.
        
       ___________________________________________________________________
       (page generated 2022-05-18 23:00 UTC)