[HN Gopher] Accelerated PyTorch Training on M1 Mac ___________________________________________________________________ Accelerated PyTorch Training on M1 Mac Author : tgymnich Score : 335 points Date : 2022-05-18 15:33 UTC (7 hours ago) (HTM) web link (pytorch.org) (TXT) w3m dump (pytorch.org) | buildbot wrote: | This is very interesting since the M1 studio supports 128GB of | unified memory - training a large memory heavy model slowly on a | single device could be interesting, or inferencing a very large | model. | zdw wrote: | Everything old is new again - the M1 studio's unified memory | echos the SGI O2 which had similar unified CPU/GPU memory back | in the 90's. | | In both cases the unified memory machines outperformed much | larger machines in specific use cases. | smoldesu wrote: | ... _specific use cases_ being the key operand here. Unified | memory is cool, but there are reasons we don 't use it at- | scale: | | - It needs extremely high-bandwidth controllers, which | severely limits the amount of memory you can use (Intel Macs | could be configured with an order of magnitude more ram in | it's server chips) | | - ECC is still off-the-table on M1 apparently | | - _Most_ workloads aren 't really constrained by memory | access in modern programs/kernels/compilers. Problems only | show up when you want to run a GPU off the same memory, which | is what these new Macs account for. | | - _Most_ of the so-called "specific workloads" that you're | outlining aren't very general applications. So far I've only | seen ARM outrun x86 in some low-precision physics demos, | which is... fine, I guess? I still don't foresee | meteorologists dropping their Intel rigs to buy a Mac Studio | anytime soon. | my123 wrote: | > - It needs extremely high-bandwidth controllers, which | severely limits the amount of memory you can use (Intel | Macs could be configured with an order of magnitude more | ram in it's server chips) | | In the first half of 2023, NVIDIA Grace Superchip will ship | with an 1TB memory config (930GB usable because ECC bits) | on a 1024-bit wide LPDDR5X-8533 config (same width as M1 | Ultra, with LPDDR5-6400). | | So it's going to become much less of an issue really soon. | zdw wrote: | > So it's going to become much less of an issue really | soon. | | The main issue would be trying to purchase one of those, | which is likely going to be both very rare and orders of | magnitude more expensive than a Mac Studio. | | The Mac Studio isn't some crazy exotic hardware like | datacenter class GPUs, but definitely has some exotic | capabilities. | my123 wrote: | > The Mac Studio isn't some crazy exotic hardware like | datacenter class GPUs, but definitely has some exotic | capabilities. | | Datacenter class GPUs are expensive yeah, but are quite | easy to buy, even in a single unit amount. | | example: https://www.dell.com/en-us/work/shop/nvidia- | ampere-a100-pcie... for the first random link, but there | are other stores selling them for significantly cheaper. | | I wonder what their CPU pricing will be though... we'll | see I guess. | Q6T46nT668w6i3m wrote: | > Most workloads aren't really constrained by memory access | in modern programs/kernels/compilers. Problems only show up | when you want to run a GPU off the same memory, which is | what these new Macs account for. | | For sure but I expect this is different for the apps Apple | _wants_ to write. It's easy to imagine the next version of | Logic or whatever doing fine tuning everywhere. | smoldesu wrote: | What is there to fine-tune, in a program like Logic? I've | often heard that word associated with using extended | instruction sets and leveraging accelerators, but where | would the M1 have "untapped power" so-to-speak? I don't | think the "upgrade" from a CISC architecture to a RISC | one can yield much opportunity for optimization, at least | not besides what the compiler already does for you. | sbeckeriv wrote: | What is the * in the chart referencing? | mrchucklepants wrote: | Probably supposed to be referencing the text under the plot | stating the specific configuration of the hardware and | software. | sbeckeriv wrote: | looks like the website was updated after I posted. I used | page search to look for the *. | munro wrote: | yess! This is important for me, because I don't have any $$$ to | rent GPUs for personal projects. Now we just need M1 support for | JAX. | | Since there are no hard benchmarks against other GPUs, here's a | Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks | like it's about 2x slower--the RTX laptop absolutely rips for | gaming, I love it. | | [1] | https://browser.geekbench.com/v5/compute/compare/4140651?bas... | jph00 wrote: | You can use GPUs for free on Paperspace Gradient, Google Colab, | and Kaggle. | mkaic wrote: | This is really cool for a number of reasons: | | 1.) Apple Silicon _currently_ can 't compete with Nvidia GPUs in | terms of raw compute power, but they're already way ahead on | energy efficiency. Training a small deep learning model on | battery power on a laptop could actually be a thing now. | | Edit: I've been informed that for matrix math, Apple Silicon | isn't actually ahead in efficiency | | 2.) Apple Silicon probably _will_ compete directly with Nvidia | GPUs in the near future in terms of raw compute power in future | generations of products like the Mac Studio and Mac Pro, which is | very exciting. Competition in this space is incredibly good for | consumers. | | 3.) At $4800, an M1 Ultra Mac Studio appears to be far and away | the cheapest machine you can buy with 128GB of GPU memory. With | proper PyTorch support, we'll actually be able to use this memory | for training big models or using big batch sizes. For the kind of | DL work I do where dataloading is much more of a bottleneck than | actual raw compute power, Mac Studio is now looking _very_ | enticing. | smoldesu wrote: | There's definitely competition, and it's going to be _really | interesting_ to watch Nvidia and Apple duke it out over the | next few years: | | - Apple undoubtedly _owns_ the densest nodes, and will fight | TSMC tooth-and-nail over first dibs on whatever silicon they | have coming next. | | - Apple's current GPU design philosophy relies on horizontally | scaling the tech they already use, whereas Nvidia has been | scaling vertically, albeit slowly. | | - Nvidia has _insane_ engineers. Despite the fact they 're | using silicon that's more than twice as large by-area when | compared to Apple, they're still doubling their numbers across | the board. And that's their last-gen tech too, the comparison | once they're on 5nm later this summer is going to be _insane_. | | I expect things to be very heated by the end of this year, with | new Nvidia, Intel and _potentially_ new Apple GPUs. | my123 wrote: | > but they're already way ahead on energy efficiency | | 1) Nope. For neural network training not the case: | https://tlkh.dev/benchmarking-the-apple-m1-max | | And that's with the 3090 set at a very high 400W power limit, | can get far more efficient when clocked lower. | | (which is normal, because no dedicated matrix math accelerators | on the GPU notably) | | 2) We'll see, hopefully Apple thinks that the market is worth | bothering with... (which would be great) | | 3) Indeed, if you need a giant pool of VRAM above everything | else at a relatively low price tag, Apple is indeed a quite | enticing option. If you can stand Metal for your use case of | course. | hedgehog wrote: | To me the cool thing is working through a PyTorch-based course | like FastAI on a local Mac may now be above the tolerably fast | threshold. | [deleted] | mhh__ wrote: | The thing is with the efficiency (which I'm not sure of) and | the competition (probably possible) is that the current nvidia | lineup is pretty old and on an even older process. They have a | big moat. | dekhn wrote: | I remain skeptical that Apple's best GPU silicon will match | nvidia's premiere products (either the top-end desktop card, or | a server monster) for training. | | It seems like this is ideal as an accelerator for already | trained models; one can imagine Photoshop utilizing it for | deep-learning based infill-painting. | | I was doing training on battery with a laptop that had a 1080 | and could do training; I have trained models on the airplane | while totalyl unplugged and still had enough power to websurf | afterwards. | sudosysgen wrote: | Apple Silicon is not ahead at all on energy efficiency for | desktop workloads. If they were ahead on energy efficiency, | they would simply be ahead on power. Indeed, GPUs are massively | parallel architectures, and they are generally limited by the | transistor and power budget (and memory, of course). | | Apple is simply behind in the GPU space. | | > At $4800, an M1 Ultra Mac Studio appears to be far and away | the cheapest machine you can buy with 128GB of GPU memory. With | proper PyTorch support, we'll actually be able to use this | memory for training big models or using big batch sizes. For | the kind of DL work I do where dataloading is much more of a | bottleneck than actual raw compute power, Mac Studio is now | looking very enticing. | | The reason why it's cheaper is that its memory is at a | _fraction_ (around 20-35%) of the memory bandwidth of a 128GB | equivalent GPU set up, which also has to be split with the CPU. | This is an unavoidable bottleneck of shared memory systems, and | for a great many applications this is a terminal performance | bottleneck. | | That's the reason you don't have a GPU with 128GB of normal | DDR5. It would just be quite limited. Perhaps for some cases it | can be useful. | p1esk wrote: | _its memory is at a fraction (around 30-40%) of the memory | bandwidth of a 128GB equivalent GPU setup_ | | Here's some info about M1 memory bandwidth: | https://www.anandtech.com/show/17024/apple-m1-max- | performanc... | sudosysgen wrote: | Yes. And the M1 Ultra has even more memory bandwidth than | the M1 Max. But a 128 GB system made of 3 NVidia A6000 has | 3x768Gb/s of memory bandwidth, a more common AI-grade card | has 2x2Tb/s of memory bandwidth, which simply dwarfs the M1 | Ultra. | matthew-wegner wrote: | For researchers, sure, but it's still quite an apples-to- | oranges comparison. | | A6000 is ~$5k per card. I guess you're referring to | something like an A100 on that other spec, which is | $10k/card (for 40GB of memory). | | I do a fair bit of neural/AI art experimentation, where | memory on the execution side is sometimes a limiting | factor for me. I'm not training models, I'm not a | hardcore researcher--those folks will absolutely be using | NVIDIA's high-end stuff or TPU pods. | | 128GB in a Studio is super compelling if it means I can | up-res some of my pieces without needing to use high- | memory-but-super-slow CPU cloud VMs, or hope I get lucky | with an A100 on Colab (or just pay for a GPU VM). | | I have a 128GB/Ultra Studio in my office now. It's a | great piece of kit, and a big reason I splurged on it-- | okay, maybe "excuse"--was that I expect it'll be useful | for a lot of my side project workloads over the next | couple of years... | sudosysgen wrote: | Hmm, that's interesting. What kind of inference workload | requires more than the 48GB of memory you'd get from 2 | 3090s, for example? I'm genuinely curious because I | haven't ran across them and it sounds interesting | mkaic wrote: | Not sure about inference but for training, 128GB is big | enough to fit a decent-sized dataset entirely into | memory, which causes a massive speedup. It's also | probably cheaper to get a 128GB Mac Studio than a | dual-3090 rig unless you're willing to build the rig | yourself and pay the bare minimum for every component | except the GPUs themselves. | | As for 128GB memory _on-inference_ models that a consumer | would be interested in, I got nothing, though it | certainly seems like it would be fun to mess around with | haha | matthew-wegner wrote: | Mostly it's old-school transfer style transfer! Well, | "old" in the sense that it's pre-CLIP. I've played with | CLIP-guided stuff too, but I've been tinkering with a | custom style transfer workflow for a few years. The | pipeline here is fractal IFS images (Chaotica/JWildfire) | -> misc processing -> style transfer -> photo editing, | basically. | | Only the workflow is the custom part--the core here is | literally the original jcjohnson implementation. | Occasionally I look around at recent work in the area, | but most seems focused on fast (video-speed) inference or | pre-baked style models. I've never seen something that | retains artistic flexibility. | | My original gut feeling on style transfer was that it | would be possible to mold it into a neat tool, but most | people bumped into it, ran their profile photo against | Starry Night, said "cool" and bounced off. And I get that | --parameter tuning can be a sloooow process. When I | really explore a series with a particular style I start | to feed it custom content images made just for how it's | reacting with various inputs. | | Here's a piece that just finished a few minutes ago: | https://mwegner.com/misc/styled_render- | BMrHXWz_2RBaUq8pAYKfL... | | That's from a local server in my garage with a K80. At | some point I had two K80s in there (so basically four | K40s with how they work), but dialed it back for power | consumption/power reasons. | | I do have a 3090 in the house, and a decent amount of | cloud infra that I sometimes tap. The jcjohnson | implementation is so far back that it doesn't even run | against modern hardware. At some point I need to sort | that out, or figure out how to wrangle a more modern | implementation into behaving in the way that I like. | | I don't really post these anywhere, although do throw | them over the wall on Twitter if anyone is curious to see | more. These are a mix of things, although the | CLIP/Midjourney/etc stuff is pretty easy to spot: | https://twitter.com/mwegner/media | visarga wrote: | GPT-3 sized models need that kind of memory for inference | sudosysgen wrote: | GPT-3 is more like 300GB iirc | mkaic wrote: | Interesting, I wasn't aware of the memory bandwidth point, | though it makes sense. TIL! | ActorNightly wrote: | > but they're already way ahead on energy efficiency. | | For raw compute like you need for ML training, the M1s | efficiency doesn't matter. Under the hood at hardware level, | you have a direct mapping of power consumption to compute | circuit activation that you really can't get around. | | The general efficiency of M1 is due its architecture and how it | fits together with normal consumer use. Less stuff on the | instruction decode, more efficient reordering, less energy | wasted moving around data due to shared memory architecture, | e.t.c | ribit wrote: | And yet somehow Apples GPU ALUs are more efficient at 3.8 | watts per TFLOP. Mind, I am not talking about specialized | matrix multiplication units that have a different internal | organization and can do things like matrix multiplication | much more efficiently, but about basic general-purpose GPU | ALUs. | | The comparison of efficiency between Apple and Nvidia here is | a bit misleading because one compares Apples general-purpose | ALUs to Nvidia's specialized ALUs. For a more direct | efficiency comparison, one would need to compare the Tensor | Cores against the AMX or ANE coprocessors. | | As to how Apple achieves such high efficiency, nobody knows. | The fact that they are on 5nm node might help, but there must | be something special about the ALU design as well. My | speculation is that they are wider and much more simpler than | in other GPUs, which directly translates to efficiency wins. | [deleted] | arecurrence wrote: | This is much nicer ergonomics than what I had to do for | tensorflow. It's ostensibly out of the box support as a different | torch device. | mark_l_watson wrote: | I agree. I appreciated the M1/Metal TensorFlow support, but | that was not as easy to setup. | alfalfasprout wrote: | I mean, building tensorflow is generally an awful experience. | dangrie158 wrote: | lekevicius wrote: | Curiously neither PyTorch nor Tensorflow currently use M1's | Neural Engine. Is too limited? Too hard to interact with? Not | worth the effort? | why_only_15 wrote: | The ANE only has support for calculations with fp16, int16 and | int8 all of which are too small to train with (too much | instability). A common thing to do is train in fp32 to be able | to get the small differences and gradients and then once the | model is frozen do inference on fp16 or bf16. | jph00 wrote: | Using mixed precision training you can do most operations in | fp16 and just a few in fp32 where it's needed. This is the | norm for NVIDIA GPU training nowadays. For instance using | fastai add `.to_fp16()` after your learner call, and that | happens automatically. | omegalulw wrote: | How is the choice between fp16 and fp32 made? Is it like if | any gradients in the tensor need the extra range you use | fp32? | h-jones wrote: | The PyTorch docs give a pretty good overview of AMP here | https://pytorch.org/tutorials/recipes/recipes/amp_recipe. | htm... and an overview of which operations cast to which | dtype can be found here | https://pytorch.org/docs/stable/amp.html#autocast-op- | referen.... | | Edit: Fixed second link. | RicoElectrico wrote: | Most probably Neural Engine is optimized for inference, not | training. | sillyinseattle wrote: | Question about terminology (no background in AI). In | econometrics, estimation is model fitting (training, I | guess), and inference refers to hypothesis testing (e.g. t or | F tests). What does inference mean here? | malshe wrote: | I have background in both and it's very confusing to me. | Inference in DL is running a trained model to | predict/classify. Inference in stats and econometrics is | totally different as you noted. | mattkrause wrote: | Prediction. | | The model is literally "inferring" something about its | inputs: e.g., these pixels denote a hot dog, those don't. | iamaaditya wrote: | In machine learning (especially deep learning or neural | networks), the 'training' is done by using Stochastic | Gradient Descent. These gradients are computed using | Backpropagation. Backpropagation requires you to do a | backward pass of your model (typically many layers of | neural weights) and thus requires you to keep in memory a | lot of intermediate values (called activations). However, | if you are doing "inference" that is if the goal is only to | get the result but not improve the model, then you don't | have to do the backpropagation and thus you don't need to | store/save the intermediate values. As the layers and | number of parameters in Deep Learning grows, this | difference in computation in training vs inference becomes | signifiant. In most modern applications of ML, you train | once but infer many times, and thus it makes sense to have | specialized hardware that is optimized for "inference" at | the cost of its inability to do "training". | eklitzke wrote: | Just to add to this, the reason these inference | accelerators have become big recently (see also the | "neural core" in Pixel phones) is because they help doing | inference tasks in real time (lower model latency) with | better power usage than a GPU. | | As a concrete example, on a camera you might want to run | a facial detector so the camera can automatically adjust | its focus when it sees a human face. Or you might want a | person detector that can detect the outline of the person | in the shot, so that you can blur/change their background | in something like a Zoom call. All of these applications | are going to work better if you can run your model at, | say, 60 HZ instead of 20 HZ. Optimizing hardware to do | inference tasks like this as fast as possible with the | least possible power usage it pretty different from | optimizing for all the things a GPU needs to do, so you | might end up with hardware that has both and uses them | for different tasks. | sillyinseattle wrote: | Thank you @iamaaditya and @eklitzke . Very informative | dataexporter wrote: | This sounds really fascinating. Are there any resources | that you'd recommend for someone who's starting out in | learning all this? I'm a complete beginner when it comes | to Machine Learning. | dr_zoidberg wrote: | Deep Learning with Python (2nd ed), by Francois Chollet. | | If you don't mind about learning the part where you | program, it's got a lot of beginner/intermediate concepts | clearly explained. If you do dive into the programming | examples, you get to play around with a few architectures | and ideas and you're left on the step to dive into the | more advanced material knowing what you're doing. | upwardbound wrote: | Inference here means "running" the model. So maybe it has a | similar meaning as in econometrics? | | Training is learning the weights (millions or billions of | parameters) that control the model's behavior, vs inference | is "running" the trained model on user data. | Q6T46nT668w6i3m wrote: | I'm surprised nobody has provided the basic explanation: | inference, here, means matrix, matrix or matrix, scalar | multiplication. | abm53 wrote: | It is confusing that the ML community have come to use | "inference" to mean prediction, whereas statisticians have | long used it to refer to training/fitting, or hypothesis | testing. | | I'm not sure when or why this started. | munro wrote: | That /sounds/ right, but training still has a forward part, | so OP does raise a really great question. And looking at the | silicon, the neural engine is almost the size of the GPU. | Really need someone educated in this area to chime in :) | dgacmu wrote: | You have to stash more information from the forward pass in | order to calculate the gradients during backprop. You can't | just naively use an inference accelerator as part of | training - inference-only gets to discard intermediate | activations immediately. | | (Also, many inference accelerators use lower precision than | you do when training) | | There are tricks you can do to use inference to accelerate | training, such as one we developed to focus on likely- | poorly-performing examples: | https://arxiv.org/abs/1910.00762 | my123 wrote: | The neural engine is only exposed through a CoreML | inference API. | | You can't even poke the ANE hardware directly from a | regular process. The interface for accessing the neural | engine is not hardened (you can easily crash the machine | from it). | | So the matter is essentially moot in practice as you'd need | your users to run with SIP off... | [deleted] | singularity2001 wrote: | Anyone else getting "illegal hardware instruction"? | | (pytorch_env) ~/dev/ai/ python -c "import torch" | zimpenfish wrote: | IIRC, when I had that problem, it was because it was loading | the wrong arch for Python. | Scene_Cast2 wrote: | I'm curious about the performance compared to something like, | say, the RTX 3070. | ivstitia wrote: | Here are some comparison numbers I've come across: | https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning... | | It is not really comparable on a step per second level but the | power consumption and now GPU memory will make it pretty | enticing. | apohn wrote: | I wrote a comment about an Tensorflow on M1 comparison to some | cloud providers. I imagine PyTorch on M1 would give similar | results. I think the gist would be that the 3070 is going to be | a better investment. | | https://news.ycombinator.com/item?id=30608125 | my123 wrote: | Low. Apple doesn't have matrix math accelerators in their | current GPUs. | | The neural engine is small and inference only. It's also only | exposed by a far higher level interface, CoreML. | | Where it could still make sense is if you have a small VRAM | pool on the dGPU and a big one on the M1, but with the price of | a Mac, not sure that makes a lot of sense either in most | scenarios compared to paying for a big dGPU. | LeanderK wrote: | > The neural engine is small and inference only | | Why is it inference only? At least the operations are the | same...just a bunch of linear algebra | londons_explore wrote: | Inference is often done fixed point, whereas training is | (usually) floating point. | | Inference also prefers different IO patterns, because you | don't need to keep the activations for every layer ready | for backpropogation. | Kon-Peki wrote: | > Apple doesn't have matrix math accelerators in their | current GPUs. | | That's because the M1 has a dedicated matrix math accelerator | called AMX [1]. I've used it with both Swift and pure C. | | https://medium.com/swlh/apples-m1-secret- | coprocessor-6599492... | my123 wrote: | AMX is indeed very nice for FP64 where customer GPUs aren't | an alternative at all. | | However, for lower precisions (which is what deep learning | uses), you're much better off with a GPU. | brrrrrm wrote: | have you actually benchmarked that? I think (someone | please correct me if I'm way off here) the AMX | instructions can hit ~2.8tflops (fp16) per co-processor | and there are 2 on the 7-core M1. That's 5.6tflops vs the | 4.6tflops the GPU can hit. | johndough wrote: | Often the limiting factor is memory bandwidth instead of | raw FLOPS, so dealing with 4 times larger data types | (FP64 vs FP16) is a disadvantage. | brrrrrm wrote: | to clarify: I am comparing FP16 performance, which both | the GPU and AMX have native support for. | | FP64 is _also_ supported by AMX, making it quite an | impressive region of silicon. | my123 wrote: | Yeah that's within the M1 family, but get within dGPUs | and it doesn't even come close. | | 30Tflops for a 3080 for vector FP32, but 119Tflops FP16 | dense with FP16 accumulate, 59.5 with FP32 accumulate, | and if you exploit sparsity then that can go even higher. | brrrrrm wrote: | Ah yes, I misunderstood your original comment | Kalanos wrote: | Anyone care to comment on how this is better than Metal's | TensorFlow support? | ekelsen wrote: | Nice results! But why are people still reporting benchmark | results on VGG? Does anybody actually use this network anymore? | | Better would be mobilenets or efficientNets or NFNets or vision | transformers or almost anything that's come out in the 8 years | since VGG was published (great work it was at the time!). | p1esk wrote: | _why are people still reporting benchmark results on VGG?_ | | Probably because it makes the hardware look good. | DSingularity wrote: | No. Because it is a way to compare performance. That's all. | Just convenience. | 0-_-0 wrote: | This is the right answer. Efficient networks like | EfficientNet are much harder to accelerate in HW. | jorgemf wrote: | Probably because it will be impossible to compare with old | results. If every year the community chooses a different model, | how are you going to compare results year over year? | learndeeply wrote: | ResNets have been around for 7 years... | jorgemf wrote: | It doesn't matter. Deep learning have been mainstream for | only 10 years. MNIST is a dataset from 1998 and it is still | being used in research papers. The most important thing is | to have a constant baseline, and ResNets are a baseline. | | Think about changing the model every other year: - 2015: | ResNet trained in Nvidia k80 - 2017: Inception trained in | Nvidia 1080 ti - 2019: Transformer trained in Nvidia V100 - | 2021: GTP-3 trained in a cluster | | Now you have your new fancy algorithm X and an Nvidia 4090. | How much better is your algorithm compared to the state of | the art, and how much have you improved compared to the | algorithms 5 years ago? Now you are in a nightmare and you | have to run all the past algorithms in order to compare it. | Or how fast is the new Nvidia card? which noone still have | and nvidia has decided to give numbers based on a their own | model? | 6gvONxR4sf7o wrote: | > But why are people still reporting benchmark results on VGG? | | It makes me feel like i'm missing something! Is is still used | as a backbone in the same way as legacy code is everywhere, or | is it something else entirely?? | plonk wrote: | > Does anybody actually use this network anymore? | | Why not? It's still good for simple classification tasks. We | use it as an encoder for a segmentation model in some cases. | Most ResNet variants are much heavier. | jph00 wrote: | I don't think that's true - have a look at this analysis | here: | | https://www.kaggle.com/code/jhoward/which-image-models- | are-b... | | Those slow and inaccurate models at the bottom of the graph | are the VGG models. A resnet34 is faster and more accurate | than any VGG model. And there are better options now -- for | example resnet34d is as fast as resnet34, and more accurate. | And then convnext is dramatically better still. | YetAnotherNick wrote: | > ResNet > VGG: ResNet-50 is faster than VGG-16 and more | accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the | same speed as VGG-19 but much more accurate than VGG-16 (6.21 | vs 9.0). | | https://github.com/jcjohnson/cnn- | benchmarks#:~:text=ResNet%2.... | toppy wrote: | Does speed up refer to absolute value or percentage? | dagmx wrote: | At least for the charts, it looks like a multiplier (or divisor | I guess) since the CPU baseline looks to be at 1 | toppy wrote: | You're right! I've missed this. | alexfromapex wrote: | Since it's tangentially relevant, if you have an M1 Mac I've | created some boilerplate for working with the latest Tensorflow | with GPU acceleration as well: | https://github.com/alexfromapex/tensorexperiments . I'm thinking | of adding a branch for PyTorch now. | masklinn wrote: | Did you compare that to Apple's tf plugin to see what was what? | galoisscobi wrote: | This is great! Appreciate the note on H5Py troubleshooting as | well. | [deleted] | cj8989 wrote: | really hope to see some comparisons with nvidia gpus! | amelius wrote: | > Accelerated GPU training is enabled using Apple's Metal | Performance Shaders (MPS) as a backend for PyTorch. | | What do shaders have to do with it? Deep learning is a mature | field now, it shouldn't need to borrow compute architecture from | the gaming/entertainment field. Anyone else find this | disconcerting? | my123 wrote: | Apple doesn't have a separate API tailored towards compute | only, but a single unified API that makes concessions to both. | | Concessions towards compute: a C++ programming language for | device code (totally unlike what's done for most graphics | APIs!) | | Concessions towards graphics: no single-source programming | model at all for example... | sudosysgen wrote: | Many GPUs allow you to write device code in C++ via SYCL. It | works well enough. | dagmx wrote: | Shaders are just the way compute is defined on the GPU. | | Why is that concerning to you? | my123 wrote: | That terminology isn't used at all in GPGPU compute APIs | specifically tailored for that purpose, which use quite | different programming models where you can mix host and | device code in the same program. | | And there are "GPUs" today that can't do graphics at all (AMD | MI100/MI200 generations) or in a restricted way (Hopper | GH100) which has the fixed function pipeline only on two | TPCs, for compatibility, but running very slowly due to that. | alfalfasprout wrote: | There's absolutely a lot of "graphics" terminology that | spills into GPGPU. For example, texture memory in CUDA :) | The reality is that GPU's, even the ones that can't output | video, are ultimately still using hardware that largely is | rooted in gaming. Obviously the underlying architectures | for these ML cards are moving away from that (increasingly | using more die space for ML related operations) but many of | the core components like memory are still shared. It boils | down to the fact that at the end of the day they're linear | algebra processors. | my123 wrote: | I'd say that there has been quite some sharing between | both back and forth. Evolutions in compute stacks shaped | modern graphics APIs too. | | Texture units are indeed a part that is useful enough to | be exposed to GPGPU compute APIs directly. The "shader" | term itself disappeared quite early in those though, as | did access to a good part of the FF pipeline including | the rasterisers themselves. | WhitneyLand wrote: | It's not the greatest term even for graphics only. | | People new to CG are likely to intuit "shaders" as something | related to, well, shading, but vertex shaders et al have | nothing to do with the color of a pixel or a polygon. | paulmd wrote: | Wait until they learn a kernel has nothing to do with | operating systems! And tensor operations have nothing to do | with tensor objects! And texture memory often isn't even | used for textures! | | It's an unfortunate set of terminology due to the way this | space evolved from graphics programming - shader cores | _used to do_ fixed-function shading! But then people wanted | them to be able to run arbitrary shaders and not just | fixed-function. And then hey, look at this neat processor, | let 's run a compute program on it. At first that was | "compute shaders" running across graphics APIs, then came | CUDA, and later OpenCL. But it is still running on the part | of the hardware that provides shading to the graphics | pipeline. | | Similarly, texture memory actually used to be used for | textures, now it is a general-purpose binding that | coalesces any type of memory access that has 1D/2D/3D | locality. | | You kinda just get used to it. Lots of niches have their | own lingo that takes some learning. Mathematics is | incomprehensible without it, really. | geertj wrote: | Not sure if it's concerning but it caught my eye as well. | MasterScrat wrote: | Small code example in the PyTorch doc: | | https://pytorch.org/docs/master/notes/mps.html | ivstitia wrote: | There was a report comparing M1 Pro with several other Nvidia | GPUs from a few months ago: | https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning... | | I'm curious on how the benchmarks change with this recent new | release! | nafizh wrote: | Exciting!! But don't see comparison with any laptop Nvidia GPUs | in terms of performance. That would be insightful. | sudosysgen wrote: | It compares unfavourably, but then again NVidia GPUs on laptop | are massive powerhogs. | smlacy wrote: | Do apple users really _require_ the ability to train large ML | models while mobile and without access to A /C power? Is this | a real-world use case for the target market? | sudosysgen wrote: | Indeed, I doubt anyone really needs that. And anyways while | training a model you'd be lucky to get an hour of battery | life even on an M1 Max. ___________________________________________________________________ (page generated 2022-05-18 23:00 UTC)