[HN Gopher] WebAssembly techniques to speed up matrix multiplica... ___________________________________________________________________ WebAssembly techniques to speed up matrix multiplication Author : brrrrrm Score : 201 points Date : 2022-01-25 15:52 UTC (7 hours ago) (HTM) web link (jott.live) (TXT) w3m dump (jott.live) | VHRanger wrote: | This'll end up inevitably as a WASM BLAS library | | Which wouldn't be a bad thing | LeanderK wrote: | What can't we just compile BLAS to WASM? Isn't this the point | of WASM? | brandmeyer wrote: | The BLAS have historically relied on a wide range of | microarchitecture-specific optimizations to get the most out | of each processor generation. An ideal solution would be for | the browser to provide that to the application in such a way | that it is difficult to fingerprint. | | See also the history of Atlas, GotoBLAS, Intel MKL, etc. | bee_rider wrote: | libflame/BLIS might be a good starting point, they've | created a framework where you bring your compute kernels, | and they make them into a BLAS (plus some other nice | functionality). I believe most of the framework itself is | in C, so I guess that could somehow be made to spit out | wasm (I know nothing about wasm). Then, getting the browser | to be aware of the actual real assembly kernels might be a | pain. | injidup wrote: | https://www.google.com/search?q=wasm%20blas&ie=utf-8&oe=utf-... | pcwalton wrote: | I really like this writeup. Note that it may not be worth using | the SIMD in this way (horizontal SIMD) if you know you will be | multiplying many matrices that are the same size. It may be | better to do vertical SIMD and simply perform the scalar | algorithm on 4 or 8 matrices at a time, like GPUs would do for | vertex shaders. This does mean that you may have to interleave | your matrices in an odd way to optimize memory access, though. | bertman wrote: | Very cool writeup! | | Unfortunately, the bar graphs at the bottom of the article have | different y-axis scaling, but even so: | | It's sad how Firefox' performance pales in comparison to Chrome. | zwieback wrote: | Very cool! | | My question: will this kind of thing become more mainstream? I've | seen the web emerge, go from static pages to entire apps being | delivered and executed in the browser. The last bastion of native | apps and libraries seems to be highly optimized algorithms but | maybe those will also migrate to a deliver-from-the-web and | execute in some kind of browser sandbox. | | Java promised to deliver some version of native code execution | but the Java app/applet idea never seemed to take off. In some | ways it seems superior to what we have now but maybe the security | concerns we had during that era held Java back too much. Or am I | misunderstanding what WebAssembly can bring to the game? | brrrrrm wrote: | I'm not really equipped to predict anything, but I think recent | surge in popularity of simple RISC-y[1] architectures like ARM | will allow the WebAssembly standard to stay small yet | efficient. I'm hopeful, but standards often have a way of not | keeping up with the newest technology so we'll see. | | [1] | https://en.wikipedia.org/wiki/Reduced_instruction_set_comput... | pjscott wrote: | I don't expect ARM to have much effect here. The wasm | instruction set is basically a low-level compiler | intermediate representation, and to go from that to machine | code for x86 or ARM is about equally difficult in both cases. | MauranKilom wrote: | Well, here is a related talk: | https://www.destroyallsoftware.com/talks/the-birth-and-death... | halpert wrote: | The web always feels terrible compared to a native app, | especially on a phone. Some of the difference is due to Safari | being a bad browser (probably intentional), but a bigger part | is that the threading model makes it really difficult to have a | responsive UI. Not to mention the browser's gestures often | clash with the application's gestures. | eyelidlessness wrote: | I generally find Safari's performance better than other | browsers. Also, every mainstream JS runtime is multithreaded. | There are limitations on what can be shared between threads, | but you can optimize a _lot_ despite those limitations | (including using WASM on the web, and native extensions /FFI | on Node/Deno). | Uehreka wrote: | In my experience, if the thing I'm working on runs in | Safari, it's buttery smooth. And if it doesn't run, it | completely shits the bed. Stuff like "we don't support that | way of doing shadows in SVG, so rather than simply not | implement that property, we've turned your entire SVG | element black". | johncolanduoni wrote: | Every mainstream JS runtime has some sort of background | garbage collection and parsing/compilation, but for any | execution that has to interact with JavaScript heap or | stack state you're still in single threaded territory. | SharedArrayBuffer can help if you're willing to give up on | JavaScript objects entirely, and for WebAssembly this is | less of a burden, but that's not going to help you perform | rendering on most web apps concurrently. JSCore goes a | little bit further and can run javascript in the same | runtime instance on multiple threads with some significant | limitations, but this isn't exposed to developers on | Safari. | halpert wrote: | On iOS, the perf of Safari is definitely better than every | other browser, because every other browser is mandated to | use WkWebView by Apple. They aren't allowed to implement | their own engine. Of course Apple isn't subject to the same | restriction. | amelius wrote: | Soon, other browsers can simply run themselves inside | WASM which then runs inside a WkWebView :) | jacobolus wrote: | On a Mac, Safari Javascript generally outperforms Chrome | and Firefox (other browsing tasks are also generally | better performing), but there are some workloads where | Safari turns out slower, especially when the developer | has put a lot of work into Chrome-specific optimization. | | Safari also generally uses a lot less memory and CPU for | the same websites. Chrome in particular burns through my | battery very quickly, and is basically completely | incapable of keeping up with my browser use style (it | just crashes when I try to open a few hundred tabs). | Presumably nobody with authority at Google is a heavy | laptop web-user or prioritizes client-side resource use: | Google's websites are also among the biggest browser | resource hogs, even when sitting idle in a background | tab. | | Safari often takes a couple years longer than other | browsers to implement cutting-edge features. This seems | to me like a perfectly reasonable design decision; some | web developers love complaining about it though, and some | sites that were only developed against the most recent | versions of Chrome don't work correctly in Safari. | jsheard wrote: | WASM threads are available in all modern browsers now, | including Safari. It's very early days in terms of ecosystem | but we're steadily getting there. | halpert wrote: | The issue isn't so much having additional threads, it's | needing two main threads. The browser has one main thread | for accepting user input and then dispatches the relevant | user events to the JS main thread. | | The browser can either dispatch the events asynchronously, | leading to the events being handled in a noticeably delayed | way, or the browser can block its main thread until the JS | dispatch finishes, leading to fewer UI events being | handled. Either way is an inferior experience. | danielvaughn wrote: | Also aren't service workers technically multi-threading? | That's been a thing for a while in the browser now. | jsheard wrote: | Technically yeah, but the threads could only communicate | through message passing which isn't ideal for | performance. The more recent major improvement is for | workers to be able to share a single memory space similar | to how threading works in native applications. | arendtio wrote: | > The web always feels terrible compared to a native app | | 'always' is certainly not true. Yes, with modern frameworks | it is very easy to build websites which are slow. But it is | also possible to build websites with butter smooth animations | and instant responses. | | I hope that in the future we will get frameworks that make it | easier to create lightweight web apps, so that we will see | more high performance apps. | halpert wrote: | I made a another comment further down, but basically web | apps can't run as fast as native apps for a variety of | reasons. | | One reason is the thread model. There are two main threads | that need to be synchronized (browser main thread and JS | main thread) which will always be slower than a single main | thread. | | Another reason is that layout and measurement of elements | in HTML is really complicated. Native apps heavily | encourage deferred measurement which lets the app measure | and lay itself out once per render pass. In JavaScript, | layouts may need to happen immediately based on what | properties of the dom you're reading and setting. | arendtio wrote: | I think nobody will argue against, that the majority of | native apps are faster than web apps. But the key point | isn't faster, but how much is fast enough. | | In general, 60fps is considered sufficient for smooth | rendering and even 5 years ago, mobile hardware was fast | enough for 60 fps web page rendering. However, many web | pages a built in ways, that the browsers can't achieve | that goal. | | So yes, it is harder for developers to create a pleasant | experience and as a result there are more bad apples in | the web app basket. | halpert wrote: | I disagree. Yes, if you have a static webpage and all you | need to do is scroll, then you can easily get 60 fps, | notably because the scrolling is handled natively by the | browser and basically is just a translation on the GPU. | If the web app accepts user input, especially touch with | dragging, then the page will not feel native with the | current batch of browsers for the reasons I mentioned | above. | arendtio wrote: | So what would be needed to prove you wrong? | | How about a 240fps video of a 60Hz display, with 2 | implementations | | 1. Qt | | 2. Web | | Both times a finger dragging a slider from point A to | point B? | engmgrmgr wrote: | You don't have to always use the DOM. You can render in | another thread, or even run compute in another thread and | use the animation frame system to handle updates. | | Having said that, maybe a little less than 10 years ago, | we achieved the desired performance with touch-screen | dragging of DOM elements. I don't remember specifics, but | we didn't use any frameworks. | nicoburns wrote: | > Some of the difference is due to Safari being a bad browser | (probably intentional), but a bigger part is that only having | one thread makes it really difficult to have a responsive UI | | Interestingly Safari is actually generally better than other | browsers for running a responsive smooth UI. Not sure how | much of that is the Safari engine and how much of it is | better CPU on iPhones. But even on a first generation iPad or | an iPhone 4 it was possible to get 60fps rendering fairly | easily. The same could not be said for even higher end | android phones of the time. | kitsunesoba wrote: | Anecdotally, the only time I've had issues with | unresponsiveness for pages in Safari is with sites that | were written with Chrome specifically in mind. | themerone wrote: | So, basically everything. | kitsunesoba wrote: | The impact is minimal to nonexistent on a light-to- | moderate-JS "site" and only really shows up in heavy | "apps", like YouTube or GDocs. | halpert wrote: | Really? Even something simple like Wordle feels janky with | the way Safari's chrome overlaps the keyboard. | acdha wrote: | Do you have some kind of extensions or something like | text zooming enabled? On a clean install it doesn't | overlap at all. | halpert wrote: | Hmm the layout is working for me now, but the tile | animation is broken. They flicker instead of smoothly | flipping over. | acdha wrote: | Any chance you have reduced motion enabled? They flip for | me but are obviously on a short timer. | javajosh wrote: | The web solved software distribution. Full stop. There only | remain the edge cases, and that's where webasm wants to help. | | Sun/Java wanted badly to solve this problem, but tried to do | too much too soon. Java gave devs in 1999 cutting edge OOP | tools for doing GUIs (e.g. Swing) but distribution was always | the problem. Installing and running browser plugins was always | error prone, and it turned out the browser was itself just good | enough to deliver value, so it won. (With the happy side-effect | of giving the world-wide community of devs one of the gooeyist | languages ever to express their fever dreams of what | application code should be). | | The question in my mind is whether there is enough demand for | the kinds of software webasm enables, especially given that | other routes of distribution (app stores) have filled in the | gaps of what the web delivers, and are generally a lot more | pleasant and predictable to native devs. | ginko wrote: | Isn't the idea of webassembly to compile native | C/C++/Rust/Whatever code to be able to run in the browser? | | Why not just compile OpenBLAS or another computer numerics | library like that to WA? | brrrrrm wrote: | that's exactly how TF.js does it: | https://github.com/google/XNNPACK/blob/master/src/f32-gemm/M... | remus wrote: | I'm by no means an expert but my understanding is that a lot of | the performance from libraries like openBLAS comes from | targeting specific architectures (e.g. particular instruction | sets on a series of processors). You can probably milk some | more performance by targeting the web assembly architecture | specifically (assuming openBLAS hasn't started doing similar | themselves). | bee_rider wrote: | People are asking about BLAS already in various threads, but if | they know the size of their matrices beforehand, it might be | interesting to try EIGEN. EIGEN also has the benefit that it is | an all-template C++ library so I guess it should be somehow | possible to spit out the WASM (I know nothing about WASM). | | Of course usually BLAS beats EIGEN, but for small, known-sized | matrices, it might have a chance. | MauranKilom wrote: | Can you specify what "small" constitutes for you? | LeanderK wrote: | I was hopeful WebAssembly will speed up the web. What's missing? | Native Browser APIs? Native IO Apis? | | Let's say I want to interactively plot some complicated function | without slowing down the rest. Can I do this in WebAssembly now? | acdha wrote: | > Let's say I want to interactively plot some complicated | function without slowing down the rest. Can I do this in | WebAssembly now? | | You've been able to do that for a while now and it would likely | be fast enough even in pure JavaScript. The things which tend | to slow the web down come back to developers not caring about | performance (and thus not even measuring) or the cross-purposes | of things like ad sales where a significant amount of | JavaScript is deployed to do things the user doesn't care | about. | sharikous wrote: | > Let's say I want to interactively plot some complicated | function without slowing down the rest. Can I do this in | WebAssembly now? | | Yep, with a Web Worker for the secondary thread. However the | environment is still young and its' difficult to use heavy | computation libraries. Besides for some reason SIMD | instructions are present only for 128 bit units (2 doubles or 4 | floats). Another problem is no matter what to do it is a layer | over the hardware, so it will be slower than specialized | machine code (if what you do is not in the JS API) | LeanderK wrote: | > difficult to use heavy computation libraries | | what do you mean? What's the blocker? Something like numpy | for js would fill this role, calling wasm in the background. | Just the missing SIMD-instructions? Some quick googling shows | that one can't compile BLAS for wasm yet. This might be due | to wasm64 not being available yet, i think? So would this | help to tap into the existing ecosystem of optimised | mathematical routines? | | Ideally...I would leave js and just use python ;) It has the | whole ecosystem at hands with numpy, scipy, statsmodels etc. | But nobody is doing it and idk why. I think it might be due | to fortran not compiling to wasm. | sharikous wrote: | Yes I forgot about wasm64 not being available still. Yes, | that's a big block. | | About numpy for js I believe js is still not comfortable | enough for this kind of use, especially with the lack of | operator overloading. | | Anyway there are some builds of BLAS (or equivalents) to | wasm and even of python. Check out pyodide and brython | [deleted] | onion2k wrote: | _Let 's say I want to interactively plot some complicated | function without slowing down the rest_ | | If you can draw your plot in a shader then you can do it in | WebGL very easily. You'd only need to update the input uniforms | and everything else would happen on the GPU. | johndough wrote: | Browsers will always lag behind desktop applications by a | decade or two because everything is designed by committee and | takes forever to arrive (see e.g. SSE, AVX, GPGPU compute). And | even if everyone can eventually agree on and implement the | smallest common denominator hardware will already have evolved | beyond that. | | In addition, browsers have to fight all kinds of nefarious | attackers, so it is a very hostile environment to develop in. | For example, we can't even measure time accurately (or do | proper multithreading with shared memory) in the browser thanks | to the Spectre and Meltdown vulnerabilities. | https://meltdownattack.com/ | | That being said, WebGL implements extremely gimped shaders. | Yet, they are still more than enough to render all kinds of | functions. For example, see https://www.shadertoy.com/ or | https://glslsandbox.com/ which are huge collections of | functions which take screen coordinates and time as input and | compute pixel colors from that, i.e. f(x, y, t) -> (r, g, b). | This might sound not very impressive on first glance, but | people have been amazingly creative within this extremely | constrained environment, resulting in all kinds of fancy 3D | renderings. | wdroz wrote: | Since wasm supports threads, I wonder if you can speed up these | operations further more by using multiple threads. | brrrrrm wrote: | That's a good point: you certainly could. There's some fun | exploration to be done with atomic operations. | | The issue is that threaded execution requires cross-origin | isolation, which isn't trivial to integrate. (Example server | that will serve the required headers: https://github.com/bwasti | /wasmblr/blob/main/thread_example/s...) | phkahler wrote: | Another technique is to transpose the left matrix so each dot | product is scanned in row-order and hence more cache friendly. | | Another one I tried ages ago is to use a single loop counter and | "de-interleave" the bits to get what would normally be 3 distinct | loop variables. For this you need to modify the entry in the | result matrix rather than having it write-only. It has the effect | of accessing like a z-order curve but in 3 dimensions. It's a bit | of overhead, but you can also unroll say 8 iterations (2x2x2) | which helps make up for it. This ends up making good use of both | caches and even virtual memory if things don't fit in RAM. OTOH | it tends to prefer sizes that are a power of 2. | gfd wrote: | I really like these set of lecture notes for optimizing matrix | multiplication: https://ppc.cs.aalto.fi/ch2/v7/ (The transpose | trick is used in v1) | progbits wrote: | This deserves it's own submission, wonderful resource! | eigenvalue wrote: | I find it surprising that, even after using all those tricks, | they are still only to achieve around 50% of the theoretical | peak performance of the chip in terms of GFLOPS. And that's | for matrix multiplication, which is a nearly ideal case for | these techniques. | dralley wrote: | The compiler will sometimes do this transpose for you, but as | with all compiler optimizations it might sometimes break. | melissalobos wrote: | That sounds very interesting, is the anywhere I can read more | about this optimization? I didn't know any compiler could do | optimizations like that. | kanaffa12345 wrote: | there is no way a general purpose compiler will figure this | out. op is probably talking about something like halide or | tvm or torchscript jit. | [deleted] | cerved wrote: | You can get extremely creative in optimizing matrix | mulplication for cache and SIMD. | jacobolus wrote: | For contexts like the web, also check out cache-oblivious | matrix multiplication https://dspace.mit.edu/bitstream/handle | /1721.1/80568/4355819... | mynameismon wrote: | Another very very interesting optimisation can be found in | these lecture slides [0]. (Scroll to slide 28, although the | entire slide deck is amazing) | | [0]: https://ocw.mit.edu/courses/electrical-engineering-and- | compu... | magoghm wrote: | I tested it on my M1 Mac and it reached 46.78 Gigaflops, which is | quite amazing for a CPU running at 3.2 GHz. Isn't that like an | average of 14.6 floating point operations per clock cycle? | lostmsu wrote: | If you look at the comment above about the regular GEMM | implementation, M1 actually can do that at 1.6 Teraflops. | danieldk wrote: | I hate to post this multiple times, but the M1 has a dedicated | matrix co-processor, it can do matrix multiplication at | >1300GFLOP/s if you use the native Accelerate framework (which | uses the standard BLAS API). The M1 Pro/Max can even do double | that (>2600 GFLOP/s) [1]. | | 46.78 GFLOP/s is not even that great on non-specialized | hardware. E.g., a Ryzen 5900X, can do ~150 GFLOP/s single- | threaded with MKL. | | [1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads | owlbite wrote: | How does this compare to the native BLAS in the Accelerate | library? | conradludgate wrote: | Accelerate on the M1 is ridiculously fast (thanks to its | special core set and specific instructions). | | Some benchmarks I've done has it beating out CUDA on my RTX | 2070. I have to got a proper gflops number though | danieldk wrote: | It's going to absolutely blow this away. Here are some of my | single precision GEMM benchmarks for the M1 and M1 Pro: | | https://github.com/danieldk/gemm-benchmark#1-to-16-threads | | tl;dr, the M1 can do ~1300 GFLOP/s and the M1 Pro up to | ~2700GFLOP/s. | | On the vanilla M1, that's 28 times faster than the best result | in the post. | | The difference (besides years of optimizing linear algebra | libraries) is that Accelerate uses the AMX matrix | multiplication co-processors through Apple's proprietary | instructions. | [deleted] | riddleronroof wrote: | This is very cool | [deleted] | wheelerof4te wrote: | I admire people who can read and understand this. | Kilenaitor wrote: | Which parts are you unable to read and understand? I'm sure | some of us here could help explain if you have specific | questions or hangups. | wheelerof4te wrote: | The math stuff :) | | JavaScript code is readable, at least. | Kilenaitor wrote: | Is "the math stuff" all the optimizations being performed | e.g. vectorizing multiplication? | | Not trying to sound dismissive here but the core math the | post is working with is actually a pretty straightforward | matrix multiplication. | | The bulk of the discussion focuses on optimizing the | execution of that straightforward multiplication algorithm | [triple-nested for loop; O(n^3)] rather than making | algorithmic/mathematic optimizations. | | And again, specific questions are easier to answer. :) | djur wrote: | Matrix multiplication isn't exactly intuitive if you've | never worked with it before. | bruce343434 wrote: | I don't understand the naming and notation of this article | because the author is assuming context that I don't have. | | Section baseline: What are N, M, K? 3 matrices or? Laid out | as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k | * N + n];`, ah, apparently a b and c are the matrices? How | does this work? | | Section body: What is the mathy "C'=aC+A[?]B"? derivative of | a constant is the angle times the constant plus the dot | product of A and B??? | conradludgate wrote: | There are 3 matrices in question: A, B and C. They have | dimensions (M * K), (K * N) and (M * N) respectively. | | They are laid out, rather than nested arrays, as a single | continuous collection of bytes that can be interpreted as | having a matrix shape. That's where the `m * N + n` comes | from (m rows down and n cols in) C' = alpha | C + A.B | | This is the 'generalised matrix-matrix multiplication' | (GEMM) operation. It's multiplying the matrices A and B, | adding it to a scales version of C and inserting it back | into C. Setting alpha to 0 gets you basic matmul | wheelerof4te wrote: | Thank you for this detailed explanation. Making the | matrix one-dimensional makes sense from the performance | standpoint. | [deleted] | marginalia_nu wrote: | It's kind of bizarre how it's an accomplishment to get your code | closer to what the hardware is capable of. In a sane world, that | should be the default environment you're working in. Anything | else is wasteful. | Kilenaitor wrote: | There's always been a tradeoff in writing code between | developer experience and taking full advantage of what the | hardware is capable of. That "waste" in execution efficiency is | often worth it for the sake of representing helpful | abstractions and generally helping developer productivity. | | The real win here is when we can have both because of smart | toolchains that can transform those high-level constructs and | representations into the most efficient implementation for the | hardware. | | Posts like this demonstrate what's possible with the right | optimizations so tools like compilers and assemblers are able | to take advantage of these when given the high-level code. That | way we can achieve what you're hoping for: the default being | optimal implementations. | AnIdiotOnTheNet wrote: | > That "waste" in execution efficiency is often worth it for | the sake of representing helpful abstractions and generally | helping developer productivity. | | That's arguable at best. I for one am sick of 'developer | productivity' being the excuse for why my goddamned | supercomputer crawls when performing tasks that were trivial | even on hardware 15 years older. | | > The real win here is when we can have both because of smart | toolchains that can transform those high-level constructs and | representations into the most efficient implementation for | the hardware. | | That's been the promise for a long time and it still hasn't | been realized. If anything things seem to be getting less and | less optimal. | adamc wrote: | No, it's really not even arguable. Lots and lots of | software is written in business contexts where the cost of | developing reliable code is a lot more important than its | performance. Not everything is a commercial product aimed | at a wide audience. | | What you're "sick of" is mostly irrelevant unless you | represent a market that is willing to pay more for a more | efficient product. I use commercial apps every day that | clearly could work a lot better than they do. But... would | I pay a lot for that? No. They are too small a factor in my | workday. | | Saving money is part of engineering too. | bruce343434 wrote: | People have been sick of slow programs and slow computers | since literally forever. I think you live in a bubble or | are complacent. | | No one I know has anything good to say about microsoft | teams, for instance. And that's just one of the recent | "dekstop apps" which are actually framed browsers. | lijogdfljk wrote: | > when performing tasks that were trivial even on hardware | 15 years older. | | Did the software to perform those tasks stop working? | nicoburns wrote: | > I for one am sick of 'developer productivity' being the | excuse for why my goddamned supercomputer crawls when | performing tasks that were trivial even on hardware 15 | years older. | | The problem here is developer salaries. So long as | developers are as expensive as they are the incentive will | be to optimise for developer productivity over runtime | efficiency. | dogleash wrote: | If developers costed one fifth of what they do now, how | many projects that let performance languish today would | staff up to the extent that doing a perf pass would make | it's way to the top of the backlog queue? | | Come on now. Let's be honest here. The answer for >90% of | projects is either a faster pace on new features, or to | pocket the payroll savings. They'd never prioritize | something that they've already determined can be ignored. | Kilenaitor wrote: | We've been making developer experience optimizations | _long_ before they started demanding high salaries. The | whole reason to go from assembly to C was to improve | developer experience and efficiency. | | It seems fairly reductive to dismiss the legitimate | advantages of increased productivity. It's faster to | iterate on ideas and products, we gain back time to focus | on more complex concepts, and, more broadly, we further | open up this field to more and more people. And those | folks can then go on to invest in these kind of | performance improvements. | AnIdiotOnTheNet wrote: | > It seems fairly reductive to dismiss the legitimate | advantages of increased productivity. | | Certainly there are some, but I think we passed the point | of diminishing returns long long ago and we're now well | into the territory of regression. I would argue that we | are actually experiencing negative productivity increases | from a lot of the abstractions we employ, because we've | built up giant abstraction stacks where each new | abstraction has new leaks to deal with and everything is | much more complicated than it needs to be because of it. | nicoburns wrote: | Hmm... I think our standards for application | functionality are also a lot higher. For example, how | many applications from the 90s dealt flawlessly with | unicode text. | AnIdiotOnTheNet wrote: | How much added slowness do you think Unicode is | responsible for? Because as much of a complex nightmarish | standard as it is[0], there are plenty of applications | that are fast that handle it just fine as far as I can | tell. They're built with native widgets and written in | (probably) C. | | [0] plenty of slow as fuck modern software doesn't handle | it even close to 'flawlessly' | danieldk wrote: | _There 's always been a tradeoff in writing code between | developer experience and taking full advantage of what the | hardware is capable of. That "waste" in execution efficiency | is often worth it for the sake of representing helpful | abstractions and generally helping developer productivity._ | | The GFLOP/s is 1/28th of what you'd get when using the native | Accelerate framework on M1 Macs [1]. I am all in for powerful | abstractions, but not using native APIs for this (even if | it's just the browser calling Accelerate in some way) is just | a huge waste of everyone's CPU cycles and electricity. | | [1] https://github.com/danieldk/gemm- | benchmark#1-to-16-threads | Salgat wrote: | Once you realize that it's a completely sandboxed environment | that works on any platform, it's a lot more impressive. | dr_zoidberg wrote: | Wasteful of computing resources, yes, but for a long time we've | been prioritizing developer time. That happens because you can | get faster hardware cheaper than you can get more developer | time (and not all developers time is equal, say, Carmack con do | in a few hours things I couldn't do in months). | | I do agree that we'd get fantastic performance out of our | systems if we had the important layers optimized like this (or | more), but it seems few (if any) have been pushing in that | direction. | terafo wrote: | But you can't get faster hardware cheaper anymore. Not | naively faster hardware anyways. You are getting more and | more optimization opportunities nowadays though. Vectorize | your code, offload some work to the GPU or one of the | countless other accelerators that are present on modern SOC, | change your I/O stack so you can utilize SSDs efficiently, | etc. I think it's a matter of time until someone puts FPGA | onto mainstream SOC, and the gap between efficient and | mainstream software will only widen from that point. | dr_zoidberg wrote: | You are precisely telling me the ways in which I can get | faster hardware: GPU, accelerators, the I/O stack and SSDs, | etc. | | I agree that the software layer has become slow, crufy, | bloated, etc. But it's still cheaper to get faster hardware | (or wait a bit for it, see M1, Alder Lake, Zen 3, to name a | few, and those are getting successors later on this year) | than to get a good programmer to optimize your code. | | And I know that we'd get much better performance out of | current (and probably future) hardware if we had more | optimized software, but it's rare to see companies and | projects tackle on such optimization efforts. | terafo wrote: | But you can't get all these things in the browser. You | don't just increase CPU frequency and get free | performance anymore. You need conscious effort to use GPU | computing, conscious effort to ditch current I/O stack | for io_uring. Modern hardware gives performance to ones | who are willing to fight for it. Disparity between naive | approach and optimized approach grows every year. | peterhunt wrote: | The real issue here is that the hardware isn't capable of | sandboxing without introducing tons of side channel attacks. | Lots of applications are willing to sacrifice a lot of | performance in order to gain the distribution advantages from a | safe, sandboxed execution environment. | not2b wrote: | In a sane world (which is the world that we live in), it's best | to find a well-optimized library for common operations like | matrix multiplication. But if you want to do something unusual | (multiply large matrices inside a browser, quickly) you've | exited the sane world so you'll have to work at it. | ska wrote: | > Anything else is wasteful. | | Everything has a cost. If the developer is a slave to machine | architecture, development is slow and error prone. If the | machine is a slave to a abstraction, everything will run | slowly. Unsurprisingly, the real trick is finding appropriate | balance for your situation. | | Of course you can make things worse, in both directions. | Zababa wrote: | On the other hand, in your sane world, productivity would be a | fraction of what it currently is, for developers and users. You | favor computer time over developer time. While computer time | can be a proxy for user time, it isn't always as developer time | can be used to speed up user time too. A single-minded focus on | computer time sounds like a case of throwing out metrics like | developer time and user time because they are harder to measure | than computer time. In any case, it sounds like a mistake to | me. | bruce343434 wrote: | I don't understand the naming and notation of this article | because the author is assuming context that I don't have. | | Section baseline: What are N, M, K? 3 matrices or? Laid out as a | flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N + | n];`, ah, apparently a b and c are the matrices? How does this | work? | | Section body: What is the mathy "C'=aC+A[?]B"? derivative of a | constant is the angle times the constant plus the dot product of | A and B??? | | Please, if you write a public blog post, use your head. Not | everybody will understand your terse notes. | engmgrmgr wrote: | Not to be too snarky, but perhaps the onus is on you to do some | homework if you want to understand a niche article for which | you lack context? | | Laying out matrices like that is pretty standard, especially | for a post about vectorization. | ausbah wrote: | shouldn't compilers handle stuff like this? | brrrrrm wrote: | In an ideal world, absolutely! It's a hard problem and there | are many attempts to make that happen automatically including | polyhedral optimization (Polly[1]) and tensor compiler | libraries (XLA[2] and TVM[3]). I work on a project called | LoopTool[4] which is researching ways to dramatically reduce | the representations of the other projects to simplify | optimization scope. | | [1] https://polly.llvm.org | | [2] https://www.tensorflow.org/xla | | [3] https://tvm.apache.org | | [4] https://github.com/facebookresearch/loop_tool | visarga wrote: | If they worked so well AMD would not be in such a bad position | with their GPUs in ML. They would just need to compile to their | arch. ___________________________________________________________________ (page generated 2022-01-25 23:00 UTC)