[HN Gopher] WebAssembly techniques to speed up matrix multiplica...
       ___________________________________________________________________
        
       WebAssembly techniques to speed up matrix multiplication
        
       Author : brrrrrm
       Score  : 201 points
       Date   : 2022-01-25 15:52 UTC (7 hours ago)
        
 (HTM) web link (jott.live)
 (TXT) w3m dump (jott.live)
        
       | VHRanger wrote:
       | This'll end up inevitably as a WASM BLAS library
       | 
       | Which wouldn't be a bad thing
        
         | LeanderK wrote:
         | What can't we just compile BLAS to WASM? Isn't this the point
         | of WASM?
        
           | brandmeyer wrote:
           | The BLAS have historically relied on a wide range of
           | microarchitecture-specific optimizations to get the most out
           | of each processor generation. An ideal solution would be for
           | the browser to provide that to the application in such a way
           | that it is difficult to fingerprint.
           | 
           | See also the history of Atlas, GotoBLAS, Intel MKL, etc.
        
             | bee_rider wrote:
             | libflame/BLIS might be a good starting point, they've
             | created a framework where you bring your compute kernels,
             | and they make them into a BLAS (plus some other nice
             | functionality). I believe most of the framework itself is
             | in C, so I guess that could somehow be made to spit out
             | wasm (I know nothing about wasm). Then, getting the browser
             | to be aware of the actual real assembly kernels might be a
             | pain.
        
         | injidup wrote:
         | https://www.google.com/search?q=wasm%20blas&ie=utf-8&oe=utf-...
        
       | pcwalton wrote:
       | I really like this writeup. Note that it may not be worth using
       | the SIMD in this way (horizontal SIMD) if you know you will be
       | multiplying many matrices that are the same size. It may be
       | better to do vertical SIMD and simply perform the scalar
       | algorithm on 4 or 8 matrices at a time, like GPUs would do for
       | vertex shaders. This does mean that you may have to interleave
       | your matrices in an odd way to optimize memory access, though.
        
       | bertman wrote:
       | Very cool writeup!
       | 
       | Unfortunately, the bar graphs at the bottom of the article have
       | different y-axis scaling, but even so:
       | 
       | It's sad how Firefox' performance pales in comparison to Chrome.
        
       | zwieback wrote:
       | Very cool!
       | 
       | My question: will this kind of thing become more mainstream? I've
       | seen the web emerge, go from static pages to entire apps being
       | delivered and executed in the browser. The last bastion of native
       | apps and libraries seems to be highly optimized algorithms but
       | maybe those will also migrate to a deliver-from-the-web and
       | execute in some kind of browser sandbox.
       | 
       | Java promised to deliver some version of native code execution
       | but the Java app/applet idea never seemed to take off. In some
       | ways it seems superior to what we have now but maybe the security
       | concerns we had during that era held Java back too much. Or am I
       | misunderstanding what WebAssembly can bring to the game?
        
         | brrrrrm wrote:
         | I'm not really equipped to predict anything, but I think recent
         | surge in popularity of simple RISC-y[1] architectures like ARM
         | will allow the WebAssembly standard to stay small yet
         | efficient. I'm hopeful, but standards often have a way of not
         | keeping up with the newest technology so we'll see.
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...
        
           | pjscott wrote:
           | I don't expect ARM to have much effect here. The wasm
           | instruction set is basically a low-level compiler
           | intermediate representation, and to go from that to machine
           | code for x86 or ARM is about equally difficult in both cases.
        
         | MauranKilom wrote:
         | Well, here is a related talk:
         | https://www.destroyallsoftware.com/talks/the-birth-and-death...
        
         | halpert wrote:
         | The web always feels terrible compared to a native app,
         | especially on a phone. Some of the difference is due to Safari
         | being a bad browser (probably intentional), but a bigger part
         | is that the threading model makes it really difficult to have a
         | responsive UI. Not to mention the browser's gestures often
         | clash with the application's gestures.
        
           | eyelidlessness wrote:
           | I generally find Safari's performance better than other
           | browsers. Also, every mainstream JS runtime is multithreaded.
           | There are limitations on what can be shared between threads,
           | but you can optimize a _lot_ despite those limitations
           | (including using WASM on the web, and native extensions /FFI
           | on Node/Deno).
        
             | Uehreka wrote:
             | In my experience, if the thing I'm working on runs in
             | Safari, it's buttery smooth. And if it doesn't run, it
             | completely shits the bed. Stuff like "we don't support that
             | way of doing shadows in SVG, so rather than simply not
             | implement that property, we've turned your entire SVG
             | element black".
        
             | johncolanduoni wrote:
             | Every mainstream JS runtime has some sort of background
             | garbage collection and parsing/compilation, but for any
             | execution that has to interact with JavaScript heap or
             | stack state you're still in single threaded territory.
             | SharedArrayBuffer can help if you're willing to give up on
             | JavaScript objects entirely, and for WebAssembly this is
             | less of a burden, but that's not going to help you perform
             | rendering on most web apps concurrently. JSCore goes a
             | little bit further and can run javascript in the same
             | runtime instance on multiple threads with some significant
             | limitations, but this isn't exposed to developers on
             | Safari.
        
             | halpert wrote:
             | On iOS, the perf of Safari is definitely better than every
             | other browser, because every other browser is mandated to
             | use WkWebView by Apple. They aren't allowed to implement
             | their own engine. Of course Apple isn't subject to the same
             | restriction.
        
               | amelius wrote:
               | Soon, other browsers can simply run themselves inside
               | WASM which then runs inside a WkWebView :)
        
               | jacobolus wrote:
               | On a Mac, Safari Javascript generally outperforms Chrome
               | and Firefox (other browsing tasks are also generally
               | better performing), but there are some workloads where
               | Safari turns out slower, especially when the developer
               | has put a lot of work into Chrome-specific optimization.
               | 
               | Safari also generally uses a lot less memory and CPU for
               | the same websites. Chrome in particular burns through my
               | battery very quickly, and is basically completely
               | incapable of keeping up with my browser use style (it
               | just crashes when I try to open a few hundred tabs).
               | Presumably nobody with authority at Google is a heavy
               | laptop web-user or prioritizes client-side resource use:
               | Google's websites are also among the biggest browser
               | resource hogs, even when sitting idle in a background
               | tab.
               | 
               | Safari often takes a couple years longer than other
               | browsers to implement cutting-edge features. This seems
               | to me like a perfectly reasonable design decision; some
               | web developers love complaining about it though, and some
               | sites that were only developed against the most recent
               | versions of Chrome don't work correctly in Safari.
        
           | jsheard wrote:
           | WASM threads are available in all modern browsers now,
           | including Safari. It's very early days in terms of ecosystem
           | but we're steadily getting there.
        
             | halpert wrote:
             | The issue isn't so much having additional threads, it's
             | needing two main threads. The browser has one main thread
             | for accepting user input and then dispatches the relevant
             | user events to the JS main thread.
             | 
             | The browser can either dispatch the events asynchronously,
             | leading to the events being handled in a noticeably delayed
             | way, or the browser can block its main thread until the JS
             | dispatch finishes, leading to fewer UI events being
             | handled. Either way is an inferior experience.
        
             | danielvaughn wrote:
             | Also aren't service workers technically multi-threading?
             | That's been a thing for a while in the browser now.
        
               | jsheard wrote:
               | Technically yeah, but the threads could only communicate
               | through message passing which isn't ideal for
               | performance. The more recent major improvement is for
               | workers to be able to share a single memory space similar
               | to how threading works in native applications.
        
           | arendtio wrote:
           | > The web always feels terrible compared to a native app
           | 
           | 'always' is certainly not true. Yes, with modern frameworks
           | it is very easy to build websites which are slow. But it is
           | also possible to build websites with butter smooth animations
           | and instant responses.
           | 
           | I hope that in the future we will get frameworks that make it
           | easier to create lightweight web apps, so that we will see
           | more high performance apps.
        
             | halpert wrote:
             | I made a another comment further down, but basically web
             | apps can't run as fast as native apps for a variety of
             | reasons.
             | 
             | One reason is the thread model. There are two main threads
             | that need to be synchronized (browser main thread and JS
             | main thread) which will always be slower than a single main
             | thread.
             | 
             | Another reason is that layout and measurement of elements
             | in HTML is really complicated. Native apps heavily
             | encourage deferred measurement which lets the app measure
             | and lay itself out once per render pass. In JavaScript,
             | layouts may need to happen immediately based on what
             | properties of the dom you're reading and setting.
        
               | arendtio wrote:
               | I think nobody will argue against, that the majority of
               | native apps are faster than web apps. But the key point
               | isn't faster, but how much is fast enough.
               | 
               | In general, 60fps is considered sufficient for smooth
               | rendering and even 5 years ago, mobile hardware was fast
               | enough for 60 fps web page rendering. However, many web
               | pages a built in ways, that the browsers can't achieve
               | that goal.
               | 
               | So yes, it is harder for developers to create a pleasant
               | experience and as a result there are more bad apples in
               | the web app basket.
        
               | halpert wrote:
               | I disagree. Yes, if you have a static webpage and all you
               | need to do is scroll, then you can easily get 60 fps,
               | notably because the scrolling is handled natively by the
               | browser and basically is just a translation on the GPU.
               | If the web app accepts user input, especially touch with
               | dragging, then the page will not feel native with the
               | current batch of browsers for the reasons I mentioned
               | above.
        
               | arendtio wrote:
               | So what would be needed to prove you wrong?
               | 
               | How about a 240fps video of a 60Hz display, with 2
               | implementations
               | 
               | 1. Qt
               | 
               | 2. Web
               | 
               | Both times a finger dragging a slider from point A to
               | point B?
        
               | engmgrmgr wrote:
               | You don't have to always use the DOM. You can render in
               | another thread, or even run compute in another thread and
               | use the animation frame system to handle updates.
               | 
               | Having said that, maybe a little less than 10 years ago,
               | we achieved the desired performance with touch-screen
               | dragging of DOM elements. I don't remember specifics, but
               | we didn't use any frameworks.
        
           | nicoburns wrote:
           | > Some of the difference is due to Safari being a bad browser
           | (probably intentional), but a bigger part is that only having
           | one thread makes it really difficult to have a responsive UI
           | 
           | Interestingly Safari is actually generally better than other
           | browsers for running a responsive smooth UI. Not sure how
           | much of that is the Safari engine and how much of it is
           | better CPU on iPhones. But even on a first generation iPad or
           | an iPhone 4 it was possible to get 60fps rendering fairly
           | easily. The same could not be said for even higher end
           | android phones of the time.
        
             | kitsunesoba wrote:
             | Anecdotally, the only time I've had issues with
             | unresponsiveness for pages in Safari is with sites that
             | were written with Chrome specifically in mind.
        
               | themerone wrote:
               | So, basically everything.
        
               | kitsunesoba wrote:
               | The impact is minimal to nonexistent on a light-to-
               | moderate-JS "site" and only really shows up in heavy
               | "apps", like YouTube or GDocs.
        
             | halpert wrote:
             | Really? Even something simple like Wordle feels janky with
             | the way Safari's chrome overlaps the keyboard.
        
               | acdha wrote:
               | Do you have some kind of extensions or something like
               | text zooming enabled? On a clean install it doesn't
               | overlap at all.
        
               | halpert wrote:
               | Hmm the layout is working for me now, but the tile
               | animation is broken. They flicker instead of smoothly
               | flipping over.
        
               | acdha wrote:
               | Any chance you have reduced motion enabled? They flip for
               | me but are obviously on a short timer.
        
         | javajosh wrote:
         | The web solved software distribution. Full stop. There only
         | remain the edge cases, and that's where webasm wants to help.
         | 
         | Sun/Java wanted badly to solve this problem, but tried to do
         | too much too soon. Java gave devs in 1999 cutting edge OOP
         | tools for doing GUIs (e.g. Swing) but distribution was always
         | the problem. Installing and running browser plugins was always
         | error prone, and it turned out the browser was itself just good
         | enough to deliver value, so it won. (With the happy side-effect
         | of giving the world-wide community of devs one of the gooeyist
         | languages ever to express their fever dreams of what
         | application code should be).
         | 
         | The question in my mind is whether there is enough demand for
         | the kinds of software webasm enables, especially given that
         | other routes of distribution (app stores) have filled in the
         | gaps of what the web delivers, and are generally a lot more
         | pleasant and predictable to native devs.
        
       | ginko wrote:
       | Isn't the idea of webassembly to compile native
       | C/C++/Rust/Whatever code to be able to run in the browser?
       | 
       | Why not just compile OpenBLAS or another computer numerics
       | library like that to WA?
        
         | brrrrrm wrote:
         | that's exactly how TF.js does it:
         | https://github.com/google/XNNPACK/blob/master/src/f32-gemm/M...
        
         | remus wrote:
         | I'm by no means an expert but my understanding is that a lot of
         | the performance from libraries like openBLAS comes from
         | targeting specific architectures (e.g. particular instruction
         | sets on a series of processors). You can probably milk some
         | more performance by targeting the web assembly architecture
         | specifically (assuming openBLAS hasn't started doing similar
         | themselves).
        
       | bee_rider wrote:
       | People are asking about BLAS already in various threads, but if
       | they know the size of their matrices beforehand, it might be
       | interesting to try EIGEN. EIGEN also has the benefit that it is
       | an all-template C++ library so I guess it should be somehow
       | possible to spit out the WASM (I know nothing about WASM).
       | 
       | Of course usually BLAS beats EIGEN, but for small, known-sized
       | matrices, it might have a chance.
        
         | MauranKilom wrote:
         | Can you specify what "small" constitutes for you?
        
       | LeanderK wrote:
       | I was hopeful WebAssembly will speed up the web. What's missing?
       | Native Browser APIs? Native IO Apis?
       | 
       | Let's say I want to interactively plot some complicated function
       | without slowing down the rest. Can I do this in WebAssembly now?
        
         | acdha wrote:
         | > Let's say I want to interactively plot some complicated
         | function without slowing down the rest. Can I do this in
         | WebAssembly now?
         | 
         | You've been able to do that for a while now and it would likely
         | be fast enough even in pure JavaScript. The things which tend
         | to slow the web down come back to developers not caring about
         | performance (and thus not even measuring) or the cross-purposes
         | of things like ad sales where a significant amount of
         | JavaScript is deployed to do things the user doesn't care
         | about.
        
         | sharikous wrote:
         | > Let's say I want to interactively plot some complicated
         | function without slowing down the rest. Can I do this in
         | WebAssembly now?
         | 
         | Yep, with a Web Worker for the secondary thread. However the
         | environment is still young and its' difficult to use heavy
         | computation libraries. Besides for some reason SIMD
         | instructions are present only for 128 bit units (2 doubles or 4
         | floats). Another problem is no matter what to do it is a layer
         | over the hardware, so it will be slower than specialized
         | machine code (if what you do is not in the JS API)
        
           | LeanderK wrote:
           | > difficult to use heavy computation libraries
           | 
           | what do you mean? What's the blocker? Something like numpy
           | for js would fill this role, calling wasm in the background.
           | Just the missing SIMD-instructions? Some quick googling shows
           | that one can't compile BLAS for wasm yet. This might be due
           | to wasm64 not being available yet, i think? So would this
           | help to tap into the existing ecosystem of optimised
           | mathematical routines?
           | 
           | Ideally...I would leave js and just use python ;) It has the
           | whole ecosystem at hands with numpy, scipy, statsmodels etc.
           | But nobody is doing it and idk why. I think it might be due
           | to fortran not compiling to wasm.
        
             | sharikous wrote:
             | Yes I forgot about wasm64 not being available still. Yes,
             | that's a big block.
             | 
             | About numpy for js I believe js is still not comfortable
             | enough for this kind of use, especially with the lack of
             | operator overloading.
             | 
             | Anyway there are some builds of BLAS (or equivalents) to
             | wasm and even of python. Check out pyodide and brython
        
               | [deleted]
        
         | onion2k wrote:
         | _Let 's say I want to interactively plot some complicated
         | function without slowing down the rest_
         | 
         | If you can draw your plot in a shader then you can do it in
         | WebGL very easily. You'd only need to update the input uniforms
         | and everything else would happen on the GPU.
        
         | johndough wrote:
         | Browsers will always lag behind desktop applications by a
         | decade or two because everything is designed by committee and
         | takes forever to arrive (see e.g. SSE, AVX, GPGPU compute). And
         | even if everyone can eventually agree on and implement the
         | smallest common denominator hardware will already have evolved
         | beyond that.
         | 
         | In addition, browsers have to fight all kinds of nefarious
         | attackers, so it is a very hostile environment to develop in.
         | For example, we can't even measure time accurately (or do
         | proper multithreading with shared memory) in the browser thanks
         | to the Spectre and Meltdown vulnerabilities.
         | https://meltdownattack.com/
         | 
         | That being said, WebGL implements extremely gimped shaders.
         | Yet, they are still more than enough to render all kinds of
         | functions. For example, see https://www.shadertoy.com/ or
         | https://glslsandbox.com/ which are huge collections of
         | functions which take screen coordinates and time as input and
         | compute pixel colors from that, i.e. f(x, y, t) -> (r, g, b).
         | This might sound not very impressive on first glance, but
         | people have been amazingly creative within this extremely
         | constrained environment, resulting in all kinds of fancy 3D
         | renderings.
        
       | wdroz wrote:
       | Since wasm supports threads, I wonder if you can speed up these
       | operations further more by using multiple threads.
        
         | brrrrrm wrote:
         | That's a good point: you certainly could. There's some fun
         | exploration to be done with atomic operations.
         | 
         | The issue is that threaded execution requires cross-origin
         | isolation, which isn't trivial to integrate. (Example server
         | that will serve the required headers: https://github.com/bwasti
         | /wasmblr/blob/main/thread_example/s...)
        
       | phkahler wrote:
       | Another technique is to transpose the left matrix so each dot
       | product is scanned in row-order and hence more cache friendly.
       | 
       | Another one I tried ages ago is to use a single loop counter and
       | "de-interleave" the bits to get what would normally be 3 distinct
       | loop variables. For this you need to modify the entry in the
       | result matrix rather than having it write-only. It has the effect
       | of accessing like a z-order curve but in 3 dimensions. It's a bit
       | of overhead, but you can also unroll say 8 iterations (2x2x2)
       | which helps make up for it. This ends up making good use of both
       | caches and even virtual memory if things don't fit in RAM. OTOH
       | it tends to prefer sizes that are a power of 2.
        
         | gfd wrote:
         | I really like these set of lecture notes for optimizing matrix
         | multiplication: https://ppc.cs.aalto.fi/ch2/v7/ (The transpose
         | trick is used in v1)
        
           | progbits wrote:
           | This deserves it's own submission, wonderful resource!
        
           | eigenvalue wrote:
           | I find it surprising that, even after using all those tricks,
           | they are still only to achieve around 50% of the theoretical
           | peak performance of the chip in terms of GFLOPS. And that's
           | for matrix multiplication, which is a nearly ideal case for
           | these techniques.
        
         | dralley wrote:
         | The compiler will sometimes do this transpose for you, but as
         | with all compiler optimizations it might sometimes break.
        
           | melissalobos wrote:
           | That sounds very interesting, is the anywhere I can read more
           | about this optimization? I didn't know any compiler could do
           | optimizations like that.
        
             | kanaffa12345 wrote:
             | there is no way a general purpose compiler will figure this
             | out. op is probably talking about something like halide or
             | tvm or torchscript jit.
        
               | [deleted]
        
         | cerved wrote:
         | You can get extremely creative in optimizing matrix
         | mulplication for cache and SIMD.
        
           | jacobolus wrote:
           | For contexts like the web, also check out cache-oblivious
           | matrix multiplication https://dspace.mit.edu/bitstream/handle
           | /1721.1/80568/4355819...
        
         | mynameismon wrote:
         | Another very very interesting optimisation can be found in
         | these lecture slides [0]. (Scroll to slide 28, although the
         | entire slide deck is amazing)
         | 
         | [0]: https://ocw.mit.edu/courses/electrical-engineering-and-
         | compu...
        
       | magoghm wrote:
       | I tested it on my M1 Mac and it reached 46.78 Gigaflops, which is
       | quite amazing for a CPU running at 3.2 GHz. Isn't that like an
       | average of 14.6 floating point operations per clock cycle?
        
         | lostmsu wrote:
         | If you look at the comment above about the regular GEMM
         | implementation, M1 actually can do that at 1.6 Teraflops.
        
         | danieldk wrote:
         | I hate to post this multiple times, but the M1 has a dedicated
         | matrix co-processor, it can do matrix multiplication at
         | >1300GFLOP/s if you use the native Accelerate framework (which
         | uses the standard BLAS API). The M1 Pro/Max can even do double
         | that (>2600 GFLOP/s) [1].
         | 
         | 46.78 GFLOP/s is not even that great on non-specialized
         | hardware. E.g., a Ryzen 5900X, can do ~150 GFLOP/s single-
         | threaded with MKL.
         | 
         | [1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads
        
       | owlbite wrote:
       | How does this compare to the native BLAS in the Accelerate
       | library?
        
         | conradludgate wrote:
         | Accelerate on the M1 is ridiculously fast (thanks to its
         | special core set and specific instructions).
         | 
         | Some benchmarks I've done has it beating out CUDA on my RTX
         | 2070. I have to got a proper gflops number though
        
         | danieldk wrote:
         | It's going to absolutely blow this away. Here are some of my
         | single precision GEMM benchmarks for the M1 and M1 Pro:
         | 
         | https://github.com/danieldk/gemm-benchmark#1-to-16-threads
         | 
         | tl;dr, the M1 can do ~1300 GFLOP/s and the M1 Pro up to
         | ~2700GFLOP/s.
         | 
         | On the vanilla M1, that's 28 times faster than the best result
         | in the post.
         | 
         | The difference (besides years of optimizing linear algebra
         | libraries) is that Accelerate uses the AMX matrix
         | multiplication co-processors through Apple's proprietary
         | instructions.
        
         | [deleted]
        
       | riddleronroof wrote:
       | This is very cool
        
       | [deleted]
        
       | wheelerof4te wrote:
       | I admire people who can read and understand this.
        
         | Kilenaitor wrote:
         | Which parts are you unable to read and understand? I'm sure
         | some of us here could help explain if you have specific
         | questions or hangups.
        
           | wheelerof4te wrote:
           | The math stuff :)
           | 
           | JavaScript code is readable, at least.
        
             | Kilenaitor wrote:
             | Is "the math stuff" all the optimizations being performed
             | e.g. vectorizing multiplication?
             | 
             | Not trying to sound dismissive here but the core math the
             | post is working with is actually a pretty straightforward
             | matrix multiplication.
             | 
             | The bulk of the discussion focuses on optimizing the
             | execution of that straightforward multiplication algorithm
             | [triple-nested for loop; O(n^3)] rather than making
             | algorithmic/mathematic optimizations.
             | 
             | And again, specific questions are easier to answer. :)
        
               | djur wrote:
               | Matrix multiplication isn't exactly intuitive if you've
               | never worked with it before.
        
           | bruce343434 wrote:
           | I don't understand the naming and notation of this article
           | because the author is assuming context that I don't have.
           | 
           | Section baseline: What are N, M, K? 3 matrices or? Laid out
           | as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k
           | * N + n];`, ah, apparently a b and c are the matrices? How
           | does this work?
           | 
           | Section body: What is the mathy "C'=aC+A[?]B"? derivative of
           | a constant is the angle times the constant plus the dot
           | product of A and B???
        
             | conradludgate wrote:
             | There are 3 matrices in question: A, B and C. They have
             | dimensions (M * K), (K * N) and (M * N) respectively.
             | 
             | They are laid out, rather than nested arrays, as a single
             | continuous collection of bytes that can be interpreted as
             | having a matrix shape. That's where the `m * N + n` comes
             | from (m rows down and n cols in)                 C' = alpha
             | C + A.B
             | 
             | This is the 'generalised matrix-matrix multiplication'
             | (GEMM) operation. It's multiplying the matrices A and B,
             | adding it to a scales version of C and inserting it back
             | into C. Setting alpha to 0 gets you basic matmul
        
               | wheelerof4te wrote:
               | Thank you for this detailed explanation. Making the
               | matrix one-dimensional makes sense from the performance
               | standpoint.
        
       | [deleted]
        
       | marginalia_nu wrote:
       | It's kind of bizarre how it's an accomplishment to get your code
       | closer to what the hardware is capable of. In a sane world, that
       | should be the default environment you're working in. Anything
       | else is wasteful.
        
         | Kilenaitor wrote:
         | There's always been a tradeoff in writing code between
         | developer experience and taking full advantage of what the
         | hardware is capable of. That "waste" in execution efficiency is
         | often worth it for the sake of representing helpful
         | abstractions and generally helping developer productivity.
         | 
         | The real win here is when we can have both because of smart
         | toolchains that can transform those high-level constructs and
         | representations into the most efficient implementation for the
         | hardware.
         | 
         | Posts like this demonstrate what's possible with the right
         | optimizations so tools like compilers and assemblers are able
         | to take advantage of these when given the high-level code. That
         | way we can achieve what you're hoping for: the default being
         | optimal implementations.
        
           | AnIdiotOnTheNet wrote:
           | > That "waste" in execution efficiency is often worth it for
           | the sake of representing helpful abstractions and generally
           | helping developer productivity.
           | 
           | That's arguable at best. I for one am sick of 'developer
           | productivity' being the excuse for why my goddamned
           | supercomputer crawls when performing tasks that were trivial
           | even on hardware 15 years older.
           | 
           | > The real win here is when we can have both because of smart
           | toolchains that can transform those high-level constructs and
           | representations into the most efficient implementation for
           | the hardware.
           | 
           | That's been the promise for a long time and it still hasn't
           | been realized. If anything things seem to be getting less and
           | less optimal.
        
             | adamc wrote:
             | No, it's really not even arguable. Lots and lots of
             | software is written in business contexts where the cost of
             | developing reliable code is a lot more important than its
             | performance. Not everything is a commercial product aimed
             | at a wide audience.
             | 
             | What you're "sick of" is mostly irrelevant unless you
             | represent a market that is willing to pay more for a more
             | efficient product. I use commercial apps every day that
             | clearly could work a lot better than they do. But... would
             | I pay a lot for that? No. They are too small a factor in my
             | workday.
             | 
             | Saving money is part of engineering too.
        
               | bruce343434 wrote:
               | People have been sick of slow programs and slow computers
               | since literally forever. I think you live in a bubble or
               | are complacent.
               | 
               | No one I know has anything good to say about microsoft
               | teams, for instance. And that's just one of the recent
               | "dekstop apps" which are actually framed browsers.
        
             | lijogdfljk wrote:
             | > when performing tasks that were trivial even on hardware
             | 15 years older.
             | 
             | Did the software to perform those tasks stop working?
        
             | nicoburns wrote:
             | > I for one am sick of 'developer productivity' being the
             | excuse for why my goddamned supercomputer crawls when
             | performing tasks that were trivial even on hardware 15
             | years older.
             | 
             | The problem here is developer salaries. So long as
             | developers are as expensive as they are the incentive will
             | be to optimise for developer productivity over runtime
             | efficiency.
        
               | dogleash wrote:
               | If developers costed one fifth of what they do now, how
               | many projects that let performance languish today would
               | staff up to the extent that doing a perf pass would make
               | it's way to the top of the backlog queue?
               | 
               | Come on now. Let's be honest here. The answer for >90% of
               | projects is either a faster pace on new features, or to
               | pocket the payroll savings. They'd never prioritize
               | something that they've already determined can be ignored.
        
               | Kilenaitor wrote:
               | We've been making developer experience optimizations
               | _long_ before they started demanding high salaries. The
               | whole reason to go from assembly to C was to improve
               | developer experience and efficiency.
               | 
               | It seems fairly reductive to dismiss the legitimate
               | advantages of increased productivity. It's faster to
               | iterate on ideas and products, we gain back time to focus
               | on more complex concepts, and, more broadly, we further
               | open up this field to more and more people. And those
               | folks can then go on to invest in these kind of
               | performance improvements.
        
               | AnIdiotOnTheNet wrote:
               | > It seems fairly reductive to dismiss the legitimate
               | advantages of increased productivity.
               | 
               | Certainly there are some, but I think we passed the point
               | of diminishing returns long long ago and we're now well
               | into the territory of regression. I would argue that we
               | are actually experiencing negative productivity increases
               | from a lot of the abstractions we employ, because we've
               | built up giant abstraction stacks where each new
               | abstraction has new leaks to deal with and everything is
               | much more complicated than it needs to be because of it.
        
               | nicoburns wrote:
               | Hmm... I think our standards for application
               | functionality are also a lot higher. For example, how
               | many applications from the 90s dealt flawlessly with
               | unicode text.
        
               | AnIdiotOnTheNet wrote:
               | How much added slowness do you think Unicode is
               | responsible for? Because as much of a complex nightmarish
               | standard as it is[0], there are plenty of applications
               | that are fast that handle it just fine as far as I can
               | tell. They're built with native widgets and written in
               | (probably) C.
               | 
               | [0] plenty of slow as fuck modern software doesn't handle
               | it even close to 'flawlessly'
        
           | danieldk wrote:
           | _There 's always been a tradeoff in writing code between
           | developer experience and taking full advantage of what the
           | hardware is capable of. That "waste" in execution efficiency
           | is often worth it for the sake of representing helpful
           | abstractions and generally helping developer productivity._
           | 
           | The GFLOP/s is 1/28th of what you'd get when using the native
           | Accelerate framework on M1 Macs [1]. I am all in for powerful
           | abstractions, but not using native APIs for this (even if
           | it's just the browser calling Accelerate in some way) is just
           | a huge waste of everyone's CPU cycles and electricity.
           | 
           | [1] https://github.com/danieldk/gemm-
           | benchmark#1-to-16-threads
        
         | Salgat wrote:
         | Once you realize that it's a completely sandboxed environment
         | that works on any platform, it's a lot more impressive.
        
         | dr_zoidberg wrote:
         | Wasteful of computing resources, yes, but for a long time we've
         | been prioritizing developer time. That happens because you can
         | get faster hardware cheaper than you can get more developer
         | time (and not all developers time is equal, say, Carmack con do
         | in a few hours things I couldn't do in months).
         | 
         | I do agree that we'd get fantastic performance out of our
         | systems if we had the important layers optimized like this (or
         | more), but it seems few (if any) have been pushing in that
         | direction.
        
           | terafo wrote:
           | But you can't get faster hardware cheaper anymore. Not
           | naively faster hardware anyways. You are getting more and
           | more optimization opportunities nowadays though. Vectorize
           | your code, offload some work to the GPU or one of the
           | countless other accelerators that are present on modern SOC,
           | change your I/O stack so you can utilize SSDs efficiently,
           | etc. I think it's a matter of time until someone puts FPGA
           | onto mainstream SOC, and the gap between efficient and
           | mainstream software will only widen from that point.
        
             | dr_zoidberg wrote:
             | You are precisely telling me the ways in which I can get
             | faster hardware: GPU, accelerators, the I/O stack and SSDs,
             | etc.
             | 
             | I agree that the software layer has become slow, crufy,
             | bloated, etc. But it's still cheaper to get faster hardware
             | (or wait a bit for it, see M1, Alder Lake, Zen 3, to name a
             | few, and those are getting successors later on this year)
             | than to get a good programmer to optimize your code.
             | 
             | And I know that we'd get much better performance out of
             | current (and probably future) hardware if we had more
             | optimized software, but it's rare to see companies and
             | projects tackle on such optimization efforts.
        
               | terafo wrote:
               | But you can't get all these things in the browser. You
               | don't just increase CPU frequency and get free
               | performance anymore. You need conscious effort to use GPU
               | computing, conscious effort to ditch current I/O stack
               | for io_uring. Modern hardware gives performance to ones
               | who are willing to fight for it. Disparity between naive
               | approach and optimized approach grows every year.
        
         | peterhunt wrote:
         | The real issue here is that the hardware isn't capable of
         | sandboxing without introducing tons of side channel attacks.
         | Lots of applications are willing to sacrifice a lot of
         | performance in order to gain the distribution advantages from a
         | safe, sandboxed execution environment.
        
         | not2b wrote:
         | In a sane world (which is the world that we live in), it's best
         | to find a well-optimized library for common operations like
         | matrix multiplication. But if you want to do something unusual
         | (multiply large matrices inside a browser, quickly) you've
         | exited the sane world so you'll have to work at it.
        
         | ska wrote:
         | > Anything else is wasteful.
         | 
         | Everything has a cost. If the developer is a slave to machine
         | architecture, development is slow and error prone. If the
         | machine is a slave to a abstraction, everything will run
         | slowly. Unsurprisingly, the real trick is finding appropriate
         | balance for your situation.
         | 
         | Of course you can make things worse, in both directions.
        
         | Zababa wrote:
         | On the other hand, in your sane world, productivity would be a
         | fraction of what it currently is, for developers and users. You
         | favor computer time over developer time. While computer time
         | can be a proxy for user time, it isn't always as developer time
         | can be used to speed up user time too. A single-minded focus on
         | computer time sounds like a case of throwing out metrics like
         | developer time and user time because they are harder to measure
         | than computer time. In any case, it sounds like a mistake to
         | me.
        
       | bruce343434 wrote:
       | I don't understand the naming and notation of this article
       | because the author is assuming context that I don't have.
       | 
       | Section baseline: What are N, M, K? 3 matrices or? Laid out as a
       | flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N +
       | n];`, ah, apparently a b and c are the matrices? How does this
       | work?
       | 
       | Section body: What is the mathy "C'=aC+A[?]B"? derivative of a
       | constant is the angle times the constant plus the dot product of
       | A and B???
       | 
       | Please, if you write a public blog post, use your head. Not
       | everybody will understand your terse notes.
        
         | engmgrmgr wrote:
         | Not to be too snarky, but perhaps the onus is on you to do some
         | homework if you want to understand a niche article for which
         | you lack context?
         | 
         | Laying out matrices like that is pretty standard, especially
         | for a post about vectorization.
        
       | ausbah wrote:
       | shouldn't compilers handle stuff like this?
        
         | brrrrrm wrote:
         | In an ideal world, absolutely! It's a hard problem and there
         | are many attempts to make that happen automatically including
         | polyhedral optimization (Polly[1]) and tensor compiler
         | libraries (XLA[2] and TVM[3]). I work on a project called
         | LoopTool[4] which is researching ways to dramatically reduce
         | the representations of the other projects to simplify
         | optimization scope.
         | 
         | [1] https://polly.llvm.org
         | 
         | [2] https://www.tensorflow.org/xla
         | 
         | [3] https://tvm.apache.org
         | 
         | [4] https://github.com/facebookresearch/loop_tool
        
         | visarga wrote:
         | If they worked so well AMD would not be in such a bad position
         | with their GPUs in ML. They would just need to compile to their
         | arch.
        
       ___________________________________________________________________
       (page generated 2022-01-25 23:00 UTC)