[HN Gopher] Faster TypedArrays: Vector Addition in WebAssembly
       ___________________________________________________________________
        
       Faster TypedArrays: Vector Addition in WebAssembly
        
       Author : brrrrrm
       Score  : 72 points
       Date   : 2022-01-02 16:43 UTC (6 hours ago)
        
 (HTM) web link (jott.live)
 (TXT) w3m dump (jott.live)
        
       | lmeyerov wrote:
       | Cool to see this stuff getting out into the wild, long time
       | coming!
       | 
       | What is the current state of these -- do simd wasm ops run in
       | ffox/chrome/edge/safari, and does the data have to be slowly bulk
       | copied back-and-forth from a worker (heartbleed removal of zero-
       | copy ownership transfers)?
       | 
       | We love js typed arrays and columnar analytics -- graphistry
       | contributed the first years of Apache Arrow JS -- but
       | intentionally didn't do wasm for kernels because of this kind of
       | stuff, despite promising internal multicore etc prototypes. A
       | surprise win for us of GPU python/js offloading to the server has
       | been not just scale but perf reliability. Curious how much it has
       | improved in the typical case, as it always made sense on paper!
        
         | brrrrrm wrote:
         | Yep, you still need to bulk copy memory around if it already
         | exists. If you own the memory, though, you can avoid copying
         | (wasm has "import") but you'll need to manage that manually.
         | 
         | I've found the cleanest approach for me is to have modules
         | allocate space for inputs and outputs and then try to get
         | functions to write directly into the input space.
         | 
         | Either way, nothing available is super friendly for user
         | provided arrays or canvas interactions.
        
       | IvanK_net wrote:
       | BTW. you can unwrap loops in pure Javascript too, and it also
       | makes the code several times faster.
       | 
       | 4x unwrap: https://jsfiddle.net/49j7htdz/1/
       | 
       | 8x unwrap: https://jsfiddle.net/49j7htdz/2/
        
         | brrrrrm wrote:
         | It's definitely a bit faster with the unroll, but on my machine
         | it's not by too much. I've added that idea to the interactive
         | benchmark if you'd like to check it out!
        
           | IvanK_net wrote:
           | You probably implemented it in a wrong way. On my machine, a
           | JSFiddle version is 3.5x faster, while in your benchmark, it
           | is 1.4x faster.
        
             | [deleted]
        
       | Matheus28 wrote:
       | If we're not counting the time to zero out the array, it seems
       | that typed arrays are slower than plain javascript because it's
       | converting to & from floats/doubles back and forth. Try the same
       | with Float64Array. My microbenchmark says f32 is 15% slower than
       | f64 on chrome: https://jsbench.me/gnkxxkjag8/1
        
         | brrrrrm wrote:
         | Great catch! I assumed the JIT would identify f32 arithmetic
         | but I guess that isn't really valid numerically. I wonder if
         | there's a way to use Math.fround[1] on your benchmark to get
         | expected speedups?
         | 
         | [1] https://developer.mozilla.org/en-
         | US/docs/Web/JavaScript/Refe...
        
           | olliej wrote:
           | Yup, the spec requires computation to be as doubles and so
           | any computation must be done that way as the end result is
           | observable.
           | 
           | I didn't know about fround but I suspect its primary use case
           | is for trying to catch floating point overflow during double
           | arithmetic, because again precision difference is observable.
        
         | greggman3 wrote:
         | And as usual for me, JavaScriptCore blows away V8 in most
         | microbenchmarks I've run
         | 
         | Same machine, JSC (Safari) is 3x faster than Chrome
         | https://jsbenchit.org/?src=c792550e65de1d038b1f24b446c74592
        
       | visarga wrote:
       | Would have been nice to see CPU and GPU benchmark scores
       | alongside.
       | 
       | I get 1.5mil iterations in numpy on CPU which is on par with
       | typed arrays and much slower than wasmblr.
        
         | brrrrrm wrote:
         | I believe numpy is bounded by Python's interpreter (which runs
         | at about ~1us per dispatch to C function from what I recall
         | while working on PyTorch). The number you get is definitely
         | what I'd expect.
         | 
         | If you use larger arrays (and use the "out" variant:
         | `numpy.add(a, b, out=c)`), you might get similar total
         | throughput at least.                 import numpy as np
         | N = 1024 * 128       A = np.random.randn(N)       B =
         | np.random.randn(N)       C = np.random.randn(N)
         | import time            for _ in range(1000):         np.add(A,
         | B, out=C)            iters = 1000       t = time.time()
         | for _ in range(iters):         np.add(A, B, out=C)       d =
         | time.time() - t       print(f"{iters * N * 3 * 4 / d / 1e9:.2f}
         | GB/s")
        
       | formerly_proven wrote:
       | (~13 GFLOPS over a ~13 kB working set)
        
         | still_grokking wrote:
         | Oh, someone measuring cache bandwidth?
        
           | brrrrrm wrote:
           | mostly L1 data-cache, yea (50% of peak BW). Which is a good
           | sign for an interpreted ISA.
        
             | olliej wrote:
             | Any WASM that runs for more than a few milliseconds is
             | going to be compiled to native code. I do wonder just where
             | the remaining 50% of bandwidth is going. IIRC bounds
             | checking is only be around a 10% penalty in most studies. I
             | guess there are other checks needed in JS (stripping
             | signaling NaNs, etc but I don't think WASM requires that)
        
       | twoodfin wrote:
       | I'd be interested to see if the manual loop unrolling is
       | necessary, or if the WASM toolchain would perform that
       | optimization automatically if `len` were a known constant.
        
         | brrrrrm wrote:
         | That's a good question, and I think there's some data in the
         | benchmark dump to help answer it. The tuning logic for the
         | results labeled `wasmblr (tuned X)` sweeps through a couple
         | values for X (the unroll size) and shows the best one. On some
         | browsers (and in node.js) this value goes to 1, which means
         | unrolling isn't found to be necessary for that size.
         | 
         | Mostly, from what I've seen, it tunes to a value greater than
         | 1. So I think generally it's still necessary today.
        
       ___________________________________________________________________
       (page generated 2022-01-02 23:00 UTC)