[HN Gopher] Faster TypedArrays: Vector Addition in WebAssembly ___________________________________________________________________ Faster TypedArrays: Vector Addition in WebAssembly Author : brrrrrm Score : 72 points Date : 2022-01-02 16:43 UTC (6 hours ago) (HTM) web link (jott.live) (TXT) w3m dump (jott.live) | lmeyerov wrote: | Cool to see this stuff getting out into the wild, long time | coming! | | What is the current state of these -- do simd wasm ops run in | ffox/chrome/edge/safari, and does the data have to be slowly bulk | copied back-and-forth from a worker (heartbleed removal of zero- | copy ownership transfers)? | | We love js typed arrays and columnar analytics -- graphistry | contributed the first years of Apache Arrow JS -- but | intentionally didn't do wasm for kernels because of this kind of | stuff, despite promising internal multicore etc prototypes. A | surprise win for us of GPU python/js offloading to the server has | been not just scale but perf reliability. Curious how much it has | improved in the typical case, as it always made sense on paper! | brrrrrm wrote: | Yep, you still need to bulk copy memory around if it already | exists. If you own the memory, though, you can avoid copying | (wasm has "import") but you'll need to manage that manually. | | I've found the cleanest approach for me is to have modules | allocate space for inputs and outputs and then try to get | functions to write directly into the input space. | | Either way, nothing available is super friendly for user | provided arrays or canvas interactions. | IvanK_net wrote: | BTW. you can unwrap loops in pure Javascript too, and it also | makes the code several times faster. | | 4x unwrap: https://jsfiddle.net/49j7htdz/1/ | | 8x unwrap: https://jsfiddle.net/49j7htdz/2/ | brrrrrm wrote: | It's definitely a bit faster with the unroll, but on my machine | it's not by too much. I've added that idea to the interactive | benchmark if you'd like to check it out! | IvanK_net wrote: | You probably implemented it in a wrong way. On my machine, a | JSFiddle version is 3.5x faster, while in your benchmark, it | is 1.4x faster. | [deleted] | Matheus28 wrote: | If we're not counting the time to zero out the array, it seems | that typed arrays are slower than plain javascript because it's | converting to & from floats/doubles back and forth. Try the same | with Float64Array. My microbenchmark says f32 is 15% slower than | f64 on chrome: https://jsbench.me/gnkxxkjag8/1 | brrrrrm wrote: | Great catch! I assumed the JIT would identify f32 arithmetic | but I guess that isn't really valid numerically. I wonder if | there's a way to use Math.fround[1] on your benchmark to get | expected speedups? | | [1] https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... | olliej wrote: | Yup, the spec requires computation to be as doubles and so | any computation must be done that way as the end result is | observable. | | I didn't know about fround but I suspect its primary use case | is for trying to catch floating point overflow during double | arithmetic, because again precision difference is observable. | greggman3 wrote: | And as usual for me, JavaScriptCore blows away V8 in most | microbenchmarks I've run | | Same machine, JSC (Safari) is 3x faster than Chrome | https://jsbenchit.org/?src=c792550e65de1d038b1f24b446c74592 | visarga wrote: | Would have been nice to see CPU and GPU benchmark scores | alongside. | | I get 1.5mil iterations in numpy on CPU which is on par with | typed arrays and much slower than wasmblr. | brrrrrm wrote: | I believe numpy is bounded by Python's interpreter (which runs | at about ~1us per dispatch to C function from what I recall | while working on PyTorch). The number you get is definitely | what I'd expect. | | If you use larger arrays (and use the "out" variant: | `numpy.add(a, b, out=c)`), you might get similar total | throughput at least. import numpy as np | N = 1024 * 128 A = np.random.randn(N) B = | np.random.randn(N) C = np.random.randn(N) | import time for _ in range(1000): np.add(A, | B, out=C) iters = 1000 t = time.time() | for _ in range(iters): np.add(A, B, out=C) d = | time.time() - t print(f"{iters * N * 3 * 4 / d / 1e9:.2f} | GB/s") | formerly_proven wrote: | (~13 GFLOPS over a ~13 kB working set) | still_grokking wrote: | Oh, someone measuring cache bandwidth? | brrrrrm wrote: | mostly L1 data-cache, yea (50% of peak BW). Which is a good | sign for an interpreted ISA. | olliej wrote: | Any WASM that runs for more than a few milliseconds is | going to be compiled to native code. I do wonder just where | the remaining 50% of bandwidth is going. IIRC bounds | checking is only be around a 10% penalty in most studies. I | guess there are other checks needed in JS (stripping | signaling NaNs, etc but I don't think WASM requires that) | twoodfin wrote: | I'd be interested to see if the manual loop unrolling is | necessary, or if the WASM toolchain would perform that | optimization automatically if `len` were a known constant. | brrrrrm wrote: | That's a good question, and I think there's some data in the | benchmark dump to help answer it. The tuning logic for the | results labeled `wasmblr (tuned X)` sweeps through a couple | values for X (the unroll size) and shows the best one. On some | browsers (and in node.js) this value goes to 1, which means | unrolling isn't found to be necessary for that size. | | Mostly, from what I've seen, it tunes to a value greater than | 1. So I think generally it's still necessary today. ___________________________________________________________________ (page generated 2022-01-02 23:00 UTC)