[HN Gopher] What's in that .wasm? Introducing wasm-decompile ___________________________________________________________________ What's in that .wasm? Introducing wasm-decompile Author : slow-typer Score : 187 points Date : 2020-04-28 12:58 UTC (10 hours ago) (HTM) web link (v8.dev) (TXT) w3m dump (v8.dev) | dlojudice wrote: | > Decompile to what? | | > `wasm-decompile` produces output that tries to look like a | "very average programming language" while still staying close to | the Wasm it represents. | | > #1 goal is readability | | > #2 goal is to still represent Wasm as 1:1 as possible | | It seems AssemblyScript would do the job | | [1] https://assemblyscript.org/ | Aardappel wrote: | AssemblyScript would certainly do worse at #2, and possibly | also at #1. To be translate to Wasm or from Wasm lead to | different optimal designs, see for example how these two | systems deal with loads and stores. | mmastrac wrote: | This is super handy. Pseudocode is very useful for understanding | flow - so much more than actual assembly. I've always found it an | order of magnitude to understand bad asm-to-C decompilation from | IDA or Ghidra over perfect disassembly. | 3pt14159 wrote: | It would be nice if the decompiled output were runnable through | an interpreter so you could step through it with a debugger of | some kind and rename or annotate the variables and functions as | you reverse engineer what is going on. | irrational wrote: | When I first started learning JavaScript in the late 90s, the | primary way I learned new things was from reading other peoples | code in my browser. Nowadays this isn't as easy since you often | have to run obfuscated code through a prettifier to get it back | into a human readable format, but it is still possible with some | effort. I was concerned that WASM would make this impossible | (despite the stated goal of "Be readable and debuggable -- | WebAssembly is a low-level assembly language, but it does have a | human-readable text format (the specification for which is still | being finalized) that allows code to be written, viewed, and | debugged by hand." _), but WASM-decompile gives me hope. | | _https://developer.mozilla.org/en-US/docs/WebAssembly/Concept... | [deleted] | snazz wrote: | This looks much nicer than the wasm2c output for that binary. I | compiled it with `clang wasm.c -c -target wasm32 -O2` just like | in the instructions (I'm on LLVM 10), and used the latest | wasm2wat with `wasm2wat -f wasm.o` and got this instead: | (module (type (;0;) (func (param i32 i32) (result f32))) | (import "env" "__linear_memory" (memory (;0;) 0)) (import | "env" "__indirect_function_table" (table (;0;) 0 funcref)) | (func (;0;) (type 0) (param i32 i32) (result f32) | (f32.add (f32.add (f32.mul | (f32.load (local.get 0)) | (f32.load (local.get 1))) | (f32.mul (f32.load offset=4 | (local.get 0)) (f32.load offset=4 | (local.get 1)))) (f32.mul (f32.load | offset=8 (local.get 0)) (f32.load | offset=8 (local.get 1)))))) | | wasm2c (also from WABT) returns this thing: | https://paste.linux.community/view/7877995f | Aardappel wrote: | wasm2c has a different objective though: to be recompile-able | again while preserving semantics. wasm-decompile was designed | for readability first. | snazz wrote: | Fair enough. I'm still surprised at just how unreadable (for | me) the wasm2c output was, though. The compiler must have | done quite a bit of optimizing that wasm2c was unable to | undo. | Aardappel wrote: | It doesn't actually try to undo anything, it just | translates Wasm instructions 1:1 (they're in your link at | lines 205-221). wasm-decompile does try to "undo" some | thing, but it is generally impossible given LLVM's | optimized output and how low-level Wasm is (see also | article). | fowl2 wrote: | Can we compile it back to wasm again? ;P | frosted-flakes wrote: | No. | | > Its #1 goal is readability: help guide readers understand | what is in a .wasm with as easy to follow code as possible. Its | #2 goal is to still represent Wasm as 1:1 as possible, to not | lose its utility as a disassembler. Obviously these two goals | are not always unifiable. | | > This output is not meant to be an actual programming language | and there is currently no way to compile it back into Wasm. | vbezhenar wrote: | Actually I thought about implementing a programming language | which is a bit more pleasant to work with than raw wat | format, but which still translated roughly 1 to 1 to wasm. | Something like in this link, actually. But that seems outside | of my capabilities and I'm not sure if it's really useful to | anyone. | rhencke wrote: | You might enjoy AssemblyScript. | | http://assemblyscript.org | DonHopkins wrote: | The great thing about AssemblyScript is it makes it | possible to share some of the same code and interfaces | and data and tools between JavaScript and WebAssembly. | | If you're already developing in TypeScript, WebAssembly | is a good way to generate WASM code that interoperates | nicely with it, which you can't do with plain old | JavaScript. | brabel wrote: | there are a few languages already do that, but mostly non- | serious stuff. | | https://github.com/appcypher/awesome-wasm-langs | | I've implemented one but still need to get back to it to | finish it... WASM is one of the easiest targets out there, | hence so many languages already targeting it. But that will | change once some of the WASM proposals become a standard, | especially things like GC support and WASI, which will take | a team (and long term investments) rather than a lone dev | to implement: | | https://webassembly.org/docs/future-features/ | klodolph wrote: | This is fascinating. For various reasons, WASM is less like a | target bytecode format and more like a peculiar IR for compilers. | I'm sure this has all sorts of effects on the tooling. | k__ wrote: | What's the difference? | saagarjha wrote: | Most bytecode isn't optimized much. | klodolph wrote: | If you were designing a bytecode as a compilation target, you | would provide an easy correspondence in the bytecode to basic | blocks. | | See: https://en.wikipedia.org/wiki/Basic_block | | WASM instead provides traditional control structures. So the | compiler either has to preserve control structures through to | the IR, or has to work backwards from basic blocks to control | structures. Both options are undesirable, from the | perspective of compiler writers, and would be unnecessary if | the VM were a greenfield project. | MaxBarraclough wrote: | I get the impression WASM is really a very clunky | representation, far from what any greenfield project would | ever have arrived at. That is to say, its decisions aren't | just tradeoffs that people disagree about, they're simply | inappropriate for what it's trying to do. More than once | I've encountered comments and blog-posts lamenting its | fundamentals, e.g. [0]. | | [0] http://troubles.md/wasm-is-not-a-stack-machine/ | klodolph wrote: | The criticisms in the linked article are simply not | grounded in fact, and the article was obviously written | by someone without any expertise in how modern compilers | work. There are also a number of basic errors in the | article. | | Liveness information simply doesn't belong in the | bytecode. SSA is trivial to recreate from local mutable | variables (it would make a good homework assignment for | someone in an undergrad "intro to compilers" class). | WebAssembly is obviously a register machine. | | > With this, it becomes possible to get rid of locals | entirely. | | Both factually incorrect and pointless. There is no | tangible benefit in getting rid of locals entirely. You | are merely changing out one representation (register | machine) for a different one (stack machine). | | There are valid criticisms of WASM, but the linked | article doesn't have any. | | The weird part of WASM is the control structures. The | rest of it is a fairly sensible, actually rather nice | register machine. You can see that older bytecode systems | like the JVM are stack machines, but newer ones tend to | be register machines. This isn't because people are | getting stupid, it's because there are legitimate reasons | to prefer register machines, and on the balance of | things, my observations are that people with experience | in the field tend to prefer register machines. | drew-y wrote: | > The weird part of WASM is the control structures. | | Out of curiosity, what do you find weird about the | control structure portion? | | I just wrote a basic language that compiles to wasm and | found the built in control structures made my life | easier. | klodolph wrote: | Open any compilers book, WASM will look alien. Control | flow analysis is normally done through basic blocks. If | you make a toy language and don't do control flow | analysis you won't notice what's so weird about WASM. If | you take an existing compiler and port it to WASM, you'll | cry, then gather yourself together and read how the | relooper algorithm works. | MaxBarraclough wrote: | Interesting, thanks for the analysis. | | > You can see that older bytecode systems like the JVM | are stack machines, but newer ones tend to be register | machines. | | Like LLVM. Is there a good reason for WebAssembly not | being SSA? | klodolph wrote: | There's no good reason for WebAssembly to be SSA. That is | a good enough reason for it not to be. | | The purpose of SSA is to make it easier to write | optimizations. However, you wouldn't be optimizing WASM | anyway, you would necessarily translate it first into | some kind of IR suitable for optimization passes. Might | as well convert to SSA at that point, rather than bloat | the bytecode by exposing implementation details of the | target. | | Remember that WASM's purpose is to be portable and safe. | Making it more complicated just in order to make | sophisticated back-ends slightly faster is a net loss. | Using SSA would also make naive/simple backends slower. | readittwice wrote: | I would still consider WASM a stack machine and not a | register machine. Yes, there are mutable local variables | in WASM but Java bytecode has them as well - which you | consider a stack machine. BTW the designers of WASM | explicitly call WASM a stack machine here: https://github | .com/WebAssembly/design/blob/master/Rationale..... With | WASM's MVP it was necessary to store e.g. loop state in | local variables, thanks to recent changes this doesn't | seem to be necessary anymore. I think this was the main | argument that blog post considered WASM to be a register | machine. javac also makes heavy use of variables in | bytecode, but somehow no one considers the JVM a register | machine. | | > my observations are that people with experience in the | field tend to prefer register machines | | That's actually the opposite of my observation, they seem | to prefer stack machines. | kazinator wrote: | The clang disassembly given in the article sure makes it | look like WASM is a nested expression tree, which leaves | the choice of stack versus register to the | implementation. (f32.add | (f32.add (f32.mul (f32.load | (local.get 0)) (f32.load | (local.get 1))) (f32.mul | (f32.load offset=4 (local.get 0)) | (f32.load offset=4 (local.get 1)))) | (f32.mul (f32.load offset=8 | (local.get 0)) (f32.load offset=8 | (local.get 1)))) | | The outer f32.add could translate into a byte code | instruction that finds its two operands on a stack, or to | one which gets them from registers. | | The code only says that there is a f32.add call which has | two operands that are the result of a f32.add and f32.mul | and so on. | | The implementations will agree in their treatment of | locals: that there are two locals 0 and 1, which support | loading at offsets and such. | | Both stack and register machines can support locals. | klodolph wrote: | Yes, that's exactly true. The "stack machine" here can be | seen as nothing more than a way of encoding the | expression tree. | klodolph wrote: | Java has instructions like dup, swap etc. To me, that is | the critical difference here, and where I draw the line | between "stack machine" and "register machine". | readittwice wrote: | I have to admit this line seems arbitrary to me. So WASM | is a register machine to you but if they would simply add | those 2 instruction would it suddenly become a stack | machine then? Those instructions would actually be | trivial to add. I think those terms are relatively well | defined and when you argue that WASM is a register | machine even though the inventors explicitly claim it's a | stack machine you should have really good arguments for | that. Personally I would be surprised if you could point | me to any literature that supports your definition. | klodolph wrote: | Turing completeness sounds like an arbitrary distinction | to those outside of the field of CS, but it's not. | | To me, the distinction here is that the stack machine in | WASM is restricted to the point that it corresponds 1:1 | with an expression tree--not even a graph, just a tree. | This means that every function in Web Assembly can be | thought of as a collections of statements and | expressions, and the stack machine abstraction is nothing | more than a serialization format for the expressions. | | Maybe dial it back a bit with the challenge to point at | literature. The literature has not really caught up with | the existence of WASM yet. | Aardappel wrote: | The multi-value proposal breaks the ability to turn Wasm | into expressions easily, and thus makes it even more of a | stack machine than it already is. Dup and swap may still | be added in the future. | | A defining feature of a register machine is that the | actual instruction encoding has direct references to | source and destination registers in it. Wasm doesn't have | those, it has explicit get_local instructions instead. | | That said, if you turn off LLVM's WebAssemblyRegStackify | pass, all LLVM IR's values will end up in locals, with | little to no stack usage. Still no register machine, but | a bit more of a grey area :) | tedmielczarek wrote: | I don't think that's true. It was designed for its | purpose by people that knew what they were doing based on | experience gathered from implementing previous | incremental steps including emscripten's original | JavaScript output and asm.js (a formalization of some | techniques emscripten employed). The design rationale is | right there in the spec repository: https://github.com/We | bAssembly/design/blob/master/Rationale.... | | It was built to be a better target than asm.js for | compiled languages to run in JavaScript VMs and it seems | to have succeeded on that front. That it's not a perfect | fully-general bytecode format doesn't seem like a real | knock against it, that's a much harder problem to solve. | In any event it seems to be getting a lot more traction | than any previous entries in the space, which is | exciting! | kazinator wrote: | Nested, structured control can be broken into basic blocks | without flattening it into if/goto first. It sounds easier, | in fact. | | In fact, the construction of a basic block from flat code | is a form of recovery of control structure. If the original | code had nested loops, nesting will emerge in the basic | block graph. | | A recursive traversal of the original structure could | produce that graph more directly; it doesn't have to walk a | flat list of instructions asking, "does this have a label | on it which is the target of a branch". without having to | walk a flat list of instructions asking questions like "is | this the target of a branch". | | For instance, if we walk an if/then/else AST node, we can | recursively get the basic block graph for the test, the | then and else part, and then integrate that into a larger | basic block graph according to a rigid pattern. | tedmielczarek wrote: | The latter is very achievable using the Relooper algorithm | developed by Alon Zakai and described in the Emscripten | paper: https://dl.acm.org/doi/10.1145/2048147.2048224 | klodolph wrote: | Achievable, but extra front-end work that belongs in the | backend. | saagarjha wrote: | Ooh, this is nice! No more having to read wasm2wat's mildly | annoying format. | cjbprime wrote: | FWIW there's also wasm2c and wasm2js out there :) | saagarjha wrote: | There is, but they are also fairly annoying to read... | DonHopkins wrote: | I'd love to have wasm2assemblyscript! | | AssemblyScript: A Subset of TypeScript That Compiles to | WebAssembly | | https://news.ycombinator.com/item?id=15187961 | | https://github.com/AssemblyScript | | https://github.com/AssemblyScript/assemblyscript | | https://docs.assemblyscript.org/ | Aardappel wrote: | I'm the author, if anyone has specific questions :) | _hardwaregeek wrote: | Loving the tooling around wasm getting better. I've been | debugging my compiler output with hexl-mode and reading the | binary format and while it's not _that_ bad, it 'd be nice to do | more advanced debugging with a text format. | | There was a project I saw too that intended to visualize | WebAssembly's execution. That'd be extremely helpful too | cfallin wrote: | > reading the binary format ... it'd be nice to do more | advanced debugging with a text format. | | Do you know about `wasm2wat` (from the WebAssembly binary | toolkit, "WABT")? It produces a 1-to-1 text representation of | the bytecode and is meant to always roundtrip via `wat2wasm` | back to the same bytecode. | _hardwaregeek wrote: | Yeah...I should probably use that. But does it work on | mangled WASM? Part of the issue was that my compiler wasn't | producing valid WASM | cfallin wrote: | Ah, no, probably doesn't do parsing recovery. But `wasm- | validate` from the same toolkit will at least tell you the | offset at which your wasm file has an error (I just flipped | some bits in a wasm file to test this), which may be | helpful! ___________________________________________________________________ (page generated 2020-04-28 23:00 UTC)