[HN Gopher] What's in that .wasm? Introducing wasm-decompile
       ___________________________________________________________________
        
       What's in that .wasm? Introducing wasm-decompile
        
       Author : slow-typer
       Score  : 187 points
       Date   : 2020-04-28 12:58 UTC (10 hours ago)
        
 (HTM) web link (v8.dev)
 (TXT) w3m dump (v8.dev)
        
       | dlojudice wrote:
       | > Decompile to what?
       | 
       | > `wasm-decompile` produces output that tries to look like a
       | "very average programming language" while still staying close to
       | the Wasm it represents.
       | 
       | > #1 goal is readability
       | 
       | > #2 goal is to still represent Wasm as 1:1 as possible
       | 
       | It seems AssemblyScript would do the job
       | 
       | [1] https://assemblyscript.org/
        
         | Aardappel wrote:
         | AssemblyScript would certainly do worse at #2, and possibly
         | also at #1. To be translate to Wasm or from Wasm lead to
         | different optimal designs, see for example how these two
         | systems deal with loads and stores.
        
       | mmastrac wrote:
       | This is super handy. Pseudocode is very useful for understanding
       | flow - so much more than actual assembly. I've always found it an
       | order of magnitude to understand bad asm-to-C decompilation from
       | IDA or Ghidra over perfect disassembly.
        
       | 3pt14159 wrote:
       | It would be nice if the decompiled output were runnable through
       | an interpreter so you could step through it with a debugger of
       | some kind and rename or annotate the variables and functions as
       | you reverse engineer what is going on.
        
       | irrational wrote:
       | When I first started learning JavaScript in the late 90s, the
       | primary way I learned new things was from reading other peoples
       | code in my browser. Nowadays this isn't as easy since you often
       | have to run obfuscated code through a prettifier to get it back
       | into a human readable format, but it is still possible with some
       | effort. I was concerned that WASM would make this impossible
       | (despite the stated goal of "Be readable and debuggable --
       | WebAssembly is a low-level assembly language, but it does have a
       | human-readable text format (the specification for which is still
       | being finalized) that allows code to be written, viewed, and
       | debugged by hand." _), but WASM-decompile gives me hope.
       | 
       | _https://developer.mozilla.org/en-US/docs/WebAssembly/Concept...
        
         | [deleted]
        
       | snazz wrote:
       | This looks much nicer than the wasm2c output for that binary. I
       | compiled it with `clang wasm.c -c -target wasm32 -O2` just like
       | in the instructions (I'm on LLVM 10), and used the latest
       | wasm2wat with `wasm2wat -f wasm.o` and got this instead:
       | (module         (type (;0;) (func (param i32 i32) (result f32)))
       | (import "env" "__linear_memory" (memory (;0;) 0))         (import
       | "env" "__indirect_function_table" (table (;0;) 0 funcref))
       | (func (;0;) (type 0) (param i32 i32) (result f32)
       | (f32.add             (f32.add               (f32.mul
       | (f32.load                   (local.get 0))
       | (f32.load                   (local.get 1)))
       | (f32.mul                 (f32.load offset=4
       | (local.get 0))                 (f32.load offset=4
       | (local.get 1))))             (f32.mul               (f32.load
       | offset=8                 (local.get 0))               (f32.load
       | offset=8                 (local.get 1))))))
       | 
       | wasm2c (also from WABT) returns this thing:
       | https://paste.linux.community/view/7877995f
        
         | Aardappel wrote:
         | wasm2c has a different objective though: to be recompile-able
         | again while preserving semantics. wasm-decompile was designed
         | for readability first.
        
           | snazz wrote:
           | Fair enough. I'm still surprised at just how unreadable (for
           | me) the wasm2c output was, though. The compiler must have
           | done quite a bit of optimizing that wasm2c was unable to
           | undo.
        
             | Aardappel wrote:
             | It doesn't actually try to undo anything, it just
             | translates Wasm instructions 1:1 (they're in your link at
             | lines 205-221). wasm-decompile does try to "undo" some
             | thing, but it is generally impossible given LLVM's
             | optimized output and how low-level Wasm is (see also
             | article).
        
       | fowl2 wrote:
       | Can we compile it back to wasm again? ;P
        
         | frosted-flakes wrote:
         | No.
         | 
         | > Its #1 goal is readability: help guide readers understand
         | what is in a .wasm with as easy to follow code as possible. Its
         | #2 goal is to still represent Wasm as 1:1 as possible, to not
         | lose its utility as a disassembler. Obviously these two goals
         | are not always unifiable.
         | 
         | > This output is not meant to be an actual programming language
         | and there is currently no way to compile it back into Wasm.
        
           | vbezhenar wrote:
           | Actually I thought about implementing a programming language
           | which is a bit more pleasant to work with than raw wat
           | format, but which still translated roughly 1 to 1 to wasm.
           | Something like in this link, actually. But that seems outside
           | of my capabilities and I'm not sure if it's really useful to
           | anyone.
        
             | rhencke wrote:
             | You might enjoy AssemblyScript.
             | 
             | http://assemblyscript.org
        
               | DonHopkins wrote:
               | The great thing about AssemblyScript is it makes it
               | possible to share some of the same code and interfaces
               | and data and tools between JavaScript and WebAssembly.
               | 
               | If you're already developing in TypeScript, WebAssembly
               | is a good way to generate WASM code that interoperates
               | nicely with it, which you can't do with plain old
               | JavaScript.
        
             | brabel wrote:
             | there are a few languages already do that, but mostly non-
             | serious stuff.
             | 
             | https://github.com/appcypher/awesome-wasm-langs
             | 
             | I've implemented one but still need to get back to it to
             | finish it... WASM is one of the easiest targets out there,
             | hence so many languages already targeting it. But that will
             | change once some of the WASM proposals become a standard,
             | especially things like GC support and WASI, which will take
             | a team (and long term investments) rather than a lone dev
             | to implement:
             | 
             | https://webassembly.org/docs/future-features/
        
       | klodolph wrote:
       | This is fascinating. For various reasons, WASM is less like a
       | target bytecode format and more like a peculiar IR for compilers.
       | I'm sure this has all sorts of effects on the tooling.
        
         | k__ wrote:
         | What's the difference?
        
           | saagarjha wrote:
           | Most bytecode isn't optimized much.
        
           | klodolph wrote:
           | If you were designing a bytecode as a compilation target, you
           | would provide an easy correspondence in the bytecode to basic
           | blocks.
           | 
           | See: https://en.wikipedia.org/wiki/Basic_block
           | 
           | WASM instead provides traditional control structures. So the
           | compiler either has to preserve control structures through to
           | the IR, or has to work backwards from basic blocks to control
           | structures. Both options are undesirable, from the
           | perspective of compiler writers, and would be unnecessary if
           | the VM were a greenfield project.
        
             | MaxBarraclough wrote:
             | I get the impression WASM is really a very clunky
             | representation, far from what any greenfield project would
             | ever have arrived at. That is to say, its decisions aren't
             | just tradeoffs that people disagree about, they're simply
             | inappropriate for what it's trying to do. More than once
             | I've encountered comments and blog-posts lamenting its
             | fundamentals, e.g. [0].
             | 
             | [0] http://troubles.md/wasm-is-not-a-stack-machine/
        
               | klodolph wrote:
               | The criticisms in the linked article are simply not
               | grounded in fact, and the article was obviously written
               | by someone without any expertise in how modern compilers
               | work. There are also a number of basic errors in the
               | article.
               | 
               | Liveness information simply doesn't belong in the
               | bytecode. SSA is trivial to recreate from local mutable
               | variables (it would make a good homework assignment for
               | someone in an undergrad "intro to compilers" class).
               | WebAssembly is obviously a register machine.
               | 
               | > With this, it becomes possible to get rid of locals
               | entirely.
               | 
               | Both factually incorrect and pointless. There is no
               | tangible benefit in getting rid of locals entirely. You
               | are merely changing out one representation (register
               | machine) for a different one (stack machine).
               | 
               | There are valid criticisms of WASM, but the linked
               | article doesn't have any.
               | 
               | The weird part of WASM is the control structures. The
               | rest of it is a fairly sensible, actually rather nice
               | register machine. You can see that older bytecode systems
               | like the JVM are stack machines, but newer ones tend to
               | be register machines. This isn't because people are
               | getting stupid, it's because there are legitimate reasons
               | to prefer register machines, and on the balance of
               | things, my observations are that people with experience
               | in the field tend to prefer register machines.
        
               | drew-y wrote:
               | > The weird part of WASM is the control structures.
               | 
               | Out of curiosity, what do you find weird about the
               | control structure portion?
               | 
               | I just wrote a basic language that compiles to wasm and
               | found the built in control structures made my life
               | easier.
        
               | klodolph wrote:
               | Open any compilers book, WASM will look alien. Control
               | flow analysis is normally done through basic blocks. If
               | you make a toy language and don't do control flow
               | analysis you won't notice what's so weird about WASM. If
               | you take an existing compiler and port it to WASM, you'll
               | cry, then gather yourself together and read how the
               | relooper algorithm works.
        
               | MaxBarraclough wrote:
               | Interesting, thanks for the analysis.
               | 
               | > You can see that older bytecode systems like the JVM
               | are stack machines, but newer ones tend to be register
               | machines.
               | 
               | Like LLVM. Is there a good reason for WebAssembly not
               | being SSA?
        
               | klodolph wrote:
               | There's no good reason for WebAssembly to be SSA. That is
               | a good enough reason for it not to be.
               | 
               | The purpose of SSA is to make it easier to write
               | optimizations. However, you wouldn't be optimizing WASM
               | anyway, you would necessarily translate it first into
               | some kind of IR suitable for optimization passes. Might
               | as well convert to SSA at that point, rather than bloat
               | the bytecode by exposing implementation details of the
               | target.
               | 
               | Remember that WASM's purpose is to be portable and safe.
               | Making it more complicated just in order to make
               | sophisticated back-ends slightly faster is a net loss.
               | Using SSA would also make naive/simple backends slower.
        
               | readittwice wrote:
               | I would still consider WASM a stack machine and not a
               | register machine. Yes, there are mutable local variables
               | in WASM but Java bytecode has them as well - which you
               | consider a stack machine. BTW the designers of WASM
               | explicitly call WASM a stack machine here: https://github
               | .com/WebAssembly/design/blob/master/Rationale..... With
               | WASM's MVP it was necessary to store e.g. loop state in
               | local variables, thanks to recent changes this doesn't
               | seem to be necessary anymore. I think this was the main
               | argument that blog post considered WASM to be a register
               | machine. javac also makes heavy use of variables in
               | bytecode, but somehow no one considers the JVM a register
               | machine.
               | 
               | > my observations are that people with experience in the
               | field tend to prefer register machines
               | 
               | That's actually the opposite of my observation, they seem
               | to prefer stack machines.
        
               | kazinator wrote:
               | The clang disassembly given in the article sure makes it
               | look like WASM is a nested expression tree, which leaves
               | the choice of stack versus register to the
               | implementation.                     (f32.add
               | (f32.add               (f32.mul                 (f32.load
               | (local.get 0))                 (f32.load
               | (local.get 1)))               (f32.mul
               | (f32.load offset=4                   (local.get 0))
               | (f32.load offset=4                   (local.get 1))))
               | (f32.mul               (f32.load offset=8
               | (local.get 0))               (f32.load offset=8
               | (local.get 1))))
               | 
               | The outer f32.add could translate into a byte code
               | instruction that finds its two operands on a stack, or to
               | one which gets them from registers.
               | 
               | The code only says that there is a f32.add call which has
               | two operands that are the result of a f32.add and f32.mul
               | and so on.
               | 
               | The implementations will agree in their treatment of
               | locals: that there are two locals 0 and 1, which support
               | loading at offsets and such.
               | 
               | Both stack and register machines can support locals.
        
               | klodolph wrote:
               | Yes, that's exactly true. The "stack machine" here can be
               | seen as nothing more than a way of encoding the
               | expression tree.
        
               | klodolph wrote:
               | Java has instructions like dup, swap etc. To me, that is
               | the critical difference here, and where I draw the line
               | between "stack machine" and "register machine".
        
               | readittwice wrote:
               | I have to admit this line seems arbitrary to me. So WASM
               | is a register machine to you but if they would simply add
               | those 2 instruction would it suddenly become a stack
               | machine then? Those instructions would actually be
               | trivial to add. I think those terms are relatively well
               | defined and when you argue that WASM is a register
               | machine even though the inventors explicitly claim it's a
               | stack machine you should have really good arguments for
               | that. Personally I would be surprised if you could point
               | me to any literature that supports your definition.
        
               | klodolph wrote:
               | Turing completeness sounds like an arbitrary distinction
               | to those outside of the field of CS, but it's not.
               | 
               | To me, the distinction here is that the stack machine in
               | WASM is restricted to the point that it corresponds 1:1
               | with an expression tree--not even a graph, just a tree.
               | This means that every function in Web Assembly can be
               | thought of as a collections of statements and
               | expressions, and the stack machine abstraction is nothing
               | more than a serialization format for the expressions.
               | 
               | Maybe dial it back a bit with the challenge to point at
               | literature. The literature has not really caught up with
               | the existence of WASM yet.
        
               | Aardappel wrote:
               | The multi-value proposal breaks the ability to turn Wasm
               | into expressions easily, and thus makes it even more of a
               | stack machine than it already is. Dup and swap may still
               | be added in the future.
               | 
               | A defining feature of a register machine is that the
               | actual instruction encoding has direct references to
               | source and destination registers in it. Wasm doesn't have
               | those, it has explicit get_local instructions instead.
               | 
               | That said, if you turn off LLVM's WebAssemblyRegStackify
               | pass, all LLVM IR's values will end up in locals, with
               | little to no stack usage. Still no register machine, but
               | a bit more of a grey area :)
        
               | tedmielczarek wrote:
               | I don't think that's true. It was designed for its
               | purpose by people that knew what they were doing based on
               | experience gathered from implementing previous
               | incremental steps including emscripten's original
               | JavaScript output and asm.js (a formalization of some
               | techniques emscripten employed). The design rationale is
               | right there in the spec repository: https://github.com/We
               | bAssembly/design/blob/master/Rationale....
               | 
               | It was built to be a better target than asm.js for
               | compiled languages to run in JavaScript VMs and it seems
               | to have succeeded on that front. That it's not a perfect
               | fully-general bytecode format doesn't seem like a real
               | knock against it, that's a much harder problem to solve.
               | In any event it seems to be getting a lot more traction
               | than any previous entries in the space, which is
               | exciting!
        
             | kazinator wrote:
             | Nested, structured control can be broken into basic blocks
             | without flattening it into if/goto first. It sounds easier,
             | in fact.
             | 
             | In fact, the construction of a basic block from flat code
             | is a form of recovery of control structure. If the original
             | code had nested loops, nesting will emerge in the basic
             | block graph.
             | 
             | A recursive traversal of the original structure could
             | produce that graph more directly; it doesn't have to walk a
             | flat list of instructions asking, "does this have a label
             | on it which is the target of a branch". without having to
             | walk a flat list of instructions asking questions like "is
             | this the target of a branch".
             | 
             | For instance, if we walk an if/then/else AST node, we can
             | recursively get the basic block graph for the test, the
             | then and else part, and then integrate that into a larger
             | basic block graph according to a rigid pattern.
        
             | tedmielczarek wrote:
             | The latter is very achievable using the Relooper algorithm
             | developed by Alon Zakai and described in the Emscripten
             | paper: https://dl.acm.org/doi/10.1145/2048147.2048224
        
               | klodolph wrote:
               | Achievable, but extra front-end work that belongs in the
               | backend.
        
       | saagarjha wrote:
       | Ooh, this is nice! No more having to read wasm2wat's mildly
       | annoying format.
        
         | cjbprime wrote:
         | FWIW there's also wasm2c and wasm2js out there :)
        
           | saagarjha wrote:
           | There is, but they are also fairly annoying to read...
        
           | DonHopkins wrote:
           | I'd love to have wasm2assemblyscript!
           | 
           | AssemblyScript: A Subset of TypeScript That Compiles to
           | WebAssembly
           | 
           | https://news.ycombinator.com/item?id=15187961
           | 
           | https://github.com/AssemblyScript
           | 
           | https://github.com/AssemblyScript/assemblyscript
           | 
           | https://docs.assemblyscript.org/
        
       | Aardappel wrote:
       | I'm the author, if anyone has specific questions :)
        
       | _hardwaregeek wrote:
       | Loving the tooling around wasm getting better. I've been
       | debugging my compiler output with hexl-mode and reading the
       | binary format and while it's not _that_ bad, it 'd be nice to do
       | more advanced debugging with a text format.
       | 
       | There was a project I saw too that intended to visualize
       | WebAssembly's execution. That'd be extremely helpful too
        
         | cfallin wrote:
         | > reading the binary format ... it'd be nice to do more
         | advanced debugging with a text format.
         | 
         | Do you know about `wasm2wat` (from the WebAssembly binary
         | toolkit, "WABT")? It produces a 1-to-1 text representation of
         | the bytecode and is meant to always roundtrip via `wat2wasm`
         | back to the same bytecode.
        
           | _hardwaregeek wrote:
           | Yeah...I should probably use that. But does it work on
           | mangled WASM? Part of the issue was that my compiler wasn't
           | producing valid WASM
        
             | cfallin wrote:
             | Ah, no, probably doesn't do parsing recovery. But `wasm-
             | validate` from the same toolkit will at least tell you the
             | offset at which your wasm file has an error (I just flipped
             | some bits in a wasm file to test this), which may be
             | helpful!
        
       ___________________________________________________________________
       (page generated 2020-04-28 23:00 UTC)