hngopher.com

       [HN Gopher] A First Look at the JIT
       ___________________________________________________________________
        
       A First Look at the JIT
        
       Author : lelf
       Score  : 139 points
       Date   : 2020-11-04 15:10 UTC (7 hours ago)
        
 (HTM) web link (blog.erlang.org)
 (TXT) w3m dump (blog.erlang.org)
        
       | whizzter wrote:
       | My take on it is that I like most of the design choices they've
       | made (Been tinkering on a similar dynamic runtime before).
       | 
       | With a dynamic language like Erlang (as with JS,Lua,etc) the main
       | benefits won't come from better register allocation,DCE,etc that
       | is the primary reason one would pick LLVM but rather just getting
       | rid of the interpreter overhead (and later the oppurtunity to do
       | some higher level type dependent optimizations that will have a
       | higher impact than register allocation and instruction
       | selection).
       | 
       | Type dependant stuff why LLVM is ill suited IMHO and why you only
       | see LLVM being mentioned as the highest optimizataion level by
       | for example JavaScriptCore(Safari) when the runtime hasn't
       | deoptimized code and is pretty certain of the types that will be
       | used.
       | 
       | Also they mentioned the complexity of maintaining
       | interpreter/JITed jumps and I'm not surprised since i remember
       | reading some paper about one of their earlier attempts and they
       | were maintaining dual stacks with quite complex cross-boundary
       | jumps.
       | 
       | Some might mention that V8 moved away from the always-JIT
       | paradigm to save on memory and startup time but since Erlang is
       | mostly used in servers i think they might see this as a good
       | compromise.
        
         | chrisseaton wrote:
         | > that will have a higher impact than register allocation and
         | instruction selection
         | 
         | The most powerful JIT compiler that I have worked with - Graal
         | - generates code that often on the face of it seems puzzlingly
         | inefficient in terms of registers allocated and instructions
         | selected. Turns out maybe it's another thing that's not as
         | important as all the text books say?
         | 
         | The important bit is removing object allocating, removing
         | boxing, deep inlining, and specialisation... when you've done
         | all that work the exact registers and instructions don't seem
         | to really make a huge difference.
        
           | Rochus wrote:
           | Are there any papers or articles about the mentioned
           | findings?
        
             | chrisseaton wrote:
             | About what findings, sorry?
             | 
             | Linear Scan is an example I reach for when I talk about
             | what parts of the compiler are really important, if that's
             | what you mean.
             | 
             | http://web.cs.ucla.edu/~palsberg/course/cs132/linearscan.pd
             | f
             | 
             | > The linear scan algorithm is considerably faster than
             | algorithms based on graph coloring, is simple to implement,
             | and results in code that is almost as efficient as that
             | obtained using more complex and time-consuming register
             | allocators based on graph coloring.
             | 
             | Also things like escape analysis and inlining are often
             | called 'the mothers of optimisation' because they
             | fundamentally enable so many other optimisations. Not sure
             | there's really a citation for it but I doubt anyone would
             | dispute.
             | 
             | I wrote about the impact on optimising Ruby.
             | 
             | https://chrisseaton.com/truffleruby/pushing-pixels/
        
               | Rochus wrote:
               | > About what findings, sorry?
               | 
               | These findings: _The important bit is removing object
               | allocating, removing boxing, deep inlining, and
               | specialisation... when you 've done all that work the
               | exact registers and instructions don't seem to really
               | make a huge difference._
               | 
               | > Linear Scan is an example I reach for when I talk about
               | what parts of the compiler are really important
               | 
               | Sure, but you just said that register allocation is not a
               | relevant factor compared to other achievements when using
               | Gral VM. Or did I get that wrong?
        
               | chrisseaton wrote:
               | > These findings:
               | 
               | Right - didn't I cover those? I gave the example that
               | Graal is extremely powerful with a huge amount of money
               | and effort behind it, but doesn't really bother at all
               | with serious register allocation or clever instruction
               | selection as suggested you should in text books. It uses
               | the most basic algorithm and doesn't see any need to do
               | any more, even when they're still keen on tiny
               | improvements to the output. It just doesn't seem to be
               | worth it.
               | 
               | But it does put a lot of effort into object allocation,
               | boxing, inlining, specialisation, etc. So those seem in
               | practice to be worth it.
        
               | Rochus wrote:
               | > But it does put a lot of effort into object allocation,
               | boxing, inlining, specialisation, etc. So those seem in
               | practice to be worth it.
               | 
               | Well, as I understand this is an assumption, not the
               | result of a dedicated study. I made similar observations
               | when using LuaJIT as a SOM backend (see
               | https://github.com/rochus-keller/Som), but it's not clear
               | why the Gral based SOM implementations are that much
               | faster.
        
               | chrisseaton wrote:
               | > Well, as I understand this is an assumption, not the
               | result of a dedicated study.
               | 
               | I'm not sure what you're angling for?
               | 
               | Some kind of falsifiable proof about where it makes sense
               | to put engineering effort and complexity? You're not
               | going to find that sorry nobody's ever been able to
               | seriously quantify those kind of things.
               | 
               | > I made similar observations
               | 
               | Well then why are we arguing about it if it's apparent to
               | both of us?
        
               | Rochus wrote:
               | > I'm not sure what you're angling for?
               | 
               | You assume that _The important bit is removing object
               | allocating, removing boxing, deep inlining, and
               | specialisation... when you 've done all that work the
               | exact registers and instructions don't seem to really
               | make a huge difference._
               | 
               | But it would be very interesting to have something like a
               | scientific study about why Gral is indeed faster than
               | other approaches.
               | 
               | > Well then why are we arguing about it if it's apparent
               | to both of us?
               | 
               | Because I would like to understand the _true_ reason to
               | be able to improve my implementation (if feasible).
               | 
               | EDIT: as you claim textbooks about compiler design are
               | wrong or not up to date, so my desire to have someone
               | change that seems understandable, isn't it?
        
               | chrisseaton wrote:
               | I don't think it's correct to say I'm just assuming.
               | 
               | Linear Scan produces a program with apparently less
               | efficient register allocation. In practice, it does not
               | matter for the wider performance of the code. Is this not
               | evidence to support the assumption that sophisticated
               | register allocation does not matter as much as we
               | thought?
               | 
               | When you enable Graal's more sophisticated escape
               | analysis algorithms you get real-world speed ups in
               | workloads such as Twitter, worth hundreds of thousands of
               | dollars in costs saved. Is this not evidence to support
               | the assumption that sophisticated escape analysis
               | algorithms do matter?
               | 
               | The first is a formal scientific study. The second is not
               | but it's still a very large-scale empirical result
               | measured by an expert. They aren't falsifiable but as I
               | said I don't think that's a realistic expectation, and I
               | think these are enough for it to be more than an
               | assumption.
        
               | Rochus wrote:
               | > Is this not evidence to support the assumption that
               | sophisticated register allocation does not matter as much
               | as we thought?
               | 
               | It's an indication but it doesn't sufficiently support
               | the conclusion. There are so many other things to
               | consider.
               | 
               | > Is this not evidence to support the assumption that
               | sophisticated escape analysis algorithms do matter?
               | 
               | Would you as a scientist seriously accept this as a
               | sufficient evidence for your claims?
               | 
               | But let's leave it at that for the moment. As far as I
               | know there are ongoing research projects which could
               | deliver more insights why specifically a Smalltalk VM
               | runs faster on Gral than anywhere else.
        
               | chrisseaton wrote:
               | > Would you as a scientist seriously accept this as a
               | sufficient evidence for your claims?
               | 
               | It was a comment on a web page dude... I didn't claim it
               | in a research paper for publication!
               | 
               | If we discourage others from more casually sharing our
               | observations as you're doing we'll miss opportunities to
               | find things to form hypotheses from in the first place!
               | Casual discussion in the community is part of science,
               | something to be encouraged, and you're sadly missing
               | something by dismissing it like this.
        
               | Rochus wrote:
               | Ok, that sounds like a response to my initial question
               | _Are there any papers or articles about the mentioned
               | findings?_
               | 
               | Casually sharing observations is a good thing, and even
               | better when there is some detail information available
               | which makes it possible to understand the propositions
               | sufficiently well and to assess how certain the
               | conclusions are.
        
           | MaxBarraclough wrote:
           | Could that be due to the cleverness of modern heavyweight
           | CPUs, with techniques like register renaming? Would things
           | change if you used less sophisticated processors?
        
             | chrisseaton wrote:
             | This is the explanation I get when I dig into these things,
             | yes.
             | 
             | For example I'll see what seems to my less-experienced eyes
             | some completely redundant 'mov' instructions shifting
             | things around. I'll ask 'should we fix this' and the
             | response is usually 'doesn't really matter it's basically
             | free'.
        
               | MaxBarraclough wrote:
               | Interesting, so (to a rough approximation) the CPU is
               | smart enough to boil off many 'shallow' inefficiencies.
               | Doesn't cache behaviour still matter though? I'd have
               | thought code-compactness would still be worth carefully
               | optimising for.
        
               | chrisseaton wrote:
               | > Doesn't cache behaviour still matter though? I'd have
               | thought code-compactness would still be worth carefully
               | optimising for.
               | 
               | Well that's partly why I'm still surprised, but I think
               | for a lot of dynamic languages there's such a huge volume
               | of code anyway... that these little 'mov's are pretty
               | insignificant.
               | 
               | Removing them would have a down-side - extra compilation
               | time.
        
         | nickjj wrote:
         | > Erlang is mostly used in servers i think they might see this
         | as a good compromise.
         | 
         | I'm not sure what the differences will be but start up time is
         | still a very big deal for web servers.
         | 
         | For example I can start up a non-Dockerized Python / gunicorn
         | app with 8,000 lines of code and 50+ package dependencies in
         | about 400ms on a $5 a month DigitalOcean server without doing
         | any type of optimizations.
         | 
         | For someone who just wants to deploy a project to 1 server that
         | means there's only 400ms of down time when I restart my server.
         | 
         | If a runtime takes let's say 10 seconds to start up, that could
         | be enough of a deterrent to not use it if you know you're going
         | down the route of a single server deploy.
        
           | dnautics wrote:
           | > For someone who just wants to deploy a project to 1 server
           | that means there's only 400ms of down time when I restart my
           | server.
           | 
           | Erlang already gives you robust primitives to do things like
           | blue-green deploys, and even graceful transfer of server
           | state even across versioning changes. If downtime between
           | releases is an something you care about, you should use
           | those, and it's likely that your downtime can be in the
           | microseconds range regardless of the vm startup latency.
        
           | cmrdporcupine wrote:
           | I don't think we're talking "start time" so much as "warm up
           | time"; if I understand these things correctly it's likely the
           | VM would start almost immediately, but would take a little
           | bit of time for hot code paths to JIT and become highly
           | performant. I don't think that would be much of a concern in
           | your example.
        
             | lawik wrote:
             | From my conversation with them on Elixir Mix I don't
             | believe warm-up is the issue. In this case the JIT is
             | simpler than that. But the BEAM VM isn't the quickest
             | starter in the world.
        
       | moonchild wrote:
       | Recommend changing the title to '...Erlang's JIT' or similar
       | (though perhaps the 'erlang.org' domain provides enough context?)
        
       | crazypython wrote:
       | How is this a JIT? Everything is compiled to machine code.
       | There's no advanced type specialization or lazy deoptimization.
       | It might as well be an in-memory, lazy, AOT compiler.
        
         | chrisseaton wrote:
         | > Everything is compiled to machine code.
         | 
         | Yes... but it does that just-in-time, which is what JIT stands
         | for.
         | 
         | > There's no advanced type specialization or lazy
         | deoptimization.
         | 
         | I don't think those things are required for it to be a JIT are
         | they? I'd say it's a static JIT, rather than a dynamic,
         | profiling, and specialising JIT, but both are JITs.
         | 
         | > It might as well be an in-memory, lazy, AOT compiler.
         | 
         | Well yeah... that describes a JIT.
        
         | rurban wrote:
         | A simple method jit is also a good jit, even without escape
         | analysis and advanced optimizations. It avoids the op branching
         | and enables code caching.
        
         | yxhuvud wrote:
         | > It might as well be an in-memory, lazy
         | 
         | Yeah, that is a jit. Perhaps not a very fancy one, the same way
         | ARC is a very unfancy GC, but it is still a jit.
        
       | haberman wrote:
       | > Data may only be kept (passed) in BEAM registers between
       | instructions.
       | 
       | > This may seem silly, aren't machine registers faster?
       | 
       | > Yes, but in practice not by much and it would make things more
       | complicated.
       | 
       | My understanding is that the latency of an L1 cache load is 5
       | cycles on recent Intel processors.
       | 
       | In tight code that is a really substantial overhead. I get that
       | register allocation is complicated, and I totally see the benefit
       | of simplicity, but it's hard not to think that this will
       | significantly limit performance.
        
         | didibus wrote:
         | It does limit performance, but not in a relevant way for the
         | type of applications Beam is designed for.
        
         | CJefferson wrote:
         | Often, if your function is that small, inlining it is a better
         | idea anyway?
        
       | Rochus wrote:
       | Interesting. Here is also a podcast about the topic:
       | https://thinkingelixir.com/podcast-episodes/017-jit-compiler...
       | 
       | Had a brief look at asmjit; as it seems it only supports x86 and
       | x86_64 and is not really an abstraction (i.e. a platform
       | independent IR). I will try to find out why they didn't use e.g.
       | LLVM or sljit.
       | 
       | EDIT: according to this article (https://www.erlang-
       | solutions.com/blog/performance-testing-th...) the speed-up factor
       | caused by the JIT is about 1.3 to 2.3 (as a comparison the speed-
       | up between the PUC Lua 5.1 interpreter and LuaJIT 2.0 is about
       | factor 15 in geometric mean over all benchmarks, see
       | http://luajit.org/performance_x86.html).
        
         | dnautics wrote:
         | Also keep in mind that the lua speedup is in interpreted vs
         | compiled, and erlang is already a compiled language that had a
         | fairly performant intermediate bytecode.
        
           | [deleted]
        
           | Rochus wrote:
           | But the bytecode is interpreted, isn't it? The PUC Lua VM is
           | a bytecode interpreter as well.
        
         | alberth wrote:
         | Re: "it seems it only support x86 and x86_64"
         | 
         | Probably because it uses DynASM and I believe only those
         | platforms are supported.
         | 
         | https://luajit.org/dynasm.html
        
           | Rochus wrote:
           | > Probably because it uses DynASM
           | 
           | I had a look at
           | https://github.com/asmjit/asmjit/tree/master/src/asmjit and
           | didn't find any indication that DynASM is used. Anyway,
           | DynASM and LuaJIT are available for many different
           | architectures, not only x86 and x86_64.
        
             | alberth wrote:
             | This link has more detail
             | 
             | https://github.com/erlang/otp/pull/2745#issuecomment-691482
             | 1...
             | 
             | "
             | 
             | LLVM is much slower at generating code when compared to
             | asmjit. LLVM can do a lot more, but it's main purpose is
             | not to be a JIT compiler.
             | 
             | With asmjit we get full control over all the register
             | allocation and can do a lot of simplifications when
             | generating code.
             | 
             | On the downside we don't get any of LLVMs built-in
             | optimizations.
             | 
             | We also considered using dynasm, but found the tooling that
             | comes with asmjit to be better.
             | 
             | "
        
               | Rochus wrote:
               | This quote from your reference gives the answer: _We also
               | considered using dynasm, but found the tooling that comes
               | with asmjit to be better._
               | 
               | EDIT: But it still gives no answer why they preferred
               | asmjit in favour of other solutions than LLVM or DynASM
               | (where the latter is just an assembler which helps to
               | integrate with C code, not a platform abstraction like
               | LLVM or others).
        
         | tachyonbeam wrote:
         | LLVM is much too heavyweight for a JIT. It's slow at generating
         | code, and makes it difficult to implement common JIT
         | optimizations such as inline caches, which rely on code-
         | patching.
        
           | Rochus wrote:
           | Yes, LLVM is huge, but there are a couple of interesting
           | alternatives such as sljit, see e.g.
           | https://github.com/wdv4758h/awesome-jit
           | 
           | Even LuaJIT can be used as a backend for other languages than
           | Lua, see e.g. https://github.com/rochus-keller/oberon/.
        
             | kzrdude wrote:
             | Makes me wonder if Python implemented in lua wouldn't just
             | win over CPython
        
               | Rochus wrote:
               | If using LuaJIT it would be faster than CPython for sure,
               | maybe even as fast as PyPy. Here are some measurement
               | results comparing the C++ SOM interpreter with a LuaJIT
               | bytecode compiler: https://github.com/rochus-
               | keller/Som#a-som-to-luajit-bytecod.... The bytecode
               | version of the SOM benchmark running on LuaJIT is about
               | factor 24 faster than the C++ based interpreter. But the
               | Gral implementation of SOM is even faster.
        
           | saurik wrote:
           | (Which is, of course, ironic given that the entire original
           | goal of LLVM was only to build a JIT; it was too slow, but
           | made an OK enough compiler backend ;P.)
        
       | kzrdude wrote:
       | With this approach, it shouldn't be too far off to experiment
       | with ahead of time compilation?
        
         | bcardarella wrote:
         | The Lumen project (https://github.com/lumen/lumen) is building
         | a BEAM bytecode compatible AOT compiler. Its primary
         | compilation target is WASM/WASI. Still under very heavy
         | development. (and some things waiting on updates to WASM spec
         | to land)
        
         | whizzter wrote:
         | I think there has existed Erlang to C compilers in the past and
         | there are new one popping up every now and then. In general
         | though AOT is in many ways considered a dead end for many
         | unless you have a certain application for it (even if that was
         | my uni thesis subject).
         | 
         | You can look back all the way to a 1995 paper (1) (with people
         | that then worked on early Java JIT's) that did a comparison of
         | JIT vs AOT compilation for Self, while good AOT is kind of
         | feasible in many ways there is always situations in languages
         | not initially designed for AOT where a JIT won't be forced to
         | go with conservative estimates that hampers performance.
         | 
         | (1) https://www.cs.ucsb.edu/research/tech-reports/1995-04
        
       ___________________________________________________________________
       (page generated 2020-11-04 23:00 UTC)