[HN Gopher] A First Look at the JIT ___________________________________________________________________ A First Look at the JIT Author : lelf Score : 139 points Date : 2020-11-04 15:10 UTC (7 hours ago) (HTM) web link (blog.erlang.org) (TXT) w3m dump (blog.erlang.org) | whizzter wrote: | My take on it is that I like most of the design choices they've | made (Been tinkering on a similar dynamic runtime before). | | With a dynamic language like Erlang (as with JS,Lua,etc) the main | benefits won't come from better register allocation,DCE,etc that | is the primary reason one would pick LLVM but rather just getting | rid of the interpreter overhead (and later the oppurtunity to do | some higher level type dependent optimizations that will have a | higher impact than register allocation and instruction | selection). | | Type dependant stuff why LLVM is ill suited IMHO and why you only | see LLVM being mentioned as the highest optimizataion level by | for example JavaScriptCore(Safari) when the runtime hasn't | deoptimized code and is pretty certain of the types that will be | used. | | Also they mentioned the complexity of maintaining | interpreter/JITed jumps and I'm not surprised since i remember | reading some paper about one of their earlier attempts and they | were maintaining dual stacks with quite complex cross-boundary | jumps. | | Some might mention that V8 moved away from the always-JIT | paradigm to save on memory and startup time but since Erlang is | mostly used in servers i think they might see this as a good | compromise. | chrisseaton wrote: | > that will have a higher impact than register allocation and | instruction selection | | The most powerful JIT compiler that I have worked with - Graal | - generates code that often on the face of it seems puzzlingly | inefficient in terms of registers allocated and instructions | selected. Turns out maybe it's another thing that's not as | important as all the text books say? | | The important bit is removing object allocating, removing | boxing, deep inlining, and specialisation... when you've done | all that work the exact registers and instructions don't seem | to really make a huge difference. | Rochus wrote: | Are there any papers or articles about the mentioned | findings? | chrisseaton wrote: | About what findings, sorry? | | Linear Scan is an example I reach for when I talk about | what parts of the compiler are really important, if that's | what you mean. | | http://web.cs.ucla.edu/~palsberg/course/cs132/linearscan.pd | f | | > The linear scan algorithm is considerably faster than | algorithms based on graph coloring, is simple to implement, | and results in code that is almost as efficient as that | obtained using more complex and time-consuming register | allocators based on graph coloring. | | Also things like escape analysis and inlining are often | called 'the mothers of optimisation' because they | fundamentally enable so many other optimisations. Not sure | there's really a citation for it but I doubt anyone would | dispute. | | I wrote about the impact on optimising Ruby. | | https://chrisseaton.com/truffleruby/pushing-pixels/ | Rochus wrote: | > About what findings, sorry? | | These findings: _The important bit is removing object | allocating, removing boxing, deep inlining, and | specialisation... when you 've done all that work the | exact registers and instructions don't seem to really | make a huge difference._ | | > Linear Scan is an example I reach for when I talk about | what parts of the compiler are really important | | Sure, but you just said that register allocation is not a | relevant factor compared to other achievements when using | Gral VM. Or did I get that wrong? | chrisseaton wrote: | > These findings: | | Right - didn't I cover those? I gave the example that | Graal is extremely powerful with a huge amount of money | and effort behind it, but doesn't really bother at all | with serious register allocation or clever instruction | selection as suggested you should in text books. It uses | the most basic algorithm and doesn't see any need to do | any more, even when they're still keen on tiny | improvements to the output. It just doesn't seem to be | worth it. | | But it does put a lot of effort into object allocation, | boxing, inlining, specialisation, etc. So those seem in | practice to be worth it. | Rochus wrote: | > But it does put a lot of effort into object allocation, | boxing, inlining, specialisation, etc. So those seem in | practice to be worth it. | | Well, as I understand this is an assumption, not the | result of a dedicated study. I made similar observations | when using LuaJIT as a SOM backend (see | https://github.com/rochus-keller/Som), but it's not clear | why the Gral based SOM implementations are that much | faster. | chrisseaton wrote: | > Well, as I understand this is an assumption, not the | result of a dedicated study. | | I'm not sure what you're angling for? | | Some kind of falsifiable proof about where it makes sense | to put engineering effort and complexity? You're not | going to find that sorry nobody's ever been able to | seriously quantify those kind of things. | | > I made similar observations | | Well then why are we arguing about it if it's apparent to | both of us? | Rochus wrote: | > I'm not sure what you're angling for? | | You assume that _The important bit is removing object | allocating, removing boxing, deep inlining, and | specialisation... when you 've done all that work the | exact registers and instructions don't seem to really | make a huge difference._ | | But it would be very interesting to have something like a | scientific study about why Gral is indeed faster than | other approaches. | | > Well then why are we arguing about it if it's apparent | to both of us? | | Because I would like to understand the _true_ reason to | be able to improve my implementation (if feasible). | | EDIT: as you claim textbooks about compiler design are | wrong or not up to date, so my desire to have someone | change that seems understandable, isn't it? | chrisseaton wrote: | I don't think it's correct to say I'm just assuming. | | Linear Scan produces a program with apparently less | efficient register allocation. In practice, it does not | matter for the wider performance of the code. Is this not | evidence to support the assumption that sophisticated | register allocation does not matter as much as we | thought? | | When you enable Graal's more sophisticated escape | analysis algorithms you get real-world speed ups in | workloads such as Twitter, worth hundreds of thousands of | dollars in costs saved. Is this not evidence to support | the assumption that sophisticated escape analysis | algorithms do matter? | | The first is a formal scientific study. The second is not | but it's still a very large-scale empirical result | measured by an expert. They aren't falsifiable but as I | said I don't think that's a realistic expectation, and I | think these are enough for it to be more than an | assumption. | Rochus wrote: | > Is this not evidence to support the assumption that | sophisticated register allocation does not matter as much | as we thought? | | It's an indication but it doesn't sufficiently support | the conclusion. There are so many other things to | consider. | | > Is this not evidence to support the assumption that | sophisticated escape analysis algorithms do matter? | | Would you as a scientist seriously accept this as a | sufficient evidence for your claims? | | But let's leave it at that for the moment. As far as I | know there are ongoing research projects which could | deliver more insights why specifically a Smalltalk VM | runs faster on Gral than anywhere else. | chrisseaton wrote: | > Would you as a scientist seriously accept this as a | sufficient evidence for your claims? | | It was a comment on a web page dude... I didn't claim it | in a research paper for publication! | | If we discourage others from more casually sharing our | observations as you're doing we'll miss opportunities to | find things to form hypotheses from in the first place! | Casual discussion in the community is part of science, | something to be encouraged, and you're sadly missing | something by dismissing it like this. | Rochus wrote: | Ok, that sounds like a response to my initial question | _Are there any papers or articles about the mentioned | findings?_ | | Casually sharing observations is a good thing, and even | better when there is some detail information available | which makes it possible to understand the propositions | sufficiently well and to assess how certain the | conclusions are. | MaxBarraclough wrote: | Could that be due to the cleverness of modern heavyweight | CPUs, with techniques like register renaming? Would things | change if you used less sophisticated processors? | chrisseaton wrote: | This is the explanation I get when I dig into these things, | yes. | | For example I'll see what seems to my less-experienced eyes | some completely redundant 'mov' instructions shifting | things around. I'll ask 'should we fix this' and the | response is usually 'doesn't really matter it's basically | free'. | MaxBarraclough wrote: | Interesting, so (to a rough approximation) the CPU is | smart enough to boil off many 'shallow' inefficiencies. | Doesn't cache behaviour still matter though? I'd have | thought code-compactness would still be worth carefully | optimising for. | chrisseaton wrote: | > Doesn't cache behaviour still matter though? I'd have | thought code-compactness would still be worth carefully | optimising for. | | Well that's partly why I'm still surprised, but I think | for a lot of dynamic languages there's such a huge volume | of code anyway... that these little 'mov's are pretty | insignificant. | | Removing them would have a down-side - extra compilation | time. | nickjj wrote: | > Erlang is mostly used in servers i think they might see this | as a good compromise. | | I'm not sure what the differences will be but start up time is | still a very big deal for web servers. | | For example I can start up a non-Dockerized Python / gunicorn | app with 8,000 lines of code and 50+ package dependencies in | about 400ms on a $5 a month DigitalOcean server without doing | any type of optimizations. | | For someone who just wants to deploy a project to 1 server that | means there's only 400ms of down time when I restart my server. | | If a runtime takes let's say 10 seconds to start up, that could | be enough of a deterrent to not use it if you know you're going | down the route of a single server deploy. | dnautics wrote: | > For someone who just wants to deploy a project to 1 server | that means there's only 400ms of down time when I restart my | server. | | Erlang already gives you robust primitives to do things like | blue-green deploys, and even graceful transfer of server | state even across versioning changes. If downtime between | releases is an something you care about, you should use | those, and it's likely that your downtime can be in the | microseconds range regardless of the vm startup latency. | cmrdporcupine wrote: | I don't think we're talking "start time" so much as "warm up | time"; if I understand these things correctly it's likely the | VM would start almost immediately, but would take a little | bit of time for hot code paths to JIT and become highly | performant. I don't think that would be much of a concern in | your example. | lawik wrote: | From my conversation with them on Elixir Mix I don't | believe warm-up is the issue. In this case the JIT is | simpler than that. But the BEAM VM isn't the quickest | starter in the world. | moonchild wrote: | Recommend changing the title to '...Erlang's JIT' or similar | (though perhaps the 'erlang.org' domain provides enough context?) | crazypython wrote: | How is this a JIT? Everything is compiled to machine code. | There's no advanced type specialization or lazy deoptimization. | It might as well be an in-memory, lazy, AOT compiler. | chrisseaton wrote: | > Everything is compiled to machine code. | | Yes... but it does that just-in-time, which is what JIT stands | for. | | > There's no advanced type specialization or lazy | deoptimization. | | I don't think those things are required for it to be a JIT are | they? I'd say it's a static JIT, rather than a dynamic, | profiling, and specialising JIT, but both are JITs. | | > It might as well be an in-memory, lazy, AOT compiler. | | Well yeah... that describes a JIT. | rurban wrote: | A simple method jit is also a good jit, even without escape | analysis and advanced optimizations. It avoids the op branching | and enables code caching. | yxhuvud wrote: | > It might as well be an in-memory, lazy | | Yeah, that is a jit. Perhaps not a very fancy one, the same way | ARC is a very unfancy GC, but it is still a jit. | haberman wrote: | > Data may only be kept (passed) in BEAM registers between | instructions. | | > This may seem silly, aren't machine registers faster? | | > Yes, but in practice not by much and it would make things more | complicated. | | My understanding is that the latency of an L1 cache load is 5 | cycles on recent Intel processors. | | In tight code that is a really substantial overhead. I get that | register allocation is complicated, and I totally see the benefit | of simplicity, but it's hard not to think that this will | significantly limit performance. | didibus wrote: | It does limit performance, but not in a relevant way for the | type of applications Beam is designed for. | CJefferson wrote: | Often, if your function is that small, inlining it is a better | idea anyway? | Rochus wrote: | Interesting. Here is also a podcast about the topic: | https://thinkingelixir.com/podcast-episodes/017-jit-compiler... | | Had a brief look at asmjit; as it seems it only supports x86 and | x86_64 and is not really an abstraction (i.e. a platform | independent IR). I will try to find out why they didn't use e.g. | LLVM or sljit. | | EDIT: according to this article (https://www.erlang- | solutions.com/blog/performance-testing-th...) the speed-up factor | caused by the JIT is about 1.3 to 2.3 (as a comparison the speed- | up between the PUC Lua 5.1 interpreter and LuaJIT 2.0 is about | factor 15 in geometric mean over all benchmarks, see | http://luajit.org/performance_x86.html). | dnautics wrote: | Also keep in mind that the lua speedup is in interpreted vs | compiled, and erlang is already a compiled language that had a | fairly performant intermediate bytecode. | [deleted] | Rochus wrote: | But the bytecode is interpreted, isn't it? The PUC Lua VM is | a bytecode interpreter as well. | alberth wrote: | Re: "it seems it only support x86 and x86_64" | | Probably because it uses DynASM and I believe only those | platforms are supported. | | https://luajit.org/dynasm.html | Rochus wrote: | > Probably because it uses DynASM | | I had a look at | https://github.com/asmjit/asmjit/tree/master/src/asmjit and | didn't find any indication that DynASM is used. Anyway, | DynASM and LuaJIT are available for many different | architectures, not only x86 and x86_64. | alberth wrote: | This link has more detail | | https://github.com/erlang/otp/pull/2745#issuecomment-691482 | 1... | | " | | LLVM is much slower at generating code when compared to | asmjit. LLVM can do a lot more, but it's main purpose is | not to be a JIT compiler. | | With asmjit we get full control over all the register | allocation and can do a lot of simplifications when | generating code. | | On the downside we don't get any of LLVMs built-in | optimizations. | | We also considered using dynasm, but found the tooling that | comes with asmjit to be better. | | " | Rochus wrote: | This quote from your reference gives the answer: _We also | considered using dynasm, but found the tooling that comes | with asmjit to be better._ | | EDIT: But it still gives no answer why they preferred | asmjit in favour of other solutions than LLVM or DynASM | (where the latter is just an assembler which helps to | integrate with C code, not a platform abstraction like | LLVM or others). | tachyonbeam wrote: | LLVM is much too heavyweight for a JIT. It's slow at generating | code, and makes it difficult to implement common JIT | optimizations such as inline caches, which rely on code- | patching. | Rochus wrote: | Yes, LLVM is huge, but there are a couple of interesting | alternatives such as sljit, see e.g. | https://github.com/wdv4758h/awesome-jit | | Even LuaJIT can be used as a backend for other languages than | Lua, see e.g. https://github.com/rochus-keller/oberon/. | kzrdude wrote: | Makes me wonder if Python implemented in lua wouldn't just | win over CPython | Rochus wrote: | If using LuaJIT it would be faster than CPython for sure, | maybe even as fast as PyPy. Here are some measurement | results comparing the C++ SOM interpreter with a LuaJIT | bytecode compiler: https://github.com/rochus- | keller/Som#a-som-to-luajit-bytecod.... The bytecode | version of the SOM benchmark running on LuaJIT is about | factor 24 faster than the C++ based interpreter. But the | Gral implementation of SOM is even faster. | saurik wrote: | (Which is, of course, ironic given that the entire original | goal of LLVM was only to build a JIT; it was too slow, but | made an OK enough compiler backend ;P.) | kzrdude wrote: | With this approach, it shouldn't be too far off to experiment | with ahead of time compilation? | bcardarella wrote: | The Lumen project (https://github.com/lumen/lumen) is building | a BEAM bytecode compatible AOT compiler. Its primary | compilation target is WASM/WASI. Still under very heavy | development. (and some things waiting on updates to WASM spec | to land) | whizzter wrote: | I think there has existed Erlang to C compilers in the past and | there are new one popping up every now and then. In general | though AOT is in many ways considered a dead end for many | unless you have a certain application for it (even if that was | my uni thesis subject). | | You can look back all the way to a 1995 paper (1) (with people | that then worked on early Java JIT's) that did a comparison of | JIT vs AOT compilation for Self, while good AOT is kind of | feasible in many ways there is always situations in languages | not initially designed for AOT where a JIT won't be forced to | go with conservative estimates that hampers performance. | | (1) https://www.cs.ucsb.edu/research/tech-reports/1995-04 ___________________________________________________________________ (page generated 2020-11-04 23:00 UTC)