[HN Gopher] Why is Rosetta 2 fast?
       ___________________________________________________________________
        
       Why is Rosetta 2 fast?
        
       Author : pantalaimon
       Score  : 443 points
       Date   : 2022-11-09 15:40 UTC (7 hours ago)
        
 (HTM) web link (dougallj.wordpress.com)
 (TXT) w3m dump (dougallj.wordpress.com)
        
       | lunixbochs wrote:
       | > To see ahead-of-time translated Rosetta code, I believe I had
       | to disable SIP, compile a new x86 binary, give it a unique name,
       | run it, and then run otool -tv /var/db/oah/ _/_ /unique-name.aot
       | (or use your tool of choice - it's just a Mach-O binary). This
       | was done on old version of macOS, so things may have changed and
       | improved since then.
       | 
       | My aotool project uses a trick to extract the AOT binary without
       | root or disabling SIP:
       | https://github.com/lunixbochs/meta/tree/master/utils/aotool
        
       | karmakaze wrote:
       | Vertical integration. My understanding was it's because the Apple
       | silicon ARM has special support to make it fast. Apple has had
       | enough experience to know that some hardware support can go a
       | long way to making the binary emulation situation better.
        
         | saagarjha wrote:
         | That's not correct, the article goes into details why.
        
           | nwallin wrote:
           | That _is_ correct, the article goes into details why. See the
           | "Apple's Secret Extension" section as well as the "Total
           | Store Ordering" section.
           | 
           | The "Apple's Secret Extension" section talks about how the M1
           | has 4 flag bits and the x86 has 6 flag bits, and how
           | emulating those 2 extra flags would make every add/sub/cmp
           | instruction significantly slower. Apple has an undocumented
           | extension that adds 2 more flag bits to make the M1's flag
           | bits behave the same as x86.
           | 
           | The "Total Store Ordering" section talks about how Apple has
           | added a non-standard store ordering to the M1 than makes the
           | M1 order its stores in the same way x86 guarantees instead of
           | the way ARM guarantees. Without this, there's no good way to
           | translate instructions in code in and around an x86 memory
           | fence; if you see a memory fence in x86 code it's safe to
           | assume that it depends on x86 memory store semantics and if
           | you don't have that you'll need to emulate it with many
           | mostly unnecessary memory fences, which will be devastating
           | for performance.
        
             | saagarjha wrote:
             | I'm aware of both of these extensions; they're not actually
             | necessary for most applications. Yes, you trade fidelity
             | with performance, but it's not _that_ big of a deal. The
             | majority of Rosetta's performance is good software
             | decisions and not hardware.
        
       | MikusR wrote:
       | The main reason, M1/2 being incredibly fast. Is listed last.
        
         | dagmx wrote:
         | Perhaps if you're comparing against Intel processors, but even
         | on an Apple Silicon Mac, Rosetta 2 vs native versions of apps
         | are no slouch.
         | 
         | 20% overhead for a non-native executable is very commendable.
        
         | Someone wrote:
         | I don't think that's the main reason. The article lists a few
         | things that, I think the main reason is that they made several
         | parts of the CPU behave identical to x86. The M1 and M2 chips:
         | 
         | - can be told to do total store ordering, just as x86 does
         | 
         | - have of a few status flags that x86 has, but regular arm
         | doesn't
         | 
         | - can be told to make the FPU behave exactly as the x86 FPU
         | 
         | It also helps that ARM has many more registers than x86.
         | Because of that the emulator can map the x86 registers to ARM
         | registers, and have registers to spare for use by the emulator.
        
         | postalrat wrote:
         | That isn't the main reason.
         | 
         | If Rosetta ran x86 code at 10% the speed of native nobody would
         | be calling it fast.
        
       | superkuh wrote:
        
         | bogeholm wrote:
         | Thanks for your thoroughly objective insights. I especially
         | appreciate the concrete examples.
        
           | howinteresting wrote:
           | Here you go for a concrete example:
           | https://news.ycombinator.com/item?id=33493276
        
             | saagarjha wrote:
             | This has nothing to do with Rosetta being incomplete (it
             | has pretty good fidelity).
        
               | howinteresting wrote:
               | It was direct corroboration of:
               | 
               | > Apple users not being able to use the same hardware
               | peripherals or same software as other people is not a
               | problem, it's a feature. There's no doubt the M1/M2 chips
               | are fast. It's just a problem that they're only available
               | in crappy computers that can't run a large amount of
               | software or hardware.
        
       | spullara wrote:
       | The first time I ran into this technology was in the early 90s on
       | the DEC Alpha. They had a tool called "MX" that would translate
       | MIPS Ultrix binaries to Alpha on DEC Unix:
       | 
       | https://www.linuxjournal.com/article/1044
       | 
       | Crazy stuff. Rosetta 2 is insanely good. Runs FPS video games
       | even.
        
       | tomcam wrote:
       | > Every one-byte x86 push becomes a four byte ARM instruction
       | 
       | Can someone explain this to me? I don't know ARM but it just
       | seems to me a push should not be that expensive.
        
         | jasonwatkinspdx wrote:
         | The general principle is that RISC style instruction sets are
         | typically fixed length and with only a couple different
         | subformats. Like the prototypical RISC design has one format
         | with an opcode and 3 register fields, and then a second with an
         | opcode and an immediate field. This simplicity and regularity
         | makes the fastest possible decoding hardware much more simple
         | and efficient compared to something like x86 that has a simply
         | dumbfounding number of possible variable length formats.
         | 
         | The basic bet of RISC was that larger instruction encodings
         | would be worth it due to the micro architectural advantages
         | they enabled. This more or less was proven out, though the
         | distinction is less distinct today with x86 decoding into uOps
         | and recent ARM standards being quite complex beasts.
        
         | TazeTSchnitzel wrote:
         | x86 has variable-length instructions, so they can be anything
         | from 1 to 15 bytes long. AArch64 instructions are always 4
         | bytes long.
        
       | iainmerrick wrote:
       | This is a great writeup. What a clever design!
       | 
       | I remember Apple had a totally different but equally clever
       | solution back in the days of the 68K-to-PowerPC migration. The
       | 68K had 16-bit instruction words, usually with some 16-bit
       | arguments. The emulator's core loop would read the next
       | instruction and branch directly into a big block of 64K x 8 bytes
       | of PPC code. So each 68K instruction got 2 dedicated PPC
       | instructions, typically one to set up a register and one to
       | branch to common code.
       | 
       | What that solution and Rosetta 2 have in common is that they're
       | super pragmatic - fast to start up, with fairly regular and
       | predictable performance across most workloads, even if the
       | theoretical peak speed is much lower than a cutting-edge JIT.
       | 
       | Anyone know how they implemented PPC-to-x86 translation?
        
         | kijiki wrote:
         | > Anyone know how they implemented PPC-to-x86 translation?
         | 
         | They licensed Transitive's retargettable binary translator, and
         | renamed it Rosetta; very Apple.
         | 
         | It was originally a startup, but had been bought by IBM by the
         | time Apple was interested.
        
           | GeekyBear wrote:
           | > It was originally a startup, but had been bought by IBM by
           | the time Apple was interested.
           | 
           | Rosetta shipped in 2005.
           | 
           | IBM bought Transitive in 2008.
           | 
           | The last version of OS X that supported Rosetta shipped in
           | 2009.
           | 
           | I always wondered if the issue was that IBM tried to alter
           | the terms of deal too much for Steve's taste.
        
             | savoytruffle wrote:
             | I agree it was a bit worryingly short-lived. However the
             | first version of Mac OS X that shipped without Rosetta 1
             | support was 10.7 Lion in summer 2011 (and many people
             | avoided it since it was problematic). So nearly-modern Mac
             | OS X with Rosetta support was realistic for a while longer.
        
               | GeekyBear wrote:
               | > However the first version of Mac OS X that shipped
               | without Rosetta 1 support was 10.7 Lion
               | 
               | Yes, but I was pointing out when the last version of OS X
               | that did support Rosetta shipped.
               | 
               | I have no concrete evidence that Apple dropped Rosetta
               | because IBM wanted to alter the terms of the deal after
               | they bought Transitive, but I've always found that timing
               | interesting.
               | 
               | In comparison, the emulator used during the 68k to PPC
               | transition was never removed from Classic MacOS, so the
               | change stood out.
        
             | r00fus wrote:
             | Apple is also not tied to reverse compatibility.
             | 
             | Their customers are not enterprise, and consequently they
             | are probably the best company in the world at dictating
             | well-managed, reasonable shifts in customer behavior at
             | scale.
             | 
             | So they likely had no need for Rosetta as of 2009.
        
           | runjake wrote:
           | Link: https://en.wikipedia.org/wiki/QuickTransit
        
         | lostgame wrote:
         | From what I understand; they purchased a piece of software that
         | already existed to translate PPC to x86 in some form or another
         | and iterated on it. I believe the software may have already
         | even been called 'Rosetta'.
         | 
         | My memory is very hazy; though. While I experienced this
         | transition firsthand and was an early Intel adopter, that's
         | about all I can remember about Rosetta or where it came from.
         | 
         | I remember before Adobe had released the Universal Binary CS3
         | that running Photoshop on my Intel Mac was a total nightmare.
         | :( I learned to not be an early adopter from that whole
         | debacle.
        
           | saagarjha wrote:
           | Transitive.
        
           | runjake wrote:
           | Link: https://en.wikipedia.org/wiki/QuickTransit
        
         | Asmod4n wrote:
         | I don't know how they did it, but they did it very very slowly.
         | Anything "interactive" was unuseable.
        
           | lilyball wrote:
           | Assuming you're talking about PPC-to-x86, it was certainly
           | usable, though noticeably slower. Heck, I used to play Tron
           | 2.0 that way, the frame rate suffered but it was still quite
           | playable.
        
           | scarface74 wrote:
           | Interactive 68K programs were usually fast. The 68K programs
           | would still call native PPC QuickDraw code. It was processor
           | intensive code that was slow. Especially with the first
           | generation 68K emulator.
           | 
           | Connectix SpeedDoubler was definitely faster.
        
             | duskwuff wrote:
             | Most of the Toolbox was still running emulated 68k code in
             | early Power Mac systems. A few bits of performance-critical
             | code (like QuickDraw, iirc) were translated, but most
             | things weren't.
        
         | klelatti wrote:
         | That's really interesting. You might enjoy reading about the VM
         | embedded into the Busicom calculator that used the Intel 4004
         | [1]
         | 
         | They squeezed a virtual machine with 88 instructions into less
         | than 1k of memory!
         | 
         | [1] https://thechipletter.substack.com/p/bytecode-and-the-
         | busico...
        
           | wang_li wrote:
           | In the mists of history S. Wozniak wrote the SWEET-16
           | interpreter for the 6502. A VM with 29 instructions
           | implemented in 300 bytes.
           | 
           | https://en.wikipedia.org/wiki/SWEET16
        
           | iainmerrick wrote:
           | That is nifty! Sounds very similar to a Forth interpreter.
        
             | vaxman wrote:
             | Burn.
             | 
             | (unintentional, which makes it even funnier)
        
       | retskrad wrote:
       | Apple Silicon will be Tim Cook's legacy.
        
       | vaxman wrote:
       | Rosetta 3 will probably be semantic evaluation of the origin and
       | complete source-level reprogramming of the target. If it comes
       | from Apple, it will translate everything to ARM and then
       | digitally sign it to run in a native-mode sandbox under a version
       | of Safari with a supporting runtime.
        
       | hinkley wrote:
       | Apple is doing some really interesting but really quiet work in
       | the area of VMs. I feel like we don't give them enough credit but
       | maybe they've put themselves in that position by not bragging
       | enough about what they do.
       | 
       | As a somewhat related aside, I have been watching Bun (low
       | startup time Node-like on top of Safari's JavaScript engine) with
       | enough interest that I started trying to fix a bug, which is
       | somewhat unusual for me. I mostly contribute small fixes to tools
       | I use at work. I can't quite grok Zig code yet so I got stuck
       | fairly quickly. The "bug" turned out to be default behavior in a
       | Zig stdlib, rather than in JavaScript code. The rest is fairly
       | tangential but suffice it to say I prefer self hosted languages
       | but this probably falls into the startup speed compromise.
       | 
       | Being low startup overhead makes their VM interesting, but the
       | fact that it benchmarks better than Firefox a lot of the time and
       | occasionally faster than v8 is quite a bit of quiet competence.
        
         | jraph wrote:
         | > feel like we don't give them enough credit but maybe they've
         | put themselves in that position by not bragging enough about
         | what they do.
         | 
         | And maybe also by keeping the technology closed and Apple-
         | specific. Many people who could be interested in using it don't
         | have access to it.
        
           | freedomben wrote:
           | Exactly. As someone who would be very interested in this, but
           | don't use Apple products, it's just not exciting because it's
           | not accessible to me (I can't even test it as a user). If
           | they wanted to write a whitepaper about it to share
           | knowledge, that might be interesting, but given that it's
           | Apple I'm not gonna hold my breath.
        
             | saagarjha wrote:
             | Apple (mostly WebKit) writes a significant amount about how
             | they designed their VMs.
        
           | jolux wrote:
           | WebKit B3 is open source: https://webkit.org/docs/b3/
        
             | [deleted]
        
       | Vt71fcAqt7 wrote:
       | I hope Rosetta is here to stay and continues developement. And I
       | hope what is learned from it can be used to make a RISC-V version
       | of it. translating native ARM to RISC-V should be much easier
       | than x86 to ARM as I understand it, so one could conceivably do
       | x86 -> ARM -> RISC-V.
        
         | rowanG077 wrote:
         | I hope not. Rosetta 2, as cool as it is, is a crutch to allow
         | Apple to transition away from x86. If it keeps beeing needing
         | it's a massive failure for Apple and the ecosystem.
        
         | klelatti wrote:
         | More likely to be useful RISC-V to Arm then Apple can support
         | running virtual machines for another architecture on its
         | machines.
        
         | masklinn wrote:
         | > I hope Rosetta is here to stay and continues developement.
         | 
         | It almost certainly is not. Odds are Apple will eventually
         | remove Rosetta II, as they did Rosetta back in the days, once
         | they consider the need for that bridge to be over (Rosetta was
         | added in 2006 in 10.4, and removed in 2011 from 10.7).
         | 
         | > And I hope what is learned from it can be used to make a
         | RISC-V version of it. translating native ARM to RISC-V should
         | be much easier than x86 to ARM as I understand it, so one could
         | conceivably do x86 -> ARM -> RISC-V.
         | 
         | That's not going to happen unless Apple decides to switch from
         | ARM to RISC-V, and... why would they? They've got 15 years
         | experience and essentially full control on ARM.
        
           | Vt71fcAqt7 wrote:
           | >That's not going to happen unless Apple decides to switch
           | from ARM to RISC-V, and... why would they? They've got 15
           | years experience and essentially full control on ARM.
           | 
           | Two points here.
           | 
           | * First off, Apple developers are not binded to Apple. The
           | knkwledge gained can be used elsewhere. See Rivos and Nuvia
           | for example.
           | 
           | * Second, Apple reportedly has already ported many of it's
           | secondary cores to RISC-V. It's not unreasonable that they
           | will switch in 10 years or so.
        
             | jrmg wrote:
             | _Apple reportedly has already ported many of it 's
             | secondary cores to RISC-V_
             | 
             | Really? In current hardware or is this speculation?
        
               | Symmetry wrote:
               | If you've got some management core somewhere in your
               | silicon you can, with RISC-V, give it a MMU but no FPU
               | and save area. You're going to be writing custom embedded
               | code anyways so you get to save silicon by only
               | incorporating the features that you need instead of
               | having to meet the full ARM spec. And you can add your
               | own custom instructions for the job at hand pretty
               | easily.
               | 
               | That would all be a terrible idea if you were doing it
               | for a core intended to run user applications, but that's
               | not what Apple, Western Digital, NVidia are embracing
               | RISC-V for embedded cores. If I were ARM I'd honestly be
               | much more worried about RISC-V's threat to my R and M
               | series cores than my A series cores.
        
               | my123 wrote:
               | Arm64 allows FPU-less designs. There are some around...
        
               | Symmetry wrote:
               | Sure. The FPU is optional on a Cortex M2, for instance.
               | But those don't have MMUs. You'd certainly need an
               | expensive architectural license to make something with an
               | MMU but no FPU if you wanted to and given all the
               | requirements ARM normally imposes for software
               | compatibility[1] between cores I'd tend to doubt that
               | they'd let you make something like that.
               | 
               | [1] Explicitly testing that you don't implement total
               | store ordering by default is one requirement I've heard
               | people talk about to get a custom core licensed.
        
               | masklinn wrote:
               | Apple has an architecture license (otherwise they could
               | not design their own cores, which they've been doing for
               | close to a decade), and already had the ability to take
               | liberties beyond what the average architecture licensee
               | can, owing to _being one of ARM's founders_.
        
               | saagarjha wrote:
               | Don't think any are shipping, but they're hiring RISC-V
               | engineers.
        
               | Vt71fcAqt7 wrote:
               | >Many dismiss RISC-V for its lack of software ecosystem
               | as a significant roadblock for datacenter and client
               | adoption, but RISC-V is quickly becoming the standard
               | everywhere that isn't exposed to the OS. For example,
               | Apple's A15 has more than a dozen Arm-based CPU cores
               | distributed across the die for various non-user-facing
               | functions. SemiAnalysis can confirm that these cores are
               | actively being converted to RISC-V in future generations
               | of hardware.[0]
               | 
               | So to answer your question, it is not in currently in
               | hardware, but it is more than just speculation.
               | 
               | [0]https://www.semianalysis.com/p/sifive-powers-google-
               | tpu-nasa...
        
             | klelatti wrote:
             | > it's not unreasonable that they will switch in 10 years
             | or so.
             | 
             | You've not provided any rationale at all for why they
             | should switch their application cores let alone on this
             | specific timetable.
             | 
             | Switching is an expensive business and there has to be a
             | major business benefit for Apple in return.
        
             | chris_j wrote:
             | For me, those two points make it clear that it would be
             | _possible_ for Apple to port to RISC-V. But it 's still not
             | clear what advantages they would gain from doing so, given
             | that their ARM license appears to let them do whatever they
             | want with CPUs that they design themselves.
        
               | Vt71fcAqt7 wrote:
               | The first point precludes Apple's gain from the
               | discussion.
        
           | quux wrote:
           | It would be funny/not funny if in a few years Apple removes
           | Rosetta 2 for Mac apps but keeps the Linux version forever so
           | docker can run at reasonable speeds.
        
           | kccqzy wrote:
           | > They've got 15 years experience
           | 
           | Did you only start counting from 2007 when the iPhone was
           | released? All the iPods prior to that were using ARM
           | processors. The Apple Newton was using ARM processors.
        
             | EricE wrote:
             | iPods and Newton were entirely different chips and OS's.
             | The first iPods weren't even on an OS that Apple created -
             | they licensed it.
        
             | masklinn wrote:
             | > All the iPods prior to that were using ARM processors.
             | 
             | Most of the original device was outsourced and contracted
             | out (for reasons of time constraint and lack of internal
             | expertise). PortalPlayer built the SoC and OS, not Apple.
             | Later SoC were sourced from SigmaTel and Samsung, until the
             | 3rd gen Touch.
             | 
             | > The Apple Newton was using ARM processors.
             | 
             | The Apple Newton was a completely different Apple, and
             | there were several years' gap between Jobs killing the
             | Newton and the birth of iPod, not to mention the completely
             | different purpose and capabilities. There would be no
             | newton-type project until the iPhone.
             | 
             | Which is also when Apple started working with silicon
             | themselves: they acquired PA in 2008, Intrinsity in 2010,
             | and Passif in 2013, released their first partially in-house
             | SoC in 2010 (A4), and their first in-house core in 2013
             | (Cyclone, in the A7).
        
           | stu2b50 wrote:
           | Rosetta 1 had a ticking time bomb. Apple was licensing it
           | from a 3rd party. Rosetta 2 is all in house as far as we
           | know.
           | 
           | Different CEO as well. Jobs was more opinionated on
           | "principles" - Cook is more than happy to sell what people
           | will buy. I think Rosetta 2 will last.
        
             | masklinn wrote:
             | > Rosetta 1 had a ticking time bomb. Apple was licensing it
             | from a 3rd party.
             | 
             | Yes, I'm sure Apple had no way of extending the license.
             | 
             | > Cook is more than happy to sell what people will buy. I
             | think Rosetta 2 will last.
             | 
             | There's no "buy" here.
             | 
             | Rosetta is complexity to maintain, and an easy cut. It's
             | not even part of the base system.
             | 
             | And "what people will buy" certainly didn't prevent
             | essentially removing support for non-hidpi displays from
             | MacOS. Which is a lot more impactful than Rosetta as far as
             | I'm concerned.
        
               | NavinF wrote:
               | > removing support for non-hidpi displays from MacOS
               | 
               | Did that really reduce sales? Consider that the wide
               | availability of crappy low end hardware gave Windows
               | laptops a terrible reputation. Eg https://www.reddit.com/
               | r/LinusTechTips/comments/yof7va/frien...
        
               | masklinn wrote:
               | > Consider that the wide availability of crappy low end
               | hardware gave Windows laptops a terrible reputation.
               | 
               | Standard DPI displays are not "crappy low-end hardware"?
               | 
               | I don't think there's a single widescreen display which
               | qualifies as hiDPI out there, that more or less doesn't
               | exist: a 5K 34" is around 160 DPI (to say nothing of the
               | downright pedestrian 5K 49" like the G9 or the AOC Agon).
        
               | fredoralive wrote:
               | What do you mean non HiDPI display support being removed
               | from Mac OS? I've been using a pair of 1920x1080 monitors
               | with my Mac Mini M1 just fine? Have they somehow broken
               | something in Mac OS 13 / Ventura? (I haven't clicked the
               | upgrade button yet, I prefer to let others leap boldly
               | first).
        
             | bpye wrote:
             | They've also allowed Rosetta 2 in Linux VMs - if they are
             | serious about supporting those use cases then I think it'll
             | stay.
        
             | kitsunesoba wrote:
             | We'll see, but even post-Cook Apple historically hasn't
             | liked the idea of third parties leaning on bridge
             | technologies for too long. Things like Rosetta are offered
             | as temporary affordances to allow time for devs to migrate,
             | not as a permanent platform fixture.
        
             | vaxman wrote:
             | But that 3rd party was only legally at arm's length.
        
             | TillE wrote:
             | What important Intel-only macOS software is going to exist
             | in five years?
             | 
             | It's basically only games and weird tiny niches, and Apple
             | is pretty happy to abandon both those categories. The
             | saving grace is that there's very few interesting Mac-
             | exclusive games in the Intel era.
        
               | flomo wrote:
               | Yeah, Apple killed all "legacy" 32-bit support, so one
               | would think there's not much software which is both
               | x86-64 and not being actively developed.
        
               | vxNsr wrote:
               | 2006 Apple was very different from 2011 Apple, renewing
               | that license in 2011 was probably considered cost
               | prohibitive for the negligible benefit.
        
               | rerx wrote:
               | Starting with Ventura, Linux VMs can use Rosetta 2 to run
               | x64 executables. I expect x64 Docker containers to remain
               | relevant for quite a few years to come. Running those at
               | reasonable speeds on Apple Silicon would be huge for
               | developers.
        
             | dmitriid wrote:
             | > Jobs was more opinionated on "principles" - Cook is more
             | than happy to sell what people will buy.
             | 
             | Well, the current "principle" is "iOS is enough, we're
             | going to run iOS apps on MacOS, and that's it".
             | 
             | Rosetta isn't needed for that.
        
               | dmitriid wrote:
               | It's strange to see people downvoting this when three
               | days ago App Store on MacOS literally defaulted to
               | searching iOS and iPad apps for me
               | https://twitter.com/dmitriid/status/1589179351572312066
        
           | CharlesW wrote:
           | > _Odds are Apple will eventually remove Rosetta II, as they
           | did Rosetta back in the days, once they consider the need for
           | that bridge to be over (Rosetta was added in 2006 in 10.4,
           | and removed in 2011 from 10.7)._
           | 
           | The difference is that Rosetta 1 was PPC - x86, so its
           | purpose ended once PPC was a fond memory.
           | 
           | Today's Rosetta is a generalized x86 - ARM translation
           | environment that isn't just for macOS apps. For example, it
           | works with Apple's new virtualization framework to support
           | running x86_64 Linux apps in ARM Linux VMs.
           | 
           | https://developer.apple.com/documentation/virtualization/run.
           | ..
        
           | gumby wrote:
           | > That's not going to happen unless Apple decides to switch
           | from ARM to RISC-V, and... why would they? They've got 15
           | years experience and essentially full control on ARM.
           | 
           | 15? More than a quarter century. They were one of the
           | original investors in ARM and have produced plenty of arm
           | devices since then beyond the newton and the ipod.
           | 
           | I'd bet they use a bunch of risc v internally too if they
           | just need a little cpu to manage something locally on some
           | device and just want to avoid paying a tiny fee to ARM or
           | just want some experience with it.
           | 
           | But RISC V as the main CPU? Yes, that's a long way away, if
           | ever. But apple is good at the long game. I wouldn't be
           | surprised to hear that Apple has iOS running on RISC V, but
           | even something like the lightning-to-HDMI adapter runs IOS on
           | ARM.
        
             | masklinn wrote:
             | > 15? More than a quarter century. They were one of the
             | original investors in ARM and have produced plenty of arm
             | devices since then beyond the newton and the ipod.
             | 
             | They didn't design their own chips for most of that time.
        
               | gumby wrote:
               | At the same time as the ARM investment they had a Cray
               | for...chip design.
        
               | masklinn wrote:
               | Yes and?
               | 
               | Apple invested in ARM and worked with ARM/Acorn on what
               | would become ARM6, in the early 90s. The newton uses it
               | (specifically the ARM610), it is a commercial failure,
               | later models use updated ARM CPUs to which AFAIK Apple
               | didn't contribute (DEC's StrongARM, and ARM's ARM710).
               | 
               | <15 years pass>
               | 
               | Apple starts working on bespoke designs again around the
               | time they start working on the iPhone, or possibly after
               | they realise it's succeeding.
               | 
               | That doesn't mean they stopped _using_ ARM in the
               | meantime (they certainly didn 't).
               | 
               | The iPod's SoC was not even designed internally (it was
               | contracted out to PortalPlayer, later generations were
               | provided by Samsung). 15 times and the revolution of
               | Jobs' return (and his immediate killing of the Newton) is
               | a long time for an internal team of silicon designers.
        
           | preisschild wrote:
           | > They've got 15 years experience and essentially full
           | control on ARM.
           | 
           | Do they? ARM made it very clear that they consider all ARM
           | cores their own[1]
           | 
           | [1]: https://www.theregister.com/2022/11/07/opinion_qualcomm_
           | vs_a...
        
             | nicoburns wrote:
             | Apple is in a somewhat different position to Qualcomm in
             | that they were a founding member of ARM. I've also heard
             | rumours that aarch64 was designed by apple and donated to
             | ARM (hence why apple was so early to release an aarch64
             | processor). So I somewhat doubt ARM will be a position to
             | sue them any time soon.
        
             | danaris wrote:
             | The Qualcomm situation is based on breaches of a specific
             | agreement that ARM had with Nuvia, which Qualcomm has now
             | bought. It's not a generalizable "ARM thinks everything
             | they license belongs to them fully in perpetuity" deal.
        
             | masklinn wrote:
             | > Do they?
             | 
             | They do, yes. They were one of the founding 3 members of
             | ARM itself, and the primary monetary contributor.
             | 
             | Through this they acquired privileges which remain extant:
             | they can literally add custom instructions to the ISA
             | (https://news.ycombinator.com/item?id=29798744), something
             | there is no available license for.
             | 
             | > ARM made it very clear that they consider all ARM cores
             | their own[1]
             | 
             | The Qualcomm situation is a breach of contract issue wrt
             | Nuvia, it's a very different issue, and by an actor with
             | very different privileges.
        
               | Vt71fcAqt7 wrote:
               | Is there a real source for this claim? It gets parroted a
               | lot on HN and elsewhere, but I've also heard it's greatly
               | exagerated. I don't think Apple engineers get to read the
               | licences, and even if they did, how do we know they
               | understood it corretly and that it got repeated
               | correctlty? I've never seen a valid source for this
               | claim.
        
               | masklinn wrote:
               | For what claim? They they co-founded ARM? That's
               | historical record. That they extended the ISA? That's
               | literally observed from decompilations. That they can do
               | so? They've been doing it for at least 2 years and ARM
               | has yet to sue.
               | 
               | > I've never seen a valid source for this claim.
               | 
               | What is "a valid source"? The linked comment is from
               | Hector Martin, the founder and lead of Asahi, who worked
               | on and assisted with reversing various facets of Apple
               | silicon, including the capabilities and extensions of the
               | ISA.
        
               | Vt71fcAqt7 wrote:
               | >For what claim?
               | 
               | that they have "essentially full control on ARM"
               | 
               | Having an ALA + some extras doesn't mean "full control."
               | 
               | he also says:
               | 
               | >And apparently in Apple's case, they get to be a little
               | bit incompatible
               | 
               | So he doesn't seem to actually know the full extent to
               | which Apple has more rights, even using the phrase "a
               | little bit" -- far from your claim. And he (and certainly
               | you) has not read the license. Perhaps they have to pay
               | for each core they release on the market that breaks
               | compatabilty? Do you know? Of course not. A valid source
               | would be a statement from someone who read the license or
               | one of the companies. There is more to a core than just
               | the ISA. If not, why is Apple porring cores to RISC-V? If
               | they have so much control ?
        
               | ksherlock wrote:
               | Why does it need a "real source"? ARM sells architecture
               | licenses, Apple has a custom ARM architecture. 1 + 1 = 2.
               | 
               | https://www.cnet.com/tech/tech-industry/apple-seen-as-
               | likely...
               | 
               | "ARM Chief Executive Warren East revealed on an earnings
               | conference call on Wednesday that "a leading handset
               | OEM," or original equipment manufacturer, has signed an
               | architectural license with the company, forming ARM's
               | most far-reaching license for its processor cores. East
               | declined to elaborate on ARM's new partner, but EETimes'
               | Peter Clarke could think of only one smartphone maker who
               | would be that interested in shaping and controlling the
               | direction of the silicon inside its phones: Apple."
               | 
               | https://en.wikipedia.org/wiki/Mac_transition_to_Apple_sil
               | ico...
               | 
               | "In 2008, Apple bought processor company P.A. Semi for
               | US$278 million.[28][29] At the time, it was reported that
               | Apple bought P.A. Semi for its intellectual property and
               | engineering talent.[30] CEO Steve Jobs later claimed that
               | P.A. Semi would develop system-on-chips for Apple's iPods
               | and iPhones.[6] _Following the acquisition, Apple signed
               | a rare "Architecture license" with ARM, allowing the
               | company to design its own core, using the ARM instruction
               | set_.[31] The first Apple-designed chip was the A4,
               | released in 2010, which debuted in the first-generation
               | iPad, then in the iPhone 4. Apple subsequently released a
               | number of products with its own processors."
               | 
               | https://www.anandtech.com/show/7112/the-arm-diaries-
               | part-1-h...
               | 
               | "Finally at the top of the pyramid is an ARM architecture
               | license. Marvell, Apple and Qualcomm are some examples of
               | the 15 companies that have this license."
        
               | Vt71fcAqt7 wrote:
               | I should have been more explicit. I am questioning the
               | claim that Apple has "full control on ARM" with no
               | restriction on the cores they make, grandfathered in from
               | the 1980s. Nobody has ever substantiated that claim.
        
       | titzer wrote:
       | Rosetta 2 is great, except it apparently can't run statically-
       | linked (non-PIC) binaries. I am unsure why this limitation
       | exists, but it's pretty annoying because Virgil x86-64-binaries
       | cannot run under Rosetta 2, which means I resort to running on
       | the JVM on my M1...
        
         | randyrand wrote:
         | Why are static binaries with PIC so rare? I'm surprised
         | position dependent code is _ever_ used anymore in the age of
         | ASLR.
         | 
         | But static binaries are still great for portability. So you'd
         | think static binaries with PIC would be the default.
        
           | masklinn wrote:
           | > But static binaries are still great for portability.
           | 
           | macOS has not officially supported static binaries in...
           | ever? You can't statically link libSystem, and it absolutely
           | does not care for kernel ABI stability.
        
             | titzer wrote:
             | > it absolutely does not care for kernel ABI stability
             | 
             | That may be true on the mach system call side, but the UNIX
             | system calls don't appear to change. (Virgil actually does
             | call the kernel directly).
        
               | masklinn wrote:
               | > That may be true on the mach system call side, but the
               | UNIX system calls don't appear to change.
               | 
               | They very much do, without warning, as the Go project
               | discovered (after having been warned multiple times)
               | during the Sierra betas:
               | https://github.com/golang/go/issues/16272
               | https://github.com/golang/go/issues/16606
               | 
               | That doesn't mean Apple goes outs of its way to break
               | syscalls (unlike microsoft), but there is no support for
               | direct syscalls. That is why, again, you can't statically
               | link libSystem.
               | 
               | > (Virgil actually does call the kernel directly).
               | 
               | That's completely unsupported -\\_(tsu)_/-
        
           | titzer wrote:
           | Virgil doesn't use ASLR. I'm not sure what value it adds to a
           | memory-safe language.
        
         | saagarjha wrote:
         | Rosetta can run statically linked binaries, but I don't think
         | anything supports binaries that aren't relocatable.
         | $ file a.out       a.out: Mach-O 64-but executable x86_64
         | $ tool -L a.out       a.out:       $ ./a.out       Hello,
         | world!
        
         | CharlesW wrote:
         | > _Rosetta 2 is great, except it apparently can 't run
         | statically-linked (non-PIC) binaries._
         | 
         | Interestingly, it supports statically-linked x86 binaries when
         | used with Linux.
         | 
         | "Rosetta can run statically linked x86_64 binaries without
         | additional configuration. Binaries that are dynamically linked
         | and that depend on shared libraries require the installation of
         | the shared libraries, or library hierarchies, in the Linux
         | guest in paths that are accessible to both the user and to
         | Rosetta."
         | 
         | https://developer.apple.com/documentation/virtualization/run...
        
         | mirashii wrote:
         | Statically linked binaries are officially unsupported on MacOS
         | in general, so there's no reason to support it on Rosetta
         | either.
         | 
         | It's unsupported in MacOS because it assumes binary
         | compatibility on the kernel system call interface, which is not
         | guaranteed.
        
           | saagarjha wrote:
           | Rosetta was introduced with the promise that it supports
           | binaries that make raw system calls. (And it does indeed
           | support these by hooking the syscall instruction.)
        
       | darzu wrote:
       | Does anyone know the names of the key people behind Rosetta 2?
       | 
       | In my experience, exceptionally well executed tech like this
       | tends to have 1-2 very talented people leading. I'd like to
       | follow their blog or Twitter.
        
         | trollied wrote:
         | The original Rosetta was written by Transitive, which was
         | formed by spinning a Manchester University research group out.
         | See https://www.software.ac.uk/blog/2016-09-30-heroes-
         | software-e...
         | 
         | I know a few of their devs went to ARM, some to Apple & a few
         | to IBM (who bought Transitive). I do know a few of their ex
         | staff (and their twitter handles), but I don't feel comfortable
         | linking them here.
        
           | scrlk wrote:
           | IIRC the current VP of Core OS at Apple is ex-
           | Manchester/Transitive.
        
         | cwzwarich wrote:
         | I am the creator / main author of Rosetta 2. I don't have a
         | blog or a Twitter (beyond lurking).
        
           | darzu wrote:
           | Should you feel inspired to share your learnings, insights,
           | or future ideas about the computing spaces you know, me and
           | I'm sure many other people would be interested to listen!
           | 
           | My preferred way to learn about a new (to me) area of tech is
           | to hear the insights of the people who have provably advanced
           | that field. There's a lot of noise to signal in tech blogs.
        
           | darzu wrote:
           | If you're feeling inclined, here's a slew of questions:
           | 
           | What was the most surprising thing you learned while working
           | on Rosetta 2?
           | 
           | Is there anything (that you can share) that you would do
           | differently?
           | 
           | Can your recommend any great starting places for someone
           | interested in instruction translation?
           | 
           | Looking forward, did your work on Rosetta give you ideas for
           | unfilled needs in the virtualization/emulation/translation
           | space?
           | 
           | What's the biggest inefficiency you see today in the tech
           | stacks you interact most with?
           | 
           | A lot of hard decisions must have been made while building
           | Rosetta 2; can you shed light on some of those and how you
           | navigated them?
        
           | pcf wrote:
           | Thanks for your amazing work!
           | 
           | May I ask - would it be possible to implement support for
           | 32-bit VST and AU plugins?
           | 
           | This would be a major bonus, because it could e.g. enable
           | producers like me to open up our music projects from earlier
           | times, and still have the old plugins work.
        
             | [deleted]
        
           | Klonoar wrote:
           | Huh, this is timely. Incredibly random but: do you know if
           | there was anything that changed as of Ventura to where trying
           | to mmap below the 2/4GB boundary would no longer work in
           | Rosetta 2? I've an app where it's worked right up to Monterey
           | yet inexplicably just bombs in Ventura.
        
           | keepquestioning wrote:
           | Isn't Rosetta 2 "done"? What are you working on now?
        
           | bdash wrote:
           | Impressive work, Cameron! Hope you're doing well.
        
           | skrrtww wrote:
           | Are you able to speak at all to the known performance
           | struggles with x87 translation? Curious to know if we're
           | likely to see any updates or improvements there into the
           | future.
        
       | peatmoss wrote:
       | Not having any particular domain experience here, I've idly
       | wondered whether or not there's any role for neural net models in
       | translating code for other architectures.
       | 
       | We have giant corpuses of source code, compiled x86_64 binaries,
       | and compiled arm64 binaries. I assume the compiled binaries
       | represent approximately our best compiler technology. It seems
       | predicting an arm binary from an x86_64 binary would not be
       | insane?
       | 
       | If someone who actually knows anything here wants to disabuse me
       | of my showerthoughts, I'd appreciate being able to put the idea
       | out of my head :-)
        
         | Symmetry wrote:
         | Many branch predictors have traditionally used perceptrons,
         | which are sort of NN like. And I think there's a lot of
         | research into involving incorporating deep learning models into
         | doing chip routings.
        
         | Someone wrote:
         | > It seems predicting an arm binary from an x86_64 binary would
         | not be insane?
         | 
         | If you start with a couple of megabytes of x64 code, and
         | predict a couple of megabytes of arm code from it, there will
         | be errors even if your model is 99.999% accurate.
         | 
         | How do you find the error(s)?
        
         | hinkley wrote:
         | I think we are on the cusp of machine aided rules generation
         | via example and counter example. It could be a very cool era of
         | "Moore's Law for software" (which I'm told software doubles in
         | speed roughly every 18 years).
         | 
         | Property based testing is a bit of a baby step here, possibly
         | in the same way that escape analysis in object allocation was
         | the precursor to borrow checkers which are the precursor to...?
         | 
         | These are my inputs, these are my expectations, ask me some
         | more questions to clarify boundary conditions, and then offer
         | me human readable code that the engine thinks satisfies the
         | criteria. If I say no, ask more questions and iterate.
         | 
         | If anything will ever allow machines to "replace" coders, it
         | will be that, but the scare quotes are because that shifts us
         | more toward information architecture from data munging, which I
         | see as an improvement on the status quo. Many of my work
         | problems can be blamed on structural issues of this sort. A
         | filter that removes people who can't think about the big
         | picture doesn't seem like a problem to me.
        
         | saagarjha wrote:
         | People have tried doing this, but not typically at the
         | instruction level. Two ways to go about this that I'm aware of
         | are trying to use machine learning to derive high-level
         | semantics about code, then lowering it to the new architecture.
        
         | brookst wrote:
         | I'm a ML dilletante and hope someone more knowledgeable chimes
         | in, but one thing to consider is the statistics of how many
         | instructions you're translating and the accuracy rate. Binary
         | execution is very unforgiving to minor mistakes in translation.
         | If 0.001% of instructions are translated incorrectly, that
         | program just isn't going to work.
        
         | qsort wrote:
         | You would need a hybrid architecture with a NN generating
         | guesses and a "watchdog" shutting down errors.
         | 
         | Neural models are basically universal approximators. Machine
         | code needs to be obscenely precise to work.
         | 
         | Unless you're doing something else in the backend, it's just a
         | turbo SIGILL generator.
        
           | throw10920 wrote:
           | This is all true - machine code needs to be "basically
           | perfect" to work.
           | 
           | However, there are lots of problems in CS that are easier to
           | check the answer to a solution than to solve in the first
           | place. It _may_ turn out to be the case that a well-tuned
           | model can quickly produce solutions to some code-generation
           | problems, that those solutions have a high enough likelihood
           | of being correct, that it 's fast enough to check (and maybe
           | try again), and that this entire process is faster than
           | state-of-the-art classical algorithms.
           | 
           | However, if that were the case, I might also expect us to be
           | able to extract better algorithms from the model -
           | intuitively, machine code generation "feels" like something
           | that's just better implemented through classical algorithms.
           | Have you met a human that can do register allocation faster
           | than LLVM?
        
           | classichasclass wrote:
           | > turbo SIGILL generator
           | 
           | This gave me the delightful mental image of a CPU smashing
           | headlong into a brick wall, reversing itself, and doing it
           | again. Which is pretty much what this would do.
        
       | ericbarrett wrote:
       | Anybody know if Docker has plans to move from qemu to Rosetta on
       | M1/2 Macs? I've found qemu to be at least 100x slower than the
       | native arch.
        
       | jeffbee wrote:
       | I wonder how much hand-tuning there is in Rosetta 2 for known,
       | critical routines. One of the tricks Transmeta used to get
       | reasonable performance on their very slow Crusoe CPU was to
       | recognize critical Windows functions and replace them with a
       | library of hand-optimized native routines. Of course that's a
       | little different because Rosetta 2 is targeting an architecture
       | that is generally speaking at least as fast as the x86
       | architecture it is trying to emulate, and that's been true for
       | most cross-architecture translators historically like DEC's VEST
       | that ran VAX code on Alpha, but Transmeta CMS was trying to
       | target a CPU that was slower.
        
         | saagarjha wrote:
         | Haven't spotted any in particular.
        
       | sedatk wrote:
       | TL;DR: One-to-one instruction translation ahead of time instead
       | of complex JIT translations to bet on M1's performance and
       | instruction cache handling.
        
       | johnthuss wrote:
       | "I believe there's significant room for performance improvement
       | in Rosetta 2... However, this would come at the cost of
       | significantly increased complexity... Engineering is about making
       | the right tradeoffs, and I'd say Rosetta 2 has done exactly
       | that."
        
         | Gigachad wrote:
         | Would be a waste of effort when the tool is designed to be
         | obsolete in a few years as everything gets natively compiled.
        
       | saagarjha wrote:
       | One thing that's interesting to note is that the amount of effort
       | expended here is not actually all that large. Yes, there are
       | smart people working on this, but the performance of Rosetta 2
       | for the most part is probably the work of a handful of clever
       | people. I wouldn't be surprised if some of them have an interest
       | in compilers but the actual implementation is fairly
       | straightforward and there isn't much of the stuff you'd typically
       | see in an optimizing JIT: no complicated type theory or analysis
       | passes. Aside from a handful of hardware bits and some convenient
       | (perhaps intentionally selected) choices in where to make
       | tradeoffs there's nothing really specifically amazing here. What
       | really makes it special is that anyone (well, any company with a
       | bit of resources) could've done it but nobody really did. (But,
       | again, Apple owning the stack and having past experience probably
       | did help them get over the hurdle of actually putting effort into
       | this.)
        
       | pjmlp wrote:
       | Back in the early days of Windows NT everywhere, the Alpha
       | version had a similar JIT emulation.
        
       | agentcooper wrote:
       | I am interested in this domain, but lacking knowledge to fully
       | understand the post. Any recommendations on good
       | books/courses/tutorials related to low level programming?
        
         | saagarjha wrote:
         | I'd recommend going through a compilers curriculum, then
         | reading up on past binary translation efforts.
        
       | pjmlp wrote:
       | Back in the early days of Windows NT everywhere, the Alpha
       | version had a similar JIT emulation.
       | 
       | https://en.m.wikipedia.org/wiki/FX!32
       | 
       | Or for a more technical deep dive,
       | 
       | https://www.usenix.org/publications/library/proceedings/usen...
        
         | mosburger wrote:
         | OMG I forgot about FX!32. My first co-op was as a QA tester for
         | the DEC Multia, which they moved from the Alpha processor to
         | Intel midway through. I did a skunkworks project for the dev
         | team attempting to run the newer versions of Multia's software
         | (then Intel-based) on older Alpha Multias using FX!32. IIRC it
         | was still internal use only/beta, but it worked quite well!
        
       | hot_gril wrote:
       | Rosetta 2 has become the poster child for "innovation without
       | deprecation" where I work (not Apple).
        
         | Tijdreiziger wrote:
         | Apple is the king of deprecation, just look at what happened to
         | Rosetta 1 and 32-bit iOS apps.
        
           | hot_gril wrote:
           | Yes they are, and that makes Rosetta 2 even more special.
           | Though Rosetta 1 got support for 5 years, which is pretty
           | good.
        
       | kccqzy wrote:
       | > The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which
       | convert floating-point condition flags to/from a mysterious
       | "external format". By some strange coincidence, this format is
       | x86, so these instruction are used when dealing with floating
       | point flags.
       | 
       | This really made me chuckle. They probably don't want to mention
       | Intel by name, but this just sounds funny.
       | 
       | https://developer.arm.com/documentation/100076/0100/A64-Inst...
        
       | manv1 wrote:
       | Apple's historically been pretty good at making this stuff. Their
       | first 68k -> PPC emulator (Davidian's) was so good that for some
       | things the PPC Mac was the fastest 68k mac you could buy. The
       | next-gen DR emulator (and SpeedDoubler etc) made things even
       | faster.
       | 
       | I suspect the ppc->x86 stuff was slower because x86 just doesn't
       | have the registers. There's only so much you can do.
        
         | scarface74 wrote:
         | > Their first 68k -> PPC emulator (Davidian's) was so good that
         | for some things the PPC Mac was the fastest 68k mac you could
         | buy.
         | 
         | This is not true. A 6100/60 running 68K code was about the
         | speed of my unaccelerated Mac LCII 68030/16. Even when using
         | SpeedDoubler, you only got speeds up to my LCII with a
         | 68030/40Mhz accelerator.
         | 
         | Even the highest end 8100/80 was slower than a high end 68k
         | Quadra.
         | 
         | The only time 68K code ran faster is when it made heavy use of
         | the Mac APIS that were native.
        
           | dev_tty01 wrote:
           | >The only time 68K code ran faster is when it made heavy use
           | of the Mac APIS that were native.
           | 
           | Yes, and that just confirms the original point. Mac apps
           | often spend a lot of time in the OS apis and therefore the
           | 68K code (the app) often ran faster on PPC than it did on 68K
           | because apps often spend much of their time in OS apis. The
           | earlier post said "so good that for some things the PPC Mac
           | was the fastest 68k mac." That is true.
           | 
           | In my own experience, I found most 68K apps felt as fast or
           | faster. Your app mix might have been different, but many
           | folks found the PPC faster.
        
             | classichasclass wrote:
             | Part of that was the greater clock speeds on the 601 and
             | 603, though. Those _started_ at 60MHz. Clock for clock 68K
             | apps were generally poorer on PowerPC until PPC clock
             | speeds made them competitive, and then the dynamic
             | recompiling emulator knocked it out of the park.
             | 
             | Similarly, Rosetta was clock-for-clock worse than Power
             | Macs at running Power Mac applications. The last generation
             | G5s would routinely surpass Mac Pros of similar or even
             | slightly greater clocks. On native apps, though, it was no
             | contest, and by the next generation the sheer processor
             | oomph put the problem completely away.
             | 
             | Rosetta 2 is notable in that it is so far Apple's only
             | processor transition where the new architecture was
             | unambiguously faster than the old one _on the old one 's
             | own turf_.
        
         | Wowfunhappy wrote:
         | > Apple's historically been pretty good at making this stuff.
         | Their first 68k -> PPC emulator (Davidian's) was so good that
         | for some things the PPC Mac was the fastest 68k mac you could
         | buy.
         | 
         | Not arguing the facts here, but I'm curious--are these
         | successes related? And if so, how has Apple done that?
         | 
         | I would imagine that very few of the engineers who programmed
         | Apple's 68k emulator are still working at Apple today. So, why
         | is Apple still so good at this? Strong internal documentation?
         | Conducive management practices? Or were they just lucky both
         | times?
        
           | joshstrange wrote:
           | I mean they are one of very few companies who have done arch
           | changes like this and they had already done it twice before
           | Rosetta 2. The same engineers might not have been used for
           | all 3 but I'm sure there was at least a tiny bit of overlap
           | between 68k->PPC and PPC->Intel (and likewise overlap between
           | PPC->Intel and Intel->ARM) that coupled with passed down
           | knowledge within the company gives them a leg up. They know
           | the pitfalls, they've see issues/advantages of using certain
           | approaches.
           | 
           | I think of it in same way that I've migrated from old->new
           | versions of frameworks/languages in the past with breaking
           | changes and each time I've done it I've gotten better at
           | knowing what to expect, what to look for, places where it
           | makes sense to "just get it working" or "upgrade the code to
           | the new paradigm". The first time or two I did it was as a
           | junior working under senior developers so I wasn't as
           | involved but what did trickle down to me and/or my part in
           | the refactor/upgrade taught me things. Later times when I was
           | in charge (or on my own) I was able to draw on those past
           | experiences.
           | 
           | Obviously my work is nowhere near as complicated as arch
           | changes but if you squint and turn your head to the side I
           | think you can see the similarities.
           | 
           | > Or were they just lucky to have success both times?
           | 
           | I think 2 times might be explained with "luck" but being
           | successful 3 times points to a strong trend IMHO, especially
           | since Rosetta 2 seems to have done even better than Rosetta 1
           | for the last transition.
        
           | spacedcowboy wrote:
           | FWIW, I know several current engineers at Apple who wrote
           | ground-breaking stuff before the Mac even existed. Apple
           | certainly doesn't have any problem with older engineers, and
           | it turns out that transferring that expertise to new chips on
           | demand isn't particularly hard for them.
        
         | nordsieck wrote:
         | > I suspect the ppc->x86 stuff was slower because x86 just
         | doesn't have the registers.
         | 
         | My understanding is that part of the reason the G4/5 was sort
         | of able to keep up with x86 at the time was due to the heavy
         | use of SIMD in some apps. And I doubt that Rosetta would have
         | been able to translate that stuff into SSE (or whatever the x86
         | version of SIMD was at the time) on the fly.
        
           | bonzini wrote:
           | Apple had a library of SIMD subroutines (IIRC
           | Accelerate.framework) and Rosetta was able to use the x86
           | implementation when translating PPC applications that called
           | it.
        
           | masklinn wrote:
           | Rosetta actually did support Altivec. It didn't support G5
           | input at all though (but likely because that was considered
           | pretty niche, as Apple only released a G5 iMac, a PowerMac,
           | and an XServe, due to the out-of-control power and thermals
           | of the PowerPC 970).
        
       | menaerus wrote:
       | > Rosetta 2 translates the entire text segment of the binary from
       | x86 to ARM up-front.
       | 
       | Do I understand correctly that the Rosetta is basically a
       | transpiler from x86-64 machine code to ARM machine code which is
       | run prior to the binary execution? If so, does it affect the
       | application startup times?
        
         | nilsb wrote:
         | Yes, it does. The delay of the first start of an app is quite
         | noticeable. But the transpiled binary is apparently cached
         | somewhere.
        
           | saagarjha wrote:
           | /var/db/oah.
        
         | nicoburns wrote:
         | > If so, does it affect the application startup times?
         | 
         | It does, but only the very first time you run the application.
         | The result of the transpilation is cached so it doesn't have to
         | be computed again until the app is updated.
        
           | arianvanp wrote:
           | And deleting the cache is undocumented (it is not in the file
           | system) so if you run Mac machines as CI runners they will
           | trash and brick themselves running out of disk space over
           | time.
        
             | rowanG077 wrote:
             | What in the actual fuck. That is such an insane decision.
             | Where is it stored then? Some dark corner of the file
             | system inaccessible via normal means?
        
             | jonny_eh wrote:
             | You mean the cache is ever expanding?
        
             | koala_man wrote:
             | Really? This SO question says it's stored in /var/db/oah/
             | 
             | https://apple.stackexchange.com/questions/427695/how-can-
             | i-l...
        
           | dylan604 wrote:
           | Does that essentially mean each non-native app is doubled in
           | disk use? Maybe not doubled but requires more space to be
           | sure.
        
             | saagarjha wrote:
             | Yes.
        
             | varenc wrote:
             | Yes... you can see the cache in /var/db/oah/
             | 
             | Though only the actual binary size that gets doubled. For
             | large apps it's usually not the binary that's taking up
             | most of the space.
        
           | kijiki wrote:
           | Similar to DEC's FX!32 in that regard. FX!32 allowed running
           | x86 Windows NT apps on Alpha Windows NT.
        
             | saltcured wrote:
             | There was also an FX!32 for Linux. But I think it may have
             | only included the interpreter part and left out the
             | transpiler part. My memory is vague on the details.
             | 
             | I do remember that I tried to use it to run the x86
             | Netscape binary for Linux on a surplus Alpha with RedHat
             | Linux. It worked, but so slowly that a contemporary Python-
             | based web browser had similar performance. In practice, I
             | settled on running Netscape from a headless 486 based PC
             | and displaying remotely on the Alpha's desktop over
             | ethernet. That was much more usable.
        
         | esskay wrote:
         | The first load is fairly slow, but once it's done it every load
         | after that is pretty much identical to what it'd be running on
         | an x86 mac due to the caching it does.
        
           | EricE wrote:
           | For me my M1 was fast enough that the first load didn't seem
           | that different - and more importantly subsequent loads were
           | lighting fast! It's astonishing how good Rosetta 2 is -
           | utterly transparent and faster than my Intel Mac thanks to
           | the M1.
        
             | savoytruffle wrote:
             | If installed using a packaged installer, or the App Store,
             | the translation is done during installation instead of at
             | first run. So, slow 1st launch may be uncommon for a lot of
             | apps or users.
        
       | hinkley wrote:
       | I remember years ago when Java adjacent research was all the
       | rage, HP had a problem that was "Rosetta lite" if you will. They
       | had a need to run old binaries on new hardware that wasn't
       | exactly backward compatible. They made a transpiler that worked
       | on binaries. It might have even been a JIT but that part of the
       | memory is fuzzy.
       | 
       | What made it interesting here was that as a sanity check they
       | made an A->A mode where they took in one architecture and spit
       | out machine code for the same architecture. The output was faster
       | than the input. Meaning that even native code has some room for
       | improvement with JIT technology.
       | 
       | I have been wishing for years that we were in a better place with
       | regard to compilers and NP complete problems where the compilers
       | had a fast mode for code-build-test cycles and a very slow
       | incremental mode for official builds. I recall someone telling me
       | the only thing they liked about the Rational IDE (C and C++?) was
       | that it cached precompiled headers, one of the Amdahl's Law areas
       | for compilers. If you changed a header, you paid the
       | recompilation cost and everyone else got a copy. I love whenever
       | the person that cares about something gets to pay the consequence
       | instead of externalizing it on others.
       | 
       | And having some CI machines or CPUs that just sit around chewing
       | on Hard Problems all day for that last 10% seems to be to be a
       | really good use case in a world that's seeing 16 core consumer
       | hardware. Also caching hints from previous runs is a good thing.
        
         | fuckstick wrote:
         | > The output was faster than the input.
         | 
         | So if you ran the input back through the output multiple times
         | then that means you could eventually get the runtime down to 0.
        
           | twic wrote:
           | But unfortunately, the memory use goes to infinity.
        
           | avidiax wrote:
           | Probably the output of the decade-old compiler that produced
           | the original binary had no optimizations.
        
             | hinkley wrote:
             | That too but the eternal riddle of optimizer passes is
             | which ones reveal structure and which obscure it. Do I loop
             | unroll or strength reduce first? If there are heuristics
             | about max complexity for unrolling or inlining then it
             | might be "both".
             | 
             | And then there's processor family versus this exact model.
        
         | zaphirplane wrote:
         | Is this for itanium
        
         | tomcam wrote:
         | I'm likely misunderstanding what you said, but I thought pre-
         | compiled headers were pretty much standard these days.
        
         | wmf wrote:
         | https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html
        
           | travisgriggs wrote:
           | It was particularly poignant at the time because JITed
           | languages were looked down on by the "static compilation
           | makes us faster" crowd. So it was a sort of "wait a minute
           | Watson!" moment in that particular tech debate.
           | 
           | No one cares as much now days, we've moved our overrated
           | opinion battlegrounds to other portions of what we do.
        
             | pjmlp wrote:
             | I eventually changed my opinion into JIT being the only way
             | to make dynamic languages faster, while strong typed ones
             | can benefit from having both AOT/JIT for different kinds of
             | deployment scenarios, and development workflows.
        
               | titzer wrote:
               | Dynamic languages need inline caches, type feedback, and
               | fairly heavy inlining to be competitive. Some of that can
               | be gotten offline, e.g. by doing PGO. But you can't, in
               | general, adapt to a program that suddenly changes phases,
               | or rebinds a global that was assumed a constant, etc.
               | Speculative optimizations with deopt are what make
               | dynamic languages fast.
        
               | hinkley wrote:
               | Before I talked myself out of writing my own programming
               | language, I used to have lunch conversations with my
               | mentor who was also speed obsessed about how JIT could
               | meet Knuth in the middle by creating a collections API
               | with feedback guided optimization, using it for algorithm
               | selection and tuning parameters by call site.
               | 
               | For object graphs in Java you can waste exorbitant
               | amounts of memory by having a lot of "children" members
               | that are sized for a default of 10 entries but the normal
               | case is 0-2. I once had to deoptimize code where someone
               | tried to do this by hand and the number they picked was 6
               | (just over half of the default). So when the average
               | jumped to 7, then the data structure ended up being 20%
               | larger than the default behavior instead of 30% smaller
               | as intended.
               | 
               | For a server workflow, having data structured tuned to
               | larger pools of objects with more complex comparison
               | operations can also be valuable, but I don't want that
               | kitchen sink stuff on mobile or in an embedded app.
               | 
               | I still think this is viable, but only if you are clever
               | about gathering data. For instance the incremental
               | increase in runtime for telemetry data is quite high on
               | the happy path. But corner cases are already expensive,
               | so telemetry adds only a few percent there instead of
               | double digits.
               | 
               | The nonstarter for this ended up being that most
               | collections APIs violate Liskov, so you almost need to
               | write your own language to pick a decomposition that
               | doesn't. Variance semantics help a ton but they don't
               | quite fix LSP.
        
               | mikepurvis wrote:
               | I think I landed in a place where it's basically "the
               | compiler has insufficient information to achieve ideal
               | optimization because some things can only be known at
               | runtime."
               | 
               | Which is not exclusively an argument for runtime JIT-- it
               | can also be an argument for instrumenting your runtime
               | environment, and feeding that profiling data back to the
               | compiler to help it make smarter decisions the next time.
               | But that's definitely a more involved process than just
               | baking it into the same JavaScript interpreter used by
               | everyone-- likely well worth it in the case of things
               | like game engines, though.
        
               | masklinn wrote:
               | It's also an argument for having much more expressive and
               | precise type systems, so the compiler has better
               | information.
               | 
               | Once you've managed to debug the codegen anyway (see: The
               | Long and Arduous Story of Noalias).
        
               | mikepurvis wrote:
               | Is it? I'd love to see a breakdown of what classes of
               | information can be gleaned from profile data, and how
               | much of an impact each one has in isolation in terms of
               | optimization.
               | 
               | Naively, I would have assumed that branch information
               | would be most valuable, in terms of being able to guide
               | execution toward the hot path and maximize locality for
               | the memory accesses occurring on the common branches. And
               | that info is not something that would be assisted by more
               | expressive types, I don't think.
        
               | titzer wrote:
               | Darn it, replied too early. See sibling comment I just
               | posted. The problem with dynamic languages is that you
               | need to speculate and be ready to undo that speculation.
        
               | notriddle wrote:
               | https://tomaszs2.medium.com/how-
               | rust-1-64-became-10-20-faste...
               | 
               | https://news.ycombinator.com/item?id=33306945
        
               | bluGill wrote:
               | The problem with JIT is not all information known at
               | runtime is the correct information to optimize one.
               | 
               | In finance the performance critical code path is often
               | the one run least often. That is you have a
               | if(unlikely_condition) {run_time_sensitive_trade();}. In
               | this case you need to tell the compiler to ensure the CPU
               | will have a pipeline stall because of a branch
               | misprediction most of the time to ensure the time that
               | counts the pipeline doesn't stall.
               | 
               | The above is a rare corner case for sure, but it is one
               | of those weird exceptions you always need to keep in mind
               | when trying to make any blanket rule.
        
               | dahfizz wrote:
               | The other issue with JIT is that it is unreliable. It
               | optimizes code by making assumptions. If one of the
               | assumptions is wrong, you pay a large latency penalty. In
               | my field of finance, having reliably low latency is
               | important. Being 15% faster on average, but every once in
               | a while you will be really slow, is not something
               | customers will go for.
        
             | saagarjha wrote:
             | I take it you are not very familiar with the website known
             | as Hacker News.
        
         | AussieWog93 wrote:
         | Outside of gaming, or hyper-CPU-critical workflows like video
         | editing, I'm not really sure if people actually even care about
         | that last 10% of performance.
         | 
         | I know most of the time I get frustrated by everyday software,
         | its doing something unnecessary in a long loop, and possibly
         | forgetting to check for Windows messages too.
        
           | koala_man wrote:
           | Performance also translates into better battery life and
           | cheaper datacenters.
        
         | hamstergene wrote:
         | Could it be simply because many binaries were produced by much
         | older, outdated optimizers. Or optimized for size.
         | 
         | Also, optimizers usually target "most common denominator" so
         | native binaries rarely use full power of current instruction
         | set.
         | 
         | Jumping from that peculiar finding to praising runtime JIT
         | feels like a longshot. To me it's more of an argument towards
         | distributing software in intermediate form (like Apple Bitcode)
         | and compiling on install, tailoring for the current processor.
        
           | jasonwatkinspdx wrote:
           | All reasonable points, but examples where JIT has an
           | advantage are well supported in research literature. The
           | typical workload that shows this is something with a very
           | large space of conditionals, but where at runtime there's a
           | lot of locality, eg matching and classification engines.
        
           | AceJohnny2 wrote:
           | > _Or optimized for size._
           | 
           | Note that on gcc (I think) and clang (I'm sure), -Oz is a
           | strict superset of -O2 (the "fast+safe" optimizations,
           | compared to -O3 that can be a bit too aggressive, given C's
           | minefield of Undefined Behavior that compilers can exploit).
           | 
           | I'd guess that, with cache fit considerations, -Oz can even
           | be faster than -O2.
        
           | astrange wrote:
           | > To me it's more of an argument towards distributing
           | software in intermediate form (like Apple Bitcode) and
           | compiling on install, tailoring for the current processor.
           | 
           | This turns out to be quite difficult, especially if you're
           | using bitcode as a compiler IL. You have to know what the
           | right "intermediate" level is; if assumptions change too much
           | under you then it's still too specific. And it means you
           | can't use things like inline assembly.
           | 
           | That's why bitcode is dead now.
           | 
           | By the way, I don't know why this thread is about how JITs
           | can optimize programs when this article is about how Rosetta
           | is not a JIT and intentionally chose a design that can't
           | optimize programs.
        
             | lmm wrote:
             | > This turns out to be quite difficult, especially if
             | you're using bitcode as a compiler IL. You have to know
             | what the right "intermediate" level is; if assumptions
             | change too much under you then it's still too specific. And
             | it means you can't use things like inline assembly.
             | 
             | > That's why bitcode is dead now.
             | 
             | Isn't this what Android does today? Applications are
             | distributed in bytecode form and then optimized for the
             | specific processor at install time.
        
         | chrisseaton wrote:
         | I've run Ruby C extensions on a JIT faster than on native, due
         | to things like inlining and profiling working more effectively
         | at runtime.
        
         | jeffbee wrote:
         | Post-build optimization of binaries without changing the target
         | CPU is common. See BOLT
         | https://github.com/facebookincubator/BOLT
        
         | mark_undoio wrote:
         | Something that fascinates me about this kind of A -> A
         | translation (which I associate with the original HP Dynamo
         | project on HPPA CPUs) is that it was able to effectively yield
         | the performance effect of one or two increased levels of -O
         | optimization flag.
         | 
         | Right now it's fairly common in software development to have a
         | debug build and a release build with potentially different
         | optimisation levels. So that's two builds to manage - if we
         | could build with lower optimisation and still effectively run
         | at higher levels then that's a whole load of build/test
         | simplification.
         | 
         | Moreover, debugging optimised binaries is fiddly due to
         | information that's discarded. Having the original, unoptimised,
         | version available at all times would give back the fidelity
         | when required (e.g. debugging problems in the field).
         | 
         | Java effectively lives in this world already as it can use high
         | optimisation and then fall back to interpreted mode when
         | debugging is needed. I wish we could have this for C/C++ and
         | other native languages.
        
           | foobiekr wrote:
           | One of the engineers I was working with on a project was from
           | Transitive (the company that made QuickTransit which became
           | Rosetta) found that their JIT based translator could not
           | deliver significant performance increases for A->A outside of
           | pathological cases, and it was very mature technology at the
           | time.
           | 
           | I think it's a hypothetical. The Mill Computing lectures talk
           | about a variant of this, which is sort of equivalent to an
           | install-time specializer for intermediate code which might
           | work, but that has many problems (for one thing, it breaks
           | upgrades and is very, very problematic for VMs being run on
           | different underlying hosts).
        
           | saagarjha wrote:
           | It depends greatly on which optimization levels you're going
           | through. --O0 to -O1 can easily be a 2-3x performance
           | improvement, which is going to be hard to get otherwise. -O2
           | to -O3 might be 15% if you're lucky, in which case -O+LTO+PGO
           | can absolutely get you wins that beat that.
        
             | bluGill wrote:
             | -O2 to -O3 has in some benchmarks made things worse. In
             | others it is a massive win, but in generally going above
             | -O2 should not be done without bench marking code. There
             | are some optimizations that can make things worse or better
             | for reasons that compiler cannot know.
        
               | astrange wrote:
               | Over-optimizing your "cold" code can also make things
               | worse for the "hot" code, eg by growing code size so much
               | that briefly entering the cold space kicks everything out
               | of caches.
        
               | hinkley wrote:
               | I have often lamented not being able to hint to the JIT
               | when I've transitioned from startup code to normal
               | operation. I don't need my Config file parsing optimized.
               | But the code for interrogating the Config at runtime
               | better be.
               | 
               | Everything before listen() is probably run once. Except
               | not ever program calls listen().
        
               | hinkley wrote:
               | And then there's always the outlier where optimizing for
               | size makes the working memory fit into cache and thus the
               | whole thing substantially faster.
        
         | freedomben wrote:
         | If JIT-ing a statically compiled input makes it faster, does
         | that mean that JIT-ing itself is superior or does it mean that
         | the static compiler isn't outputting optimal code? (real
         | question. asked another way, does JIT have optimizations it can
         | make that a static compiler can't?)
        
           | vips7L wrote:
           | Yes, the JIT has more profile guided data as to what your
           | program actually does at runtime, therefore it can optimize
           | better.
        
             | gpderetta wrote:
             | On the other hand some optimization are so expensive that a
             | JIT just doesn't have the execution budget to perform them.
             | 
             | Probably the optimal system is an hybrid iterative JIT/AOT
             | compiler (which incidentally was the original objective of
             | LLVM).
        
           | mockery wrote:
           | In addition to the sibling comments, one simple opportunity
           | available to a JIT and not AOT is 100% confidence about the
           | target hardware and its capabilities.
           | 
           | For example AOT compilation often has to account for the
           | possibility that the target machine might not have certain
           | instructions - like SSE/AVX vector ops, and emit both SSE and
           | non-SSE versions of a codepath with, say, a branch to pick
           | the appropriate one dynamically.
           | 
           | Whereas a JIT knows what hardware it's running on - it
           | doesn't have to worry about any other CPUs.
        
             | duped wrote:
             | AOT compilers support this through a technique called
             | function multi-versioning. It's not free and only goes so
             | far, but it isn't reserved to JITs.
             | 
             | The classical reason to use FMV is for SIMD optimizations,
             | fwiw
        
             | acdha wrote:
             | One great example of this was back in the P4 era where
             | Intel hit higher clock speeds at the expense of much higher
             | latency. If you made a binary for just that processor a
             | smart compiler could use the usual tricks to hit very good
             | performance, but that came at the expense of other
             | processors and/or compatibility (one appeal to the AMD
             | Athlon & especially Opteron was that you could just run the
             | same binary faster without caring about any of that[1]). A
             | smart JIT could smooth that considerably but at the time
             | the memory & time constraints were a challenge.
             | 
             | 1. The usual caveats about benchmarking what you care about
             | apply, of course. The mix of webish things I worked on and
             | scientists I supported followed this pattern, YMMV.
        
           | andrewaylett wrote:
           | It depends on what the JIT does exactly, but in general _yes_
           | a JIT _may_ be able to make optimisations that a static
           | compiler won 't be aware of because a JIT can optimise for
           | the specific data being processed.
           | 
           | That said, a sufficiently advanced CPU could also make those
           | optimisations on "static" code. That was one of the things
           | Transmeta had been aiming towards, I think.
        
           | kmeisthax wrote:
           | It's more the case that the ahead-of-time compilation is
           | suboptimal.
           | 
           | Modern compilers have a thing called PGO (Profile Guided
           | Optimization) that lets you take a compiled application, run
           | it and generate an execution profile for it, and then compile
           | the application again using information from the profiling
           | step. The reason why this works is that lots of optimization
           | involves time-space tradeoffs that only make sense to do if
           | the code is frequently called. JIT _only_ runs on frequently-
           | called code, so it has the advantage of runtime profiling
           | information, while ahead-of-time (AOT) compilers have to make
           | educated guesses about what loops are the most hot. PGO
           | closes that gap.
           | 
           | Theoretically, a JIT _could_ produce binary code hyper-
           | tailored to a particular user 's habits and their computer's
           | specific hardware. However, I'm not sure if that has that
           | much of a benefit versus PGO AOT.
        
             | com2kid wrote:
             | > Theoretically, a JIT could produce binary code hyper-
             | tailored to a particular user's habits and their computer's
             | specific hardware. However, I'm not sure if that has that
             | much of a benefit versus PGO AOT.
             | 
             | In theory JIT can be a _lot_ more efficient, optimizing for
             | not only the exact instruction set, and do per CPU
             | architecture optimizations, such as instruction length,
             | pipeline depth, cache sizes, etc.
             | 
             | In reality I doubt most compiler or JIT development teams
             | have the resources to write and test all those potential
             | optimizations, especially as new CPUs are coming out all
             | the time, and each set of optimizations is another set of
             | tests that has to be maintained.
        
               | bluGill wrote:
               | gcc and clang at least have options so you can optimize
               | for specific CPUs. I'm not sure how good they are (most
               | people want a generic optimization that runs well on all
               | CPUs of the family, so there likely is lots of room for
               | improvement with CPU specific optimization), but they can
               | do that. This does (or at least can, again it probably
               | isn't fully implemented), account for instruction length,
               | pipeline depth, cache size.
               | 
               | The Javascript V8 engine, and the JVM both are popular
               | and supported enough that I expect the teams working on
               | them take advantage of every trick they can for specific
               | CPUs, they have a lot of resources for this. (at least
               | the major x86 and ARM chips - maybe they don't for MIPS
               | or some uncommon variant of ARM...). Of courses there are
               | other JIT engines, some uncommon ones don't have many
               | resources and won't do this.
        
               | titzer wrote:
               | > take advantage of every trick they can for specific
               | CPUs
               | 
               | Not to the extent clang and gcc do, no. V8 does, e.g. use
               | AVX instructions and some others if they are indicated to
               | be available by CPUID. TurboFan does global scheduling in
               | moving out of the sea of nodes, but that is not machine-
               | specific. There was an experimental local instruction
               | scheduler for TurboFan but it never really helped big
               | cores, while measurements showed it would have helped
               | smaller cores. It didn't actually calculate latencies; it
               | just used a greedy heuristic. I am not sure if it was
               | ever turned on. TurboFan doesn't do software pipelining
               | or unroll/jam, though it does loop peeling, which isn't
               | CPU-specific.
        
               | astrange wrote:
               | > gcc and clang at least have options so you can optimize
               | for specific CPUs. I'm not sure how good they are
               | 
               | They are not very good at it, and can't be. You can look
               | inside them and see the models are pretty simple; the
               | best you can do is optimize for the first step (decoder)
               | of the CPU and avoid instructions called out in the
               | optimization manual as being especially slow. But on an
               | OoO CPU there's not much else you can do ahead of time,
               | since branches and memory accesses are unpredictable and
               | much slower than in-CPU resource stalls.
        
               | duped wrote:
               | Like another commented, JIT compilers do this today.
               | 
               | The thing that makes this mostly theoretical is that the
               | underlying assumption is only true when you neglect that
               | an AOT has zero run-time cost while a JIT compiler has to
               | execute the code it's optimizing _and_ the code to decide
               | if it 's worth optimizing and generate new code.
               | 
               | So JIT compiler optimizations are a bit different than
               | AOT optimizations since they have to both generate
               | faster/smaller code _and_ the execute code that performs
               | the optimization. The problem is that most optimizations
               | beyond peephole are quite expensive.
               | 
               | There's another thing that AOT compilers don't need to
               | deal with, which is being wrong.Production JITs have to
               | implement dynamic de-optimization in the case that an
               | optimization was built on a bad assumption.
               | 
               | That's why JITs are only faster in theory (today), since
               | there are performance pitfalls in the JIT itself.
        
               | titzer wrote:
               | Nearly all JS engines are doing concurrent JIT
               | compilation now, so some of the compilation cost is moved
               | off the main thread. Java JITs have had multiple compiler
               | threads for more than a decade.
        
               | saagarjha wrote:
               | The well funded production JIT compilers (HotSpot, V8,
               | etc.) absolutely do take advantage of these. The vector
               | ISA can sometimes be unwieldy to work with but things
               | like replacing atomics, using unaligned loads, or taking
               | advantage of differing pointer representations is common.
        
               | com2kid wrote:
               | They do some auto-vectorization, but AFAIK they don't do
               | micro-optimizations for different CPUs.
        
           | rowanG077 wrote:
           | A JIT can definitely make optimizations that a static
           | compiler can't. Simply by virtue of it having concrete
           | dynamic real-time information.
        
           | ketralnis wrote:
           | It means that in this case, the static compiler emitted code
           | that could be further optimised, that's all. It doesn't mean
           | that that's always the case, or that static compilers _can
           | 't_ produce optimal code, or that either technique is
           | "better" than the other.
           | 
           | An easy example is code compiled for 386 running on a 586.
           | The A->A compiler can use CPU features that weren't available
           | to the 386. As with PGO you have branch prediction
           | information that's not available to the static compiler. You
           | can statically compile the dynamically linked dependencies,
           | allowing inlining that wasn't previously available.
           | 
           | On the other hand you have to do all of that. That takes
           | warmup time just like a JIT.
           | 
           | I think the road to enlightenment is letting go of phrasing
           | like "is superior". There are lots of upsides and downsides
           | to pretty much every technique.
        
         | sergimas15 wrote:
         | nice
        
         | hawflakes wrote:
         | People have mentioned the Dynamo project from HP. But I think
         | you're actually thinking of the Aries project (I worked in a
         | directly adjacent project) that allowed you to run PA-RISC
         | binaries on IA-64.
         | 
         | https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html
        
       | dynjo wrote:
       | It is quite astonishing how seamless Apple has managed to make
       | the Intel to ARM transition, there are some seriously smart minds
       | behind Rosetta. I honestly don't think I had a single software
       | issue during the transition!
        
         | wombat-man wrote:
         | There's an annoying dwarf fortress bug but other than that,
         | same
        
         | xxpor wrote:
         | They've almost made it too good. I have to run software that
         | ships an x86 version of CPython, and it just deeply offends me
         | on a personal level, even though I can't actually detect any
         | slowdown (probably because lol python in the first place)
        
         | ChuckNorris89 wrote:
         | If that blows your mind, you should see how Microsoft did the
         | emulation of the PowerPC based Xeon chip to X86 so you can play
         | Xbox 360 games on Xbox One.
         | 
         | There's an old pdf from Microsoft researchers with the details
         | but I can't seem to find it right now.
        
           | RedShift1 wrote:
           | Any good videos on that?
        
         | poulpy123 wrote:
         | having total control on the hardware and the software didn't
         | hurt for sure
        
           | manv1 wrote:
           | Qualcomm (and Broadcomm) has total control on the hardware
           | and software side of a lot of stuff and their stuff is shit.
           | 
           | It's not about control, it's about good engineering.
        
             | stevefan1999 wrote:
             | It's about both control and engineering in Apple's case.
        
             | porcc wrote:
             | So many parts across the stack need to work well for this
             | to go well. Early support for popular software is a good
             | example. This goes from partnerships all the way down to
             | hardware designers.
             | 
             | I'd argue it's not about engineering more than it is about
             | good organizational structure.
        
               | iamstupidsimple wrote:
               | And having execs who design the organizational structure
               | around those goals is part of what makes good engineering
               | :)
        
             | zeusk wrote:
             | That's really not the case, if you're in Microsoft or
             | Linux's position you can't really change the OS
             | architecture or driver models for any particular vendor.
             | 
             | That generality and general knowledge separation between
             | different stacks leaves quite a lot of efficiency on the
             | table.
        
         | esskay wrote:
         | It has been extremely smooth sailing. I moved my own mac over
         | to it about a year ago, swapping a beefed up MPB for a budget
         | friendly M1 Air (which has massively smashed it out the park
         | performance wise, far better than I was expecting). Didn't have
         | a single issue.
         | 
         | My work mac was upgraded to a MBP M1 Pro and again, very
         | smooth. I had one minor issue with a docker container not being
         | happy (it was an x86 instance) but one minor tweak to the
         | docker compose file and I was done.
         | 
         | It does still amaze me how good these new machines are. Its
         | almost enough to redeem apple for the total pile of
         | overheating, underperforming crap that came directly before the
         | transition (aka any mac with a touchbar).
        
         | js2 wrote:
         | I have a single counter-example. Mailplane, a Gmail SSB. It's
         | Intel including its JS engine, making the Gmail UI too sluggish
         | to use.
         | 
         | I've fallen back to using Fluid, an ancient and also Intel-
         | specific SSB, but its web content runs in a separate WebKit ARM
         | process so it's plenty fast.
         | 
         | I've emailed the Mailplane author but they wont release an
         | Universal version of the app since they've EOL'd Mailplane.
         | 
         | I have yet to find a Gmail SSB that I'm happy with under ARM.
         | Fluid is a barely workable solution.
        
           | cmg wrote:
           | For what it's worth, I use Mailplane on an M1 MacBook Air
           | (8GB) with 2 Gmail tabs and a calendar tab without noticeable
           | issues.
           | 
           | Unfortunately the developers weren't able to get Google to
           | work with them on a policy change that impacted the app [0]
           | [1] and so gave up and have moved on to a new and completely
           | different customer support service.
           | 
           | [0] https://developers.googleblog.com/2020/08/guidance-for-
           | our-e... [1] https://mailplaneapp.com/blog/entry/mailplane_st
           | opped_sellin...
           | 
           | So unfortunately
        
         | perardi wrote:
         | I think the end of support for 32-bit applications in 2019
         | helped, slightly, with the run-up.
         | 
         | Assuming you weren't already shipping 64-bit
         | applications...which would be weird...updating the application
         | probably required getting everything into a contemporary
         | version of Xcode, cleaning out the cruft, and getting it
         | compiling nice and cleanly. After that, the ARM transition was
         | kind of a "it just works" scenario.
         | 
         | Now, I'm sure Adobe and other high-performance application
         | developers had to do some architecture-specific tweaks, but,
         | gotta think Apple clued them in ahead of time as to what was
         | coming.
        
         | chrchang523 wrote:
         | I finally started seriously using a M1 work laptop yesterday,
         | and I'm impressed. More than twice as fast on a compute-
         | intensive job as my personal 2015 MBP, with a binary compiled
         | for x86 and with hand-coded SIMD instructions.
        
           | robohoe wrote:
           | Are you me lol? I'm on my third day on M1 Pro. Battery life
           | is nuts. I can be on video calls and still do dev work
           | without worrying about charging. And the thing runs cool!
        
           | dexterdog wrote:
           | It helps that there were almost 2 years between the release
           | and your adoption. I had a very early M1 and it was not too
           | bad, but there were issues. I knew that going in.
        
             | EricE wrote:
             | I had an M1 Air early on and I didn't run into any issues.
             | Even the issues with apps like Homebrew were resolved
             | within 3-4 months of the M1 debut. It's amazing just how
             | seamless such a major architectural transition it was and
             | continues to be!
        
         | radicaldreamer wrote:
         | Since this is the company's third big arch transition, cross-
         | compilation and compatibility is probably considered a core
         | competency for Apple to maintain internally.
        
           | mixmastamyk wrote:
           | And Next was multi-platform as well.
        
         | AnIdiotOnTheNet wrote:
         | It isn't their first rodeo: 68k->PPC->x86_64->ARM.
        
           | darzu wrote:
           | You gotta think there's been a lot of churn and lost
           | knowledge at the company between PPC->x86_64 (2006) and now
           | though.
        
           | esskay wrote:
           | Rosetta 1 and the PPC -> x86 move wasn't anywhere near as
           | smooth, I recall countless problems with that switch. Rosetta
           | 2 is a totally different experience, and so much better in
           | every way.
        
           | kevincox wrote:
           | But they've been on x84_64 for a _long_ time. How much of
           | that knowledge is still around? Probably some traces of it
           | have been institutionalized but it isn 't the same as if they
           | just grabbed the same team and made them do it again a year
           | after the least transition.
        
           | toast0 wrote:
           | nitpick, they did PPC -> x86 (32), the x86_64 bit transition
           | was later (no translation layer though). They actually had
           | 64-bit PPC systems on the G5 when they switched to Intel
           | 32-bit, but Rosetta only does 32-bit PPC -> 32-bit x86; it
           | would have been rare to have released 64-bit PPC only
           | software.
        
             | EricE wrote:
             | They had 64 bit Carbon translation layer, but spiked it to
             | force Adobe and some other large publishers to go native
             | Intel. There was a furious uproar at the time, but it
             | turned out to be the right decision.
        
       | rgiacobazzi wrote:
       | Great article!
        
       ___________________________________________________________________
       (page generated 2022-11-09 23:00 UTC)