[HN Gopher] Why is Rosetta 2 fast? ___________________________________________________________________ Why is Rosetta 2 fast? Author : pantalaimon Score : 443 points Date : 2022-11-09 15:40 UTC (7 hours ago) (HTM) web link (dougallj.wordpress.com) (TXT) w3m dump (dougallj.wordpress.com) | lunixbochs wrote: | > To see ahead-of-time translated Rosetta code, I believe I had | to disable SIP, compile a new x86 binary, give it a unique name, | run it, and then run otool -tv /var/db/oah/ _/_ /unique-name.aot | (or use your tool of choice - it's just a Mach-O binary). This | was done on old version of macOS, so things may have changed and | improved since then. | | My aotool project uses a trick to extract the AOT binary without | root or disabling SIP: | https://github.com/lunixbochs/meta/tree/master/utils/aotool | karmakaze wrote: | Vertical integration. My understanding was it's because the Apple | silicon ARM has special support to make it fast. Apple has had | enough experience to know that some hardware support can go a | long way to making the binary emulation situation better. | saagarjha wrote: | That's not correct, the article goes into details why. | nwallin wrote: | That _is_ correct, the article goes into details why. See the | "Apple's Secret Extension" section as well as the "Total | Store Ordering" section. | | The "Apple's Secret Extension" section talks about how the M1 | has 4 flag bits and the x86 has 6 flag bits, and how | emulating those 2 extra flags would make every add/sub/cmp | instruction significantly slower. Apple has an undocumented | extension that adds 2 more flag bits to make the M1's flag | bits behave the same as x86. | | The "Total Store Ordering" section talks about how Apple has | added a non-standard store ordering to the M1 than makes the | M1 order its stores in the same way x86 guarantees instead of | the way ARM guarantees. Without this, there's no good way to | translate instructions in code in and around an x86 memory | fence; if you see a memory fence in x86 code it's safe to | assume that it depends on x86 memory store semantics and if | you don't have that you'll need to emulate it with many | mostly unnecessary memory fences, which will be devastating | for performance. | saagarjha wrote: | I'm aware of both of these extensions; they're not actually | necessary for most applications. Yes, you trade fidelity | with performance, but it's not _that_ big of a deal. The | majority of Rosetta's performance is good software | decisions and not hardware. | MikusR wrote: | The main reason, M1/2 being incredibly fast. Is listed last. | dagmx wrote: | Perhaps if you're comparing against Intel processors, but even | on an Apple Silicon Mac, Rosetta 2 vs native versions of apps | are no slouch. | | 20% overhead for a non-native executable is very commendable. | Someone wrote: | I don't think that's the main reason. The article lists a few | things that, I think the main reason is that they made several | parts of the CPU behave identical to x86. The M1 and M2 chips: | | - can be told to do total store ordering, just as x86 does | | - have of a few status flags that x86 has, but regular arm | doesn't | | - can be told to make the FPU behave exactly as the x86 FPU | | It also helps that ARM has many more registers than x86. | Because of that the emulator can map the x86 registers to ARM | registers, and have registers to spare for use by the emulator. | postalrat wrote: | That isn't the main reason. | | If Rosetta ran x86 code at 10% the speed of native nobody would | be calling it fast. | superkuh wrote: | bogeholm wrote: | Thanks for your thoroughly objective insights. I especially | appreciate the concrete examples. | howinteresting wrote: | Here you go for a concrete example: | https://news.ycombinator.com/item?id=33493276 | saagarjha wrote: | This has nothing to do with Rosetta being incomplete (it | has pretty good fidelity). | howinteresting wrote: | It was direct corroboration of: | | > Apple users not being able to use the same hardware | peripherals or same software as other people is not a | problem, it's a feature. There's no doubt the M1/M2 chips | are fast. It's just a problem that they're only available | in crappy computers that can't run a large amount of | software or hardware. | spullara wrote: | The first time I ran into this technology was in the early 90s on | the DEC Alpha. They had a tool called "MX" that would translate | MIPS Ultrix binaries to Alpha on DEC Unix: | | https://www.linuxjournal.com/article/1044 | | Crazy stuff. Rosetta 2 is insanely good. Runs FPS video games | even. | tomcam wrote: | > Every one-byte x86 push becomes a four byte ARM instruction | | Can someone explain this to me? I don't know ARM but it just | seems to me a push should not be that expensive. | jasonwatkinspdx wrote: | The general principle is that RISC style instruction sets are | typically fixed length and with only a couple different | subformats. Like the prototypical RISC design has one format | with an opcode and 3 register fields, and then a second with an | opcode and an immediate field. This simplicity and regularity | makes the fastest possible decoding hardware much more simple | and efficient compared to something like x86 that has a simply | dumbfounding number of possible variable length formats. | | The basic bet of RISC was that larger instruction encodings | would be worth it due to the micro architectural advantages | they enabled. This more or less was proven out, though the | distinction is less distinct today with x86 decoding into uOps | and recent ARM standards being quite complex beasts. | TazeTSchnitzel wrote: | x86 has variable-length instructions, so they can be anything | from 1 to 15 bytes long. AArch64 instructions are always 4 | bytes long. | iainmerrick wrote: | This is a great writeup. What a clever design! | | I remember Apple had a totally different but equally clever | solution back in the days of the 68K-to-PowerPC migration. The | 68K had 16-bit instruction words, usually with some 16-bit | arguments. The emulator's core loop would read the next | instruction and branch directly into a big block of 64K x 8 bytes | of PPC code. So each 68K instruction got 2 dedicated PPC | instructions, typically one to set up a register and one to | branch to common code. | | What that solution and Rosetta 2 have in common is that they're | super pragmatic - fast to start up, with fairly regular and | predictable performance across most workloads, even if the | theoretical peak speed is much lower than a cutting-edge JIT. | | Anyone know how they implemented PPC-to-x86 translation? | kijiki wrote: | > Anyone know how they implemented PPC-to-x86 translation? | | They licensed Transitive's retargettable binary translator, and | renamed it Rosetta; very Apple. | | It was originally a startup, but had been bought by IBM by the | time Apple was interested. | GeekyBear wrote: | > It was originally a startup, but had been bought by IBM by | the time Apple was interested. | | Rosetta shipped in 2005. | | IBM bought Transitive in 2008. | | The last version of OS X that supported Rosetta shipped in | 2009. | | I always wondered if the issue was that IBM tried to alter | the terms of deal too much for Steve's taste. | savoytruffle wrote: | I agree it was a bit worryingly short-lived. However the | first version of Mac OS X that shipped without Rosetta 1 | support was 10.7 Lion in summer 2011 (and many people | avoided it since it was problematic). So nearly-modern Mac | OS X with Rosetta support was realistic for a while longer. | GeekyBear wrote: | > However the first version of Mac OS X that shipped | without Rosetta 1 support was 10.7 Lion | | Yes, but I was pointing out when the last version of OS X | that did support Rosetta shipped. | | I have no concrete evidence that Apple dropped Rosetta | because IBM wanted to alter the terms of the deal after | they bought Transitive, but I've always found that timing | interesting. | | In comparison, the emulator used during the 68k to PPC | transition was never removed from Classic MacOS, so the | change stood out. | r00fus wrote: | Apple is also not tied to reverse compatibility. | | Their customers are not enterprise, and consequently they | are probably the best company in the world at dictating | well-managed, reasonable shifts in customer behavior at | scale. | | So they likely had no need for Rosetta as of 2009. | runjake wrote: | Link: https://en.wikipedia.org/wiki/QuickTransit | lostgame wrote: | From what I understand; they purchased a piece of software that | already existed to translate PPC to x86 in some form or another | and iterated on it. I believe the software may have already | even been called 'Rosetta'. | | My memory is very hazy; though. While I experienced this | transition firsthand and was an early Intel adopter, that's | about all I can remember about Rosetta or where it came from. | | I remember before Adobe had released the Universal Binary CS3 | that running Photoshop on my Intel Mac was a total nightmare. | :( I learned to not be an early adopter from that whole | debacle. | saagarjha wrote: | Transitive. | runjake wrote: | Link: https://en.wikipedia.org/wiki/QuickTransit | Asmod4n wrote: | I don't know how they did it, but they did it very very slowly. | Anything "interactive" was unuseable. | lilyball wrote: | Assuming you're talking about PPC-to-x86, it was certainly | usable, though noticeably slower. Heck, I used to play Tron | 2.0 that way, the frame rate suffered but it was still quite | playable. | scarface74 wrote: | Interactive 68K programs were usually fast. The 68K programs | would still call native PPC QuickDraw code. It was processor | intensive code that was slow. Especially with the first | generation 68K emulator. | | Connectix SpeedDoubler was definitely faster. | duskwuff wrote: | Most of the Toolbox was still running emulated 68k code in | early Power Mac systems. A few bits of performance-critical | code (like QuickDraw, iirc) were translated, but most | things weren't. | klelatti wrote: | That's really interesting. You might enjoy reading about the VM | embedded into the Busicom calculator that used the Intel 4004 | [1] | | They squeezed a virtual machine with 88 instructions into less | than 1k of memory! | | [1] https://thechipletter.substack.com/p/bytecode-and-the- | busico... | wang_li wrote: | In the mists of history S. Wozniak wrote the SWEET-16 | interpreter for the 6502. A VM with 29 instructions | implemented in 300 bytes. | | https://en.wikipedia.org/wiki/SWEET16 | iainmerrick wrote: | That is nifty! Sounds very similar to a Forth interpreter. | vaxman wrote: | Burn. | | (unintentional, which makes it even funnier) | retskrad wrote: | Apple Silicon will be Tim Cook's legacy. | vaxman wrote: | Rosetta 3 will probably be semantic evaluation of the origin and | complete source-level reprogramming of the target. If it comes | from Apple, it will translate everything to ARM and then | digitally sign it to run in a native-mode sandbox under a version | of Safari with a supporting runtime. | hinkley wrote: | Apple is doing some really interesting but really quiet work in | the area of VMs. I feel like we don't give them enough credit but | maybe they've put themselves in that position by not bragging | enough about what they do. | | As a somewhat related aside, I have been watching Bun (low | startup time Node-like on top of Safari's JavaScript engine) with | enough interest that I started trying to fix a bug, which is | somewhat unusual for me. I mostly contribute small fixes to tools | I use at work. I can't quite grok Zig code yet so I got stuck | fairly quickly. The "bug" turned out to be default behavior in a | Zig stdlib, rather than in JavaScript code. The rest is fairly | tangential but suffice it to say I prefer self hosted languages | but this probably falls into the startup speed compromise. | | Being low startup overhead makes their VM interesting, but the | fact that it benchmarks better than Firefox a lot of the time and | occasionally faster than v8 is quite a bit of quiet competence. | jraph wrote: | > feel like we don't give them enough credit but maybe they've | put themselves in that position by not bragging enough about | what they do. | | And maybe also by keeping the technology closed and Apple- | specific. Many people who could be interested in using it don't | have access to it. | freedomben wrote: | Exactly. As someone who would be very interested in this, but | don't use Apple products, it's just not exciting because it's | not accessible to me (I can't even test it as a user). If | they wanted to write a whitepaper about it to share | knowledge, that might be interesting, but given that it's | Apple I'm not gonna hold my breath. | saagarjha wrote: | Apple (mostly WebKit) writes a significant amount about how | they designed their VMs. | jolux wrote: | WebKit B3 is open source: https://webkit.org/docs/b3/ | [deleted] | Vt71fcAqt7 wrote: | I hope Rosetta is here to stay and continues developement. And I | hope what is learned from it can be used to make a RISC-V version | of it. translating native ARM to RISC-V should be much easier | than x86 to ARM as I understand it, so one could conceivably do | x86 -> ARM -> RISC-V. | rowanG077 wrote: | I hope not. Rosetta 2, as cool as it is, is a crutch to allow | Apple to transition away from x86. If it keeps beeing needing | it's a massive failure for Apple and the ecosystem. | klelatti wrote: | More likely to be useful RISC-V to Arm then Apple can support | running virtual machines for another architecture on its | machines. | masklinn wrote: | > I hope Rosetta is here to stay and continues developement. | | It almost certainly is not. Odds are Apple will eventually | remove Rosetta II, as they did Rosetta back in the days, once | they consider the need for that bridge to be over (Rosetta was | added in 2006 in 10.4, and removed in 2011 from 10.7). | | > And I hope what is learned from it can be used to make a | RISC-V version of it. translating native ARM to RISC-V should | be much easier than x86 to ARM as I understand it, so one could | conceivably do x86 -> ARM -> RISC-V. | | That's not going to happen unless Apple decides to switch from | ARM to RISC-V, and... why would they? They've got 15 years | experience and essentially full control on ARM. | Vt71fcAqt7 wrote: | >That's not going to happen unless Apple decides to switch | from ARM to RISC-V, and... why would they? They've got 15 | years experience and essentially full control on ARM. | | Two points here. | | * First off, Apple developers are not binded to Apple. The | knkwledge gained can be used elsewhere. See Rivos and Nuvia | for example. | | * Second, Apple reportedly has already ported many of it's | secondary cores to RISC-V. It's not unreasonable that they | will switch in 10 years or so. | jrmg wrote: | _Apple reportedly has already ported many of it 's | secondary cores to RISC-V_ | | Really? In current hardware or is this speculation? | Symmetry wrote: | If you've got some management core somewhere in your | silicon you can, with RISC-V, give it a MMU but no FPU | and save area. You're going to be writing custom embedded | code anyways so you get to save silicon by only | incorporating the features that you need instead of | having to meet the full ARM spec. And you can add your | own custom instructions for the job at hand pretty | easily. | | That would all be a terrible idea if you were doing it | for a core intended to run user applications, but that's | not what Apple, Western Digital, NVidia are embracing | RISC-V for embedded cores. If I were ARM I'd honestly be | much more worried about RISC-V's threat to my R and M | series cores than my A series cores. | my123 wrote: | Arm64 allows FPU-less designs. There are some around... | Symmetry wrote: | Sure. The FPU is optional on a Cortex M2, for instance. | But those don't have MMUs. You'd certainly need an | expensive architectural license to make something with an | MMU but no FPU if you wanted to and given all the | requirements ARM normally imposes for software | compatibility[1] between cores I'd tend to doubt that | they'd let you make something like that. | | [1] Explicitly testing that you don't implement total | store ordering by default is one requirement I've heard | people talk about to get a custom core licensed. | masklinn wrote: | Apple has an architecture license (otherwise they could | not design their own cores, which they've been doing for | close to a decade), and already had the ability to take | liberties beyond what the average architecture licensee | can, owing to _being one of ARM's founders_. | saagarjha wrote: | Don't think any are shipping, but they're hiring RISC-V | engineers. | Vt71fcAqt7 wrote: | >Many dismiss RISC-V for its lack of software ecosystem | as a significant roadblock for datacenter and client | adoption, but RISC-V is quickly becoming the standard | everywhere that isn't exposed to the OS. For example, | Apple's A15 has more than a dozen Arm-based CPU cores | distributed across the die for various non-user-facing | functions. SemiAnalysis can confirm that these cores are | actively being converted to RISC-V in future generations | of hardware.[0] | | So to answer your question, it is not in currently in | hardware, but it is more than just speculation. | | [0]https://www.semianalysis.com/p/sifive-powers-google- | tpu-nasa... | klelatti wrote: | > it's not unreasonable that they will switch in 10 years | or so. | | You've not provided any rationale at all for why they | should switch their application cores let alone on this | specific timetable. | | Switching is an expensive business and there has to be a | major business benefit for Apple in return. | chris_j wrote: | For me, those two points make it clear that it would be | _possible_ for Apple to port to RISC-V. But it 's still not | clear what advantages they would gain from doing so, given | that their ARM license appears to let them do whatever they | want with CPUs that they design themselves. | Vt71fcAqt7 wrote: | The first point precludes Apple's gain from the | discussion. | quux wrote: | It would be funny/not funny if in a few years Apple removes | Rosetta 2 for Mac apps but keeps the Linux version forever so | docker can run at reasonable speeds. | kccqzy wrote: | > They've got 15 years experience | | Did you only start counting from 2007 when the iPhone was | released? All the iPods prior to that were using ARM | processors. The Apple Newton was using ARM processors. | EricE wrote: | iPods and Newton were entirely different chips and OS's. | The first iPods weren't even on an OS that Apple created - | they licensed it. | masklinn wrote: | > All the iPods prior to that were using ARM processors. | | Most of the original device was outsourced and contracted | out (for reasons of time constraint and lack of internal | expertise). PortalPlayer built the SoC and OS, not Apple. | Later SoC were sourced from SigmaTel and Samsung, until the | 3rd gen Touch. | | > The Apple Newton was using ARM processors. | | The Apple Newton was a completely different Apple, and | there were several years' gap between Jobs killing the | Newton and the birth of iPod, not to mention the completely | different purpose and capabilities. There would be no | newton-type project until the iPhone. | | Which is also when Apple started working with silicon | themselves: they acquired PA in 2008, Intrinsity in 2010, | and Passif in 2013, released their first partially in-house | SoC in 2010 (A4), and their first in-house core in 2013 | (Cyclone, in the A7). | stu2b50 wrote: | Rosetta 1 had a ticking time bomb. Apple was licensing it | from a 3rd party. Rosetta 2 is all in house as far as we | know. | | Different CEO as well. Jobs was more opinionated on | "principles" - Cook is more than happy to sell what people | will buy. I think Rosetta 2 will last. | masklinn wrote: | > Rosetta 1 had a ticking time bomb. Apple was licensing it | from a 3rd party. | | Yes, I'm sure Apple had no way of extending the license. | | > Cook is more than happy to sell what people will buy. I | think Rosetta 2 will last. | | There's no "buy" here. | | Rosetta is complexity to maintain, and an easy cut. It's | not even part of the base system. | | And "what people will buy" certainly didn't prevent | essentially removing support for non-hidpi displays from | MacOS. Which is a lot more impactful than Rosetta as far as | I'm concerned. | NavinF wrote: | > removing support for non-hidpi displays from MacOS | | Did that really reduce sales? Consider that the wide | availability of crappy low end hardware gave Windows | laptops a terrible reputation. Eg https://www.reddit.com/ | r/LinusTechTips/comments/yof7va/frien... | masklinn wrote: | > Consider that the wide availability of crappy low end | hardware gave Windows laptops a terrible reputation. | | Standard DPI displays are not "crappy low-end hardware"? | | I don't think there's a single widescreen display which | qualifies as hiDPI out there, that more or less doesn't | exist: a 5K 34" is around 160 DPI (to say nothing of the | downright pedestrian 5K 49" like the G9 or the AOC Agon). | fredoralive wrote: | What do you mean non HiDPI display support being removed | from Mac OS? I've been using a pair of 1920x1080 monitors | with my Mac Mini M1 just fine? Have they somehow broken | something in Mac OS 13 / Ventura? (I haven't clicked the | upgrade button yet, I prefer to let others leap boldly | first). | bpye wrote: | They've also allowed Rosetta 2 in Linux VMs - if they are | serious about supporting those use cases then I think it'll | stay. | kitsunesoba wrote: | We'll see, but even post-Cook Apple historically hasn't | liked the idea of third parties leaning on bridge | technologies for too long. Things like Rosetta are offered | as temporary affordances to allow time for devs to migrate, | not as a permanent platform fixture. | vaxman wrote: | But that 3rd party was only legally at arm's length. | TillE wrote: | What important Intel-only macOS software is going to exist | in five years? | | It's basically only games and weird tiny niches, and Apple | is pretty happy to abandon both those categories. The | saving grace is that there's very few interesting Mac- | exclusive games in the Intel era. | flomo wrote: | Yeah, Apple killed all "legacy" 32-bit support, so one | would think there's not much software which is both | x86-64 and not being actively developed. | vxNsr wrote: | 2006 Apple was very different from 2011 Apple, renewing | that license in 2011 was probably considered cost | prohibitive for the negligible benefit. | rerx wrote: | Starting with Ventura, Linux VMs can use Rosetta 2 to run | x64 executables. I expect x64 Docker containers to remain | relevant for quite a few years to come. Running those at | reasonable speeds on Apple Silicon would be huge for | developers. | dmitriid wrote: | > Jobs was more opinionated on "principles" - Cook is more | than happy to sell what people will buy. | | Well, the current "principle" is "iOS is enough, we're | going to run iOS apps on MacOS, and that's it". | | Rosetta isn't needed for that. | dmitriid wrote: | It's strange to see people downvoting this when three | days ago App Store on MacOS literally defaulted to | searching iOS and iPad apps for me | https://twitter.com/dmitriid/status/1589179351572312066 | CharlesW wrote: | > _Odds are Apple will eventually remove Rosetta II, as they | did Rosetta back in the days, once they consider the need for | that bridge to be over (Rosetta was added in 2006 in 10.4, | and removed in 2011 from 10.7)._ | | The difference is that Rosetta 1 was PPC - x86, so its | purpose ended once PPC was a fond memory. | | Today's Rosetta is a generalized x86 - ARM translation | environment that isn't just for macOS apps. For example, it | works with Apple's new virtualization framework to support | running x86_64 Linux apps in ARM Linux VMs. | | https://developer.apple.com/documentation/virtualization/run. | .. | gumby wrote: | > That's not going to happen unless Apple decides to switch | from ARM to RISC-V, and... why would they? They've got 15 | years experience and essentially full control on ARM. | | 15? More than a quarter century. They were one of the | original investors in ARM and have produced plenty of arm | devices since then beyond the newton and the ipod. | | I'd bet they use a bunch of risc v internally too if they | just need a little cpu to manage something locally on some | device and just want to avoid paying a tiny fee to ARM or | just want some experience with it. | | But RISC V as the main CPU? Yes, that's a long way away, if | ever. But apple is good at the long game. I wouldn't be | surprised to hear that Apple has iOS running on RISC V, but | even something like the lightning-to-HDMI adapter runs IOS on | ARM. | masklinn wrote: | > 15? More than a quarter century. They were one of the | original investors in ARM and have produced plenty of arm | devices since then beyond the newton and the ipod. | | They didn't design their own chips for most of that time. | gumby wrote: | At the same time as the ARM investment they had a Cray | for...chip design. | masklinn wrote: | Yes and? | | Apple invested in ARM and worked with ARM/Acorn on what | would become ARM6, in the early 90s. The newton uses it | (specifically the ARM610), it is a commercial failure, | later models use updated ARM CPUs to which AFAIK Apple | didn't contribute (DEC's StrongARM, and ARM's ARM710). | | <15 years pass> | | Apple starts working on bespoke designs again around the | time they start working on the iPhone, or possibly after | they realise it's succeeding. | | That doesn't mean they stopped _using_ ARM in the | meantime (they certainly didn 't). | | The iPod's SoC was not even designed internally (it was | contracted out to PortalPlayer, later generations were | provided by Samsung). 15 times and the revolution of | Jobs' return (and his immediate killing of the Newton) is | a long time for an internal team of silicon designers. | preisschild wrote: | > They've got 15 years experience and essentially full | control on ARM. | | Do they? ARM made it very clear that they consider all ARM | cores their own[1] | | [1]: https://www.theregister.com/2022/11/07/opinion_qualcomm_ | vs_a... | nicoburns wrote: | Apple is in a somewhat different position to Qualcomm in | that they were a founding member of ARM. I've also heard | rumours that aarch64 was designed by apple and donated to | ARM (hence why apple was so early to release an aarch64 | processor). So I somewhat doubt ARM will be a position to | sue them any time soon. | danaris wrote: | The Qualcomm situation is based on breaches of a specific | agreement that ARM had with Nuvia, which Qualcomm has now | bought. It's not a generalizable "ARM thinks everything | they license belongs to them fully in perpetuity" deal. | masklinn wrote: | > Do they? | | They do, yes. They were one of the founding 3 members of | ARM itself, and the primary monetary contributor. | | Through this they acquired privileges which remain extant: | they can literally add custom instructions to the ISA | (https://news.ycombinator.com/item?id=29798744), something | there is no available license for. | | > ARM made it very clear that they consider all ARM cores | their own[1] | | The Qualcomm situation is a breach of contract issue wrt | Nuvia, it's a very different issue, and by an actor with | very different privileges. | Vt71fcAqt7 wrote: | Is there a real source for this claim? It gets parroted a | lot on HN and elsewhere, but I've also heard it's greatly | exagerated. I don't think Apple engineers get to read the | licences, and even if they did, how do we know they | understood it corretly and that it got repeated | correctlty? I've never seen a valid source for this | claim. | masklinn wrote: | For what claim? They they co-founded ARM? That's | historical record. That they extended the ISA? That's | literally observed from decompilations. That they can do | so? They've been doing it for at least 2 years and ARM | has yet to sue. | | > I've never seen a valid source for this claim. | | What is "a valid source"? The linked comment is from | Hector Martin, the founder and lead of Asahi, who worked | on and assisted with reversing various facets of Apple | silicon, including the capabilities and extensions of the | ISA. | Vt71fcAqt7 wrote: | >For what claim? | | that they have "essentially full control on ARM" | | Having an ALA + some extras doesn't mean "full control." | | he also says: | | >And apparently in Apple's case, they get to be a little | bit incompatible | | So he doesn't seem to actually know the full extent to | which Apple has more rights, even using the phrase "a | little bit" -- far from your claim. And he (and certainly | you) has not read the license. Perhaps they have to pay | for each core they release on the market that breaks | compatabilty? Do you know? Of course not. A valid source | would be a statement from someone who read the license or | one of the companies. There is more to a core than just | the ISA. If not, why is Apple porring cores to RISC-V? If | they have so much control ? | ksherlock wrote: | Why does it need a "real source"? ARM sells architecture | licenses, Apple has a custom ARM architecture. 1 + 1 = 2. | | https://www.cnet.com/tech/tech-industry/apple-seen-as- | likely... | | "ARM Chief Executive Warren East revealed on an earnings | conference call on Wednesday that "a leading handset | OEM," or original equipment manufacturer, has signed an | architectural license with the company, forming ARM's | most far-reaching license for its processor cores. East | declined to elaborate on ARM's new partner, but EETimes' | Peter Clarke could think of only one smartphone maker who | would be that interested in shaping and controlling the | direction of the silicon inside its phones: Apple." | | https://en.wikipedia.org/wiki/Mac_transition_to_Apple_sil | ico... | | "In 2008, Apple bought processor company P.A. Semi for | US$278 million.[28][29] At the time, it was reported that | Apple bought P.A. Semi for its intellectual property and | engineering talent.[30] CEO Steve Jobs later claimed that | P.A. Semi would develop system-on-chips for Apple's iPods | and iPhones.[6] _Following the acquisition, Apple signed | a rare "Architecture license" with ARM, allowing the | company to design its own core, using the ARM instruction | set_.[31] The first Apple-designed chip was the A4, | released in 2010, which debuted in the first-generation | iPad, then in the iPhone 4. Apple subsequently released a | number of products with its own processors." | | https://www.anandtech.com/show/7112/the-arm-diaries- | part-1-h... | | "Finally at the top of the pyramid is an ARM architecture | license. Marvell, Apple and Qualcomm are some examples of | the 15 companies that have this license." | Vt71fcAqt7 wrote: | I should have been more explicit. I am questioning the | claim that Apple has "full control on ARM" with no | restriction on the cores they make, grandfathered in from | the 1980s. Nobody has ever substantiated that claim. | titzer wrote: | Rosetta 2 is great, except it apparently can't run statically- | linked (non-PIC) binaries. I am unsure why this limitation | exists, but it's pretty annoying because Virgil x86-64-binaries | cannot run under Rosetta 2, which means I resort to running on | the JVM on my M1... | randyrand wrote: | Why are static binaries with PIC so rare? I'm surprised | position dependent code is _ever_ used anymore in the age of | ASLR. | | But static binaries are still great for portability. So you'd | think static binaries with PIC would be the default. | masklinn wrote: | > But static binaries are still great for portability. | | macOS has not officially supported static binaries in... | ever? You can't statically link libSystem, and it absolutely | does not care for kernel ABI stability. | titzer wrote: | > it absolutely does not care for kernel ABI stability | | That may be true on the mach system call side, but the UNIX | system calls don't appear to change. (Virgil actually does | call the kernel directly). | masklinn wrote: | > That may be true on the mach system call side, but the | UNIX system calls don't appear to change. | | They very much do, without warning, as the Go project | discovered (after having been warned multiple times) | during the Sierra betas: | https://github.com/golang/go/issues/16272 | https://github.com/golang/go/issues/16606 | | That doesn't mean Apple goes outs of its way to break | syscalls (unlike microsoft), but there is no support for | direct syscalls. That is why, again, you can't statically | link libSystem. | | > (Virgil actually does call the kernel directly). | | That's completely unsupported -\\_(tsu)_/- | titzer wrote: | Virgil doesn't use ASLR. I'm not sure what value it adds to a | memory-safe language. | saagarjha wrote: | Rosetta can run statically linked binaries, but I don't think | anything supports binaries that aren't relocatable. | $ file a.out a.out: Mach-O 64-but executable x86_64 | $ tool -L a.out a.out: $ ./a.out Hello, | world! | CharlesW wrote: | > _Rosetta 2 is great, except it apparently can 't run | statically-linked (non-PIC) binaries._ | | Interestingly, it supports statically-linked x86 binaries when | used with Linux. | | "Rosetta can run statically linked x86_64 binaries without | additional configuration. Binaries that are dynamically linked | and that depend on shared libraries require the installation of | the shared libraries, or library hierarchies, in the Linux | guest in paths that are accessible to both the user and to | Rosetta." | | https://developer.apple.com/documentation/virtualization/run... | mirashii wrote: | Statically linked binaries are officially unsupported on MacOS | in general, so there's no reason to support it on Rosetta | either. | | It's unsupported in MacOS because it assumes binary | compatibility on the kernel system call interface, which is not | guaranteed. | saagarjha wrote: | Rosetta was introduced with the promise that it supports | binaries that make raw system calls. (And it does indeed | support these by hooking the syscall instruction.) | darzu wrote: | Does anyone know the names of the key people behind Rosetta 2? | | In my experience, exceptionally well executed tech like this | tends to have 1-2 very talented people leading. I'd like to | follow their blog or Twitter. | trollied wrote: | The original Rosetta was written by Transitive, which was | formed by spinning a Manchester University research group out. | See https://www.software.ac.uk/blog/2016-09-30-heroes- | software-e... | | I know a few of their devs went to ARM, some to Apple & a few | to IBM (who bought Transitive). I do know a few of their ex | staff (and their twitter handles), but I don't feel comfortable | linking them here. | scrlk wrote: | IIRC the current VP of Core OS at Apple is ex- | Manchester/Transitive. | cwzwarich wrote: | I am the creator / main author of Rosetta 2. I don't have a | blog or a Twitter (beyond lurking). | darzu wrote: | Should you feel inspired to share your learnings, insights, | or future ideas about the computing spaces you know, me and | I'm sure many other people would be interested to listen! | | My preferred way to learn about a new (to me) area of tech is | to hear the insights of the people who have provably advanced | that field. There's a lot of noise to signal in tech blogs. | darzu wrote: | If you're feeling inclined, here's a slew of questions: | | What was the most surprising thing you learned while working | on Rosetta 2? | | Is there anything (that you can share) that you would do | differently? | | Can your recommend any great starting places for someone | interested in instruction translation? | | Looking forward, did your work on Rosetta give you ideas for | unfilled needs in the virtualization/emulation/translation | space? | | What's the biggest inefficiency you see today in the tech | stacks you interact most with? | | A lot of hard decisions must have been made while building | Rosetta 2; can you shed light on some of those and how you | navigated them? | pcf wrote: | Thanks for your amazing work! | | May I ask - would it be possible to implement support for | 32-bit VST and AU plugins? | | This would be a major bonus, because it could e.g. enable | producers like me to open up our music projects from earlier | times, and still have the old plugins work. | [deleted] | Klonoar wrote: | Huh, this is timely. Incredibly random but: do you know if | there was anything that changed as of Ventura to where trying | to mmap below the 2/4GB boundary would no longer work in | Rosetta 2? I've an app where it's worked right up to Monterey | yet inexplicably just bombs in Ventura. | keepquestioning wrote: | Isn't Rosetta 2 "done"? What are you working on now? | bdash wrote: | Impressive work, Cameron! Hope you're doing well. | skrrtww wrote: | Are you able to speak at all to the known performance | struggles with x87 translation? Curious to know if we're | likely to see any updates or improvements there into the | future. | peatmoss wrote: | Not having any particular domain experience here, I've idly | wondered whether or not there's any role for neural net models in | translating code for other architectures. | | We have giant corpuses of source code, compiled x86_64 binaries, | and compiled arm64 binaries. I assume the compiled binaries | represent approximately our best compiler technology. It seems | predicting an arm binary from an x86_64 binary would not be | insane? | | If someone who actually knows anything here wants to disabuse me | of my showerthoughts, I'd appreciate being able to put the idea | out of my head :-) | Symmetry wrote: | Many branch predictors have traditionally used perceptrons, | which are sort of NN like. And I think there's a lot of | research into involving incorporating deep learning models into | doing chip routings. | Someone wrote: | > It seems predicting an arm binary from an x86_64 binary would | not be insane? | | If you start with a couple of megabytes of x64 code, and | predict a couple of megabytes of arm code from it, there will | be errors even if your model is 99.999% accurate. | | How do you find the error(s)? | hinkley wrote: | I think we are on the cusp of machine aided rules generation | via example and counter example. It could be a very cool era of | "Moore's Law for software" (which I'm told software doubles in | speed roughly every 18 years). | | Property based testing is a bit of a baby step here, possibly | in the same way that escape analysis in object allocation was | the precursor to borrow checkers which are the precursor to...? | | These are my inputs, these are my expectations, ask me some | more questions to clarify boundary conditions, and then offer | me human readable code that the engine thinks satisfies the | criteria. If I say no, ask more questions and iterate. | | If anything will ever allow machines to "replace" coders, it | will be that, but the scare quotes are because that shifts us | more toward information architecture from data munging, which I | see as an improvement on the status quo. Many of my work | problems can be blamed on structural issues of this sort. A | filter that removes people who can't think about the big | picture doesn't seem like a problem to me. | saagarjha wrote: | People have tried doing this, but not typically at the | instruction level. Two ways to go about this that I'm aware of | are trying to use machine learning to derive high-level | semantics about code, then lowering it to the new architecture. | brookst wrote: | I'm a ML dilletante and hope someone more knowledgeable chimes | in, but one thing to consider is the statistics of how many | instructions you're translating and the accuracy rate. Binary | execution is very unforgiving to minor mistakes in translation. | If 0.001% of instructions are translated incorrectly, that | program just isn't going to work. | qsort wrote: | You would need a hybrid architecture with a NN generating | guesses and a "watchdog" shutting down errors. | | Neural models are basically universal approximators. Machine | code needs to be obscenely precise to work. | | Unless you're doing something else in the backend, it's just a | turbo SIGILL generator. | throw10920 wrote: | This is all true - machine code needs to be "basically | perfect" to work. | | However, there are lots of problems in CS that are easier to | check the answer to a solution than to solve in the first | place. It _may_ turn out to be the case that a well-tuned | model can quickly produce solutions to some code-generation | problems, that those solutions have a high enough likelihood | of being correct, that it 's fast enough to check (and maybe | try again), and that this entire process is faster than | state-of-the-art classical algorithms. | | However, if that were the case, I might also expect us to be | able to extract better algorithms from the model - | intuitively, machine code generation "feels" like something | that's just better implemented through classical algorithms. | Have you met a human that can do register allocation faster | than LLVM? | classichasclass wrote: | > turbo SIGILL generator | | This gave me the delightful mental image of a CPU smashing | headlong into a brick wall, reversing itself, and doing it | again. Which is pretty much what this would do. | ericbarrett wrote: | Anybody know if Docker has plans to move from qemu to Rosetta on | M1/2 Macs? I've found qemu to be at least 100x slower than the | native arch. | jeffbee wrote: | I wonder how much hand-tuning there is in Rosetta 2 for known, | critical routines. One of the tricks Transmeta used to get | reasonable performance on their very slow Crusoe CPU was to | recognize critical Windows functions and replace them with a | library of hand-optimized native routines. Of course that's a | little different because Rosetta 2 is targeting an architecture | that is generally speaking at least as fast as the x86 | architecture it is trying to emulate, and that's been true for | most cross-architecture translators historically like DEC's VEST | that ran VAX code on Alpha, but Transmeta CMS was trying to | target a CPU that was slower. | saagarjha wrote: | Haven't spotted any in particular. | sedatk wrote: | TL;DR: One-to-one instruction translation ahead of time instead | of complex JIT translations to bet on M1's performance and | instruction cache handling. | johnthuss wrote: | "I believe there's significant room for performance improvement | in Rosetta 2... However, this would come at the cost of | significantly increased complexity... Engineering is about making | the right tradeoffs, and I'd say Rosetta 2 has done exactly | that." | Gigachad wrote: | Would be a waste of effort when the tool is designed to be | obsolete in a few years as everything gets natively compiled. | saagarjha wrote: | One thing that's interesting to note is that the amount of effort | expended here is not actually all that large. Yes, there are | smart people working on this, but the performance of Rosetta 2 | for the most part is probably the work of a handful of clever | people. I wouldn't be surprised if some of them have an interest | in compilers but the actual implementation is fairly | straightforward and there isn't much of the stuff you'd typically | see in an optimizing JIT: no complicated type theory or analysis | passes. Aside from a handful of hardware bits and some convenient | (perhaps intentionally selected) choices in where to make | tradeoffs there's nothing really specifically amazing here. What | really makes it special is that anyone (well, any company with a | bit of resources) could've done it but nobody really did. (But, | again, Apple owning the stack and having past experience probably | did help them get over the hurdle of actually putting effort into | this.) | pjmlp wrote: | Back in the early days of Windows NT everywhere, the Alpha | version had a similar JIT emulation. | agentcooper wrote: | I am interested in this domain, but lacking knowledge to fully | understand the post. Any recommendations on good | books/courses/tutorials related to low level programming? | saagarjha wrote: | I'd recommend going through a compilers curriculum, then | reading up on past binary translation efforts. | pjmlp wrote: | Back in the early days of Windows NT everywhere, the Alpha | version had a similar JIT emulation. | | https://en.m.wikipedia.org/wiki/FX!32 | | Or for a more technical deep dive, | | https://www.usenix.org/publications/library/proceedings/usen... | mosburger wrote: | OMG I forgot about FX!32. My first co-op was as a QA tester for | the DEC Multia, which they moved from the Alpha processor to | Intel midway through. I did a skunkworks project for the dev | team attempting to run the newer versions of Multia's software | (then Intel-based) on older Alpha Multias using FX!32. IIRC it | was still internal use only/beta, but it worked quite well! | hot_gril wrote: | Rosetta 2 has become the poster child for "innovation without | deprecation" where I work (not Apple). | Tijdreiziger wrote: | Apple is the king of deprecation, just look at what happened to | Rosetta 1 and 32-bit iOS apps. | hot_gril wrote: | Yes they are, and that makes Rosetta 2 even more special. | Though Rosetta 1 got support for 5 years, which is pretty | good. | kccqzy wrote: | > The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which | convert floating-point condition flags to/from a mysterious | "external format". By some strange coincidence, this format is | x86, so these instruction are used when dealing with floating | point flags. | | This really made me chuckle. They probably don't want to mention | Intel by name, but this just sounds funny. | | https://developer.arm.com/documentation/100076/0100/A64-Inst... | manv1 wrote: | Apple's historically been pretty good at making this stuff. Their | first 68k -> PPC emulator (Davidian's) was so good that for some | things the PPC Mac was the fastest 68k mac you could buy. The | next-gen DR emulator (and SpeedDoubler etc) made things even | faster. | | I suspect the ppc->x86 stuff was slower because x86 just doesn't | have the registers. There's only so much you can do. | scarface74 wrote: | > Their first 68k -> PPC emulator (Davidian's) was so good that | for some things the PPC Mac was the fastest 68k mac you could | buy. | | This is not true. A 6100/60 running 68K code was about the | speed of my unaccelerated Mac LCII 68030/16. Even when using | SpeedDoubler, you only got speeds up to my LCII with a | 68030/40Mhz accelerator. | | Even the highest end 8100/80 was slower than a high end 68k | Quadra. | | The only time 68K code ran faster is when it made heavy use of | the Mac APIS that were native. | dev_tty01 wrote: | >The only time 68K code ran faster is when it made heavy use | of the Mac APIS that were native. | | Yes, and that just confirms the original point. Mac apps | often spend a lot of time in the OS apis and therefore the | 68K code (the app) often ran faster on PPC than it did on 68K | because apps often spend much of their time in OS apis. The | earlier post said "so good that for some things the PPC Mac | was the fastest 68k mac." That is true. | | In my own experience, I found most 68K apps felt as fast or | faster. Your app mix might have been different, but many | folks found the PPC faster. | classichasclass wrote: | Part of that was the greater clock speeds on the 601 and | 603, though. Those _started_ at 60MHz. Clock for clock 68K | apps were generally poorer on PowerPC until PPC clock | speeds made them competitive, and then the dynamic | recompiling emulator knocked it out of the park. | | Similarly, Rosetta was clock-for-clock worse than Power | Macs at running Power Mac applications. The last generation | G5s would routinely surpass Mac Pros of similar or even | slightly greater clocks. On native apps, though, it was no | contest, and by the next generation the sheer processor | oomph put the problem completely away. | | Rosetta 2 is notable in that it is so far Apple's only | processor transition where the new architecture was | unambiguously faster than the old one _on the old one 's | own turf_. | Wowfunhappy wrote: | > Apple's historically been pretty good at making this stuff. | Their first 68k -> PPC emulator (Davidian's) was so good that | for some things the PPC Mac was the fastest 68k mac you could | buy. | | Not arguing the facts here, but I'm curious--are these | successes related? And if so, how has Apple done that? | | I would imagine that very few of the engineers who programmed | Apple's 68k emulator are still working at Apple today. So, why | is Apple still so good at this? Strong internal documentation? | Conducive management practices? Or were they just lucky both | times? | joshstrange wrote: | I mean they are one of very few companies who have done arch | changes like this and they had already done it twice before | Rosetta 2. The same engineers might not have been used for | all 3 but I'm sure there was at least a tiny bit of overlap | between 68k->PPC and PPC->Intel (and likewise overlap between | PPC->Intel and Intel->ARM) that coupled with passed down | knowledge within the company gives them a leg up. They know | the pitfalls, they've see issues/advantages of using certain | approaches. | | I think of it in same way that I've migrated from old->new | versions of frameworks/languages in the past with breaking | changes and each time I've done it I've gotten better at | knowing what to expect, what to look for, places where it | makes sense to "just get it working" or "upgrade the code to | the new paradigm". The first time or two I did it was as a | junior working under senior developers so I wasn't as | involved but what did trickle down to me and/or my part in | the refactor/upgrade taught me things. Later times when I was | in charge (or on my own) I was able to draw on those past | experiences. | | Obviously my work is nowhere near as complicated as arch | changes but if you squint and turn your head to the side I | think you can see the similarities. | | > Or were they just lucky to have success both times? | | I think 2 times might be explained with "luck" but being | successful 3 times points to a strong trend IMHO, especially | since Rosetta 2 seems to have done even better than Rosetta 1 | for the last transition. | spacedcowboy wrote: | FWIW, I know several current engineers at Apple who wrote | ground-breaking stuff before the Mac even existed. Apple | certainly doesn't have any problem with older engineers, and | it turns out that transferring that expertise to new chips on | demand isn't particularly hard for them. | nordsieck wrote: | > I suspect the ppc->x86 stuff was slower because x86 just | doesn't have the registers. | | My understanding is that part of the reason the G4/5 was sort | of able to keep up with x86 at the time was due to the heavy | use of SIMD in some apps. And I doubt that Rosetta would have | been able to translate that stuff into SSE (or whatever the x86 | version of SIMD was at the time) on the fly. | bonzini wrote: | Apple had a library of SIMD subroutines (IIRC | Accelerate.framework) and Rosetta was able to use the x86 | implementation when translating PPC applications that called | it. | masklinn wrote: | Rosetta actually did support Altivec. It didn't support G5 | input at all though (but likely because that was considered | pretty niche, as Apple only released a G5 iMac, a PowerMac, | and an XServe, due to the out-of-control power and thermals | of the PowerPC 970). | menaerus wrote: | > Rosetta 2 translates the entire text segment of the binary from | x86 to ARM up-front. | | Do I understand correctly that the Rosetta is basically a | transpiler from x86-64 machine code to ARM machine code which is | run prior to the binary execution? If so, does it affect the | application startup times? | nilsb wrote: | Yes, it does. The delay of the first start of an app is quite | noticeable. But the transpiled binary is apparently cached | somewhere. | saagarjha wrote: | /var/db/oah. | nicoburns wrote: | > If so, does it affect the application startup times? | | It does, but only the very first time you run the application. | The result of the transpilation is cached so it doesn't have to | be computed again until the app is updated. | arianvanp wrote: | And deleting the cache is undocumented (it is not in the file | system) so if you run Mac machines as CI runners they will | trash and brick themselves running out of disk space over | time. | rowanG077 wrote: | What in the actual fuck. That is such an insane decision. | Where is it stored then? Some dark corner of the file | system inaccessible via normal means? | jonny_eh wrote: | You mean the cache is ever expanding? | koala_man wrote: | Really? This SO question says it's stored in /var/db/oah/ | | https://apple.stackexchange.com/questions/427695/how-can- | i-l... | dylan604 wrote: | Does that essentially mean each non-native app is doubled in | disk use? Maybe not doubled but requires more space to be | sure. | saagarjha wrote: | Yes. | varenc wrote: | Yes... you can see the cache in /var/db/oah/ | | Though only the actual binary size that gets doubled. For | large apps it's usually not the binary that's taking up | most of the space. | kijiki wrote: | Similar to DEC's FX!32 in that regard. FX!32 allowed running | x86 Windows NT apps on Alpha Windows NT. | saltcured wrote: | There was also an FX!32 for Linux. But I think it may have | only included the interpreter part and left out the | transpiler part. My memory is vague on the details. | | I do remember that I tried to use it to run the x86 | Netscape binary for Linux on a surplus Alpha with RedHat | Linux. It worked, but so slowly that a contemporary Python- | based web browser had similar performance. In practice, I | settled on running Netscape from a headless 486 based PC | and displaying remotely on the Alpha's desktop over | ethernet. That was much more usable. | esskay wrote: | The first load is fairly slow, but once it's done it every load | after that is pretty much identical to what it'd be running on | an x86 mac due to the caching it does. | EricE wrote: | For me my M1 was fast enough that the first load didn't seem | that different - and more importantly subsequent loads were | lighting fast! It's astonishing how good Rosetta 2 is - | utterly transparent and faster than my Intel Mac thanks to | the M1. | savoytruffle wrote: | If installed using a packaged installer, or the App Store, | the translation is done during installation instead of at | first run. So, slow 1st launch may be uncommon for a lot of | apps or users. | hinkley wrote: | I remember years ago when Java adjacent research was all the | rage, HP had a problem that was "Rosetta lite" if you will. They | had a need to run old binaries on new hardware that wasn't | exactly backward compatible. They made a transpiler that worked | on binaries. It might have even been a JIT but that part of the | memory is fuzzy. | | What made it interesting here was that as a sanity check they | made an A->A mode where they took in one architecture and spit | out machine code for the same architecture. The output was faster | than the input. Meaning that even native code has some room for | improvement with JIT technology. | | I have been wishing for years that we were in a better place with | regard to compilers and NP complete problems where the compilers | had a fast mode for code-build-test cycles and a very slow | incremental mode for official builds. I recall someone telling me | the only thing they liked about the Rational IDE (C and C++?) was | that it cached precompiled headers, one of the Amdahl's Law areas | for compilers. If you changed a header, you paid the | recompilation cost and everyone else got a copy. I love whenever | the person that cares about something gets to pay the consequence | instead of externalizing it on others. | | And having some CI machines or CPUs that just sit around chewing | on Hard Problems all day for that last 10% seems to be to be a | really good use case in a world that's seeing 16 core consumer | hardware. Also caching hints from previous runs is a good thing. | fuckstick wrote: | > The output was faster than the input. | | So if you ran the input back through the output multiple times | then that means you could eventually get the runtime down to 0. | twic wrote: | But unfortunately, the memory use goes to infinity. | avidiax wrote: | Probably the output of the decade-old compiler that produced | the original binary had no optimizations. | hinkley wrote: | That too but the eternal riddle of optimizer passes is | which ones reveal structure and which obscure it. Do I loop | unroll or strength reduce first? If there are heuristics | about max complexity for unrolling or inlining then it | might be "both". | | And then there's processor family versus this exact model. | zaphirplane wrote: | Is this for itanium | tomcam wrote: | I'm likely misunderstanding what you said, but I thought pre- | compiled headers were pretty much standard these days. | wmf wrote: | https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html | travisgriggs wrote: | It was particularly poignant at the time because JITed | languages were looked down on by the "static compilation | makes us faster" crowd. So it was a sort of "wait a minute | Watson!" moment in that particular tech debate. | | No one cares as much now days, we've moved our overrated | opinion battlegrounds to other portions of what we do. | pjmlp wrote: | I eventually changed my opinion into JIT being the only way | to make dynamic languages faster, while strong typed ones | can benefit from having both AOT/JIT for different kinds of | deployment scenarios, and development workflows. | titzer wrote: | Dynamic languages need inline caches, type feedback, and | fairly heavy inlining to be competitive. Some of that can | be gotten offline, e.g. by doing PGO. But you can't, in | general, adapt to a program that suddenly changes phases, | or rebinds a global that was assumed a constant, etc. | Speculative optimizations with deopt are what make | dynamic languages fast. | hinkley wrote: | Before I talked myself out of writing my own programming | language, I used to have lunch conversations with my | mentor who was also speed obsessed about how JIT could | meet Knuth in the middle by creating a collections API | with feedback guided optimization, using it for algorithm | selection and tuning parameters by call site. | | For object graphs in Java you can waste exorbitant | amounts of memory by having a lot of "children" members | that are sized for a default of 10 entries but the normal | case is 0-2. I once had to deoptimize code where someone | tried to do this by hand and the number they picked was 6 | (just over half of the default). So when the average | jumped to 7, then the data structure ended up being 20% | larger than the default behavior instead of 30% smaller | as intended. | | For a server workflow, having data structured tuned to | larger pools of objects with more complex comparison | operations can also be valuable, but I don't want that | kitchen sink stuff on mobile or in an embedded app. | | I still think this is viable, but only if you are clever | about gathering data. For instance the incremental | increase in runtime for telemetry data is quite high on | the happy path. But corner cases are already expensive, | so telemetry adds only a few percent there instead of | double digits. | | The nonstarter for this ended up being that most | collections APIs violate Liskov, so you almost need to | write your own language to pick a decomposition that | doesn't. Variance semantics help a ton but they don't | quite fix LSP. | mikepurvis wrote: | I think I landed in a place where it's basically "the | compiler has insufficient information to achieve ideal | optimization because some things can only be known at | runtime." | | Which is not exclusively an argument for runtime JIT-- it | can also be an argument for instrumenting your runtime | environment, and feeding that profiling data back to the | compiler to help it make smarter decisions the next time. | But that's definitely a more involved process than just | baking it into the same JavaScript interpreter used by | everyone-- likely well worth it in the case of things | like game engines, though. | masklinn wrote: | It's also an argument for having much more expressive and | precise type systems, so the compiler has better | information. | | Once you've managed to debug the codegen anyway (see: The | Long and Arduous Story of Noalias). | mikepurvis wrote: | Is it? I'd love to see a breakdown of what classes of | information can be gleaned from profile data, and how | much of an impact each one has in isolation in terms of | optimization. | | Naively, I would have assumed that branch information | would be most valuable, in terms of being able to guide | execution toward the hot path and maximize locality for | the memory accesses occurring on the common branches. And | that info is not something that would be assisted by more | expressive types, I don't think. | titzer wrote: | Darn it, replied too early. See sibling comment I just | posted. The problem with dynamic languages is that you | need to speculate and be ready to undo that speculation. | notriddle wrote: | https://tomaszs2.medium.com/how- | rust-1-64-became-10-20-faste... | | https://news.ycombinator.com/item?id=33306945 | bluGill wrote: | The problem with JIT is not all information known at | runtime is the correct information to optimize one. | | In finance the performance critical code path is often | the one run least often. That is you have a | if(unlikely_condition) {run_time_sensitive_trade();}. In | this case you need to tell the compiler to ensure the CPU | will have a pipeline stall because of a branch | misprediction most of the time to ensure the time that | counts the pipeline doesn't stall. | | The above is a rare corner case for sure, but it is one | of those weird exceptions you always need to keep in mind | when trying to make any blanket rule. | dahfizz wrote: | The other issue with JIT is that it is unreliable. It | optimizes code by making assumptions. If one of the | assumptions is wrong, you pay a large latency penalty. In | my field of finance, having reliably low latency is | important. Being 15% faster on average, but every once in | a while you will be really slow, is not something | customers will go for. | saagarjha wrote: | I take it you are not very familiar with the website known | as Hacker News. | AussieWog93 wrote: | Outside of gaming, or hyper-CPU-critical workflows like video | editing, I'm not really sure if people actually even care about | that last 10% of performance. | | I know most of the time I get frustrated by everyday software, | its doing something unnecessary in a long loop, and possibly | forgetting to check for Windows messages too. | koala_man wrote: | Performance also translates into better battery life and | cheaper datacenters. | hamstergene wrote: | Could it be simply because many binaries were produced by much | older, outdated optimizers. Or optimized for size. | | Also, optimizers usually target "most common denominator" so | native binaries rarely use full power of current instruction | set. | | Jumping from that peculiar finding to praising runtime JIT | feels like a longshot. To me it's more of an argument towards | distributing software in intermediate form (like Apple Bitcode) | and compiling on install, tailoring for the current processor. | jasonwatkinspdx wrote: | All reasonable points, but examples where JIT has an | advantage are well supported in research literature. The | typical workload that shows this is something with a very | large space of conditionals, but where at runtime there's a | lot of locality, eg matching and classification engines. | AceJohnny2 wrote: | > _Or optimized for size._ | | Note that on gcc (I think) and clang (I'm sure), -Oz is a | strict superset of -O2 (the "fast+safe" optimizations, | compared to -O3 that can be a bit too aggressive, given C's | minefield of Undefined Behavior that compilers can exploit). | | I'd guess that, with cache fit considerations, -Oz can even | be faster than -O2. | astrange wrote: | > To me it's more of an argument towards distributing | software in intermediate form (like Apple Bitcode) and | compiling on install, tailoring for the current processor. | | This turns out to be quite difficult, especially if you're | using bitcode as a compiler IL. You have to know what the | right "intermediate" level is; if assumptions change too much | under you then it's still too specific. And it means you | can't use things like inline assembly. | | That's why bitcode is dead now. | | By the way, I don't know why this thread is about how JITs | can optimize programs when this article is about how Rosetta | is not a JIT and intentionally chose a design that can't | optimize programs. | lmm wrote: | > This turns out to be quite difficult, especially if | you're using bitcode as a compiler IL. You have to know | what the right "intermediate" level is; if assumptions | change too much under you then it's still too specific. And | it means you can't use things like inline assembly. | | > That's why bitcode is dead now. | | Isn't this what Android does today? Applications are | distributed in bytecode form and then optimized for the | specific processor at install time. | chrisseaton wrote: | I've run Ruby C extensions on a JIT faster than on native, due | to things like inlining and profiling working more effectively | at runtime. | jeffbee wrote: | Post-build optimization of binaries without changing the target | CPU is common. See BOLT | https://github.com/facebookincubator/BOLT | mark_undoio wrote: | Something that fascinates me about this kind of A -> A | translation (which I associate with the original HP Dynamo | project on HPPA CPUs) is that it was able to effectively yield | the performance effect of one or two increased levels of -O | optimization flag. | | Right now it's fairly common in software development to have a | debug build and a release build with potentially different | optimisation levels. So that's two builds to manage - if we | could build with lower optimisation and still effectively run | at higher levels then that's a whole load of build/test | simplification. | | Moreover, debugging optimised binaries is fiddly due to | information that's discarded. Having the original, unoptimised, | version available at all times would give back the fidelity | when required (e.g. debugging problems in the field). | | Java effectively lives in this world already as it can use high | optimisation and then fall back to interpreted mode when | debugging is needed. I wish we could have this for C/C++ and | other native languages. | foobiekr wrote: | One of the engineers I was working with on a project was from | Transitive (the company that made QuickTransit which became | Rosetta) found that their JIT based translator could not | deliver significant performance increases for A->A outside of | pathological cases, and it was very mature technology at the | time. | | I think it's a hypothetical. The Mill Computing lectures talk | about a variant of this, which is sort of equivalent to an | install-time specializer for intermediate code which might | work, but that has many problems (for one thing, it breaks | upgrades and is very, very problematic for VMs being run on | different underlying hosts). | saagarjha wrote: | It depends greatly on which optimization levels you're going | through. --O0 to -O1 can easily be a 2-3x performance | improvement, which is going to be hard to get otherwise. -O2 | to -O3 might be 15% if you're lucky, in which case -O+LTO+PGO | can absolutely get you wins that beat that. | bluGill wrote: | -O2 to -O3 has in some benchmarks made things worse. In | others it is a massive win, but in generally going above | -O2 should not be done without bench marking code. There | are some optimizations that can make things worse or better | for reasons that compiler cannot know. | astrange wrote: | Over-optimizing your "cold" code can also make things | worse for the "hot" code, eg by growing code size so much | that briefly entering the cold space kicks everything out | of caches. | hinkley wrote: | I have often lamented not being able to hint to the JIT | when I've transitioned from startup code to normal | operation. I don't need my Config file parsing optimized. | But the code for interrogating the Config at runtime | better be. | | Everything before listen() is probably run once. Except | not ever program calls listen(). | hinkley wrote: | And then there's always the outlier where optimizing for | size makes the working memory fit into cache and thus the | whole thing substantially faster. | freedomben wrote: | If JIT-ing a statically compiled input makes it faster, does | that mean that JIT-ing itself is superior or does it mean that | the static compiler isn't outputting optimal code? (real | question. asked another way, does JIT have optimizations it can | make that a static compiler can't?) | vips7L wrote: | Yes, the JIT has more profile guided data as to what your | program actually does at runtime, therefore it can optimize | better. | gpderetta wrote: | On the other hand some optimization are so expensive that a | JIT just doesn't have the execution budget to perform them. | | Probably the optimal system is an hybrid iterative JIT/AOT | compiler (which incidentally was the original objective of | LLVM). | mockery wrote: | In addition to the sibling comments, one simple opportunity | available to a JIT and not AOT is 100% confidence about the | target hardware and its capabilities. | | For example AOT compilation often has to account for the | possibility that the target machine might not have certain | instructions - like SSE/AVX vector ops, and emit both SSE and | non-SSE versions of a codepath with, say, a branch to pick | the appropriate one dynamically. | | Whereas a JIT knows what hardware it's running on - it | doesn't have to worry about any other CPUs. | duped wrote: | AOT compilers support this through a technique called | function multi-versioning. It's not free and only goes so | far, but it isn't reserved to JITs. | | The classical reason to use FMV is for SIMD optimizations, | fwiw | acdha wrote: | One great example of this was back in the P4 era where | Intel hit higher clock speeds at the expense of much higher | latency. If you made a binary for just that processor a | smart compiler could use the usual tricks to hit very good | performance, but that came at the expense of other | processors and/or compatibility (one appeal to the AMD | Athlon & especially Opteron was that you could just run the | same binary faster without caring about any of that[1]). A | smart JIT could smooth that considerably but at the time | the memory & time constraints were a challenge. | | 1. The usual caveats about benchmarking what you care about | apply, of course. The mix of webish things I worked on and | scientists I supported followed this pattern, YMMV. | andrewaylett wrote: | It depends on what the JIT does exactly, but in general _yes_ | a JIT _may_ be able to make optimisations that a static | compiler won 't be aware of because a JIT can optimise for | the specific data being processed. | | That said, a sufficiently advanced CPU could also make those | optimisations on "static" code. That was one of the things | Transmeta had been aiming towards, I think. | kmeisthax wrote: | It's more the case that the ahead-of-time compilation is | suboptimal. | | Modern compilers have a thing called PGO (Profile Guided | Optimization) that lets you take a compiled application, run | it and generate an execution profile for it, and then compile | the application again using information from the profiling | step. The reason why this works is that lots of optimization | involves time-space tradeoffs that only make sense to do if | the code is frequently called. JIT _only_ runs on frequently- | called code, so it has the advantage of runtime profiling | information, while ahead-of-time (AOT) compilers have to make | educated guesses about what loops are the most hot. PGO | closes that gap. | | Theoretically, a JIT _could_ produce binary code hyper- | tailored to a particular user 's habits and their computer's | specific hardware. However, I'm not sure if that has that | much of a benefit versus PGO AOT. | com2kid wrote: | > Theoretically, a JIT could produce binary code hyper- | tailored to a particular user's habits and their computer's | specific hardware. However, I'm not sure if that has that | much of a benefit versus PGO AOT. | | In theory JIT can be a _lot_ more efficient, optimizing for | not only the exact instruction set, and do per CPU | architecture optimizations, such as instruction length, | pipeline depth, cache sizes, etc. | | In reality I doubt most compiler or JIT development teams | have the resources to write and test all those potential | optimizations, especially as new CPUs are coming out all | the time, and each set of optimizations is another set of | tests that has to be maintained. | bluGill wrote: | gcc and clang at least have options so you can optimize | for specific CPUs. I'm not sure how good they are (most | people want a generic optimization that runs well on all | CPUs of the family, so there likely is lots of room for | improvement with CPU specific optimization), but they can | do that. This does (or at least can, again it probably | isn't fully implemented), account for instruction length, | pipeline depth, cache size. | | The Javascript V8 engine, and the JVM both are popular | and supported enough that I expect the teams working on | them take advantage of every trick they can for specific | CPUs, they have a lot of resources for this. (at least | the major x86 and ARM chips - maybe they don't for MIPS | or some uncommon variant of ARM...). Of courses there are | other JIT engines, some uncommon ones don't have many | resources and won't do this. | titzer wrote: | > take advantage of every trick they can for specific | CPUs | | Not to the extent clang and gcc do, no. V8 does, e.g. use | AVX instructions and some others if they are indicated to | be available by CPUID. TurboFan does global scheduling in | moving out of the sea of nodes, but that is not machine- | specific. There was an experimental local instruction | scheduler for TurboFan but it never really helped big | cores, while measurements showed it would have helped | smaller cores. It didn't actually calculate latencies; it | just used a greedy heuristic. I am not sure if it was | ever turned on. TurboFan doesn't do software pipelining | or unroll/jam, though it does loop peeling, which isn't | CPU-specific. | astrange wrote: | > gcc and clang at least have options so you can optimize | for specific CPUs. I'm not sure how good they are | | They are not very good at it, and can't be. You can look | inside them and see the models are pretty simple; the | best you can do is optimize for the first step (decoder) | of the CPU and avoid instructions called out in the | optimization manual as being especially slow. But on an | OoO CPU there's not much else you can do ahead of time, | since branches and memory accesses are unpredictable and | much slower than in-CPU resource stalls. | duped wrote: | Like another commented, JIT compilers do this today. | | The thing that makes this mostly theoretical is that the | underlying assumption is only true when you neglect that | an AOT has zero run-time cost while a JIT compiler has to | execute the code it's optimizing _and_ the code to decide | if it 's worth optimizing and generate new code. | | So JIT compiler optimizations are a bit different than | AOT optimizations since they have to both generate | faster/smaller code _and_ the execute code that performs | the optimization. The problem is that most optimizations | beyond peephole are quite expensive. | | There's another thing that AOT compilers don't need to | deal with, which is being wrong.Production JITs have to | implement dynamic de-optimization in the case that an | optimization was built on a bad assumption. | | That's why JITs are only faster in theory (today), since | there are performance pitfalls in the JIT itself. | titzer wrote: | Nearly all JS engines are doing concurrent JIT | compilation now, so some of the compilation cost is moved | off the main thread. Java JITs have had multiple compiler | threads for more than a decade. | saagarjha wrote: | The well funded production JIT compilers (HotSpot, V8, | etc.) absolutely do take advantage of these. The vector | ISA can sometimes be unwieldy to work with but things | like replacing atomics, using unaligned loads, or taking | advantage of differing pointer representations is common. | com2kid wrote: | They do some auto-vectorization, but AFAIK they don't do | micro-optimizations for different CPUs. | rowanG077 wrote: | A JIT can definitely make optimizations that a static | compiler can't. Simply by virtue of it having concrete | dynamic real-time information. | ketralnis wrote: | It means that in this case, the static compiler emitted code | that could be further optimised, that's all. It doesn't mean | that that's always the case, or that static compilers _can | 't_ produce optimal code, or that either technique is | "better" than the other. | | An easy example is code compiled for 386 running on a 586. | The A->A compiler can use CPU features that weren't available | to the 386. As with PGO you have branch prediction | information that's not available to the static compiler. You | can statically compile the dynamically linked dependencies, | allowing inlining that wasn't previously available. | | On the other hand you have to do all of that. That takes | warmup time just like a JIT. | | I think the road to enlightenment is letting go of phrasing | like "is superior". There are lots of upsides and downsides | to pretty much every technique. | sergimas15 wrote: | nice | hawflakes wrote: | People have mentioned the Dynamo project from HP. But I think | you're actually thinking of the Aries project (I worked in a | directly adjacent project) that allowed you to run PA-RISC | binaries on IA-64. | | https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html | dynjo wrote: | It is quite astonishing how seamless Apple has managed to make | the Intel to ARM transition, there are some seriously smart minds | behind Rosetta. I honestly don't think I had a single software | issue during the transition! | wombat-man wrote: | There's an annoying dwarf fortress bug but other than that, | same | xxpor wrote: | They've almost made it too good. I have to run software that | ships an x86 version of CPython, and it just deeply offends me | on a personal level, even though I can't actually detect any | slowdown (probably because lol python in the first place) | ChuckNorris89 wrote: | If that blows your mind, you should see how Microsoft did the | emulation of the PowerPC based Xeon chip to X86 so you can play | Xbox 360 games on Xbox One. | | There's an old pdf from Microsoft researchers with the details | but I can't seem to find it right now. | RedShift1 wrote: | Any good videos on that? | poulpy123 wrote: | having total control on the hardware and the software didn't | hurt for sure | manv1 wrote: | Qualcomm (and Broadcomm) has total control on the hardware | and software side of a lot of stuff and their stuff is shit. | | It's not about control, it's about good engineering. | stevefan1999 wrote: | It's about both control and engineering in Apple's case. | porcc wrote: | So many parts across the stack need to work well for this | to go well. Early support for popular software is a good | example. This goes from partnerships all the way down to | hardware designers. | | I'd argue it's not about engineering more than it is about | good organizational structure. | iamstupidsimple wrote: | And having execs who design the organizational structure | around those goals is part of what makes good engineering | :) | zeusk wrote: | That's really not the case, if you're in Microsoft or | Linux's position you can't really change the OS | architecture or driver models for any particular vendor. | | That generality and general knowledge separation between | different stacks leaves quite a lot of efficiency on the | table. | esskay wrote: | It has been extremely smooth sailing. I moved my own mac over | to it about a year ago, swapping a beefed up MPB for a budget | friendly M1 Air (which has massively smashed it out the park | performance wise, far better than I was expecting). Didn't have | a single issue. | | My work mac was upgraded to a MBP M1 Pro and again, very | smooth. I had one minor issue with a docker container not being | happy (it was an x86 instance) but one minor tweak to the | docker compose file and I was done. | | It does still amaze me how good these new machines are. Its | almost enough to redeem apple for the total pile of | overheating, underperforming crap that came directly before the | transition (aka any mac with a touchbar). | js2 wrote: | I have a single counter-example. Mailplane, a Gmail SSB. It's | Intel including its JS engine, making the Gmail UI too sluggish | to use. | | I've fallen back to using Fluid, an ancient and also Intel- | specific SSB, but its web content runs in a separate WebKit ARM | process so it's plenty fast. | | I've emailed the Mailplane author but they wont release an | Universal version of the app since they've EOL'd Mailplane. | | I have yet to find a Gmail SSB that I'm happy with under ARM. | Fluid is a barely workable solution. | cmg wrote: | For what it's worth, I use Mailplane on an M1 MacBook Air | (8GB) with 2 Gmail tabs and a calendar tab without noticeable | issues. | | Unfortunately the developers weren't able to get Google to | work with them on a policy change that impacted the app [0] | [1] and so gave up and have moved on to a new and completely | different customer support service. | | [0] https://developers.googleblog.com/2020/08/guidance-for- | our-e... [1] https://mailplaneapp.com/blog/entry/mailplane_st | opped_sellin... | | So unfortunately | perardi wrote: | I think the end of support for 32-bit applications in 2019 | helped, slightly, with the run-up. | | Assuming you weren't already shipping 64-bit | applications...which would be weird...updating the application | probably required getting everything into a contemporary | version of Xcode, cleaning out the cruft, and getting it | compiling nice and cleanly. After that, the ARM transition was | kind of a "it just works" scenario. | | Now, I'm sure Adobe and other high-performance application | developers had to do some architecture-specific tweaks, but, | gotta think Apple clued them in ahead of time as to what was | coming. | chrchang523 wrote: | I finally started seriously using a M1 work laptop yesterday, | and I'm impressed. More than twice as fast on a compute- | intensive job as my personal 2015 MBP, with a binary compiled | for x86 and with hand-coded SIMD instructions. | robohoe wrote: | Are you me lol? I'm on my third day on M1 Pro. Battery life | is nuts. I can be on video calls and still do dev work | without worrying about charging. And the thing runs cool! | dexterdog wrote: | It helps that there were almost 2 years between the release | and your adoption. I had a very early M1 and it was not too | bad, but there were issues. I knew that going in. | EricE wrote: | I had an M1 Air early on and I didn't run into any issues. | Even the issues with apps like Homebrew were resolved | within 3-4 months of the M1 debut. It's amazing just how | seamless such a major architectural transition it was and | continues to be! | radicaldreamer wrote: | Since this is the company's third big arch transition, cross- | compilation and compatibility is probably considered a core | competency for Apple to maintain internally. | mixmastamyk wrote: | And Next was multi-platform as well. | AnIdiotOnTheNet wrote: | It isn't their first rodeo: 68k->PPC->x86_64->ARM. | darzu wrote: | You gotta think there's been a lot of churn and lost | knowledge at the company between PPC->x86_64 (2006) and now | though. | esskay wrote: | Rosetta 1 and the PPC -> x86 move wasn't anywhere near as | smooth, I recall countless problems with that switch. Rosetta | 2 is a totally different experience, and so much better in | every way. | kevincox wrote: | But they've been on x84_64 for a _long_ time. How much of | that knowledge is still around? Probably some traces of it | have been institutionalized but it isn 't the same as if they | just grabbed the same team and made them do it again a year | after the least transition. | toast0 wrote: | nitpick, they did PPC -> x86 (32), the x86_64 bit transition | was later (no translation layer though). They actually had | 64-bit PPC systems on the G5 when they switched to Intel | 32-bit, but Rosetta only does 32-bit PPC -> 32-bit x86; it | would have been rare to have released 64-bit PPC only | software. | EricE wrote: | They had 64 bit Carbon translation layer, but spiked it to | force Adobe and some other large publishers to go native | Intel. There was a furious uproar at the time, but it | turned out to be the right decision. | rgiacobazzi wrote: | Great article! ___________________________________________________________________ (page generated 2022-11-09 23:00 UTC)