[HN Gopher] The Apple GPU and the impossible bug ___________________________________________________________________ The Apple GPU and the impossible bug Author : stefan_ Score : 719 points Date : 2022-05-13 13:32 UTC (9 hours ago) (HTM) web link (rosenzweig.io) (TXT) w3m dump (rosenzweig.io) | stefan_ wrote: | > The Tiled Vertex Buffer is the Parameter Buffer. PB is the | PowerVR name, TVB is the public Apple name, and PB is still an | internal Apple name. | | Patent lawyers love this one silly trick. | robert_foss wrote: | Seeing how Apple licensed the full PowerVR hardware before, | they probably currently have a license for the whatever | hardware they based their design on. | kimixa wrote: | They originally claimed they completely redesigned it and | announced they were therefore going to drop the PowerVR | architecture license - that was the reason for the stock | price crash and Imagination Technologies sale in 2017. | | Then they have since scrubbed the internet of all such claims | and to this day pay for an architecture license. I think it's | similar to an ARM architecture license - where it's a license | for any derived technology and patents rather than actually | being given the RTL for powervr-designed cores. | | I worked at PowerVR during that time (I have Opinions, but | will try to keep them to myself), and my understanding was | that Apple hadn't actually taken new PowerVR RTL for a number | of years and had significant internal redesigns of large | units (e.g. the shader ISA was rather different from the | PowerVR designs of the time), but presumably they still use | enough of the derived tech and ideas that paying the | architecture license is necessary. This transfer was only one | way - we never saw anything internal about Apple's designs, | so reverse engineering efforts like this are still | interesting. | | And as someone who worked on the PowerVR cores (not the Apple | derivatives) I can assure you all this discussed in the | original post is _extremely_ familiar. | pyb wrote: | Apple's claim is that they designed it themselves. https://en | .wikipedia.org/wiki/Talk:Apple_M1#[dubious_%E2%80%... | gjsman-1000 wrote: | There's no reason that couldn't be a half-truth - it could | be a PowerVR with certain components replaced, or even the | entire GPU replaced but with PowerVR-like commands and | structure for compatibility reasons. Kind of like how AMD | designed their own x86 chip despite it being x86 (Intel's | architecture). | | Also, if you read Hector Martin's tweets (he's doing the | reverse-engineering), Apple replacing the actual logic | while maintaining the "API" of sorts is not unheard of. | It's what they do with ARM themselves - using their own ARM | designs instead of the stock Cortex ones while maintaining | ARM compatibility.* | | *Thus, Apple has a right to the name "Apple Silicon" | because the chip is designed by Apple, and just happens to | be ARM-compatible. Other chips from almost everyone else | use stock ARM designs from ARM themselves. Otherwise, we | might as well call AMD an "Intel design" because its x86 by | the same logic. | quux wrote: | Didn't Apple have a large or even dominant role in the | design of the ARM64/AArch64 architecture? I remember | reading somewhere that they developed ARM64 and | essentially "gave it" to ARM who accepted but nobody | could understand at the time why a 64 bit extension to | ARM was needed so urgently, and why some of the details | of the architecture had been designed the way they had. | Years later with Apple Silicon it all became clear. | kalleboo wrote: | The source is a former Apple engineer (now at Nvidia | apparently) | | https://twitter.com/stuntpants/status/1346470705446092811 | | > _arm64 is the Apple ISA, it was designed to enable | Apple's microarchitecture plans. There's a reason Apple's | first 64 bit core (Cyclone) was years ahead of everyone | else, and it isn't just caches_ | | > _Arm64 didn't appear out of nowhere, Apple contracted | ARM to design a new ISA for its purposes. When Apple | began selling iPhones containing arm64 chips, ARM hadn't | even finished their own core design to license to | others._ | | > _ARM designed a standard that serves its clients and | gets feedback from them on ISA evolution. In 2010 few | cared about a 64-bit ARM core. Samsung & Qualcomm, the | biggest mobile vendors, were certainly caught unaware by | it when Apple shipped in 2013._ | | > > _Samsung was the fab, but at that point they were | already completely out of the design part. They likely | found out that it was a 64 bit core from the diagnostics | output. SEC and QCOM were aware of arm64 by then, but | they hadn't anticipated it entering the mobile market | that soon._ | | > _Apple planned to go super-wide with low clocks, highly | OoO, highly speculative. They needed an ISA to enable | that, which ARM provided._ | | > _M1 performance is not so because of the ARM ISA, the | ARM ISA is so because of Apple core performance plans a | decade ago._ | | > > _ARMv8 is not arm64 (AArch64). The advantages over | arm (AArch32) are huge. Arm is a nightmare of | dependencies, almost every instruction can affect flow | control, and must be executed and then dumped if its | precondition is not met. Arm64 is made for reordering._ | quux wrote: | Thanks! | travisgriggs wrote: | > > M1 performance is not so because of the ARM ISA, the | ARM ISA is so because of Apple core performance plans a | decade ago. | | This is such an interesting counterpoint to the | occasional "Just ship it" screed (just one yesterday I | think?) we see on HN. | | I have to say, I find this long form delivery of tech to | be enlightening. That kind of foresight has to mean some | level of technical saaviness at high decision making | levels. Whereas many of us are caught at companies with | short sighted/tech naive leadership who clamor to just | ship it so we can start making money and recoup the money | we're losing on these expensive tech type developers. | kif wrote: | I think the "just ship it" method is necessary when | you're small and starting out. Unless you are well | funded, you couldn't afford to do what Apple did. | pyb wrote: | I haven't followed the announcements CPU side - do Apple | clearly claim that they designed their own CPU (with an | ARM instruction set)? | daneel_w wrote: | They are one of a handful of companies that hold a | license allowing them to both customize the reference | core and to implement the Arm ISA through their own | silicon design. Everyone else's SoCs all use the same Arm | reference mask. Qualcomm also holds such a license, which | owes to their Snapdragon SoC, just like Apple's A- and | M-series, occupying a performance hierarchy above | everything else Arm. | happycube wrote: | The _only_ Qualcomm designed 64-bit mobile core so far | was the Kyro core in the 820. They then assigned that | team to server chips (Centriq) then sacked the whole team | when they felt they needed to cut cash flow to stave off | Avago /Broadcom. The "Kyro" cores from 835 on are | rebadged/adjusted ARM cores. | | IMO the Kyro/820 wasn't a _major_ failure, it turned out | a lot better than the 810 which had A53 /A57 cores. | | And _then_ they decided they needed a mobile CPU team | again and bought Nuvia for ~US$1 Billion. | masklinn wrote: | According to Hector Martin (the project lead of Asahi) in | previous threads of the subject[0], Apple actually has an | "architecture+" license which is completely exclusive to | them, thanks to having literally been at the origins of | ARM: not only can Apple implement the ISA on completely | custom silicon rather than license ARM cores, they can | _customise_ the ISA (as in add instructions, as well as | opt out of mandatory ISA features). | | [0] https://news.ycombinator.com/item?id=29798744 | pyb wrote: | Such a license is a big clue, but not quite what I was | enquiring about... | paulmd wrote: | To be blunt, you're asking about questions that could be | solved with a quick google and you are coming off as a | bit of a jerk asking for very specific citations with | exact specific wording for basic facts like this that, | again, could be solved by looking through the wikipedia | for "apple silicon" and then bouncing to a specific | source. People have answered your question and you're | brushing them off because you want it answered in an | exact specific way. | | https://en.wikipedia.org/wiki/Apple_silicon | | https://www.anandtech.com/show/7335/the- | iphone-5s-review/2 | | > NVIDIA and Samsung, up to this point, have gone the | processor license route. They take ARM designed cores | (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate | them into custom SoCs. In NVIDIA's case the CPU cores are | paired with NVIDIA's own GPU, while Samsung licenses GPU | designs from ARM and Imagination Technologies. Apple | previously leveraged its ARM processor license as well. | Until last year's A6 SoC, all Apple SoCs leveraged CPU | cores designed by and licensed from ARM. | | > With the A6 SoC however, Apple joined the ranks of | Qualcomm with leveraging an ARM architecture license. At | the heart of the A6 were a pair of Apple designed CPU | cores that implemented the ARMv7-A ISA. I came to know | these cores by their leaked codename: Swift. | | Yes, Apple has been designing and using non-reference | cores since the A6 era, and were one of the first to the | table with ARMv8 (apple engineers claim it was designed | for them under contract to their specifications, but | _this_ part is difficult to verify with anything more | than citations from individual engineers). | | I expect that Apple has said as much in their | presentations somewhere, but if you're that keen on | finding such an incredibly specific attribution, then | knock yourself out. It'll be in an apple conference | somewhere, like WWDC. They probably have said "apple- | designed silicon" or "custom core" at some point, and | that would be your citation - but they also sell | products, not hardware, and they don't _extensively_ talk | about their architectures since they 're not really the | product, so you probably won't find a deep-dive like | Anandtech from Apple directly where they say "we have | 8-wide decode, 16-deep pipeline... etc" sorts of things. | [deleted] | gjsman-1000 wrote: | Qualcomm did use their own design called _Kyro_ for a | little while, but is now focusing on cores designed by | Nuvia which they just bought for the future. | | As for Apple, they've designed their own cores since the | Apple A6 which used the _Swift_ core. If you go to the | Wikipedia page, you can actually see the names of their | core designs, which they improve every year. For the M1 | and A14, they use _Firestorm_ High-Performance Cores and | _Icestorm_ Efficiency Cores. The A15 uses _Avalanche_ and | _Blizzard_. If you visit AnandTech, they have deep-dives | on the technical details of many of Apple 's core designs | and how they differ from other core designs including | stock ARM. | | The Apple A5 and earlier were stock ARM cores, the last | one they used being Cortex A9. | | For this reason, Apple is about as much an ARM chip as | AMD is an Intel chip. Technically compatible, | implementation almost completely different. It's also why | Apple calls it "Apple Silicon" and it is not just | marketing, but actually justified just as much as AMD not | calling their chips Intel derivatives. | GeekyBear wrote: | > Qualcomm did use their own design called Kyro for a | little while | | Before that, they had Scorpion and Krait, which were both | quite successful 32 bit ARM compatible cores at the time. | | Kryo started as an attempt to quickly launch a custom 64 | bit ARM core and the attempt failed badly enough that | Qualcomm abandoned designing their own cores and turned | to licensing semi-custom cores from ARM instead. | amaranth wrote: | Kyro started as custom but flopped in the Snapdragon 820 | so they moved to a "semi-custom" design, it's unclear how | different it really is from the stock Cortex designs. | daneel_w wrote: | The other-wordly performance-per-watt would be another. | stephen_g wrote: | They do, and their microarchitecture is unambiguously, | hugely different to anything else (some details in 1). | The last Apple Silicon chip to use a standard Arm design | was the A5X, whereas they were using customised PowerVR | GPUs until I think the A11. | | 1. https://www.anandtech.com/show/16226/apple- | silicon-m1-a14-de... | rjsw wrote: | > Apple replacing the actual logic while maintaining the | "API" of sorts is not unheard of. | | They did this with ADB, early PowerPC systems contained a | controller chip that has the same API that was | implemented in software in the 6502 IOP coprocessor in | the IIfx/Q900/Q950. | brian_herman wrote: | Also laywers that can keep it in court long enough for a | redesign. | tambourine_man wrote: | Few things are more enjoyable than reading a good bug story, even | when it's not one's area of expertise. Well done. | alimov wrote: | I had the same thought. I really enjoy following along and | getting a glimpse into the thought process of people working | through challenges. | danw1979 wrote: | Alyssa and the rest of the Asahi team are basically magicians as | far as I can tell. | | What amazing work and great writing that takes an absolute | graphics layman (me) on a very technical journey yet it is still | largely understandable. | [deleted] | nicoburns wrote: | > Why the duplication? I have not yet observed Metal using | different programs for each. | | I'm guessing whoever designed the system wasn't sure whether they | would ever need to be different, and designed it so that they | could be. It turned out that they didn't need to be, but it was | either more work than it was worth to change it (considering that | simply passing the same parameter twice is trivial), or they | wanted to leave the flexibility in the system in case it's needed | in future. | | I've definitely had APIs like this in a few places in my code | before. | pocak wrote: | I don't understand why the programs are the same. The partial | render store program has to write out both the color and the | depth buffer, while the final render store should only write | out color and throw away depth. | kimixa wrote: | Possibly pixel local storage - I think this can be accessed | with extended raster order groups and image blocks in metal. | | https://developer.apple.com/documentation/metal/resource_fun. | .. | | E.g in their example in the link above for deferred rendering | (figure 4) the multiple G buffers won't actually need to | leave the on-chip tile buffer - unless there's a partial | render before the final shading shader is run. | hansihe wrote: | Not necessarily, other render passes could need the depth | data later. | Someone wrote: | So it seems it allows for optimization. If you know you | don't need everything, one of the steps can do less than | the other. | johntb86 wrote: | Most likely that would depend on what storeAction is set | to: https://developer.apple.com/documentation/metal/mtlrend | erpas... | pocak wrote: | Right, I had the article's bunny test program on my mind, | which looks like it has only one pass. | | In OpenGL, the driver would have to scan the following | commands to see if it can discard the depth data. If it | doesn't see the depth buffer get cleared, it has to be | conservative and save the data. I assume mobile GPU drivers | in general do make the effort to do this optimization, as | the bandwidth savings are significant. | | In Vulkan, the application explicitly specifies which | attachment (i.e. stencil, depth, color buffer) must be | persisted at the end of a render pass, and which need not. | So that maps nicely to the "final render flush program". | | The quote is about Metal, though, which I'm not familiar | with, but a sibling comment points out it's similar to | Vulkan in this aspect. | | So that leaves me wondering: did Rosenzweig happen to only | try Metal apps that always use _MTLStoreAction.store_ in | passes that overflow the TVB, or is the Metal driver | skipping a useful optimization, or neither? E.g. because | the hardware has another control for this? | plekter wrote: | I think multisampling may be the answer. | | For partial rendering all samples must be written out, but | for the final one you can resolve(average) them before | writeout. | [deleted] | 542458 wrote: | It's been said more than a few times in the past, but I cannot | get over just how smart and motivated Alyssa Rosenzweig is - | she's currently an undergraduate university student, and was | leading the Panfrost project when she was still in high school! | Every time I read something she wrote I'm astounded at how | competent and eloquent she is. | frostwarrior wrote: | While I was reading I was already thinking that. I can't | believe how smart and an awesome developer she is. | pciexpgpu wrote: | Undergrad? I thought she was some Staff SWE in an OSS company. | Seriously impressive, and ought to give anyone imposter | syndrome. | gjsman-1000 wrote: | Well, Alyssa is, and works for Collabora while also being | undergrad. | coverband wrote: | I was about to post "very impressive", but that seems a huge | understatement after finding out she's still in school... | [deleted] | aero-glide2 wrote: | Have to admit, wherever i see people much younger than me do | great things I get very depressed. | kif wrote: | I used to feel this way, too. However, every single one of us | has their own unique circumstances. | | I can't give too many details unfortunately. But, there's a | specific step I took in my career, which was completely | random at the time. I was still a student, and I decided not | to work somewhere. I resigned two weeks in. Had I not done | that, I wouldn't be where I am today. My situation would be | totally different. | | Yes, some people are very talented. But it does take quite a | lot of work and dedication. And yes, sometimes you cannot | afford to dedicate your time to learning something because | life happens. | cowvin wrote: | No need to be depressed. It's not a competition between you. | You can find inspiration in what others achieve and try to | achieve more yourself. | ip26 wrote: | I get that. But then I remember at that age, I was only just | cobbling together my very first computer from the scrap bin. | An honest comparison is nearly impossible. | pimeys wrote: | And for me, her existence is enough to keep me of getting | depressed about my industry. Whatever she's doing, is keeping | my hopes up for computer engineering. | [deleted] | ohgodplsno wrote: | Be excited! This means amazing things are coming, from | incredibly talented people. And even better when they put out | their knowledge in public, in an easy to digest form, letting | you learn from them. | azinman2 wrote: | Does anyone know if she has a proper interview somewhere? I'd | love to know how she got so technical in high school to be able | to reverse engineer a GPU -- something I would have no idea how | to start even with many more years experience (although | admittedly I know very little about GPUs and don't do graphics | work). | daenz wrote: | That image gave me flashbacks of gnarly shader debugging I did | once. IIRC, I was dividing by zero in some very rare branch of a | fragment shader, and it caused those black tiles to flicker in | and out of existence. Excruciatingly painful to debug on a GPU. | thanatos519 wrote: | What an entertaining story! | ninju wrote: | > Comparing a trace from our driver to a trace from Metal, | looking for any relevant difference, we eventually _stumble on | the configuration required_ to make depth buffer flushes work. | | > And with that, we get our bunny. | | So what was the configuration that needed to change? Don't leave | us hanging!!! | [deleted] | dry_soup wrote: | Very interesting and easy to follow writeup, even for a graphics | ignoramus like myself. | Jasper_ wrote: | Huh, I always thought tilers re-ran their vertex shaders multiple | times -- once with position-only to do binning, and then _again_ | when computing for all attributes with each tile; that 's what | the "forward tilers" like Adreno/Mali do. That's crazy they dump | all geometry to main memory rather than keeping it in pipe. It | explains why geometry is more of a limit on AGX/PVR than | Adreno/Mali. | pocak wrote: | That's what I thought, too, until I saw ARM's Hot Chips 2016 | slides. Page 24 shows that they write transformed positions to | RAM, and later write varyings to RAM. That's for Bifrost, but | it's implied Midgard is the same, except it doesn't filter out | vertices from culled primitives. | | That makes me wonder whether the other GPUs with position-only | shading - Intel and Adreno - do the same. | | As for PowerVR, I've never seen them described as position-only | shaders - I think they've always done full vertex processing | upfront. | | edit: slides are at https://old.hotchips.org/wp- | content/uploads/hc_archives/hc28... | Jasper_ wrote: | Mali's slides here still show them doing two vertex shading | passes, one for positions, and again for other attributes. | I'm guessing "memory" here means high-performance in-unit | memory like TMEM, rather than a full frame's worth of data, | but I'm not sure! | atq2119 wrote: | I was under that impression as well. If they write out all | attributes, what is really the remaining difference to a | traditional immediate more renderer? Nvidia reportedly has | vertex attributes going through memory for many generations | already (and they are at least partially tiled...). | | I suppose the difference is whether the render target lives in | the "SM" and is explicitly loaded and flushed (by a shader, no | less!) or whether it lives in a separate hardware block that | acts as a cache. | Jasper_ wrote: | NV has vertex attributes "in-pipe" (hence mesh shaders), and | the appearance of a tiler is a misread, it's just a change to | the macro-rasterizer about which quads get dispatched first, | it's not a true tiler. | | The big difference is the end of the pipe, as mentioned; | whether you have ROPs or whether your shader cores load/store | from a framebuffer segment. Basically, whether or not | framebuffer clears are expensive (assuming no fast-clear | cheats), or free. | [deleted] | bob1029 wrote: | I really appreciate the writing and work that was done here. | | It is amazing to me how complicated these systems have become. I | am looking over the source for the single triangle demo. Most of | this is just about getting information from point A to point B in | memory. Over 500 lines worth of GPU protocol overhead... Granted, | this is a one-time cost once you get it working, but it's still a | lot to think about and manage over time. | | I've written software rasterizers that fit neatly within 200 | lines and provide very flexible pixel shading techniques. | Certainly not capable of running a cyberpunk 2077 scene, but | interactive framerates otherwise. In the good case, I can go from | a dead stop to final frame buffer in <5 milliseconds. Can you | even get the GPU to wake up in that amount of time? | mef wrote: | with great optimization comes great complexity | [deleted] | quux wrote: | Impressive work and really interesting write up. Thanks! | VyseofArcadia wrote: | > Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a | screaming fast desktop, but its unified memory and tiler GPU have | roots in mobile phones. | | PowerVR has its roots in a desktop video card with somewhat | limited release and impact. It really took off when it was used | in the Sega Dreamcast home console and the Sega Naomi arcade | board. It was only later that people put them in phones. | robert_foss wrote: | But being a Tiling rendering architecture which is normal for | mobile applications and not how desktop GPUs are architectured, | it would be fair to call it a mobile GPU. | Veliladon wrote: | Nvidia appears to be an immediate mode renderer to the user | but has used a tiled rendering architecture under the hood | since Maxwell. | pushrax wrote: | According to the sources I've read, it uses a tiled | rasterizing architecture but it's not deferred in the same | way as typical mobile TBDR that bins all vertexes before | starting rasterization, deferring all rasterization after | all vertex generation, and flushing each tile to the | framebuffer once. | | NV seems to rasterize vertexes in small batches (i.e. | immediately) but buffers the rasterizer output on die in | tiles. There can still be significant overlap between | vertex generation and rasterization. Those tiles are | flushed to the framebuffer, potentially before they are | fully rendered, and potentially multiple times per draw | call depending on the vertex ordering. They do some | primitive reordering to try to avoid flushing as much, but | it's not a full deferred architecture. | [deleted] | monocasa wrote: | Nvidia's is a tile-based immediate mode rasterizer. It's | more a cache friendly immediate renderer than a TBDR. | tomc1985 wrote: | I actually had one of those cards! The only games I could get | it to work with were Half-Life, glQuake, and Jedi Knight, and | the bilinear texture filtering had some odd artifacting IIRC | wazoox wrote: | Unified memory was introduced by SGI with the O2 workstation in | 1996, then they used it again with their x86 workstations SGI | 320 and 540 in 1999. So it was a workstation-class technology | before being a mobile one :) | andrekandre wrote: | even the n64 had unified memory way back in 1995 | nwallin wrote: | The N64's unified memory model had a pretty big asterisk | though. The system had only 4kB for textures out of 4MB of | total RAM. And textures are what uses the most memory in a | lot of games. | ChuckNorris89 wrote: | N64 chip was also SGI designed | iforgotpassword wrote: | Was it the kyro 2? I had one of these but killed it by | overclocking... Would make for a good retro system. | smcl wrote: | The Kyro and Kyro 2 were a little after the Dreamcast. | sh33sh wrote: | Really enjoyed the way it was written | GeekyBear wrote: | Alyssa's writing style steps you through a technical mystery in | a way that remains compelling even if you lack the domain | knowledge to solve the mystery yourself. ___________________________________________________________________ (page generated 2022-05-13 23:00 UTC)