[HN Gopher] Single-chip processors have reached their limits ___________________________________________________________________ Single-chip processors have reached their limits Author : blopeur Score : 129 points Date : 2022-04-04 16:53 UTC (6 hours ago) (HTM) web link (spectrum.ieee.org) (TXT) w3m dump (spectrum.ieee.org) | Veliladon wrote: | The M1 Ultra is fabricated as a single chip. The 12900K is | fabricated as a single chip and is still a quarter the size of | the M1 Ultra. Ryzen 3 puts 8 cores on a CCX instead of four | because DDR memory controllers don't have infinite memory | bandwidth (contrary to AMD's wishful nomenclature) and make | shitty interconnects between banks of L3. | | Chiplets are valid strategies that are going to be used in the | future but there are still more tricks that CPU makers have up | their sleeves that they need to use out of necessity. They're | nowhere near their limits. | paulmd wrote: | "chip" is ambiguous the way you're using it. | | The M1 Ultra is two _dies_ in one _package_. The package is | what goes on the motherboard. | | You can also count the memory modules as dies/packages as well. | It's not incorrect to say that M1 Ultra has a bunch of LPDDR5 | _packages_ on it as well, each LPDDR5 _package_ may have | multiple _dies_ in it as well. | | But depending on context it also wouldn't be incorrect to say | the M1 ultra is a _package_ as a chip even if it 's got more | packages on it. From the context of the motherboard maker, the | CPU BGA unit is the "package". | | Anyway no, Ultra isn't a monolithic die in the sense you're | meaning, it's two dies that are joined, Apple just uses a | ridiculously fat pipe to do it (far beyond what AMD is using | for Ryzen) such that it basically appears to be a single die. | The same is true for AMD, Rome/Milan are notionally NUMA - | running in NPS4 mode can squeeze some extra performance in | extreme situations if applications are aware of it, and there's | some weird oddities caused by memory locality in "unbalanced" | configurations where each quadrant doesn't have the same amount | of channels. It just doesn't feel like it because AMD has done | a very good job hiding it. | | However you're also right that we haven't reached the end of | monolithic chips either. Splitting a chip into modules imposes | a power penalty for data movement, it's much more expensive to | move data off-chiplet than on-chiplet, and that imposes a limit | on how finely you can split your chiplets (doing let's say 64 | tiny chiplets on a package would use a huge amount of power | moving data around, since everything is off-chip). There are | various technologies like copper-copper bonding and EMIB that | will hopefully lower that power cost in the future, but it's | there. | | And even AMD uses monolithic chips for their laptop parts, | because of that. If _any_ cores are running, the IO die has to | be powered up, running its memory and infinity fabric links, | and at least one CCD has to be powered up, even if it 's just | to run "hello world". This seems to be around 15-20W, which is | significant in the context of a home or office PC. | | It's worth noting that Ryzen is not really a desktop-first | architecture. It's server-first, and AMD has found a clever way | to pump their volumes by using it for enthusiast hardware. | Servers don't generally run 100% idle, they are loaded or they | are turned off entirely and rebooted when needed. If you can't | stand the extra 20W at idle, AMD would probably tell you to buy | an APU instead. | 2OEH8eoCRo0 wrote: | > The M1 Ultra is fabricated as a single chip. | | I'm curious how much the M1 Ultra costs. It's such a massive | single piece of glass I'd guess it's $1,200+. If that's the | case it doesn't make sense to compare the M1 Ultra to $500 CPUs | from Intel and AMD. | mrtksn wrote: | Wouldn't the price be primarily based on capital investment | and not so much on the unit itself? After all, it's | essentially a print out on a crystal using reeeeeally | expensive printers. AFAIK Apple's relationship with TSMC is | more than a customer relationship. | 2OEH8eoCRo0 wrote: | In a parallel universe where Intel builds and sells this | CPU- what's the price? Single chip, die size of 860 square | mm, 114 billion transistors, on package memory. | | It just got me thinking the other day since all of these | benchmarks pit it against $500-$1000 CPUs and it doesn't | seem to fall in that price range at all. Look at this | thing: | | https://cdn.wccftech.com/wp- | content/uploads/2022/03/2022-03-... | headass wrote: | also all the other shit that's on the chip ram etc. | gameswithgo wrote: | It is commonly said that on the new M1 macs, that the ram | is on the chip, it is not. It is on the same substrate, but | its just normal (fast) dram chips soldered on nearby. | grishka wrote: | If there's a defective M1 Ultra, they can cut it in half and | say those are two low-end M1 Max. | 2OEH8eoCRo0 wrote: | Wouldn't they only get, at most, one low-end M1 Max if | there is a defect? | grishka wrote: | They sell cheaper models with some cores disabled, that's | what I meant by low-end. Ever wondered what's the deal | with the cheapest "7-core GPU" M1? | monocasa wrote: | If the defect is in the right place, Apple apparently | sells M1 Max chips with some GPU cores disabled. | wmf wrote: | I estimate that Apple's internal "price" for the M1 Ultra is | around $2,000. Since most of the chip is GPU, it should | really be compared to a combo like 5950X + 6800 XT or 12900K | + 3080. | 2OEH8eoCRo0 wrote: | It wouldn't surprise me. M1 Ultra has 114 billion | transistors and a total area of ~860 square mm. For | comparison, an RTX 3090 has 28 billion transistors and a | total area of 628 square mm. | sliken wrote: | Dunno, M1 Ultra includes a decent GPU, which the $500 CPUs | from Intel and AMD do not. Seems relatively comparable to a | $700 GPU (like a RTX 3070 if you can find one) depending on | what you are using. Sadly metal native games are rare, many | use some metal wrapper and/or Rosetta emulation. | | Seems pretty fair to compare an Intel alder lake or higher | end AMD Ryzen AND a GPU (rtx 3070 or radeon 6800) to the M1 | ultra, assuming you don't care about power, heat, or space. | touisteur wrote: | Has anyone managed to reach the actual advertised 21 FP32 | TFLOPS? I'm curious. Even BLAS or pure custom matmul stuff? | How much of that is actually available? I can almost | saturate and sustain an NVIDIA A40 or A4000 to their peak | perf, so, wondering whether anyone written something there? | cma wrote: | M1 ultra is two chips with an interconnect between them I | thought? Or is the interconnect already on die with them? | | (Edit: sounds like it is two: "Apple said fusing the two M1 | processors together required a custom-built package that uses a | silicon interposer to make the connection between chips. " | https://www.protocol.com/bulletins/apple-m1-ultra-chip ) | monocasa wrote: | > M1 ultra is two chips with an interconnect between them I | thought? Or is the interconnect already on die with them? | | It's either depending on how you look at it. The active | components of the interconnect are on the two M1 dies, but | the interconnect itself goes through the interposer as well. | fulafel wrote: | Some older stuff for reference: IBM POWER5 and POWER5+ | (2004&2005) are MCM designs, had 2-4 CPU chips plus cache chips | in same package. | | Link: https://en.wikipedia.org/wiki/POWER5 | sliken wrote: | Pentium pro from 1995 had two pieces of silicon in the package: | https://en.wikipedia.org/wiki/Pentium_Pro | h2odragon wrote: | PPros are quite hard to find now because the "gold | scavengers" loved them. As i recall, at the peak in 2008, | they were $100ea and more for the ceramic packages. All that | interconnect was tiny gold wires, apparently. | sliken wrote: | Heh, had no idea, they seemed to have a pretty limited run, | ran at up 200 MHz, but was pretty quickly replaced by a | Pentium-II at 233 Mhz on a single die. | marcodiego wrote: | Makes me remember the processor in the film terminator 2: | https://gndn.files.wordpress.com/2016/04/shot00332.jpg | bob1029 wrote: | Despite the limitations apparently present in single chip/CPU | systems, they can still provide an insane amount of performance | if used properly. | | There are also many problems that are literally impossible to | make faster or more correct than by simply running them on a | single thread/processor/core/etc. There always will be forever | and ever. This is not a "we lack the innovation" problem. It's an | information-theoretic / causality problem you can demonstrate | with actual math & physics. Does a future event's processing | circumstances maybe depend on all events received up until now? | If yes, congratulations. You now have a total ordering problem | just like pretty much everyone else. Yes, you can cheat and say | "well these pieces here and here dont have a hard dependency on | each other", but its incredibly hard to get this shit right if | you decide to go down that path. | | The most fundamental demon present in any distributed system is | latency. The difference between L1 and a network hop in the same | datacenter can add up very quickly. | | Again, for many classes of problems, there is simply no | handwaving this away. You either wait the requisite # of | microseconds for the synchronous ack to come back, or you hope | your business doesnt care if john doe gets duplicated a few times | in the database on a totally random basis. | AnthonyMouse wrote: | The alternative is speculative execution. If you can guess what | the result is going to be, you can proceed to the next | calculation and you get there faster if it turns out you were | right. | | If you have parallel processors, you can stop guessing and just | proceed under both assumptions concurrently and throw out the | result that was wrong when you find out which one it was. This | is going to be less efficient, but if your only concern is | "make latency go down," it can beat waiting for the result or | guessing wrong. | tsimionescu wrote: | Not necessarily. There are problems you can't speed up even | if you are given a literal infinity of processors - the | problems in EXP for example (well, EXP - NP). Even for NP | problems, the number of processors you need for a meaningful | speed up grows proportionally to the size of the problem | (assuming P!=NP). | AnthonyMouse wrote: | Computational complexity and parallelism are orthogonal. | Many EXP algorithms are embarrassingly parallel. You still | have to do 2^n calculations, but if you have 1000 | processors then it will take 1000 times less wall clock | time because you're doing 1000 calculations at once. | | The reason parallelism doesn't "solve" EXP problems is that | parallelism grows linearly against something whose time | complexity grows exponentially. It's not that it doesn't | work at all, it's that if you want to solve the problem for | 2n in the same time as for n, you need to double the number | of processors. So the number of processors you need to | solve the problem in whatever you define as a reasonable | amount of time grows exponentially with n, but having e.g. | 2^n processors when n is 1000 is Not Gonna Happen. | | Having 1000 processors will still solve the problem twice | as fast as 500, but that's not much help in practice when | it's the difference between 50 billion billion years and | 100. | kzrdude wrote: | It's surprising it took that many cores before the limit was | reached! | ksec wrote: | Is spectrum.ieee.org becoming another mainstream ( so to speak ) | journalism where everything is dumbed down to basically Newspeak. | The article is poorly written, the content is shallow and the | headline is click bait. | retrac wrote: | The best chiplet interconnect may turn out to be no interconnect | at all. Wafer scale integration [1] has come up periodically over | the years. In short, just make a physically larger integrated | circuit, potentially as large as the entire wafer -- like a foot | across. As I understand it, there's no particular technical | hurdle, and indeed the progress with self-healing and self- | testing designs with redundancy to improve yield for small | processors, also makes really large designs more feasible than in | the past. The economics never worked out in the favour of this | approach before, but now we're at the scaling limit maybe that | will change. | | At least one company is pursuing this at the very high end. The | Cerebras WFE-2 [2] ("wafer scale engine") has 2.6 trillion | transistors with 800,000 cores and 48 gigabytes of RAM, on a | single, giant, integrated circuit (shown in the linked article). | I'm just an interested follower of the field, no expert, so what | do I know. But I think that we may see a shift in that direction | eventually. Everything on-die with a really big die. System on a | chip, but for the high end, not just tiny microcontrollers. | | [1] https://en.wikipedia.org/wiki/Wafer-scale_integration | | [2] https://www.zdnet.com/article/cerebras-continues-absolute- | do... | AceJohnny2 wrote: | To clarify and contextualize a bit what you're saying: | | The one big obstacle in creating larger chips is defects. | There's just a statistical chance of there being a defect on | any given surface area of the wafer, defect which generally | breaks the chip that occupies that area of the wafer. | | So historically, the approach was to make more smaller chips | and trash those chips on the wafer affected by defects. Then | came the "chiplet" approach where they can assemble those | functional chips into a larger meta-chip (like the Apple M1 | Ultra). | | But as you're saying, changes in the way chips are designed can | make them resilient to defects, so you no longer need to trash | that chip on your wafer that's affected by such a defect, and | can thus design a larger chip without fear of defects. | | (Of course such an approach requires a level of redundancy in | the design, so there's a tradeoff) | zitterbewegung wrote: | The Cerebras WFE is design has on each wafer to disable / | efuse a portion of itself to account for defects. This is | what you can do if your control the wafer. | candiddevmike wrote: | Wikipedia link on microlithography if you want a rabbit hole | about wafer making: | | https://wikipedia.org/wiki/Microlithography | | Being able to print something in nanometers is an overlooked | technical achievement for human manufacturing. | adhesive_wombat wrote: | If that rabbit hole appeals, the ITRS reports (now called | IRDS[2]) are very good mid-level, year-by-year summary of | the state of the art in chipmaking, including upcoming | challenges and future directions. | | > Being able to print something in nanometers is an | overlooked technical achievement for human manufacturing. | | IMO, a semiconductor fab probably is _the_ highest human | achievement in terms of process engineering. Not only do | you "print" nanometric devices, you do it continuously, in | a multi-month pipelined system and sell the results for as | little as under a penny (micros, and even the biggest | baddest CPUs are "only" a thousand pounds, far less than | any other item with literally a billion functional designed | features on it). | | [1]: https://en.wikipedia.org/wiki/International_Technology | _Roadm... | | [2]: https://en.wikipedia.org/wiki/International_Roadmap_fo | r_Devi... | AceJohnny2 wrote: | I'll add that many DRAM chips already do something like this, | but ironically enough the re-routing mechanism adds | complexity _which is itself a source of problems_ , (be it | manufacturing or design, such as broken timing promises) | | Also, NAND Flash storage (SSD) is designed around the very | concept of re-routing around bad blocks, because the very | technology means they have a wear-life. | Dylan16807 wrote: | > I'll add that many DRAM chips already do something like | this, but ironically enough the re-routing mechanism adds | complexity which is itself a source of problems, (be it | manufacturing or design, such as broken timing promises) | | The best-performing solution there is probably software. | Tell the OS about bad blocks and keep the hardware simple. | nine_k wrote: | I think this is already implemented both in Linux and in | Windows; you can tell the OS which RAM ranges are | defective. | | Doing this from the chip side is not there yet, | apparently. I wonder when will this be included in the | DRAM feature list, if ever. I suspect that detecting | defects from the RAM side is not trivial. | Dylan16807 wrote: | > I suspect that detecting defects from the RAM side is | not trivial. | | Factory testing or a basic self-test mode could easily | find any parts that are flat-out broken. And as internal | ECC rolls out as a standard feature, that could help find | weaker rows over time. | dylan42 wrote: | > change in the way chips are designed can make them | resilient to defects | | This is already happening for almost all modern chips | manufactured in the last 10+ years. DRAM chips have extra | rows/cols. Even Intel CPUs have redundant cache lines, | internal bus lines and other redundant critical parts, which | are burned-in during initial chip testing. | ip26 wrote: | The other one big obstacle is chips are square while wafers | are round. | paulmd wrote: | it depends on the exact shape of your mask of course, but | typically losses around the edges are in the 2-3% range. | | It's not really possible to fix this either since wafers | need to be round for various manufacturing processes | (spinning the wafer for coating or washing stages) and | round obviously isn't a dense packing of the mask itself. | It just kinda is how it is, square mask and round wafer | means you lose a bit off the edges, fact of life. | paulmd wrote: | > changes in the way chips are designed can make them | resilient to defects, so you no longer need to trash that | chip on your wafer that's affected by such a defect, | | no, it's basically "chiplets but you don't cut the chiplets | apart". You design the chiplets to be nodes in a mesh | interconnect, and failed chiplets can simply be disabled | entirely and then routed around. But they're still "chiplets" | that have their own functionality and provide a coarser | conceptual block than a core itself and thus simplify some of | the rest of the chip design (communications/interconnect, | etc). | | note that technically (if you don't mind the complexity) | there's nothing wrong with harvesting at multiple levels like | this! You could have "this chiplet has 8 cores, that one has | 6, that one failed entirely and is disabled" and as long as | it doesn't adversely affect program characteristics too much | (data load piling up or whatever) that can be fine too. | | however, there's nothing about "changes in the way the chips | are designed that makes them more resilient to defects", you | still get the same failure rates per chiplet, and will still | get the same amount of failed (or partially failed) chiplets | per wafer, but instead of cutting out the good ones and then | repackaging, you just leave them all together around "route | around the bad ones". | | The advantage is that MCM-style chiplet/interposer packaging | actually makes data movement much more expensive, because you | have to run a more powerful interconnect, where this isn't | moving anything "off-chip", so you avoid a lot of that power | cost. There are other technologies like EMIB and copper- | copper bonding that potentially can lessen those costs for | chiplets of course. | | What Intel is looking at doing with "tiles" in their future | architectures with chiplets connected by EMIB at the edges | (especially if they use copper-copper bonding) is sort of a | half-step in engineering terms here but I think there are | still engineering benefits (and downsides of course) to doing | it as a single wafer rather than hopping through the bridge | even with a really good copper-copper bond. Actual full-on | MCM/interposer packaging is a step worse than cu-cu bonding | and requires more energy but even cu-cu bonding is not | perfect and thus not as good as just "on-chip" routing. So | WSI is designed to get everything "on-chip" but without the | yield problems of just a single giant chip. | wmf wrote: | Calling wafer-scale "no interconnect" is kind of misleading | since it's still very difficult to stitch reticles and it has | yield challenges. | galaxyLogic wrote: | Sounds like a great development if it works out. | | But consider also that you can stick chiplets on top of each | other vertically. That means you can put chiplets much closer | together than if they were constrained to exist on the same | single plane of the wafer. | | Now how about stacking wafers on top of wafers? That could be | super, but there might be technical difficulties, which maybe | sooner or later can be overcome. | AceJohnny2 wrote: | > _But consider also that you can stick chiplets on top of | each other vertically._ | | The problem there is heat dissipation. Already the | performance constraint on consumer chips like the Apple M1 is | how well it can dissipate heat in the product it's placed in | (see Macbook Air vs Mac Mini). Stacking the chips just makes | it worse. | GeekyBear wrote: | The fact that the M1 Macbook Air operates without needing a | fan is very unusual for that level of performance. | paulmd wrote: | AMD's 5800X3D and the upcoming generation of AMD/NVIDIA | GPUs (both of which are rumored to feature stacked cache | dies) are going to be real interesting. So far we haven't | ever seen a stacked _enthusiast_ die (MCM doesn 't feature | any active transistors on the interposer) and it will be | interesting to see how the thermals work out. | | This isn't even stacking _compute_ dies either, stacking | memory /cache is the low-hanging fruit but in the long term | what everyone really wants is stacking multiple compute | dies on top of each other, and _that 's_ going to get spicy | real quick. | | M1 is the other example but again, Apple's architecture is | sort of unique in that they've designed it to run from the | ground up at 3 GHz exactly, there's no overclocking/etc | like enthusiasts generally expect. AMD is having to disable | voltage control/overclocking on the 5800X3D as well | (although that may be more related to voltage control | rather than thermals - sounds like the cache die may run | off one of the voltage rails from the CPU, potentially a | FIVR could be used to drive that rail independently, or add | an additional V_mem rail...) | | And maybe that's the long-term future of things, that | overclocking goes away and you just design for a tighter | design envelope, one where you _know_ the thermals work for | the dies in the middle of the sandwich. Plus the Apple | design of "crazy high IPC and moderately low ~3 GHz | clocks" appears well-adapted for that reality. | Iwan-Zotow wrote: | the problem is signal propagation | | for light to cross 1 feet should take ca 1ns | paulmd wrote: | 3D circuits would be denser (shorter propagation | distances) than a planar circuit. In fact "computronium" | is sort of an idea about how dense you can conceptually | make computation. | | You just can't really cool it that well with current | technologies. Microfluidics are the current magic wand | that everyone wishes existed but it's a ways away yet. | nynx wrote: | There are some new chip manufacturing technique coming down the | pipeline, which will lead to prices dropping and likely "wafer- | scale" will get to the mainstream. | tragictrash wrote: | Could you elaborate? Would love to know more. | nynx wrote: | Unfortunately, I cannot. | AtlasBarfed wrote: | ... that's the exact opposite of every economic and yield | advantage that chiplet design addresses, isn't it? | | Want to fine tune your chip offering to some multiple of 8 | cores (arbitrary example of the # cores on the chiplet)? Just a | packaging issue. | | Want to upbin very large corecounts that generally overclock | quite well? For a massive unichip described, maybe there are | sections of the chip that are clocking well and sections that | aren't: you're stuck. With chiplets, you have better binning | granularity and packaging. | | Want to fine-tune various cache levels? I believe from what | I've read that AMD is doing L3 on a separate chiplet (and | vertically stacking it!). So you can custom-tune the cache size | for chip designs, | | You can custom-process different parts of the "CPU" with | different processes and fabrication, possibly even different | fab vendors. | | You can upgrade various things like memory support and other | evolving things in an isolated package, which should help | design and testing. | | The interconnects are the main problem. But then again, I can't | imaging what a foot-wide CPU introduces for intra-chip | communication, it probably would have it's own pseudo- | interconnect highway anyway. | | Maybe you don't even need to reengineer some chiplets between | processor generations. If the BFD of your new release is some | improvement to the higher or lower cpus in the High-Low | designs, but the other is the same, then that should be more | organizational efficiency. | | Intel and others have effectively moved from gigantic | integrated circuits decades ago: motherboard chipsets were | always done with a separate cheaper fab that was a gen or two | behind the CPU. | | Maybe when process tech has finally stabilized for a generation | now that process technology seems to be stagnating more, then | massive wafer designs will start to edge out chiplet designs, | but right now it appears that the opposite has happened and | will continue for the foreseeable future. | kwhitefoot wrote: | Nothing new under the sun. Ivor Catt was proposing wafer scale | computing in the '70s. Large numbers of processors with the | ability to route around defective units. | | https://www.ivorcatt.org/icrns86jun_0004.htm | Iwan-Zotow wrote: | speed of light ~ 1*10^9 feet/sec | | To cross one foot - no less than 10^-9 sec = 1ns | FredPret wrote: | Instead of microcircuits, megacircuits. I like it | truth_seeker wrote: | A chip with Semi-FPGA as well as Semi-ASIC strategy could work. | FPGA dev tools chain needs to improve. | tempnow987 wrote: | "Reached their limits" - I feel like I've heard this many many | times before. | | Not that I doubt it, but just I've also been impressed with the | ingenuity that folks come up with in this space. | mjreacher wrote: | Agreed. I would be wary of reaching fundamental limits set by | physics although I don't think we're there yet. | | "It would appear that we have reached the limits of what is | possible to achieve with computer technology, although one | should be careful with such statements, as they tend to sound | pretty silly in five years." | | - attributed to von Neumann, 1949. | tawaypol wrote: | "There's plenty of room at the bottom." | marcosdumay wrote: | Actually, we are getting out of room there. | | that speech is about 80 years old nowadays. There was plenty | of room at that time. | | Of course, it also speculated that we would move into quantum | computers at some point, what is still a possibility, but now | we know that quantum computers won't solve every issue. | syntheweave wrote: | We only have to solve one limitation per year to keep making | progress year over year, and as it is, the semiconductor | industry still seems to be solving large numbers of significant | issues yearly. So while we don't necessarily get smooth, | predictable improvement, a safe bet is that there will be | continue to be useful new developments 10-20 years out, even if | they don't translate to the same kinds of gains as in years | past. | macrolocal wrote: | For example, there's lots to explore in the VLIW space. | JonChesterfield wrote: | Compilers, largely. | macrolocal wrote: | Yep, and also architectures whose state is simpler to | model. | gameswithgo wrote: | aeturnum wrote: | I read articles like this as saying "reached their limits [as | we currently understand them]." Sometimes we learn we were | mistaken and more is possible but it's not reliable and, | crucially, when it happens it happens in unexpected ways. The | process of talking about when (and why) techniques have hit | their useful limits is often key to unearthing the next step. | anonymousDan wrote: | So is UCI-e a competitor/potential successor for something like | Intel's QPI (or whatever they are using now)? | RcouF1uZ4gsC wrote: | > UCIe is a start, but the standard's future remains to be seen. | "The founding members of initial UCIe promoters represent an | impressive list of contributors across a broad range of | technology design and manufacturing areas, including the HPC | ecosystem," said Nossokoff, "but a number of major organizations | have not as yet joined, including Apple, AWS, Broadcom, IBM, | NVIDIA, other silicon foundries, and memory vendors." | | The fact that the standard doesn't include anyone who is actually | building chips makes me very pessimistic about it. | ranger207 wrote: | Looks like a lot of people who actually build chips are in the | organization | | https://www.uciexpress.org/membership | AnimalMuppet wrote: | "More multi-chip processor designs" != "single-chip processors | have reached their limits". | refulgentis wrote: | I'm embarrassed to admit I still don't quite understand what a | chiplet is, would be very grateful for your input here. | | If a thread can run on multiple chiplets then this is awesome and | seems like a solution. | | If one thread == one chiplet, then*: | | - a chiplet is equivalent to a core, except with speedier | connections to other cores? | | - this isn't a solution, we're 15 years into cores and single- | threaded performance is still king. If separating work into | separate threads was a solution, cores would work more or less | just fine.** | | * put "in my totally uneducated opinion, it seems like..." before | each of these, internet doesn't communicate tone well and I'm | definitely not trying to pass judgement here, I don't know what | I'm talking about! | | ** generally, for consumer hardware and use cases, i.e. "I am | buying a new laptop and I want it to go brrrr", all sorts of | caveats there of course | [deleted] | sliken wrote: | AMD Epyc is (AFAIK) what popularized the term. Their current | design has a memory controller (PCIe controller, 8 x 64 bit | channels of ram, etc) and 8 chiplets which are pretty much just | 8 cores and a infinity fabric connection for a cache coherent | connection to other CPUs (in the same or other sockets) and | dram. | | So generally Epyc come with some multiple of 8 CPUs enabled (1 | per chiplet) and the latency between cores on the same chiplet | is lower than the latency to other chiplets. | | This allows AMD to target high end servers (up to 64 cores), | low end (down to 16), workstations with threadripper (4 | chiplets instead of 8), and high end desktops (2 chiplets | instead of 8) with the same silicon. This allows them to spend | less on fabs, R&D, etc because they can amortize the silicon | over more products/volume. It also lets them bin them so | chiplets with bad cores can still be sold. It's one of the | things that lets AMD compete with the much larger volume Intel | has, and do pretty well against numerous silicon designs Intel | chips. | hesdeadjim wrote: | A chiplet is a full-fledged CPU with many cores on it. The term | is used when multiple of these chips are stitched together with | a high speed interconnect and plugged into the single socket on | your motherboard. | | If you ripped the lid off a Ryzen "chip", you would see | multiple CPU dies underneath for the high end models. | tenebrisalietum wrote: | Additionally - MCM - multi-chip module - instead of putting | separate chips for various functions on a board, they're | fused together in what from the outside looks like a single | chip, but internally is 3 or 4 unrelated chips. | | Examples at the Wikipedia article: | https://en.wikipedia.org/wiki/Multi-chip_module | WalterBright wrote: | I remember back in the 80's the limit was considered to be 64K | RAM chips, because otherwise the defect rate would kill the | yield. | | Of course, there's always the "make a 4 core chip. If one core | doesn't work, sell it as a 3 core chip. And so on." | dboreham wrote: | Hmm. I worked for a memory manufacturer in the 80s and I do not | remember any limit. | ksec wrote: | That is mainstream news reporting for you since the 80s. | throwaway4good wrote: | "Single-Chip Processors Have Reached Their Limits | | Announcements from XYZ and ABC prove that chiplets are the | future, but interconnects remain a battleground" | | This could easily have been written 10 years ago, and I bet | someone will write it in 10 years again. | | We need these really big chips with their big powerful cores | because the nature the computing we do only changes very slowly | towards being distributed and parallelizable and thus able to use | a massive number of smaller but far more efficient cores. | Dylan16807 wrote: | You're implying you can't put big powerful cores on chiplets | but that's not true at all. | lazide wrote: | Hardly - performance/core hasn't flatlined, but has not | maintained the same growth over time (decades) in performance | we've traditionally had. That's the problem. | | So if you want better aggregate performance, more cores has | been the plan for a decade+ now. | | FLOP/s per core or whatever other metric you choose to use. | | Previously it was possible to get 20-50% or more performance | improvements even year to year for a core. | Dylan16807 wrote: | I wasn't talking about improvement at all. This was about | big strong cores versus efficient cores, which is a | tradeoff that always exists. | | You could choose between 20 strong cores or 48 efficient | cores on the same die space across four chiplets, for | example. | alain94040 wrote: | Correct. Also known as Rent's rule. According to Wikipedia, it | first was mentioned in the 1960s: | https://en.wikipedia.org/wiki/Rent%27s_rule | narag wrote: | I hope somebody with relevant knowledge can answer this question, | please: what % of the costs is "physical cost per unit" and what | % is maintaining the I+D, factories, channels...? | | In other words, if a chip with 100x size (100x gates, etc.) made | sense, would it cost 100x to produce or just 10x or just 2x? | | Edit: providing there wouldn't be additional design costs, just | stacking current tech. | ksec wrote: | >would it cost 100x to produce or just 10x or just 2x? | | Why would 100x something only cost 2x to produce? | | >what % of the costs is "physical cost per unit" and what % is | maintaining the I+D, factories, channels...? | | Without unit volume and a definition of the first "cost" in the | sentence no one could answer that question. But if you want to | know the BOM cost of a chip, it is simply Wafer Price divided | total useable chips depending on yield where yield is both a | factor of current maturity of node and whether your design | allows correction of defects for usable chips. Then add about | ~10% for testing and packaging. | mlyle wrote: | There's many limiting factors... one is the reticle limit. | | But most fundamental is the defect density on wafers. If you | have, say, 10 defects per wafer, and you have 1000 chips on it: | odds are you get 990 good chips. | | If you have 10 chips on the wafer, you get 2-3 good chips per | wafer. | | Of course, there's yield maximization strategies, like being | able to turn off portions of the die if it's defective (for | certain kinds of defects). | | For the upper limit, look at what Cerebras is doing with wafer | scale. Then you get into related, crazy problems, like getting | thousands of amperes into the circuit and cooling it. | nightfly wrote: | I'm not an expert, or even an amateur, here but I defects are | inevitable. So if you _need_ 100x the size without defects and | one defect ruins the chip the cost might be 10000x to produce | tonyarkles wrote: | It's been a while since I've been out of that industry, but | back around the 45nm days, one of the biggest concerns was | yield. If you've got 100x the surface area, the probability of | there being a manufacturing defect that wrecks the chip goes | up. Now, you could probably get away with selectively disabling | defective cores, but the chiplet idea seems, to me, like it | would give you a lot more flexibility. As an example, let's say | a chiplet i9 requires 8x flawless chips, and a chiplet Celeron | requires 4 chips, but they're allowed to have defects in the | cache because the Celeron is sold with a smaller cache anyway. | | In the "huge chip" case, you need the whole 8x area to be | flawless, otherwise the chip gets binned as a Celeron. If the | chiplet case, any single chip with a flaw can go into the | Celeron bin, and 8 flawless ones can be assembled into a | flawless CPU, and any defect ones go into the re-use bin. And | if you end up with a flawed chip that can't be used at the | smallest bin size, you're only tossing 1/4 or 1/8 of a CPU in | the trash. | wmf wrote: | The way TSMC amortizes those fixed costs is to charge by the | wafer, so if your chip is 100x larger it costs at least 100x | more. (You will have losses due to defects and around the edges | of the wafer.) You can play with a calculator like | https://caly-technologies.com/die-yield-calculator/ to get a | feel for the numbers. | hinkley wrote: | I hope we are going to get back to a more asymmetric multi- | processing arrangement in the near term where we abandon the | fiction of a processor or two running the whole show with | peripheral systems that have as little smarts as possible and | promote them to at least second class citizens. | | These systems are much more powerful than when these abstractions | were laid down, and at this point it feels like the difference | between redundant storage on the box versus three feet away is | more academic than anything else. | wmf wrote: | That kind of exists since most I/O devices have CPU cores in | them, although usually hidden behind register-based interfaces. | Apple has taken it a little further by using the same core | everywhere and creating a standard IPC mechanism. | gotaquestion wrote: | The problem is AMP is very hard to program and debug. In | embedded, one core is a scheduler and another is doing some | real-time task (like arm BIG.little). In larger automotive | heterogeneous compute platform, typically they are all treated | as accelerators, or with bespoke Tier-1 integration (or like | NVIDIA Xavier). And on top of that, OEMs always want to | "reclaim" those spare cycles when the other AMP cores are | underutilized, which is nigh impossible to do, so they fall | back to symmetric MP. I think embedded is the only place for | this to work right now. | | EDIT: I'm not an expert in this field but I have been asked to | do work in this domain, and this narrow sampling is what I | encountered, but I'd like to learn more about tooling and | strategies for more generic AMP deployments. | gnarbarian wrote: | Are we moving this way because bigger chips with many cores have | worse yields? so the answer is to make lots of little chips and | fuse then together? | [deleted] | sliken wrote: | Well fuse is one possibility. The AMD Epyc has generally an | IO+memory controller die (called IOD) + 8 chiplets that are 8 | cores each for most of the Epyc chips, however not all cores | are enabled depending on the SKU. | | However apple's approach does allow impressive bandwidth, | 2.5TB/sec which is much higher than any of the chiplet | approaches I'm aware of. | monocasa wrote: | Yeah, in the very general case, chip errors are a function of | die area. Cutting a die into four pieces so that when an error | occurs in manufacturing, you only throw out a quarter of the | die area is becoming the right model for a lot of designs. | | Like all things chips, it's way more complicated than that, | fractally, as you start digging in. Like AMD started down this | road initially because of their contractual agreements with | GloFlo to keep shipping with GloFlo does, but wanted the bulk | of the logic on a smaller node than GloFo could provide, hence | the IO die and compute chiplets model that still exists in Zen. | It's still a good idea for other reasons but they lucked out a | bit by being forced in that direction before other major | fabless companies. | | This is also not a new idea, but sort of ebbs and flows with | the economics of the chip market. See the VAX 9000 multi chip | modules for an 80s take on the same ideas and economic | pressures. | WithinReason wrote: | Their GPUs are likely to be multichip for the first time too | with NAVI 31 (while Nvidia's next gen will still be single | chip and likely fall behind AMD). It also seems like that the | cache will be 6nm while the logic will be 5nm and bonded | together with some new TSMC technology. At least that can be | inferred from some leaks: | | https://www.tweaktown.com/news/84418/amd-rdna-3-gpu- | engineer... | thissiteb1lows wrote: | ceeplusplus wrote: | I've yet to see any sort of research out of AMD on MCM | mitigations for things like cache coherency and NUMA. | Nvidia on the other hand has published papers as far back | as 2017 on the subject. On top of that even the M1 Ultra | has some rough scaling spots in certain workloads and Apple | is by far ahead of everyone else on the chiplet curve (if | you don't believe me, try testing lock-free atomic | load/store latency across CCX's in Zen3). | | Also AMD claimed the MI250X is "multichip" but it presents | itself as 2 GPUs to the OS and the interconnect is worse | than NVLink. | monocasa wrote: | There's a few ways to interpret that. Another | interpretation could be that they are simply taping out | Navi32 on two nodes, perhaps for AMD to better utilize the | 5nm slots they have access to. Perhaps when Nvidia is on | Samsung 10nm+++, then the large consumer AMD GPUs get a | node advantage already being at TSMC 7nm+++, and so they're | only using 5nm slots for places like integrated GPUs and | data center parts that care about perf/watt. | | But your interpretation is equally valid with the | information we have AFAICT. | IshKebab wrote: | This is what Tesla's Dojo does (it's really a TSMC technology | that they are the first to utilize). You can cut your wafer up | into chips, ditch the bad ones, then reassemble them into a | bigger wafery chip thing using some kind of glue. Then you can | do more layers to wire them up. | | I think they do it using identical chips but I guess there's no | real reason you couldn't have different chips connected in one | wafer. Expensive though! ___________________________________________________________________ (page generated 2022-04-04 23:00 UTC)