[HN Gopher] Intel 3rd gen Xeon Scalable (Ice Lake): generational... ___________________________________________________________________ Intel 3rd gen Xeon Scalable (Ice Lake): generationally big, competitively small Author : totalZero Score : 60 points Date : 2021-04-06 16:33 UTC (6 hours ago) (HTM) web link (www.anandtech.com) (TXT) w3m dump (www.anandtech.com) | ChuckMcM wrote: | I found the news of Intel releasing this chip quite encouraging. | If they have enough capacity on their 10nm node to put it into | production then they have tamed many of the problems that were | holding them back. My hope is that Gelsinger's renewed attention | to engineering excellence will allow the folks who know how to | iron out a process to work more freely than they did under the | previous leadership. | | That said, fixing Intel is a three step process right? First they | have to get their process issues under control (seems like they | are making progress there). Second, they need to figure out the | third party use of that process so that they can bank some some | of revenue that is out there from the chip shortage. And finally, | they need to answer the "jelly bean" market, and by that we know | that "jelly bean" type processors have become powerful enough to | be the only processor in a system so Intel needs to play there or | it will lose that whole segment to Nvidia/ARM. | sitkack wrote: | If they price it right, it could be amazing. Computing is | mostly about economics. The new node sizes greatly increase the | production capacity. Half the dimension in x and y gets you 4x | the transistors on the same wafer. It is like making 4x the | number of fabs. | | It also has speed and power advantages. | | I think this release is excellent news on many levels. | Retric wrote: | Intel _10nm_ is really just a marketing term at this point | and has nothing to do with transistor density. | [deleted] | judge2020 wrote: | Production for a datacenter CPU is not the same as production | for datacenter + enthusiast-grade consumer CPUs like Zen 3 | currently achieves, unfortunately. Rocket lake being backported | to 14nm is still not a good sign for actual production volume, | although it probably means next generation will be 10nm all the | way. | willis936 wrote: | Datacenter CPUs are much larger than consumer parts and yield | goes down with the square of the die area. They start with | these because the margins go up faster than square of the die | area. | Robotbeat wrote: | But modern techniques exist to deal with problems in a | large die (ie testing and then segmenting off cores with | mistakes on them), so the fact they're starting with large | chip die sizes doesn't really tell you much, no? | erik wrote: | The top end most profitable SKUs are fully enabled dies. | That they are now able to ship dies this large is a good | sign. The 10nm laptop chips they have produced so far | were rumored to have atrocious yield. | knz_ wrote: | > Rocket lake being backported to 14nm is still not a good | sign for actual production volume, | | I'm not seeing a good reason for thinking this is the case. | Server CPUs are harder to fab (much larger die area) and they | need to fab more of them (desktop CPUs are relatively niche | compared to mobile and server CPUs). | | If anything this is a sign that 10nm is fully ready. | bushbaba wrote: | I assume for intel, they make more server CPUs per year than | the entirety of AMDs output. | bayindirh wrote: | Intel has a momentum and some cult following, but they're | no match for AMD in certain aspects like PCI lanes, memory | channels, and some types of computation which favors AMD's | architecture. | | Day by day, more data centers get AMD systems by choice or | by requirement (Oh, you want 8xA100 nVidia made modules | with maximum performance. You need an AMD CPU since it has | more PCI lanes, for example). | | You don't see much AMD server CPUs around because first | generation and most of the second generation has completely | bought by FAANG, Dropbox, et al. | | As the productions ramps up with newer generation, we can | buy the overflowing parts after most of the production is | gobbled up by these buyers. | zepmck wrote: | In terms of PCI lanes efficiency, there is no competition | between Intel and AMD. Intel is much ahead wrt AMD. Don't | not be impressed about number of lanes available on the | board. | cptskippy wrote: | I can only assume you're referring to Intel's Rocket Lake | storage demonstration they tweeted out. This was using | PCMark 10's Quick Storage Benchmark which is more CPU | bound than anything else. | | All of the other benchmarks in the PCMark test suite push | the bottleneck down to the storage device. | | One would think Intel might want to build a storage array | that could stress the PCIe lanes but then that might show | an entirely different picture than the one Intel is | portraying. | bayindirh wrote: | > Don't not be impressed about number of lanes available | on the board. | | When you configure the system full-out with GPUs & HBAs, | the number of lanes becomes a matter of necessity rather | than a spec which you drool over. | | A PCIe lane is a PCIe lane. Its capacity, latency and | speed is fixed, and you need these with minimum number of | PCIe switches to saturate the devices and servers you | have, at least in our scenario. | ryan_j_naughton wrote: | There is a slow but seismic shift to AMD within data | centers right now. | mhh__ wrote: | They still have something like 40% to go to even reach | parity with Intel though | [deleted] | totalZero wrote: | > Rocket lake being backported to 14nm is still not a good | sign for actual production volume | | I'm genuinely having trouble understanding what you mean by | this. | | Rocket Lake being backported to 14nm means that 10nm can be | allocated in greater proportion toward higher-priced chips | like Alder Lake and Ice Lake SP. Seems like it would be good | for production volume. | rincebrain wrote: | I think they mean that the fact that they needed to | backport Rocket Lake, versus just having all their | production on 10nm, implies a much more limited production | capacity than the other situation. | zamadatix wrote: | Production volume referring to production volume of the | node still being low as a whole not risks to the being able | to get volume of these particular SKUs using the node. | jlawer wrote: | I was under the impression the issue with 10nm is | frequency, which lead to the rocket lake backport. | Unfortunately it seems that the 10nm node's efficiency | point is lower on the frequency curve. The reviewed ice | lake processor was 300mhz lower then the previous | generation (though with much higher core count), despite | higher power draw and the process node shrink. | | In laptop processors they can easily show efficiency gains | from the 10nm process and IPC improvement, It appears most | laptop processors end up running a power envelope lower | then the ideal performance per watt efficiency. Server | Processors with higher core counts means you can run more | workloads per server, again providing efficiency gains. | However desktop / gaming tends to be smaller core count + | higher frequency with little concern of efficiency outside | of quality of life factors (i.e. don't make me use a 1KW | chiller). Intel has been pushing 5ghz processor frequency | for years, and rocket lake continues that push (5.3ghz | boost), when they drop frequency to move to 10nm, its hard | to see an IPC improvement that is able to paper over that. | | However alder lake CPUs will have a thread count advantage, | so at least with 24 threads it should be able to show | generational improvement over the current 8c/16 rocket lake | parts. That will allow them to at least argue their value | with select benchmarks and intel only features. Those 8 | efficiency cores will likely be a BIG win on laptop, but on | desktop I doubt they will compare favourably to the full | fat cores on a current Ryzen 5900x (i.e. a currently | available 24 core processor). | | Intel is going to have at least 1 more BAD mainstream | desktop generation before they can truely compete on the | mainstream high end, however there is a chance they have | something like a HEDT part that would allow them to at | least save face. That being said, given a choice, Intel | will give up desktop market share for the faster growing | laptop and server markets. | buu700 wrote: | What's a "jelly bean" processor? Trying to search for that just | gets a bunch of hits about Android 4.1. | madsushi wrote: | https://news.ycombinator.com/item?id=17376874 | | > [1] Jelly Bean chips are those that are made in batches of | 1 - 10 million with a set of functions that are fairly | specific to their application. | foobarian wrote: | Is that the chip-on-board packaging like described here: | https://electronics.stackexchange.com/questions/9137/what- | ki... ? | ChuckMcM wrote: | Sometimes referred to as "applications specific processor" | (ASP) or "System on chip" (SoC). These are the bulk of | semiconductor sales these days as they have replaced all of | the miscellaneous gate logic on devices with a single | programmable block that has a bunch of built in peripherals. | | Think Atmel AtMega parts, there are trillions of these in | various roles. When you think of something like a 555 | timer[1] that is now more cost effectively and capably | replaced with an 8 pin micro-processor you can get an idea of | the shift. | | While these are rarely built on the "leading edge" process | node, when a process node takes over for high margin chips, | the previous node gets used for lower margin chips, which | effectively does a shrink on their die increasing their cost | (most of these chips seem to keep their performance specs | fairly constant, preferring cost reduction over performance | improvement.) | | Anyway, the zillions of these chips in lots of different | "flavors" are colloquially referred to as "jelly bean" chips. | dragontamer wrote: | http://sparks.gogo.co.nz/assets/_site_/downloads/smd- | discret... | | > Jellybean is a common term for components that you keep in | your parts inventory for when your project just needs "a | transistor" or "a diode" or "a mosfet" | | ----------- | | For many hobbyists, a Raspberry Pi or Arduino is a good | example of a Jellybean. You buy 10x Raspberry Pis and stuff | your drawer full of them, because they're cheap enough to do | most tasks. You don't really know what you're going to use | all 10x Rasp. Pi for, but you know you'll find a use of it a | few weeks from now. | | --------- | | At least, in my Comp. Engineering brain, I think N2222 | transistors or 3904-transistors, or the 741 Op-amp. There are | better op-amps and better transistors for any particular job. | But I chose these parts because they're familiar, | comfortable, cheap and well understood by a wide variety of | engineers. | | Well, not the 741 OpAmp anymore anyway. 741 was a jellybean | back in the 12V days. Today I think 5V compatibility has | become the standard voltage (because of USB). So 5V op-amps | are a more important "jellybean". | klodolph wrote: | I don't know how old you are, but the 741 was obsolete in | the 1980s. It sticks around in EE textbooks because it's | such an easy way to demonstrate _problems_ with op-amps... | high input current, low gain-bandwidth product, low slew | rate, etc. | | I think your jellybean op-amps would more likely be TL072, | LM358, or NE5532. | dragontamer wrote: | Old, beaten up textbooks from the corner of my | neighborhood library was talking 741 back in the 2000s, | when I was in high school and started dabbling around | with electricity more seriously. | | Maybe it was fully obsolete by that point, but high | school + neighborhood libraries aren't exactly filled | with up-to-date textbooks or the latest and greatest. | | I remember that Radio Shack was still selling kits with | 741 in them, as well as breadboards and common | components... 12V wall-warts and the like. Online | shopping was beginning to get popular, but I was still a | mallrat who picked up components and dug through old | Radio Shack manuals into 2005 or 2006. | | It was the ability to walk around, and see those | component shelves sitting there in Radio Shack that got | me curious about the hobby and start researching it. I do | wonder how modern children are supposed to get interested | into hobbies now that malls are less popular (and | electronic shops like Radio Shack are basically | disappeared). | | ------------ | | I don't remember what we used in college. I knew that I | was more selective and understood the kinds of problems | various OpAmps had back then. Also you're not really rich | enough to invest into a private stockpile of chips, and | instead just use whatever the labs are stocked with in | college. | | LM358 is the jellybean that I keep in my drawer today. If | you're curious. Old habits die hard though, I still think | 741 as the jellybean even though it really is obsolete | today. | ChuckMcM wrote: | I've got a tube each of 358's and 1458's (dual version) | in my parts supplies. But my microwave stuff is finding | them lacking. | bavell wrote: | +1 for LM358 | carlhjerpe wrote: | What I don't understand is: ASML is building these machines for | making ICs. Why can TSMC use them for 7nm but Intel can only | use them for 10 right now? Doesn't ASML make the lenses as well | so that you're "only" stuck making the etching thingy (forgot | what it's called, but the reflective template of a CPU). | | It seems like nobody is talking about this, could anyone shine | some light? | dragontamer wrote: | Consider that the wavelength of red light is 700 nm, and the | wavelength of UV-C is 100nm to 280nm. | | And immediately, we see the problem about dropping to 10nm: | that's literally smaller than the distance that photons | vibrate on their way to the final target. | | And yeah, 10nm and 7nm is a marketing term, but that doesn't | change the fact that these processes are all smaller than the | wavelength of light. | | ------- | | So there are two ways to get around this problem. | | 1. Use smaller light: "Extreme UV" is even smaller than | normal UV at 13.5nm. Kind of the obvious solution, but higher | energy and changes the chemistry slightly, since the light is | a different color. Things are getting mighty close to literal | "X-Ray Lasers" as they are, so the power requirements are | getting quite substantial. | | 2. Multipatterning -- Instead of developing the entire thing | in one shot, do it in multiple shots, and "carefully line up" | the chips between different shots. As difficult as it sounds, | its been done before at 40nm and other processes. (https://en | .wikipedia.org/wiki/Multiple_patterning#EUV_Multip...) | | 3. Do both at the same time to reach 5nm, 4nm, or 3nm. Either | way, 10nm and 7nm is the point where the various companies | had to decide to do #1 first or #2 first. Either way, your | company needs to learn to do both in the long term. TSMC and | Samsung went with #1 EUV, and I think Intel though that #2 | multi-patterning would be easier. | | And the rest is history. Seems like EUV was easier after all, | and TSMC / Samsung's bets paid off. | | Mind you, I barely know any of the stuff I'm talking about. | I'm not a physicist or chemist. But the above is my general | understanding of the issues. I'm sure Intel had their reasons | to believe why multipatterning would be easier. Maybe it was | easier, but other company issues drove away engineers and | something unrelated caused Intel to fall behind. | vzidex wrote: | I'll take a crack at it, though I'm only in undergrad (took a | course on VLSI this semester). | | Making a device at a specific technology node (e.g. 14nm, | 10nm, 7nm) isn't just about the lithography, although litho | is crucial too. In effect, lithography is what allows you to | "draw" patterns onto a wafer, but then you still need to do | various things to that patterned wafer (deposition, etching, | polishing, cleaning, etc.). Going from "we have litho | machines capable of X nm spacing" to "we can manufacture a | CPU on this node at scale with good yield" requires a huge | amount of low-level design to figure out transistor sizings, | spacings, and then how to actually manufacture the designed | transistors and gates using the steps listed above. | mqus wrote: | TSMCs 7nm is roughly equivalent to intels 10nm, the numbers | don't really mean anything and are not comparable | lifeisstillgood wrote: | This might be a very dumb question but it always bothered me - | silicon wafers are always shown as great circles, but processor | dies are obviously square. But it looks like the etching etc goes | right to the circular edges - wouldn't it be better to leave the | dead space untouched? | pas wrote: | I think these are just press/PR wafers and real production ones | don't pattern on the edge. (First of all it takes time, and in | case of EUV it means things amortize even faster, because every | shot damages the "optical elements" a bit.) | | edit: it also depends on how many dies the mask (reticle) has | on it. Intel uses one die reticles, so i. theory their real | wafers have no situation in which they have partial dies at the | edge. | w0utert wrote: | Most semiconductor production processes like etching, doping, | polish etc are done on the full wafer, not on individual | images/fields. So there is nothing to be gained there in terms | of production efficiency. | | The litho step could in theory be optimized by skipping | incomplete fields at the edges, but the reduction in exposure | time would be relatively small, especially for smaller designs | that fit multiple chips within a single image field. I imagine | it would als introduce yield risk because of things like uneven | wafer stress & temperature, higher variability in stage move | time when stepping edge fields vs center fields, etc. | andromeduck wrote: | Many of the process steps involve rotation so this is | impractical. | jvanderbot wrote: | From Anandtech[1]: | | "As impressive as the new Xeon 8380 is from a generational and | technical stand-point, what really matters at the end of the day | is how it fares up to the competition. I'll be blunt here; nobody | really expected the new ICL-SP parts to beat AMD or the new Arm | competition - and it didn't. The competitive gap had been so | gigantic, with silly scenarios such as where a competing 1-socket | systems would outperform Intel's 2-socket solutions. Ice Lake SP | gets rid of those more embarrassing situations, and narrows the | performance gap significantly, however the gap still remains, and | is still undeniable." | | This sounds about right for a company fraught with so many | process problems lately: Play catch up for a while and hope you | experience fewer in the future to continue to narrow the gap. | | "Narrow the gap significantly" sounds like good technical | progress for Intel. But the business message isn't wonderful. | | 1. https://www.anandtech.com/show/16594/intel-3rd-gen-xeon- | scal... | ajross wrote: | I don't know that it's all so bad. The final takeaway is that a | 660mm2 Intel die at 270W got about 70-80% of the performance | that AMD's 1000mm2 MCM gets at 250W. So performance per | transistor is similar, but per watt Intel lags. But then the | idle draw was significantly better (AMD's idle power remains a | problem across the Zen designs), so for many use cases it's | probably a draw. | | That sounds "competetive enough" to me in the datacenter world, | given the existing market lead Intel has. | marmaduke wrote: | It's impressive how you and parent comment copied over | to/from the dupe posting verbatim. | | _edit_ oops nevermind, I see my comment was also | mysteriously transported from the dupe. | Symmetry wrote: | I'm not sure that's a fair area comparison? AMD only has | around 600 mm2 of expensive leading edge 7nm silicon and uses | chiplets to up their yields. The rest is the connecting bits | from an older and cheaper process. Intel's full size is a | single monolithic die on a leading edge process. | ineedasername wrote: | Do chiplets underperform compared to a monolithic die? | wmf wrote: | Yes. | Symmetry wrote: | All things being equal a chiplet design will underperform | a monolithic die. But we've already seen the benchmarks | on the performance of Milan so looking at chiplets versus | monolithic is mostly about considering AMD's strategy and | constraints rather than how the chips perform. | monocasa wrote: | Pretty much any time you have signals going off chip, you | lose out on both bandwidth and latency. | ComputerGuru wrote: | I would argue that for high-end servers, idle draw is a bit | of a non-issue as presumably either you have only one of | these machines and it's sitting idle (so no matter how | inefficient it doesn't matter) or you have hundreds/thousands | of them and they'll be as far from idle as it's possible to | be. | | AMD's idle power consumption is a bigger issue for desktop, | laptop, and HEDT. | rbanffy wrote: | If it has 80% of the performance, it will still be | competitive at 80% of the price. | ShroudedNight wrote: | This sounds like a dangerous assumption to make. I would | expect that needing 25% more machines for the same | performance would be a non-starter for many potential | customers. | throwaway4good wrote: | I would expect most high-end servers in data centers to sit | idle most of the time? Do you know of any data on this? | ajross wrote: | Most servers are doing things for human beings, and we | have irregular schedules. Standard rule of thumb is that | you plan for a peak capacity of 10x average. A datacenter | that _doesn 't_ have significant idle capacity is one | that's some kind of weird special purpose thing like a | mining facility. | adrian_b wrote: | That's true, but I would expect that most idle servers | are turned off and they use Wake-on-LAN to become active | when there is work to do. | | Just a few servers could be kept idle, not off, to enable | a sub-second start-up time for some new work. | jeffbee wrote: | Certainly for bit players and corporate datacenters with | utilization < 1% you'd expect the median server to just | sit there. For larger (amazon, google, etc) players the | economic incentives against idleness are just too great. | JoshTriplett wrote: | > For larger (amazon, google, etc) players the economic | incentives against idleness are just too great. | | Not all workloads are CPU-bound. Cloud providers have | _many_ servers for which the CPUs are idle most of the | time, because they 're disk-bound, network-bound, other- | server-bound, bursty, or similar. They're going to aim to | minimize the idle time, but they can't eliminate it | entirely given that they have customer-defined workloads. | mamon wrote: | But if the workload is not CPU-bound then why would they | care about upgrading their CPUs to more performant ones, | like Ice Lake Xeons? | JoshTriplett wrote: | The workloads are determined by their customers, and | customers don't always pick the exact size system they | need (or there isn't always an option for the exact size | system they need). The major clouds are going to upgrade | and offer faster CPUs as an option, people are going to | use that option, and some of their workloads will end up | idling the CPU. Major cloud vendors almost certainly have | statistics for "here's how much idle time we have, so | here's approximately how much we'd save with lower power | consumption on idle". | ajross wrote: | Electricity costs for large datacenters are higher than the | equipment costs. They absolutely care about idle draw. | bostonsre wrote: | If you are the one paying the electricity bills for that | datacenter, then yes, it probably matters to you a lot. | If you are just renting a server from aws or gcp, it | probably matters less. Although, I assume costs born from | idle inefficiency will probably be passed to the | customer... | [deleted] | spideymans wrote: | Shouldn't datacenters attempt to minimize idle time | though? A server sitting at idle is a depreciating asset | that could likely be put to more productive use if tasks | were rescheduled to take advantage of idle time (this | would also reduce the total number of servers needed). | deelowe wrote: | Utilization is a very difficult problem to solve. The | difference between peak and off peak utilization can be | as much as 70% or more depending on the application. | gumby wrote: | That is definitely the objective but the reality is that | load is not* uniform over the day. So you are paying to | keep some number of servers hot (I don't know about | spinup/spindown practices in modern datacenters). | | I doubt this applies to HPC (the target market for this | part) as they either schedule jobs closely or could, I | imagine, shut them down. But I'm not in that space either | so this is merely conjecture. | | * I am sure there are corner cases where the load _is_ | uniform, but they are by definition few. | ComputerGuru wrote: | If you have enough servers for idle draw to be more than | a rounding error in your opex breakdown, then you have a | strategy to keep idle time to zero. It doesn't make any | financial sense (no matter how low idle draw is) to have | a server sit idle (or even powered off, but that's a | capex problem). | my123 wrote: | For a cloud infrastructure, you have a significant part | at idle, for when customers want to instantly spawn a VM. | zsmi wrote: | The target market for this part is not that kind of | datacenter. | | Based on the article they're targeting high performance | compute, i.e. "application codes used in earth system | modeling, financial services, manufacturing, as well as | life and material science." | klodolph wrote: | The opposite is true... a major advantage of running | cloud infrastructure is that you can run your CPUs near | 100% all the time. CPUs which are not running full bore | can have jobs moved to them. | jrockway wrote: | Yeah, I think it's hard to keep your computers at 100% | utilization for the entire day. You host services close | to your users, and your users go to bed at some point, | many of them at around the same time every day. Then your | computers have very little work to do. | | Some bigger companies have a lot of batch jobs that can | run overnight and steal idle cycles, but you have to be | gigantic before that's realistic. (My experience with | writing gigantic batch jobs is that I just requisitioned | the compute at "production quality" so I could work on | them during the day, rather than waiting for them to run | overnight. Not sure what other people did, and therefore | not really sure how much runs overnight at big | companies.) | | Cloud providers have spot instances that could take up | some of this slack, but I bet there is plenty of idle | capacity precisely because the cost can't go to $0 | because of electricity use. Or I could be completely | wrong about workloads, maybe everyone has their web | servers and CI systems running at 100% CPU all night. | I've never seen it, though. | thekrendal wrote: | Or for redundancy sake, if you're using any kind of sane | setup. (Yes, YMMV bigly with this particular idea.) | chomp wrote: | Can confirm, built out a datacenter space in a past life. | Power costs were of limited concern - cooling was the | limited resource. Even then, literally no one went down a | spec sheet and compared "hmm, this one has a tiny less | amount of watts idle". We just kept servers dark | regardless so that we can save on cooling. Nitpicking | idle draw for server processors just isn't realistic for | a lot of cases. | dahfizz wrote: | large datacenters have hardware orchestration systems | that let them turn off unused machines. There really is | no reason to have lots of machines on but unused. At | least, that is not a significant enough event to be a | determining factor in hardware purchasing. | neogodless wrote: | A bit off topic from the server CPU discussion, but I was | curious how well AMD is advancing idle power consumption. | | For example, the Ryzen 3000 desktop chips seemed to have | the issue[0], but the same Zen 2 cores seem to have found | some improvements in the Ryzen 4000 mobile chips[1]. | | I didn't want to just rely on Reddit forum comments, so I | found this measure of the Ryzen 3600[2]. | | > When one thread is active, it sits at 12.8 W, but as we | ramp up the cores, we get to 11.2 W per core. The non-core | part of the processor, such as the IO chip, the DRAM | channels and the PCIe lanes, even at idle still consume | around 12-18 W in the system. | | My interpretation was expect ~12 W or more idle consumption | (just from the CPU package), but I'm not sure I understand | it correctly. | | I couldn't find the same information for Ryzen 4000 | laptops, but the same APU is tested in a NUC, where the | total system draw (at the wall) at idle was about 10-11 W, | still nearly double that of a Core i7 U-series NUC[3], but | certainly lower than that of just the CPU package in the | Ryzen 3600. | | Anecdotally, my 45W Ryzen 7 4800H laptop with 15.6" 1080p | screen lasts about 4 hours on 80% of the 60Wh battery with | 95% brightness, doing various non-intensive tasks. Though I | don't know how well the battery holds up on complete non- | use standby. | | [0] https://old.reddit.com/r/AMDHelp/comments/cfm1xa/why_is | _ryze... | | [1] https://old.reddit.com/r/Amd/comments/haq4fg/the_idle_p | ower_... | | [2] https://www.anandtech.com/show/15787/amd- | ryzen-5-3600-review... | | [3] https://www.anandtech.com/show/16236/asrock-4x4-box4800 | u-ren... | bkor wrote: | > I couldn't find the same information for Ryzen 4000 | laptops | | I measured an Asus Mini PC PN50 with a Ryzen 4500U. The | idle power usage was 8.5 Watt for the system. This with | 32GB of memory and a SATA SSD installed. It would be nice | if it was lower than this, but it isn't too bad. | Interestingly the machine used 1.2 Watt while off after | it wasn't on power, 0.5 Watt after starting up and | shutting it down. | | Recently noticed some people focussing on low power but | powerful 24/7 home "servers". Systems that are on 24/7, | but often idle. One system used around 4.5 Watt in idle. | The "brick" / power adapter often uses too much power, | even when everything is off. | wtallis wrote: | Ryzen 3000 desktop processors use a chiplet design, with | the IO die built on an older process than the processor | dies. Ryzen 4000 mobile processors are monolithic dies, | so they don't have the extra power of the inter-chiplet | connections and they're entirely 7nm parts instead of a | mix of 7nm and 14nm. | monocasa wrote: | You can't really compare die sizes of a MCM and a single die | and expect to get transistor counts out of that. So much of | the area of the MCM is taken up by all the separate phys to | communicate between the chiplets and the I/O die, and the I/O | die itself is on GF14nm (about equivalent to Intel 22nm) last | time I checked, not a new competitive logic node. | | There's probably a few more gates still on the AMD side, but | it's not the half again larger that you'd expect looking at | area alone. | jvanderbot wrote: | Furthermore: | | "At the end of the day, Ice Lake SP is a success. Performance | is up, and performance per watt is up. I'm sure if we were able | to test Intel's acceleration enhancements more thoroughly, we | would be able to corroborate some of the results and hype that | Intel wants to generate around its product. But even as a | success, it's not a traditional competitive success. The | generational improvements are there and they are large, and as | long as Intel is the market share leader, this should translate | into upgraded systems and deployments throughout the enterprise | industry. Intel is still in a tough competitive situation | overall with the high quality the rest of the market is | enabling." | jandrese wrote: | I found it a little weird that they conclusions section | didn't mention the AMD or ARM competition at all, given that | the Intel chip seemed to be behind them in most of the tests. | jvanderbot wrote: | You mean OP didn't? Yes, that's probably standard PR to | focus on strengths rather than competition. | jandrese wrote: | I mean the Anand piece. | jvanderbot wrote: | The conclusions section was quoted in my post and they | explicitly mention it. | | "As impressive as the new Xeon 8380 is from a | generational and technical stand-point, what really | matters at the end of the day is how it fares up to the | competition. I'll be blunt here; nobody really expected | the new ICL-SP parts to beat AMD or the new Arm | competition - and it didn't. " | ksec wrote: | It is certainly good enough to compete, prioritising Fab | capacity to Server unit and lock in those important ( Swaying ) | deals from clients. Sales and Marketing work their connection | along with software tools that HPC markets needs and AFAIK is | still far ahead of AMD. | | And I can bet those prices have lots of room for special | discount to clients. Since RAM and NAND Storage dominate the | cost of server, the difference of Intel and AMD shrinks rapidly | in the grand scheme of things, giving Intel a chance to fight. | And there is something not mentioned enough, the importance of | PCI-E 4.0 Support. | | I wanted to rant about AMD, but I guess there is not much | point. ARM is coming. | quelsolaar wrote: | >This sounds about right for a company fraught with so many | process problems lately | | Publicly the problems have been lately, but the things that | caused these problems have happened much further back. | | I'm cautiously bullish on Intel. From what I gather, Intel is | in a much better place internally. They have much better focus, | there is less infighting, its more engineering then sales lead, | they have some very good people and they are no longer | complacent. It will however take years before this is becomes | visible from the outside. | | Given the demand for CPUs and the competitions inability to | deliver, I think intel will do OK even if they are no ones | first choice of CPU vendor, while they try to catch up. | intricatedetail wrote: | Why Intel even bothers releasing products that don't bring | anything new and worthwhile to the table? This is such a massive | waste of time, resources and environment. | w0mbat wrote: | 10nm? I love retro-computing. | ajross wrote: | As gets repeated ad nauseum, industry numbering has gone wonky. | Intel still hews more or less to the ITRS labelling for its | nodes, which means that it's 10nm process has pitches and | density values along the same lines as TSMC or Samsung's 7nm | processes. | | This is, indeed, no longer an industry leading density and it | lags what you see on "5nm" parts from Apple and Qualcomm. But | it's the same density that AMD is using for the Zen 2/3 devices | against which this is competing in the datacenter. | adrian_b wrote: | Maybe the density is the same, but the 10-nm process variant | that Intel is forced to use for Ice Lake Server is much worse | than the 7-nm TSMC process. | | It is worse in the sense that at the same number of active | cores and the same power consumption, the 10-nm Ice Lake | Server can reach only a much lower clock frequency than the | 7-nm Epyc, which results in a much lower performance for | anything that does not use AVX-512. | | It is also worse in the sense that the maximum clock | frequency when the power limits are not reached is also much | worse for the 10-nm process used for Ice Lake Server. | | Ice Lake Server does not use the improved 10-nm process | (SuperFin) that is used for Tiger Lake and it is strongly | handicapped because of that. | ac29 wrote: | While I'd agree with you that TSMC's current 7nm seems to | be better than Intel's current 10nm, comparing Epyc to Ice | Lake SP isnt quite the same. Intel is putting (up to) 40 | cores on the same die, AMD only puts 8 cores. It looks like | AMD has the better method for overall performance, and | Intel will likely follow them - in addition to being able | to get more cores into a socket, I suspect Intel could also | crank frequency higher with less cores per die. | adrian_b wrote: | For the user it does not matter how many cores are on a | die. | | For the user it matters what is included in a package. | The new Ice Lake Server package (77.5 mm x 56.5 mm) has | finally reached about the same size as the Epyc package | (75.4 mm x 58.5 mm), because now Intel offers for the | first time 8 memory channels, like its competitors have | offered for many years. | | So in packages of the same size, Intel has 40 cores, | while AMD offers 64 cores. Moreover Intel requires an | extra package for the I/O controller, while AMD includes | it in the CPU package. | | So for general-purpose users, AMD offers much more in the | same space. | | On the other hand, Ice Lake Server has twice the number | of FMA units, so it has as many floating-point | multipliers as 80 AMD cores. This advantage is diminished | by the fact that the clock frequency for heavy AVX-512 | instructions is only 80% of the nominal frequency, but it | can still give an advantage to Ice Lake Server for the | programs that can use AVX-512. | totalZero wrote: | From a yield perspective, if core failures are | independent events, binning is probably easier with the | big chiplet approach. | | The Epyc 3 approach does have some drawbacks. Looking at | the Epyc 3 TDP numbers, there's probably a nontrivial | thermal cost to breaking out the dies as AMD has. Not to | mention the I/O for Epyc 3 is not on TSMC 7nm. | mhh__ wrote: | Intel's process have been a disaster, however considering that | for the most part they aren't _that_ far behind (especially | financially) I don 't think they have to catch up much on | process at least to be right back in the fight - I will believe | that the pecking order has truly changed when AMD's | documentation and software is as good as Intel's. | Pr0GrasTiNati0n wrote: | And only 20 of those cores have back doors.....lulz | Sephr wrote: | As disappointing as the perf is for server workloads, what I'm | really interested in is SLI gaming performance. I can imagine | that this would be a boon for high end gaming with multiple x16 | PCIe 4.0 slots and 8 DDR4 channels. | | SLI really shines on HEDT platforms, and this is probably the | last non-multi-chip quasi-HEDT CPU for a while with this kind of | IO. | | (Yes, I know SLI is 'dead' with the latest generation of GPUs) | zamadatix wrote: | These would be absolute trash for SLI performance vs top end | standard consumer desktop parts. The best SKU has a peak boost | clock of 3.7 GHz, the core to core latencies are about twice as | high as the desktop parts, and the memory+PCIe bandwidth mean | little to nothing for gaming performance (remember SLI | bandwidth goes over a dedicate bridge as well) which is highly | sensitive to latencies instead. | marmaduke wrote: | Nice to see that AVX512 hasn't died with Xeon Phi. I see it | coming out in a number of high end but lightweight notebooks too | (Surface Pro with i7 10XXG7, MacBookPro 13" idem). This is a nice | way to avoid needing GPU for heavily vectorizable compute tasks, | assuming you don't need the CUDA ecosystem. | api wrote: | The 2020 Intel MacBook Air and 13" Pro have 10nm Ice Lake with | AVX512. The Ice Lake MacBook Air performs pretty well and very | close to the Ice Lake Pro, though of course the M1 destroys it. | mhh__ wrote: | > though of course the M1 destroys it. | | SIMD throughput? | api wrote: | Actually I don't know... I suspect Intel still wins in wide | SIMD. The M1 totally destroys Intel in general purpose code | performance, especially when you consider power | consumption. | bitcharmer wrote: | AVX-512 is an abomination in my field and we avoid it like the | plague. It looks like we're not the only ones. Linus has a lot | to say about it as well. | | https://www.phoronix.com/scan.php?page=news_item&px=Linus-To... | 37ef_ced3 wrote: | For example, AVX-512 neural net inference: https://NN-512.com | | Only interesting if you care about price (dollars spent per | inference) | | For raw speed (no matter the price) the GPU wins | dragontamer wrote: | GPGPU will never really be able to take over CPU-based SIMD. | | GPUs have far more bandwidth, but CPUs beat them in latency. | Being able to AVX512 your L1 cached data for a memcpy will | always be superior to passing data to the GPU. | | With Ice Lake's 1MB L2 cache, pretty much all tasks smaller | than 1MB will be superior in AVX512 rather than sending it to a | GPU. Sorting 250,000 Float32 elements? Better to SIMD Bitonic | sort / SIMD Mergepath | (https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf) on your | AVX512 rather than spend a 5us PCIe 4.0 traversal to the GPU. | | It is better to keep the data hot in your L2 / L3 cache, rather | than pipe it to a remote computer (even if the 16x PCIe 4.0 | pipe is 32GB/s and the HBM2 RAM is high bandwidth once it gets | there). | | -------- | | But similarly: CPU SIMD can never compete against GPGPUs at | what they do. GPUs have access to 8GBs @500GB/s VRAM on the | low-end and 40GBs @1000GB/s on the high end (NVidia's A100). | EDIT: Some responses have reminded me about the 80GB @ 2000GB/s | models NVidia recently released. | | CPUs barely scratch 200GB/s on the high end, since DDR4 is just | slower than GPU-RAM. For any problem where data-bandwidth and | parallelism is the bottleneck, that fits inside of GPU-VRAM | (such as many-many sequences of large scale matrix | multiplications), it will pretty much always be better to | compute that sort of thing on a GPU. | marmaduke wrote: | In my experience, the most important aspect missing in most | CPU GPU discussions, is that CPUs have a massive cache | compared to GPUs, and that cache has pretty good bandwidth | (~30 GB/core?), even if main memory doesn't. So even if your | task's hot data doesn't fit in L2 but in L3/core, AVX- | whatever per core processing is a good bet regardless of what | a GPU can do. | | Another aspect that seems like a hidden assumption in CPU-GPU | discussions is that you have the time-energy-expertise budget | to (re)build your application to fit GPUs. | dragontamer wrote: | On the memory perspective, I basically see problems in | roughly the following grouping of categories: | | 40TBs+ -- Storage-only solutions. "External Tape Merge sort | algorithm", "Sequential Table Scan", etc. etc. (SSDs or | even Hard drives if you go big enough) | | 4TB to 40TBs -- Multi-socket DDR4 RAM is king (8-way Ice | Lake Xeon Scalable Platinum will probably reach 40TBs). | Single-node distributed memory with NUMA / UPI to scale. | | 1TB to 4TB -- Single Socket DDR4 RAM (EPYC, even if at 4x | NUMA. Or Single-node Ice Lake). | | 80GB to 1TB -- DGX / NVlink distributed memory A100 ganging | up HBM2 together. GPU-distributed RAM is king. | | 256MBs to 80GBs -- HBM2 / GDDR6 Graphics RAM is king (80GB | A100 2TB/s). | | 1.5MBs to 256MBs -- L3 cache is king (8x32MBs EPYC L3 | cache, or POWER9 110MB+ L3 cache unified) | | 128kB to 1.5MBs -- L2 cache is king (1.25MB Ice Lake Xeons | L2, this article) | | 1kB to 128kB -- L1 cache is king. (128kB L1 cache on Apple | M1). Note: "GPU __Shared__" is a close analog to L1 and | competes against it, but is shared between 32 to 256 GPU | threads, so its not an apples-to-apples comparison. | | 1kB and below -- The realm of register-space solutions. | (See 64-bit chess engine bitboards and the like). Almost | fully CPU-constrained / GPU-constrained programming. 256x | 32-bit GPU registers per GPU-thread / SIMD thread. CPUs | have fewer nominal registers, but many "out of order" | buffers or "reorder buffers" that practically count as | register storage in a practical / pragmatic sense. CPUs | just use their "real registers" as a mechanism to | automatically discover parallelism in otherwise single- | thread written code. | | ------------ | | As you can see: GPUs win in some categories, but CPUs win | in others. And these numbers change every few months as a | new CPU and/or GPU comes out. And at the lowest levels: | CPUs and GPUs cannot be compared due to fundamental | differences in architecture. | | For example: GPU __shared__ memory has gather/scatter | capabilities (the NVidia PTX instructions / AMD GCN | instructions permute vs bpermute), while CPUs traditionally | only accelerate gather capabilities (pshufb), and leave | vgather/vscatter instructions to the L1 cache instead. GPUs | have 32x ports to __shared__, so every one of the | 32-threads in a wave-front can read/write every single | clock-tick (as long as all 32 they are on different | ports/alignment, or you have a special one-to-all | broadcast). CPUs only have 2 or 4 ports, so vscatter and | vgather operate slowly, as if a single thread were | reading/writing each of the memory locations. | | But CPU L1 cache has store-forwarding, MESI + cache | coherence, and other acceleration features that GPUs don't | have. | | GPUs are therefore more efficient at sharing data within | workgroups of ~256 threads, but CPUs are more efficient at | sharing data between cores, or even among out-of-die NUMA | solutions, thanks to robust MESI messaging. | ajross wrote: | FWIW: your DRAM numbers are quoting clock speeds and not | bandwidth. They aren't linear at all. In fact with enough | cores you can easily saturate memory that wide, and CPUs are | getting wider just as fast as GPUs are. The giant Epyc AMD | pushed out last fall has 8 (!) 64 bit DRAM channels, where | IIRC the biggest NVIDIA part is still at 6. | mrb wrote: | dragontamer is still correct. He quotes correct bandwidth | numbers. EPYC's 8 channels of DDR4-3200 gets it to 204.8 | GB/s (and, yes, that's _bandwidth_ ) | | Whereas Nvidia's A100 has over 2000 GB/s of memory | bandwidth. That's 10-fold better. | dragontamer wrote: | > 8 (!) 64 bit DRAM channels | | Yeah. And at 3200 Mbit/sec, that comes out to 200GB/s. | (3200 MHz x 8-bytes (aka 64-bit) == 25GB/s. x8 channels == | 200GB/s). | | > where IIRC the biggest NVIDIA part is still at 6. | | That's 6x *1024-bit* HBM2 channels. Total bandwidth is | 2000GBps, or over 10x the speed of the "8x channel EPYC". | Yeah, HBM2 is fat, extremely fat. | | ---------- | | *ONE* HBM2 channel offers over 300GBps bandwidth. And the | A100 has *SIX* of them. Literally ONE HBM2 channel beats | the speed of all 8x DDR4 EPYC memory channels working in | parallel. | ajross wrote: | You're still quoting clock speeds. That's not how this | works. Go check a timing diagram for a DRAM cycle in your | part of choice and do the math. | dragontamer wrote: | Do you know what 3200MHz / PC4-25600 DDR4 means? | | 25600 is the channel rate in (EDIT) MB/sec of the stick | of RAM. That's 25GB/s for a 3200 MHz DDR4 stick. x8 (for | 8-channels working in parallel) is 200GB/s. | | ----------- | | This has been measured in practice by Netflix: https://20 | 19.eurobsdcon.org/slides/NUMA%20Optimizations%20in... | | As you can see, Netflix's FreeBSD optimizations have | allowed EPYC to reach 194GB/s measured performance (or | just under the 200GB/s theoretical). And only with VERY | careful NUMA-tuning and extreme optimizations were they | able to get there. | gbl08ma wrote: | All of that is bandwidth and clock speed, not latency | dragontamer wrote: | Look, if CPUs were better at memory latency, the BVH- | traversal of raytracing would still be done on CPUs. | | BVH-tree traversals are done on the GPU now for a reason. | GPUs are better at latency hiding and taking advantage of | larger sets of bandwidth than CPUs. Yes, even on things | like pointer-chasing through a BVH-tree for AABB bounds | checking. | | GPUs have pushed latency down and latency-hiding up to | unimaginable figures. In terms of absolute latency, | you're right, GPUs are still higher latency than CPUs. | But in terms of "practical" effects (once accounting for | latency hiding tricks on the GPU, such as 8x way | occupancy (similar to hyperthreading), as well as some | dedicated datastructures / programming tricks (largely | taking advantage of the millions of rays processed in | parallel per frame), it turns out that you can convert | many latency-bound problems into bandwidth-constrained | problems. | | ----------- | | That's the funny thing about computer science. It turns | out that with enough RAM and enough parallelism, you can | convert ANY latency-bound problem into a bandwidth-bound | problem. You just need enough cache to hold the results | in the meantime, while you process other stuff in | parallel. | | Raytracing is an excellent example of this form of | latency hiding. Bouncing a ray off of your global data- | structure of objects involved traversing pointers down | the BVH tree. A ton of linked-list like current_node = | current_node->next like operations (depending on which | current_node->child the ray hit). | | From the perspective of any ray, it looks like its | latency-bound. But from the perspective of processing | 2.073 million rays across a 1920 x 1080 video game scene | with realtime-raytracing enabled, its bandwidth bound. | wmf wrote: | That presentation shows 194 gigabits/s which is only ~24 | gigabytes/s at the NIC; that requires ~96 gigabytes/s of | memory bandwidth. Usable memory bandwidth on Milan is | only <120 gigabytes/s which is about 60% of the | theoretical max. DRAM never gets more than ~80% of | theoretical max bandwidth because of command overhead | (which is what I think ajross keeps alluding to). | https://www.anandtech.com/show/16594/intel-3rd-gen-xeon- | scal... | dragontamer wrote: | I appreciate the correction. It seems like I made the | mistake of Gbit vs GByte confusion (little-b vs big-B). | | > (which is what I think ajross keeps alluding to) | | It seems like ajross is accusing me of underestimating | CPU-bandwidth. At least, that's my interpretation of the | discussion so far. As you've pointed out however, I'm | overestimating it. | | EDIT: But I'm overestimating it on both sides. A100 2000 | TB/s is the "channel bandwidth" as well, as the CAS and | RAS commands still need to go through the channel and get | interpreted. | volta83 wrote: | > Being able to AVX512 your L1 cached data for a memcpy will | always be superior to passing data to the GPU. | | The two last apps I worked on have been GPU-only. The CPU | process starts running and launches GPU work, and that's it, | the GPU does all the work until the process exits. | | There is no need to "pass data to the GPU" because data is | never on CPU memory, so there is nothing to pass from there. | All network and file I/O goes directly to the GPU. | | Once all your software runs on the GPU, passing data to the | CPU for some small task doesn't make much sense either. | dragontamer wrote: | So we know that GPUs are really good at raytracing and | matrix multiplication, two things that are needed for | graphics programming. | | However, the famous "Moana" scene for Disney-level | productions is a 93GB (!!!!) scene statically, with another | 131GBs (!!!) of animation data (trees blowing in the winds, | waves moving on the shore, etc. etc.). | | That's simply never going to fit on a 8GB, 40GB, or even | 80GB high-end GPU. The only way to work with that kind of | data is to think about how to split it up, and have the CPU | store lots of the data, while the GPU processes pieces of | the data in parallel. | | https://www.render-blog.com/2020/10/03/gpu-motunui/ | | Which has been done before, mind you. But it should be | noted that the discussion point for GPU-scale compute runs | into practical RAM-capacity constraints today, even on | movie-scale problems from 5 years ago (Moana was released | in 2016, and had to be rendered on hardware years older | than 2016). | | Moana scene is here if you're curious: | https://www.disneyanimation.com/resources/moana-island- | scene... | | ---------- | | But yes, if your data fits within the 8GBs GPU (or you can | afford a 40GB or 80GB VRAM GPU and your data fits in that), | doing everything on the GPU is absolutely an option. | oivey wrote: | We know that GPUs are really good at far more than ray | tracing and matrix multiplication. Oversimplifying a bit, | they're great at basically any massively parallel | operation that has minimal branching and can fit in | memory. Using a GPU to just add two images together | probably isn't worth it, but many real world workflows | allow you to operate solely on the GPU. | | If you're Disney, you can afford boxes with 10+ A100s | with NVLink sharing the memory in a single 400+ GB pool. | Unknown if that ends up being more economical than the | equivalent CPU version, but it's important to understand | in order to evaluate the future of GPUs. | volta83 wrote: | >That's simply never going to fit on a 8GB, 40GB, or even | 80GB high-end GPU. The only way to work with that kind of | data is to think about how to split it up, and have the | CPU store lots of the data, while the GPU processes | pieces of the data in parallel. | | There is always a problem size that does not fit into | memory. | | Whether that memory is the GPU memory, or the CPU memory, | doesn't really matter. | | We have been solving this problem for 60 years already. | It isn't rocket science. | | --- | | The CPU doesn't have to do anything. | | The GPU can map a file stored on hard disk to VRAM | memory, do random access into it, process chunks of it, | write the results into network sockets and send them over | the network, etc. | | The only thing the CPU has to do is launch a kernel: | int main(args...) { | main_kernel<<<...>>>(args...); synchronize(); | return 0; } | | and this is a relatively accurate depiction of how the | "main function" of the two latests apps I've worked on | look like: the GPU does everything. | | --- | | > However, the famous "Moana" scene for Disney-level | productions is a 93GB (!!!!) scene statically, with | another 131GBs [...] That's simply never going to fit on | a high-end GPU. | | LOL. | | V100 with 32Gbs and 8x per rack gave you 256 Gb of VRAM | addressable from any GPU in the rack. | | A100 with 80GB and 16x per rack give you 1.3 TB of VRAM | addressable from any GPU in the rack. | | You can fit Moana in GPU VRAM in a now old DGX-2. | | If you are willing to bet cash on Moana never fitting on | a single GPU, I'd take you on that bet. Sounds like free | money to me. | dragontamer wrote: | I'll post the link again: https://www.render- | blog.com/2020/10/03/gpu-motunui/ | | This person rendered the Moana scene on just 8GBs of GPU | VRAM. It does this by rendering 6.7GB chunks at a time on | the GPU, with the CPU keeping the RAM-heavy "big picture" | in mind. (EDITED paragraph. First wording of this | paragraph was poor). | | ------ | | Its not that these problems "cannot be solved", its that | these problems "become grossly more complicated" when | under RAM / VRAM constraints. They're still solvable, but | now you have to do strange techniques. | | ------ | | With regards to a Ray-tracer, tracing the ray-of-light | that's bouncing around could theoretically touch ANY of | the 93GBs of static object data (which could have been | shifted by any of the 131 GBs of animation data). That is | to say: a ray that bounces off of a any leaf on any tree | could bounce in any direction, hitting potentially any | other geometry in the scene. | | That pretty much forces you to keep the geometry in high- | speed RAM, and not do an I/O cycle between each ray- | bounce. | | As a rough reminder of the target performance: Raytracers | aim at ~30 million to 30-billion ray-bounces per second, | depending on movie-grade vs video-game optimized. Either | way, that level of performance is really only ever going | to be solved by keeping all of the geometry data in RAM. | | > A100 with 80GB and 16x per rack give you 1.3 TB of VRAM | addressable from any GPU in the rack. | | That doesn't mean it makes sense to traverse a BVH-tree | across a relatively high-latency NVLink connection off- | chip. I know GPUs have decent latency hiding but... | that's a lot of latency to hide. | | Again: your CPU-renderers can hit 10s of millions of rays | per second. I'm not sure if you're gonna get something | pragmatic by just dropping the entire geometry into | distributed NVSwitch'd memory and hoping for the best. | | Honestly, that's where the 8GB CPU+GPU team becomes | interesting to me. A methodology for clearly separating | the geometry and splitting up which local compute-devices | are responsible for handling which rays is going to scale | better than a naive dump reliant on remote-connections | pretending to be RAM. | | Video games hit Billions of rays/second. The promise of | GPU-compute is on that order, and I just doubt that | remote RAM accesses over NVLink will get you there. | | > If you are willing to bet cash on Moana never fitting | on a single GPU, I'd take you on that bet. Sounds like | free money to me. | | The issue is not Moana (or other movies from 2016), the | issue are the movies that will be made in 2022 and into | the future. Especially if they're near photorealistic | like Marvel-movies or Star Wars. | | ---------- | | The other problem is: what's cheaper? A DGX-system could | very well be faster than one CPU system. But would it be | faster than a cluster of Ice-Lake Xeons with AVX512 each | with the precise amount of RAM needed for the problem? | (Ex: 512GBs in some hypothetical future movie?) | | A team probably would be better: CPUs have expandable | RAM, that's their biggest advantage. GPUs have fixed RAM. | Slicing the problem up so that pieces of Raytracing fits | on GPUs, while the other, "bulkier" bits fit on CPU DDR4 | (or DDR5), would probably be the most cost-efficient way | at solving the raytracing problem. | | The GPU-Moana experiment showed that "collecting rays | that bounce outside of RAM" is an efficient methodology. | Slice the scene into 8GB chunks, process the rays that | are within that chunk, and the collate the rays together | to find where the rays go. | aviraldg wrote: | > There is no need to "pass data to the GPU" because data | is never on CPU memory, so there is nothing to pass from | there. All network and file I/O goes directly to the GPU. | | This is very interesting - do you have a link that explains | how it works / is implemented? | dragontamer wrote: | PS5, XBox Series X, and NVidia have a "GPU Direct I/O" | feature. | | https://www.nvidia.com/en-us/geforce/news/rtx-io-gpu- | acceler... | | https://www.amd.com/en/products/professional- | graphics/radeon... | | The GPU itself can send PCIe 4.0 messages out. So why not | have the GPU make I/O requests on behalf of itself? Its a | bit obscure, but this feature has been around for a | number of years now. The idea is to remove the CPU and | DDR4 from the loop entirely, because those just | bottleneck / slowdown the GPU. | | -------- | | From an absolute performance perspective, it seems good. | But CPUs are really good and standardized at accessing | I/O in very efficient ways. I'm personally of the opinion | that blocking and/or event driven I/O from the CPU (with | the full benefit of threads / OS-level concepts) would be | easier to think about than high-performance GPU-code. | | But still, its a neat concept, and it seems like there's | a big demand for it (see PS5 / XBox Series X). | etaioinshrdlu wrote: | The CPU is still acting as the PCIe controller though | (right?), which kind of makes the CPU act like a network | switch. PCIe is a point-to-point protocol kind of like | ethernet too. Old-school PCI was a shared bus so devices | might be able to directly talk to each other, but I don't | think that was ever actually used. | d110af5ccf wrote: | My understanding matches yours, but it's worth noting | that (IIUC) memory and PCIe are (last time I checked?) a | separate I/O subsystem that just happens to reside within | the same package as the CPU on modern chips. So P2PDMA | avoids burning CPU cycles and RAM bandwidth shuffling | data around that you never wanted to use on the CPU | anyway. (Also see: https://lwn.net/Articles/767281/) | dragontamer wrote: | Take a look at the Radeon more closely. | | I think the Radeon + Premier Pro documentation makes it | clear how it works: | https://www.amd.com/system/files/documents/radeon-pro- | ssg-pr... | | As you can see, the GPU is attached to the x16 slot, and | the 4x NVMe SSDs are attached to the GPU. When the CPU | wants to store data on the SSD, it communicates first to | the GPU, which then pass-throughs the data to the four | SSDs. | | That's the simpler example. | | -------------- | | In NVidia's case, they're building on top of GPUDirect | Storage (https://developer.nvidia.com/blog/gpudirect- | storage/), which seems to be based on enterprise | technology where PCIe switches were used. | | NVidia's GPUs would command the PCIe switch to grab data, | without having the PCIe switch send data to the CPU | (which would most likely be dropped in DDR4, or maybe L3 | in an optimized situation). | ASpaceCowboi wrote: | will this work on the latest mac pro? Probably not right? | wmf wrote: | No, it's a different socket. | robbyt wrote: | Classic Intel | wmf wrote: | You can't increase memory and PCIe channels while keeping | the same socket. This isn't a cash grab; it's actual | progress. | paulpan wrote: | TLDR from Anandtech is that while this is a good improvement over | previous gen, it still falls behind AMD (Epyc) and ARM (Altra) | counterparts. What's somewhat alarming is that on a per-core | comparison (28-core 205W designs), the performance increase can | be a wash. Doesn't bode well for Intel as both their competitors | are due for refreshes that will re-widen the gap. | | Key question will be how quickly Intel will shift to the next | architecture, Sapphire Rapids. Will this release be like the | consumer/desktop Rocket Lake? E.g. just a placeholder to | essentially volume test the 10nm fabrication for datacenter. | Probably at least a year out at this point since Ice Lake SP was | supposed to be originally released in 2H2020. | gsnedders wrote: | > Key question will be how quickly Intel will shift to the next | architecture, Sapphire Rapids. Will this release be like the | consumer/desktop Rocket Lake? E.g. just a placeholder to | essentially volume test the 10nm fabrication for datacenter. | Probably at least a year out at this point since Ice Lake SP | was supposed to be originally released in 2H2020. | | Alder Lake is meant to be a consumer part contemporary with | Sapphire Rapids, which is server only. They're likely based on | the same (performance) core, with Adler Lake additionally | having low-power cores. | | Last I heard the expectation was still that these new parts | would enter the market at the end of this year. | CSSer wrote: | Lately Intel seems to be getting a lot of flack here. As a | layperson in the space who's pretty out of the loop (I built a | home PC about a decade ago), could someone explain to me why that | is? Is Intel really falling behind or dressing up metrics to | mislead or something like that? I also partly ask because I feel | that I only really superficially understand why Apple ditched/is | ditching Intel, although I understand if that is a bit off-topic | for the current article. | s_dev wrote: | >Is Intel really falling behind | | Intel is already behind AMD -- they have no product segment | where they are absolutely superior. The means AMD is setting | the market pace. | | On top of this Apple is switching to ARM designed CPUs. This | also looks to be a vote of no confidence in Intel. | | The consensus seems to be that Intel who have their own fabs -- | never really nailed anything under 14nm and are now being | outcompeted. | meepmorp wrote: | Apple designs it's own chips, it doesn't use ARM's designs. | They do use the ARM ISA, tho. | totalZero wrote: | > Intel is already behind AMD -- they have no product segment | where they are absolutely superior. | | There are some who would argue this claim, but I think it's | at least a defensible one. | | Still, availability is an important factor that isn't | captured by benchmarking. AMD has had CPU inventory trouble | in the low-end laptop segment and high-end desktop segment | alike. | | > The consensus seems to be that Intel who have their own | fabs -- never really nailed anything under 14nm and are now | being outcompeted. | | Intel has done well with 10nm laptop CPUs. They were just | very late to the party. Desktop and server timelines have | been quite a bit worse. I agree Intel did not nail 10nm, but | they're definitely hanging in there. It's one process node at | the cusp of transition to EUV, so some of the defeatism | around Intel may be overzealous if we keep in mind that 7nm | process development has been somewhat parallel to 10nm | because of the difference in the lithographic technology. | yoz-y wrote: | Intel was unable to improve their fabrication process year | after year, while promising to do so repeatedly. Now, they have | been practically lapped twice. Apple has a somewhat specific | use case, but their cpus have significantly better performance | per watt. | matmatmatmat wrote: | Some of the other comments above have touched on this, but I | think there is also a bit of latent anti-Intel sentiment in | many people's minds. Intel extracted a non-trivial price | premium out of consumers for many, many years (both for chips | and by forcing people to upgrade motherboards by changing CPU | sockets) while AMD could only catch up to them for brief | periods of time. People paid that price premium for one reason | or another, but it doesn't mean they were thrilled about it. | | Many people, I'd say especially enthusiasts, were quite happy | when AMD was able to compete on a performance/$ basis and then | outright beat Intel. | | Of course, now the tables have turned and AMD is able to | extract that price premium while Intel cut prices. Who knows | how long this will last, but Intel is still the 800 lb gorilla | in terms of capacity, engineering talent, and revenue. I don't | think we've heard the last from them. | blackoil wrote: | A perfect storm. Intel had trouble with its 10nm/7nm | engineering processes, which TSMC has been able to achieve. AMD | had a resurgence with Zen arch. and ARM/Apple/TSMC/Samsung put | 100s of billions to catchup with the x86 performance. | | Intel is still biggest player in the game, because even though | they are stuck at 14nm, AMD isn't able to manufacture enough to | take bigger chunks of the market. Apple won't sell it to | PC/Datacenter space, rest are still niche. | ac29 wrote: | > even though they are stuck at 14nm | | I think this isnt quite fair, their laptop 10nm chips have | been shipping in volume since last year, and their server | chips were released today, with 200k+ units already shipped | (according to Anandtech). The only line left on 14nm is | socketed Desktop processors, which is a relatively small | market compared to laptops and servers. | colinmhayes wrote: | Hacker News users generally aren't very interested in | laptop processors. Sure business wise they're incredibly | important, but as far as getting flack on hacker news, | laptop chips won't stop it. People here have been waiting | for intel 10nm on server and especially desktop for 6 years | now. | totalZero wrote: | Unless you have scraped past posts to perform some kind | of sentiment analysis, this is pure speculation intended | to move the goalposts on GP. | jimbob21 wrote: | Yes, quite simply they have fallen behind while also promising | things they have failed to deliver. As an example, their most | recent flagship release is the 11900k, which has 2 fewer cores | (now 8) than its predecessor (had 10, 10900k), and almost no | improvement to speak of otherwise (in some games its ~1% | faster). On the other hand, AMD's flagship, which to be fair is | $150 more expensive, has 16 cores, very similar clock speeds, | and is much more energy efficient (intel and amd calculate TDP | differently). Overall, AMD is the better choice by a large | margin and Intel is getting flock because it sat on its laurels | for the last decade(?) and hasn't done anything to improve | itself. | | To put it in numbers alone, look at this benchmark. Flagship vs | Flagship: | https://www.cpubenchmark.net/compare/Intel-i9-11900K-vs-AMD-... | formerly_proven wrote: | Naturally the 11900K performs quite a bit worse than the | 10900K in anything which uses all cores, but the remarkable | thing about the 11900K is that it even performs worse in a | bunch of game benchmarks, so as a product it genuinely | doesn't make any sense. | chx wrote: | Absolutely. Intel has been stuck on the 14nm node for a very, | very long time. 10nm CPUs were supposed to ship in 2015, they | did really only in late 2019, 2020. Meanwhile AMD caught up and | Intel has been doing the silliest shenanigans to appear as if | they were competitive, like in 2018 they demonstrated a 28 core | 5GHz CPU and kinda forgot to mention the behind-the-scenes one | horsepower (~745W) industrial chiller keeping that beast | running. | | Also, the first 10nm "Ice Lake" mobile CPUs were not really an | improvement over the by then many times refined 14nm chips | "Comet Lake". It's been a faecal pageant. | mhh__ wrote: | Intel's processes (i.e. turning files on a computer into chips) | have been a complete disaster in recent years, to the point of | basically _missing_ one of their key die shrinks entirely as | far as I can tell. | | They are, in a certain sense, suffering from their own success | in that their competitors have basically been nonexistant up | until Zen came about (and even then only until Zen 3 have Intel | truly been knocked off their single thread perch). This has led | to them getting cagey, and a bit ridiculous in the sense that | they are not only backporting new designs to old processes but | also pumping them up to genuinely ridiculous power budgets. | With Apple, AMD, and TSMC they have basically been caught with | their trousers down by younger and leaner companies. | | Ultimately this is where Intel need good leadership. The mba | solution is to just give up and do something else (e.g. spin | off the fabs), but I think they should have the confidence (as | far as I can tell this is what they are doing) to rise to the | technical challenge - they will probably never have a run like | they did from Nehalem to shortly before now, but throwing in | the towel means that the probability is zero. | | Intel have been in situations like this before, e.g. When | Itanium was clearly doomed and AMD were doing well (amd64), | they came back with new processors and basically ran away to | the bank for years - AMD's server market share is still pitiful | compared to Intel (10% at most), for example. | Symmetry wrote: | I don't want to council despair but I'm not as sanguine as | you either. Intel has had disastrous microarchitectures | before. Itanium, P4, and previous ones. But it's never had to | worry about recovering from a _process_ disaster before. It | might very well be able to but I worry. | mhh__ wrote: | I'm not exactly optimistic either, I just think that the | doomsaying is overblown (and sometimes looks like a tribal | thing from Apple and AMD fans if I'm being honest - i.e. | companies aren't your friends) | ac29 wrote: | > Intel's processes (i.e. turning files on a computer into | chips) have been a complete disaster in recent years, to the | point of basically missing one of their key die shrinks | entirely as far as I can tell. | | Which one? I dont believe they missed a die shrink, it just | took a _long_ time. Intel 14nm came out in 2014 with their | Broadwell Processors, and the next node, 10nm came out in | 2019 (technically 2018, but very few units shipped that | year). | totalZero wrote: | Intel killed the longstanding "tick tock" model in 2016 | because of failures with 10nm yield and the higher-than- | expected costs of 14nm. Intel got too aggressive with the | timeline of the die shrink, which led to them trying to do | 10nm on DUV rather than waiting for EUV technology where | the light is about an order of magnitude shorter in | wavelength than that of DUV (and thus able to resolve | today's nano-scale features without all the RETs needed for | DUV). | | From the 2015 10-K [0]: | | _" We expect to lengthen the amount of time we will | utilize our 14nm and our next-generation 10nm process | technologies, further optimizing our products and process | technologies while meeting the yearly market cadence for | product introductions."_ | | Spoiler alert: In the five years after shelving the tick- | tock model, Intel also missed the yearly market cadence for | product introductions. | | [0] https://www.sec.gov/Archives/edgar/data/50863/000005086 | 31600... | mhh__ wrote: | Cannon Lake I believe was basically cancelled. | chx wrote: | You wish. It was released because a bunch of Intel | managers had bonuses tied to launching 10nm and so they | released it. | ineedasername wrote: | They can't get their next-gen fabs (chip factories) into | production. It's been a problem long enough that they're not | even next-gen anymore: it's current-gen, about to be previous- | gen. | | So what you're seeing isn't really anti-Intel, it's probably | often more like bitter disappointment that they haven't done | better. Though I'm sure there's a tiny bit of fanboy-ism for & | against Intel. | | There's definitely some of that pro-AMD fanboy sentiment in the | gaming community where people build their own rigs: AMD chips | are massively cheaper than a comparable Intel chip. | M277 wrote: | Just a minor nitpick regarding your last paragraph, this is | no longer the case. Intel is now significantly cheaper after | they heavily cut prices across the board. | | For instance, you can now get an i7-10700K (which is roughly | equivalent in single thread and better in multi thread) for | cheaper than a R5 5600X. | robocat wrote: | Nitpick: you are comparing price where you should be | comparing performance per dollar, or are you cherry-picking | the wrong comparison? | | My cherry-pick is where the AMD chip is 30% more expensive, | but multi-threaded performance is 100% better in this | example: | https://www.cpubenchmark.net/compare/Intel-i9-11900K-vs- | AMD-... | | Edit: picking individual processors to compare (especially | low volume ones) is often not useful when talking about how | well a company is competing in the market. | makomk wrote: | The comment you're replying to is "cherry-picking" the | current-gen AMD processor which offers the best value for | most users. You're cherry-picking an Intel processor | which almost no-one has any reason to buy over other | Intel options (the i9-11900K is much more expensive than | the 11700K or 10700k for little extra performance; AMD | had a few chips like this last gen, and they actually | downplayed how much of a price increase this gen was by | only comparing to those poor-value chips). One of these | comparisons is a lot more useful than the other. | MangoCoffee wrote: | >So what you're seeing isn't really anti-Intel, it's probably | often more like bitter disappointment that they haven't done | better. | | its back to where everyone design its own chip for their own | product but don't need a fab 'cause of foundry like TSMC and | Samsung. | tyingq wrote: | Lots of shade because they first missed the whole mobile | market, then got beat by AMD Zen by missing the chiplet concept | and a successful current-gen process size, then finally also | overshadowed by Apple's M1. The M1 thing is interesting, | because it likely means the next set of ARM Neoverse CPUs for | servers, from Amazon and others, will be really impressive. | Intel is behind on many fronts. | mhh__ wrote: | >likely means the next set of ARM Neoverse CPUs from Amazon | and others will be really impressive | | M1 is proof that it can be done, however you can absolutely | make a bad CPU for a good ISA so I wouldn't take it for | granted. | tyingq wrote: | Might be a hint as to how much of M1's prowess is just the | process size and how much is Apple. | JohnJamesRambo wrote: | https://jamesallworth.medium.com/intels-disruption-is-now-co... | | I think that summarizes it pretty well in that one graph. ___________________________________________________________________ (page generated 2021-04-06 23:00 UTC)