[HN Gopher] eGPU: A 750 MHz Class Soft GPGPU for FPGA ___________________________________________________________________ eGPU: A 750 MHz Class Soft GPGPU for FPGA Author : matt_d Score : 39 points Date : 2023-08-01 20:11 UTC (2 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | stefanpie wrote: | One group at Georgia Tech in our building has also been working | on open source GPU designs that can also target FPGAs and | interoperate with RISCV. They have several publications on the | work they have built up. Thought I might share since it's not | referenced in the submission paper. | | https://vortex.cc.gatech.edu/ | mepian wrote: | They still haven't published the source code for their Skybox | project, I wonder why. Unless I missed it in their repository? | https://github.com/vortexgpgpu | gsmecher wrote: | Also discussed here: | https://old.reddit.com/r/FPGA/comments/15fnb6u/egpu_a_750_mh... | dragontamer wrote: | For a GPU circuit, it basically comes down to the number of | hardware multipliers on the FPGA, does it not? | | I remember synthesizing a 16-bit Wallace tree in a lab exercise | back in college. I think that single multiplier used up 70% of my | LUTs. | | You only will get massive amounts of hardware parallel | multipliers if the underlying circuit has a ton of hardware | multipliers (Like Xilinx's VLIW SIMD AI chips) | | ------- | | At all computer sizes, a GPU probably will have more multiply | circuits than an equivalent cost FPGA, with exception of maybe | those AI chips from Xilinx (where the individual cores are | basically presynthesized with hardcoded ISA). | | Ex: at under 500mW power usage you probably will prefer some ARM | NEON SIMD or TI DSP / VLIW. At cell phone levels you'd prefer a | cell phone GPU, and at desktop/server levels you'd prefer a | desktop GPU. | danhor wrote: | > At all computer sizes, a GPU probably will have more multiply | circuits than an equivalent cost FPGA | | Very likely yes, but FPGAs often have hundreds to thousands of | hardware multipliers, as part of the DSP blocks. Here for | example newer AMD FPGAs: | https://eu.mouser.com/datasheet/2/903/ds890_ultrascale_overv... | mathisfun123 wrote: | I wish people would stop quoting marketing material as some | kind representation of what they know. | | You're giving completely the wrong impression about dsp | slices - it is absolutely not 1 dsp slice per FP operator at | any precision that you would want to do floating point | arithmetic. It's definitely at least 2 plus a whole bunch of | LUTs (~500) for FP16 with 4 stages or something like that. | And if you want faster (fewer stages) then you need more | slices. On alveo u280, which is an ultrascale part, I have | never been able to effectively utilize more than ~4000 dsp | slices (out of 9024) for 5,4 mults and that cost basically | 99% of clbs in SLR1 and SLR2. | | And even then, disconnected FPUs are completely meaningless | without a datapath implementing eg matmul and boy oh boy do | you have no clue what you're in for there. | | Takeaway: it's pointless to compare raw specsheet numbers | when _everything_ comes down to datapath. | pkaye wrote: | How much would that FPGA cost? | UncleOxidant wrote: | The FPGAs with enough multipliers to be competitive against | an actual GPU are going to be quite a bit more expensive than | a GPU aren't they? | Lramseyer wrote: | Full Disclosure, I work for an FPGA company. | | The mind blowing part of all of this is the fact that they were | able to close timing at 771MHz. That is insanely fast for an | FPGA. For perspective, most modern FPGAs run their designs at | around 300MHz* While most of the heavy lifting in this design use | hardened components like DSPs and FPUs, it's still very | impressive to see! | | What I didn't see talked about much was how memory is loaded in | and out of the processor. I'm curious to see what the memory | bandwidth numbers look like as well as the resource utilization | of the higher level routing. | | *For most hardware designs that aren't things like CPUs and GPUs, | you don't always need a super high clock speed. You have a lot | more flexibility to compute in space rather than in time (think | more threads running slower.) The pros and cons of such tradeoffs | are a bit of a complicated topic, but should at least be noted. | mathisfun123 wrote: | > The mind blowing part of all of this is the fact that they | were able to close timing at 771MHz | | It's true but I mean this is Intel in-house research right? If | they can't get absolute peak fmax on their own parts that would | be a really bad look right? Plus these stratix parts have hard | FP blocks (not just DSPs) so they're basically mostly | scheduling stuff rather building the whole datapath. But | admittedly I haven't read the paper... | | >Full Disclosure, I work for an FPGA company | | I currently do too (as an intern, maybe even the same one as | you) and I haven't looked very hard but I'm sure we have | similar fmax achieving projects (maybe even GPUs since we're | fighting hard to compete with Nvidia...). | unwind wrote: | Uh, non-native question: what is the word "class" doing in the | title? | | Is a hyphen missing, so it should be "750 MHz-class"? I searched | the linked page but the word only appears in the title, sans | hyphen. | avmich wrote: | Wonder it this could help to alleviate the momentary shortage of | GPUs on the market. | ZiiS wrote: | 10 year old entry level GPUs have 100 750Mhz cores | monocasa wrote: | 'Cores' are really overstated in GPUs. CUDA cores are really | SIMD lanes and if you counted it the same way as a CPU does, | you'd get somewhere in the dozens of cores range even for | modern GPUs. | codedokode wrote: | A proper method is counting ALUs instead of vague "cores". | xigency wrote: | That seems backwards to me. Sure, a GPU core is less | general, but in terms of concurrent execution, memory | bandwidth, and FLOPS I would expect hundreds to thousands | of cores for all new GPU offerings. Apple's double-digit | GPU core counts for instance sound extremely understated. | monocasa wrote: | It's not. The best comparison is the SM count for Nvidia | hardware, or the wavefront count for AMD hardware. So a | 4070 has 46 cores as you'd count them on a CPU. | latchkey wrote: | I don't think this will be momentary. Reality is that there | have been shortages of GPUs for a long time now and demand | isn't going down. People are signing 3 year contracts with | lambda now. | [deleted] | monocasa wrote: | If it's on an FPGA then it doesn't really compete with GPUs you | can buy from just about any perspective other than openness. ___________________________________________________________________ (page generated 2023-08-01 23:00 UTC)