[HN Gopher] The History, Status, and Future of FPGAs ___________________________________________________________________ The History, Status, and Future of FPGAs Author : skovorodkin Score : 123 points Date : 2020-07-23 14:57 UTC (8 hours ago) (HTM) web link (queue.acm.org) (TXT) w3m dump (queue.acm.org) | jcranmer wrote: | As a bit of a counterpoint: | | One of my prior projects involved working with a lot of ex-FPGA | developers. This is obviously a rather biased group of people, | but I saw a lot of feedback around that was very negative about | FPGAs. | | One comment that's telling is that since the 90s, FPGAs were seen | as the obvious "next big technology" for HPC market... and then | Nvidia came out and pushed CUDA hard, and now GPGPUs have | cornered the market. FPGAs are still trying to make inroads (the | article here mentions it), but the general sense I have is that | success has not been forthcoming. | | The issue with FPGAs is you start with a clock rate in the 100s | of MHz (exact clock rate is dependent on how long the paths need | to be), compared with a few GHz for GPUs and CPUs. Thus you need | a 5x performance win from switching to an FPGA just to break | even, and you probably need another 2x on top of that to motivate | people going through the pain of FPGA programming. Nvidia made | GPGPU work by being able to demonstrate meaningful performance | gains to make the cost of rewriting code worth it; FPGAs have yet | to do that. | | Edit: It's worth noting that the programming model of FPGAs has | consistently been cited as the thing holding back FPGAs for the | past 20 years. The success of GPGPU, despite the need to move to | a different programming model to achieve gains there, and the | inability of the FPGA community to furnish the necessary magic | programming model suggests to me (and my FPGA-skeptic coworkers) | that the programming model isn't the actual issue preventing | FPGAs from succeeding, but that FPGAs have structural issues | (e.g., low clock speeds) that prevent their utility in wider | market classes. | lnsru wrote: | It's not the speed, that holds FPGA adaptation back. It's | development process/time. While one can start with GPU | immediately, there is a need for FPGA to develop whole PCIe | infrastructure and efficient data movers. One is done with GPU | while FPGA developers just start with algorithms. As long as | one does not need real time capability, GPU is an obvious | choice. My 200 MHz design outcompetes every CPU and GPU out | there with very narrow data processing window, but development | time is 5x compared to regular software. | sfgnilnio wrote: | You ever work with an FPGA? The programming model and the | tooling are a _huge_ part of the problem. | | Verilog and VHDL have basically nothing in common with any | language you've ever used. | | Compilation can take multiple _days_. This means that debugging | happens in simulation, at maybe 1 /10000th of the desired speed | of the circuit. | | If you try to make something too big, it just plain won't fit. | There is no graceful degradation in performance; an inefficient | design will just not function, come Hell or high water. | | The existing compilers will happily build you _the wrong thing_ | if you write something ill-defined. There are a ton of things | expressible in a hardware description language that don 't | actually map onto a real circuit (at least not one that can be | automatically derived). In any normal language anything you can | express is well-defined and can be compiled and executed. Not | so in hardware. | | Timing problems are a nightmare. _Every single logic element_ | acts like its own processor, writing directly into the | registers of its neighbours, with no primitives for | coordination. Imagine if you had to worry about race conditions | _inside of a single instruction!_ | | Maybe if all these problems are solved FPGAs still wouldn't | catch on, but let's not pretend the programming model isn't a | problem. Hardware is fundamentally hard to design and the | tooling is all 50 years out of date. | blargmaster33 wrote: | There is a reason EE do the FPGA work. | formerly_proven wrote: | > You ever work with an FPGA? The programming model and the | tooling are a huge part of the problem. | | I'd argue FPGAs aren't programmed and don't have a | programming model. Complaints that the programming model of | FPGAs holds their adoption back are thus conceptually ill- | founded. (The tooling still sucks). | alfalfasprout wrote: | I mean, the problem is that in the FPGA world the tooling | and synthesis languages are inextricably linked. HLS is an | approach that, IMO, is also the completely wrong direction | since a general purpose programming language like C/C++ | won't map nicely to the constructs you need in FPGA design. | | What we really need is a lightweight, open source toolchain | for FPGAs and one or more "higher level" synthesis | languages. I've always wondered if a DSL using a higher | language like Python isn't a better way to do this. Rather | than try to transpile an entire language, just provide | building blocks and interfaces that can then be used to | generate verilog/VHDL. | jlokier wrote: | > I've always wondered if a DSL using a higher language | like Python isn't a better way to do this | | Like this? http://www.myhdl.org/ | borramakot wrote: | Just to throw in one more complication, I'll assert that the | only benefits of FPGAs over ASICs are one time costs and time | to market. Those are big benefits, but almost by definition, | they aren't as important for workloads that are large scale and | stable. So, if you do have a workload that's an excellent match | for FPGAs, and if that workload will have lots of long term | volume, you should make an ASIC for it. | | So, for FPGAs to be the next big thing in HPC, you'd need to | find a class of workloads that benefit from the FPGA | architecture, for long enough and with high enough volume to be | worth the work to move over, and are also unstable or low | volume enough that it's not worth making them their own chip. | cbzoiav wrote: | Thats not entirely true - the flexibility can have its own | value. Unlike an ASIC you can handle multiple workloads or | update flows. | | For example timing protocols on backbone equipment handling | 100-400Gbps. Depending on how its configured you may need to | do different things. Additionally you probably don't want to | replace 6 figure hardware every generation. | | Another example is test equipment where you can't run the | tests in parallel. A single piece of hardware can be far more | portable / cost effective. | borramakot wrote: | I may not have said it well, but I broadly agree with you. | If a workload needs high performance but not consistently | (e.g. because you're doing serial tests by swapping | bitstreams), predictably (e.g. because you need flexibility | for network stuff you can't predict at design time), or | with enough volume (e.g. costs in the low millions are | prohibitive), an ASIC isn't the right solution. | | But my point is that for FPGAs to come to prominence as a | major computation paradigm, it probably won't be because it | outperforms GPU on one really big workload like bitcoin or | genetic analysis or something. It'll have to be a | moderately large number of medium scale workloads. | kyboren wrote: | GPUs work great for accelerating many applications, and it's | true that that reduces interest in FPGAs. For applications that | map well to GPUs, you're absolutely correct that the higher | clock speeds (and greater effective logic area) make GPUs | superior as accelerators. | | However, some applications do not map well to GPUs. | Particularly those applications with a great deal of bit-level | parallelism can achieve enormous speedups with bespoke | hardware. For those applications where it doesn't make sense to | tape out an ASIC, FPGAs are beautiful--even if they only | operate at a few hundred MHz. | | I think the "programming model" _is_ actually the biggest | barrier to wider adoption. Your comment is suffused with what I | believe is the source of this disagreement: The idea that one | _programs_ an FPGA. One _designs hardware_ that is implemented | on an FPGA. The difference may sound pedantic, but it really is | not. There is a massively huge difference between software | programming and hardware design, and hardware design is | downright unnatural for software developers. They are | completely different skill sets. | | On top of that add all the headaches that come with | implementing a physical device with physical constraints (the | article complains about P&R times but this is far from the only | burden) and it becomes clear that FPGAs are quite frankly a | massive pain in the ass compared to software running on CPUs or | GPUs. | exmadscientist wrote: | Very much this. | | (Also, in general, FPGA tools are just some of the lowest | quality garbage out there... and that is saying something. | They're _that bad_. This is a completely unnecessary | speedbump.) | | The rebuttal to your objection is always tools like "HLS" | (High-Level Synthesis), or in English it's "C to HDL" (FPGAs | are 'programmed' in the two Hardware Definition Languages | VHDL (bad) or Verilog (worse, but manageable if you learn | VHDL first).) These are not programming languages, they are | hardware definition languages. That means things like | "everything in a block always executes in parallel". (Take | that, Erlang?) In fact, everything on the chip always | executes in parallel, all the time, no exceptions; you "just" | select which output is valid. That's because this is how | hardware works. | | This model maps very, very poorly to traditional programming | languages. This makes FPGAs hard to learn for engineers and | hard to target for HLS tools. The tools can give you decent | enough output to meet low- to mid-performance needs, but if | you need high performance -- and if not, why are you going | through this masochism? -- you're going to need to write some | HDL yourself, which is hard and makes you use the industry's | worst tools. | | Thus, FPGAs languish. | panpanna wrote: | > FPGA tools are just some of the lowest quality garbage | out there | | I think things are about to change thanks to yosys and | other open source tools. | | > VHDL (bad) or Verilog (worse, | | VHDL (and its software counterpart Ada) are very well | thought and great to use once you get to know them (and | understand why they are the way they are). Yeah, they are a | bit verbose but I prefer a strong base to syntactic sugar. | adwn wrote: | > _VHDL (and its software counterpart Ada) are very well | thought and great to use once you get to know them (and | understand why they are the way they are). Yeah, they are | a bit verbose but I prefer a strong base to syntactic | sugar._ | | As a professional FPGA developer: VHDL (and Verilog even | moreso) are _bad_ [1] at what they 're used for today: | implementing and verifying digital hardware designs. In | fact, they're at most moderately tolerable at what they | were originally intended for: _describing hardware_. | | [1] They're not completely terrible - a completely | terrible idea would be to start with C and try to bend it | so that you can design FPGAs with it... | roastsquirrel wrote: | Parts of VHDL leave a little to be desired but overall I | find it to be a really great language. To the extent I | bought Ada 2012 by John Barnes and I kind of like that | too after coding in C/C++ etc, but maybe I'm now biased | after many years of VHDL coding :) It's not uncommon to | see "VHDL is bad" and such like, and I do wonder what the | reasons are for those comments. | kyboren wrote: | > The rebuttal to your objection is always tools like "HLS" | | Yup. I know HLS has gotten a lot better recently but my | impression is that, somewhat like fusion, HLS as a first- | class design paradigm is always a decade away. | | > FPGA tools are just some of the lowest quality garbage | out there | | Absolutely. I think the problem is vendors see FPGA tooling | as a cost center and a necessary evil in order to use their | real products, the chips themselves. Users are also highly | technical and traditionally have no alternative, so | (mostly) working but poor-quality software is simply pushed | out the door. "They'll figure it out". | | Finally, to expand on the difficulties imposed by physical | constraints, I think another huge blocker to wide adoption | is that FPGAs are physically incompatible. I cannot take a | bitstream compiled for one FPGA and program it to any other | FPGA. Hell, I can't even take a bitstream compiled for one | FPGA and use that bitstream for any other device _in the | same device family_. Without some kind of standardized | portability, FPGAs will remain niche devices used only for | very specific applications. | vzidex wrote: | >I think the problem is vendors see FPGA tooling as a | cost center and a necessary evil | | Yes to a degree, but another part of the problem is the | "physical constraints" you mention. FPGA tooling has to | solve multiple hard problems, on the fly, at large scale | (some of the latest chips are edging up to 10M logic | elements). Unfortunately for the FPGA industry, I think | that this is unavoidable - though a lot of interesting | work is being done around partial reconfiguration, which | should allow for users to work with smaller designs on a | large chip. | kyboren wrote: | Well, that's an explanation for why FPGA compilation | flows take so much time, but it's not a good explanation | for why the software is so crap. | | I think partial reconfiguration is really sexy, but it's | been around for a long time. What's new and exciting | there? Genuinely curious. | phkahler wrote: | Not sure why they think chip details and bitstreams need | to be kept secret. If they would open up, people would | make better tools for them. | s_gourichon wrote: | > cannot take a bitstream compiled for one FPGA and | program it to any other FPGA. | | Like considering dumping memory content on a PC and | reinject it on another with different RAM layout and | devices and complaining the OS and programs can't | continue running? Is that a sane expectation? | | There are upstream formats targeting FPGAs that can be | shared, although yes redoing place and route is slow. | | Should manufacturers provide new formats closer to final | form yet would allow binaries that can be adjusted, kind | of like .a .so or even llvm? | | Alternatively, would building whole images for many | families of FPGA make sense? Feels like programs | distributed as binaries for p OS variants times q | hardware architectures, each producing a different | binary... random example | https://github.com/krallin/tini/releases/tag/v0.19.0 has | 114 assets. | kyboren wrote: | > Like considering dumping memory content on a PC and | reinject it on another with different RAM layout and | devices and complaining the OS and programs can't | continue running? Is that a sane expectation? | | Even worse; it's more like that plus extracting the raw | microarchitectural state of a CPU, serializing it in a | somewhat arbitrary way, trying to shove that blob into a | different CPU and still expecting everything to continue | running. | | I'm not necessarily complaining, just pointing out this | significant difference WRT software programs running on | CPUs. | | > There are upstream formats targeting FPGAs that can be | shared, although yes redoing place and route is slow. | | Can you show me an example? I'd like to see this. You do | not mean FPGA overlays, correct? | | > Should manufacturers provide new formats closer to | final form yet would allow binaries that can be adjusted, | kind of like .a .so or even llvm? | | Like you say, at the very least you will need to re-do | place and route. But actually the problem is much worse | than this. Different FPGAs have different physical | resources. Not just differing amounts of logic area, but | different amounts of block RAM, different DSP blocks and | in varying numbers, high-speed transceivers, etc. This | necessitates making different design trade-offs. Simply | shoehorning the same design into different FPGAs, even if | it were kind of possible, will not work well. | | > Alternatively, would building whole images for many | families of FPGA make sense? | | Currently I think that's the only real option. But the | extreme overhead, duplication of effort and maintenance | burden make it very unattractive. | | My napkin sketch is some sort of generalized array of | partial reconfiguration regions with standardized | resources in each region. Accelerator applications can | distribute versions targeting different numbers of | regions (e.g. one version for FPGAs supporting up to 8 | regions, one for FPGAs supporting up to 16 regions, | etc.). The FPGA gets loaded with a bitstream supporting a | PCIe endpoint and management engine, and some sort of | crossbar between regions. At accelerator load time, | previously mapped, placed, and routed logical regions | used in the application are placed onto actual partial | reconfiguration regions and connections between regions | are routed appropriately. The idea is to pre-compute as | much of the work as possible, leaving a lower dimension | problem to solve for final implementation. Timing closure | and clock management are left as exercises for the reader | :P. | ianhowson wrote: | > bitstream ... Is that a sane expectation? | | No. Bitstream formats are not in any way compatible | across devices. Because timing is a factor, even if you | had the same physical layout of LUTs and routing, it's | unlikely that your design would work. | | (From parent) | | > use that bitstream for any other device in the same | device family | | Not at the bitstream level. _However_ , you can take a | place&routed chunk of logic and treat it as a unit. You | can replicate it (without repeating P&R), move it around, | copy it onto other devices _in the same family_. This is | super useful as most FPGA applications have large | repeating structures, but P &R doesn't know that it's a | factorable unit. It'll repeat P&R for each instance and | you'll get unpredictable timing characteristics. | | > Should manufacturers provide new formats closer to | final form yet would allow binaries that can be adjusted, | kind of like .a .so or even llvm? | | > would building whole images for many families of FPGA | make sense | | You can license libraries that are a P&R'd blob and drop | them into your design. There's no easy way to make this | generalizable across devices without shipping the | original RTL, and conversion from RTL->bitstream is where | most of the pain lies. | qppo wrote: | > HLS as a first-class design paradigm is always a decade | away. | | What about Chisel? | henrikeh wrote: | Chisel is not a HSL. Chisel is much closer to VHDL and | Verilog, since the hardware is directly described. | qppo wrote: | Chisel would allow me to write say, a codec algorithm and | compile it into hardware, correct? As well as specify the | hardware that is necessary to describe it? | | I'm a casual in that space but I thought Chisel was an | HDL that could be used to support HLS. | jcranmer wrote: | Again, a counterpoint: | | I worked on hardware for something akin to a FPGA on a much | coarser granularity (kind of like coarse-grained | reconfigurable arrays)--close enough that you have to adapt | tools like place-and-route to compile to the hardware. The | programming for this was mostly driven in pretty vanilla | C++, with some extra intrinsics thrown in. This C++ was | close enough to handcoded performance that many people | didn't even bother trying to tune their applications by | resorting to hand-coding in the assembly-ish syntax. | | This helped bolster my opinion that FPGAs aren't really the | answer that most people are looking for, and that there are | useful nearby technologies that can leverage the benefits | of FPGAs while having programming models that are on par | with (say) GPGPU. | kyboren wrote: | For sure. FPGAs are probably not the answer that most | people are looking for. FPGAs are but one point in the | trade-off space, and they're not one you jump to "just | because". | | > [...] there are useful nearby technologies that can | leverage the benefits of FPGAs while having programming | models that are on par with (say) GPGPU | | I think CGRAs are really cool but they're even more | niche, and I suspect your original point about GPUs | eating everyone's lunch applies particularly strongly to | CGRAs. The point is well taken, though, and I don't | necessarily disagree. | Stubb wrote: | Another big advantage of FPGAs is low latency and the ability | to hit precise timing deadlines. When working with radio | hardware, you still need an FPGA for automatic gain control | calculations and recording/playing out samples. Similarly, | you need to do your CRC and other calculations in an FPGA if | you need to immediately respond to incoming signals, such as | the CTS->RTS->DATA->ACK exchange in 802.11. | daxfohl wrote: | I think that's _the_ big advantage of FPGA. If you need | acceleration to hit a 10 microsecond latency target, FPGA | is what you need. If your latency target is like a | millisecond or longer, then GPU can handle a lot more | throughput. But GPU can 't typically give you a 10-us | guarantee. | | Okay, bit-banging is another advantage of FPGA that GPU | doesn't do as well. There are a few things. | rthomas6 wrote: | Take a look at Vitis. Xilinx is aware of this problem and are | seeking to capture the market of people that want magic | programming solutions to speed up existing software. Who knows | if it will be successful, but they are trying more than ever to | make FPGAs usable without having to know how to make hardware | designs and verification. | mdiesel wrote: | I work with fpgas, but from LabVIEW. NI have put some effort | into making the same language work for everything including | fpgas, and a graphical language is great for this kind of work. | | It's so easy that it's quite common to see people pass off work | onto the fpga if it involves some slightly heavier data | processing, which is exactly how it should be. | tyingq wrote: | There is another traditional FPGA use case where you need real | time data capture or signal generation. That seems to be | getting eaten from the bottom now that there are really high | speed MCUs that are easier to program. It's less efficient, but | easier to develop for. | rogerbinns wrote: | I good example of this is XMOS. Their chips are divided into | "tiles" which can simultaneously run code, together with | multiple interfaces such as USB, i2s, i2c, and GPIO. Latency | is very deterministic because the tiles are not using caches, | interrupts, shared buses etc. | | Their development environment is Eclipse based with numerous | libraries such as audio processing, interface management, DFU | etc. They use a variant of C (xc) that lets you send data | between channels/tiles, and easily parallelize processing. | | An example use is in voice assistants where multiple | microphones need to be analyzed simultaneously, echo and | background noise has to be eliminated, and the speaker | isolated into a single audio stream. I've used it for an | audio processing product that needed match hardware timers | exactly, provide USB access, matched input and output etc. | exmadscientist wrote: | The other problem with using an FPGA here is that | microcontrollers are cheap and have great cheap dev boards. | FPGAs, not so much. I've wanted to just "drop in" a small | FPGA in several designs, the way you can drop in a | microcontroller, but there's no available FPGA that's not a | massive headache in that use case. Trust me, I've looked. | | The iCE40 series is _almost_ there but not quite. It 's a bit | pricey (this is sometimes okay, sometimes a dealbreaker) but | its care and feeding is too annoying. Who wants to source a | separate configuration memory? Sometimes I don't have the | space for that crap. | | If any company can bring a small, cheap, low power FPGA to | the market, preferably with onboard non-volatile | configuration memory, a microcontroller-like peripheral mix | (UART, I2C, SPI, etc.), easy configuration (re)loading, and | with good tool and dev board support, they'll sell a lot of | units. They don't even have to be fast! | jburgess777 wrote: | Gowin might just fill this niche. They are working with | yosys on open source support as well. | | https://www.gowinsemi.com/en/product/detail/2/ | | http://www.clifford.at/yosys/cmd_synth_gowin.html | panpanna wrote: | I think Cypress had a product line that combined a CPU and | a small programmable array, just big enough to implement | your own custom IO and protocols and maybe some minimal | logic beyond that. | | Maybe that's what most hobbyists need? | tyingq wrote: | In the hobbyist space, I also see a fair amount of CPLDs | used when something like a GAL | (https://en.m.wikipedia.org/wiki/Generic_array_logic) | would be much cheaper and easier. Doesn't work for | everything, but they can be handy. | gvb wrote: | The MiniZED is $89 and a ton of fun! It has an ARM | processor (Xilinx Zynq XC7Z007S SoC), Arduino compatible | daughterboard connectors, microcontroller-like peripheral | mix, and runs linux. | | http://zedboard.org/product/minized | | https://www.avnet.com/shop/us/products/avnet-engineering- | ser... | | Oh, and Vivado (the FPGA development IDE) is free (as in | beer) for that FPGA as well as Xilinx' other mid to low end | FPGAs. | cmrdporcupine wrote: | See it's funny, I (software guy) have recently started doing a | bunch of FPGA stuff on the side for "fun" and I find the | programming model to not be the biggest challenge. | | The tools, yes, because it seems like hardware engineers have a | fetish for all-encompassing painful vendor specific IDEs with | half the features that us software developers have, and with a | crapload of vendor lock-in... but I digress. | | I find working in Verilog to be pretty pleasant. Yes I can see | that with sufficient complexity it wouldn't scale out well. But | SystemVerilog does give you some pretty good tools for managing | with modularity. | | On the other hand, I've never particularly enjoyed working with | GPUS, CUDA, etc. | | So I would agree with your statement that the structural issues | prevent their utility in wider market classes -- and those | really are as you say ... lower clock speeds, cost, but also | vendor tooling. | | FPGAs could really do with a GCC/LLVM type open, universal, | modular tooling. I use fusesoc, which is about as close to that | as I will get (declarative build that generates the Vivado | project behind the scenes), but it's not perfect, still. | daxfohl wrote: | Agreed. I never thought the mental leap to Verilog was a big | hurdle. It's just C-like syntax with some new constructs | around signaling and parallelism. I found this interesting | rather than foreboding. | | The main challenge I had was compilation time. It can | sometimes take overnight to compile a simple application if | there's a lot of nested looping, only to have it run out of | gates. This can be a royal pain. | | I'd expect most HPC scenarios would have lots of nested | looping, and probably memory accesses, and thus have to spend | a lot of time writing state machines to get around gate count | limitations and wait for memory responses, at which point | you're basically designing a 200 MHz CPU. | | So I don't see it as being very useful for general purpose | acceleration, but could be a good CPU offload for some very | specific use cases that are more bit-banging than computing. | Azure accelerates all its networking via FPGA, which seems | like the ideal use case. | jjoonathan wrote: | I don't mean to belittle your exploration, but are you sure | it's an apples-to-apples comparison? This suggests to me that | it isn't: | | > it seems like hardware engineers have a fetish for all- | encompassing painful vendor specific IDEs | | Hardware engineers feel pain just like you do. The reason why | they put up with those awful software suites is because they | have features they need that aren't available elsewhere. In | particular, they interface with IP blocks and hard blocks, | including at a debug + simulation level. Those tend to evolve | quickly and last time I looked -- which admittedly was a | while ago -- the open source FPGA tooling pretty much | completely ignored them, even though they're critical to | commercial development. | | If you are content to live without gigabit transceivers, PCIe | controllers, DRAM controllers, embedded ARM cores, and so on, | I suspect it would be relatively easy to use the open source | tooling, but you would only be able to address a small | fraction of FPGA applications. | tieze wrote: | LLVM folks have actually just started on such tooling: CIRCT. | With Chris Lattner at the helm, and industry players like | Xilinx and Intel seemingly on board. | d_silin wrote: | I wonder if it is possible to add a (small) FPGA to a personal | computer that could accelerate any specific software tasks | (video/audio encoding, ML algorithms, compression, extra FPU | capabilities) _on user demand_. | geforce wrote: | IIRC some CPUs of the Intel Atom series already have an | embedded FPGA. | duskwuff wrote: | Intel has launched a couple of Xeon Gold CPUs (like a variant | of the 6138P) with integrated FPGAs for specific markets. | Nothing mass-market, though, and they don't seem to have | caught on much. | sod wrote: | Isn't that what Apple did with that Afterburner Card for the | MacPro? I read in https://www.anandtech.com/show/15646/apple- | now-offering-stan... that that card is an fpga. | | I could imagine that Apple will include something like this in | their Apple Silicon SOC for ARM macs. | | The Afterburner Card is not user programmable, but maybe it may | in the future and this was just the first try to get the | hardware in the field. | tails4e wrote: | There absolutely is. There are PCIe cards you can plugin and | use them as accelerators, just like you would use a GPU. Of | course programming them to do the task you want is harder, but | it can do anything. Saw a great example where someone | implemented memcached on a single FPGA plugin and replaced many | Xeons with it. | jeffreyrogers wrote: | The problem with this will be the overhead of transferring data | to/from the FPGA, which once accounted for often causes doing | the computation on the CPU to make more sense. It's obviously | not a show-stopper, since GPUs have the same problem, but are | still useful, but it's hard to find a workload that maps well | to this solution. | derefr wrote: | In a DAW, accelerating a heavy VST plugin might make sense. | But often those are amenable to being translated to GPGPU | code already. | | I guess the one place where GPGPU-based solutions _wouldn 't_ | work, is when the code you want to accelerate is necessarily | acting as some kind of Turing machine (i.e. emulation for | some other architecture.) However, I can't think of a | situation where an FPGA programmed with the netlist for arch | A, running alongside a CPU running arch B, would make more | sense than just getting the arch-B CPU to emulate arch A; | unless, perhaps, the instructions in arch-A are _very, very | CISC_ , perhaps with analogue components (e.g. RF logic, like | a cellular baseband modem.) | not2b wrote: | This is normally handled in emulation by putting the inner | parts of the testbench (the transactors) onto the FPGA as | well, to minimize the amount of data that has to be | transferred between the CPU and the FPGA. If the FPGA is to | be used as a peripheral, again a division of labor needs to | be found that minimizes the amount of data that needs to be | communicated. But if there is FPGA logic on the same chip as | the CPU cores, the overhead can be greatly reduced, and we're | seeing more of that now. | deelowe wrote: | I assumed this was kind of intel's plan when they purchased | Altera. I this issue with this is the amount of time it takes | to load the bitstream, but I thought I saw some things recently | where progress was being made on this front. | vzidex wrote: | > issue with this is the amount of time it takes to load the | bitstream, but I thought I saw some things recently where | progress was being made on this front | | You saw correctly, work is indeed being done to build | "shells" that can accept workloads without the user having to | go through the FPGA tooling/build process. | daxfohl wrote: | It's been possible for a long time, but there are big | challenges to adoption. Every FPGA is different and the image | is tightly coupled to the chip, so you'd have to compile the | algorithm specifically to your chip before loading, which can | take hours. Then loading the image each time you change out | accelerators for a different application can take minutes. Then | the software that uses the accelerator would have to know which | chip and which image you're running and send data to it | accordingly. Then you have to remember that FPGA's aren't | really that great of accelerators sometimes, since they run at | such low clock speeds, have crummy memory interfaces, limited | gate support for floating point or even integer multiplication, | etc. CPU's commonly outperform them even at the things they're | supposed to be good at. | | So it's unlikely ever to gain broad acceptance because the | software vendors would have to support such a high number of | permutations and the return can be questionable. This is why | you see far more accelerators based on ASICs that have higher | clock speeds and baked-in circuitry for specific tasks, with | standardized APIs. | | But sure, there's nothing preventing you from buying an FPGA | board, hooking it up to your PC, creating a few images that do | the accelerations you want, and writing software that uses | them, swapping the image in when your program loads. You could | even write a smart driver that swaps the image only if it's not | in use by another app, or whatever. It's just unlikely you'll | ever find a bunch of third-party software that supports it. | rustybolt wrote: | Yes, and it has been done. There are FPGA's that you can | connect to with PCIe, and you only have to pay the small price | of writing an FPGA implementation for your usecase. It usually | takes just a couple of weeks (OK, maybe months). | SomeoneFromCA wrote: | You might actuall go even faster than PCIe, by pretending | being a DDR4 memory stick. | [deleted] | Koshkin wrote: | I wonder what would be the advantages of using an FPGA to _test_ | a CPU design - compared to relying on a (presumably more | accurate) computer-based simulation. (I understand the reasons | one might want to _implement_ a CPU in an FPGA.) | dbcurtis wrote: | This idea is more than 30 years old. It has been done, and one | upon a time companies were built around this idea. | | First off, mapping an entire CPU to an FPGA cluster is a design | challenge itself. Assuming you can build an FPGA cluster large | enough to hold your CPU, and reliable enough to get work done | on it, you have the problem of partitioning your design across | the FPGA's. Second problem: observability. In a simulator, you | can probe anywhere trivially, with an FPGA cluster, you must | route the probed signal to something you can observe. (I am not | even going to talk about getting stimulus in and results out, | since with FPGA or simulator, either way you have that problem, | it is just different mechanics.) | | The big problem is that an FPGA models each signal with two | states: 1 and 0. A logic simulator can use more states, in | particular U or "unknown". All latches should come up U, and | getting out of reset (a non-trivial problem), to grossly | oversimplify, is "chasing the U's away". An FPGA model could, | in theory, model signals with more than two states. The model | size will grow quickly. | | Source: Once upon a time I was pre-silicon validation manager | for a CPU you have heard of, and maybe used. Once upon a time I | was architect of a hardware-implemented logic simulator that | used 192 states (not 2) to model the various vagaries of wired- | net resolution. Once upon a time I watched several cube- | neighbors wrestle with the FPGA model of another CPU you have | heard of, and maybe used. | | Note: What would 3 state truth tables look like, with states | 0,1,U? 0 and 1 is 0. 0 and U is 0. 1 and U is U -- etc. You can | work out the rest with that hint, I think. | | Edit to add: Why are U's important? They uncover a large class | of reset bugs and bus-clash bugs. I once worked on a mainframe | CPU where we simulated the design using a two-state simulator. | Most of the bugs in bring-up were getting out of reset. Once we | could do load-add-store-jump, the rest just mostly worked. | Reset bugs suck. | jacquesm wrote: | > Reset bugs suck. | | Indeed they do. And even if you have working chips you get | the next stage: board level reset bugs. A MC68K board I | helped develop didn't want to boot, some nasty side effect of | a reset line that didn't stay at the same level long enough | stopped the CPU from resetting reliably when everything else | did just fine. That took a while to debug. | rcxdude wrote: | Because it's substantially faster. Simulating a large CPU | design in software is slow and it doesn't parallelise well, so | your tests will take a lot longer (and these aren't fast even | with FPGA acceleration: runtimes can be days or weeks if you're | running a large fraction of the design for even a tiny amount | of time in the simulation). | NotCamelCase wrote: | SW-based simulation is mostly about functional correctness and | robustness of an implementation. Even with cycle-accurate | simulations there is a lot of data you can't just extrapolate | from simulation results pertaining to timing and performance | constraints. And that's where emulating CPU/GPU/ASIC designs | generally help the most. | bsder wrote: | The problem that FPGAs have is that they are only good for low- | volume solutions that require flexibility and have no power | constraints. | | That's a really narrow market. Telecom equipment and lab | equipment, basically. | | If I need volume, I need at least an ASIC. If I need to manage | power, I need a full custom design. | wwarner wrote: | This is really interesting. If a cpu hardware vulnerability like | spectre could be repaired by patching an fpga on the SOC that | would be incredible. That type of functionality would overtake | the entire cloud market in about 3 days. | rustybolt wrote: | Amazon already has FPGA's on the cloud: | https://aws.amazon.com/ec2/instance-types/f1/ | | I don't think they are very popular though. Maybe they are used | sometimes for machine learning? | jeffreyrogers wrote: | FPGAs are too slow for that. I think you can get the clock rate | up to about 600Mhz, but that is only for very small portions of | the chip. Otherwise you run into timing issues. The clock speed | for most of the chip will be significantly lower. | rcxdude wrote: | Yup. If you just want a CPU, use a CPU. an FPGA is a terrible | substitute, and generally you only want to embed a CPU on | them if you are either developing a CPU or you want a not | very fast CPU as an addon to a design which is already using | an FPGA (and generally for this nowadays the vendors make | FPGAs whith a CPU on the same die, because it's so common and | frees up quite a lot of the FPGA fabric and power budget). | glitchc wrote: | It would also open up new attack vectors. | thehappypm wrote: | That's the real nightmare. Now all of a sudden, you can | program the CPU itself if you can access the update | mechanism. CPUs being non-programmable is a feature as well | as a bug. | deelowe wrote: | CPUs are already "programmable" via microcode updates. | pjmlp wrote: | And have been since ages, that was one of the themes | regarding RISC Vs CISC design. | gtsteve wrote: | Microcode is loaded when the OS starts though right? At | the very least it's not persistent. | deelowe wrote: | BIOS or OS | cwzwarich wrote: | Pretty much every new non-x86 CPU doesn't have updatable | microcode, so that's a very x86-centric problem. | rwmj wrote: | I'm afraid it doesn't work like this. That would only be | possible if the chip was using an FPGA fabric for the relevant | parts of the design. For example if the L1 cache was | implemented as an FPGA you could in theory patch around L1TF. | But they wouldn't do that because it would be far slower/larger | than implementing it directly as an ASIC. | | Or you might imagine a chip that has an FPGA on the side (I | expected Intel would ship this after acquiring Altera, but it | never happened). But the FPGA would somehow have to have access | to the paths that caused the vulnerability, which is highly | unlikely, and would also be really slow compared to what they | actually do which is hacking around it by microcode changes. | duskwuff wrote: | > Or you might imagine a chip that has an FPGA on the side (I | expected Intel would ship this after acquiring Altera, but it | never happened). | | They did: https://www.anandtech.com/show/12773/intel-shows- | xeon-scalab... | | But I get the sense this part was aimed at a few very | specific customers. It required some PCB-level power delivery | changes, so you couldn't even drop it into a standard server | motherboard. | lnsru wrote: | I am working right now on bare metal websockets implementation on | Xilinx Series 7 FPGAs. Currently it's ZynQ SoC, but final product | will probably have Kintex 7 inside, so no Linux. The tools make | me cry, no examples, application notes from 2014 with ancient | libraries. I hope, vendors will fix tooling. But I see, Xilinx | has released Vitis, so their scope is elsewhere, no interest in | old crap. Using Git with Vivado is already enough pain. So I keep | my text sources in Git and complete zipped projects as releases. | Ouch! | tails4e wrote: | I posted this elsewhere, there are a lot of good resources and | examples for the tools: | | https://github.com/xupgit/FPGA-Design-Flow-using-Vivado/tree... | | https://www.xilinx.com/support/university.html | | https://www.xilinx.com/video/hardware/getting-started-with-t... | | There are others thst cover the SDK side of things, but the HW | side/Vivado is well documented. | mindentropy wrote: | Have you looked at open source solutions? Tim Ansell is | managing some great projects on open source solutions. Check | out Symbiflow, LiteX, Yosys etc. | lnsru wrote: | Are these mature already? It took some time for KiCad to get | to current usable state and I don't want to be early adopter. | In fact, I want to have my private hardware MVP next year | with current tools. On the other hand I can't imagine my | slacker colleagues using anything else than Vivado. Learning | Vivado for them was already mission impossible. | IshKebab wrote: | I wouldn't say KiCad is usable yet. I've made multiple | attempts to use it and it just is fundamentally user | hostile. Unfortunately the devs see any attempt to improve | user friendliness as "dumbing down". | | Fortunately there is (finally!) an open source PCB design | program that doesn't suck: Horizon EDA. I've only made one | PCB with it but honestly it was pretty great and the author | fixed every usability bug I reported in a matter of hours, | which is an insane difference from KiCad's "you're holding | it wrong". | | The only think I don't like about it is it has an | unnecessarily powerful and confusing component system | (there are modules, entities, gates, etc.). But really it | is the best by far. | | Anyway, on FPGAs, I think the tools are only vaguely mature | for iCE40 and even then you basically need to already be an | expert unfortunately. | beefok wrote: | I feel you completely. The Vivado IDE/toolchain is absolutely | atrocious and the designers should be shamed for the horrifying | bloatware they push as the STANDARD. Sometimes I have better | luck doing everything in tcl/commandline there. | tails4e wrote: | Vivado is amazing compared with the ASIC counterparts: Design | compiler is for RTL synthesis only and you need years of | experience to get any decent qor out of it. In ASIC land you | have separate tools for every step, synthesis, STAs, PnR, | simulation, floor planning, power analysis, etc. Vivado does | all that in one seamless tool, and allows you to cross probe | from a routed net right back to the RTL code it came from. | Try doing that with ASIC tools. So to me it's a matter of | perspective, once you understand how difficult the problem of | hardware design is to solve, and what some of the existing de | facto industry standard tools are like (for ASIC), you come | to appreciate vivado for just how well it brings all of these | complex facets together. Of course if you come from a SW | background you make think vivado is terrible compared to | VScode or some other IDE, but that's an unfair comparison. I | guess to reframe the question - show me a hardware design | environment that is better than Vivado. Also, I separate | vivado fron the Xilixn SDK, as they are different tools, and | Vivado is expclitly got the HW parts of the design | jlokier wrote: | I added one small Verilog file to a Vivado project. | | It froze the IDE for _45 minutes_ before I could do | anything else. | | This was on a beefy machine at AWS too, not some cheap home | desktop thing. | | That wasn't compiling, no synthesis, P&R, nothing. | | There was no giant netlist I'd been working on either. Most | of the FPGA was empty. | | That was literally just adding a small source file which | the IDE auto-indexed so you could browse the contents. | | In Verilator, an open source Verilog simulator, that same | source file loaded, completed its simulation and checked | test results in less than a second. So it wasn't that hard | to compile and expand its contents. | | Vivado is excellent for some things. But the excellence is | not uniform unfortunately. On that project, I had to do | most of the Verilog development outside Vivado because it | was vastly faster outside Only importing modules when they | were pretty much ready to use and behaviorally validated. | tails4e wrote: | That's definitely an anomaly, I use vivado with ASIC code | reguarly, very large designs and have not seen anything | like this. I use vivado to elaborate and a alyse code | intended for ASIC use as its better than other ASIC tools | for that purposed. Once I'm happy with it in vivado, then | I push it through design compiler, etc. Elaborating a | deign that is 4 hours in DC synthesis is about 3 mins in | vivado elaboration. | PanosJee wrote: | inaccel.com is making lots of steps to bring FPGA to 2020 | | Spark/k8s integration Abstraction of popular cores Python APIS | Serverless deployments Etc | rwmj wrote: | _> Intel, AMD, and many other companies use FPGAs to emulate | their chips before manufacturing them._ | | Really? I'm assuming if this is true it can only be for tiny | parts of the design, or they have some gigantic wafer-scale FPGA | that they're not telling anyone about :-) Anyway I thought they | mainly used software emulation to verify their designs. | k0stas wrote: | The largest FPGAs were reticle-busters when I used to work on | them. Today I think the largest FPGAs use chiplet-style | integration. Even with the inefficiency of an FPGA, many | smaller chip designs can still fit on the largest FPGA. | | Also, there are prototyping boards specifically built for | emulation that integrate multiple FPGAs, although this does | introduces a partitioning problem that has to be solved either | manually or via dedicated emulator software. | jcranmer wrote: | The FPGA emulator for a chip I was working on involved an | entire rack of FPGAs... for a single core. | variaga wrote: | Of the half-dozen semiconductor- designing companies I've | worked for, _all_ of them used FPGAs for emulation. | | - modern FPGAs are huge. | | - when an asic design won't fit in a single FPGA, it's usually | possible to partition the design into multiple FPGAs | | - software emulation/ simulation is not guaranteed to be "more | accurate". FPGAs can interact with a real-world environment in | ways that simulation simply cannot | | - simulations run 1000s of times slower than FPGAs. Months of | simulation time can be covered in minutes on the FPGA | | Edit: to be clear, they all use simulation too, but FPGAs are | used to accelerate the verification process | UncleOxidant wrote: | Having been involved with CPU emulation in the past a couple of | comments: | | 1. It's not just a single FPGA but a large box full of them. | for example: | https://www.synopsys.com/verification/emulation/zebu-server.... | | 2. Software models are employed for parts of the system (For | example, the southbridge and all the peripherals connected to | it are generally a software model which communicates with the | hardware emulated portion in the FPGA via a PCIe model which is | partly in hardware and partly in software.) This saves a lot of | gates in the FPGA - those parts have already been well tested | anyway so no need to put them into the hardware emulation. | formerly_proven wrote: | https://www.youtube.com/watch?v=650yVg9smfI | TomVDB wrote: | Many years ago, we had a custom made board with 8 huge Xilinx | Virtex 5 FPGAs (the largest available at the time) to emulate a | large SOC. Those FPGAs were something like $20K a piece. | | We had 10 such boards, good for millions of dollars in | hardware, and a small team to keep it running. | | These platform were mostly used by the firmware team to develop | everything before real silicon came back. It could run the full | design at ~1 to 10MHz vs +500MHz on silicon or 10kHz in | simulation. | | After running for a while, that FPGA platform crashed on a case | where a FIFO in a memory controller overflowed. | | Our VP of engineering said that finding this one bug was | sufficient to justify the whole FPGA emulation investment. | mindentropy wrote: | The multiple FPGA on a board is generally from Dini Group | right? Fantastic boards. | | Ref: https://www.dinigroup.com/web/index.php | iron2disulfide wrote: | There are a lot of companies who create multi-FPGA boards. | The market for FPGAs-for-ASIC-prototyping is substantial. | duskwuff wrote: | Dini's naming schemes are hilarious. They're all named like | monsters in B-movies -- their latest system, the DNVUF4A, | is called "Godzilla's Butcher on Steroids", for instance. | | Also, Dini got acquired by Synopsys a few years ago. | TomVDB wrote: | No, it was in-house custom made for the purpose. | | Huge PCBs, ~2ft by 2ft. | jacquesm wrote: | Design verification is big business and your VP was exactly | right, a factor of 100 to 1000 speed increase would allow for | much more thorough testing and broader testing as well, for | instance hooked up to other hardware with reasonable fidelity | compared to the real thing. Still coarse but a lot better | than nothing. Good call. It isn't rare at all to have a | respin if you don't do design verification. | | One of the nicer stories about the first ARM chip is that | they built a software simulator to verify the design and as a | result they found plenty of bugs in the hardware before | committing to silicon. The first delivered chips worked right | away. | retro_guy wrote: | Maybe you will find this article about Large-Scale Field- | Programmable Analog Arrays [FPAAs] interesting as well: | https://hasler.ece.gatech.edu/FPAA_IEEEXPlore_2020.pdf | justicezyx wrote: | FPGAs are good at _nothing_ in the scale that can challenge non- | configurable silicons... | | They are good at a lot of things that are in a smaller scales. | Like general prototyping/testing/simulation, telecom, special- | purpose real-time computing etc. | | The behind-scene logic is that FPGAs can never make things as | flexible as software. And flexible software always offset the | inefficiency in a non-configurable chips. Just comparing FPGAs | and CPUs/GPUs will never teach FPGAs vendors the reality, or they | choose to ignore after all... ___________________________________________________________________ (page generated 2020-07-23 23:00 UTC)