[HN Gopher] XLS: Accelerated HW Synthesis ___________________________________________________________________ XLS: Accelerated HW Synthesis Author : victor82 Score : 104 points Date : 2020-09-02 15:19 UTC (7 hours ago) (HTM) web link (google.github.io) (TXT) w3m dump (google.github.io) | Traster wrote: | >XLS is used inside of Google for generating feed-forward | pipelines from "building block" routines | | For those that aren't familiar, control flow - or non "Directed | Acyclical graphs" are the hard part of HLS. This looks like a | fairly nice syntax compared to the bastardisations of C that | Intel and Xilinx pursue for HLS but I'm not sure this is bringing | anything new to the table. | | As for the examples, I'm kind of flumoxed that they haven't given | any details on what the examples synthesize to. For example, how | many logic blocks does the CRC32 use? How many clock cycles? What | about the throughput? I'm going to sound like a grumpy old man | now, but it's important becaues it's very difficult to get | performant code as a hardware engineer. Generally it involves | having a fair idea of how the code is going to synthesize. What | is damn near impossible is figuring out what you want to | synthesize to, and then guessing the shibboleth that the compiler | wants in order to produce that code. Given that they haven't | tackled the difficult problems like control flow, folding, | resource sharing etc. It makes me hesitant to believe they've | produced something phenomenal. | aseipp wrote: | The HLS tools from Xilinx and Intel (and maybe Cadence I guess) | can also actually compile your models as ordinary C++ code (i++ | from Intel is literally just a fork of Clang, I think, and so | are tools like LegUp), leading to their greatest benefit: | simulations are way, way faster and software compilers have | vastly better iteration times than synthesizers. | | They seem to have a simulation framework for these tools that | isn't just "re-use an existing simulator", and it apparently | does use LLVM for codegen but that's the easy part. Actual | simulation performance numbers would be really interesting to | see vs actual RTL sims. | learyg wrote: | Hi, one of the collaborators here, thanks for the good points. | | We have been targeting some Lattice FPGAs for prototyping | purposes, but we've mostly been doing designs for ASIC | processes, which is why details are a little sparse for FPGAs | you get off the shelf, but it's a priority for us to fill those | in. We have some interactive demos that show FPGA synthesis | stats (cell counts, generated Verilog, let you toy with the | pipeline frequency) and integrate with the [IR visualizer](http | s://google.github.io/xls/ir_visualization/#screenshot), we'll | try to open source that as soon as possible. The OSS tools | (SymbiFlow) that some of our colleagues collaborate on can do | synthesis in just a few seconds, so it can feel pretty cool to | see these things in near-real-time. | | We fold over resources in time with a sequential generator, but | we still have a ways to go, we expect a bunch of problems will | map nicely onto concurrent processes, they're turing complete | and nice for the compiler to reason about. | | I'm a big believer that phenomenonal is really effort and | solving real-world pain points integrated over time -- it's a | journey! We're intending to do blog posts as we hit big | milestones, so keep an eye out! | Traster wrote: | Do you mind me asking what applications Google uses this for | internally? Is this used in a flow that's ended up in | production? Also, what are your thoughts on integrating | optimized RTL blocks? | learyg wrote: | One of the things we have on our short list is "good FFI" | for instantiating existing RTL blocks (and making their | timing characteristics known to the compiler) and making | import flows from Verilog/SystemVerilog types. The latter | may be a bit your-Verilog-flow specific, but we think there | are some universal components you can provide that folks | can slot in their flows as appropriate. | | Being able to re-time pipelines without a rewrite is a | useful capability. Although it's still experimental and | we're actively building out the capabilities, we have it in | real designs that have important datapaths. | ampdepolymerase wrote: | Reminds me of the old reconfigure.io which used the ideas and | syntax of Go's CSP and transformed them into async HDL code. | Unfortunately the startup has been shuttered. | | http://docs.reconfigure.io/ | simonw wrote: | XLS as an acronym for Accelerated HW Synthesis is a bit of a | stretch! | dirtypersian wrote: | I believe it might come from the fact that this process of | going from high level programming language to hardware is | called "high level synthesis". I think the "X" is meant to make | it more generic, i.e. X level synthesis. | simonw wrote: | That makes sense. Accelerated => XL just about works for me. | high_derivative wrote: | It's most likely inspired by XLA (Accelerated Linear Algebra) - | same creator(s). | Connect12A22 wrote: | I love their RISC-V implementation in 500 lines of code: | https://github.com/google/xls/blob/main/xls/examples/riscv_s... | fmakunbound wrote: | Comments indicate it implements a subset of various things. | Traster wrote: | It's kind of a good demonstration of the problem with software | versus hardware, here's xls solution (just for one function): | fn decode_i_instruction(ins: u32) -> (u12, u5, u3, u5, u7) { | let imm_11_0 = (ins >> u32:20); let rs1 = (ins >> | u32:15) & u32:0x1F; let funct3 = (ins >> u32:12) & | u32:0x07; let rd = (ins >> u32:7) & u32:0x1F; let | opcode = ins & u32:0x7F; (imm_11_0 as u12, rs1 as u5, | funct3 as u3, rd as u5, opcode as u7) } | | here's the systemverilog solution | {im_11_0,rs1,funct3,rd,opcode} <= ins; | | Obviously, in software, you can't slice data in the same way | since as far as I can tell, it's assuming all variables are a | certain size and so there's no naturally way of bit slicing. | FullyFunctional wrote: | That's untrue. You need to include the declarations of | im_11_0, etc. for the above to work and then you end up with | just as much code. There's no reason they couldn't extend | match to operate on bit slices also which would make this | identical. | | Frankly, combinatorics is not where I expect the most | interesting differences. Sequential logic is surely more | interesting. | learyg wrote: | Thanks again for the detailed thought! We actually [developed | more advanced bit slicing syntax]( https://github.com/google/ | xls/blob/1b6859dc384fe8fa39fb901af... ) since that example | was written, you can do things like a standard slice `x[5:8]` | or a Verilog-style "width slice" that has explicit signedness | `x[i +: u8]`. There's currently no facility for | "destructuring" structs as bitfields like pattern matches, | but there's no conceptual reason it can't be done, I think | that'd be an interesting thing to prioritize if there's good | bang for the buck. [Github issue to | track!](https://github.com/google/xls/issues/131) Let me know | if I missed out on details or rationale, thanks! | Traster wrote: | Hey, thanks for replying, the project looks like it has a | lot of potential. You're right, bit slicing gets you like | 99% of the way there (the rest is just syntax sugar). It's | interesting because from what I remember there were some | non-trivial issues for the people using LLVM for their IR | because of fundamental assumptions in the representation, | but bit-slicing is the core functionality. Is there a | reason you guys decided on your own IR? | rbanffy wrote: | When I started playing with MAME, I somewhat dreamed of a way to | turn its highly structured code into something that could not | only be compiled into an emulator as it is, but also be | synthesizable into hardware. | | The possibility of using a single codebase to generate both a | software emulator and a hardware implementation is incredible, | from a hardware preservation point of view. | mmastrac wrote: | I love this. I did something similar with using Java to build an | RTL: | | https://github.com/mmastrac/oblivious-cpu/blob/master/hidecp... | | I was thinking about turning it into a full language at some | point, but they beat me to it (and I love the Rust syntax!). | asdfman123 wrote: | If they rename it XLSM they can embed some neat VBA scripts into | it and squeeze out more functionality. | | (I'm sorry.) | rowanG077 wrote: | DSLX seems like a nightmare. Does it support arbitrary C++? | thotypous wrote: | Google is also investing some developer time on Bluespec since it | was opensourced (https://github.com/B-Lang-org/bsc). I wonder if | these projects make part of a bigger plan at Google. | w_t_payne wrote: | I've got a Kahn-process-network based "simulation" framework, | intended to provide a smooth conveyor belt of product maturation | from prototypes written in high level scripting languages like | Python or MATLAB through to production code written in C or Ada. | (Sort of like Simulink, but with a different set of warts). | Having some hardware synthesis capability is very much on the | roadmap, and this looks like it's going to be worth investigating | for that. Very excited to dive into it! | jashmenn wrote: | I've been programming for 20 years and yet I have no idea what | this does. Can someone ELI5? | jevogel wrote: | As far as I can tell, it is a high-level synthesis tool for | developing FPGA/ASIC applications. You write your circuit | functions in a Rust-like DSL and it generates optimized | Verilog/System Verilog code, which can then be synthesized into | hardware. But you can also take the output of the DSL and | simulate it first, which presumably is quicker than simulating | Verilog. | cokernel_hacker wrote: | It is a project aimed at making the design of electronic | logical easier. | | Often, such hardware is written using hardware description | languages [1] like Verilog or VHDL. These languages are very | low level and, in the opinion of some, a little clumsy to use. | | XLS aims to provide a system for High-level synthesis [2]. The | benefit of such systems is that you can more easily map | interesting algorithms to hardware without being super low | level. | | [1] https://en.wikipedia.org/wiki/Hardware_description_language | | [2] https://en.wikipedia.org/wiki/High-level_synthesis | pkaye wrote: | I remember years ago reading about Handel-C. A lot like Go | with channels and threads and function calls. The way it | synthesized the hardware was pretty simple conceptually. You | could easily understand how the program flow was converted | into a state machine in the hardware. | | Not sure what happened it it. Maybe it did not optimize | things enough. | | https://en.wikipedia.org/wiki/Handel-C | | https://babbage.cs.qc.cuny.edu/courses/cs345/Manuals/HandelC. | .. | erikerikson wrote: | Not like you're 5 and I'm definitely not an expert on this | project but here's my best shot... | | Most programs are loaded into memory and parts of those | programs are moved to registers and are used to load data into | other registers. That data is, in turn, sent to logic units | like adders that add two registers together or comparators that | compare to register's values. The generality comes at a cost in | terms of power and time but offers flexibility in return. | | That is very different from something like a light switch where | you flip the switch and the result continuously reflects that | input within the limits of the speed of light. | | If you are willing to sacrifice flexibility, translating your | code into hardware gives you a device that runs the same | processing on its inputs continuously at the speed of light | subject to your information processing constraints (e.g. | derivations of the original input still need to be calculated | prior to use). | | Traditionally, separate languages and greater hardware | knowledge requirements made custom circuits less accessible. | This project brings more standard, higher level languages into | the set of valid specifications for custom electronics. | zelly wrote: | Verilog for codemonkeys | FullyFunctional wrote: | That's a complete mischaracterization. The point of any and | all HLSes is to raise the level of abstraction so you can be | more productive. Even for highly skilled Verilog "monkies", | writing in an HLS is a great deal faster and less error prone | (assuming comparable mastery of the language) simply because | you do not need to deal with a lot of low level details. | | The $1M question however how this experience pans out as you | try to squeeze out the last bit of timing margin. I don't | know, but I'm eager to find out. | | ADD: this parallels the situation with CUDA where writing a | first working implementation is usually easy, but by the time | you have an heavily optimized version ... | nickysielicki wrote: | HLS is going to improve, and you can either disregard it and | be left behind or you can try to understand where it fits | into a design. Your choice. | patrickcteng wrote: | ditto | gadders wrote: | Thank god I'm not alone. | tlack wrote: | You feed in Rust (a flavor called DSLX) or C++ and it generates | code for your FPGA (in Verilog). You then upload this compiled | "bitstream" to your FPGA and now you have something akin to a | custom microprocessor, but running just your program. | est31 wrote: | It looks really quite similar to Rust: https://github.com/goo | gle/xls/blob/main/xls/examples/dslx_in... | | Note that there are differences though: Seems no type | inferrence, for .. in, different array syntax, match arms | delimitered by ";" instead of ",". | | But it has a lot of the cool stuff from Rust: pattern | matching, expression orientedness (let ... = match { ... }), | etc. | | Also other syntax is similar: fn foo() -> Type syntax, | although something similar to that can be achieved in C++ as | well. | muizelaar wrote: | Looks like the match arm difference is going away: | https://github.com/google/xls/pull/127 | est31 wrote: | Very cool. TBH, Rust's match arm delimiter story is a bit | weird. Sometimes you need to put a ",", sometimes you | don't. And macro rules macros have ";" instead of ",". | foota wrote: | I think it turns a c-ish language (from the looks, not sure | about semantics) into a hardware language like HDL. | R0b0t1 wrote: | See also https://github.com/SpinalHDL/SpinalHDL. | jeffreyrogers wrote: | This is interesting. Overall I'm bearish on high-level synthesis | for anything requiring high performance, since you typically need | to think about how your code will be mapped to hardware if you | want it to perform well, and adding abstractions interferes with | that. I would like to know more about how Google uses this, since | it doesn't seem like a good fit for the type of stuff I work on. | typon wrote: | This doesn't seem like HLS, more like a new HDL that's based on | Rust. This has been done many times before with other | functional languages (Clash, Chisel, Spinal, hardcaml and | others). These projects never take off because hardware | designers are inherently conservative and they won't let go of | their horrible language (Verilog or SystemVeriog) no matter | what. | | I'm sure Google will use XLS for their internal digital design | work, but I don't expect this to ever gain widespread support. | (not because HLS is inherently bad, but because of the culture) | analognoise wrote: | Hardware has gotten 1000x faster, and software has made that | 1000x faster system slower than it was in the 1980's, and you | think hardware people should learn the software style? | | ...Are you sure? | jeffreyrogers wrote: | They describe it as HLS, and it definitely looks like HLS to | me. But maybe we have different definitions. Either way, it | seems to be targeting a strange subset of problems: it | doesn't look high level enough to be easy to use for non- | hardware designers (I don't think this goal is achievable, | but it is at least a worthy goal), and it doesn't seem low- | level enough to allow predictable performance. | Traster wrote: | > These projects never take off because hardware designers | are inherently conservative and they won't let go of their | horrible language (Verilog or SystemVeriog) no matter what. | | This is categorically not true. There have been repeated | projects to re-invent hardware description languages. They | don't fail because hardware engineers are conservative, they | fail because they don't produce good enough results. | | Intel has a team of hundreds of engineers working on HLS, | Xilinx probably has almost as many, there are lots of smaller | companies working on their own things like Maxeler. They | haven't take off because it's an unsolved problem to automate | some of the things you do in Verilog efficiently. | | Take this language for example - it cannot express any | control flow. It's feed forward only. Which essentially | means, it is impossible to express most of the difficult | parts of the problems people solve in hardware. I hate | Verilog, I would love a better solution, but this language is | like designing a software programming language that has no | concept of run-time conditionals. | aseipp wrote: | I mean, languages like Bluespec are very close to actual | SystemVerilog semantically, and others like Clash are | essentially structural by design, not behavioral (I can't | speak for other alt-RTLs). You are in full control of using | DFFs, the language perfectly reflects where combinatorial | logic is done, the mappings of DFFs or IP to underlying RTL | and device primitives can easily be done so there's no | synthesis ambiguity, etc. In the hands of an experienced | RTL engineer you can more or less exactly understand/infer | their logic footprint just from reading the code, just like | Verilog. You can do Verilog annotations that get persisted | in the compiler output to help the synthesizer and all that | stuff. Despite that, you still hear all the exact same | complaints ("not good enough" because it used a few extra | LUTs due to the synthesizer being needy, despite the fact | RTL people already admit to spending stupid amounts of time | on pleasing synthesizers already.) Died-in-the-wool RTL | engineers are certainly a conservative bunch, and cagey | about this stuff no matter what, it's undeniable. | | I think a bigger problem is things like tooling which is | deeply invested in existing RTLs. High-end verification | tools are more important than just the languages, but | they're also very difficult to replicate and extend and | acquire. That includes simulation, debuggers, formal tools, | etc. Verification is where all the actual effort goes, | anyway. You make that problem simpler, and you'll have a | winner regardless of what anyone says. | | You mention the Intel and Xilinx's software groups, but | frankly I believe it's a good example of the bigger | culture/market problem in the FPGA world. FPGA companies | desperately want to own every single part of the toolchain | in a bid for vertical integration; in theory it seems nice, | but it actually sucks. This is the root of why everyone | says Quartus/Vivado are shitware, despite being technically | impressive engineering feats. Intel PSG and Xilinx just | aren't software companies, even if they employ a lot of | programmers who are smart. They aren't going to be the ones | to encourage or support alternative RTLs, deliver | integrated tools for verification, etc. It also creates | perverse incentives where they can fuel device sales | through the software. (Xilinx IP uses too much space? Guess | you gotta buy a bigger device!) Oh sure, Xilinx _wants_ you | to believe that they 're uniquely capable of delivering P&R | tools nobody else can -- the way RTL engineers talk about | the mythical P&R algorithms, you'd think Xilinx programmers | were godly superhumans, or they were getting paid by Xilinx | themselves -- that revealing chip details would immediately | mean their designs would be copied by Other Electronics | Companies and they would crumble overnight despite the | literal billions you would need up-front to establish | profitability and a market position, and so on. The ASIC | world figured out a long time ago that controlling the | software just meant the software was substandard. | gchadwick wrote: | > These projects never take off because hardware designers | are inherently conservative and they won't let go of their | horrible language (Verilog or SystemVeriog) no matter what. | | As a hardware designer whose never been a fan of | SystemVerilog but continues to use it I think this is | inaccurate. There are two main issues that mean I currently | choose SystemVerilog (though would certainly be happy to | replace it). | | 1. Tooling, Verilog or SystemVerilog (at least bits of it) is | widely supported across the EDA ecosystem. Any new HDL thus | needs to compile down to Verilog to be usable for anything | serious. Most do indeed do this but there can be a major | issue with mapping the language. Any issues you get in the | compiled Verilog need to be mentally mapped back to the | initial language. Depending upon the HDL this can be rather | hard, especially if there's serious name mangling going on. | | 2. New HDLs don't seem to optimize for the kinds of issues I | have and may make dealing with the issues I do have worse. | Most of my career I've been working on CPUs and GPUs. | Implementation results matter (so power, max frequency and | silicon area) and to hit the targets you want to hit you | often need to do some slightly crazy stuff. You also need a | very good mental model of how the implemented design (i.e. | what gates you get, where they get placed and how they're | connected) is produced from the HDL and in turn know how to | alter the HDL to get a better result in gates. A typical | example is dealing with timing paths, you may need to knock a | few gates off a path to meet a frequency goal which requires | you to a) map the gates back to HDL constructs so you can see | what bit of RTL is causing the issues and b) do some of the | slightly crazy stuff, hyper-specific optimisations that rely | on a deep understanding of the micro-architecture. | | New HDLs often have nice thing like decent type systems and | generative capabilities but loose the low-level easy metal | mapping of RTL to gates you get with Verilog. I don't find | much of my time for instance is spent dealing with Verilog's | awful type system (including the time spent dealing with bugs | that arise from it). It's frustrating but making it better | wouldn't have a transformative effect on my work. | | I do spend lots of time mentally mapping gates back to RTL to | then try and out work out better ways to write the RTL to | improve implementation results. This often comes back to say | seeing an input an AND gate is very late, realising you can | make a another version of that signal that won't break | functional correctness 90% of the time with a fix-up applied | to deal with the other 10% of cases in some other less timing | critical part of the design (e.g. in a CPU pipeline the fix- | up would be causing a reply or killing an instruction further | down the pipeline). Due to the mapping issue I brought up in | 1. new HDLs often make this harder. Taking a higher level | approach to the design can also make such fixes very fiddly | or impossible to do without hacking up the design in a major | way. | | That said my only major experience with a CPU design not | using Verilog/SystemVerilog was building a couple of CPUs for | my PhD in Bluespec SystemVerilog. I kind of liked the | language but ultimately due to 1. and 2. didn't think it | really did much for me over SystemVerilog. | | If you're building hardware with less tight constraints than | yes some of the new HDLs around could work very well for you | and yes hardware designers can be very conservative about | changing their ways but it simply isn't the case that this is | the only thing holding back adoption of new HDLs. | | I do need to spend some more time getting to grips with | what's now available and up and coming but I can't say I've | seen anything, that for my job at least, provides a major | jump over SystemVerilog. | learyg wrote: | Hi, one of the collaborators here! One question to consider, | and one that I consider pretty frequently, is what the hard | difference really is between HLS and RTL. It seems up to | interpretation, but I think of it more as a spectrum than | anything that truly schisms the space. I think I personally | associate the term HLS with "trying to uplevel the design | process where we can". | | Even with modern RTL, we have a synthesizing compiler | optimizing our design within a cycle boundary, trying to manage | fanouts and close timing by duplicating paths and optimize | redundant boolean formulas. Some will even do some forms of | cross stage optimization. | | If you think of XLS's starting point as "mostly structural" | akin to RTL (instead of "loops where you push a button and | produce a whole chip") it's really an up-leveling process, | where there's a compiler layer underneath you that can assist | you in exploring the design space, ideally more quickly and | effectively, and trying to give you a flexible substrate to | make that happen (by describing bits of functionality as much | as possible in latency insensitive ways). | | I like to think of it like [Advanced | Chess](https://en.wikipedia.org/wiki/Advanced_chess) -- keep | the human intuition but permit the use of lots of cycles for | design process assist. It appears from what we've seen so far | that when you have a "lifted" representation of your design | such that tools can work with it well, composition and | exploration becomes more possible, fun, and fruitful! I | _expect_ over time we 'll have a mode where you still require | everything closes timing in a single cycle when you explicitly | want all the control you had / don't care so much for the | assist, then you just get the benefits of the tooling / fast | simulation infrastructure that works with the same program | representation. It's a great space to be working in as somebody | who loves compilers, tools, and systems: there's so much you | _could_ do, there 's incredible opportunity! ___________________________________________________________________ (page generated 2020-09-02 23:00 UTC)