[HN Gopher] XLS: Accelerated HW Synthesis
       ___________________________________________________________________
        
       XLS: Accelerated HW Synthesis
        
       Author : victor82
       Score  : 104 points
       Date   : 2020-09-02 15:19 UTC (7 hours ago)
        
 (HTM) web link (google.github.io)
 (TXT) w3m dump (google.github.io)
        
       | Traster wrote:
       | >XLS is used inside of Google for generating feed-forward
       | pipelines from "building block" routines
       | 
       | For those that aren't familiar, control flow - or non "Directed
       | Acyclical graphs" are the hard part of HLS. This looks like a
       | fairly nice syntax compared to the bastardisations of C that
       | Intel and Xilinx pursue for HLS but I'm not sure this is bringing
       | anything new to the table.
       | 
       | As for the examples, I'm kind of flumoxed that they haven't given
       | any details on what the examples synthesize to. For example, how
       | many logic blocks does the CRC32 use? How many clock cycles? What
       | about the throughput? I'm going to sound like a grumpy old man
       | now, but it's important becaues it's very difficult to get
       | performant code as a hardware engineer. Generally it involves
       | having a fair idea of how the code is going to synthesize. What
       | is damn near impossible is figuring out what you want to
       | synthesize to, and then guessing the shibboleth that the compiler
       | wants in order to produce that code. Given that they haven't
       | tackled the difficult problems like control flow, folding,
       | resource sharing etc. It makes me hesitant to believe they've
       | produced something phenomenal.
        
         | aseipp wrote:
         | The HLS tools from Xilinx and Intel (and maybe Cadence I guess)
         | can also actually compile your models as ordinary C++ code (i++
         | from Intel is literally just a fork of Clang, I think, and so
         | are tools like LegUp), leading to their greatest benefit:
         | simulations are way, way faster and software compilers have
         | vastly better iteration times than synthesizers.
         | 
         | They seem to have a simulation framework for these tools that
         | isn't just "re-use an existing simulator", and it apparently
         | does use LLVM for codegen but that's the easy part. Actual
         | simulation performance numbers would be really interesting to
         | see vs actual RTL sims.
        
         | learyg wrote:
         | Hi, one of the collaborators here, thanks for the good points.
         | 
         | We have been targeting some Lattice FPGAs for prototyping
         | purposes, but we've mostly been doing designs for ASIC
         | processes, which is why details are a little sparse for FPGAs
         | you get off the shelf, but it's a priority for us to fill those
         | in. We have some interactive demos that show FPGA synthesis
         | stats (cell counts, generated Verilog, let you toy with the
         | pipeline frequency) and integrate with the [IR visualizer](http
         | s://google.github.io/xls/ir_visualization/#screenshot), we'll
         | try to open source that as soon as possible. The OSS tools
         | (SymbiFlow) that some of our colleagues collaborate on can do
         | synthesis in just a few seconds, so it can feel pretty cool to
         | see these things in near-real-time.
         | 
         | We fold over resources in time with a sequential generator, but
         | we still have a ways to go, we expect a bunch of problems will
         | map nicely onto concurrent processes, they're turing complete
         | and nice for the compiler to reason about.
         | 
         | I'm a big believer that phenomenonal is really effort and
         | solving real-world pain points integrated over time -- it's a
         | journey! We're intending to do blog posts as we hit big
         | milestones, so keep an eye out!
        
           | Traster wrote:
           | Do you mind me asking what applications Google uses this for
           | internally? Is this used in a flow that's ended up in
           | production? Also, what are your thoughts on integrating
           | optimized RTL blocks?
        
             | learyg wrote:
             | One of the things we have on our short list is "good FFI"
             | for instantiating existing RTL blocks (and making their
             | timing characteristics known to the compiler) and making
             | import flows from Verilog/SystemVerilog types. The latter
             | may be a bit your-Verilog-flow specific, but we think there
             | are some universal components you can provide that folks
             | can slot in their flows as appropriate.
             | 
             | Being able to re-time pipelines without a rewrite is a
             | useful capability. Although it's still experimental and
             | we're actively building out the capabilities, we have it in
             | real designs that have important datapaths.
        
       | ampdepolymerase wrote:
       | Reminds me of the old reconfigure.io which used the ideas and
       | syntax of Go's CSP and transformed them into async HDL code.
       | Unfortunately the startup has been shuttered.
       | 
       | http://docs.reconfigure.io/
        
       | simonw wrote:
       | XLS as an acronym for Accelerated HW Synthesis is a bit of a
       | stretch!
        
         | dirtypersian wrote:
         | I believe it might come from the fact that this process of
         | going from high level programming language to hardware is
         | called "high level synthesis". I think the "X" is meant to make
         | it more generic, i.e. X level synthesis.
        
           | simonw wrote:
           | That makes sense. Accelerated => XL just about works for me.
        
         | high_derivative wrote:
         | It's most likely inspired by XLA (Accelerated Linear Algebra) -
         | same creator(s).
        
       | Connect12A22 wrote:
       | I love their RISC-V implementation in 500 lines of code:
       | https://github.com/google/xls/blob/main/xls/examples/riscv_s...
        
         | fmakunbound wrote:
         | Comments indicate it implements a subset of various things.
        
         | Traster wrote:
         | It's kind of a good demonstration of the problem with software
         | versus hardware, here's xls solution (just for one function):
         | fn decode_i_instruction(ins: u32) -> (u12, u5, u3, u5, u7) {
         | let imm_11_0 = (ins >> u32:20);        let rs1 = (ins >>
         | u32:15) & u32:0x1F;        let funct3 = (ins >> u32:12) &
         | u32:0x07;        let rd = (ins >> u32:7) & u32:0x1F;        let
         | opcode = ins & u32:0x7F;        (imm_11_0 as u12, rs1 as u5,
         | funct3 as u3, rd as u5, opcode as u7)       }
         | 
         | here's the systemverilog solution
         | {im_11_0,rs1,funct3,rd,opcode} <= ins;
         | 
         | Obviously, in software, you can't slice data in the same way
         | since as far as I can tell, it's assuming all variables are a
         | certain size and so there's no naturally way of bit slicing.
        
           | FullyFunctional wrote:
           | That's untrue. You need to include the declarations of
           | im_11_0, etc. for the above to work and then you end up with
           | just as much code. There's no reason they couldn't extend
           | match to operate on bit slices also which would make this
           | identical.
           | 
           | Frankly, combinatorics is not where I expect the most
           | interesting differences. Sequential logic is surely more
           | interesting.
        
           | learyg wrote:
           | Thanks again for the detailed thought! We actually [developed
           | more advanced bit slicing syntax]( https://github.com/google/
           | xls/blob/1b6859dc384fe8fa39fb901af... ) since that example
           | was written, you can do things like a standard slice `x[5:8]`
           | or a Verilog-style "width slice" that has explicit signedness
           | `x[i +: u8]`. There's currently no facility for
           | "destructuring" structs as bitfields like pattern matches,
           | but there's no conceptual reason it can't be done, I think
           | that'd be an interesting thing to prioritize if there's good
           | bang for the buck. [Github issue to
           | track!](https://github.com/google/xls/issues/131) Let me know
           | if I missed out on details or rationale, thanks!
        
             | Traster wrote:
             | Hey, thanks for replying, the project looks like it has a
             | lot of potential. You're right, bit slicing gets you like
             | 99% of the way there (the rest is just syntax sugar). It's
             | interesting because from what I remember there were some
             | non-trivial issues for the people using LLVM for their IR
             | because of fundamental assumptions in the representation,
             | but bit-slicing is the core functionality. Is there a
             | reason you guys decided on your own IR?
        
       | rbanffy wrote:
       | When I started playing with MAME, I somewhat dreamed of a way to
       | turn its highly structured code into something that could not
       | only be compiled into an emulator as it is, but also be
       | synthesizable into hardware.
       | 
       | The possibility of using a single codebase to generate both a
       | software emulator and a hardware implementation is incredible,
       | from a hardware preservation point of view.
        
       | mmastrac wrote:
       | I love this. I did something similar with using Java to build an
       | RTL:
       | 
       | https://github.com/mmastrac/oblivious-cpu/blob/master/hidecp...
       | 
       | I was thinking about turning it into a full language at some
       | point, but they beat me to it (and I love the Rust syntax!).
        
       | asdfman123 wrote:
       | If they rename it XLSM they can embed some neat VBA scripts into
       | it and squeeze out more functionality.
       | 
       | (I'm sorry.)
        
       | rowanG077 wrote:
       | DSLX seems like a nightmare. Does it support arbitrary C++?
        
       | thotypous wrote:
       | Google is also investing some developer time on Bluespec since it
       | was opensourced (https://github.com/B-Lang-org/bsc). I wonder if
       | these projects make part of a bigger plan at Google.
        
       | w_t_payne wrote:
       | I've got a Kahn-process-network based "simulation" framework,
       | intended to provide a smooth conveyor belt of product maturation
       | from prototypes written in high level scripting languages like
       | Python or MATLAB through to production code written in C or Ada.
       | (Sort of like Simulink, but with a different set of warts).
       | Having some hardware synthesis capability is very much on the
       | roadmap, and this looks like it's going to be worth investigating
       | for that. Very excited to dive into it!
        
       | jashmenn wrote:
       | I've been programming for 20 years and yet I have no idea what
       | this does. Can someone ELI5?
        
         | jevogel wrote:
         | As far as I can tell, it is a high-level synthesis tool for
         | developing FPGA/ASIC applications. You write your circuit
         | functions in a Rust-like DSL and it generates optimized
         | Verilog/System Verilog code, which can then be synthesized into
         | hardware. But you can also take the output of the DSL and
         | simulate it first, which presumably is quicker than simulating
         | Verilog.
        
         | cokernel_hacker wrote:
         | It is a project aimed at making the design of electronic
         | logical easier.
         | 
         | Often, such hardware is written using hardware description
         | languages [1] like Verilog or VHDL. These languages are very
         | low level and, in the opinion of some, a little clumsy to use.
         | 
         | XLS aims to provide a system for High-level synthesis [2]. The
         | benefit of such systems is that you can more easily map
         | interesting algorithms to hardware without being super low
         | level.
         | 
         | [1] https://en.wikipedia.org/wiki/Hardware_description_language
         | 
         | [2] https://en.wikipedia.org/wiki/High-level_synthesis
        
           | pkaye wrote:
           | I remember years ago reading about Handel-C. A lot like Go
           | with channels and threads and function calls. The way it
           | synthesized the hardware was pretty simple conceptually. You
           | could easily understand how the program flow was converted
           | into a state machine in the hardware.
           | 
           | Not sure what happened it it. Maybe it did not optimize
           | things enough.
           | 
           | https://en.wikipedia.org/wiki/Handel-C
           | 
           | https://babbage.cs.qc.cuny.edu/courses/cs345/Manuals/HandelC.
           | ..
        
         | erikerikson wrote:
         | Not like you're 5 and I'm definitely not an expert on this
         | project but here's my best shot...
         | 
         | Most programs are loaded into memory and parts of those
         | programs are moved to registers and are used to load data into
         | other registers. That data is, in turn, sent to logic units
         | like adders that add two registers together or comparators that
         | compare to register's values. The generality comes at a cost in
         | terms of power and time but offers flexibility in return.
         | 
         | That is very different from something like a light switch where
         | you flip the switch and the result continuously reflects that
         | input within the limits of the speed of light.
         | 
         | If you are willing to sacrifice flexibility, translating your
         | code into hardware gives you a device that runs the same
         | processing on its inputs continuously at the speed of light
         | subject to your information processing constraints (e.g.
         | derivations of the original input still need to be calculated
         | prior to use).
         | 
         | Traditionally, separate languages and greater hardware
         | knowledge requirements made custom circuits less accessible.
         | This project brings more standard, higher level languages into
         | the set of valid specifications for custom electronics.
        
         | zelly wrote:
         | Verilog for codemonkeys
        
           | FullyFunctional wrote:
           | That's a complete mischaracterization. The point of any and
           | all HLSes is to raise the level of abstraction so you can be
           | more productive. Even for highly skilled Verilog "monkies",
           | writing in an HLS is a great deal faster and less error prone
           | (assuming comparable mastery of the language) simply because
           | you do not need to deal with a lot of low level details.
           | 
           | The $1M question however how this experience pans out as you
           | try to squeeze out the last bit of timing margin. I don't
           | know, but I'm eager to find out.
           | 
           | ADD: this parallels the situation with CUDA where writing a
           | first working implementation is usually easy, but by the time
           | you have an heavily optimized version ...
        
           | nickysielicki wrote:
           | HLS is going to improve, and you can either disregard it and
           | be left behind or you can try to understand where it fits
           | into a design. Your choice.
        
         | patrickcteng wrote:
         | ditto
        
           | gadders wrote:
           | Thank god I'm not alone.
        
         | tlack wrote:
         | You feed in Rust (a flavor called DSLX) or C++ and it generates
         | code for your FPGA (in Verilog). You then upload this compiled
         | "bitstream" to your FPGA and now you have something akin to a
         | custom microprocessor, but running just your program.
        
           | est31 wrote:
           | It looks really quite similar to Rust: https://github.com/goo
           | gle/xls/blob/main/xls/examples/dslx_in...
           | 
           | Note that there are differences though: Seems no type
           | inferrence, for .. in, different array syntax, match arms
           | delimitered by ";" instead of ",".
           | 
           | But it has a lot of the cool stuff from Rust: pattern
           | matching, expression orientedness (let ... = match { ... }),
           | etc.
           | 
           | Also other syntax is similar: fn foo() -> Type syntax,
           | although something similar to that can be achieved in C++ as
           | well.
        
             | muizelaar wrote:
             | Looks like the match arm difference is going away:
             | https://github.com/google/xls/pull/127
        
               | est31 wrote:
               | Very cool. TBH, Rust's match arm delimiter story is a bit
               | weird. Sometimes you need to put a ",", sometimes you
               | don't. And macro rules macros have ";" instead of ",".
        
         | foota wrote:
         | I think it turns a c-ish language (from the looks, not sure
         | about semantics) into a hardware language like HDL.
        
       | R0b0t1 wrote:
       | See also https://github.com/SpinalHDL/SpinalHDL.
        
       | jeffreyrogers wrote:
       | This is interesting. Overall I'm bearish on high-level synthesis
       | for anything requiring high performance, since you typically need
       | to think about how your code will be mapped to hardware if you
       | want it to perform well, and adding abstractions interferes with
       | that. I would like to know more about how Google uses this, since
       | it doesn't seem like a good fit for the type of stuff I work on.
        
         | typon wrote:
         | This doesn't seem like HLS, more like a new HDL that's based on
         | Rust. This has been done many times before with other
         | functional languages (Clash, Chisel, Spinal, hardcaml and
         | others). These projects never take off because hardware
         | designers are inherently conservative and they won't let go of
         | their horrible language (Verilog or SystemVeriog) no matter
         | what.
         | 
         | I'm sure Google will use XLS for their internal digital design
         | work, but I don't expect this to ever gain widespread support.
         | (not because HLS is inherently bad, but because of the culture)
        
           | analognoise wrote:
           | Hardware has gotten 1000x faster, and software has made that
           | 1000x faster system slower than it was in the 1980's, and you
           | think hardware people should learn the software style?
           | 
           | ...Are you sure?
        
           | jeffreyrogers wrote:
           | They describe it as HLS, and it definitely looks like HLS to
           | me. But maybe we have different definitions. Either way, it
           | seems to be targeting a strange subset of problems: it
           | doesn't look high level enough to be easy to use for non-
           | hardware designers (I don't think this goal is achievable,
           | but it is at least a worthy goal), and it doesn't seem low-
           | level enough to allow predictable performance.
        
           | Traster wrote:
           | > These projects never take off because hardware designers
           | are inherently conservative and they won't let go of their
           | horrible language (Verilog or SystemVeriog) no matter what.
           | 
           | This is categorically not true. There have been repeated
           | projects to re-invent hardware description languages. They
           | don't fail because hardware engineers are conservative, they
           | fail because they don't produce good enough results.
           | 
           | Intel has a team of hundreds of engineers working on HLS,
           | Xilinx probably has almost as many, there are lots of smaller
           | companies working on their own things like Maxeler. They
           | haven't take off because it's an unsolved problem to automate
           | some of the things you do in Verilog efficiently.
           | 
           | Take this language for example - it cannot express any
           | control flow. It's feed forward only. Which essentially
           | means, it is impossible to express most of the difficult
           | parts of the problems people solve in hardware. I hate
           | Verilog, I would love a better solution, but this language is
           | like designing a software programming language that has no
           | concept of run-time conditionals.
        
             | aseipp wrote:
             | I mean, languages like Bluespec are very close to actual
             | SystemVerilog semantically, and others like Clash are
             | essentially structural by design, not behavioral (I can't
             | speak for other alt-RTLs). You are in full control of using
             | DFFs, the language perfectly reflects where combinatorial
             | logic is done, the mappings of DFFs or IP to underlying RTL
             | and device primitives can easily be done so there's no
             | synthesis ambiguity, etc. In the hands of an experienced
             | RTL engineer you can more or less exactly understand/infer
             | their logic footprint just from reading the code, just like
             | Verilog. You can do Verilog annotations that get persisted
             | in the compiler output to help the synthesizer and all that
             | stuff. Despite that, you still hear all the exact same
             | complaints ("not good enough" because it used a few extra
             | LUTs due to the synthesizer being needy, despite the fact
             | RTL people already admit to spending stupid amounts of time
             | on pleasing synthesizers already.) Died-in-the-wool RTL
             | engineers are certainly a conservative bunch, and cagey
             | about this stuff no matter what, it's undeniable.
             | 
             | I think a bigger problem is things like tooling which is
             | deeply invested in existing RTLs. High-end verification
             | tools are more important than just the languages, but
             | they're also very difficult to replicate and extend and
             | acquire. That includes simulation, debuggers, formal tools,
             | etc. Verification is where all the actual effort goes,
             | anyway. You make that problem simpler, and you'll have a
             | winner regardless of what anyone says.
             | 
             | You mention the Intel and Xilinx's software groups, but
             | frankly I believe it's a good example of the bigger
             | culture/market problem in the FPGA world. FPGA companies
             | desperately want to own every single part of the toolchain
             | in a bid for vertical integration; in theory it seems nice,
             | but it actually sucks. This is the root of why everyone
             | says Quartus/Vivado are shitware, despite being technically
             | impressive engineering feats. Intel PSG and Xilinx just
             | aren't software companies, even if they employ a lot of
             | programmers who are smart. They aren't going to be the ones
             | to encourage or support alternative RTLs, deliver
             | integrated tools for verification, etc. It also creates
             | perverse incentives where they can fuel device sales
             | through the software. (Xilinx IP uses too much space? Guess
             | you gotta buy a bigger device!) Oh sure, Xilinx _wants_ you
             | to believe that they 're uniquely capable of delivering P&R
             | tools nobody else can -- the way RTL engineers talk about
             | the mythical P&R algorithms, you'd think Xilinx programmers
             | were godly superhumans, or they were getting paid by Xilinx
             | themselves -- that revealing chip details would immediately
             | mean their designs would be copied by Other Electronics
             | Companies and they would crumble overnight despite the
             | literal billions you would need up-front to establish
             | profitability and a market position, and so on. The ASIC
             | world figured out a long time ago that controlling the
             | software just meant the software was substandard.
        
           | gchadwick wrote:
           | > These projects never take off because hardware designers
           | are inherently conservative and they won't let go of their
           | horrible language (Verilog or SystemVeriog) no matter what.
           | 
           | As a hardware designer whose never been a fan of
           | SystemVerilog but continues to use it I think this is
           | inaccurate. There are two main issues that mean I currently
           | choose SystemVerilog (though would certainly be happy to
           | replace it).
           | 
           | 1. Tooling, Verilog or SystemVerilog (at least bits of it) is
           | widely supported across the EDA ecosystem. Any new HDL thus
           | needs to compile down to Verilog to be usable for anything
           | serious. Most do indeed do this but there can be a major
           | issue with mapping the language. Any issues you get in the
           | compiled Verilog need to be mentally mapped back to the
           | initial language. Depending upon the HDL this can be rather
           | hard, especially if there's serious name mangling going on.
           | 
           | 2. New HDLs don't seem to optimize for the kinds of issues I
           | have and may make dealing with the issues I do have worse.
           | Most of my career I've been working on CPUs and GPUs.
           | Implementation results matter (so power, max frequency and
           | silicon area) and to hit the targets you want to hit you
           | often need to do some slightly crazy stuff. You also need a
           | very good mental model of how the implemented design (i.e.
           | what gates you get, where they get placed and how they're
           | connected) is produced from the HDL and in turn know how to
           | alter the HDL to get a better result in gates. A typical
           | example is dealing with timing paths, you may need to knock a
           | few gates off a path to meet a frequency goal which requires
           | you to a) map the gates back to HDL constructs so you can see
           | what bit of RTL is causing the issues and b) do some of the
           | slightly crazy stuff, hyper-specific optimisations that rely
           | on a deep understanding of the micro-architecture.
           | 
           | New HDLs often have nice thing like decent type systems and
           | generative capabilities but loose the low-level easy metal
           | mapping of RTL to gates you get with Verilog. I don't find
           | much of my time for instance is spent dealing with Verilog's
           | awful type system (including the time spent dealing with bugs
           | that arise from it). It's frustrating but making it better
           | wouldn't have a transformative effect on my work.
           | 
           | I do spend lots of time mentally mapping gates back to RTL to
           | then try and out work out better ways to write the RTL to
           | improve implementation results. This often comes back to say
           | seeing an input an AND gate is very late, realising you can
           | make a another version of that signal that won't break
           | functional correctness 90% of the time with a fix-up applied
           | to deal with the other 10% of cases in some other less timing
           | critical part of the design (e.g. in a CPU pipeline the fix-
           | up would be causing a reply or killing an instruction further
           | down the pipeline). Due to the mapping issue I brought up in
           | 1. new HDLs often make this harder. Taking a higher level
           | approach to the design can also make such fixes very fiddly
           | or impossible to do without hacking up the design in a major
           | way.
           | 
           | That said my only major experience with a CPU design not
           | using Verilog/SystemVerilog was building a couple of CPUs for
           | my PhD in Bluespec SystemVerilog. I kind of liked the
           | language but ultimately due to 1. and 2. didn't think it
           | really did much for me over SystemVerilog.
           | 
           | If you're building hardware with less tight constraints than
           | yes some of the new HDLs around could work very well for you
           | and yes hardware designers can be very conservative about
           | changing their ways but it simply isn't the case that this is
           | the only thing holding back adoption of new HDLs.
           | 
           | I do need to spend some more time getting to grips with
           | what's now available and up and coming but I can't say I've
           | seen anything, that for my job at least, provides a major
           | jump over SystemVerilog.
        
         | learyg wrote:
         | Hi, one of the collaborators here! One question to consider,
         | and one that I consider pretty frequently, is what the hard
         | difference really is between HLS and RTL. It seems up to
         | interpretation, but I think of it more as a spectrum than
         | anything that truly schisms the space. I think I personally
         | associate the term HLS with "trying to uplevel the design
         | process where we can".
         | 
         | Even with modern RTL, we have a synthesizing compiler
         | optimizing our design within a cycle boundary, trying to manage
         | fanouts and close timing by duplicating paths and optimize
         | redundant boolean formulas. Some will even do some forms of
         | cross stage optimization.
         | 
         | If you think of XLS's starting point as "mostly structural"
         | akin to RTL (instead of "loops where you push a button and
         | produce a whole chip") it's really an up-leveling process,
         | where there's a compiler layer underneath you that can assist
         | you in exploring the design space, ideally more quickly and
         | effectively, and trying to give you a flexible substrate to
         | make that happen (by describing bits of functionality as much
         | as possible in latency insensitive ways).
         | 
         | I like to think of it like [Advanced
         | Chess](https://en.wikipedia.org/wiki/Advanced_chess) -- keep
         | the human intuition but permit the use of lots of cycles for
         | design process assist. It appears from what we've seen so far
         | that when you have a "lifted" representation of your design
         | such that tools can work with it well, composition and
         | exploration becomes more possible, fun, and fruitful! I
         | _expect_ over time we 'll have a mode where you still require
         | everything closes timing in a single cycle when you explicitly
         | want all the control you had / don't care so much for the
         | assist, then you just get the benefits of the tooling / fast
         | simulation infrastructure that works with the same program
         | representation. It's a great space to be working in as somebody
         | who loves compilers, tools, and systems: there's so much you
         | _could_ do, there 's incredible opportunity!
        
       ___________________________________________________________________
       (page generated 2020-09-02 23:00 UTC)