ams1.josuah.net

       
        {josuah.net} | {panoramix-labs.fr}
 (DIR)  • {josuah.net}
 (DIR)  • {panoramix-labs.fr}
       
        {git} | {cv} | {links} | {quotes} | {ascii} | {tgtimes} | {gopher} | {mail}
 (DIR)  • {git}
 (BIN)  • {cv}
 (DIR)  • {links}
 (DIR)  • {quotes}
 (DIR)  • {ascii}
 (HTM)  • {tgtimes}
 (DIR)  • {gopher}
       
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
       Wishbone B4: Standard or Pipelined?
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 (HTM)  While writing {HDL} to teach a chip new tricks, it is best to avoid drowning in
        the complexity.
       
 (HTM)  The famous {divide and rule} helps: splitting the design in modules that, like
        a programming language function, reduce the scope of what is worked on, and
        hides the complexity for the parent module that calls them.
       
        But it quickly ends-up in an sea of many modules communicating in many
        different ways.
       
       Organising communication with a bus
       ───────────────────────────────────
 (HTM)  Adding another layer of organisation becomes necessary: is using a {bus} that
        acts as a central spine for communication across the whole design.
       
 (HTM)  Multiple bus protocols are used, with {Wishbone} the simplest and most widely
        used one for open source cores.
       
       What flavor?
       ────────────
        The Wishbone bus comes in multiple variants:
       
        • Use or not of an extra `CTI` signal: *Classic* or *Registered Feedback*;
       
        • Different timing constraints for `ACK`: *Synchronous* or *Asynchronous*;
       
        • Different meanings for `STB` and `CYC`: *Standard* or *Pipelined*;
       
        • Some extra optional signals.
       
        I suppose the aim was to offer the largest coverage of all use-cases, so that
        Wishbone to be used in a standard way for most situations.
       
        This large range of options also makes it harder to support every combination,
        some being incompatible together, and it seems common to use the most basic
        wishbone on every case.
       
        Left is to decide which combination is the simplest.
       
       Standard and Pipelined
       ──────────────────────
        At first, I wanted to avoid the Pipelined mode, to keep it as simple as
        possible. But my opinion changed when having a look at how both worked:
       
        In **Standard mode**, when a master issue a request with `STB_O`, as long sa
        the slave did not send ready, it will keep `STB_O` high, until it sees an
        `ACK_I` held high by the slave. The `CYC_O` and `STB_O` are both set on the
        clock where ACK_I is received, and it is only on the next clock that it is
        possible to isue a new request.
       
        ┊            ___     ___     ___     ___     ___   
        ┊ CLK_I        __/   \___/   \___/   \___/   \___/   \__
        ┊            _______________________
        ┊ CYC_O        __/                       \______________
        ┊            _______________________
        ┊ STB_O        __/                       \______________
        ┊                            _______
        ┊ ACK_I        __________________/       \______________
        ┊ 
       
        In **Pipelined mode**, a master issue a request by taking `STB_O` high, and
        instead of waiting for `ACK_I` to take it back low, it check `STALL_I`: if
        high, then it waits; if low, it considers the request queued by the slave, and
        may submit another one right away. In that case, the `ACK_I` only tells the
        master that a queued request has finished.
       
        ┊            ___     ___     ___     ___     ___
        ┊ CLK_I        __/   \___/   \___/   \___/   \___/   \__
        ┊            _______________________________
        ┊ CYC_O        __/                               \______
        ┊            _______________
        ┊ STB_O        __/               \______________________
        ┊                                    _______
        ┊ ACK_I        __________________________/       \______
        ┊            _______
        ┊ STALL_I        __/       \______________________________
        ┊ 
       
        In both case, `CYC_O` stays up through the whole transaction, and `ACK_I`
        announces that the request is done.
       
        Other signals, such as data, read/write or address have been omitted for
        clarity.
       
       Standard uses one less signal
       ─────────────────────────────
        Implementing a Pipelined slave does not reveal to be more complex in practice:
       
        • If the slave is simple and gives single-clock answers, the extra `STALL_I`
          can be tied low (`STALL_I = 0`) and ignored.
       
        • If the slave has multiple cycles before taking a request, the `STALL_I`
          would have been used in Standard mode anyway, in the form of an internal
          `busy` register.
       
        Although, a Standard master is a bit simpler to implement, as it does not have
        to wait that the request is queued first, and then to wait again that the slave
        provides an answer, and instead only has to wait the `ACK_I`.
       
       Pipelined for better throughput
       ───────────────────────────────
        In the timing examples above, the slave takes 3 cycles to work on the request,
        and then sets the `ACK_I` signal.
       
        It seems to take one more clock cycle to operate, but the Pipelined mode still
        has a higher throughput: it is not necessary to wait that the result is
        available to submit a new request.
       
        This will only work if the slave is having a buffer, a FIFO to queue the
        incoming requests and work on them later.
       
       Pipelined as easy to implement as Standard
       ──────────────────────────────────────────
        Having a Pipelined mode may seem more difficult to implement since it suggests
        that a complex queuing mechanism is to write for it, but a pipeline is entirely
        optional even in Pipelined mode.
       
        The only `ACK_I` needs to be shifted by one clock, which is done by using a
        register instead of a wire for it. This will add the delay needed, due to
        registers applying changes on the next clock.
       
        That way, it is still possible to write very simple modules that do everything
        in a single clock.
       
       Standard has a 1-clock better lattency
       ──────────────────────────────────────
        A single clock cycle is indeed consumed in Wishbone in its Pipelined mode. This
        could lead to an overall higher lattency, in particular if there are multiple
        Wishbone buses chained together.
       
       Pipelined may help with timing
       ──────────────────────────────
        If too complex operations are done in a single clock cycle, it may take too
        much time for all the signal to settle down and stablise until the next clock
        tick.
       
        A too long chain of logic and the timing constraint (the clock rate) might be
        missed.
       
        A long chain of logic might be broken down in two steps with registers, that
        let half of the steps be done before, and after the register, so that there is
        roughly half of the work to be one in a single clock tick.
       
        If Wishbone is used in Standard mode, the signals would have to propagate
        inside the master, then to the slave, then inside the slave, then back to the
        master, all of that in probably a single clock tick.
       
        Placing a register in the bus, by making `ACK_O` a register, permits to break
        the long chain form master to slave and back to master by introducing an
        intermediate step (register) for the signal to take a pause before going back
        to master, making sure it had time to settle down in the slave.
       
        That way, if the timings of the slave are fine with one master, it has better
        chances to be fine with any other master, since the timings of the slave and
        master do not sum-up anymore.
       
       Conclusion
       ──────────
        While the Standard wishbone seems more frequently uesd, the Pipelined mode
        seems to be a bit more keen on timing, and most of the drawbacks like extra
        clock for ACK or extra signal, would likely also appear in the Standard mode.
       
        I am still new to Wishbone, and much curious about what you think about it:
        Which variant do you use? Anything that I would have missed for the Standard
        mode? `me@josuah.net`
       
 (HTM)  Among notable Pipelined mode users is {ZipCPU}.
       
       Update
       ──────
 (HTM)  While looking at {this} ZipCPU article, it seems that its motivation for using
        Pipelined mode is expressed in these sentencse:
       
        Reminding the way logic gates may "solve maths":
       
        ┊ One solution to sequencing operations is to create a giant state machine.
        ┊ The reality, though, is that an FPGA tends to create all the logic for
        ┊ every state at once, and then only select the correct answer at the end of
        ┊ each clock tick. In this fashion, a state machine can be very much like the
        ┊ simple ALU we've discussed.
       
        And the conclusion of what makes more sense:
       
        ┊ On the other hand, if the FPGA is going to implement all of the logic for
        ┊ the operation anyway, why not arrange each of those operations into a
        ┊ sequence, where each stage does something useful? This approach rearranges
        ┊ the algorithm into a pipeline.
       
 (HTM)  And its use of Wishbone is extensively explained in {https://raw.githubusercontent.com/ZipCPU/zipcpu/master/doc/orconf.pdf}.
       
       Links
       ─────
 (HTM)  • {http://cdn.opencores.org/downloads/wbspec_b4.pdf#page=91}
       
 (HTM)  • {http://zipcpu.com/zipcpu/2017/05/29/simple-wishbone.html}
       
 (HTM)  • {https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html}
       
 (HTM)  • {https://raw.githubusercontent.com/ZipCPU/zipcpu/master/doc/orconf.pdf}