[HN Gopher] VRoom A high end RISC-V implementation
       ___________________________________________________________________
        
       VRoom A high end RISC-V implementation
        
       Author : cmurf
       Score  : 108 points
       Date   : 2022-03-21 16:07 UTC (6 hours ago)
        
 (HTM) web link (moonbaseotago.github.io)
 (TXT) w3m dump (moonbaseotago.github.io)
        
       | titzer wrote:
       | This is a very ambitious project, so respect and good luck.
       | 
       | I am wondering if the performance will pan out in practice, as it
       | doesn't seem to have a very deep pipeline, so getting high
       | clockspeeds may be a challenge. In particular the 5 clock branch
       | mispredict penalty suggest the pipeline design is fairly simple.
       | Production CPUs live and die by the gate depth and hit/miss
       | latency of caches and predictors. A longer pipeline is the
       | typical answer to gate delay issues. Cache design (and register
       | file design!) is also super subtle; L1 is extremely important.
        
         | evilos wrote:
         | They mention in their arch slides that they expect to add at
         | least 2 more pipeline stages to hit higher clocks.
        
         | Taniwha wrote:
         | As mentioned here I expect that reality will intrude and the
         | pipe will get bigger - of course good a BTC (and spending lots
         | of gates on it) is important because that's what mitigates that
         | deep pipe.
         | 
         | I haven't published my latest work (end of the week) I have a
         | minor bump to ~6.5 DMips/MHz - Dhrystone isn't everything but
         | it's still proving a useful tool to tweak the architecture
         | (which is what's going on now)
        
       | blacklion wrote:
       | > Eventually we'll do some instruction combining using this
       | information (best place may be at entry to I$0 trace cache), or
       | possibly at the rename stage
       | 
       | So much for "we will do only simplest of commands and u-op fusing
       | will fix performance".
       | 
       | It is why I'm very suspicious about this argument from RISC-V
       | proponents.
        
         | Taniwha wrote:
         | I think that we need lots of trace before we decide which ops
         | make sense to combine
        
           | blacklion wrote:
           | As far as I understand, RISC-V proponents want to have
           | "recommended" command sequences for compilers, to avoid
           | situation when different RISC-V CPUs will need different
           | compilations. If different RISC-V implementations have
           | different "fuseable" command sequences, we will be in
           | dreadful situation when you will need exact "-mcpu" for
           | decent performance and binary packages will be very
           | unoptimal.
           | 
           | And such "conventions" are bad idea, like comments in code,
           | IMHO. It can not be checked by tools, etc.
        
             | tsmi wrote:
             | > you will need exact "-mcpu" for decent performance
             | 
             | For some definitions of decent, I think that ship has
             | sailed.
             | 
             | https://clang.llvm.org/docs/CrossCompilation.html
             | 
             | -target <triple> The triple has the general format
             | <arch><sub>-<vendor>-<sys>-<abi>, where: arch = x86_64,
             | i386, arm, thumb, mips, etc. sub = for ex. on ARM: v5, v6m,
             | v7a, v7m, etc. vendor = pc, apple, nvidia, ibm, etc. sys =
             | none, linux, win32, darwin, cuda, etc. abi = eabi, gnu,
             | android, macho, elf, etc.
             | 
             | Note, none of those are exhaustive...
        
             | ncmncm wrote:
             | It is always frustrating when you have put in the work to
             | optimize code, and turn out to have pessimized it for the
             | next chip over.
             | 
             | The extremum for this is getting a 10x performance boost by
             | using, e.g., POPCNT, and suffering instead a 10-100x
             | pessimization because POPCNT is trapped and emulated.
        
       | themerone wrote:
       | What does GPL mean for a chip design?
       | 
       | I understand how it applies to the HDL, but I doubt that it
       | obligates you have to open your code to users of physical chips.
        
         | Taniwha wrote:
         | Well (author here) - this is a private project - typically such
         | a project would be very propriety - people don't get to show
         | their work.
         | 
         | But I'm looking to find someone to build this thing, it's been
         | a while since I last built chips (last CPU I helped design
         | never saw the light of day due to reason that had little to do
         | with how well it worked). So I need a way to show it off, show
         | it's real. So GPLing it is a great way to do that - as is
         | showing up on HN (thanks to whoever posted this :-).
         | 
         | In practice the RTL level design of a processor is only a part
         | of making a real processor - a real VRoom! would likely have
         | hand built ALUs, shifters, caches, register files etc those
         | things are all in the RTL at a high level but are really
         | different IP - likely they'd be entangled with GPL and a
         | manufacturer might feel that to be an issue.
         | 
         | However I'm happy to dual license (I want to get it built, and
         | maybe get paid to do it).
         | 
         | Also about half the companies building RISCVs are in China
         | (I've been building open source hardware in China for a decade
         | or so now, so I know there's lots of smart people there) - they
         | have a real problem (in the West) building something like this
         | - all the rumors about supply chain/etc stuff - having an open
         | sourced GPL'd reference that's cycle accurate is a way help
         | build confidence.
        
           | Taniwha wrote:
           | One other comment about why GPLing something is important for
           | some like me - publishing my 'secrets' are a great way to
           | turn them into "prior art" - you read it here first, you
           | can't patent it now - I can protect my ideas from becoming
           | part of someone else's protected IP by publishing it.
           | 
           | I spent a few years working on an x86 clone, I had maybe 10
           | (now expired) patents on how to get around stupidly obvious
           | things that Intel had patented - (or around ways to get
           | around ways to get around In tel that other's had patented) -
           | frankly from a technical POV it was all a lot of BS,
           | including my patents
        
         | wmf wrote:
         | It means "pay me to remove the GPL". It's fake GPL like MySQL
         | and MongoDB.
        
           | homarp wrote:
           | https://www.fsf.org/blogs/rms/selling-exceptions
           | 
           | RMS wrote "I've considered selling exceptions acceptable
           | since the 1990s, and on occasion I've suggested it to
           | companies. Sometimes this approach has made it possible for
           | important programs to become free software."
        
         | Someone wrote:
         | I guess you could argue that, if you bought a device with this
         | CPU, you should be able to replace the CPU with one of your own
         | that's derived from this one.
         | 
         | I think that's the spirit of the GPL in a hardware context, but
         | I don't think it's a given (by a long stretch) that courts
         | would accept that argument.
         | 
         | A somewhat clearer case would be if you bought a device that
         | implements a GPL licensed design in a FPGA. I think you could
         | argue such devices cannot disable the reprogrammability of the
         | FPGA.
        
         | dmoreno wrote:
         | IANAL, but as far as I know it's very important it's GPLv3
         | which means the antitivoization clause, which means that
         | hardware that uses this firmware must provide full source code
         | and a way to let you use your own firmware.
         | 
         | If somehow this code is not in a firmware... No idea.
        
         | marcodiego wrote:
         | AFAICS, it is the same as software: you changed and
         | distributed; you have to provide your changes if asked to.
        
         | bragr wrote:
         | Also IANAL, but as I understand it, the HDL would compile down
         | to a sequence of gates, and presumably we'd treat that the same
         | way as a binary - a "Non-Source Form" as the GPL calls it. So
         | anyone that receives a copy of those gates (either as a binary
         | blob for a FPGA, or pre-flashed on a FPGA, or made on actual
         | silicon) would be entitled to the source as per GPL3 section 6
         | "Conveying Non-Source Forms".
         | 
         | I don't think the GPL anti-tivoization clause has much bearing
         | there other than presumably you'd have to provide the full tool
         | chain that resulted in the final gates - presumably this would
         | affect companies producing actual chips the most since you
         | couldn't have any propriety optimization or layout steps in
         | producing the actual chip design, though also no DRM for FPGAs
         | (is that even a thing?)
        
       | Taniwha wrote:
       | Author here (Paul Campbell) - AMA
        
         | ncmncm wrote:
         | Are you making any attempt at a learning branch predictor? Is
         | anything published about really-current methods?
        
           | Taniwha wrote:
           | Not yet - I have a pretty generic combined bimodal/global
           | predictor - there's a lot of research on BTCs - it's easy to
           | throw gates at this area - I can imagine chips hitting 20-30%
           | BTC in area just to keep the rest running
           | 
           | My next set of work in this area will be integrating an L0
           | trace cache into the existing BTC - that will help me greatly
           | up the per-clock issue rate
        
         | titzer wrote:
         | As a language VM implementor, I would really love to have a
         | conditional call instruction, like arm32. AFAICT this would be
         | a relatively simple instruction to implement in the CPU. Is
         | that accurate?
        
           | Taniwha wrote:
           | yes and no - there's a few issues here:
           | 
           | 1 - architectural - RISCV has a nice clean ISA, it's adding
           | instructions quickly, CMOV is contentious issue there - I'm
           | not an expert on the history so I'll let others relitigate it
           | - it's easy to add new instructions to a RISCV machine,
           | unlike Intel/ARM it's even encouraged - however adding a new
           | instruction to ALL machines is more difficult and may take
           | many years. But unlike Intel/ARM there IS a process to adopt
           | new instructions that doesn't involve just springing them on
           | your customers
           | 
           | 2 - remember RISCV is a no-condition code architecture - that
           | would make CMOV require 3 register file ports (the only such
           | instruction that also requires an adder [for the compare]) -
           | register file ports are extremely expensive, especially for
           | just 1 instruction
           | 
           | 3 - micro-architectural - on simple pipes CMOV is pretty
           | simple (you just inhibit register write, plus do something
           | special with register bypass) I'd have to think very hard
           | about how to do it on something like VRoom! with out of
           | order, speculative, register renaming - I can see a naive way
           | to do it, but ideally there should be a way to nullify such
           | an instruction early in the pipe which would mean some sort
           | of renaming-on-the-fly hack
        
             | titzer wrote:
             | Note I was talking about a conditionall _call_ instruction,
             | which is very useful for, e.g. safety checks.
        
               | Taniwha wrote:
               | conditional CALL is MUCH harder to implement well - it's
               | because the call part essentially happens up at the
               | PC/BTC end of the CPU while at the execution stage what
               | you're doing is writing the saved PC to the LR/etc and
               | the register compare (or accessing a condition code that
               | may not have been calculated yet).
               | 
               | In many ways I guess it's a bit like a conditional branch
               | that needs a write port - in RISCV, without condition
               | codes, your conditional call relative branch distance
               | will be smaller because the instruction encoding will
               | need to encode 2-3 registers
        
               | dzaima wrote:
               | I imagine something like that might be viable in the to-
               | be-designed RISC-V J extension, as safety checks (mostly
               | in JITs) would be close to the only thing benefiting from
               | this.
               | 
               | Though, maybe instead of a conditional call, a
               | conditional signal could do, which'd clearly give no
               | expectation of performance if it's hit, simplifying the
               | hardware effort required.
        
               | Taniwha wrote:
               | Yeah, I can imagine that being particularly easy to
               | implement in VRoom! exceptions are handled synchronously
               | at the end of the pipe (with everything before them
               | already committed, and everything after flushed).
               | Instructions that can convert to exceptions (like loads
               | and store taking TLB misses) essentially hit two
               | different functional units - a conditional exception
               | would be tested in a branch/ALU unit and then transition
               | into an effective a no-op or transition into an d
               | exception and synchronise the pipe when they hit the
               | commit stage
        
           | kragen wrote:
           | 8080 had it too, 8086 dropped it due to disuse. In a modern
           | context it's just a performance hack, an alternative to
           | macro-op fusion, but for high-performance RISC-V (or i386, or
           | amd64, or aarch64) you need macro-op fusion anyway.
        
         | sitkack wrote:
         | What does your benchmarking workflow look like? I am interested
         | in                 * From a high level what does your dev
         | iteration look like?        * Getting instruction traces,
         | timing and resimulating those traces       * Power analysis,
         | timing analysis (do you do this as part of performance
         | simulation) ?        * Do you benchmark the whole chip or
         | specific sub units?       * How do you choose what to focus on
         | in terms of performance enhancements?        * What areas are
         | you focusing on now?        * What tools would make this
         | easier?
        
           | Taniwha wrote:
           | At the moment I'm just starting working my way up the
           | hierarchy of benchmarks, dhrystone's been useful though it's
           | nearing the end of its use - I build the big FPGA version (on
           | an AWS FPGA instance) to give me a place to run bigger things
           | exactly like this.
           | 
           | I currently run low level simulations in Verilator where I
           | can easily take large internal architectural trace, and
           | bigger stuff on AWS (where that sort of trace is much much
           | harder)
           | 
           | I haven't got to the power analysis stage - that will need to
           | wait until we decide to build a real chip - timing will
           | depend on final tools if we get to build something real,
           | currently it's building on Vivado for the FPGA target.
           | 
           | Mostly I'm doing whole chip tests - getting everything to
           | work well together is sort of the area I'm focusing on at the
           | moment (correctness was the previous goal - being together
           | enough to boot linux), the past 3 months I've brought the
           | performance up b y a factor of 4 - the trace cache might get
           | me 2x more if I'm lucky.
           | 
           | I spend a lot of time looking at low level performance, at
           | some level I want to get the IPC (instructions per clock) of
           | the main pipe as high as I can so I stare at the spots where
           | that doesn't happen
           | 
           | I'm using open source tools (thanks everyone!)
        
             | tromp wrote:
             | > dhrystone's been useful though it's nearing the end of
             | its use
             | 
             | Would my fhourstones [1] [2] benchmark be of any use?
             | 
             | [1] https://tromp.github.io/c4/fhour.html
             | 
             | [2] https://openbenchmarking.org/test/pts/fhourstones
        
               | Taniwha wrote:
               | thanks I'll have a look - I'm not so interested in raw
               | scores, more about relative numbers so I can judge
               | different architectural experiments
        
             | [deleted]
        
         | gary_0 wrote:
         | From what little I know about microarchitecture, this seems
         | extremely impressive. Hopefully these aren't dumb questions:
         | 
         | Are there GPL'd designs for PCIe, USB, etc, that could be used
         | to incorporate this into a SoC design? If not, how much work is
         | that compared to this?
         | 
         | Also, what other kind of technical considerations would be
         | involved to make this into a "real" chip on something like
         | 28nm?
        
           | Taniwha wrote:
           | Great questions - I'm using an open source UART from someone
           | else, an d for the AWS FPGA system I have a 'fake' disk
           | driver plus timers/interrupt controllers etc
           | 
           | So far I haven't needed USB/ether/PCIe/etc I've sort of
           | sketched out a place for those to live - I think that for a
           | high end system like this one you can't just plug something
           | in - real performance needs some consideration of how:
           | 
           | - cache coherency works - VM and virtual memory works
           | (essentially page tables for IO devices) - PMAP protections
           | from I/O space (so that devices can't bypass the CPU PMAPs
           | that are used to man age secure enclaves in machine mode)
           | 
           | So in general I'm after something uniquer, or at least
           | slightly bespoke.
           | 
           | I also think there's a bit of a grand convergence going on in
           | this area around serdes's which are sort of becoming a new
           | generic interface PCIe, high speed ether, new USBs, disk
           | drivers etc are all essentially bunches of serdes with
           | different protocol engines behind them - a smart SoC is going
           | to split things this way for maximum flexibility
        
             | rwmj wrote:
             | Don't know much about the details, but this company /
             | person claims to have developed some open source IP:
             | http://www.enjoy-digital.fr/
        
           | Lramseyer wrote:
           | Not Paul Campbell, but I'll share what I know on the matter.
           | 
           | So GPL'd IO blocks - This is a great question, and something
           | I have definitely been asking myself! One thing to keep in
           | mind is that IO interfaces like PCIe, USB, and whatnot have a
           | Physical interface ("Phy" for short.) Those contain quite a
           | bit of analog circuitry, which is tied to the transistor
           | architecture that's used for the design.
           | 
           | That being said, A lot of interfaces that aren't DRAM
           | protocols use what's known as a SerDes Phy (short for
           | Serializer De-serializer Physical interface.) More or less,
           | they have an analog front end and a digital back end, and
           | that digital back end that connects to everything else is
           | somewhat standardized way. So it wouldn't be unreasonable to
           | try to build something like an open PCIe controller that only
           | has the Transaction Layer and Data Link Layer. While there
           | are various timing concerns/constraints when not including a
           | Phy layer (lowest layer,) I don't think it's impossible.
           | 
           | The other big challenge is that anyone wanting to use an open
           | source design will definitely want the test benches and test
           | cases included in the repo (you can think of them like unit
           | tests.) Unfortunately, most of the software to actually
           | compile and run those simulations is cost prohibitive for an
           | individual, because it's licensed software. Also, the
           | companies that develop this software make a ton of money
           | selling things like USB and PCIe controllers, so I'll let you
           | draw your own conclusions about the incentives of that
           | industry.
           | 
           | Even if you were able to get your hands on the software, the
           | simulations are very computationally intensive, and
           | contribution by individuals would be challenging ...though
           | not impossible!
           | 
           | Despite those barriers, it's a direction that I desperately
           | want to see the industry move towards, and I think it's
           | becoming more and more inevitable as companies like Google
           | get involved with hardware, and try to make the ecosystem
           | more open. Chiplet architectures are also all the rage these
           | days, so it would be less of a risk for a company to attempt
           | to use an open source design.
           | 
           | I'd really be curious to hear Paul Campbell's take on this
           | question though. He definitely knows a lot more than I do!
        
             | tsmi wrote:
             | One advantage of SkyWater opening its PDK is Universities
             | are starting to back fill all the hardware that is missing.
             | 
             | Here's a SerDes from Purdue. I don't think this particular
             | design has been validated in silicon yet though.
             | 
             | https://arxiv.org/abs/2105.13256
        
         | black_puppydog wrote:
         | Do you dance? :)
         | 
         | https://youtu.be/nlu0foF3WBk?t=182
         | 
         | I know, I'm leaning hard on that second "A" there. :D
        
           | Taniwha wrote:
           | heh! - I'm a Kiwi who lived and worked in Silicon Valley for
           | 20 years, moved back when the kids started high school, but
           | mostly still work there - while I was there I started a
           | company using "Taniwha" ... great for a logo, but a mistake
           | because of course no one in the US knows how to pronounce it
           | (pro-tip the "wh" is most close to an english "f")
        
         | tsmi wrote:
         | Have you considered making an ASIC of your design?
         | https://efabless.com/open_shuttle_program
        
           | Taniwha wrote:
           | It's likely too big for those programs - I am (just now)
           | starting a build with the Open Lane/Sky tools not with the
           | intent of actually taping out but more to squeeze the
           | architectural timing (above the slow FPGA I've been using for
           | bringing up Linux) so I can find the places where I'm being
           | stupidly unreasonable about timing (working on my own I can't
           | afford a Synopsys license)
        
             | tsmi wrote:
             | Gotcha. Did you run into any issues with yosys given that
             | it has limited system verilog support?
             | 
             | Ibex needed to add a pass with sv2v
             | https://github.com/lowRISC/ibex/tree/master/syn
        
               | Taniwha wrote:
               | I'm just starting this week, I've recently switched to
               | some use of SV interfaces and it does not like arrays of
               | them - sv2v seems the way to go - but even without that
               | yosys goes bang! somethings too big Vivado compiles the
               | same stuff - I rearchitected the bit that might obviously
               | be doing this but no luck so far.
        
         | tux3 wrote:
         | Any thoughts about higher level HDLs in embedded in software
         | languages, like Chisel, nMigen, or others? Some other RISC-V
         | core designers claim they've had increased productivity with
         | those.
         | 
         | It seems that despite a lot of valid criticism against
         | (System)Verilog, nothing really seems to be a on trajectory to
         | replace it today. I'm not sure if that's purely inertia
         | (existing tooling, workflows, methodologies), other HDLs not
         | being attractive enough, or maybe Verilog is just good enough?
        
           | Taniwha wrote:
           | I think they're great - I earned my VLSI chops building stuff
           | in the 90s and I can write Verilog about as fast as I can
           | think so it's my goto language. I've also written a couple of
           | compilers over the years so I know it really well (you can
           | thank me for the ' _' in  "always @(_)"). That's just my
           | personal bias.
           | 
           | Inertia in tooling is a REALLY BIG deal - if you can't run
           | your design through simulation, (and FPGA simulation),
           | synthesis, layout/etc you'll never build a chip - it can take
           | a 5-10 years for a new language feature to become ubiquitous
           | enough so that you can depend on it en ough to use it in a
           | design (I've been struggling with this using System Verilog
           | interfaces this month).
           | 
           | If you look closely at VRoom! you'll see I'm stepping beyond
           | some Verilog limitations by adding tiny programs that
           | generate bespoke bits of Verilog as part of the build process
           | - this stops me from fat fingering some bit in a giant
           | encoder but also helps me make things that SV doesn't do so
           | well (big 1-hot muxes, priority schedulers etc)
        
             | Taniwha wrote:
             | err HN swallowed my * there as in: "(you can thank me for
             | the '*' in "always @(*)")"
        
           | snakke wrote:
           | As an aside, the latest and active development of nMigen has
           | been rebranded a few months ago to Amaranth and can be found
           | here: https://github.com/amaranth-lang/amaranth . In case
           | people googled nMigen and came to the repository that hasn't
           | been updated in two years.
        
       | [deleted]
        
       | codedokode wrote:
       | The presentation was interesting; but I would like to write an
       | idea that is tangentially related to this CPU.
       | 
       | I noticed that modern CPUs are optimized for legacy monolith OS
       | kernels like Linux or Windows. But having a large, multimegabyte
       | kernel is a bad idea from a security standpoint. A single mistake
       | or intentional error in some rarely used component (like a
       | temperature sensor driver) can get attacker full access to the
       | system. Again, an error in any part of the monolith kernel can
       | cause system failure. And Linux kernel doesn't even use static
       | analysis to find bugs! It is obvious that using microkernels
       | could solve many of the issues above.
       | 
       | But microkernels tend to have poor performance. One of the
       | reasons for this could be high context switch latency. CPUs with
       | high context switch latency are only good for legacy OSes and not
       | ready for better future kernels. Therefore, either we will find a
       | way to make context switches fast or we will have to stay with
       | large, insecure kernels full of vulnerabilities.
       | 
       | So I was thinking what could be done here. For example, one thing
       | that could be improved is to get rid of address space switch. It
       | causes flushes of various caches and it hurts performance.
       | Instead, we could always use the single mapping from virtual to
       | physical addresses, but allocate each process different virtual
       | address range. To implement this, we could add two registers,
       | which would hold minumum and maximum accessible virtual
       | addresses. It should be easy to check the address against them to
       | prevent speculative out of bounds memory accesses.
       | 
       | By the way, 32-bit x86 architecture had segments, that could be
       | used to divide single address space between processes.
       | 
       | Another thing that can take time is saving/restoring registers on
       | context switch. One way to solve the problem could be to use
       | multiple banks (say, 64 banks) of registers that can be quickly
       | switched, another way would be to zero out registers on return
       | from kernel and let processes save them if they need it.
       | 
       | Or am I wrong somewhere and fast context switches cannot be
       | implemented this way?
        
         | db65edfc7996 wrote:
         | >But microkernels tend to have poor performance.
         | 
         | Citation needed. What kind of hit are we talking about? 5%?
         | 90%? We have supercomputers from the future that have capacity
         | to spare. I would be willing to take an enormous performance
         | hit for better security guarantees on essential infrastructure
         | (routers, firewalls, file servers, electrical grid, etc).
        
         | kragen wrote:
         | SASOSes are interesting, sometimes extending a 64-bit address
         | space to cover a whole cluster, but they aren't compatible with
         | anything that calls fork().
         | 
         | The various variants of L4 have pretty good context-switch
         | latency even on traditional CPUs, and seL4 in particular is
         | formally proven correct on a few platforms. Spectre+Meltdown
         | mitigation was painful for them, but they're still pretty good.
         | 
         | Lots of microcontrollers have no MMUs but do have MPUs to keep
         | a user task from cabbaging the memory of the kernel or other
         | tasks. Not sure if any of them use the PDP-11-style base+offset
         | segment scheme you're describing to define the memory regions.
         | 
         | Protected-memory multitasking on a multicore system doesn't
         | need to involve context switches, especially with per-core
         | memory.
         | 
         | Even on Linux, context switches are cheap when your memory map
         | is small. httpdito normally has five pages mapped and takes
         | about 100 microseconds (on a 2.8GHz amd64 laptop) to fork,
         | serve a request, and exit. I think I've measured context
         | switches a lot faster than that between two existing processes.
         | 
         | Multiple register banks for context switching go back to the
         | CDC 6600's peripheral processor (FEP) or maybe the TX-0 on
         | which Sutherland wrote SKETCHPAD; it has a lot of advantages
         | beyond potentially cheaper IPC. Register bank switching for
         | interrupt handling was one of the major features the Z80 had
         | over the 8080 (you cn think of the interrupt handler as being
         | the kernel). The Tera MTA in the 01990s was at least widely
         | talked about if not widely imitated. Switching register sets is
         | how "SMT" works and also sort of how GPUs work. And today
         | Padauk's "FPPA" microcontrollers (starting around 12 cents
         | IIRC) use register bank switching to get much lower I/O latency
         | than competing microcontrollers that must take an interrupt and
         | halt background processing until I/O is complete.
         | 
         | Another alternative approach to memory protection is to do it
         | in software, like Java, Oberon, and Smalltalk do, and Liedtke's
         | EUMEL did; then an IPC can be just an ordinary function call.
         | Side-channel leaks like Spectre seem harder to plug in that
         | scenario. GC may make fault isolation difficult in such an
         | environment, particularly with regard to performance bugs that
         | make real-time tasks miss deadlines, and possibly Rust-style
         | memory ownership could help there.
        
           | codedokode wrote:
           | What I would like to have is a context switch latency
           | comparable to a function call. For example, if in a
           | microkernel system bus driver, network card driver, firewall,
           | TCP stack, socket service are all separate userspace
           | processes, then every time a packet arrives there would be a
           | context-switching festival.
           | 
           | As I understand, in microkernel OSes most system calls are
           | simply IPCs - for example, network card driver passes
           | incoming packet to the firewall. So there is almost no kernel
           | work except for context switch. That's why it has to be as
           | fast as possible and resemble a normal function call, maybe
           | even without invoking the kernel at all. Maybe something like
           | Intel's call gate, but fast.
           | 
           | > they aren't compatible with anything that calls fork().
           | 
           | I wouldn't miss it; for example, Windows works fine without
           | it.
        
         | kinghajj wrote:
         | You should look into the Mill CPU architecture.[0] Its design
         | should make microkernels much more viable.
         | 
         | * Single 64-bit address space. Caches use virtual addresses.
         | 
         | * Because of that, the TLB is moved _after_ the last level
         | cache, so it 's not on the critical path.
         | 
         | * There's instead a PLB (protection lookaside buffer), which
         | can be searched in parallel with cache lookup. (Technically,
         | there's three: two instruction PLBs and one data PLB.)
         | 
         | [0]: https://millcomputing.com/
        
           | foobiekr wrote:
           | I was also going to mention the Mill, but it's become a bit
           | of a Flying Dutchman that people tell tales of but which
           | probably doesn't exist.
        
         | thechao wrote:
         | Segment registers are _precisely_ how NT does context
         | switching. I think it may be restricted to just switching from
         | user- to kernel- threads. I can 't remember if there's thread-
         | to-thread switching using segment registers -- I feel like this
         | was a thing, or it was just a thing we did when we tried to
         | boot NT on Larrabee. (Blech.)
        
         | wrs wrote:
         | Long ago, we in the Newton project at Apple had that idea. We
         | (in conjunction with ARM) were defining the first ARM MMU, so
         | we took the opportunity to implement "domains" of memory
         | protection mappings that could be quickly swapped at a context
         | switch. So you get multiple threads in the same address space,
         | but with independent R/W permission mappings.
         | 
         | I think a few other ARM customers were intrigued by the
         | security possibilities, but the vast majority were more like
         | "what is this bizarre thing, I just want to run Unix", so the
         | feature disappeared eventually.
         | 
         | Here's some ARM documentation if you want to pull this thread:
         | https://developer.arm.com/documentation/dui0056'/latest/'cac...
        
           | wrs wrote:
           | Too late to edit, but here's a documentation link that works
           | better:
           | https://developer.arm.com/documentation/dui0056/d/caches-
           | and...
        
           | StillBored wrote:
           | Its similar to the original macOS, which used handles to
           | track/access/etc memory requested from the OS and swap them
           | to disk as needed. First you request the space, then you
           | request access, which pinned it into ram.
           | 
           | PalmOS was another one that worked similarly.
           | https://www.fuw.edu.pl/~michalj/palmos/Memory.html
        
         | Taniwha wrote:
         | These days there are few caches that need to be flushed at
         | context switch time - RISCV's ASIDs mean that you don't need to
         | flush the TLBs (mostly) when you contect switch.
         | 
         | VRoom! largely has physically tagged caches so they don't need
         | to be flushed, the BTC is virtually tagged, but split into
         | kernel and user caches, you need to flush the user one on on a
         | context switch (or both on a VM switch) - also the trace cache
         | (L0 icache) will also be virtually tagged. VRoom! also doesn't
         | do speculative accesses past the TLBs.
         | 
         | Honestly saving and restoring kernel context is small compared
         | to the time spent in the kernel (and I've spent much of the
         | past year looking at how this works in depth).
         | 
         | Practically you have to design stuff to an architecture (like
         | RISCV) so that one can leverage off of the work of others
         | (compilers, libraries, kernels) adding specialised stuff that
         | would (in this case) get in to a critical timing path is
         | something that one has to consider very carefully - b ut that's
         | a lot of what RISCV is about - you can go and knock up that
         | chip yourself on an FPGA and start trialing it on your
         | microkernel
        
           | kragen wrote:
           | Thanks, this is really informative.
        
       | nynx wrote:
       | The Architectural presentation linked from the GitHub repository
       | for this project is an incredibly good resource on how these
       | kinds of things are designed.
        
         | avianes wrote:
         | Yes, there is a huge lack of open and approachable information
         | sources in micro-architecture.
         | 
         | Be aware though, the micro-architecture used here is very
         | interesting but differs in many ways from state of the art
         | industrial high-end micro-architectures for superscalar out-of-
         | order speculative processor.
         | 
         | I am quite curious about how the author came up with these
         | choices
        
           | Taniwha wrote:
           | Well, everyone was building tiny RISCVs, I kind of thought
           | "can I make a Xeon+ class RISCV if I throw gates at the
           | problem ?" :-)
           | 
           | Seriously though I started out with the intent of building a
           | 4/8 instruction/clock decoder, and an O-O execution pipe that
           | could keep up - with the end goal of at least 4+ instruction
           | s/clock average (we peak now at 8) - the renamer, dual
           | register file, and commitQ are the core of what's probably
           | different here
        
             | avianes wrote:
             | Yes, the "dual register file" is probably the most
             | intriguing to me.
             | 
             | This looks like a renaming scheme used in some old micro-
             | architecture (Intel Core 2 maybe) where ROB receives
             | transient results and acts as a physical regfile, at commit
             | reg value are copied to a arch regfile. But in your uarch
             | the physical regfile is decoupled from ROB, which must
             | correspond to your commitQ.
             | 
             | I wonder if this solution is viable for a very large uarch
             | (8 way) because read ports to copy reg value from pysical
             | regfile to arch regfile are additional read ports that can
             | be avoided with other (more complex) renaming scheme. These
             | additional read ports can be expensive on a regfile that
             | already has a bunch of ports.
             | 
             | Any thoughts about this?
             | 
             | But I haven't read much of your code yet, that's just a raw
             | observation
        
               | Taniwha wrote:
               | the commitQ entries are smart enough to 'see' the commits
               | into the architectural file and request the data from
               | it's current location
               | 
               | It does mean lots of register read ports .... but you can
               | duplicate register files at some point (reducing read
               | ports but keeping the write ports) (you want to keep them
               | close to the ALUs/multipliers/etc) - in some ways these
               | are more implementation issues rather than
               | 'architectural'
        
               | avianes wrote:
               | I see, there are indeed solutions like regfile
               | duplication to handle large port number but it's
               | expensive when physical regfile becomes large. I still
               | think that the uarch's job is to ensure minimal
               | implementation cost ;).
               | 
               | Thank you for your opinion and thought process, it's very
               | valuable !
        
       | tasty_freeze wrote:
       | For a few years I worked with the guy behind this project, Paul
       | Campbell. He is a fearless coder, and moves between hardware and
       | software design with equal ease.
       | 
       | An example of his crazy coding chops, he was frustrated by the
       | lack of verilog licenses at the place he worked back in the early
       | 90s. His solution was to whip up a compliant verilog simulator,
       | then wrote a screen saver that would pick up verification tasks
       | from a pending queue. They had many macs around the office that
       | were powered 24/7, and they could chew through a lot of work
       | during the 16 hours a day when nobody was sitting in front of
       | them. When someone sat down at their computer in the morning or
       | came back from lunch, the screen saver would just abandon the
       | simulation job it was running and that job would go back to the
       | queue of work waiting to be completed.
        
         | evilos wrote:
         | That is terrifying.
        
         | thechao wrote:
         | Synthesizable verilog is a very small language compared to
         | system verilog -- especially in the 90s. Off the top of my head
         | I know of _six_ "just real quick" verilog simulators that I've
         | worked with (one of which I wrote). I'm not sure how I feel
         | about them. On one hand, I hate dealing with licenses; on the
         | other hand, now you've got to worry that your custom simulation
         | matches behavior with the synthesis tools. A lot of the
         | "nonstandard" interpretation for synthesizable verilog from the
         | bigs comes from practical understanding of the behavior for a
         | given node. Most of that is captured in ABC files ... but not
         | all of it.
        
           | Taniwha wrote:
           | It was more than simple synthesisable verilog, but not a lot
           | - it was also a compiler rather than an interpreter - at the
           | time VCS was just starting to be a thing, verilog as a
           | language was not at all well defined (lots of assumptions
           | about event ordering that no-one should have been making)
           | 
           | I was designing Mac graphics accelerators I'd built it on a
           | some similar infrastructure I'd built to capture trace from
           | people's machines to try and figure out where QuickDraw was
           | really spending it's time - we ended up with a minimilistic
           | graphics accelerator that beat the pants off of everyone else
        
             | thechao wrote:
             | This is why I think Moore (LLHD), Verilator, and Yosys are
             | such awesome tools. They move a lot more slowly than (say)
             | GCC, but I personally think they're all close to the
             | tipping point.
        
               | Taniwha wrote:
               | I wrote a second, much more standard Verilog compiler
               | (because by then there was a standard) with the intent of
               | essentially selling cloud simulation time (being 3rd to a
               | marketplace means you have to innovate) - sadly I was a
               | bit ahead of my time ('cloud' was not yet a word) the
               | whole California/Enron "smartest guys in the room"
               | debacle kind of made a self financed startup like that
               | non-viable
               | 
               | So in the end I open sourced the compiler ('vcomp') but
               | it didn't take off
        
         | colejohnson66 wrote:
         | So, BOINC before BOINC?
        
           | jasonwatkinspdx wrote:
           | A lot of people have come up with something similar. Someone
           | I know implemented the Condor scheduler to run models on
           | workstations at night at a hedge fund. That Condor scheduler
           | dates to the 80s. Smaller 3d animation studios commonly do
           | this too.
        
       | Symmetry wrote:
       | The architectural details here were pretty interesting:
       | 
       | https://moonbaseotago.github.io/talk/index.html
       | 
       | It would be nice to get actual performance numbers rather than
       | just frequency scaled Dhrystone but I suppose we have to be
       | patient.
        
         | Taniwha wrote:
         | Dhystone's just a place to start, it helps me make quick
         | tweaks, and I'm at that stage of the process - it's
         | particularly good because it's somewhat at odds with my big
         | wide decoders - VRoom! can decode bundles of up to 8
         | instructions per clock, while Dhrystone has lots of twisty
         | branches, only decodes ~3.7 instructions per bundle - it's a
         | great test for the architecture by pushing at the things it
         | might not be as good as.
         | 
         | Having said that I'm about reaching the end of the point where
         | it's the only thing - being able to run bigger longer
         | benchmarks is one of the reasons for bringing up linux on the
         | big FPGA on AWS
        
           | Taniwha wrote:
           | I'll add that freq scaled Dhrystone (DMIPS/MHz) is a
           | particularly useful number because it helps you compare
           | architectures rather than just clocks - you can figure out
           | questions like "If I can make this run at 5GHz how will it
           | compare with X?"
        
       | minroot wrote:
       | Any recommendations for resources on learning to makes things
       | like this in general?
        
         | Symmetry wrote:
         | Computer Architecture: A Quantitative Approach[1] is the
         | textbook that gets recommended the most on the topic, I
         | believe.
        
           | homarp wrote:
           | [1] https://www.elsevier.com/books/computer-
           | architecture/henness...
        
           | tsmi wrote:
           | If you're at the point in your career where you're not sure
           | which is the right textbook then "A Quantitative Approach" is
           | likely to be really tough to get through.
           | 
           | Computer Organization and Design, by the same authors, is
           | considered a better choice for a first book. I personally
           | loved it and couldn't put it down the first time I read it.
           | 
           | https://www.elsevier.com/books/computer-organization-and-
           | des...
        
             | camtarn wrote:
             | Definitely recommend this textbook as a great read - it
             | remains one of the very few textbooks I've read end-to-end
             | and genuinely enjoyed.
        
             | minroot wrote:
             | Any recommendation for books on (System)Verilog
        
               | sitkack wrote:
               | You might like "Digital Design and Computer Architecture,
               | RISC-V Edition" by Harris and Harris.
               | 
               | https://www.google.com/books/edition/Digital_Design_and_C
               | omp...
               | 
               | This book definitely skews pragmatic, hands on and
               | doesn't assume much. Covers both VHDL and Verilog. Has
               | sections on branch prediction, register renaming, etc.
        
               | tsmi wrote:
               | I personally am not into the verilog specific books. For
               | me HDLs are hardware description languages, so first you
               | learn to design digital hardware, then you learn to
               | describe them.
               | 
               | For that I highly recommend: https://www.cambridge.org/us
               | /academic/subjects/engineering/c...
               | 
               | Great first book on the subject.
        
           | jasonwatkinspdx wrote:
           | Older editions of this are freely available online, and great
           | for learning about microarchitecture.
        
       ___________________________________________________________________
       (page generated 2022-03-21 23:00 UTC)