[HN Gopher] VRoom A high end RISC-V implementation ___________________________________________________________________ VRoom A high end RISC-V implementation Author : cmurf Score : 108 points Date : 2022-03-21 16:07 UTC (6 hours ago) (HTM) web link (moonbaseotago.github.io) (TXT) w3m dump (moonbaseotago.github.io) | titzer wrote: | This is a very ambitious project, so respect and good luck. | | I am wondering if the performance will pan out in practice, as it | doesn't seem to have a very deep pipeline, so getting high | clockspeeds may be a challenge. In particular the 5 clock branch | mispredict penalty suggest the pipeline design is fairly simple. | Production CPUs live and die by the gate depth and hit/miss | latency of caches and predictors. A longer pipeline is the | typical answer to gate delay issues. Cache design (and register | file design!) is also super subtle; L1 is extremely important. | evilos wrote: | They mention in their arch slides that they expect to add at | least 2 more pipeline stages to hit higher clocks. | Taniwha wrote: | As mentioned here I expect that reality will intrude and the | pipe will get bigger - of course good a BTC (and spending lots | of gates on it) is important because that's what mitigates that | deep pipe. | | I haven't published my latest work (end of the week) I have a | minor bump to ~6.5 DMips/MHz - Dhrystone isn't everything but | it's still proving a useful tool to tweak the architecture | (which is what's going on now) | blacklion wrote: | > Eventually we'll do some instruction combining using this | information (best place may be at entry to I$0 trace cache), or | possibly at the rename stage | | So much for "we will do only simplest of commands and u-op fusing | will fix performance". | | It is why I'm very suspicious about this argument from RISC-V | proponents. | Taniwha wrote: | I think that we need lots of trace before we decide which ops | make sense to combine | blacklion wrote: | As far as I understand, RISC-V proponents want to have | "recommended" command sequences for compilers, to avoid | situation when different RISC-V CPUs will need different | compilations. If different RISC-V implementations have | different "fuseable" command sequences, we will be in | dreadful situation when you will need exact "-mcpu" for | decent performance and binary packages will be very | unoptimal. | | And such "conventions" are bad idea, like comments in code, | IMHO. It can not be checked by tools, etc. | tsmi wrote: | > you will need exact "-mcpu" for decent performance | | For some definitions of decent, I think that ship has | sailed. | | https://clang.llvm.org/docs/CrossCompilation.html | | -target <triple> The triple has the general format | <arch><sub>-<vendor>-<sys>-<abi>, where: arch = x86_64, | i386, arm, thumb, mips, etc. sub = for ex. on ARM: v5, v6m, | v7a, v7m, etc. vendor = pc, apple, nvidia, ibm, etc. sys = | none, linux, win32, darwin, cuda, etc. abi = eabi, gnu, | android, macho, elf, etc. | | Note, none of those are exhaustive... | ncmncm wrote: | It is always frustrating when you have put in the work to | optimize code, and turn out to have pessimized it for the | next chip over. | | The extremum for this is getting a 10x performance boost by | using, e.g., POPCNT, and suffering instead a 10-100x | pessimization because POPCNT is trapped and emulated. | themerone wrote: | What does GPL mean for a chip design? | | I understand how it applies to the HDL, but I doubt that it | obligates you have to open your code to users of physical chips. | Taniwha wrote: | Well (author here) - this is a private project - typically such | a project would be very propriety - people don't get to show | their work. | | But I'm looking to find someone to build this thing, it's been | a while since I last built chips (last CPU I helped design | never saw the light of day due to reason that had little to do | with how well it worked). So I need a way to show it off, show | it's real. So GPLing it is a great way to do that - as is | showing up on HN (thanks to whoever posted this :-). | | In practice the RTL level design of a processor is only a part | of making a real processor - a real VRoom! would likely have | hand built ALUs, shifters, caches, register files etc those | things are all in the RTL at a high level but are really | different IP - likely they'd be entangled with GPL and a | manufacturer might feel that to be an issue. | | However I'm happy to dual license (I want to get it built, and | maybe get paid to do it). | | Also about half the companies building RISCVs are in China | (I've been building open source hardware in China for a decade | or so now, so I know there's lots of smart people there) - they | have a real problem (in the West) building something like this | - all the rumors about supply chain/etc stuff - having an open | sourced GPL'd reference that's cycle accurate is a way help | build confidence. | Taniwha wrote: | One other comment about why GPLing something is important for | some like me - publishing my 'secrets' are a great way to | turn them into "prior art" - you read it here first, you | can't patent it now - I can protect my ideas from becoming | part of someone else's protected IP by publishing it. | | I spent a few years working on an x86 clone, I had maybe 10 | (now expired) patents on how to get around stupidly obvious | things that Intel had patented - (or around ways to get | around ways to get around In tel that other's had patented) - | frankly from a technical POV it was all a lot of BS, | including my patents | wmf wrote: | It means "pay me to remove the GPL". It's fake GPL like MySQL | and MongoDB. | homarp wrote: | https://www.fsf.org/blogs/rms/selling-exceptions | | RMS wrote "I've considered selling exceptions acceptable | since the 1990s, and on occasion I've suggested it to | companies. Sometimes this approach has made it possible for | important programs to become free software." | Someone wrote: | I guess you could argue that, if you bought a device with this | CPU, you should be able to replace the CPU with one of your own | that's derived from this one. | | I think that's the spirit of the GPL in a hardware context, but | I don't think it's a given (by a long stretch) that courts | would accept that argument. | | A somewhat clearer case would be if you bought a device that | implements a GPL licensed design in a FPGA. I think you could | argue such devices cannot disable the reprogrammability of the | FPGA. | dmoreno wrote: | IANAL, but as far as I know it's very important it's GPLv3 | which means the antitivoization clause, which means that | hardware that uses this firmware must provide full source code | and a way to let you use your own firmware. | | If somehow this code is not in a firmware... No idea. | marcodiego wrote: | AFAICS, it is the same as software: you changed and | distributed; you have to provide your changes if asked to. | bragr wrote: | Also IANAL, but as I understand it, the HDL would compile down | to a sequence of gates, and presumably we'd treat that the same | way as a binary - a "Non-Source Form" as the GPL calls it. So | anyone that receives a copy of those gates (either as a binary | blob for a FPGA, or pre-flashed on a FPGA, or made on actual | silicon) would be entitled to the source as per GPL3 section 6 | "Conveying Non-Source Forms". | | I don't think the GPL anti-tivoization clause has much bearing | there other than presumably you'd have to provide the full tool | chain that resulted in the final gates - presumably this would | affect companies producing actual chips the most since you | couldn't have any propriety optimization or layout steps in | producing the actual chip design, though also no DRM for FPGAs | (is that even a thing?) | Taniwha wrote: | Author here (Paul Campbell) - AMA | ncmncm wrote: | Are you making any attempt at a learning branch predictor? Is | anything published about really-current methods? | Taniwha wrote: | Not yet - I have a pretty generic combined bimodal/global | predictor - there's a lot of research on BTCs - it's easy to | throw gates at this area - I can imagine chips hitting 20-30% | BTC in area just to keep the rest running | | My next set of work in this area will be integrating an L0 | trace cache into the existing BTC - that will help me greatly | up the per-clock issue rate | titzer wrote: | As a language VM implementor, I would really love to have a | conditional call instruction, like arm32. AFAICT this would be | a relatively simple instruction to implement in the CPU. Is | that accurate? | Taniwha wrote: | yes and no - there's a few issues here: | | 1 - architectural - RISCV has a nice clean ISA, it's adding | instructions quickly, CMOV is contentious issue there - I'm | not an expert on the history so I'll let others relitigate it | - it's easy to add new instructions to a RISCV machine, | unlike Intel/ARM it's even encouraged - however adding a new | instruction to ALL machines is more difficult and may take | many years. But unlike Intel/ARM there IS a process to adopt | new instructions that doesn't involve just springing them on | your customers | | 2 - remember RISCV is a no-condition code architecture - that | would make CMOV require 3 register file ports (the only such | instruction that also requires an adder [for the compare]) - | register file ports are extremely expensive, especially for | just 1 instruction | | 3 - micro-architectural - on simple pipes CMOV is pretty | simple (you just inhibit register write, plus do something | special with register bypass) I'd have to think very hard | about how to do it on something like VRoom! with out of | order, speculative, register renaming - I can see a naive way | to do it, but ideally there should be a way to nullify such | an instruction early in the pipe which would mean some sort | of renaming-on-the-fly hack | titzer wrote: | Note I was talking about a conditionall _call_ instruction, | which is very useful for, e.g. safety checks. | Taniwha wrote: | conditional CALL is MUCH harder to implement well - it's | because the call part essentially happens up at the | PC/BTC end of the CPU while at the execution stage what | you're doing is writing the saved PC to the LR/etc and | the register compare (or accessing a condition code that | may not have been calculated yet). | | In many ways I guess it's a bit like a conditional branch | that needs a write port - in RISCV, without condition | codes, your conditional call relative branch distance | will be smaller because the instruction encoding will | need to encode 2-3 registers | dzaima wrote: | I imagine something like that might be viable in the to- | be-designed RISC-V J extension, as safety checks (mostly | in JITs) would be close to the only thing benefiting from | this. | | Though, maybe instead of a conditional call, a | conditional signal could do, which'd clearly give no | expectation of performance if it's hit, simplifying the | hardware effort required. | Taniwha wrote: | Yeah, I can imagine that being particularly easy to | implement in VRoom! exceptions are handled synchronously | at the end of the pipe (with everything before them | already committed, and everything after flushed). | Instructions that can convert to exceptions (like loads | and store taking TLB misses) essentially hit two | different functional units - a conditional exception | would be tested in a branch/ALU unit and then transition | into an effective a no-op or transition into an d | exception and synchronise the pipe when they hit the | commit stage | kragen wrote: | 8080 had it too, 8086 dropped it due to disuse. In a modern | context it's just a performance hack, an alternative to | macro-op fusion, but for high-performance RISC-V (or i386, or | amd64, or aarch64) you need macro-op fusion anyway. | sitkack wrote: | What does your benchmarking workflow look like? I am interested | in * From a high level what does your dev | iteration look like? * Getting instruction traces, | timing and resimulating those traces * Power analysis, | timing analysis (do you do this as part of performance | simulation) ? * Do you benchmark the whole chip or | specific sub units? * How do you choose what to focus on | in terms of performance enhancements? * What areas are | you focusing on now? * What tools would make this | easier? | Taniwha wrote: | At the moment I'm just starting working my way up the | hierarchy of benchmarks, dhrystone's been useful though it's | nearing the end of its use - I build the big FPGA version (on | an AWS FPGA instance) to give me a place to run bigger things | exactly like this. | | I currently run low level simulations in Verilator where I | can easily take large internal architectural trace, and | bigger stuff on AWS (where that sort of trace is much much | harder) | | I haven't got to the power analysis stage - that will need to | wait until we decide to build a real chip - timing will | depend on final tools if we get to build something real, | currently it's building on Vivado for the FPGA target. | | Mostly I'm doing whole chip tests - getting everything to | work well together is sort of the area I'm focusing on at the | moment (correctness was the previous goal - being together | enough to boot linux), the past 3 months I've brought the | performance up b y a factor of 4 - the trace cache might get | me 2x more if I'm lucky. | | I spend a lot of time looking at low level performance, at | some level I want to get the IPC (instructions per clock) of | the main pipe as high as I can so I stare at the spots where | that doesn't happen | | I'm using open source tools (thanks everyone!) | tromp wrote: | > dhrystone's been useful though it's nearing the end of | its use | | Would my fhourstones [1] [2] benchmark be of any use? | | [1] https://tromp.github.io/c4/fhour.html | | [2] https://openbenchmarking.org/test/pts/fhourstones | Taniwha wrote: | thanks I'll have a look - I'm not so interested in raw | scores, more about relative numbers so I can judge | different architectural experiments | [deleted] | gary_0 wrote: | From what little I know about microarchitecture, this seems | extremely impressive. Hopefully these aren't dumb questions: | | Are there GPL'd designs for PCIe, USB, etc, that could be used | to incorporate this into a SoC design? If not, how much work is | that compared to this? | | Also, what other kind of technical considerations would be | involved to make this into a "real" chip on something like | 28nm? | Taniwha wrote: | Great questions - I'm using an open source UART from someone | else, an d for the AWS FPGA system I have a 'fake' disk | driver plus timers/interrupt controllers etc | | So far I haven't needed USB/ether/PCIe/etc I've sort of | sketched out a place for those to live - I think that for a | high end system like this one you can't just plug something | in - real performance needs some consideration of how: | | - cache coherency works - VM and virtual memory works | (essentially page tables for IO devices) - PMAP protections | from I/O space (so that devices can't bypass the CPU PMAPs | that are used to man age secure enclaves in machine mode) | | So in general I'm after something uniquer, or at least | slightly bespoke. | | I also think there's a bit of a grand convergence going on in | this area around serdes's which are sort of becoming a new | generic interface PCIe, high speed ether, new USBs, disk | drivers etc are all essentially bunches of serdes with | different protocol engines behind them - a smart SoC is going | to split things this way for maximum flexibility | rwmj wrote: | Don't know much about the details, but this company / | person claims to have developed some open source IP: | http://www.enjoy-digital.fr/ | Lramseyer wrote: | Not Paul Campbell, but I'll share what I know on the matter. | | So GPL'd IO blocks - This is a great question, and something | I have definitely been asking myself! One thing to keep in | mind is that IO interfaces like PCIe, USB, and whatnot have a | Physical interface ("Phy" for short.) Those contain quite a | bit of analog circuitry, which is tied to the transistor | architecture that's used for the design. | | That being said, A lot of interfaces that aren't DRAM | protocols use what's known as a SerDes Phy (short for | Serializer De-serializer Physical interface.) More or less, | they have an analog front end and a digital back end, and | that digital back end that connects to everything else is | somewhat standardized way. So it wouldn't be unreasonable to | try to build something like an open PCIe controller that only | has the Transaction Layer and Data Link Layer. While there | are various timing concerns/constraints when not including a | Phy layer (lowest layer,) I don't think it's impossible. | | The other big challenge is that anyone wanting to use an open | source design will definitely want the test benches and test | cases included in the repo (you can think of them like unit | tests.) Unfortunately, most of the software to actually | compile and run those simulations is cost prohibitive for an | individual, because it's licensed software. Also, the | companies that develop this software make a ton of money | selling things like USB and PCIe controllers, so I'll let you | draw your own conclusions about the incentives of that | industry. | | Even if you were able to get your hands on the software, the | simulations are very computationally intensive, and | contribution by individuals would be challenging ...though | not impossible! | | Despite those barriers, it's a direction that I desperately | want to see the industry move towards, and I think it's | becoming more and more inevitable as companies like Google | get involved with hardware, and try to make the ecosystem | more open. Chiplet architectures are also all the rage these | days, so it would be less of a risk for a company to attempt | to use an open source design. | | I'd really be curious to hear Paul Campbell's take on this | question though. He definitely knows a lot more than I do! | tsmi wrote: | One advantage of SkyWater opening its PDK is Universities | are starting to back fill all the hardware that is missing. | | Here's a SerDes from Purdue. I don't think this particular | design has been validated in silicon yet though. | | https://arxiv.org/abs/2105.13256 | black_puppydog wrote: | Do you dance? :) | | https://youtu.be/nlu0foF3WBk?t=182 | | I know, I'm leaning hard on that second "A" there. :D | Taniwha wrote: | heh! - I'm a Kiwi who lived and worked in Silicon Valley for | 20 years, moved back when the kids started high school, but | mostly still work there - while I was there I started a | company using "Taniwha" ... great for a logo, but a mistake | because of course no one in the US knows how to pronounce it | (pro-tip the "wh" is most close to an english "f") | tsmi wrote: | Have you considered making an ASIC of your design? | https://efabless.com/open_shuttle_program | Taniwha wrote: | It's likely too big for those programs - I am (just now) | starting a build with the Open Lane/Sky tools not with the | intent of actually taping out but more to squeeze the | architectural timing (above the slow FPGA I've been using for | bringing up Linux) so I can find the places where I'm being | stupidly unreasonable about timing (working on my own I can't | afford a Synopsys license) | tsmi wrote: | Gotcha. Did you run into any issues with yosys given that | it has limited system verilog support? | | Ibex needed to add a pass with sv2v | https://github.com/lowRISC/ibex/tree/master/syn | Taniwha wrote: | I'm just starting this week, I've recently switched to | some use of SV interfaces and it does not like arrays of | them - sv2v seems the way to go - but even without that | yosys goes bang! somethings too big Vivado compiles the | same stuff - I rearchitected the bit that might obviously | be doing this but no luck so far. | tux3 wrote: | Any thoughts about higher level HDLs in embedded in software | languages, like Chisel, nMigen, or others? Some other RISC-V | core designers claim they've had increased productivity with | those. | | It seems that despite a lot of valid criticism against | (System)Verilog, nothing really seems to be a on trajectory to | replace it today. I'm not sure if that's purely inertia | (existing tooling, workflows, methodologies), other HDLs not | being attractive enough, or maybe Verilog is just good enough? | Taniwha wrote: | I think they're great - I earned my VLSI chops building stuff | in the 90s and I can write Verilog about as fast as I can | think so it's my goto language. I've also written a couple of | compilers over the years so I know it really well (you can | thank me for the ' _' in "always @(_)"). That's just my | personal bias. | | Inertia in tooling is a REALLY BIG deal - if you can't run | your design through simulation, (and FPGA simulation), | synthesis, layout/etc you'll never build a chip - it can take | a 5-10 years for a new language feature to become ubiquitous | enough so that you can depend on it en ough to use it in a | design (I've been struggling with this using System Verilog | interfaces this month). | | If you look closely at VRoom! you'll see I'm stepping beyond | some Verilog limitations by adding tiny programs that | generate bespoke bits of Verilog as part of the build process | - this stops me from fat fingering some bit in a giant | encoder but also helps me make things that SV doesn't do so | well (big 1-hot muxes, priority schedulers etc) | Taniwha wrote: | err HN swallowed my * there as in: "(you can thank me for | the '*' in "always @(*)")" | snakke wrote: | As an aside, the latest and active development of nMigen has | been rebranded a few months ago to Amaranth and can be found | here: https://github.com/amaranth-lang/amaranth . In case | people googled nMigen and came to the repository that hasn't | been updated in two years. | [deleted] | codedokode wrote: | The presentation was interesting; but I would like to write an | idea that is tangentially related to this CPU. | | I noticed that modern CPUs are optimized for legacy monolith OS | kernels like Linux or Windows. But having a large, multimegabyte | kernel is a bad idea from a security standpoint. A single mistake | or intentional error in some rarely used component (like a | temperature sensor driver) can get attacker full access to the | system. Again, an error in any part of the monolith kernel can | cause system failure. And Linux kernel doesn't even use static | analysis to find bugs! It is obvious that using microkernels | could solve many of the issues above. | | But microkernels tend to have poor performance. One of the | reasons for this could be high context switch latency. CPUs with | high context switch latency are only good for legacy OSes and not | ready for better future kernels. Therefore, either we will find a | way to make context switches fast or we will have to stay with | large, insecure kernels full of vulnerabilities. | | So I was thinking what could be done here. For example, one thing | that could be improved is to get rid of address space switch. It | causes flushes of various caches and it hurts performance. | Instead, we could always use the single mapping from virtual to | physical addresses, but allocate each process different virtual | address range. To implement this, we could add two registers, | which would hold minumum and maximum accessible virtual | addresses. It should be easy to check the address against them to | prevent speculative out of bounds memory accesses. | | By the way, 32-bit x86 architecture had segments, that could be | used to divide single address space between processes. | | Another thing that can take time is saving/restoring registers on | context switch. One way to solve the problem could be to use | multiple banks (say, 64 banks) of registers that can be quickly | switched, another way would be to zero out registers on return | from kernel and let processes save them if they need it. | | Or am I wrong somewhere and fast context switches cannot be | implemented this way? | db65edfc7996 wrote: | >But microkernels tend to have poor performance. | | Citation needed. What kind of hit are we talking about? 5%? | 90%? We have supercomputers from the future that have capacity | to spare. I would be willing to take an enormous performance | hit for better security guarantees on essential infrastructure | (routers, firewalls, file servers, electrical grid, etc). | kragen wrote: | SASOSes are interesting, sometimes extending a 64-bit address | space to cover a whole cluster, but they aren't compatible with | anything that calls fork(). | | The various variants of L4 have pretty good context-switch | latency even on traditional CPUs, and seL4 in particular is | formally proven correct on a few platforms. Spectre+Meltdown | mitigation was painful for them, but they're still pretty good. | | Lots of microcontrollers have no MMUs but do have MPUs to keep | a user task from cabbaging the memory of the kernel or other | tasks. Not sure if any of them use the PDP-11-style base+offset | segment scheme you're describing to define the memory regions. | | Protected-memory multitasking on a multicore system doesn't | need to involve context switches, especially with per-core | memory. | | Even on Linux, context switches are cheap when your memory map | is small. httpdito normally has five pages mapped and takes | about 100 microseconds (on a 2.8GHz amd64 laptop) to fork, | serve a request, and exit. I think I've measured context | switches a lot faster than that between two existing processes. | | Multiple register banks for context switching go back to the | CDC 6600's peripheral processor (FEP) or maybe the TX-0 on | which Sutherland wrote SKETCHPAD; it has a lot of advantages | beyond potentially cheaper IPC. Register bank switching for | interrupt handling was one of the major features the Z80 had | over the 8080 (you cn think of the interrupt handler as being | the kernel). The Tera MTA in the 01990s was at least widely | talked about if not widely imitated. Switching register sets is | how "SMT" works and also sort of how GPUs work. And today | Padauk's "FPPA" microcontrollers (starting around 12 cents | IIRC) use register bank switching to get much lower I/O latency | than competing microcontrollers that must take an interrupt and | halt background processing until I/O is complete. | | Another alternative approach to memory protection is to do it | in software, like Java, Oberon, and Smalltalk do, and Liedtke's | EUMEL did; then an IPC can be just an ordinary function call. | Side-channel leaks like Spectre seem harder to plug in that | scenario. GC may make fault isolation difficult in such an | environment, particularly with regard to performance bugs that | make real-time tasks miss deadlines, and possibly Rust-style | memory ownership could help there. | codedokode wrote: | What I would like to have is a context switch latency | comparable to a function call. For example, if in a | microkernel system bus driver, network card driver, firewall, | TCP stack, socket service are all separate userspace | processes, then every time a packet arrives there would be a | context-switching festival. | | As I understand, in microkernel OSes most system calls are | simply IPCs - for example, network card driver passes | incoming packet to the firewall. So there is almost no kernel | work except for context switch. That's why it has to be as | fast as possible and resemble a normal function call, maybe | even without invoking the kernel at all. Maybe something like | Intel's call gate, but fast. | | > they aren't compatible with anything that calls fork(). | | I wouldn't miss it; for example, Windows works fine without | it. | kinghajj wrote: | You should look into the Mill CPU architecture.[0] Its design | should make microkernels much more viable. | | * Single 64-bit address space. Caches use virtual addresses. | | * Because of that, the TLB is moved _after_ the last level | cache, so it 's not on the critical path. | | * There's instead a PLB (protection lookaside buffer), which | can be searched in parallel with cache lookup. (Technically, | there's three: two instruction PLBs and one data PLB.) | | [0]: https://millcomputing.com/ | foobiekr wrote: | I was also going to mention the Mill, but it's become a bit | of a Flying Dutchman that people tell tales of but which | probably doesn't exist. | thechao wrote: | Segment registers are _precisely_ how NT does context | switching. I think it may be restricted to just switching from | user- to kernel- threads. I can 't remember if there's thread- | to-thread switching using segment registers -- I feel like this | was a thing, or it was just a thing we did when we tried to | boot NT on Larrabee. (Blech.) | wrs wrote: | Long ago, we in the Newton project at Apple had that idea. We | (in conjunction with ARM) were defining the first ARM MMU, so | we took the opportunity to implement "domains" of memory | protection mappings that could be quickly swapped at a context | switch. So you get multiple threads in the same address space, | but with independent R/W permission mappings. | | I think a few other ARM customers were intrigued by the | security possibilities, but the vast majority were more like | "what is this bizarre thing, I just want to run Unix", so the | feature disappeared eventually. | | Here's some ARM documentation if you want to pull this thread: | https://developer.arm.com/documentation/dui0056'/latest/'cac... | wrs wrote: | Too late to edit, but here's a documentation link that works | better: | https://developer.arm.com/documentation/dui0056/d/caches- | and... | StillBored wrote: | Its similar to the original macOS, which used handles to | track/access/etc memory requested from the OS and swap them | to disk as needed. First you request the space, then you | request access, which pinned it into ram. | | PalmOS was another one that worked similarly. | https://www.fuw.edu.pl/~michalj/palmos/Memory.html | Taniwha wrote: | These days there are few caches that need to be flushed at | context switch time - RISCV's ASIDs mean that you don't need to | flush the TLBs (mostly) when you contect switch. | | VRoom! largely has physically tagged caches so they don't need | to be flushed, the BTC is virtually tagged, but split into | kernel and user caches, you need to flush the user one on on a | context switch (or both on a VM switch) - also the trace cache | (L0 icache) will also be virtually tagged. VRoom! also doesn't | do speculative accesses past the TLBs. | | Honestly saving and restoring kernel context is small compared | to the time spent in the kernel (and I've spent much of the | past year looking at how this works in depth). | | Practically you have to design stuff to an architecture (like | RISCV) so that one can leverage off of the work of others | (compilers, libraries, kernels) adding specialised stuff that | would (in this case) get in to a critical timing path is | something that one has to consider very carefully - b ut that's | a lot of what RISCV is about - you can go and knock up that | chip yourself on an FPGA and start trialing it on your | microkernel | kragen wrote: | Thanks, this is really informative. | nynx wrote: | The Architectural presentation linked from the GitHub repository | for this project is an incredibly good resource on how these | kinds of things are designed. | avianes wrote: | Yes, there is a huge lack of open and approachable information | sources in micro-architecture. | | Be aware though, the micro-architecture used here is very | interesting but differs in many ways from state of the art | industrial high-end micro-architectures for superscalar out-of- | order speculative processor. | | I am quite curious about how the author came up with these | choices | Taniwha wrote: | Well, everyone was building tiny RISCVs, I kind of thought | "can I make a Xeon+ class RISCV if I throw gates at the | problem ?" :-) | | Seriously though I started out with the intent of building a | 4/8 instruction/clock decoder, and an O-O execution pipe that | could keep up - with the end goal of at least 4+ instruction | s/clock average (we peak now at 8) - the renamer, dual | register file, and commitQ are the core of what's probably | different here | avianes wrote: | Yes, the "dual register file" is probably the most | intriguing to me. | | This looks like a renaming scheme used in some old micro- | architecture (Intel Core 2 maybe) where ROB receives | transient results and acts as a physical regfile, at commit | reg value are copied to a arch regfile. But in your uarch | the physical regfile is decoupled from ROB, which must | correspond to your commitQ. | | I wonder if this solution is viable for a very large uarch | (8 way) because read ports to copy reg value from pysical | regfile to arch regfile are additional read ports that can | be avoided with other (more complex) renaming scheme. These | additional read ports can be expensive on a regfile that | already has a bunch of ports. | | Any thoughts about this? | | But I haven't read much of your code yet, that's just a raw | observation | Taniwha wrote: | the commitQ entries are smart enough to 'see' the commits | into the architectural file and request the data from | it's current location | | It does mean lots of register read ports .... but you can | duplicate register files at some point (reducing read | ports but keeping the write ports) (you want to keep them | close to the ALUs/multipliers/etc) - in some ways these | are more implementation issues rather than | 'architectural' | avianes wrote: | I see, there are indeed solutions like regfile | duplication to handle large port number but it's | expensive when physical regfile becomes large. I still | think that the uarch's job is to ensure minimal | implementation cost ;). | | Thank you for your opinion and thought process, it's very | valuable ! | tasty_freeze wrote: | For a few years I worked with the guy behind this project, Paul | Campbell. He is a fearless coder, and moves between hardware and | software design with equal ease. | | An example of his crazy coding chops, he was frustrated by the | lack of verilog licenses at the place he worked back in the early | 90s. His solution was to whip up a compliant verilog simulator, | then wrote a screen saver that would pick up verification tasks | from a pending queue. They had many macs around the office that | were powered 24/7, and they could chew through a lot of work | during the 16 hours a day when nobody was sitting in front of | them. When someone sat down at their computer in the morning or | came back from lunch, the screen saver would just abandon the | simulation job it was running and that job would go back to the | queue of work waiting to be completed. | evilos wrote: | That is terrifying. | thechao wrote: | Synthesizable verilog is a very small language compared to | system verilog -- especially in the 90s. Off the top of my head | I know of _six_ "just real quick" verilog simulators that I've | worked with (one of which I wrote). I'm not sure how I feel | about them. On one hand, I hate dealing with licenses; on the | other hand, now you've got to worry that your custom simulation | matches behavior with the synthesis tools. A lot of the | "nonstandard" interpretation for synthesizable verilog from the | bigs comes from practical understanding of the behavior for a | given node. Most of that is captured in ABC files ... but not | all of it. | Taniwha wrote: | It was more than simple synthesisable verilog, but not a lot | - it was also a compiler rather than an interpreter - at the | time VCS was just starting to be a thing, verilog as a | language was not at all well defined (lots of assumptions | about event ordering that no-one should have been making) | | I was designing Mac graphics accelerators I'd built it on a | some similar infrastructure I'd built to capture trace from | people's machines to try and figure out where QuickDraw was | really spending it's time - we ended up with a minimilistic | graphics accelerator that beat the pants off of everyone else | thechao wrote: | This is why I think Moore (LLHD), Verilator, and Yosys are | such awesome tools. They move a lot more slowly than (say) | GCC, but I personally think they're all close to the | tipping point. | Taniwha wrote: | I wrote a second, much more standard Verilog compiler | (because by then there was a standard) with the intent of | essentially selling cloud simulation time (being 3rd to a | marketplace means you have to innovate) - sadly I was a | bit ahead of my time ('cloud' was not yet a word) the | whole California/Enron "smartest guys in the room" | debacle kind of made a self financed startup like that | non-viable | | So in the end I open sourced the compiler ('vcomp') but | it didn't take off | colejohnson66 wrote: | So, BOINC before BOINC? | jasonwatkinspdx wrote: | A lot of people have come up with something similar. Someone | I know implemented the Condor scheduler to run models on | workstations at night at a hedge fund. That Condor scheduler | dates to the 80s. Smaller 3d animation studios commonly do | this too. | Symmetry wrote: | The architectural details here were pretty interesting: | | https://moonbaseotago.github.io/talk/index.html | | It would be nice to get actual performance numbers rather than | just frequency scaled Dhrystone but I suppose we have to be | patient. | Taniwha wrote: | Dhystone's just a place to start, it helps me make quick | tweaks, and I'm at that stage of the process - it's | particularly good because it's somewhat at odds with my big | wide decoders - VRoom! can decode bundles of up to 8 | instructions per clock, while Dhrystone has lots of twisty | branches, only decodes ~3.7 instructions per bundle - it's a | great test for the architecture by pushing at the things it | might not be as good as. | | Having said that I'm about reaching the end of the point where | it's the only thing - being able to run bigger longer | benchmarks is one of the reasons for bringing up linux on the | big FPGA on AWS | Taniwha wrote: | I'll add that freq scaled Dhrystone (DMIPS/MHz) is a | particularly useful number because it helps you compare | architectures rather than just clocks - you can figure out | questions like "If I can make this run at 5GHz how will it | compare with X?" | minroot wrote: | Any recommendations for resources on learning to makes things | like this in general? | Symmetry wrote: | Computer Architecture: A Quantitative Approach[1] is the | textbook that gets recommended the most on the topic, I | believe. | homarp wrote: | [1] https://www.elsevier.com/books/computer- | architecture/henness... | tsmi wrote: | If you're at the point in your career where you're not sure | which is the right textbook then "A Quantitative Approach" is | likely to be really tough to get through. | | Computer Organization and Design, by the same authors, is | considered a better choice for a first book. I personally | loved it and couldn't put it down the first time I read it. | | https://www.elsevier.com/books/computer-organization-and- | des... | camtarn wrote: | Definitely recommend this textbook as a great read - it | remains one of the very few textbooks I've read end-to-end | and genuinely enjoyed. | minroot wrote: | Any recommendation for books on (System)Verilog | sitkack wrote: | You might like "Digital Design and Computer Architecture, | RISC-V Edition" by Harris and Harris. | | https://www.google.com/books/edition/Digital_Design_and_C | omp... | | This book definitely skews pragmatic, hands on and | doesn't assume much. Covers both VHDL and Verilog. Has | sections on branch prediction, register renaming, etc. | tsmi wrote: | I personally am not into the verilog specific books. For | me HDLs are hardware description languages, so first you | learn to design digital hardware, then you learn to | describe them. | | For that I highly recommend: https://www.cambridge.org/us | /academic/subjects/engineering/c... | | Great first book on the subject. | jasonwatkinspdx wrote: | Older editions of this are freely available online, and great | for learning about microarchitecture. ___________________________________________________________________ (page generated 2022-03-21 23:00 UTC)