[HN Gopher] Linear Address Spaces: Unsafe at any speed
       ___________________________________________________________________
        
       Linear Address Spaces: Unsafe at any speed
        
       Author : gbrown_
       Score  : 115 points
       Date   : 2022-06-29 19:45 UTC (3 hours ago)
        
 (HTM) web link (queue.acm.org)
 (TXT) w3m dump (queue.acm.org)
        
       | Veserv wrote:
       | Of course things would be faster if we did away with coarse
       | grained virtual memory protection and instead merged everything
       | into a single address space and guaranteed protection using fine
       | grained permission mechanisms.
       | 
       | The problem with that is that a single error in the fine grained
       | mechanism anywhere in the entire system can quite easily cause
       | complete system compromise. To achieve any safety guarantees
       | requires achieving perfect safety guarantees across all arbitrary
       | code in your entire deployed system. This is astronomically
       | harder than ensuring safety guarantees using virtual memory
       | protection where you only need to analyze the small trusted code
       | base establishing the linear address space and do not need to be
       | able to analyze or even understand arbitrary code to enforce
       | safety and separation.
       | 
       | For that matter, fine grained permissions are a strict superset
       | of the prevailing virtual memory paradigm as you can trivially
       | model the existing coarse grained protection by just making the
       | fine grained protection more coarse. So, if you can make a safe
       | system using fine grained permissions then you can trivially
       | create a safe system using coarse grained virtual memory
       | protection. And, if you can do that then you can create a
       | unhackable operating system right now using those techniques. So
       | where is it?
       | 
       | Anybody who claims to be able to solve this problem should first
       | start by demonstrating a mathematically proven unhackable
       | operating system as that is _strictly easier_ than what is being
       | proposed. Until they do that, the entire idea is a total
       | pipedream with respect to multi-tenant systems.
        
         | VogonPoetry wrote:
         | I think that the plague of speculative execution bugs qualify
         | as a single error in virtual memory systems that cause complete
         | system compromise. This was not a logic error in code, but a
         | flaw in the hardware. It isn't clear to me if CHERI would have
         | been immune to speculative execution problems, but access
         | issues would likely have shown up if the memory ownership tests
         | were in the wrong place.
         | 
         | I have been following CHERI. I note that in order to create the
         | first FPGA implementation they had to first define the HDL for
         | a virtual memory system -- all of the research "processor"
         | models that were available did not have working / complete VM
         | implementations. CHERI doesn't replace VM, it is in addition to
         | having VM.
         | 
         | I've found that memory bugs (including virtual memory ones) are
         | difficult to debug, because the error is almost never in the
         | place where the failures show up and there is no easy way to
         | track back who ought to own the object or how long ago the
         | error happened. CHERI can help with this by at least being able
         | to identify the owner.
         | 
         | Virtual memory systems are usually pretty complex. Take a look
         | at the list of issues for the design of L3
         | <https://pdos.csail.mit.edu/6.828/2007/lec/l3.html>. The
         | largest section there is for creating address spaces. For the
         | Linux kernel, in this diagram a lot of the MM code is colored
         | green <https://i.stack.imgur.com/1dyzH.png>, it is a
         | significant portion. More code means more bugs and much harder
         | to formally verify.
         | 
         | I am not convinced by the argument that it is possible to take
         | a fine grained system and trivially expand it to a coarse
         | grained system. How is shared memory handled, mmap'ed dylibs,
         | page level copy-on-write?
        
         | [deleted]
        
         | potatoalienof13 wrote:
         | You have misunderstood the article. It is not advocating for
         | the return to single address space systems. It is advocating
         | for potential alternatives to the linear address space model.
         | Here [1] is an operating system that I think fits under the
         | description of what you were talking about
         | 
         | https://en.wikipedia.org/wiki/Singularity_%28operating_syste...
         | [1]
        
           | Genbox wrote:
           | The more I research Singularity, the more I like it. I deep
           | dived into all the design docs years ago and the amount of
           | rethinking existing OS infrastructure is astounding.
           | 
           | Joe Duffy has some great blog posts on Midori (OS based on
           | Singularity) here:
           | http://joeduffyblog.com/2015/11/03/blogging-about-midori/
        
       | infogulch wrote:
       | The Mill's memory model is one of its most interesting features
       | IMO [1] and solves some of the same problems, but by going the
       | other way.
       | 
       | On the Mill the whole processor bank uses a global virtual
       | address space. TLB and mapping to physical memory happens at the
       | _memory controller_. Everything above the memory controller is in
       | the same virtual address space, including L1-L3+ caches. This
       | solves _a lot_ of problems, for example: If you go out to main
       | memory you 're already paying ~300 cycles of latency, so having a
       | large silicon area / data structure for translation is no longer
       | a 1-cycle latency problem. Writes to main memory are flushed down
       | the same memory hierarchy that reads come from and succeed as
       | soon as they hit L1. Since all cache lines are in the same
       | virtual address space you don't have to track and synchronize
       | reads and writes across translation zones within the cache
       | hierarchy. When you request an unallocated page you get the whole
       | pre-zeroed page back _instantly_ , since it doesn't need to be
       | mapped to physical pages until writes are flushed out of L3. This
       | means its possible for a page to be allocated, written to, read,
       | and deallocated which _never actually touches physical memory_
       | throughout the whole sequence and the whole workload is served
       | purely within the cache hierarchy.
       | 
       | Protection is a separate system ("PLB") and can be much smaller
       | and more streamlined since it's not trying to do two jobs at
       | once. The PLB allows processes to give fine-grained temporary
       | access of a portion of its memory to another process; RW, Ro, Wo,
       | byte-addressed ranges, for one call or longer etc. Processes get
       | allocated available address space on start, they can't just
       | assume they own the whole address space or start at some specific
       | address (you should be using ASLR anyways so this should have no
       | effect on well-formed programs, though there is a legacy
       | fallback).
       | 
       | [1]: My previous comment:
       | https://news.ycombinator.com/item?id=27952660
        
         | pclmulqdq wrote:
         | The Mill model is kind of cool, but today, many peripherals
         | (including GPUs and NICs) have the ability to dump bytes
         | straight into L3 cache. This improves latency in a lot of
         | tasks, including the server-side ones that the Mill is designed
         | for. This is possible due to the fact that MMUs are above the
         | L3 cache.
         | 
         | Honestly, I'm happy waiting for 4k pages to die and be replaced
         | by huge pages. Page tables were added to the x86 architecture
         | in 1985, when 1MB of memory was a ton of memory to have. Having
         | 256 pages worth of memory in your computer was weird and
         | exotic. Fast forward to today, and the average user has several
         | GB of memory - mainstream computers can be expanded to over 128
         | GB today - and we still mainly use 4k pages. That is the
         | problem here. If we could swap to 2M pages in most
         | applications, we would be able to reduce page table sizes by a
         | factor of 512, and they would still be a lot larger than page
         | tables when virtual memory was invented. And we wouldn't waste
         | much memory!
         | 
         | But no, 4k pages for backwards compatibility. 4k pages forever.
         | And while we're at it, let's add features to Linux (like TCP
         | zero copy) that rely on having 4k pages.
        
       | a-dub wrote:
       | > Why do we even have linear physical and virtual addresses in
       | the first place, when pretty much everything today is object-
       | oriented?
       | 
       | are there alternatives to linearly growing call stacks?
        
         | robotresearcher wrote:
         | A stack is a list of objects with a LIFO interface. Doesn't
         | have to be a contiguous byte sequence.
        
           | a-dub wrote:
           | is there an example of machine code that doesn't make use of
           | a linear contiguous call stack?
           | 
           | what would the alternative be? compute the size of all the
           | stack frames a-priori in the compiler and then spray them all
           | over main memory and then maintain a linear contiguous list
           | of addresses? doesn't the linear contiguous nature of
           | function call stacks in machine code preserve locality in
           | order to make more efficient use of caches? or would the
           | caches have to become smarter in order to know to preserve
           | "nearby" stack frames when possible?
           | 
           | also, why not just make the addresses wider and put the pid
           | in the high bits? they're already doing this masking stuff
           | for the security descriptors, why not just throw the pid in
           | there as well and be done with it?
        
             | robotresearcher wrote:
             | The linked article doesn't mention call stacks explicitly,
             | but describes the R1000 arch was object+offset addressed in
             | HW. So unless they restricted the call stack to fit into
             | one object and use only offsets, then yes, they must have
             | chained objects together for the stack.
             | 
             | When you have a page-based memory model, you've created the
             | importance of address locality. If you have object-based
             | memory model, and the working set is of objects, not pages,
             | then address locality between objects doesn't matter.
             | 
             | Of course, page-based based memory models are by FAR the
             | most common in practice.
             | 
             | (Note: pages ARE objects, but the objects are significant
             | to the VM system and not to your program. So strictly,
             | page-based models are a corner case of object-based models,
             | where the objects are obscure.)
        
               | a-dub wrote:
               | would be interesting to see how the actual call stack is
               | implemented. they must either have a fixed width object
               | as you mention or some kind of linear chaining like
               | you're describing.
               | 
               | found this on wikipedia: https://resources.sei.cmu.edu/as
               | set_files/TechnicalReport/19...
               | 
               | memory and disk are unified into one address space, code
               | is represented by this "diana" structure which can be
               | compressed text, text, ast or machine code. would be
               | curious how procedures are represented in machine code.
               | 
               | what a fascinating machine!
        
             | Someone wrote:
             | > is there an example of machine code that doesn't make use
             | of a linear contiguous call stack?
             | 
             | Early CPUs didn't have support for a stack, and some early
             | languages such as COBOL and Fortran didn't need one. They
             | didn't allow recursive function calls, so return addresses
             | could be stored at fixed addresses, and a return could
             | either be an indirect jump reading from that address or a
             | direct jump whose target address got modified when writing
             | to that fixed address (see
             | https://people.cs.clemson.edu/~mark/subroutines.html for
             | the history of subroutine calls)
             | 
             | Both go (https://blog.cloudflare.com/how-stacks-are-
             | handled-in-go) and rust
             | (https://mail.mozilla.org/pipermail/rust-
             | dev/2013-November/00...) initially had split stacks
             | (https://releases.llvm.org/3.0/docs/SegmentedStacks.html,
             | https://gcc.gnu.org/wiki/SplitStacks)
        
       | kazinator wrote:
       | > _Why do we even have linear physical and virtual addresses in
       | the first place, when pretty much everything today is object-
       | oriented?_
       | 
       | Simple: we don't want some low level kernel memory management
       | dictating what constitutes an "object".
       | 
       | Everything isn't object-oriented. E.g. large arrays, memory-
       | mapped files, including executables and libraries.
       | 
       | Linear memory sucks, but every other organization sucks more.
       | 
       | Segmented has been done; the benefit-to-clunk ratio was
       | negligible.
        
         | MarkSweep wrote:
         | The benefit-to-thunk ratio was not great either.
         | 
         | ( one reference to thunks involving segmented memory:
         | https://devblogs.microsoft.com/oldnewthing/20080207-00/?p=23...
         | )
        
           | kazinator wrote:
           | Real segmentation would have solved the problem described in
           | the article. Under virtual memory segments like on the 80386
           | (and mainframes before that), you can physically relocate a
           | segment and while adjusting its descriptor so that the
           | addressing doesn't change.
           | 
           | The problem was mainly caused by having no MMU, so moving
           | around objects in order to save space required adjusting
           | pointers. Today, a copying garbage collector will do the same
           | thing; rewrite all the links among the moved objects. You'd
           | have similar hacks on Apple Macintoshes, with their MC68K
           | processors and flat space.
        
       | mwcremer wrote:
       | tl;dr page-based linear addressing induces performance loss with
       | complicated access policies, e.g. multilevel page tables. Mr.
       | Kamp would prefer an object model of memory access and
       | protection. Also, CHERI
       | (https://dl.acm.org/doi/10.5555/2665671.2665740) increases code
       | safety by treating pointers and integers as distinct types.
        
       | gumby wrote:
       | The Multics system was designed to have segments (for this
       | discussion == pages) that were handled the way he described, down
       | to the pointer handling. Not bad for the 1960s, though Unix was
       | designed for machines with a lot fewer transistors back at the
       | time when that mattered a lot.
       | 
       | Things like TLBs (not a new invention, but going back to the
       | 1960s) really only matter to systems programmers, as he says, and
       | judicious use simplifies and has simplified programming for a
       | long time. I think if he really wants to go down this path he'll
       | discover that the worst case behavior (five probes to find a
       | page) really is worth it in the long run.
        
       | anewpersonality wrote:
       | CHERI is a gamechanger
        
       | gralx wrote:
       | Link didn't work for me. Direct link did:
       | 
       | https://dl.acm.org/doi/abs/10.1145/3534854
        
       | scottlamb wrote:
       | tl;dr: conventional design bad, me smart, capability-based
       | pointers (base+offset with provenance) can replace virtual
       | memory, CHERI good (a real modern implementation of capability-
       | based pointers).
       | 
       | The first two points are similar to other Poul-Henning Kamp
       | articles [1]. The last two are more interesting.
       | 
       | I'm inclined to agree with "CHERI good". Memory safety is a huge
       | problem. I'm a fan of improving it by software means (e.g. Rust)
       | but CHERI seems attractive at least for the huge corpus of
       | existing C/C++ software. The cost is doubling the size of
       | pointers, but I think it's worth it in many cases.
       | 
       | I would have liked to see more explanation of how capability-
       | based pointers replacing virtual memory would actually work on a
       | modern system.
       | 
       | * Would we give up fork() and other COW sorts of tricks?
       | Personally I'd be fine with that, but it's worth mentioning.
       | 
       | * What about paging/swap/mmap (to compressed memory contents,
       | SSD/disk, the recently-discussed "transparent memory offload"
       | [2], etc)? That seems more problematic. Or would we do a more
       | intermediate thing like The Mill [3] where there's still a
       | virtual address space but only one rather than per-process
       | mappings?
       | 
       | * What bookkeeping is needed, and how does it compare with the
       | status quo? My understanding with CHERI is that the hardware
       | verifies provenance [4]. The OS would still need to handle the
       | assignment. My best guess is the OS would maintain analogous data
       | structures to track assignment to processes (or maybe an extent-
       | based system rather than pages) but maybe the hardware wouldn't
       | need them?
       | 
       | * How would performance compare? I'm not sure. On the one hand,
       | double pointer size => more memory, worse cache usage. On the
       | other hand, I've seen large systems spend >15% of their time
       | waiting on the TLB. Huge pages have taken a chunk out of that
       | already, so maybe the benefit isn't as much as it seemed a few
       | years ago. Still, if this nearly eliminates that time, that may
       | be significant, and it's something you can measure with e.g.
       | "perf"/"pmu-tools"/"toplev" on Linux.
       | 
       | * etc
       | 
       | [1] eyeroll at https://queue.acm.org/detail.cfm?id=1814327
       | 
       | [2] https://news.ycombinator.com/item?id=31814804
       | 
       | [3] http://millcomputing.com/wiki/Memory#Address_Translation
       | 
       | [4] I haven't dug into _how_ when fetching pointers from RAM
       | rather than pure register operations, but for the moment I 'll
       | just assume it works, unless it's probabilistic?
        
       | throw34 wrote:
       | "The R1000 addresses 64 bits of address space instantly in every
       | single memory access. And before you tell me this is impossible:
       | The computer is in the next room, built with 74xx-TTL
       | (transistor-transistor logic) chips in the late 1980s. It worked
       | back then, and it still works today."
       | 
       | That statement has to be coming with some hidden caveats. 64 bits
       | of address space is crazy huge so it's unlikely the entire range
       | was even present. If only a subset of the range was "instantly"
       | available, we have that now. Turn off main memory and run right
       | out of the L1 cache. Done.
       | 
       | We need to keep in mind, the DRAM ICs themselves have a hierarchy
       | with latency trade-offs.
       | https://www.cse.iitk.ac.in/users/biswap/CS698Y/lectures/L15....
       | 
       | This does seem pretty neat though. "CHERI makes pointers a
       | different data type than integers in hardware and prevents
       | conversion between the two types."
       | 
       | I'm definitely curious how the runtime loader works.
        
         | cmrdporcupine wrote:
         | _" We need to keep in mind, the DRAM ICs themselves have a
         | hierarchy with latency trade-offs_" Yes this is the thing --
         | I'm not a hardware engineer or hardware architecture expert,
         | but -- it seems to me that what we have now is a set of
         | abstractions presented by the hardware to the software based on
         | a model of what hardware "used to" look like, mostly what it
         | used to look like in a 1970s minicomputer, when most of the
         | intensive key R&D in operating systems architecture was done.
         | 
         | One can reasonably ask, like Mr Kamp is, why we should stick to
         | these architectural idols at this point in time. It's
         | reasonable enough, except that the alternative of heterodox,
         | alternative architectures is also heterogenous -- new concepts
         | that don't necessarily "play well with others." All our
         | compiler technology, all our OS conventions, our tooling, etc.
         | would need to be rethought under new abstractions.
         | 
         | And those are fun hobby or thought exercises, but in the real
         | world of industry, they just won't happen. (Though I guess from
         | TFA it could happen in a more specialized domain like
         | aerospace/defence)
         | 
         | In the meantime, hardware engineering is doing amazing things
         | building powerfully performing systems that give us some nice
         | convenient consistent (if sometimes insecure and awkward) myths
         | about how our systems work, and they're making them faster
         | every year.
        
           | bentcorner wrote:
           | Makes me wonder if 50 years from now we'll still be stuck
           | with the hardware equivalent of the floppy disk icon, only
           | because retooling the universe over from scratch is too
           | expensive.
        
           | nine_k wrote:
           | As they say, C was designed for the PDP-11 architecture, and
           | modern computers are forced to emulate it, because the tools
           | to describe software (languages and OSes) which we have can't
           | easily describe other architectures.
           | 
           | There were modern semi-successful attempts though, see PS3 /
           | Cell architecture. It did not stick though.
           | 
           | I'd say that the modern heterodox architecture domain is
           | GPUs, but we have one proprietary and successful interface
           | for them (CUDA), and the open alternatives (openCL) are
           | markedly weaker yet. And it's not even touching the OS
           | abstractions.
        
       | jart wrote:
       | You can avoid the five levels of indirection by using "unreal
       | mode". I just wish it were possible to do with 64-bit code.
        
       | cmrdporcupine wrote:
       | "The R1000 has many interesting aspects ... the data bus is 128
       | bits wide: 64-bit for the data and 64-bit for data's type"
       | 
       |  _what what what?_
       | 
       | How on earth would you ever need to have a type enumeration 2^64
       | long?
       | 
       | Neat, though.
        
         | btilly wrote:
         | My guess is that it is an object oriented system. The data's
         | type is a pointer to the address that defines the type. Which
         | could be anywhere in the system.
         | 
         | This is also a security feature. If you find a way to randomly
         | change the data's type, you're unlikely to successfully change
         | it to another type.
        
         | kimixa wrote:
         | The other option is to use those 64bits to double the total
         | bandwidth in the "Traditional" page-table system.
         | 
         | All this extra complexity and bus width doesn't come for free,
         | after all, there's opportunity cost.
        
         | KerrAvon wrote:
         | No idea, but consider that it could be a enum + bitfield rather
         | than strictly an enum.
        
         | robotresearcher wrote:
         | I don't know if this machine supported it, but it could allow
         | you to have a system-wide unique type for this-struct-in-this-
         | thread-in-this-process, with strong type checking all the way
         | through the compiler into run time. Which would be pretty cool.
         | 
         | GUIDs for types.
        
       | gpderetta wrote:
       | At Intel they probably still have nightmares about iAPX 432. They
       | are not going to try an OO architecture again.
       | 
       | Having said that, I wouldn't be surprised if some form of
       | segmentation became popular again.
        
         | KerrAvon wrote:
         | I'd hope that anyone at Intel with said nightmares would have
         | read this paper by now (wherein Bob Colwell, et al, argue that
         | the 432 could have been faster with some minor fixes, and
         | competitive with contemporary CPUs with some additional larger
         | modifications).
         | 
         | https://archive.org/details/432_complexity_paper/
        
         | gumby wrote:
         | The underexplored value of early segmentation was the
         | discretionary segment level permissions enforced by hardware.
         | 
         | Years ago I prototyped a system that had filesystem permission
         | support at the segment level. The idea was you could have a
         | secure dynamic library for, say, manipulating the passwd file
         | (you can tell how long ago that was). You could call into it if
         | you had the execute bit set appropriately, even if you didn't
         | have the read bit set, so you couldn't read the memory but
         | could call into it at the allowed locations (i.e. PLT was x
         | only).
         | 
         | However it was clear everyone wanted to get rid of the segment
         | support, so that idea never went anywhere.
        
         | monocasa wrote:
         | They made a decent go at it again in 16 and 32 bit protected
         | mode. The GDT and LDT along with task gates were intended to be
         | used as an hardware object capability system like the iAPX
         | 432's.
        
       | kimixa wrote:
       | I'm a little confused about how the object base is looked up in
       | these systems, and if they're sparse or dense and have any size
       | or total object count limitations, and if that ends up having the
       | same limitations on total count as page tables that required the
       | current multi-level approach.
       | 
       | As surely you could consider page table as effectively
       | implementing a fixed-size "object cache"? It is just a lookup for
       | an offset into physical memory, after all, with the "object ID"
       | just being the masked first part of the address? And if the
       | objects are variable sized, is it possible to end up with
       | physical address fragmentation as objects of different sizes are
       | allocated and freed?
       | 
       | The claim of single-cycle lookups today would require an on-chip
       | fixed-size (and small!) fast sram, as there's a pretty hard limit
       | on the amount of memory you can get to read in a single clock
       | cycle, no matter how fancy or simple the logic behind deciding to
       | lookup. If we call this area the "TLB" haven't we got back to
       | pagetables again?
       | 
       | And for the size of sram holding the TLB/object cache entries -
       | increasing the amount of data stored in them means you have less
       | total too. A current x86_64 CPU supports 2^48 of physical address
       | space, reduced to 36 bits if you know it's 4k aligned - and 2^57
       | of virtual address space as the tag, again reduced to 45 bits if
       | we know it's 4k aligned. That means to store the tag and physical
       | address you need a total of 81 bits of SRRAM. A 64-bit object ID,
       | plus 64-bit physical address plus 64-bit size is 192bits, over 2x
       | that, so you could pack 2x the number of TLB entries into the
       | same sram block. To match the capabilities of the example above,
       | 57 bits of physical address (cannot be reduced as arbitrary sizes
       | means it's not aligned), plus a similarly reduced to 48 bit
       | object ID and size still adds up to 153, only slightly less than
       | 2x, though I'm sure people could argue that reducing the
       | capabilities here have merit, I don't know how many objects or
       | their maximum possible size in such a system. And that's "worst
       | case" 4k pages for the pagetable system too.
       | 
       | I can't see how this idea could be implemented without extreme
       | limitations - look at the TLB size of modern processors and
       | that's the maximum number of objects you could have while meeting
       | the claims of speed and simplicity. There may be some advantage
       | in making them flexible in terms of size, rather than fixed-size,
       | but then you run into the same fragmentation issues, and need to
       | keep that size somewhere in the extremely-tight TLB memory.
        
         | monocasa wrote:
         | > As surely you could consider page table as effectively
         | implementing a fixed-size "object cache"? It is just a lookup
         | for an offset into physical memory, after all, with the "object
         | ID" just being the masked first part of the address? And if the
         | objects are variable sized, is it possible to end up with
         | physical address fragmentation as objects of different sizes
         | are allocated and freed?
         | 
         | Because that's only a base, not a limit. The right pointer
         | arithmetic can spill over to any other object base's memory.
        
         | marshray wrote:
         | > with the "object ID" just being the masked first part of the
         | address?
         | 
         | Doesn't that imply the minimum-sized object requires 4K
         | physical ram?
         | 
         | Is that a problem?
        
           | kimixa wrote:
           | Maybe? If you just round up each "object" to 4k then you can
           | implement this using the current PTE on x86_64, but this
           | removes the (supposed) advantage of only requiring a single
           | PTE for each object (or "object cache" lookup entry or
           | whatever you want to call it) in the cases when an object
           | spans multiple page-sizes of data.
           | 
           | Having arbitrary sizes objects will likely be possible in
           | hardware - it's just an extra size being stored in the PTE if
           | you can mask out the objectID from the address (in the
           | example in the original post, it's a whole 64-bit object ID,
           | allowing a full 64-bits of offset within each object, but
           | totaling a HUGE 128bit effectively address)
           | 
           | But arbitrary sizes feels like it pushes the issues that many
           | current userspace allocators have to deal with today to the
           | hardware/microcode - namely about packing to cope with
           | fragmentation and similar (only instead of virtual address
           | space they'll have to deal with physical address space). The
           | solutions to this today are certainly non-trivial and still
           | can fail in many ways, so far away from being solved, let
           | along solved in a simple enough way to be implemented that
           | close to hardware.
        
       | avodonosov wrote:
       | Since this addressing scheme is <object, offset>, and as these
       | pairs need to fit in 64 bits, I am curious, is the numjer of bits
       | for each part is fixed and what are those fixed widths. In other
       | words what is the maximum possible offset within one object and
       | the max number of objects?
       | 
       | Probably segment registers in x86 can be thought as object
       | identifiers, thus allowing the same non-linear approach?(Isn't
       | that the purpose of segments even?)
       | 
       | Update: BTW, another term for what the author calls "linear" is
       | "flat".
        
         | monocasa wrote:
         | Yeah, x86 segments in the protected modes were intended to be
         | used as a hardware object capability system like the author is
         | getting at.
         | 
         | And yeah, it's probably a fixed 64bit lookup into an object
         | descriptor table.
        
           | marshray wrote:
           | Wouldn't it be hilarious if the 21st century brought about
           | the re-adoption of the security design features introduced in
           | the 80286 (1982)?
        
             | monocasa wrote:
             | I came this close to ordering custom "Make the LDT Great
             | Again" hats after spectre was released, lol.
        
       | dragontamer wrote:
       | > Why do we even have linear physical and virtual addresses in
       | the first place, when pretty much everything today is object-
       | oriented?
       | 
       | Well, GPU code is certainly not object-oriented, and I hope it
       | never becomes that. SIMD code won't be able to jump between
       | objects like typical CPU-oriented OOP does (unless all objects
       | within a warp/workgroup jump to the same function pointers?)
       | 
       | GPU code is common in video games. DirectX needs to lay out its
       | memory very specifically as you write out the triangles and other
       | vertex/pixel data for the GPU to later process. This memory
       | layout is then memcopy'd over to PCIe using the linear address
       | space mechanism, and GPUs are now cohesive with this space
       | (thanks to Shared Virtual Memory).
       | 
       | So today, thanks to shared virtual memory and advanced atomics,
       | we can have atomic compare-and-swap coordinate CPU and GPU code
       | operating over the same data (and copies of that data can be
       | cached in CPU-ram or GPU-VRAM and transferred over automatically
       | with PCIe memory barriers and whatnot).
       | 
       | ----------
       | 
       | Similarly, shared linear address spaces operate over rDMA (remote
       | direct memory access), a protocol built on top of Ethernet. This
       | means that your linear memory space is mmap'd on your CPU, but
       | then asks for access to someone else's RAM over the network. The
       | mmap then causes this whole "inefficient pointer-traversals" to
       | then get turned into Ethernet packets to share RAM between CPUs.
       | 
       | Ultimately, when you start dealing with high-speed data-sharing
       | between "external" compute units (ie: a GPU, or a ethernet-
       | connected far-away CPU), rather than "just" a NUMA-node or other
       | nearby CPU, the linear address space seems ideal.
       | 
       | --------
       | 
       | Even the most basic laptop, or even Cell Phone, these days, is a
       | distributed system consisting of a CPU + GPU. Apple chips even
       | have a DSP and a few other elements. Passing data between all of
       | these things makes sense in a distributed linear address space
       | (albeit really wonky with PCIe, mmaps, base address pointers and
       | all sorts of complications... but they are figured out, and it
       | does work every day)
       | 
       | I/O devices working directly in memory is going to only become
       | more common. 100Gbps network connections exist in supercomputer
       | labs, 10Gbps Ethernet is around the corner for consumers. NVMe
       | drives are pushing I/O to such high bandwidths that'd make DDR2
       | RAM blush. GPUs are growing more complicated and are rumored to
       | start turning into distributed chiplets soon. USB3.0 and beyond
       | are high-speed links that directly drop off data into linear
       | address spaces (or so I've been told). Etc. etc.
        
       | edave64 wrote:
       | There is often a quite significant distance between the
       | beautiful, elegant and efficient design that brings tears to the
       | eyes of a designer, and being pragmatic and financially viable.
       | 
       | Building a new competitive processor architecture isn't feasible
       | if you can't at least ensure compile-time compatibility with
       | existing programs. People won't buy a processor that won't run
       | their programs.
        
       | ajb wrote:
       | This article compares CHERI to an 80's computer, the Rational
       | R1000 (which I'm glad to know of). It's worth noting that CHERI's
       | main idea was explored in the 70's by the CAP computer[1]. CAP
       | and CHERI are both projects of the University of Cambridge's
       | Computer Lab. It's fairly clear that CAP inspired CHERI.
       | 
       | [1] https://en.wikipedia.org/wiki/CAP_computer
        
         | yvdriess wrote:
         | Are you sure it wasn't done before by IBM in the '60s?
         | 
         | That's usually the case. For hardware, at least
         | 
         | For software, it usually was done before by Lisp in the '70s.
        
           | Animats wrote:
           | The original machines like that were the Burroughs 5000
           | (1961), and the Burroughs 5500 (1964), which was quite
           | successful. Memory was allocated by the OS in variable length
           | chunks. Addresses were not plain numbers; they were more like
           | Unix paths, as in /program/function/variable/arrayindex.
           | 
           | That model works, but is not compatible with C and UNIX.
        
             | heavenlyblue wrote:
             | How would you address recursive functions this way?
        
             | EvanAnderson wrote:
             | You beat me! CHERI totally made me think about those
             | machines.
             | 
             | There's some good background here for those who are
             | interested: https://www.smecc.org/The%20Architecture%20%20o
             | f%20the%20Bur...
             | 
             | The architecture of the B5000 / B5500 / B6500 lives on
             | today in the Unisys ClearPath line. I believe the OS, MCP,
             | is one of the longest-maintained software operating systems
             | still in active use, too.
        
           | monocasa wrote:
           | IBM didn't really play with hardware object capabilities
           | until the S/38, and even the it's a bit of a stretch to call
           | them that.
        
       | cmrdporcupine wrote:
       | Another system that had an object-based non-linear address space
       | I believe was the "Rekursiv" CPU developed at Linn (yes, the
       | Swedish audio/drum machine company; EDIT: Linn. Scottish. Not
       | drum machine. Thanks for the corrections. In fact I even knew
       | this at one time. Yay brain.) in the 80s.
       | 
       | https://en.wikipedia.org/wiki/Rekursiv
       | 
       | I actually have a copy of the book they wrote about it here
       | somewhere. I often fantasize about implementing a version of it
       | in FPGA someday.
        
         | Gordonjcp wrote:
         | > Linn (yes, the Swedish audio/drum machine company) in the 80s
         | 
         | Uhm.
         | 
         | Linn the audio company, known as Linn Products, are Scottish,
         | being based a little to the south of Glasgow, and named after
         | the park the original workshop was beside.
         | 
         | Linn the drum machine company, known as Linn Electronics, were
         | American, being founded by and named after Roger Linn.
         | 
         | Two totally different companies, run by totally different
         | people, not connected in any way, and neither of them Swedish.
         | 
         | The Linn Rekursiv was designed by the audio company, and was
         | largely unsuccessful, and none exist any more - not even bits
         | of them :-/
        
           | cmrdporcupine wrote:
           | oops :-)
        
         | kwhitefoot wrote:
         | Surely Linn is Scottish.
        
       | martincmartin wrote:
       | "Unsafe at Any Speed" is the name of Ralph Nader's book on car
       | manufacturers resisting car safety measures. It resulted in the
       | creation of the United States Department of Transportation in
       | 1966 and the predecessor agencies of the National Highway Traffic
       | Safety Administration in 1970.
        
       | akdor1154 wrote:
       | > They also made it a four-CPU system, with all CPUs operating in
       | the same 64-bit global address space. It also needed a good 1,000
       | amperes at 5 volts delivered to the backplane through a dozen
       | welding cables.
       | 
       | That is absolutely terrifying.
        
         | buildbot wrote:
         | These days you just use 12v and convert right next to or on die
         | - but we are still in that range of amps for big chips! Take
         | for example a 3090 at 500w @12v, the core is running at 1.056v,
         | that's 473 Amps!
        
       ___________________________________________________________________
       (page generated 2022-06-29 23:00 UTC)