[HN Gopher] Writing a file system from scratch in Rust
       ___________________________________________________________________
        
       Writing a file system from scratch in Rust
        
       Author : carlosgaldino
       Score  : 207 points
       Date   : 2020-07-27 17:04 UTC (5 hours ago)
        
 (HTM) web link (blog.carlosgaldino.com)
 (TXT) w3m dump (blog.carlosgaldino.com)
        
       | ravenstine wrote:
       | Is there any advantage in writing a custom file system for a
       | niche purpose? It seems like most file systems are just different
       | variations of managing where/when files are written
       | simultaneously. Could a file system written specifically for
       | something like PostgreSQL cut out the middle-man and increase
       | performance?
        
         | tene wrote:
         | You may be interested in a paper written by the Ceph team:
         | "File Systems Unfit as Distributed Storage Backends: Lessons
         | from 10 Years of Ceph Evolution"
         | 
         | https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf
         | 
         | There are definitely some significant benefits you can get from
         | managing your own storage, rather than using a filesystem.
        
         | [deleted]
        
         | topspin wrote:
         | Yes. Oracle has done this (ASM) to eliminate overhead,
         | implement fault tolerance and provide a storage management
         | interface based on SQL, for example.
         | 
         | I once made a 'file system' to mount cpio archives (read-only)
         | in an embedded system. Cpio is an extremely simple format to
         | generate and edit (in code) and mounting it directly was very
         | effective.
        
           | formerly_proven wrote:
           | I suspect operating on block storage directly may both be
           | easier and more reliable for databases, since about 75 % of
           | the complication in writing transactional I/O software is
           | working around the kernel's behavior.
        
             | zzz61831 wrote:
             | Kernel's fsyncing behavior is one thing, but just relying
             | on a massive amount of fragile C code running in kernel is
             | a significant liability, especially if your software is a
             | centralized database and crashes, panics will bring down
             | everything.
        
               | formerly_proven wrote:
               | Yes, and also the traditional answer was that the kernel
               | handles weird and complicated hardware and can talk to
               | RAID controllers properly, but nowadays hardware has much
               | less variance, and RAID is rare (and arguably unnecessary
               | for a direct-IO database).
               | 
               | I think it'd be viable for an enterprise-y database to do
               | IO directly over NVMe. Imagine the efficiency and
               | throughput gains you could get from a database that (1)
               | has a unified view of memory allocation in the system (2)
               | directly performs its page-level IO on the storage
               | devices.
        
       | dm319 wrote:
       | It would be nice if the intro had a brief explanation of why a
       | disk needs to be divided into blocks. Otherwise, I really enjoyed
       | this read from the perspective of a lay person.
        
         | unethical_ban wrote:
         | The disk / inodes need to know where to start looking for a
         | file's contents, like the address in memory for RAM. Or like
         | the mail: We subdivide by city, then ZIP, then street, then
         | address.
         | 
         | So the inode says "The data for my file starts at block 72 and
         | is 3 blocks long" (or something like that). The disk then goes
         | there, and reads blocks 72,73,74.
         | 
         | Each block is 4KiB large often, so if you have a 10KiB file,
         | you still take up ceiling(file size/block size) blocks.
         | 
         | That's why there is a difference between "File size" and "Size
         | on disk" when you look at disk usage summaries.
        
           | saurik wrote:
           | This doesn't explain why blocks are valuable as you could use
           | byte addressing. The reason why blocks are valuable is for
           | similar reasons to why memory is divided into pages. (Which
           | is all I am going to write, as I don't have the time to
           | answer this well today. But "you need to know where things
           | are on disk" isn't an answer.)
        
             | sedatk wrote:
             | This isn't the reason about block addressing at all since
             | it existed before paging was a thing. Addressing storage
             | content in blocks is closely aligned with how direct access
             | storage is organized in "sectors", usually 512-byte blocks.
             | Disks don't have byte access interfaces, so it doesn't make
             | sense to use one.
             | 
             | It also lets you to address much larger data structures
             | with limited sized variables. With block addressing, you
             | can access terabytes of storage with a 32-bit integer.
        
         | masklinn wrote:
         | > It would be nice if the intro had a brief explanation of why
         | a disk needs to be divided into blocks.
         | 
         | One reason is that HDDs simply don't have a byte-wise
         | resolution, so there's little point talking to HDDs in sub-
         | sector units. Sectors are usually 512 bytes to 4k.
         | 
         | A second reason is being able to simply address the drive.
         | Using 32b indices, if you index bytewise you're limited to 4GB
         | which was available in the early 90s. With 512 bytes blocks,
         | you get an addressing capacity of 2TB, and with 4k blocks (the
         | AF format), you get 16TB. In fact I remember 'round the late
         | 90s / early aught we'd reformat our drives using higher-size
         | blocks because the base couldn't see the entire thing.
        
           | monadic2 wrote:
           | Well, one benefit would be a relaxation of constraints on
           | filesystems. This might be worth performance loss.
        
           | jagged-chisel wrote:
           | > HDDs simply don't have a byte-wise resolution
           | 
           | Sufficient explanation for code. Now why is it that disks
           | lack byte-wise resolution?
        
             | avianlyric wrote:
             | Cost, more resolution means more transistors, address
             | lines, and protocol overhead (your memory address take up
             | space too).
             | 
             | You could build a HDD with byte-wise resolution, which did
             | a whole load of clever things internally to handle error
             | correction (modern HDDs have far more complex error
             | correction that just block wise checksums, they're a piece
             | of technological magic and analogue electronics).
             | 
             | But sure a device would cost far more, and offer no real
             | benefits. Especially when you consider that reading a
             | single byte in isolation is a very rare task, and it takes
             | so long (compared to accessing a CPU cache) to find the
             | data and transfer it that you might as well grab a whole
             | load of bytes at the same time.
             | 
             | Byte-wise resolution in a HDD is like ordering rice by the
             | grain in individual envelopes. Sure you could do it, but
             | why the hell would you?
        
             | pkaye wrote:
             | Because of the sector based error correction. I remember
             | some earlier 8-bit disks has 128 byte sectors. Some newer
             | flash SSDs have 1k-4k physical sector size for ecc coding
             | efficiency so in those cases they will internally do a
             | read-modify-write to handle a 512 byte logical sector size.
        
             | e12e wrote:
             | .. And does it hold true for ram disks?
        
               | avianlyric wrote:
               | Yes it would, but you would be working in cache lines
               | rather than disk blocks.
               | 
               | When working with RAM your PC will transfer a cache line
               | of memory from RAM to your CPU caches. Again this is a
               | feature of hardware limitations and trading off
               | granularity against speed.
               | 
               | Your CPU operates many time faster than RAM so it makes
               | sense to request multiple bytes at once, then your RAM
               | can access those bytes, and line them up on it's output
               | pins ready for your CPU to come back and read them into
               | cache. This give you better bandwidth. (It's a little
               | more complicated than this because the bytes are
               | transferred in serial, rather than in parallel, but
               | digging into the details here is tad tricky).
               | 
               | On the granularity point, the more granular your memory
               | access pattern is, the more bits you need to address that
               | memory. In RAM that either means more physical pins and
               | motherboard traces, or more time to communicate that
               | address. It also means more transistors are needed to
               | hold that address in memory while you're looking it up.
               | All of those things mean more money.
               | 
               | And before we start looking at memory map units, that let
               | you do magical things like map a 64 bit address space
               | into only 16GB of physical RAM. Again the greater the
               | granularity, the more transistors your MMU needs to store
               | the mapping, and thus more cost.
               | 
               | So really the reason we don't have bit level addressing
               | granularity is ultimately cost. There's no reason you
               | could build a machine that did that, it would just cost a
               | fortune and wouldn't provide any practical benefits. So
               | engineers made trade off to build something more useful.
        
             | maccam94 wrote:
             | I believe for performance and error correction features for
             | spinning hard drives. Check out the tables and graphic in
             | the overview section here[1]. Any write to a sector
             | requires updating the checksum, which means the whole
             | sector has to be read before the drive can complete the
             | write. Since the sector is probably in the kernel disk
             | cache already, the whole sector can be flushed to disk at
             | once and the drive can compute the checksum on the fly
             | (instead of flushing 1 byte, making the drive do a read of
             | the sector, recompute the checksum, and then do a write).
             | 
             | 1: https://en.wikipedia.org/wiki/Advanced_Format
        
             | tinus_hn wrote:
             | Try designing the format keeing the required features in
             | mind and you might seen why they have.
             | 
             | The hardware has to be able to read and write data from an
             | arbitrary location and do some kind of checksum for error
             | detection and they have to be able to efficiently transfer
             | this data to the cpu.
        
             | cesarb wrote:
             | > Now why is it that disks lack byte-wise resolution?
             | 
             | Overhead.
             | 
             | Each disk sector is not 512 bytes (or 4096 bytes in modern
             | disks), it's a bit more. It needs at least the sector
             | number (to detect when the head is flying over the correct
             | sector), a CRC or error-correcting code, a gap of several
             | hundred bits before the next sector (so writing to one
             | sector won't overwrite the next one), and a preamble with
             | an alternating pattern (to synchronize the analog to
             | digital conversion). All this overhead would be a huge
             | waste if the addressable unit were a single byte.
        
         | baruch wrote:
         | A drive (be it HDD or SSD) is a block device, all drives are
         | divided into blocks/sectors since they are a rather lossy
         | medium. Data written to a bit may not necessarily be read
         | accurately and so the solution is to have an Error-Correcting-
         | Code that will allow to recover from some number of badly read
         | bits (above that and you get a medium error from the drive).
         | Since an ECC for a byte is a bit excessive the drives use
         | blocks of 512 bytes or 4096 bytes, one reason for the move from
         | 512 to 4096 is that the ECC becomes more efficient.
         | 
         | HDDs also have small "gaps" that have headers to locate the
         | sectors (in the distant past you could do a low-level format to
         | correct these gaps as well).
         | 
         | SSDs do not have these gaps but they do need the ECC.
        
       | vinc wrote:
       | I recently wrote a very simple and naive filesystem in rust for a
       | toy OS I'm building and it was quite an interesting thing to do:
       | https://github.com/vinc/moros/blob/master/doc/filesystem.md
       | 
       | Then I implemented a little FUSE driver in Python to read the
       | disk image from the host system and it was wonderful to mount it
       | the first time and see the files! https://github.com/vinc/moros-
       | fuse
        
       | blackrock wrote:
       | Once you have the file system, and a scheduler, don't you have a
       | basic rudimentary operating system?
       | 
       | How soon until someone builds an Operating System developed in
       | Rust? Maybe make it microkernel-based this time.
        
         | smt88 wrote:
         | > _How soon until someone builds an Operating System developed
         | in Rust?_
         | 
         | Redox[1] has been around for almost as long as Rust has. I
         | first heard about it 4-5 years ago.
         | 
         | They had an interesting competition a while back challenging
         | people to figure out how to crash it.
         | 
         | 1. https://www.redox-os.org/
        
           | blackrock wrote:
           | Yeah, I heard about this project. But there's a graveyard of
           | dead OS projects out there.
           | 
           | What's the progress and potential of Redox?
        
       | Immortal333 wrote:
       | Shameless plug. I did similar in my OS course. But, in C. Github:
       | https://github.com/immortal3/EbFS
       | 
       | Warning: Terribly written. many hacks.
        
         | aquabeagle wrote:
         | _frankly, It never helped on my resume but I enjoyed writing._
         | 
         | Frankly, if I saw this code on a resume, I'd keep looking.
         | scanf("%s",buffer);            // Debug :: printf("hash :
         | %ld\n",hash("cdRoot"));       switch(hash(buffer))       {
         | //command : "ls"        case 5863588:
        
           | orf wrote:
           | Everyone starts somewhere, and everyone has written far worse
           | at some point.
           | 
           | Saying terribly unconstructive and disheartening things like
           | that makes everyone, yourself included, feel bad. Do better.
        
         | RealityVoid wrote:
         | Soooo... how does it work?
         | 
         | I'm not asking about the structure or how it's organized. I
         | mean... is the filesystem in a file or... how?
         | 
         | Background: I mostly do embedded stuff so at a glance I would
         | have expected low level primitives (like, HW interactions,
         | registers and stuff) but I see none. So maybe, my expectation,
         | when tacking a problem, of interacting with the HW directly,
         | does not stand in modern environments.
         | 
         | Even better, but unrelated question... how the heck does a x86
         | OS request data from the HDD?
        
           | pkaye wrote:
           | Looks like a filesystem in a file.
        
           | keithnz wrote:
           | as an aside, for our embedded system we use
           | https://github.com/ARMmbed/littlefs for our flash file
           | system, it has a bit of a description on its design and its
           | copy on write system so that it can handle random power loss.
           | Be nice to see some of these kinds of libraries done in Nim
           | or Rust.
        
           | mcpherrinm wrote:
           | You'd presumably have some "block device" abstraction between
           | your filesystem and your device driver. Don't want to re-
           | implement a FS for each type of hardware. On a Linux system,
           | you can read, eg, /dev/sda1 from userspace, which is what it
           | looks like this filesystem probably does.
           | 
           | As for how you actually request data from the hard drive:
           | There's older ATA interfaces, and BIOS routines from them,
           | which I suspect is what most hobbyist OSes would use.
           | 
           | A more modern interface is AHCI. The OSDev wiki has an
           | overview, where you can see how the registers work:
           | https://wiki.osdev.org/AHCI
        
       | Ericson2314 wrote:
       | My dream is to add enough type parameters so in-memory
       | collections can also work as (not horribly tuned!) on-disk
       | datastructures.
       | 
       | It's a nice ambitious goal which can really drive language and
       | library design.
        
       | phjesusthatguy3 wrote:
       | We've attempted this as well and it's not as simple as it seems.
       | The issues we've run into have made us reconsider porting our FS
       | handlers to Rust, although we are cautiously optimistic about
       | later results.
        
         | ianlevesque wrote:
         | Any more specifics?
        
       | unethical_ban wrote:
       | I have read down to the implementation section, but for my money,
       | this is the best way to describe the high level function and
       | behavior of a filesystem that I have ever seen.
        
         | ridiculous_fish wrote:
         | A very accessible (though dated) intro to filesystems is
         | Practical File System Design, by Dominic Giampaolo.
         | 
         | PDF link: http://www.nobius.org/practical-file-system-
         | design.pdf
        
           | Q6T46nT668w6i3m wrote:
           | Frankly, not too much has changed since Giampaolo. In fact,
           | it is still standard reading in many graduate seminars on the
           | subject!
        
           | vondur wrote:
           | Is he the guy who did the BeOS filesystem?
        
             | peterkelly wrote:
             | Yes. Later went to Apple and did Spotlight.
             | 
             | https://en.wikipedia.org/wiki/Dominic_Giampaolo
        
       | azhenley wrote:
       | There's also this file system chapter from a series on writing an
       | OS in Rust: http://osblog.stephenmarz.com/ch10.html
        
         | est31 wrote:
         | And for code there is TFS https://github.com/redox-os/tfs
        
       | bluejekyll wrote:
       | Always fun to see this type of work. I notice the usage of
       | OsString, and it made me wonder: does the way an OS encodes it's
       | strings potentially make this FS non-portable between OSes? If I
       | want to mount a drive formatted with this FS, would the OsString
       | be potentially non-portable?
       | 
       | There was a lot of discussion in the past around TFS
       | https://github.com/redox-os/tfs, my understanding is that effort
       | has kinda lost steam.
        
         | fiddlerwoaroof wrote:
         | This is really cool, I wish someone would fund it.
        
       ___________________________________________________________________
       (page generated 2020-07-27 23:00 UTC)