[HN Gopher] Writing a file system from scratch in Rust ___________________________________________________________________ Writing a file system from scratch in Rust Author : carlosgaldino Score : 207 points Date : 2020-07-27 17:04 UTC (5 hours ago) (HTM) web link (blog.carlosgaldino.com) (TXT) w3m dump (blog.carlosgaldino.com) | ravenstine wrote: | Is there any advantage in writing a custom file system for a | niche purpose? It seems like most file systems are just different | variations of managing where/when files are written | simultaneously. Could a file system written specifically for | something like PostgreSQL cut out the middle-man and increase | performance? | tene wrote: | You may be interested in a paper written by the Ceph team: | "File Systems Unfit as Distributed Storage Backends: Lessons | from 10 Years of Ceph Evolution" | | https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf | | There are definitely some significant benefits you can get from | managing your own storage, rather than using a filesystem. | [deleted] | topspin wrote: | Yes. Oracle has done this (ASM) to eliminate overhead, | implement fault tolerance and provide a storage management | interface based on SQL, for example. | | I once made a 'file system' to mount cpio archives (read-only) | in an embedded system. Cpio is an extremely simple format to | generate and edit (in code) and mounting it directly was very | effective. | formerly_proven wrote: | I suspect operating on block storage directly may both be | easier and more reliable for databases, since about 75 % of | the complication in writing transactional I/O software is | working around the kernel's behavior. | zzz61831 wrote: | Kernel's fsyncing behavior is one thing, but just relying | on a massive amount of fragile C code running in kernel is | a significant liability, especially if your software is a | centralized database and crashes, panics will bring down | everything. | formerly_proven wrote: | Yes, and also the traditional answer was that the kernel | handles weird and complicated hardware and can talk to | RAID controllers properly, but nowadays hardware has much | less variance, and RAID is rare (and arguably unnecessary | for a direct-IO database). | | I think it'd be viable for an enterprise-y database to do | IO directly over NVMe. Imagine the efficiency and | throughput gains you could get from a database that (1) | has a unified view of memory allocation in the system (2) | directly performs its page-level IO on the storage | devices. | dm319 wrote: | It would be nice if the intro had a brief explanation of why a | disk needs to be divided into blocks. Otherwise, I really enjoyed | this read from the perspective of a lay person. | unethical_ban wrote: | The disk / inodes need to know where to start looking for a | file's contents, like the address in memory for RAM. Or like | the mail: We subdivide by city, then ZIP, then street, then | address. | | So the inode says "The data for my file starts at block 72 and | is 3 blocks long" (or something like that). The disk then goes | there, and reads blocks 72,73,74. | | Each block is 4KiB large often, so if you have a 10KiB file, | you still take up ceiling(file size/block size) blocks. | | That's why there is a difference between "File size" and "Size | on disk" when you look at disk usage summaries. | saurik wrote: | This doesn't explain why blocks are valuable as you could use | byte addressing. The reason why blocks are valuable is for | similar reasons to why memory is divided into pages. (Which | is all I am going to write, as I don't have the time to | answer this well today. But "you need to know where things | are on disk" isn't an answer.) | sedatk wrote: | This isn't the reason about block addressing at all since | it existed before paging was a thing. Addressing storage | content in blocks is closely aligned with how direct access | storage is organized in "sectors", usually 512-byte blocks. | Disks don't have byte access interfaces, so it doesn't make | sense to use one. | | It also lets you to address much larger data structures | with limited sized variables. With block addressing, you | can access terabytes of storage with a 32-bit integer. | masklinn wrote: | > It would be nice if the intro had a brief explanation of why | a disk needs to be divided into blocks. | | One reason is that HDDs simply don't have a byte-wise | resolution, so there's little point talking to HDDs in sub- | sector units. Sectors are usually 512 bytes to 4k. | | A second reason is being able to simply address the drive. | Using 32b indices, if you index bytewise you're limited to 4GB | which was available in the early 90s. With 512 bytes blocks, | you get an addressing capacity of 2TB, and with 4k blocks (the | AF format), you get 16TB. In fact I remember 'round the late | 90s / early aught we'd reformat our drives using higher-size | blocks because the base couldn't see the entire thing. | monadic2 wrote: | Well, one benefit would be a relaxation of constraints on | filesystems. This might be worth performance loss. | jagged-chisel wrote: | > HDDs simply don't have a byte-wise resolution | | Sufficient explanation for code. Now why is it that disks | lack byte-wise resolution? | avianlyric wrote: | Cost, more resolution means more transistors, address | lines, and protocol overhead (your memory address take up | space too). | | You could build a HDD with byte-wise resolution, which did | a whole load of clever things internally to handle error | correction (modern HDDs have far more complex error | correction that just block wise checksums, they're a piece | of technological magic and analogue electronics). | | But sure a device would cost far more, and offer no real | benefits. Especially when you consider that reading a | single byte in isolation is a very rare task, and it takes | so long (compared to accessing a CPU cache) to find the | data and transfer it that you might as well grab a whole | load of bytes at the same time. | | Byte-wise resolution in a HDD is like ordering rice by the | grain in individual envelopes. Sure you could do it, but | why the hell would you? | pkaye wrote: | Because of the sector based error correction. I remember | some earlier 8-bit disks has 128 byte sectors. Some newer | flash SSDs have 1k-4k physical sector size for ecc coding | efficiency so in those cases they will internally do a | read-modify-write to handle a 512 byte logical sector size. | e12e wrote: | .. And does it hold true for ram disks? | avianlyric wrote: | Yes it would, but you would be working in cache lines | rather than disk blocks. | | When working with RAM your PC will transfer a cache line | of memory from RAM to your CPU caches. Again this is a | feature of hardware limitations and trading off | granularity against speed. | | Your CPU operates many time faster than RAM so it makes | sense to request multiple bytes at once, then your RAM | can access those bytes, and line them up on it's output | pins ready for your CPU to come back and read them into | cache. This give you better bandwidth. (It's a little | more complicated than this because the bytes are | transferred in serial, rather than in parallel, but | digging into the details here is tad tricky). | | On the granularity point, the more granular your memory | access pattern is, the more bits you need to address that | memory. In RAM that either means more physical pins and | motherboard traces, or more time to communicate that | address. It also means more transistors are needed to | hold that address in memory while you're looking it up. | All of those things mean more money. | | And before we start looking at memory map units, that let | you do magical things like map a 64 bit address space | into only 16GB of physical RAM. Again the greater the | granularity, the more transistors your MMU needs to store | the mapping, and thus more cost. | | So really the reason we don't have bit level addressing | granularity is ultimately cost. There's no reason you | could build a machine that did that, it would just cost a | fortune and wouldn't provide any practical benefits. So | engineers made trade off to build something more useful. | maccam94 wrote: | I believe for performance and error correction features for | spinning hard drives. Check out the tables and graphic in | the overview section here[1]. Any write to a sector | requires updating the checksum, which means the whole | sector has to be read before the drive can complete the | write. Since the sector is probably in the kernel disk | cache already, the whole sector can be flushed to disk at | once and the drive can compute the checksum on the fly | (instead of flushing 1 byte, making the drive do a read of | the sector, recompute the checksum, and then do a write). | | 1: https://en.wikipedia.org/wiki/Advanced_Format | tinus_hn wrote: | Try designing the format keeing the required features in | mind and you might seen why they have. | | The hardware has to be able to read and write data from an | arbitrary location and do some kind of checksum for error | detection and they have to be able to efficiently transfer | this data to the cpu. | cesarb wrote: | > Now why is it that disks lack byte-wise resolution? | | Overhead. | | Each disk sector is not 512 bytes (or 4096 bytes in modern | disks), it's a bit more. It needs at least the sector | number (to detect when the head is flying over the correct | sector), a CRC or error-correcting code, a gap of several | hundred bits before the next sector (so writing to one | sector won't overwrite the next one), and a preamble with | an alternating pattern (to synchronize the analog to | digital conversion). All this overhead would be a huge | waste if the addressable unit were a single byte. | baruch wrote: | A drive (be it HDD or SSD) is a block device, all drives are | divided into blocks/sectors since they are a rather lossy | medium. Data written to a bit may not necessarily be read | accurately and so the solution is to have an Error-Correcting- | Code that will allow to recover from some number of badly read | bits (above that and you get a medium error from the drive). | Since an ECC for a byte is a bit excessive the drives use | blocks of 512 bytes or 4096 bytes, one reason for the move from | 512 to 4096 is that the ECC becomes more efficient. | | HDDs also have small "gaps" that have headers to locate the | sectors (in the distant past you could do a low-level format to | correct these gaps as well). | | SSDs do not have these gaps but they do need the ECC. | vinc wrote: | I recently wrote a very simple and naive filesystem in rust for a | toy OS I'm building and it was quite an interesting thing to do: | https://github.com/vinc/moros/blob/master/doc/filesystem.md | | Then I implemented a little FUSE driver in Python to read the | disk image from the host system and it was wonderful to mount it | the first time and see the files! https://github.com/vinc/moros- | fuse | blackrock wrote: | Once you have the file system, and a scheduler, don't you have a | basic rudimentary operating system? | | How soon until someone builds an Operating System developed in | Rust? Maybe make it microkernel-based this time. | smt88 wrote: | > _How soon until someone builds an Operating System developed | in Rust?_ | | Redox[1] has been around for almost as long as Rust has. I | first heard about it 4-5 years ago. | | They had an interesting competition a while back challenging | people to figure out how to crash it. | | 1. https://www.redox-os.org/ | blackrock wrote: | Yeah, I heard about this project. But there's a graveyard of | dead OS projects out there. | | What's the progress and potential of Redox? | Immortal333 wrote: | Shameless plug. I did similar in my OS course. But, in C. Github: | https://github.com/immortal3/EbFS | | Warning: Terribly written. many hacks. | aquabeagle wrote: | _frankly, It never helped on my resume but I enjoyed writing._ | | Frankly, if I saw this code on a resume, I'd keep looking. | scanf("%s",buffer); // Debug :: printf("hash : | %ld\n",hash("cdRoot")); switch(hash(buffer)) { | //command : "ls" case 5863588: | orf wrote: | Everyone starts somewhere, and everyone has written far worse | at some point. | | Saying terribly unconstructive and disheartening things like | that makes everyone, yourself included, feel bad. Do better. | RealityVoid wrote: | Soooo... how does it work? | | I'm not asking about the structure or how it's organized. I | mean... is the filesystem in a file or... how? | | Background: I mostly do embedded stuff so at a glance I would | have expected low level primitives (like, HW interactions, | registers and stuff) but I see none. So maybe, my expectation, | when tacking a problem, of interacting with the HW directly, | does not stand in modern environments. | | Even better, but unrelated question... how the heck does a x86 | OS request data from the HDD? | pkaye wrote: | Looks like a filesystem in a file. | keithnz wrote: | as an aside, for our embedded system we use | https://github.com/ARMmbed/littlefs for our flash file | system, it has a bit of a description on its design and its | copy on write system so that it can handle random power loss. | Be nice to see some of these kinds of libraries done in Nim | or Rust. | mcpherrinm wrote: | You'd presumably have some "block device" abstraction between | your filesystem and your device driver. Don't want to re- | implement a FS for each type of hardware. On a Linux system, | you can read, eg, /dev/sda1 from userspace, which is what it | looks like this filesystem probably does. | | As for how you actually request data from the hard drive: | There's older ATA interfaces, and BIOS routines from them, | which I suspect is what most hobbyist OSes would use. | | A more modern interface is AHCI. The OSDev wiki has an | overview, where you can see how the registers work: | https://wiki.osdev.org/AHCI | Ericson2314 wrote: | My dream is to add enough type parameters so in-memory | collections can also work as (not horribly tuned!) on-disk | datastructures. | | It's a nice ambitious goal which can really drive language and | library design. | phjesusthatguy3 wrote: | We've attempted this as well and it's not as simple as it seems. | The issues we've run into have made us reconsider porting our FS | handlers to Rust, although we are cautiously optimistic about | later results. | ianlevesque wrote: | Any more specifics? | unethical_ban wrote: | I have read down to the implementation section, but for my money, | this is the best way to describe the high level function and | behavior of a filesystem that I have ever seen. | ridiculous_fish wrote: | A very accessible (though dated) intro to filesystems is | Practical File System Design, by Dominic Giampaolo. | | PDF link: http://www.nobius.org/practical-file-system- | design.pdf | Q6T46nT668w6i3m wrote: | Frankly, not too much has changed since Giampaolo. In fact, | it is still standard reading in many graduate seminars on the | subject! | vondur wrote: | Is he the guy who did the BeOS filesystem? | peterkelly wrote: | Yes. Later went to Apple and did Spotlight. | | https://en.wikipedia.org/wiki/Dominic_Giampaolo | azhenley wrote: | There's also this file system chapter from a series on writing an | OS in Rust: http://osblog.stephenmarz.com/ch10.html | est31 wrote: | And for code there is TFS https://github.com/redox-os/tfs | bluejekyll wrote: | Always fun to see this type of work. I notice the usage of | OsString, and it made me wonder: does the way an OS encodes it's | strings potentially make this FS non-portable between OSes? If I | want to mount a drive formatted with this FS, would the OsString | be potentially non-portable? | | There was a lot of discussion in the past around TFS | https://github.com/redox-os/tfs, my understanding is that effort | has kinda lost steam. | fiddlerwoaroof wrote: | This is really cool, I wish someone would fund it. ___________________________________________________________________ (page generated 2020-07-27 23:00 UTC)