[HN Gopher] Investigating Linux phantom disk reads ___________________________________________________________________ Investigating Linux phantom disk reads Author : kamaraju Score : 111 points Date : 2023-05-02 20:25 UTC (2 hours ago) (HTM) web link (questdb.io) (TXT) w3m dump (questdb.io) | sytse wrote: | TLDR; "Ingestion of a high number of column files under memory | pressure led to the kernel starting readahead disk read | operations, which you wouldn't expect from a write-only load. The | rest was as simple as using madvise in our code to disable the | readahead in table writers." | EE84M3i wrote: | The article kind of dances around it, but AIUI the reason that | their "weite-only load" caused reads (and thus readahead) was | because they were writing to a mapped page that had already | been evicted - so the kernel _was_ reading /faulting those | pages because it can only write in block/page sized chunks. | | In some sense maybe this could be thought of as readahead in | preparation for writing to those pages, which is undesirable in | this case. | | However, what confused be about this article was if the data | files are append only, how is there a "next" block to read | ahead to? I guess maybe the files are pre-allocated or the | kernel is reading previous pages. | [deleted] | bremac wrote: | Reading between the lines, it sounds as if they're using | mmap. There is no "append" operation on a memory mapping, so | the file would need to be preallocated before mapping it. | | If the preallocation is done using fallocate or just writing | zeros, then by default it's backed by blocks on disk, and | readahead must hit the disk since there is data there. On the | other hand, preallocating with fallocate using | FALLOC_FL_ZERO_RANGE or (often) with ftruncate() will just | update the logical file length, and even if readahead is | triggered it won't actually hit the disk. | EE84M3i wrote: | For the file being entirely pre-allocated case I | understand, but for the file hole case I'm not sure I | understand why you'd get such high disk activity. | | If the index block also got evicted from the page cache, | then could reading into a file hole still trigger a fault? | Or is the "holiness" of a page for a mapping stored in the | page table? | pengaru wrote: | The readahead is a bit of a readaround when I last checked, | as in it'll pull in some stuff before the fault as well. | | There used to be a sys-wide tunable in /sys to control how | large an area readahead would extend to, but I'm not seeing | it anymore on this 6.1 laptop. I think there's been some work | changing stuff to be more clever in this area in recent | years. It used to be interesting to make that value small vs. | large and see how things like uncached journalctl (heavy mmap | user) were affected in terms of performance vs. IO generated. | EE84M3i wrote: | The article distinguishes "readaround" from a linear | predicted "readahead", but then says the output of blktrace | indicates a "potential readahead", which is where I got | confused. | | Does MADV_RANDOM disable both "readahead" and "readaround"? | pengaru wrote: | Going through mmap for bulk-ingest sucks because the kernel has | to fault in the contents to make what's in-core reflect what's | on-disk before your write access to the mapped memory occurs. | It's basically a read-modify-write pattern even when all you | intended to do was write the entire page. | | When you just use a write call you provide a unit of arbitrary | size, and if you've done your homework that size is a multiple of | page size and the offset page-aligned. Then there's no need for | the kernel to load anything in for the written pages; you're | providing everything in the single call. Then you go down the | O_DIRECT rabbithole every fast linux database has historically | gone down. | davidhyde wrote: | Seems like using memory mapped files for a write-only load is the | sub optimal choice. Maybe I'm mistaken but surely using an | append-only file handle would be simpler than changing the | behaviour of how memory mapped files are cached like they did for | their solution? | addisonj wrote: | I am going to write this comment with a large preface: I don't | think it is ever helpful to be an absolutist. For every best- | practice/"right way" to do things, there are circumstances when | doing it another way makes sense. That can be a ton of reasons | for that, be it technical, money/time, etc. The best engineering | teams aren't those that just blindly follow what others say is a | best practice but understand the options and make an informed | choice. None of the following comment is at all commentary on | questDB, as they mention in the article, _many_ databases use | similar tools. | | With that said, after reading the first paragraph I immediately | searched the article for "mmap" and had a good sense of where the | rest of this was going. Put simply, it is just really hard to | consider what the OS is going to do in all situations when using | mmap. Based on my experience, I would guess that a _ton_ of | people reading this comment have hit issues that, I would argue, | is due to using mmap. (Particularly looking at you prometheus). | | All things told, this is a pretty innocuous incident of mmap | causing problems, but I would encourage any aspiring DB engineers | to read https://db.cs.cmu.edu/mmap-cidr2022 as it gives a great | overview of the range of problems that can occur when using mmap | | I think some would argue that mmap is "fine" for append only | workloads (and is certainly more reasonable compared to a DB with | arbitrary updates) but even here, lots of factors like metadata, | scaling number of tables, etc will _eventually_ bring you to hit | some fundamental problems when using mmap. | | The interesting opportunity in my mind, especially with | improvements in async IO (both at FS level and in tools like | rust), is to build higher level abstractions that bring the | "simplicity" of mmap, but with more purpose-built semantics ideal | for databases. | 0xbadcafebee wrote: | There are other methods you can use to increase performance under | memory pressure, but you'd end up handling i/o directly and | maintaining your own index of memory and disk accesses, page- | aligned reads/writes, etc. It would be easier to just require | your users buy more memory, but when there's a hack like this | available, that seems preferable to implementing your own VMM and | disk i/o subsystem. ___________________________________________________________________ (page generated 2023-05-02 23:00 UTC)