[HN Gopher] Investigating Linux phantom disk reads
       ___________________________________________________________________
        
       Investigating Linux phantom disk reads
        
       Author : kamaraju
       Score  : 111 points
       Date   : 2023-05-02 20:25 UTC (2 hours ago)
        
 (HTM) web link (questdb.io)
 (TXT) w3m dump (questdb.io)
        
       | sytse wrote:
       | TLDR; "Ingestion of a high number of column files under memory
       | pressure led to the kernel starting readahead disk read
       | operations, which you wouldn't expect from a write-only load. The
       | rest was as simple as using madvise in our code to disable the
       | readahead in table writers."
        
         | EE84M3i wrote:
         | The article kind of dances around it, but AIUI the reason that
         | their "weite-only load" caused reads (and thus readahead) was
         | because they were writing to a mapped page that had already
         | been evicted - so the kernel _was_ reading /faulting those
         | pages because it can only write in block/page sized chunks.
         | 
         | In some sense maybe this could be thought of as readahead in
         | preparation for writing to those pages, which is undesirable in
         | this case.
         | 
         | However, what confused be about this article was if the data
         | files are append only, how is there a "next" block to read
         | ahead to? I guess maybe the files are pre-allocated or the
         | kernel is reading previous pages.
        
           | [deleted]
        
           | bremac wrote:
           | Reading between the lines, it sounds as if they're using
           | mmap. There is no "append" operation on a memory mapping, so
           | the file would need to be preallocated before mapping it.
           | 
           | If the preallocation is done using fallocate or just writing
           | zeros, then by default it's backed by blocks on disk, and
           | readahead must hit the disk since there is data there. On the
           | other hand, preallocating with fallocate using
           | FALLOC_FL_ZERO_RANGE or (often) with ftruncate() will just
           | update the logical file length, and even if readahead is
           | triggered it won't actually hit the disk.
        
             | EE84M3i wrote:
             | For the file being entirely pre-allocated case I
             | understand, but for the file hole case I'm not sure I
             | understand why you'd get such high disk activity.
             | 
             | If the index block also got evicted from the page cache,
             | then could reading into a file hole still trigger a fault?
             | Or is the "holiness" of a page for a mapping stored in the
             | page table?
        
           | pengaru wrote:
           | The readahead is a bit of a readaround when I last checked,
           | as in it'll pull in some stuff before the fault as well.
           | 
           | There used to be a sys-wide tunable in /sys to control how
           | large an area readahead would extend to, but I'm not seeing
           | it anymore on this 6.1 laptop. I think there's been some work
           | changing stuff to be more clever in this area in recent
           | years. It used to be interesting to make that value small vs.
           | large and see how things like uncached journalctl (heavy mmap
           | user) were affected in terms of performance vs. IO generated.
        
             | EE84M3i wrote:
             | The article distinguishes "readaround" from a linear
             | predicted "readahead", but then says the output of blktrace
             | indicates a "potential readahead", which is where I got
             | confused.
             | 
             | Does MADV_RANDOM disable both "readahead" and "readaround"?
        
       | pengaru wrote:
       | Going through mmap for bulk-ingest sucks because the kernel has
       | to fault in the contents to make what's in-core reflect what's
       | on-disk before your write access to the mapped memory occurs.
       | It's basically a read-modify-write pattern even when all you
       | intended to do was write the entire page.
       | 
       | When you just use a write call you provide a unit of arbitrary
       | size, and if you've done your homework that size is a multiple of
       | page size and the offset page-aligned. Then there's no need for
       | the kernel to load anything in for the written pages; you're
       | providing everything in the single call. Then you go down the
       | O_DIRECT rabbithole every fast linux database has historically
       | gone down.
        
       | davidhyde wrote:
       | Seems like using memory mapped files for a write-only load is the
       | sub optimal choice. Maybe I'm mistaken but surely using an
       | append-only file handle would be simpler than changing the
       | behaviour of how memory mapped files are cached like they did for
       | their solution?
        
       | addisonj wrote:
       | I am going to write this comment with a large preface: I don't
       | think it is ever helpful to be an absolutist. For every best-
       | practice/"right way" to do things, there are circumstances when
       | doing it another way makes sense. That can be a ton of reasons
       | for that, be it technical, money/time, etc. The best engineering
       | teams aren't those that just blindly follow what others say is a
       | best practice but understand the options and make an informed
       | choice. None of the following comment is at all commentary on
       | questDB, as they mention in the article, _many_ databases use
       | similar tools.
       | 
       | With that said, after reading the first paragraph I immediately
       | searched the article for "mmap" and had a good sense of where the
       | rest of this was going. Put simply, it is just really hard to
       | consider what the OS is going to do in all situations when using
       | mmap. Based on my experience, I would guess that a _ton_ of
       | people reading this comment have hit issues that, I would argue,
       | is due to using mmap. (Particularly looking at you prometheus).
       | 
       | All things told, this is a pretty innocuous incident of mmap
       | causing problems, but I would encourage any aspiring DB engineers
       | to read https://db.cs.cmu.edu/mmap-cidr2022 as it gives a great
       | overview of the range of problems that can occur when using mmap
       | 
       | I think some would argue that mmap is "fine" for append only
       | workloads (and is certainly more reasonable compared to a DB with
       | arbitrary updates) but even here, lots of factors like metadata,
       | scaling number of tables, etc will _eventually_ bring you to hit
       | some fundamental problems when using mmap.
       | 
       | The interesting opportunity in my mind, especially with
       | improvements in async IO (both at FS level and in tools like
       | rust), is to build higher level abstractions that bring the
       | "simplicity" of mmap, but with more purpose-built semantics ideal
       | for databases.
        
       | 0xbadcafebee wrote:
       | There are other methods you can use to increase performance under
       | memory pressure, but you'd end up handling i/o directly and
       | maintaining your own index of memory and disk accesses, page-
       | aligned reads/writes, etc. It would be easier to just require
       | your users buy more memory, but when there's a hack like this
       | available, that seems preferable to implementing your own VMM and
       | disk i/o subsystem.
        
       ___________________________________________________________________
       (page generated 2023-05-02 23:00 UTC)