[HN Gopher] Git ls-files is Faster Than Fd and Find
       ___________________________________________________________________
        
       Git ls-files is Faster Than Fd and Find
        
       Author : todsacerdoti
       Score  : 91 points
       Date   : 2021-11-21 17:33 UTC (5 hours ago)
        
 (HTM) web link (cj.rs)
 (TXT) w3m dump (cj.rs)
        
       | rmetzler wrote:
       | Small typo occurring two times: git ls-file instead of git ls-
       | file _s_.
       | 
       | Thanks for bringing it up, I didn't know about the command. I
       | also might have a use case for it.
        
         | c_joly wrote:
         | Thanks for pointing this out and for your kind words. A fix for
         | the typos should be live soon.
        
       | tyingq wrote:
       | A warm buffer cache makes a big difference too, so if you're
       | benchmarking things like find vs other tools, be sure to empty
       | the cache between runs.
       | 
       | For Linux:                 echo 3 > /proc/sys/vm/drop_caches
       | 
       | In this case, the author is using hyperfine with --warmup 10, so
       | the numbers are all using a warm buffer cache. A cold cache
       | probably would have been more realistic for comparison, since the
       | benchmark is traversing lots of directories.
        
       | [deleted]
        
       | cb321 wrote:
       | I got run times from the simplest single-threaded directory walk
       | that are only 1.8x slower than git ls-files. (Min time of 10 runs
       | with the git repo housed by /dev/shm on Linux 5.15.)
       | 
       | The "simple" code is in
       | https://github.com/c-blake/cligen/blob/master/cligen/dents.n...
       | (just `dents find` does not require the special kernel batch
       | system call module to be fast. That kernel module is more about
       | statx batching but IO uring can also do that these days. For
       | those unfamiliar with Nim, it's a high productivity, high
       | performance systems language.)
       | 
       | I believe that GNU find is slow because it is specifically
       | written to allow arbitrary filesystem depth as opposed to "open
       | file descriptor limit-limited depth". (I personally consider this
       | over-/mis-engineered from days of bygone systems with default
       | ulimits of, say, only 64 open fd's. There should at least be a
       | "fast mode" since let's be real - file hierarchies deeper than
       | 256..1024 levels, while possible, are rare and one should
       | optimize for the common case or at least allow manual instruction
       | that this case prevails. AFAIK there is no such "fast mode".
       | Maybe on some BSD?)
       | 
       | Meanwhile, I think the Rust fd is slow because of (probably
       | counterproductive) multi-threading (at least it does 11,000 calls
       | to futex).
        
         | aaaaaaaaaaab wrote:
         | >Meanwhile, I think the Rust fd is slow because of (probably
         | counterproductive) multi-threading (at least it does 11,000
         | calls to futex).
         | 
         | Afaik rustaceans call this "fearless concurrency".
        
         | R0b0t1 wrote:
         | From experience benchmarking this in the past, Window's poor
         | NTFS implementation sees speedup from multithreading (which I
         | can't make sense of) whereas Linux usually had a penalty for
         | multithreading directory traversals.
         | 
         | But, as mentioned below, this is faster because git-ls is using
         | a pregenerated index.
        
           | cb321 wrote:
           | No question/disagreement that it's faster from having an
           | index. It just need only be 1.8x faster -- not the much
           | higher ratio reported in the article (on Linux, anyway).
           | 
           | I was mostly just adding some performance color that the
           | points of comparison of the article are, in some sense, not
           | very fast.
        
             | R0b0t1 wrote:
             | Hm, yes, I reread the part about the performance of find.
             | GNU find is one of the fastest directory traversals out of
             | the box compared to many other implementations, e.g. the
             | ones that come in stdlibs. I was under the impression the
             | slowness was applying the tests when they were applied.
        
               | cb321 wrote:
               | Running the tests like `-type l` or `-perm` or whatnot
               | surely can be slow, but that is not in play in the
               | article's use case.
               | 
               | In my experience, fast vs. slow "reputations" are sadly
               | very unreliable.
               | 
               | Here is some system call color.                   $ find
               | | wc -l         79348         $ strace -c find >/dev/null
               | % time     seconds  usecs/call     calls    errors
               | syscall         ------ ----------- ----------- ---------
               | --------- ----------------         29.57    0.039172
               | 1     24453           fcntl         24.51    0.032477
               | 1     19615           close         19.87    0.026329
               | 1     14724           newfstatat         16.78
               | 0.022229           2      9786           getdents64
               | 7.48    0.009909           2      4948         6 openat
               | 0.86    0.001137           1       755           write
               | ...         ------ ----------- ----------- ---------
               | --------- ----------------         100.00    0.132487
               | 1     74336         9 total              $ dents find|wc
               | -l         79348         $ strace -c dents f >/dev/null
               | % time     seconds  usecs/call     calls    errors
               | syscall         ------ ----------- ----------- ---------
               | --------- ----------------         42.43    0.018491
               | 3      5084           getdents64         28.69
               | 0.012503           2      4896           openat
               | 25.40    0.011071           2      4896           close
               | 3.47    0.001514           2       755           write
               | ...         ------ ----------- ----------- ---------
               | --------- ----------------         100.00    0.043579
               | 2     15697         2 total
               | 
               | So, you know, almost 5x the system call count. (EDIT: and
               | yes, yes, not all syscalls are created equally. In this
               | hot cache case there are so many that syscall overhead
               | alone could be a large fraction of costs, though.)
               | 
               | But you don't need to trust me. It's probably about 50
               | lines of C to code up a file tree walk recursion and test
               | it yourself.
        
         | bscphil wrote:
         | > I believe that GNU find is slow because it is specifically
         | written to allow arbitrary filesystem depth as opposed to "open
         | file descriptor limit-limited depth".
         | 
         | I haven't benchmarked find specifically, but I believe the most
         | common Rust library for the purpose, walkdir[1], also allows
         | arbitrary file system recursion depth, and is extremely fast.
         | It was fairly close to some "naive" limited depth code I wrote
         | in C for the same purpose. A lot of go-to C approaches seem to
         | needlessly call stat on every file, so they're even slower.
         | 
         | I'd be curious to see benchmarks of whether this actually makes
         | a difference.
         | 
         | [1] https://github.com/BurntSushi/walkdir
        
         | masklinn wrote:
         | > Meanwhile, I think the Rust fd is slow because of (probably
         | counterproductive) multi-threading (at least it does 11,000
         | calls to futex).
         | 
         | There's probably a switch to run single-threaded so that should
         | be testable.
        
           | cb321 wrote:
           | Agreed. Adding -j1 still leaves a very large (and wildly
           | varying actually) number of futex calls shown by strace and
           | slows down execution by another 1.5x. So, a little boost from
           | threads but not much. Someone more interested (than I am) in
           | "fixing" fd should study it more. Often people do not
           | "special case" things like "-j1" to actually be single
           | threaded with (no MT runtime overheads) but instead "launch
           | only 1 worker thread" and such. Might taking hacking on fd to
           | really test what is going on.
        
             | psykotic wrote:
             | I've only profiled fd on Windows but one thing that stood
             | out was that it performed 2 stat syscalls per file via
             | NtQueryInformationFile (that number should be 0 per file on
             | Windows since stat metadata comes for free with
             | NtQueryDirectoryFile from the directory enumeration). When
             | I mentioned this finding on Twitter, someone confirmed that
             | it's also doubling up the stat syscalls on Linux. But if
             | the OP is actually trying to benchmark raw directory
             | enumeration speed vs git ls-files, they should make sure
             | they're benchmarking against something that's not making
             | per-file stat calls at all.
        
       | filmor wrote:
       | Well, first doing `find > .my-index` and then measuring `cat .my-
       | index` would give you even better results... I don't find it
       | noteworthy that reading from an index is faster than actually
       | recursively walking the filesystem.
        
         | yakubin wrote:
         | Yes. I once wrote a tool which spends a lot of time traversing
         | the filesystem and to my surprise in one scenario most of its
         | time is spent on-CPU (!) in kernel-mode in the _readdir_r(2)_
         | syscall implementation. I still haven 't dived into what it's
         | doing on the CPU, but it sure is interesting.
        
         | rakoo wrote:
         | No, it's not surprising, so why do we still not use indexes for
         | this ?
         | 
         | NTFS maintains a journal of all files modification
         | (https://en.wikipedia.org/wiki/USN_Journal). This is used by
         | Everything (https://www.voidtools.com/support/everything/) to
         | quickly and efficiently index _all_ files and folders. Thanks
         | to that, searching for a file is instantaneous because it's
         | "just an index read".
         | 
         | The feature is common: listing files. We know that indexes help
         | solve the issue. But we still use `find` because there's no
         | efficient files indexing strategy. Yes, there's
         | updatedb/mlocate, but updating the db requires a full traversal
         | of the filesytem because there's no efficient listing of
         | changes in linux, only filesystems-specific stuff.
         | 
         | So we will still have articles like this until the situation
         | changes, and it will still be relevant to the discussion
         | because it's not "solved"
        
           | roywiggins wrote:
           | WizTree also uses NTFS metadata to perform _shockingly_ fast
           | reports on space usage. Just much, much faster than anything
           | else I 've seen, and I'm not sure there's any equivalents for
           | other filesystems.
        
             | rakoo wrote:
             | WizTree is so fast it makes me yearn for the day we have
             | the same funcitonality on Linux.
             | 
             | Especially not because I'm particularly interested in
             | visualizing free disk space on the regular, but because I
             | want to backup changed files as soon as possible and the
             | only reliable way to do it is to have some kind of journal.
        
           | ryantriangles wrote:
           | Everything's approach has important tradeoffs: the indexing
           | service must be run as administrator/root, and by reading the
           | USN journal it gives users a listing of _all_ files and
           | mtimes on the filesystem regardless of directory permissions.
           | That means that any user who can run a file search can also
           | see every other users' files, which includes their partial
           | web history (since most browsers, including Firefox and
           | Chrome, cache data in files named for the website they're
           | from, and the mtime will often be the last visit time). If
           | you want to avoid this, you can only switch to Everything's
           | traditional directory-traversal-based index, which has the
           | same performance as updatedb/mlocate, Baloo, or fsearch.
           | 
           | I think these tradeoffs are why there hasn't been as much
           | interest in replicating it, combined with the fact that
           | mlocate/fd/find/fsearch are already a lot faster in most
           | circumstances than the default Windows search and good enough
           | for the most common use cases (although there are certainly
           | usecases where they're not).
        
             | sillysaurusx wrote:
             | It feels like the index update should be a part of the
             | filesystem layer itself. Not a separate process like you're
             | saying.
        
             | rakoo wrote:
             | That's an extremely important tradeoff but it's not an
             | issue in the process: it is possible to split indexing and
             | searching in a root process and querying in a user process
             | that asks the first one, and the first one filters with the
             | different rights. Nothing we've never heard of.
        
           | ww520 wrote:
           | Updating an index adds additional time, adding a random seek
           | to the index file location. Also requires some transaction to
           | group the update to the file metadata and the index. It just
           | adds more complexity with little benefit.
        
             | ziml77 wrote:
             | As someone who uses Everything many times per day, I can
             | say that there is a significant amount of benefit. I don't
             | think I'd be able to function at work without being able to
             | instantly search all my files. The lack of a similar
             | solution on Linux is one of the big barriers to me using
             | it. The best options I've seen there all refresh the index
             | on a schedule.
        
         | c_joly wrote:
         | > I don't find it noteworthy that reading from an index is
         | faster than actually recursively walking the filesystem
         | 
         | I agree, what I found noteworthy though is that git ls-files
         | uses this index you already have for.
         | 
         | (author here, "proof": https://cj.rs/contact/ &
         | https://keybase.io/leowzukw)
        
         | cseleborg wrote:
         | Yeah, that's not really the interesting part. It's clever
         | because if you're a developer editing source files, the Git
         | index is relatively up-to-date. So, as the author says, why not
         | take advantage of it?
        
       | nmz wrote:
       | Walk is faster than find. https://github.com/google/walk
        
       | gorgoiler wrote:
       | One thing I've never understood about Linux filesystems: given
       | how small and bounded the sets of directories and directory
       | entries are, why is filesystem traversal not instantaneous?
        
         | spicybright wrote:
         | Lack of caches I'd assume. Not because it's not easy, but
         | because no one has taken the time to implement it.
        
           | toxik wrote:
           | They say there are two hard problems in computer science:
           | cache invalidation, naming, and off-by-one errors.
        
             | spicybright wrote:
             | Hehe. You're right, I forgot one of the golden rules.
             | There's probably a better reason why no one has implemented
             | it.
        
         | amelius wrote:
         | Also, why can tab-completion in Bash hang my shell?
        
           | chrsig wrote:
           | This is actually an incredibly frustrating thing -- when
           | doing disk i/o, the process goes into an uninterruptible
           | sleep state[0].
           | 
           | So if your shell is doing some tab completion that goes into
           | an uninterruptible sleep, there's no escape sequence you can
           | send it to cancel that operation.
           | 
           | Where I've seen this be the most vexing is in AWS if an EBS
           | volume dies (surprisingly more frequent of an occurrence than
           | you'd think!). It effectively prevents the machine from
           | cleanly shutting down, and you wind up having to wait for the
           | instance to be forcefully terminated.
           | 
           | [0] https://eklitzke.org/uninterruptible-sleep
        
         | m000 wrote:
         | What you describe is an over-specialized optimization that very
         | few users would benefit from, but would still introduce
         | significant complexity.
         | 
         | Linux already transparently caches filesystem metadata. You
         | already get a good speedup if you attempt the same directory
         | walk twice, and not much have changed in the filesystem.
        
           | gorgoiler wrote:
           | My desktop has ~2MB of basenames on it. Memory isn't free,
           | and there's more to an inode than a filename, but it seems
           | odd that this data that's the size of a cat photo doesn't get
           | special treatment over other vm cache data.
        
             | thanatos519 wrote:
             | That's tunable, and apparently the default is reasonable:
             | https://sysctl-explorer.net/vm/vfs_cache_pressure/
             | 
             | My workstation has 64GB of RAM and only ~7M directory
             | entries so I have 'vm.vfs_cache_pressure = 1' in
             | /etc/sysctl.conf and cache everything with a full directory
             | traversal via find. The first time it takes 52s; subsequent
             | times take 5s. It has never given me memory problems.
        
               | spicybright wrote:
               | I'm assuming the cache isn't kept on disk tho, right?
               | 
               | As in 52s the first time since boot, 5+ for subsequent
               | runs (incorporating time for changed files/directories)
        
         | topspin wrote:
         | > why is filesystem traversal not instantaneous
         | 
         | Because a mounted file system isn't a simple indexed list of
         | paths. File systems are shared, mutable state.
         | 
         | The mechanism you're asking about is called the dentry cache[1]
         | and a decent narrative of its operation is found here[2]. It
         | has the fabulously complex job of atomically brokering the
         | state of an arbitrary number of local and remote file systems
         | between an arbitrary number of processes using arbitrary access
         | patterns without consuming all available RAM. Expecting the
         | dentry cache to yield 'instantaneous' results is unreasonable.
         | Comparing its performance to that of an VCS index is naive.
         | 
         | [1]
         | https://www.kernel.org/doc/Documentation/filesystems/vfs.txt
         | [2] https://www.halolinux.us/kernel-reference/the-dentry-
         | cache.h... (no endorsement of these people, but the dentry
         | cache description is both concise and sufficient.)
        
         | wvh wrote:
         | Probably having to check if the actual disk entries changed is
         | what slows it down. I wonder if it would be possible with
         | nowadays' memory sizes to keep all metadata in memory as a
         | write-through cache. Not sure if it'd be worth it though, my
         | system has close to half a million files, but I'm only
         | interested in about a hundred or so. I don't think file systems
         | are slow in practice for typical human-scale operations though,
         | with the exception of non-indexed "search all my files" type of
         | operations.
        
       | bee_rider wrote:
       | Of course scanning an index is faster than traversing the
       | filesystem.
       | 
       | Is locate/mlocate some obscure command? It works pretty well for
       | this sort of thing (and has the advantage that you wouldn't need
       | to put git repos everywhere, or something silly like that).
       | 
       | I often forget what I've named a pdf that I've downloaded, but
       | usually I'll put something related to the topic of a paper in the
       | file name, so a command like:                   locate -i
       | "*svd*.pdf"
       | 
       | will have a decent chance of finding any papers about svd's that
       | I have, and is more or less instantaneous.
       | 
       | Although -- I think I did come across the locate command because
       | I was searching for alternatives to 'find.' I can not for the
       | life of me remember the correct order for the path to start
       | searching, and the filename.
        
         | cb321 wrote:
         | plocate [1] is even faster (but of course with either you need
         | to re-scan the whole FSes and rebuild indexes to have an up-to-
         | date view).
         | 
         | [1] https://www.linuxuprising.com/2021/09/plocate-is-much-
         | faster...
        
         | quicklime wrote:
         | The problem with locate is that it requires the index db to be
         | updated periodically (I think this happens daily by default?).
         | For some use cases, especially those where I'm searching for
         | files in a tree that I'm actively working in, this forces me to
         | fall back to find (or maybe git ls-files now).
         | 
         | I feel like it should be the job of the filesystem to maintain
         | an index and incrementally update it whenever files are created
         | or removed, but afaict no modern filesystem offers it.
        
           | spicybright wrote:
           | It really is strange we don't have that yet.
           | 
           | We already pretend files are on the physical disk and sync it
           | in the background. Adding smarter indexing to that shouldn't
           | be a huge deal.
        
           | bee_rider wrote:
           | I suppose this is sort of a problem depending on how you use
           | it. I find that I don't really need to locate things that
           | I've recently used, because I already know where they are as
           | a result of having recently used them. But, other people have
           | different workflows of course (I can imagine, for example, if
           | somebody's workflow involved creating lots of files, only
           | some of which are immediately useful, they might want the
           | ability to scan through them for particular outputs).
        
           | ziml77 wrote:
           | As mentioned elsewhere in these comments, NTFS actually does
           | do that. There are a couple tools out there that take
           | advantage of it like Everything Search and WizTree. I don't
           | know why no open source filesystem has done the same thing.
        
           | Karellen wrote:
           | Sarcasm?
        
       | cryptonector wrote:
       | This is why I have a git alias `ls` for `ls-files`.
        
       | njharman wrote:
       | > source code management system (SCM), it's main business3 is not
       | to help you list your files!
       | 
       | This seems utterly bizarre statement. What does author imagine
       | SCM dies? Stuff, but listing the files that are the source it is
       | managing is up there.
       | 
       | Like saying FPS is a game its not pushing triangles through GPU.
       | 
       | The larger purpose is facilitated by and requires the lessor
       | purpose.
        
       | mikewhy wrote:
       | FYI to anyone looking to try this out: the `--others` flag, used
       | to include untracked files, will also print out files included in
       | your .gitignore. So if you have eg. a node_modules/ or venv/
       | folder, all its contents will be listed.
       | 
       | This is often unwanted noise. I haven't been able to find if
       | there's some combination of flags that would get the desired
       | behaviour, but it's been a while since I've messed around with
       | this.
        
         | xyzzy_plugh wrote:
         | What's the desired behaviour you want? --cached --others
         | --exclude-standard will show tracked and untracked but not
         | ignored.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-11-21 23:00 UTC)