[HN Gopher] Git ls-files is Faster Than Fd and Find ___________________________________________________________________ Git ls-files is Faster Than Fd and Find Author : todsacerdoti Score : 91 points Date : 2021-11-21 17:33 UTC (5 hours ago) (HTM) web link (cj.rs) (TXT) w3m dump (cj.rs) | rmetzler wrote: | Small typo occurring two times: git ls-file instead of git ls- | file _s_. | | Thanks for bringing it up, I didn't know about the command. I | also might have a use case for it. | c_joly wrote: | Thanks for pointing this out and for your kind words. A fix for | the typos should be live soon. | tyingq wrote: | A warm buffer cache makes a big difference too, so if you're | benchmarking things like find vs other tools, be sure to empty | the cache between runs. | | For Linux: echo 3 > /proc/sys/vm/drop_caches | | In this case, the author is using hyperfine with --warmup 10, so | the numbers are all using a warm buffer cache. A cold cache | probably would have been more realistic for comparison, since the | benchmark is traversing lots of directories. | [deleted] | cb321 wrote: | I got run times from the simplest single-threaded directory walk | that are only 1.8x slower than git ls-files. (Min time of 10 runs | with the git repo housed by /dev/shm on Linux 5.15.) | | The "simple" code is in | https://github.com/c-blake/cligen/blob/master/cligen/dents.n... | (just `dents find` does not require the special kernel batch | system call module to be fast. That kernel module is more about | statx batching but IO uring can also do that these days. For | those unfamiliar with Nim, it's a high productivity, high | performance systems language.) | | I believe that GNU find is slow because it is specifically | written to allow arbitrary filesystem depth as opposed to "open | file descriptor limit-limited depth". (I personally consider this | over-/mis-engineered from days of bygone systems with default | ulimits of, say, only 64 open fd's. There should at least be a | "fast mode" since let's be real - file hierarchies deeper than | 256..1024 levels, while possible, are rare and one should | optimize for the common case or at least allow manual instruction | that this case prevails. AFAIK there is no such "fast mode". | Maybe on some BSD?) | | Meanwhile, I think the Rust fd is slow because of (probably | counterproductive) multi-threading (at least it does 11,000 calls | to futex). | aaaaaaaaaaab wrote: | >Meanwhile, I think the Rust fd is slow because of (probably | counterproductive) multi-threading (at least it does 11,000 | calls to futex). | | Afaik rustaceans call this "fearless concurrency". | R0b0t1 wrote: | From experience benchmarking this in the past, Window's poor | NTFS implementation sees speedup from multithreading (which I | can't make sense of) whereas Linux usually had a penalty for | multithreading directory traversals. | | But, as mentioned below, this is faster because git-ls is using | a pregenerated index. | cb321 wrote: | No question/disagreement that it's faster from having an | index. It just need only be 1.8x faster -- not the much | higher ratio reported in the article (on Linux, anyway). | | I was mostly just adding some performance color that the | points of comparison of the article are, in some sense, not | very fast. | R0b0t1 wrote: | Hm, yes, I reread the part about the performance of find. | GNU find is one of the fastest directory traversals out of | the box compared to many other implementations, e.g. the | ones that come in stdlibs. I was under the impression the | slowness was applying the tests when they were applied. | cb321 wrote: | Running the tests like `-type l` or `-perm` or whatnot | surely can be slow, but that is not in play in the | article's use case. | | In my experience, fast vs. slow "reputations" are sadly | very unreliable. | | Here is some system call color. $ find | | wc -l 79348 $ strace -c find >/dev/null | % time seconds usecs/call calls errors | syscall ------ ----------- ----------- --------- | --------- ---------------- 29.57 0.039172 | 1 24453 fcntl 24.51 0.032477 | 1 19615 close 19.87 0.026329 | 1 14724 newfstatat 16.78 | 0.022229 2 9786 getdents64 | 7.48 0.009909 2 4948 6 openat | 0.86 0.001137 1 755 write | ... ------ ----------- ----------- --------- | --------- ---------------- 100.00 0.132487 | 1 74336 9 total $ dents find|wc | -l 79348 $ strace -c dents f >/dev/null | % time seconds usecs/call calls errors | syscall ------ ----------- ----------- --------- | --------- ---------------- 42.43 0.018491 | 3 5084 getdents64 28.69 | 0.012503 2 4896 openat | 25.40 0.011071 2 4896 close | 3.47 0.001514 2 755 write | ... ------ ----------- ----------- --------- | --------- ---------------- 100.00 0.043579 | 2 15697 2 total | | So, you know, almost 5x the system call count. (EDIT: and | yes, yes, not all syscalls are created equally. In this | hot cache case there are so many that syscall overhead | alone could be a large fraction of costs, though.) | | But you don't need to trust me. It's probably about 50 | lines of C to code up a file tree walk recursion and test | it yourself. | bscphil wrote: | > I believe that GNU find is slow because it is specifically | written to allow arbitrary filesystem depth as opposed to "open | file descriptor limit-limited depth". | | I haven't benchmarked find specifically, but I believe the most | common Rust library for the purpose, walkdir[1], also allows | arbitrary file system recursion depth, and is extremely fast. | It was fairly close to some "naive" limited depth code I wrote | in C for the same purpose. A lot of go-to C approaches seem to | needlessly call stat on every file, so they're even slower. | | I'd be curious to see benchmarks of whether this actually makes | a difference. | | [1] https://github.com/BurntSushi/walkdir | masklinn wrote: | > Meanwhile, I think the Rust fd is slow because of (probably | counterproductive) multi-threading (at least it does 11,000 | calls to futex). | | There's probably a switch to run single-threaded so that should | be testable. | cb321 wrote: | Agreed. Adding -j1 still leaves a very large (and wildly | varying actually) number of futex calls shown by strace and | slows down execution by another 1.5x. So, a little boost from | threads but not much. Someone more interested (than I am) in | "fixing" fd should study it more. Often people do not | "special case" things like "-j1" to actually be single | threaded with (no MT runtime overheads) but instead "launch | only 1 worker thread" and such. Might taking hacking on fd to | really test what is going on. | psykotic wrote: | I've only profiled fd on Windows but one thing that stood | out was that it performed 2 stat syscalls per file via | NtQueryInformationFile (that number should be 0 per file on | Windows since stat metadata comes for free with | NtQueryDirectoryFile from the directory enumeration). When | I mentioned this finding on Twitter, someone confirmed that | it's also doubling up the stat syscalls on Linux. But if | the OP is actually trying to benchmark raw directory | enumeration speed vs git ls-files, they should make sure | they're benchmarking against something that's not making | per-file stat calls at all. | filmor wrote: | Well, first doing `find > .my-index` and then measuring `cat .my- | index` would give you even better results... I don't find it | noteworthy that reading from an index is faster than actually | recursively walking the filesystem. | yakubin wrote: | Yes. I once wrote a tool which spends a lot of time traversing | the filesystem and to my surprise in one scenario most of its | time is spent on-CPU (!) in kernel-mode in the _readdir_r(2)_ | syscall implementation. I still haven 't dived into what it's | doing on the CPU, but it sure is interesting. | rakoo wrote: | No, it's not surprising, so why do we still not use indexes for | this ? | | NTFS maintains a journal of all files modification | (https://en.wikipedia.org/wiki/USN_Journal). This is used by | Everything (https://www.voidtools.com/support/everything/) to | quickly and efficiently index _all_ files and folders. Thanks | to that, searching for a file is instantaneous because it's | "just an index read". | | The feature is common: listing files. We know that indexes help | solve the issue. But we still use `find` because there's no | efficient files indexing strategy. Yes, there's | updatedb/mlocate, but updating the db requires a full traversal | of the filesytem because there's no efficient listing of | changes in linux, only filesystems-specific stuff. | | So we will still have articles like this until the situation | changes, and it will still be relevant to the discussion | because it's not "solved" | roywiggins wrote: | WizTree also uses NTFS metadata to perform _shockingly_ fast | reports on space usage. Just much, much faster than anything | else I 've seen, and I'm not sure there's any equivalents for | other filesystems. | rakoo wrote: | WizTree is so fast it makes me yearn for the day we have | the same funcitonality on Linux. | | Especially not because I'm particularly interested in | visualizing free disk space on the regular, but because I | want to backup changed files as soon as possible and the | only reliable way to do it is to have some kind of journal. | ryantriangles wrote: | Everything's approach has important tradeoffs: the indexing | service must be run as administrator/root, and by reading the | USN journal it gives users a listing of _all_ files and | mtimes on the filesystem regardless of directory permissions. | That means that any user who can run a file search can also | see every other users' files, which includes their partial | web history (since most browsers, including Firefox and | Chrome, cache data in files named for the website they're | from, and the mtime will often be the last visit time). If | you want to avoid this, you can only switch to Everything's | traditional directory-traversal-based index, which has the | same performance as updatedb/mlocate, Baloo, or fsearch. | | I think these tradeoffs are why there hasn't been as much | interest in replicating it, combined with the fact that | mlocate/fd/find/fsearch are already a lot faster in most | circumstances than the default Windows search and good enough | for the most common use cases (although there are certainly | usecases where they're not). | sillysaurusx wrote: | It feels like the index update should be a part of the | filesystem layer itself. Not a separate process like you're | saying. | rakoo wrote: | That's an extremely important tradeoff but it's not an | issue in the process: it is possible to split indexing and | searching in a root process and querying in a user process | that asks the first one, and the first one filters with the | different rights. Nothing we've never heard of. | ww520 wrote: | Updating an index adds additional time, adding a random seek | to the index file location. Also requires some transaction to | group the update to the file metadata and the index. It just | adds more complexity with little benefit. | ziml77 wrote: | As someone who uses Everything many times per day, I can | say that there is a significant amount of benefit. I don't | think I'd be able to function at work without being able to | instantly search all my files. The lack of a similar | solution on Linux is one of the big barriers to me using | it. The best options I've seen there all refresh the index | on a schedule. | c_joly wrote: | > I don't find it noteworthy that reading from an index is | faster than actually recursively walking the filesystem | | I agree, what I found noteworthy though is that git ls-files | uses this index you already have for. | | (author here, "proof": https://cj.rs/contact/ & | https://keybase.io/leowzukw) | cseleborg wrote: | Yeah, that's not really the interesting part. It's clever | because if you're a developer editing source files, the Git | index is relatively up-to-date. So, as the author says, why not | take advantage of it? | nmz wrote: | Walk is faster than find. https://github.com/google/walk | gorgoiler wrote: | One thing I've never understood about Linux filesystems: given | how small and bounded the sets of directories and directory | entries are, why is filesystem traversal not instantaneous? | spicybright wrote: | Lack of caches I'd assume. Not because it's not easy, but | because no one has taken the time to implement it. | toxik wrote: | They say there are two hard problems in computer science: | cache invalidation, naming, and off-by-one errors. | spicybright wrote: | Hehe. You're right, I forgot one of the golden rules. | There's probably a better reason why no one has implemented | it. | amelius wrote: | Also, why can tab-completion in Bash hang my shell? | chrsig wrote: | This is actually an incredibly frustrating thing -- when | doing disk i/o, the process goes into an uninterruptible | sleep state[0]. | | So if your shell is doing some tab completion that goes into | an uninterruptible sleep, there's no escape sequence you can | send it to cancel that operation. | | Where I've seen this be the most vexing is in AWS if an EBS | volume dies (surprisingly more frequent of an occurrence than | you'd think!). It effectively prevents the machine from | cleanly shutting down, and you wind up having to wait for the | instance to be forcefully terminated. | | [0] https://eklitzke.org/uninterruptible-sleep | m000 wrote: | What you describe is an over-specialized optimization that very | few users would benefit from, but would still introduce | significant complexity. | | Linux already transparently caches filesystem metadata. You | already get a good speedup if you attempt the same directory | walk twice, and not much have changed in the filesystem. | gorgoiler wrote: | My desktop has ~2MB of basenames on it. Memory isn't free, | and there's more to an inode than a filename, but it seems | odd that this data that's the size of a cat photo doesn't get | special treatment over other vm cache data. | thanatos519 wrote: | That's tunable, and apparently the default is reasonable: | https://sysctl-explorer.net/vm/vfs_cache_pressure/ | | My workstation has 64GB of RAM and only ~7M directory | entries so I have 'vm.vfs_cache_pressure = 1' in | /etc/sysctl.conf and cache everything with a full directory | traversal via find. The first time it takes 52s; subsequent | times take 5s. It has never given me memory problems. | spicybright wrote: | I'm assuming the cache isn't kept on disk tho, right? | | As in 52s the first time since boot, 5+ for subsequent | runs (incorporating time for changed files/directories) | topspin wrote: | > why is filesystem traversal not instantaneous | | Because a mounted file system isn't a simple indexed list of | paths. File systems are shared, mutable state. | | The mechanism you're asking about is called the dentry cache[1] | and a decent narrative of its operation is found here[2]. It | has the fabulously complex job of atomically brokering the | state of an arbitrary number of local and remote file systems | between an arbitrary number of processes using arbitrary access | patterns without consuming all available RAM. Expecting the | dentry cache to yield 'instantaneous' results is unreasonable. | Comparing its performance to that of an VCS index is naive. | | [1] | https://www.kernel.org/doc/Documentation/filesystems/vfs.txt | [2] https://www.halolinux.us/kernel-reference/the-dentry- | cache.h... (no endorsement of these people, but the dentry | cache description is both concise and sufficient.) | wvh wrote: | Probably having to check if the actual disk entries changed is | what slows it down. I wonder if it would be possible with | nowadays' memory sizes to keep all metadata in memory as a | write-through cache. Not sure if it'd be worth it though, my | system has close to half a million files, but I'm only | interested in about a hundred or so. I don't think file systems | are slow in practice for typical human-scale operations though, | with the exception of non-indexed "search all my files" type of | operations. | bee_rider wrote: | Of course scanning an index is faster than traversing the | filesystem. | | Is locate/mlocate some obscure command? It works pretty well for | this sort of thing (and has the advantage that you wouldn't need | to put git repos everywhere, or something silly like that). | | I often forget what I've named a pdf that I've downloaded, but | usually I'll put something related to the topic of a paper in the | file name, so a command like: locate -i | "*svd*.pdf" | | will have a decent chance of finding any papers about svd's that | I have, and is more or less instantaneous. | | Although -- I think I did come across the locate command because | I was searching for alternatives to 'find.' I can not for the | life of me remember the correct order for the path to start | searching, and the filename. | cb321 wrote: | plocate [1] is even faster (but of course with either you need | to re-scan the whole FSes and rebuild indexes to have an up-to- | date view). | | [1] https://www.linuxuprising.com/2021/09/plocate-is-much- | faster... | quicklime wrote: | The problem with locate is that it requires the index db to be | updated periodically (I think this happens daily by default?). | For some use cases, especially those where I'm searching for | files in a tree that I'm actively working in, this forces me to | fall back to find (or maybe git ls-files now). | | I feel like it should be the job of the filesystem to maintain | an index and incrementally update it whenever files are created | or removed, but afaict no modern filesystem offers it. | spicybright wrote: | It really is strange we don't have that yet. | | We already pretend files are on the physical disk and sync it | in the background. Adding smarter indexing to that shouldn't | be a huge deal. | bee_rider wrote: | I suppose this is sort of a problem depending on how you use | it. I find that I don't really need to locate things that | I've recently used, because I already know where they are as | a result of having recently used them. But, other people have | different workflows of course (I can imagine, for example, if | somebody's workflow involved creating lots of files, only | some of which are immediately useful, they might want the | ability to scan through them for particular outputs). | ziml77 wrote: | As mentioned elsewhere in these comments, NTFS actually does | do that. There are a couple tools out there that take | advantage of it like Everything Search and WizTree. I don't | know why no open source filesystem has done the same thing. | Karellen wrote: | Sarcasm? | cryptonector wrote: | This is why I have a git alias `ls` for `ls-files`. | njharman wrote: | > source code management system (SCM), it's main business3 is not | to help you list your files! | | This seems utterly bizarre statement. What does author imagine | SCM dies? Stuff, but listing the files that are the source it is | managing is up there. | | Like saying FPS is a game its not pushing triangles through GPU. | | The larger purpose is facilitated by and requires the lessor | purpose. | mikewhy wrote: | FYI to anyone looking to try this out: the `--others` flag, used | to include untracked files, will also print out files included in | your .gitignore. So if you have eg. a node_modules/ or venv/ | folder, all its contents will be listed. | | This is often unwanted noise. I haven't been able to find if | there's some combination of flags that would get the desired | behaviour, but it's been a while since I've messed around with | this. | xyzzy_plugh wrote: | What's the desired behaviour you want? --cached --others | --exclude-standard will show tracked and untracked but not | ignored. | [deleted] ___________________________________________________________________ (page generated 2021-11-21 23:00 UTC)