[HN Gopher] You can list a directory containing 8M files, but no... ___________________________________________________________________ You can list a directory containing 8M files, but not with ls Author : _wldu Score : 50 points Date : 2021-08-15 19:21 UTC (3 hours ago) (HTM) web link (be-n.com) (TXT) w3m dump (be-n.com) | tyingq wrote: | perl -E 'opendir(my $d,".");say while readdir $d' | marcodiego wrote: | Makes me think that findfirst and findnext were not that bad | after all. | Y_Y wrote: | If you haven't prepared for this eventuality then odds are you're | going to run out of inodes first. And it's probably not useful to | just dump all those filenames to your terminal. And don't even | say you were piping the output of `ls` to something else! | | Anyway the coreutils shouldn't have arbitrary limits like this, | at least if they do then the limits should be so high that you | have to be really trying hard in order to reach them. | cmeacham98 wrote: | There isn't actually an arbitrary limit, it's just that glibc's | readdir() implementation is really really slow with millions of | files according to the article. Presumably if you waited awhile | `ls` would eventually get the whole list. | matheusmoreira wrote: | The glibc functions are just bad wrappers for the real system | calls which no doubt work much more efficiently. I fully | expected to find the system call solution in the article and | was not disappointed. | mercurialuser wrote: | did you try ls -1? in the far past I had the same problem listing | millions of files in a dir edit: if I remember correctly ls | bufferizes the results for sorting. with -1 it just dumps the | values | acdha wrote: | It's not "ls -1" but "--sort=none" or "-U". | loeg wrote: | "ls -f" in POSIX ls (which GNU ls also implements). Also, | avoid "-F", which will stat each file. | the_arun wrote: | `ls | more` works too, right? | acdha wrote: | No - it still does all of the work to sort the entries, | which is the slow part since it prevents the first entry | from being displayed until the last has been retrieved. | loeg wrote: | At least with GNU ls, 'ls | more' does not disable sorting. | It disables automatic color (which is important -- coloring | requires 'stat(2)'ing every file in the directory). | innagadadavida wrote: | I think it also tries to nearly format into columns and this | requires it to know name lengths for all files. If you do -l it | basically outputs one file per line and can be done more | efficiently. | sigmonsays wrote: | seems the author didn't read the man page. ls -1f as others have | pointed out is a much better than the solution. | | Additionally, having 8 million anything in a single directory | screams bad planning. It's common for some hashing if directory | structure to be planned. | unwind wrote: | Meta, if the author is around: there seems to be some kind of | encoding problem, on my system (Linux, Firefox) I see a lot of | strange characters where there should probably be punctuation. | | The first section header reads "Why doesnaEUR(tm)t ls work?", for | instance. | FpUser wrote: | Same here | cmeacham98 wrote: | This is because the page has no doctype, thus putting the | browser in "quirks mode", defaulting to a charset of ISO-8859-1 | (as the page does not specify one). The author can fix this | either by specifying the charset, or adding the HTML5 doctype | (HTML5 defaults to UTF-8). | dheera wrote: | Maybe browsers should default to UTF-8 already. It's 2021. | lxgr wrote: | Why? Defaulting to UTF-8 for modern HTML, and to ISO-8859-1 | for legacy pages, makes a lot of sense. | | Pages that haven't been adapted to HTML 5 in the last 10 | years or so are exceedingly unlikely to do so in year 11. | dheera wrote: | ISO-8859-1 is a subset of UTF-8 isn't it? No harm done by | defaulting to the superset. | [deleted] | CodesInChaos wrote: | No. ASCII is a subset of UTF-8, ISO-8859-1 is not. The | first 256 codepoints of unicode match ISO-8859-1, which | is probably the source of your confusion. However | Codepoints 128-255 are encoded differently in UTF-8. They | are represented by a single byte when encoded as | ISO-8859-1, while they turn into two bytes encoded in | UTF-8. | | Plus "ISO-8859-1" is treated as Windows-1252 by browsers, | while unicode uses ISO-8859-1 extended with the ISO 6429 | control characters for its initial 256 codepoints. | dheera wrote: | Ah I see, thanks. | anyfoo wrote: | If it were, the characters in question would already | display correctly for this website, since they are within | ISO-8859-1. ASCII is a subset of UTF-8. | magicalhippo wrote: | We need to handle a lot of crappy data-in-text-files at | work, and for most of them using the UTF-8 duck test seems | to be the most reliable. | | If it decodes successfully as UTF-8 it's probably UTF-8. | wolfgang42 wrote: | That requires scanning the whole file before guessing the | encoding, which browsers don't do for performance reasons | (and also because an HTML document _may never end_ , it's | perfectly valid for the server to keep appending to the | document indefinitely). The HTML5 spec does recommend | doing this on the first 1024 bytes, though. | magicalhippo wrote: | Browsers are quite happy on re-rendering the whole | document multiple times though, so it could just switch | and re-decode when UTF-8 fails. Sure it wouldn't be the | fast path, but sure beats looking at hieroglyphs. | | And yeah, add some sensible limits to this logic of | course. Most web pages aren't never-ending nor multi-GB | of text. | wolfgang42 wrote: | _> HTML5 defaults to UTF-8_ | | I'm not sure this is correct, though the WHATWG docs[1] are | kind of confusing. From what I can tell, it seems like HTML5 | documents are required to be UTF-8, but also this is required | to be explicitly declared either in the Content-Type header, | a leading BOM, or a <meta> tag in the first 1024 bytes of the | file. Reading this blog post[2] it sounds like there is a | danger that if you don't do this then heuristics will kick in | and try to guess the charset instead; the documented | algorithm for this doesn't seem to consider the doctype at | all. | | [1]: https://html.spec.whatwg.org/dev/semantics.html#charset | | [2]: https://blog.whatwg.org/the-road-to-html-5-character- | encodin... | Aardwolf wrote: | The single quot for doesn't is an ASCII character though, why | does that one become aEUR(tm)? | iudqnolq wrote: | Here's the heuridtic-based hypothesis of the python package | ftfy >>> ftfy.fix_and_explain("aEUR(tm)") | ExplainedText( text="'", | explanation=[ ('encode', 'sloppy- | windows-1252'), ('decode', 'utf-8'), | ('apply', 'uncurl_quotes') ] ) | wolfgang42 wrote: | Note that uncurl_quotes is a FTFY fix unrelated to | character encoding, it's basically just s/'/'/. (FTFY | turns all of its fixes on by default, which sometimes | results in it doing more than you might want it to.) | | You can play around with FTFY here (open the "Decoding | steps" to see the explanation of what it did and why): ht | tps://www.linestarve.com/tools/mojibake/?mojibake=aEUR(tm | ) | bdowling wrote: | It's not. It's a Unicode 'RIGHT SINGLE QUOTATION MARK' | (U+2019), which in UTF-8 is encoded as 0xe2 0x80 0x99. | | 0xe2 is a in iso8859-1. 0x80 is not in iso8859-1, but is | EUR in windows-1252. 0x99 is not in iso8859-1, but is (tm) | in windows-1252. | | So, the browser here appears to be defaulting to | windows-1252. | kccqzy wrote: | Use your browser to override the encoding. For example in | Firefox choose "View > Repair Text Encoding" from the menu or | in Safari choose "View > Text Encoding > Unicode (UTF-8)" from | the menu. Many browsers still default to Latin 1, but this page | is using UTF-8. | | (This used to happen a lot ~15 years ago. Did the dominance of | UTF-8 make people forget about these encoding issues?) | fintler wrote: | https://github.com/hpc/mpifileutils handles this pretty well -- | with SYS_getdents64. It has a few other tricks in there in | addition to this one. | scottlamb wrote: | tl;dr: try "ls -1 -f". It's fast. | | This doesn't pass my smell test: | | > Putting two and two together I could see that the reason it was | taking forever to list the directory was because ls was reading | the directory entries file 32K at a time, and the file was 513M. | So it would take around 16416 system calls of getdents() to list | the directory. That is a lot of calls, especially on a slow | virtualized disk. | | 16,416 system calls is a little inefficient but not that | noticeable on human terms. And the author is talking as if each | one waits 10 ms for a disk head to move to the correct position. | That's not true. The OS and drive both do readahead, and they're | both quite effective. I recently tried to improve performance of | a long-running sequential read on an otherwise-idle old-fashioned | spinning disk by tuning the former ("sudo blockdev --setra 6144 | /path/to/device"). I found it made no real difference: "iostat" | showed OS-level readahead reduces the number of block operations | (as expected) but also that total latency doesn't decrease. It | turns out in this scenario the disk's cache is full of the | upcoming bytes so those extra operations are super fast anyway. | | The real reason "ls" takes a while to print stuff is that by | default it will buffer everything before printing anything so | that it can sort it and (when stdout is a terminal) place it into | appropriately-sized columns. It also (depending on the options | you are using) will stat every file, which obviously will dwarf | the number of getdents calls and access the inodes (which are | more scattered across the filesystem). | | "ls -1 -f" disables both those behaviors. It's reasonably fast | without changing the buffer size. moonfire- | nvr@nuc:/media/14tb/sample$ time ls -1f | wc -l 1042303 | real 0m0.934s user 0m0.403s sys | 0m0.563s | | That's on Linux with ext4. | loeg wrote: | Agree re smell test. Those directory blocks are cached, even in | front of a slow virtualized disk, and most of those syscalls | are hitting in cache. Author is likely running into (1) stat | calls and (2) buffer and sort behavior, exactly as you | describe. | iso1210 wrote: | Interesting, tried myself on a test VM | | ~/test$ time for I in `seq -w 1 1000000`; do touch $I; done | real 27m8.663s user 14m15.410s sys 12m24.411s | | OK | | ~/test$ time ls -1f | wc -l 1000002 | | real 0m0.604s user 0m0.180s sys 0m0.422s | | ~/test$ time ls -f | wc -l 1000002 | | real 0m0.574s | | ~/test$ time perl -E 'opendir(my $d,".");say while readdir | $d' |wc -l 1000002 | | real 0m0.597s | | All seems reasonable. Directory size alone is 23M, somewhat | larger than the typical 4096 bytes. | osswid wrote: | ls -f | wolfgang42 wrote: | -f do not sort, enable -aU, disable -ls --color | -a do not ignore entries starting with . -U | do not sort; list entries in directory order -l | use a long listing format -s print the | allocated size of each file, in blocks --color | colorize the output | | I assume you mean to imply that by turning off | sorting/filtering/formatting ls will run in a more optimized | mode where it can avoid buffering and just dump the dentries as | described in the article? | jjgreen wrote: | Seems that way: https://github.com/wertarbyte/coreutils/blob/ | master/src/ls.c... | loeg wrote: | Yeah, exactly. OP is changing 3 variables and concluding that | getdirent buffer size was the significant one, but actually | the problem was likely (1) stat calls, for --color, and (2) | buffer and sort, which adds O(N log N) sorting time to the | total run+print time. (Both of which are avoided by using | getdirent directly.) | majkinetor wrote: | Yeah ... | | However, lets just accept that regular people don't know those | tricks and we should keep files in subfolders? I have that logic | in any app that has potential to spam a directory. You can still | show them as a single folder (somewhere called branch view) if | you like but every other tools that uses ls will work like a | charm (such as your backup shell script) | yjftsjthsd-h wrote: | Then anything working on it needs to recurse. | bifrost wrote: | Interesting point, this does appear to be Linux and situation | specific though. | | Its interesting enough that I'm going to run my own test now. | bifrost wrote: | Its going to take me a bit to generate several million files | but so far I've got a single directory with 550k files in it, | it takes 30s to ls it on a very busy system running FreeBSD. | | 1.1M files -> 120 seconds | | 1.8M files -> 270 seconds (this could be related to system load | being over 90 heh) | ipaddr wrote: | At 3,000 my windows 7 os freezes. Not bad for a million. | ygra wrote: | You may want to disable short name generation on windows | when putting many files on one directory. | loeg wrote: | Try "ls -f" (don't sort)? | | Which filesystem you use will also make a big difference | here. You could imagine some filesystem that uses the | getdirentries(2) binary format for dirents, and that could | literally memcpy cached directory pages for a syscall. In | FreeBSD, UFS gets somewhat close, but 'struct direct' differs | from the ABI 'struct dirent'. And the FS attempts to validate | the disk format, too. | | FWIW, FreeBSD uses 4kB (x86 system page size) where glibc | uses 32kB in this article[1]. To the extent libc is actually | the problem (I'm not confident of that yet based on the | article), this will be worse than glibc's larger buffer. | | [1]: https://github.com/freebsd/freebsd- | src/blob/main/lib/libc/ge... | bifrost wrote: | with "ls -f" on 1.9M files its 45 seconds, much better than | regular ls (and system load of 94) | | 2.25M and its 60 seconds | | I'm also spamming about 16-18 thousand new files per second | to disk using a very inefficient set of csh scripts... | scottlamb wrote: | A more efficient one-liner: seq 1 | 8000000 | xargs touch | avaika wrote: | If you are going to have a directory with millions of files, | probably there's one more interesting thing to consider. | | As you might know ext* and some other FSs store filenames right | in the directory file. Means the more files you have in the | directory, the bigger directory size gets. In majority of cases | nothing unusual happens, cause people have maybe a few dozens of | dirs / files. | | However if you'll put millions of files, then directory size | grows up to a few megabytes in size. If you decide to clean up | later, you'd probably expect the directory size to shrink. But it | never happens. Unless you run fsck or re-create a directory. | | That's because nobody believes the implementation effort really | worth it. Here's a link to lkml discussion: | https://lkml.org/lkml/2009/5/15/146 | | PS. Here's a previous discussion of the very same article posted | in this submission. It's been 10 years already :) | https://news.ycombinator.com/item?id=2888820 | | upd. Here's a code example: | | $ mkdir niceDir && cd niceDir | | # this might take a few moments | | $ for ((i=1;i<133700;i++)); do touch | long_long_looong_man_sakeru_$i ; done | | $ ls -lhd . | | drwxr-xr-x 2 user user 8.1M Aug 2 13:37 . | | $ find . -type f -delete | | $ ls -l | | total 0 | | $ ls -lhd . | | drwxr-xr-x 2 user user 8.1M Aug 2 13:37 . | kalmi10 wrote: | I once had a directory on OpenZFS with more than a billion files, | and after cleaning it up with only handful of folders remaining, | running ls in it still took a few seconds. I guess some large but | almost empty tree structure remained. | | https://0kalmi.blogspot.com/2020/02/quick-moving-of-billion-... ___________________________________________________________________ (page generated 2021-08-15 23:00 UTC)