hngopher.com

       [HN Gopher] You can list a directory containing 8M files, but no...
       ___________________________________________________________________
        
       You can list a directory containing 8M files, but not with ls
        
       Author : _wldu
       Score  : 50 points
       Date   : 2021-08-15 19:21 UTC (3 hours ago)
        
 (HTM) web link (be-n.com)
 (TXT) w3m dump (be-n.com)
        
       | tyingq wrote:
       | perl -E 'opendir(my $d,".");say while readdir $d'
        
       | marcodiego wrote:
       | Makes me think that findfirst and findnext were not that bad
       | after all.
        
       | Y_Y wrote:
       | If you haven't prepared for this eventuality then odds are you're
       | going to run out of inodes first. And it's probably not useful to
       | just dump all those filenames to your terminal. And don't even
       | say you were piping the output of `ls` to something else!
       | 
       | Anyway the coreutils shouldn't have arbitrary limits like this,
       | at least if they do then the limits should be so high that you
       | have to be really trying hard in order to reach them.
        
         | cmeacham98 wrote:
         | There isn't actually an arbitrary limit, it's just that glibc's
         | readdir() implementation is really really slow with millions of
         | files according to the article. Presumably if you waited awhile
         | `ls` would eventually get the whole list.
        
           | matheusmoreira wrote:
           | The glibc functions are just bad wrappers for the real system
           | calls which no doubt work much more efficiently. I fully
           | expected to find the system call solution in the article and
           | was not disappointed.
        
       | mercurialuser wrote:
       | did you try ls -1? in the far past I had the same problem listing
       | millions of files in a dir edit: if I remember correctly ls
       | bufferizes the results for sorting. with -1 it just dumps the
       | values
        
         | acdha wrote:
         | It's not "ls -1" but "--sort=none" or "-U".
        
           | loeg wrote:
           | "ls -f" in POSIX ls (which GNU ls also implements). Also,
           | avoid "-F", which will stat each file.
        
           | the_arun wrote:
           | `ls | more` works too, right?
        
             | acdha wrote:
             | No - it still does all of the work to sort the entries,
             | which is the slow part since it prevents the first entry
             | from being displayed until the last has been retrieved.
        
             | loeg wrote:
             | At least with GNU ls, 'ls | more' does not disable sorting.
             | It disables automatic color (which is important -- coloring
             | requires 'stat(2)'ing every file in the directory).
        
         | innagadadavida wrote:
         | I think it also tries to nearly format into columns and this
         | requires it to know name lengths for all files. If you do -l it
         | basically outputs one file per line and can be done more
         | efficiently.
        
       | sigmonsays wrote:
       | seems the author didn't read the man page. ls -1f as others have
       | pointed out is a much better than the solution.
       | 
       | Additionally, having 8 million anything in a single directory
       | screams bad planning. It's common for some hashing if directory
       | structure to be planned.
        
       | unwind wrote:
       | Meta, if the author is around: there seems to be some kind of
       | encoding problem, on my system (Linux, Firefox) I see a lot of
       | strange characters where there should probably be punctuation.
       | 
       | The first section header reads "Why doesnaEUR(tm)t ls work?", for
       | instance.
        
         | FpUser wrote:
         | Same here
        
         | cmeacham98 wrote:
         | This is because the page has no doctype, thus putting the
         | browser in "quirks mode", defaulting to a charset of ISO-8859-1
         | (as the page does not specify one). The author can fix this
         | either by specifying the charset, or adding the HTML5 doctype
         | (HTML5 defaults to UTF-8).
        
           | dheera wrote:
           | Maybe browsers should default to UTF-8 already. It's 2021.
        
             | lxgr wrote:
             | Why? Defaulting to UTF-8 for modern HTML, and to ISO-8859-1
             | for legacy pages, makes a lot of sense.
             | 
             | Pages that haven't been adapted to HTML 5 in the last 10
             | years or so are exceedingly unlikely to do so in year 11.
        
               | dheera wrote:
               | ISO-8859-1 is a subset of UTF-8 isn't it? No harm done by
               | defaulting to the superset.
        
               | [deleted]
        
               | CodesInChaos wrote:
               | No. ASCII is a subset of UTF-8, ISO-8859-1 is not. The
               | first 256 codepoints of unicode match ISO-8859-1, which
               | is probably the source of your confusion. However
               | Codepoints 128-255 are encoded differently in UTF-8. They
               | are represented by a single byte when encoded as
               | ISO-8859-1, while they turn into two bytes encoded in
               | UTF-8.
               | 
               | Plus "ISO-8859-1" is treated as Windows-1252 by browsers,
               | while unicode uses ISO-8859-1 extended with the ISO 6429
               | control characters for its initial 256 codepoints.
        
               | dheera wrote:
               | Ah I see, thanks.
        
               | anyfoo wrote:
               | If it were, the characters in question would already
               | display correctly for this website, since they are within
               | ISO-8859-1. ASCII is a subset of UTF-8.
        
             | magicalhippo wrote:
             | We need to handle a lot of crappy data-in-text-files at
             | work, and for most of them using the UTF-8 duck test seems
             | to be the most reliable.
             | 
             | If it decodes successfully as UTF-8 it's probably UTF-8.
        
               | wolfgang42 wrote:
               | That requires scanning the whole file before guessing the
               | encoding, which browsers don't do for performance reasons
               | (and also because an HTML document _may never end_ , it's
               | perfectly valid for the server to keep appending to the
               | document indefinitely). The HTML5 spec does recommend
               | doing this on the first 1024 bytes, though.
        
               | magicalhippo wrote:
               | Browsers are quite happy on re-rendering the whole
               | document multiple times though, so it could just switch
               | and re-decode when UTF-8 fails. Sure it wouldn't be the
               | fast path, but sure beats looking at hieroglyphs.
               | 
               | And yeah, add some sensible limits to this logic of
               | course. Most web pages aren't never-ending nor multi-GB
               | of text.
        
           | wolfgang42 wrote:
           | _> HTML5 defaults to UTF-8_
           | 
           | I'm not sure this is correct, though the WHATWG docs[1] are
           | kind of confusing. From what I can tell, it seems like HTML5
           | documents are required to be UTF-8, but also this is required
           | to be explicitly declared either in the Content-Type header,
           | a leading BOM, or a <meta> tag in the first 1024 bytes of the
           | file. Reading this blog post[2] it sounds like there is a
           | danger that if you don't do this then heuristics will kick in
           | and try to guess the charset instead; the documented
           | algorithm for this doesn't seem to consider the doctype at
           | all.
           | 
           | [1]: https://html.spec.whatwg.org/dev/semantics.html#charset
           | 
           | [2]: https://blog.whatwg.org/the-road-to-html-5-character-
           | encodin...
        
           | Aardwolf wrote:
           | The single quot for doesn't is an ASCII character though, why
           | does that one become aEUR(tm)?
        
             | iudqnolq wrote:
             | Here's the heuridtic-based hypothesis of the python package
             | ftfy                   >>> ftfy.fix_and_explain("aEUR(tm)")
             | ExplainedText(             text="'",
             | explanation=[                 ('encode', 'sloppy-
             | windows-1252'),                 ('decode', 'utf-8'),
             | ('apply', 'uncurl_quotes')             ]         )
        
               | wolfgang42 wrote:
               | Note that uncurl_quotes is a FTFY fix unrelated to
               | character encoding, it's basically just s/'/'/. (FTFY
               | turns all of its fixes on by default, which sometimes
               | results in it doing more than you might want it to.)
               | 
               | You can play around with FTFY here (open the "Decoding
               | steps" to see the explanation of what it did and why): ht
               | tps://www.linestarve.com/tools/mojibake/?mojibake=aEUR(tm
               | )
        
             | bdowling wrote:
             | It's not. It's a Unicode 'RIGHT SINGLE QUOTATION MARK'
             | (U+2019), which in UTF-8 is encoded as 0xe2 0x80 0x99.
             | 
             | 0xe2 is a in iso8859-1. 0x80 is not in iso8859-1, but is
             | EUR in windows-1252. 0x99 is not in iso8859-1, but is (tm)
             | in windows-1252.
             | 
             | So, the browser here appears to be defaulting to
             | windows-1252.
        
         | kccqzy wrote:
         | Use your browser to override the encoding. For example in
         | Firefox choose "View > Repair Text Encoding" from the menu or
         | in Safari choose "View > Text Encoding > Unicode (UTF-8)" from
         | the menu. Many browsers still default to Latin 1, but this page
         | is using UTF-8.
         | 
         | (This used to happen a lot ~15 years ago. Did the dominance of
         | UTF-8 make people forget about these encoding issues?)
        
       | fintler wrote:
       | https://github.com/hpc/mpifileutils handles this pretty well --
       | with SYS_getdents64. It has a few other tricks in there in
       | addition to this one.
        
       | scottlamb wrote:
       | tl;dr: try "ls -1 -f". It's fast.
       | 
       | This doesn't pass my smell test:
       | 
       | > Putting two and two together I could see that the reason it was
       | taking forever to list the directory was because ls was reading
       | the directory entries file 32K at a time, and the file was 513M.
       | So it would take around 16416 system calls of getdents() to list
       | the directory. That is a lot of calls, especially on a slow
       | virtualized disk.
       | 
       | 16,416 system calls is a little inefficient but not that
       | noticeable on human terms. And the author is talking as if each
       | one waits 10 ms for a disk head to move to the correct position.
       | That's not true. The OS and drive both do readahead, and they're
       | both quite effective. I recently tried to improve performance of
       | a long-running sequential read on an otherwise-idle old-fashioned
       | spinning disk by tuning the former ("sudo blockdev --setra 6144
       | /path/to/device"). I found it made no real difference: "iostat"
       | showed OS-level readahead reduces the number of block operations
       | (as expected) but also that total latency doesn't decrease. It
       | turns out in this scenario the disk's cache is full of the
       | upcoming bytes so those extra operations are super fast anyway.
       | 
       | The real reason "ls" takes a while to print stuff is that by
       | default it will buffer everything before printing anything so
       | that it can sort it and (when stdout is a terminal) place it into
       | appropriately-sized columns. It also (depending on the options
       | you are using) will stat every file, which obviously will dwarf
       | the number of getdents calls and access the inodes (which are
       | more scattered across the filesystem).
       | 
       | "ls -1 -f" disables both those behaviors. It's reasonably fast
       | without changing the buffer size.                   moonfire-
       | nvr@nuc:/media/14tb/sample$ time ls -1f | wc -l         1042303
       | real    0m0.934s         user    0m0.403s         sys
       | 0m0.563s
       | 
       | That's on Linux with ext4.
        
         | loeg wrote:
         | Agree re smell test. Those directory blocks are cached, even in
         | front of a slow virtualized disk, and most of those syscalls
         | are hitting in cache. Author is likely running into (1) stat
         | calls and (2) buffer and sort behavior, exactly as you
         | describe.
        
           | iso1210 wrote:
           | Interesting, tried myself on a test VM
           | 
           | ~/test$ time for I in `seq -w 1 1000000`; do touch $I; done
           | real 27m8.663s user 14m15.410s sys 12m24.411s
           | 
           | OK
           | 
           | ~/test$ time ls -1f | wc -l 1000002
           | 
           | real 0m0.604s user 0m0.180s sys 0m0.422s
           | 
           | ~/test$ time ls -f | wc -l 1000002
           | 
           | real 0m0.574s
           | 
           | ~/test$ time perl -E 'opendir(my $d,".");say while readdir
           | $d' |wc -l 1000002
           | 
           | real 0m0.597s
           | 
           | All seems reasonable. Directory size alone is 23M, somewhat
           | larger than the typical 4096 bytes.
        
       | osswid wrote:
       | ls -f
        
         | wolfgang42 wrote:
         | -f      do not sort, enable -aU, disable -ls --color
         | -a      do not ignore entries starting with .            -U
         | do not sort; list entries in directory order            -l
         | use a long listing format            -s      print the
         | allocated size of each file, in blocks            --color
         | colorize the output
         | 
         | I assume you mean to imply that by turning off
         | sorting/filtering/formatting ls will run in a more optimized
         | mode where it can avoid buffering and just dump the dentries as
         | described in the article?
        
           | jjgreen wrote:
           | Seems that way: https://github.com/wertarbyte/coreutils/blob/
           | master/src/ls.c...
        
           | loeg wrote:
           | Yeah, exactly. OP is changing 3 variables and concluding that
           | getdirent buffer size was the significant one, but actually
           | the problem was likely (1) stat calls, for --color, and (2)
           | buffer and sort, which adds O(N log N) sorting time to the
           | total run+print time. (Both of which are avoided by using
           | getdirent directly.)
        
       | majkinetor wrote:
       | Yeah ...
       | 
       | However, lets just accept that regular people don't know those
       | tricks and we should keep files in subfolders? I have that logic
       | in any app that has potential to spam a directory. You can still
       | show them as a single folder (somewhere called branch view) if
       | you like but every other tools that uses ls will work like a
       | charm (such as your backup shell script)
        
         | yjftsjthsd-h wrote:
         | Then anything working on it needs to recurse.
        
       | bifrost wrote:
       | Interesting point, this does appear to be Linux and situation
       | specific though.
       | 
       | Its interesting enough that I'm going to run my own test now.
        
         | bifrost wrote:
         | Its going to take me a bit to generate several million files
         | but so far I've got a single directory with 550k files in it,
         | it takes 30s to ls it on a very busy system running FreeBSD.
         | 
         | 1.1M files -> 120 seconds
         | 
         | 1.8M files -> 270 seconds (this could be related to system load
         | being over 90 heh)
        
           | ipaddr wrote:
           | At 3,000 my windows 7 os freezes. Not bad for a million.
        
             | ygra wrote:
             | You may want to disable short name generation on windows
             | when putting many files on one directory.
        
           | loeg wrote:
           | Try "ls -f" (don't sort)?
           | 
           | Which filesystem you use will also make a big difference
           | here. You could imagine some filesystem that uses the
           | getdirentries(2) binary format for dirents, and that could
           | literally memcpy cached directory pages for a syscall. In
           | FreeBSD, UFS gets somewhat close, but 'struct direct' differs
           | from the ABI 'struct dirent'. And the FS attempts to validate
           | the disk format, too.
           | 
           | FWIW, FreeBSD uses 4kB (x86 system page size) where glibc
           | uses 32kB in this article[1]. To the extent libc is actually
           | the problem (I'm not confident of that yet based on the
           | article), this will be worse than glibc's larger buffer.
           | 
           | [1]: https://github.com/freebsd/freebsd-
           | src/blob/main/lib/libc/ge...
        
             | bifrost wrote:
             | with "ls -f" on 1.9M files its 45 seconds, much better than
             | regular ls (and system load of 94)
             | 
             | 2.25M and its 60 seconds
             | 
             | I'm also spamming about 16-18 thousand new files per second
             | to disk using a very inefficient set of csh scripts...
        
               | scottlamb wrote:
               | A more efficient one-liner:                   seq 1
               | 8000000 | xargs touch
        
       | avaika wrote:
       | If you are going to have a directory with millions of files,
       | probably there's one more interesting thing to consider.
       | 
       | As you might know ext* and some other FSs store filenames right
       | in the directory file. Means the more files you have in the
       | directory, the bigger directory size gets. In majority of cases
       | nothing unusual happens, cause people have maybe a few dozens of
       | dirs / files.
       | 
       | However if you'll put millions of files, then directory size
       | grows up to a few megabytes in size. If you decide to clean up
       | later, you'd probably expect the directory size to shrink. But it
       | never happens. Unless you run fsck or re-create a directory.
       | 
       | That's because nobody believes the implementation effort really
       | worth it. Here's a link to lkml discussion:
       | https://lkml.org/lkml/2009/5/15/146
       | 
       | PS. Here's a previous discussion of the very same article posted
       | in this submission. It's been 10 years already :)
       | https://news.ycombinator.com/item?id=2888820
       | 
       | upd. Here's a code example:
       | 
       | $ mkdir niceDir && cd niceDir
       | 
       | # this might take a few moments
       | 
       | $ for ((i=1;i<133700;i++)); do touch
       | long_long_looong_man_sakeru_$i ; done
       | 
       | $ ls -lhd .
       | 
       | drwxr-xr-x 2 user user 8.1M Aug 2 13:37 .
       | 
       | $ find . -type f -delete
       | 
       | $ ls -l
       | 
       | total 0
       | 
       | $ ls -lhd .
       | 
       | drwxr-xr-x 2 user user 8.1M Aug 2 13:37 .
        
       | kalmi10 wrote:
       | I once had a directory on OpenZFS with more than a billion files,
       | and after cleaning it up with only handful of folders remaining,
       | running ls in it still took a few seconds. I guess some large but
       | almost empty tree structure remained.
       | 
       | https://0kalmi.blogspot.com/2020/02/quick-moving-of-billion-...
        
       ___________________________________________________________________
       (page generated 2021-08-15 23:00 UTC)