[HN Gopher] Scaling Git's garbage collection ___________________________________________________________________ Scaling Git's garbage collection Author : todsacerdoti Score : 52 points Date : 2022-09-13 16:02 UTC (6 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | forrestthewoods wrote: | > At GitHub, we store a lot of Git data: more than 18.6 petabytes | of it, to be precise. | | That actually seems kinda small. | | Git's lack of good support for large files means there's probably | an exabyte of data that, imho, should be source control but | isn't. | kccqzy wrote: | That's indeed small. I'd guess that Google probably stores 4 | orders of magnitude more data than GitHub. | | (I was in fact asked a long time ago in an interview to | estimate how much disk was needed to store Google's search | index.) | sulam wrote: | Glad it was a long time ago. Those kinds of questions are | awful. | isatty wrote: | Agreed that it isn't ideal, but about "awful" specifically | - I'm not too sure. I would never ask such a question but I | would assume the intent is just to find out how you think | and not to get you to spit out a number. Would it be fun if | the interviewer worked together with you to approximate it? | ajb wrote: | You can't actually put the Android source in GitHub because of | the 4GB per repo size limit. Niche problem, but shows the scale | of things. | kortex wrote: | It would be amazing if Github/lab provided a backing store for | www.dvc.org . I've been using to great effect, but I have to | rely on separate AWS integration for storing the large objects | in s3. | sc68cal wrote: | I wish they had not gone with uint32_t for storing mtimes, since | they now have to deal with the 2038 problem, sometime in the | future. | | I am surprised they didn't directly use time_t, so that they | wouldn't have to deal with this (since some platforms have | already gone to 64 bit time_t) | kevingadd wrote: | Wouldn't that mean if a platform changed time_t formats it | would invalidate all their stored files? | [deleted] | grogers wrote: | Well if they use unsigned 32 bit they at least extended it to | Y2106 :-) | | But for this use case it's not really an issue though. FTA it | sounded like they always write the mtime as now. It's unlikely | they wouldn't GC the repo in 68 years to make wraparound an | issue. | est31 wrote: | For on-disk formats, time_t would probably not be a good | choice, but indeed, they have a time_t to uint32_t conversion | going on, that is not even saturating, just cutting bits off: | | https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df... | | https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df... | cesarb wrote: | > I wish they had not gone with uint32_t for storing mtimes, | since they now have to deal with the 2038 problem, sometime in | the future. | | Since uint32_t is _unsigned_ , wouldn't it be the Y2106 problem | instead? | | > I am surprised they didn't directly use time_t, so that they | wouldn't have to deal with this (since some platforms have | already gone to 64 bit time_t) | | You mentioned the problem yourself without noticing: _some_ | platforms have gone to 64-bit time_t, but others haven 't. This | is a file format, which can be shared by multiple platforms, so | it cannot use types which change size depending on the | platform. ___________________________________________________________________ (page generated 2022-09-13 23:00 UTC)