[HN Gopher] Scaling Git's garbage collection
       ___________________________________________________________________
        
       Scaling Git's garbage collection
        
       Author : todsacerdoti
       Score  : 52 points
       Date   : 2022-09-13 16:02 UTC (6 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | forrestthewoods wrote:
       | > At GitHub, we store a lot of Git data: more than 18.6 petabytes
       | of it, to be precise.
       | 
       | That actually seems kinda small.
       | 
       | Git's lack of good support for large files means there's probably
       | an exabyte of data that, imho, should be source control but
       | isn't.
        
         | kccqzy wrote:
         | That's indeed small. I'd guess that Google probably stores 4
         | orders of magnitude more data than GitHub.
         | 
         | (I was in fact asked a long time ago in an interview to
         | estimate how much disk was needed to store Google's search
         | index.)
        
           | sulam wrote:
           | Glad it was a long time ago. Those kinds of questions are
           | awful.
        
             | isatty wrote:
             | Agreed that it isn't ideal, but about "awful" specifically
             | - I'm not too sure. I would never ask such a question but I
             | would assume the intent is just to find out how you think
             | and not to get you to spit out a number. Would it be fun if
             | the interviewer worked together with you to approximate it?
        
         | ajb wrote:
         | You can't actually put the Android source in GitHub because of
         | the 4GB per repo size limit. Niche problem, but shows the scale
         | of things.
        
         | kortex wrote:
         | It would be amazing if Github/lab provided a backing store for
         | www.dvc.org . I've been using to great effect, but I have to
         | rely on separate AWS integration for storing the large objects
         | in s3.
        
       | sc68cal wrote:
       | I wish they had not gone with uint32_t for storing mtimes, since
       | they now have to deal with the 2038 problem, sometime in the
       | future.
       | 
       | I am surprised they didn't directly use time_t, so that they
       | wouldn't have to deal with this (since some platforms have
       | already gone to 64 bit time_t)
        
         | kevingadd wrote:
         | Wouldn't that mean if a platform changed time_t formats it
         | would invalidate all their stored files?
        
           | [deleted]
        
         | grogers wrote:
         | Well if they use unsigned 32 bit they at least extended it to
         | Y2106 :-)
         | 
         | But for this use case it's not really an issue though. FTA it
         | sounded like they always write the mtime as now. It's unlikely
         | they wouldn't GC the repo in 68 years to make wraparound an
         | issue.
        
         | est31 wrote:
         | For on-disk formats, time_t would probably not be a good
         | choice, but indeed, they have a time_t to uint32_t conversion
         | going on, that is not even saturating, just cutting bits off:
         | 
         | https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df...
         | 
         | https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df...
        
         | cesarb wrote:
         | > I wish they had not gone with uint32_t for storing mtimes,
         | since they now have to deal with the 2038 problem, sometime in
         | the future.
         | 
         | Since uint32_t is _unsigned_ , wouldn't it be the Y2106 problem
         | instead?
         | 
         | > I am surprised they didn't directly use time_t, so that they
         | wouldn't have to deal with this (since some platforms have
         | already gone to 64 bit time_t)
         | 
         | You mentioned the problem yourself without noticing: _some_
         | platforms have gone to 64-bit time_t, but others haven 't. This
         | is a file format, which can be shared by multiple platforms, so
         | it cannot use types which change size depending on the
         | platform.
        
       ___________________________________________________________________
       (page generated 2022-09-13 23:00 UTC)