[HN Gopher] Building and operating a pretty big storage system c...
       ___________________________________________________________________
        
       Building and operating a pretty big storage system called S3
        
       Author : werner
       Score  : 157 points
       Date   : 2023-07-27 15:20 UTC (7 hours ago)
        
 (HTM) web link (www.allthingsdistributed.com)
 (TXT) w3m dump (www.allthingsdistributed.com)
        
       | Twirrim wrote:
       | > That's a bit error rate of 1 in 10^15 requests. In the real
       | world, we see that blade of grass get missed pretty frequently -
       | and it's actually something we need to account for in S3.
       | 
       | One of the things I remember from my time at AWS was
       | conversations about how 1 in a billion events end up being a
       | daily occurrence when you're operating at S3 scale. Things that
       | you'd normally mark off as so wildly improbable it's not worth
       | worrying about, have to be considered, and handled.
       | 
       | Glad to read about ShardStore, and especially the formal
       | verification, property based testing etc. The previous generation
       | of services were notoriously buggy, a very good example of the
       | usual perils of organic growth (but at least really well designed
       | such that they'd fail "safe", ensuring no data loss, something S3
       | engineers obsessed about).
        
         | Waterluvian wrote:
         | Ever see a UUID collision?
        
         | ignoramous wrote:
         | James Hamilton, AWS' chief architect, wrote about this
         | phenomena in 2017: _At scale, rare events aren 't rare_;
         | https://news.ycombinator.com/item?id=14038044
        
         | ilyt wrote:
         | I think Ceph hit similar problems and they had to add more
         | robust checksumming to the system, as relying on just tcp
         | checksums for integrity for example was no longer enough
        
           | Twirrim wrote:
           | Yes, I remember tcp checksumming coming up as not sufficient
           | at one stage. Even saw S3 deal with a real head-scratcher of
           | a non-impacting event that came down to a single NIC in a
           | single machine corrupting the tcp checksum under very
           | specific circumstances.
        
         | mjb wrote:
         | > daily occurrence when you're operating at S3 scale
         | 
         | Yeah! With S3 averaging over 100M requests per second, 1 in a
         | billion happens every ten seconds. And it's not just S3. For
         | example, for Prime Day 2022, DynamoDB peaked at over 105M
         | requests per second (just for the Amazon workload):
         | https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
         | 
         | In the post, Andy also talks about Lightweight Formal Methods
         | and the team's adoption of Rust. When even extremely low
         | probability events are common, we need to invest in multiple
         | layers of tooling and process around correctness.
        
         | ldjkfkdsjnv wrote:
         | Also worked at Amazon, saw some issues with major well known
         | open source libraries that broke in places nobody would ever
         | expect.
        
           | wrboyce wrote:
           | Any examples you can share?
        
         | rubiquity wrote:
         | Daily? A component I worked on that supported S3's Index could
         | hit a 1 in a billion issue multiple times a minute. Thankfully
         | we had good algorithms and hardware that is a lot more reliable
         | these days!
        
           | Twirrim wrote:
           | This was 7-8 years ago now. Lot of scaling up since those
           | days :)
        
       | jl6 wrote:
       | Great to see Amazon employees being allowed to talk openly about
       | how S3 works behind the scenes. I would love to hear more about
       | how Glacier works. As far as I know, they have never revealed
       | what the underlying storage medium is, leading to a lot of wild
       | speculation (tape? offline HDDs? custom HDDs?).
        
         | Twirrim wrote:
         | Glacier is a big "keep your lips sealed" one. I'd love AWS to
         | talk about everything there, and the entire journey it was on
         | because it is truly fascinating.
        
         | [deleted]
        
         | inopinatus wrote:
         | Never officially stated, but frequent leaks from insiders
         | confirm that Glacier is based on Very Large Arrays of Wax
         | Phonograph Records (VLAWPR) technology.
        
           | Twirrim wrote:
           | We came up with that idea in Glacier during the run up to
           | April one year (2014, I think?), half jokingly suggested it
           | as an April Fool's Day Joke, but Amazon quite reasonably
           | decided against doing such jokes.
           | 
           | One of the tag line ideas we had was "8 out of 10 customers
           | say they prefer the feel of their data after it is restored"
        
       | [deleted]
        
       | anderspitman wrote:
       | The things we could build if S3 specified a simple OAuth2-based
       | protocol for delegating read/write access. The world needs an
       | HTTP-based protocol for apps to access data on the user's behalf.
       | Google Drive is the closest to this but it only has a single
       | provider and other issues[0]. I'm sad remoteStorage never caught
       | on. I really hope Solid does well but it feels too complex to me.
       | My own take on the problem is https://gemdrive.io/, but it's
       | mostly on hold while I'm focused on other parts of the self-
       | hosting stack.
       | 
       | [0]: https://gdrivemusic.com/help
        
       | baq wrote:
       | > What's interesting here, when you look at the highest-level
       | block diagram of S3's technical design, is the fact that AWS
       | tends to ship its org chart. This is a phrase that's often used
       | in a pretty disparaging way, but in this case it's absolutely
       | fascinating.
       | 
       | I'd go even further: at this scale, it is essential and required
       | to develop these kind of projects with any sort of velocity.
       | 
       | Large organizations ship their communication structure by design.
       | The alternative is engineering anarchy.
        
         | hobo_in_library wrote:
         | This is also why reorgs tend to be pretty common at large tech
         | orgs.
         | 
         | They know they'll almost inevitably ship their org chart. And
         | they'll encounter tons of process-based friction if they don't.
         | 
         | The solution: Change your org chart to match what you want to
         | ship
        
         | Severian wrote:
         | Straight from The Mythical Man Month: Organizations which
         | design systems are constrained to produce systems which are
         | copies of the communication structures of these organizations.
        
       | epistasis wrote:
       | Working in genomics, I've dealt with lots of petabyte data stores
       | over the past decade. Having used AWS S3, GCP GCS, and a raft of
       | storage systems for collocated hardware (Ceph, Gluster, and an HP
       | system whose name I have blocked from my memory), I have no small
       | amount of appreciation for the effort that goes into operating
       | these sorts of systems.
       | 
       | And the benefits of sharing disk IOPs with untold numbers of
       | other customers is hard to understate. I hadn't heard the term
       | "heat" as it's used in the article but it's incredibly hard to
       | mitigate on single system. For our co-located hardware clusters,
       | we would have to customize the batch systems to treat IO as an
       | allocatable resource the same as RAM or CPU in order to manage it
       | correctly across large jobs. S3 and GCP are super expensive, but
       | the performance can be worth it.
       | 
       | This sort of article is some of the best of HN, IMHO.
        
       | deathanatos wrote:
       | > _Now, let's go back to that first hard drive, the IBM RAMAC
       | from 1956. Here are some specs on that thing:_
       | 
       | > _Storage Capacity: 3.75 MB_
       | 
       | > _Cost: ~$9,200 /terabyte_
       | 
       | Those specs can't possibly be correct. If you multiply the cost
       | by the storage, the cost of the drive works out to 3C/.
       | 
       | This site[1] states,
       | 
       | > _It stored about 2,000 bits of data per square inch and had a
       | purchase price of about $10,000 per megabyte_
       | 
       | So perhaps the specs should read $9,200 / _megabyte_? (Which
       | would put the drive 's cost at $34,500, which seems more
       | plausible.)
       | 
       | [1]: https://www.historyofinformation.com/detail.php?entryid=952
        
         | acdha wrote:
         | https://en.m.wikipedia.org/wiki/IBM_305_RAMAC has the likely
         | source of the error: 30M bits (using the 6 data bits but not
         | parity), but it rented for $3k per month so you didn't have a
         | set cost the same as buying a physical drive outright - very
         | close to S3's model, though.
        
         | andywarfield wrote:
         | oh shoot. good catch, thanks!
        
         | birdyrooster wrote:
         | Must've put a decimal point in the wrong place or something. I
         | always do that. I always mess up some mundane detail.
        
           | S_A_P wrote:
           | Did you get the memo? Yeah I will go ahead and get you
           | another copy of that memo.
        
       | jakupovic wrote:
       | The part about distributing loads takes me back to S3 KeyMap days
       | and me trying to migrate to it, from initial implementation. What
       | I learned is that even after you identify the hottest
       | objects/partitions/buckets you cannot simply move them and be
       | done. Everything had to be sorted. The actual solution was to
       | sort and then divide the host's partition load into quartiles and
       | move the second quartile partitions onto the least loaded hosts.
       | If one tried to move the hottest buckets, 1st quartile, it would
       | put even more load on the remaining members which would fail,
       | over and over again.
       | 
       | Another side effect was that the error rate went from steady ~1%
       | to days without any errors. Consequently we updated the alerts to
       | be much stricter. This was around 2009 or so.
       | 
       | Also came from academic background, UM, but instead of getting my
       | PhD I joined S3. It even rhymes :).
        
       | dsalzman wrote:
       | > Imagine a hard drive head as a 747 flying over a grassy field
       | at 75 miles per hour. The air gap between the bottom of the plane
       | and the top of the grass is two sheets of paper. Now, if we
       | measure bits on the disk as blades of grass, the track width
       | would be 4.6 blades of grass wide and the bit length would be one
       | blade of grass. As the plane flew over the grass it would count
       | blades of grass and only miss one blade for every 25 thousand
       | times the plane circled the Earth.
        
       | mcapodici wrote:
       | S3 is more than storage. It is a standard. I like how you can get
       | S3 compatible (usually with some small caveats) storage from a
       | few places. I am not sure how open the standards is, and if you
       | have to pay Amazon to say you are "S3 compatible" but it is
       | pretty cool.
       | 
       | Examples:
       | 
       | iDrive has E2, Digital Ocean has Object Storage, Cloudflare has
       | R2, Vultr has Object Storage, Backblaze has B2
        
       ___________________________________________________________________
       (page generated 2023-07-27 23:00 UTC)