[HN Gopher] Building and operating a pretty big storage system c... ___________________________________________________________________ Building and operating a pretty big storage system called S3 Author : werner Score : 157 points Date : 2023-07-27 15:20 UTC (7 hours ago) (HTM) web link (www.allthingsdistributed.com) (TXT) w3m dump (www.allthingsdistributed.com) | Twirrim wrote: | > That's a bit error rate of 1 in 10^15 requests. In the real | world, we see that blade of grass get missed pretty frequently - | and it's actually something we need to account for in S3. | | One of the things I remember from my time at AWS was | conversations about how 1 in a billion events end up being a | daily occurrence when you're operating at S3 scale. Things that | you'd normally mark off as so wildly improbable it's not worth | worrying about, have to be considered, and handled. | | Glad to read about ShardStore, and especially the formal | verification, property based testing etc. The previous generation | of services were notoriously buggy, a very good example of the | usual perils of organic growth (but at least really well designed | such that they'd fail "safe", ensuring no data loss, something S3 | engineers obsessed about). | Waterluvian wrote: | Ever see a UUID collision? | ignoramous wrote: | James Hamilton, AWS' chief architect, wrote about this | phenomena in 2017: _At scale, rare events aren 't rare_; | https://news.ycombinator.com/item?id=14038044 | ilyt wrote: | I think Ceph hit similar problems and they had to add more | robust checksumming to the system, as relying on just tcp | checksums for integrity for example was no longer enough | Twirrim wrote: | Yes, I remember tcp checksumming coming up as not sufficient | at one stage. Even saw S3 deal with a real head-scratcher of | a non-impacting event that came down to a single NIC in a | single machine corrupting the tcp checksum under very | specific circumstances. | mjb wrote: | > daily occurrence when you're operating at S3 scale | | Yeah! With S3 averaging over 100M requests per second, 1 in a | billion happens every ten seconds. And it's not just S3. For | example, for Prime Day 2022, DynamoDB peaked at over 105M | requests per second (just for the Amazon workload): | https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f... | | In the post, Andy also talks about Lightweight Formal Methods | and the team's adoption of Rust. When even extremely low | probability events are common, we need to invest in multiple | layers of tooling and process around correctness. | ldjkfkdsjnv wrote: | Also worked at Amazon, saw some issues with major well known | open source libraries that broke in places nobody would ever | expect. | wrboyce wrote: | Any examples you can share? | rubiquity wrote: | Daily? A component I worked on that supported S3's Index could | hit a 1 in a billion issue multiple times a minute. Thankfully | we had good algorithms and hardware that is a lot more reliable | these days! | Twirrim wrote: | This was 7-8 years ago now. Lot of scaling up since those | days :) | jl6 wrote: | Great to see Amazon employees being allowed to talk openly about | how S3 works behind the scenes. I would love to hear more about | how Glacier works. As far as I know, they have never revealed | what the underlying storage medium is, leading to a lot of wild | speculation (tape? offline HDDs? custom HDDs?). | Twirrim wrote: | Glacier is a big "keep your lips sealed" one. I'd love AWS to | talk about everything there, and the entire journey it was on | because it is truly fascinating. | [deleted] | inopinatus wrote: | Never officially stated, but frequent leaks from insiders | confirm that Glacier is based on Very Large Arrays of Wax | Phonograph Records (VLAWPR) technology. | Twirrim wrote: | We came up with that idea in Glacier during the run up to | April one year (2014, I think?), half jokingly suggested it | as an April Fool's Day Joke, but Amazon quite reasonably | decided against doing such jokes. | | One of the tag line ideas we had was "8 out of 10 customers | say they prefer the feel of their data after it is restored" | [deleted] | anderspitman wrote: | The things we could build if S3 specified a simple OAuth2-based | protocol for delegating read/write access. The world needs an | HTTP-based protocol for apps to access data on the user's behalf. | Google Drive is the closest to this but it only has a single | provider and other issues[0]. I'm sad remoteStorage never caught | on. I really hope Solid does well but it feels too complex to me. | My own take on the problem is https://gemdrive.io/, but it's | mostly on hold while I'm focused on other parts of the self- | hosting stack. | | [0]: https://gdrivemusic.com/help | baq wrote: | > What's interesting here, when you look at the highest-level | block diagram of S3's technical design, is the fact that AWS | tends to ship its org chart. This is a phrase that's often used | in a pretty disparaging way, but in this case it's absolutely | fascinating. | | I'd go even further: at this scale, it is essential and required | to develop these kind of projects with any sort of velocity. | | Large organizations ship their communication structure by design. | The alternative is engineering anarchy. | hobo_in_library wrote: | This is also why reorgs tend to be pretty common at large tech | orgs. | | They know they'll almost inevitably ship their org chart. And | they'll encounter tons of process-based friction if they don't. | | The solution: Change your org chart to match what you want to | ship | Severian wrote: | Straight from The Mythical Man Month: Organizations which | design systems are constrained to produce systems which are | copies of the communication structures of these organizations. | epistasis wrote: | Working in genomics, I've dealt with lots of petabyte data stores | over the past decade. Having used AWS S3, GCP GCS, and a raft of | storage systems for collocated hardware (Ceph, Gluster, and an HP | system whose name I have blocked from my memory), I have no small | amount of appreciation for the effort that goes into operating | these sorts of systems. | | And the benefits of sharing disk IOPs with untold numbers of | other customers is hard to understate. I hadn't heard the term | "heat" as it's used in the article but it's incredibly hard to | mitigate on single system. For our co-located hardware clusters, | we would have to customize the batch systems to treat IO as an | allocatable resource the same as RAM or CPU in order to manage it | correctly across large jobs. S3 and GCP are super expensive, but | the performance can be worth it. | | This sort of article is some of the best of HN, IMHO. | deathanatos wrote: | > _Now, let's go back to that first hard drive, the IBM RAMAC | from 1956. Here are some specs on that thing:_ | | > _Storage Capacity: 3.75 MB_ | | > _Cost: ~$9,200 /terabyte_ | | Those specs can't possibly be correct. If you multiply the cost | by the storage, the cost of the drive works out to 3C/. | | This site[1] states, | | > _It stored about 2,000 bits of data per square inch and had a | purchase price of about $10,000 per megabyte_ | | So perhaps the specs should read $9,200 / _megabyte_? (Which | would put the drive 's cost at $34,500, which seems more | plausible.) | | [1]: https://www.historyofinformation.com/detail.php?entryid=952 | acdha wrote: | https://en.m.wikipedia.org/wiki/IBM_305_RAMAC has the likely | source of the error: 30M bits (using the 6 data bits but not | parity), but it rented for $3k per month so you didn't have a | set cost the same as buying a physical drive outright - very | close to S3's model, though. | andywarfield wrote: | oh shoot. good catch, thanks! | birdyrooster wrote: | Must've put a decimal point in the wrong place or something. I | always do that. I always mess up some mundane detail. | S_A_P wrote: | Did you get the memo? Yeah I will go ahead and get you | another copy of that memo. | jakupovic wrote: | The part about distributing loads takes me back to S3 KeyMap days | and me trying to migrate to it, from initial implementation. What | I learned is that even after you identify the hottest | objects/partitions/buckets you cannot simply move them and be | done. Everything had to be sorted. The actual solution was to | sort and then divide the host's partition load into quartiles and | move the second quartile partitions onto the least loaded hosts. | If one tried to move the hottest buckets, 1st quartile, it would | put even more load on the remaining members which would fail, | over and over again. | | Another side effect was that the error rate went from steady ~1% | to days without any errors. Consequently we updated the alerts to | be much stricter. This was around 2009 or so. | | Also came from academic background, UM, but instead of getting my | PhD I joined S3. It even rhymes :). | dsalzman wrote: | > Imagine a hard drive head as a 747 flying over a grassy field | at 75 miles per hour. The air gap between the bottom of the plane | and the top of the grass is two sheets of paper. Now, if we | measure bits on the disk as blades of grass, the track width | would be 4.6 blades of grass wide and the bit length would be one | blade of grass. As the plane flew over the grass it would count | blades of grass and only miss one blade for every 25 thousand | times the plane circled the Earth. | mcapodici wrote: | S3 is more than storage. It is a standard. I like how you can get | S3 compatible (usually with some small caveats) storage from a | few places. I am not sure how open the standards is, and if you | have to pay Amazon to say you are "S3 compatible" but it is | pretty cool. | | Examples: | | iDrive has E2, Digital Ocean has Object Storage, Cloudflare has | R2, Vultr has Object Storage, Backblaze has B2 ___________________________________________________________________ (page generated 2023-07-27 23:00 UTC)