[HN Gopher] The Next Backblaze Storage Pod ___________________________________________________________________ The Next Backblaze Storage Pod Author : TangerineDream Score : 189 points Date : 2021-06-24 16:11 UTC (6 hours ago) (HTM) web link (www.backblaze.com) (TXT) w3m dump (www.backblaze.com) | [deleted] | dragontamer wrote: | I have to imagine that by making Storage Pods 1.0 through 6.0, | maybe they "encouraged" Dell (and other manufacturers) that this | particular 60+ hard drive server was a good idea. | | And now that multiple "storage pod-like" systems exist in the | marketplace (not just Dell, but also Supermicro) selling 60-bay | or 90-bay 3.5" Hard drive storage servers in 4U rack form | factors, there's not much reason to build their own? | | At least, that's my assumption. After all, if the server chassis | is a commodity now (and it absolutely is), no point making custom | small runs to make a hypothetical Storage Pod 7. Economies of | scale is too big a benefit (worst case scenario: its now Dell's | or Supermicro's problem rather than Backblaze's). | | EDIT: I admit that I don't really work in IT, I'm just a | programmer. So I don't really know how popular 4U / ~60 HDD | servers were before Backblaze Storage Pod 1.0 | bluedino wrote: | You'd think they could build a ARM-powered, credit card sized | controller for them with a disk breakout card and network IO. | PC motherboard and full-sized cards seem like overkill. | francoisfeugeas wrote: | A French SDS company, now owned by OVH, did exactly that a | few years ago : https://fr.slideshare.net/openio/openio- | serverless-storage-7... | | I don't think they actually sold any. | yabones wrote: | I have seen some people build Ceph clusters using the HC2 | board [1] before. I'm not sure what the performance is | like, but it seems like a neat way to scale out storage. | The only real shortcoming is that there's a single NIC... | If there were two, you could use an HA stack for your | network and have a very robust system for very cheap. | | [1] https://www.hardkernel.com/shop/odroid-hc2-home-cloud- | two/ | dragontamer wrote: | They're running a fair bit of math (probably Reed Solomon | matrix multiplications for error correction) over all the | data accesses. | | Given the bandwidth of 60+ hard drives (150MB/s per hard | drive x 60 == 9GB/s in/out), I'm pretty sure you need a | decent CPU just to handle the PCIe traffic. At least PCIe 3.0 | x16, just for the hard drives. And then another x16 for | network connections (multiple PHY for Fiber in/out that can | handle that 9GB/s to a variety of switches). | | We're looking at PCIe 3.0 x32 just for HDDs and Networking. | Throw down a NVMe-cache or other stuff and I'm not seeing any | kind of small system working out here. | | --------- | | Then the math comes in: matrix multiplications over every bit | of data to verify checksums and reed-solomon error correction | starts to get expensive. Maybe if you had an FPGA or some | kind of specialist DSP (lol GPUs maybe, since they're good at | matrix multiplication), you can handle the bandwidth. But it | seems nontrivial to me. | | Server CPU seems to be a cheap and simple answer: get the | large number of PCIe I/O lanes plus a beefy CPU to handle the | calculations. Maybe a cheap CPU with many I/O lanes going to | a GPU / FPGA / ASIC for the error checking math, but... | specialized chips cost money. I don't think a cheap low-power | CPU would be powerful enough to perform real-time error | correction calculations over 9GBps of data. | | -------- | | We can leave Backblaze's workload and think about typical SAN | or NAS workloads too. More I/O is needed if you add NVMe | storage to cache hard drive reads/writes, tons of RAM is | needed if you plan to dedup. | gpm wrote: | I'm not familiar with the algorithms, but matrix | multiplication sounds well suited towards GPUs. I wonder if | you could get away with a much cheaper CPU and a cheaper | GPU for less cost? | dragontamer wrote: | But the main issue with GPUs (or FPGAs / ASICs) is now | you need to send 9GBps to some other chip AND back again. | | Which means 9GBps downstream (to be processed by the GPU) | + 9GBps upstream (GPU is done with the data), or a total | bandwidth of 18GBps aggregate to the GPU / FPGA / ASIC / | whatever coprocessor you're using. | | So that's what? Another 32x lanes of PCIe 3.0? Maybe a | 16x PCIE 4.0 GPU can handle that kind of I/O... but you | can see that moving all this data around is non-trivial, | even if we assume the math is instantaneous. | | --------- | | Practically speaking, it seems like any CPU with enough | PCIe bandwidth to handle this traffic is a CPU beefy | enough to seemingly run the math. | Dylan16807 wrote: | PCIE 3.0 is 1GB/s per lane _in each direction_. A 3.0 8x | link would do a good job of saturating the drives. And | basically any CPU could run 8x to the storage controllers | and 8x to a GPU. Get any Ryzen chip and you can run 4 | lanes directly to a network card too. | nine_k wrote: | If only the ASIC on the HDD could run these computations | and correct bit errors right during data transfers! | dragontamer wrote: | The HDD ASIC certainly is doing those computations. | | The issue is that Backblaze has a 2nd layer of error | correction codes. This 2nd layer of error correction | codes needs to be calculated somewhere. If enough errors | come from a drive, the administrators take down the box | and replace the hard-drives and resilver the data. | | Backblaze physically distributes the data over 20 | separate computers in 20 separate racks. Some computer | needs to run the math to "Combine" the data (error | correction + checksums and all) back into the original | data on every single read. So singular hard drive can do | this math because the data has been reliably dispersed to | so many different computers. | dwild wrote: | > Given the bandwidth of 60+ hard drives (150MB/s per hard | drive x 60 == 9GB/s in/out) | | Given their scale and goal, it would be pretty wasteful to | build it to max the writing speed of all hard drives. | Considering you rarely write on the pod, you would be | better of getting a fraction of that speed and writing on | multiple pods at the same time to get the required peak | performance. | | In fact actually that makes much more sense to put that | math on some ingests server and theses hard drive servers | would simply write the resulting data. It makes it much | easier and faster to divide it over 20 pods like they | currently do. | bluedino wrote: | Definitely limited by the 1gbps or even 10gbs network | connection | e12e wrote: | Any pod like this would normally have at least 1x40gbps | uplink minimum? | | Like most blade setups, like (random example): https://ww | w.storagereview.com/review/supermicro-x11-microbla... | dragontamer wrote: | Storage Pod 6.0 seems to be 2x10Gbps Ethernet: | https://www.backblaze.com/blog/open-source-data-storage- | serv... | nine_k wrote: | Read load alone can be pretty high. | | And no, you want to calculate checksums and fix bit | errors right here in the RAM buffers you just read or | received, because at such scales hardware is not error- | free. | scottlamb wrote: | > They're running a fair bit of math (probably Reed Solomon | matrix multiplications for error correction) over all the | data accesses. | | Do those run on this machine? I imagine backblaze has | redundancy at the cluster level rather than machine level. | That allows them to lose a single machine without any data | becoming unavailable. It also means we shouldn't assume the | erasure code calculations happen on a machine with 60 | drives attached. That's still possible but alternatively | the client [1] could do those calculations and the drive | machines could simply handle read/write raw chunks. This | can mean less network bandwidth [2] and better load | balancing (heavier calculations done further from the | stateful component). | | [1] Meaning a machine handling a user-facing request or re- | replication after drive/machine loss. | | [2] Assume data is divided into slices that are | reconstructed from N/M chunks, such that each chunk is | smaller than its slice. [3] On read, the client-side | erasure code design means N chunk transfers from drive | machine to client. If instead the client queries one of the | relevant drive machines, that machine has to receive N-1 | chunks from the others and send back a full slice. (Similar | for writes.) More network traffic on the drive machine and | across the network in total, less on the client. | | [3] This assumption might not make sense if they care more | about minimizing seeks on read than minimizing bytes | stored. Then they might have at least one full copy that | doesn't require accessing the others. | morei wrote: | That's not really how it's done. | | RS is normally used as erasure code: It's used when writing | (to compute code blocks), and when reading _only when data | is missing_. Checksums are used to detect corrupt data, | which is then treated as missing and RS used to reconstruct | it. Using RS to detect/correct corrupt data is very | inefficient. | | Checksums are also normally free (CRC + memcpy on most | modern CPUs runs in the same time that memcpy does: it's | entirely memory bound). | | The generation of code blocks is also fairly cheap: | Certainly no large matrix multiplications! This is because | the erasure code generally only spans a small number of | blocks (e.g. 10 data blocks), so every code byte is only | dependent on 10 data bytes. The math for this is reasonably | simple, and further simplified with some reasonable sized | look-up tables. | | That's not to say that there is no CPU needed, but it's | really not all that much, certainly nothing that needs | acceleration support. | bluedino wrote: | Xeon E5-1620 last I saw | pram wrote: | They existed for other applications, like NetApp and the ZFS | Appliance. Long, long before 2009. | [deleted] | walrus01 wrote: | 3U and 4U x86 whitebox servers designed for any standard | 12"x13" motherboard, where the entire front panel was hotswap | 3.5" HDD bays were already a thing many, many years before | backblaze existed. | | What wasn't really a thing was servers with hotswap HDD trays | on both ends (like the supermicros) and things that were | designed with vertical hard drives dropped down from a top- | opening lid to achieve even higher density. | briffle wrote: | The Sun X4500 "thumper" server had 48 drives, if I remember | correctly, and came out in 2006ish. | | It had hot-swap SATA disks (up to 512GB disks initially!) and | was actually pretty cool and forward thinking | | https://web.archive.org/web/20061128164442/http://www.sun.co. | .. | zrail wrote: | IIRC the company I was at at the time had a set of thumpers | for something. Maybe a SAN? | | They were incredibly cool for the time. | notyourday wrote: | That's not the backblaze design. The backblaze design is that | the drives are individually hot-swappable without a tray. 60 | commodity SATA drives that can be removed and serviced | individually while a 4U server continues to operate normally | is pretty amazing. | dangerboysteve wrote: | The company which manufactures the metal cases created a | spinoff company called 45Drives which sells commercially | supported pods. | notyourday wrote: | They are fantastic. | Wassight wrote: | Just a reminder that backblaze uses dark patterns for their | account cancellations. You'll never be able to use all of the | time you pay for. | foodstances wrote: | Did the surge in Chia mining affect global hard drive prices at | all? | that_lurker wrote: | Not yet, but when Chia pools become available the mining will | take off and hdd prices will most likely rice | richwater wrote: | Chia is a literal scam. | | There's a massive amount of premined chia controlled by the | "chia strategic reserve". | | It will take a decade for the amount of mined value to equal | the pre-mined value. | josefresco wrote: | I looked into Chia mining as a hobby and was directed to | Burstcoin. I don't know much about either, but Burstcoin | advocates claim it's the "better" PoC. | | Note: I went to double check something on the Burstcoin | website and realized today, June 24th they changed their | name to Signum - https://www.burst-coin.org/ | d33lio wrote: | Not for hyperscalers like BackBlaze. They have contracts with | specific purchase quotas and guaranteed price deltas. Chia has | certainly affected prices on the secondary markets, there | hasn't been a better time in the past decade to be a secondary | server "junk" seller on eBay! NetApp 4246 JBOD's are going for | $1000! Absolutely insane! | sliken wrote: | I've been watching drive prices, it's hard to say exactly why, | but around mid April disk prices at Newegg and amazon jumped | significantly. One drive that had been $300, jumped to $400, | $500, and even spiked to $800 for a bit. By Jun 1st had dropped | to $550, and only this week has dropped for $400. Still above | the original $300, but at least a not terribly painful premium. | ev1 wrote: | Chia mining-before-transactions-are-released -> Chia | transactions released -> Chia price at a high price -> Chia | price halves shortly after, turns out virtually impossible | for US users to get on any of the exchanges handling XCH | infogulch wrote: | If low volume is a problem for manufacturers because you don't | need that much, the obvious solution is to increase volume by | selling them. Of course that would introduce even more problems | to solve, but at least volume wouldn't be one of them. | igravious wrote: | fta: "Right after we introduced Storage Pod 1.0 to the world, | we had to make a decision as to whether or not to make and sell | Storage Pods in addition to our cloud-based services. We did | make and sell a few Storage Pods--we needed the money--but we | eventually chose software." | jleahy wrote: | They already sell them, but personally I thought they were too | expensive, probably they were adding a mark-up on their build | price when selling them. | bluedino wrote: | I don't think Backblaze ever sold them, but 45Drives does. | I'm not sure if they assemble them for BB or if they were | just using their published design. | igravious wrote: | they did at one point, fta: "Right after we introduced | Storage Pod 1.0 to the world, we had to make a decision as | to whether or not to make and sell Storage Pods in addition | to our cloud-based services. We did make and sell a few | Storage Pods--we needed the money--but we eventually chose | software." | jleahy wrote: | Indeed, I was thinking of 45Drives. | igravious wrote: | Not any more they don't, fta: "Right after we introduced | Storage Pod 1.0 to the world, we had to make a decision as to | whether or not to make and sell Storage Pods in addition to | our cloud-based services. We did make and sell a few Storage | Pods--we needed the money--but we eventually chose software." | ineedasername wrote: | Servicing & warranty management for hardware sales is a very | different business than their core competency. | ajaimk wrote: | It's actually interesting to me that backblaze has actually | reached a size where global logistics plays a bigger part in | costs than the actual servers. (And the servers got cheaper). | | Also, Dell and Supermicro have storage servers inspired by the BB | Pods. | | Glad to see this scrappy company hit this amount of scale; a long | way from schucking Hard Drives | andrewtbham wrote: | It's really amazing they made their own for so long... hardware | is a commodity business. | bluedino wrote: | The original storage pod was only 1/7th as much as a Dell | solution, and that didn't include any labor, software, blah | blah blah. | | They're still only buying the assembled hardware from Dell. | jasode wrote: | _> hardware is a commodity business._ | | The _hardware components_ of the Backblaze Pod are commodities | but the entire finished unit is not a commodity. E.g. the rough | equivalent from 45drives is not a commodity: | https://www.45drives.com/products/storinator-xl60-configurat... | ineedasername wrote: | I guess it's kind of like building your own gaming PC: Even | paying retail prices for the parts, you can build your own for | significantly cheaper than a comparable pre-built system. Since | their business model is "extremely cheap unlimited backup | storage" they had to go it alone, but now there are more COTS | options similar to their needs. | wilhil wrote: | I'm curious what is being used for the drives (and to a lesser | extent, memory) - Dell or OEM and how does support work? | | We sell a lot of Dell and for base models, it is very economical | compared to self built. | | The moment however we add a few high capacity hard drives or | memory, all bets are off the table and it's usually 1.75-4x the | price of a white box part. | | I get not supporting the part itself, but, had them not support a | raid card error (corrupt memory) after they saw we had a third | party drive.... we only buy a handful of servers a month - I can | imagine this possibly being a huge problem for Backblaze | though... | ocdtrekkie wrote: | Especially flash storage goes through the roof at enterprise | purchasing. I've bought the drive trays and used consumer SSDs | in servers more than a few times with no real ill effects where | SATA is acceptable. If you need SAS, you just need to accept | the pain that is about to come when you order. | wfleming wrote: | Backblaze has usually sourced their own hard drives, and I | suspect they still are/will. (The post didn't seem to indicate | otherwise.) | | Every year they post a summary of what models they're working | with and how they perform, which is usually good reading. This | is last year's: https://www.backblaze.com/blog/backblaze-hard- | drive-stats-fo.... | ineedasername wrote: | _The post didn 't seem to indicate otherwise_ | | That appeared to depend on whether the vendor imposed massive | markups on the drives. However they also mentioned service | etc.: If they struck a deal with Dell, then Dell might be | perfectly happy to sell the servers at a very modest profit | while making their money on the service agreement. | d33lio wrote: | _But can it farm Chia?_ | | Always cool to get insights into the business and technical | challenges at Backblaze! | amelius wrote: | Title is misleading as there is no next Backblaze storage pod, | and there never will be, according to the article. | wtallis wrote: | > and there never will be, according to the article. | | The article doesn't say that. It says: | | > So the question is: Will there ever be a Storage Pod 7.0 and | beyond? We want to say yes. We're still control freaks at | heart, meaning we'll want to make sure we can make our own | storage servers so we are not at the mercy of "Big Server Inc." | In addition, we do see ourselves continuing to invest in the | platform so we can take advantage of and potentially create | new, yet practical ideas in the space (Storage Pod X anyone?). | So, no, we don't think Storage Pods are dead, they'll just have | a diverse group of storage server friends to work with. | ceejayoz wrote: | "The Next Backblaze Storage Pod" is commercially available | storage from Dell. It's a little clickbaity, but it's both a) | the title of the article, which HN encourages using and b) | accurate. | choppaface wrote: | > That's a trivial number of parts and vendors for a hardware | company, but stating the obvious, Backblaze is a software | company. | | Stating the obvious: Backblaze wants investors to value them like | a SaaS company. This blog post suggests they're more of a | logistics and product company-- huge capex and depreciating | assets on hand. As a customer, I like their product, but they're | no Dropbox. If they would allow personal NAS then I could see | them being a software company. | ahmedalsudani wrote: | It's easy to buy a bunch of hard drives and connect in a data | center. Managing petabytes per user for thousands of users is | the hard part, and it's a software problem. | | BackBlaze is definitely a SaaS company... though the quality of | their offering certainly lags behind Dropbox, both in terms of | feature set and user experience. They're also in a very | competitive industry. Storage/backup is basically a commodity | nowadays. | ricardobeat wrote: | I sync my NAS to Backblaze B2 without any issues, and pricing | is great. | chx wrote: | For me the Big Deal is Backblaze B2. Especially when fronted by | Cloudflare -- zero traffic costs. Storage is cheap as far as | cloud storage provider goes and traffic is decidedly the | cheapest possible. | wmf wrote: | Dropbox owns more hardware than Backblaze. | edgeform wrote: | Always love these articles, look forward to them without knowing | it. | | This one is particularly interesting as they discuss the | logistical challenges of their own success in having to build | more and more Storage Pods. | | As always, a super fascinating read worth your time. | jleahy wrote: | I'm surprised Dell don't make they buy hard drives from them at a | substantial mark-up, as they allude to vendors doing earlier in | the story. | tpetry wrote: | Or dell forces them to buy their drives but their markup is not | as high as their competitors? | gnopgnip wrote: | If you are buying in bulk the pricing can be a lot better | than the list prices. But generally there isn't a problem | buying a server without a drive, Dell will still support the | server for non disk related warranty issues, you don't need | special firmware or disks | foobarbazetc wrote: | Dell (and, really, all server providers apart from | Supermicro) have crazy markups on storage. | | It's where they make most of their margin. | | And then, most of the time, the drives they sell come on | custom sleds that they don't sell separately as a form of | DRM/lock in. | | Then you get a nice little trade on Chinese-made sleds that | sort of work, but not for anything recent like hot swap NVMe | drives. | | I'm sure BB were able to negotiate down a lot (Dell usually | come down 50% off the list price if you press them hard | enough for one off projects), but... yeah. That's how it | generally goes. | jleahy wrote: | The default markup is awful, check the Dell website. I'd | describe the process of buying a drive from Dell a bit like | getting mugged. | Analemma_ wrote: | I imagine if you're buying sixty pods a month every month | you have some leverage with Dell to get better prices, | especially if you have a demonstrated ability to just walk | away and build your own if you don't like their offer. | gpm wrote: | Dell's not in the best negotiating position here, given that | "build our own" is a valid alternative. | ineedasername wrote: | Dell me be very happy to have a high volume customer with very | standardized & predictable needs, and so they're happy with | modest markups & extra profit on the service agreements, which | is a nice benefit for Backblaze since building their own pods | doesn't give them any service guarantee/warranty. | bluedino wrote: | Any idea what Dell is actually selling them? The DVR's we buy | (Avigilon) are white Dell 7x0's with a custom white bezel, but | those only fit 18 3.5" drives. | narism wrote: | Dell's densest server is the PowerEdge XE7100 [1] (100 3.5" | drives in 5U) but the bezel cover picture looks like more like | a standard 2U, maybe a R740xd2 (26 3.5" in 2U). | | [1] https://www.delltechnologies.com/asset/en- | ae/products/server... | | https://www.servethehome.com/dell-emc-poweredge-xe7100-100-d... | erikpt-work wrote: | Could be the MD3060e with an R650 server? | | https://i.dell.com/sites/doccontent/shared-content/data-shee... | toomuchtodo wrote: | Your link 404s, I think it's the extra character on the end. | maxclark wrote: | I'd love to know this as well. Dell doesn't have anything | remotely close to what Backblaze was designing/building | themselves. | | So did they do something custom (unlikely at this volume) or | did Backblaze change their hardware approach? | philjohn wrote: | They do since last year - https://www.servethehome.com/dell- | emc-poweredge-xe7100-100-d... | ineedasername wrote: | Not quite the same, but they do have something like the Pods, | but a bit more modular: | | It's their PowerEdge MX platform, which allows you to slot in | different "sleds" for storage/compute etc. as needed. It can | take 7 storage sleds for a total of 112 drivers per chasis. | brandon wrote: | Based on the pictured bezel, it looks like they've got three | rows 3.5" 14TB SATA drives up front in 14th generation | carriers. Best guess would be something like an R740XD2 which | has 26 total drive bays per 2U. | wrikl wrote: | The author recently commented: | https://www.backblaze.com/blog/next-backblaze-storage-pod/#c... | | It's apparently the "Dell PowerEdge R740xd2 rack server". | chx wrote: | https://ifworlddesignguide.com/entry/281015-poweredge-r740xd. | .. super interesting design. | encryptluks2 wrote: | I can't say that I'm surprised, and honestly anyone can open | source the architecture for a storage array of comparable use. I | think the only unique thing here really is the chassis, but there | are plenty of whitebox vendors that sell storage chassis. You may | not get as many drives in one, but usually the other components | in these things are usually pretty cheap minus the storage. I | don't really see this being a loss in the community at all, and | maybe someone else will get creative and build something better. ___________________________________________________________________ (page generated 2021-06-24 23:00 UTC)