[HN Gopher] 6 Raspberry Pis, 6 SSDs on a Mini ITX Motherboard
       ___________________________________________________________________
        
       6 Raspberry Pis, 6 SSDs on a Mini ITX Motherboard
        
       Author : ingve
       Score  : 311 points
       Date   : 2022-08-17 14:45 UTC (8 hours ago)
        
 (HTM) web link (www.jeffgeerling.com)
 (TXT) w3m dump (www.jeffgeerling.com)
        
       | alberth wrote:
       | Imagine instead of 6 Pi's this was 6 M2 arm chips on a mini-ITX
       | board.
        
         | dis-sys wrote:
         | That would cost 6 arms and 6 legs.
        
           | alberth wrote:
           | As a thought experiment, so the:
           | 
           | - 8GB Pi cost $75 (Geekbench score of 318 single / 808 multi-
           | core)
           | 
           | - 8GB M1 Mac mini cost $699 (Geek bench score of 1752 single
           | / 7718 multi-core)
           | 
           | This isn't too far off linearly price scaling.
           | 
           | The M1 Mac mini costs 9.3x more, but get 5.5x faster single-
           | core & 9.5x faster multi-core.
           | 
           | https://browser.geekbench.com/v5/cpu/15902536
           | 
           | https://browser.geekbench.com/v5/cpu/15912650
        
             | rbanffy wrote:
             | > The M1 Mac mini costs 9.3x more, but get 5.5x faster
             | single-core & 9.5x faster multi-core.
             | 
             | It's not always that we get a Mac that has a better
             | price/performance than any other computer ;-)
             | 
             | Their M-series is quite remarkable.
        
         | kllrnohj wrote:
         | You can get things like Epyc mini-ITX boards (
         | https://www.anandtech.com/show/16277/asrock-rack-deep-mini-i...
         | ) if you're just looking to ramp up the compute in a tiny
         | space. Divide it up into 6 VMs if you still want it to be a
         | cluster even.
        
         | rbanffy wrote:
         | I find these boards - that get little boards and network them
         | into a cluster - very interesting. I'd like to see more of
         | these in the future. I hope someone makes a Framework-
         | compatible motherboard with these at some point.
         | 
         | The Intel Edison module could have been a viable building block
         | for one (and it happened a long time ago in computing terms -
         | 2014) - it was self-contained, with RAM and storage on the
         | module - but it lacked ethernet to connect multiple ones on a
         | board - and I don't remember it having a fast external bus to
         | build a network on. And it was quickly discontinued.
        
       | AreYouSirius wrote:
        
       | tenebrisalietum wrote:
       | > Many people will say "just buy one PC and run VMs on it!", but
       | to that, I say "phooey."
       | 
       | I mean with VM-leaking things like Spectre (not sure how much
       | similar things affect ARM tbh) having physical barriers between
       | your CPUs can be seen as a positive thing.
        
         | mrweasel wrote:
         | Sure, it's just that the Raspberry Pi isn't really fast enough
         | for most production workloads. Having a cluster of them doesn't
         | really help, you'd still be better off with a single PC.
         | 
         | As a learning tool, having the ability to build a real hardware
         | cluster, in a MiniITX case is awesome. I do sort of wonder what
         | the business case for these boards are, I mean are there
         | actually enough people who want to do something like this...
         | schools maybe? I still think it's beyond weird that that there
         | are so much hardware available for build Pi clusters, but I
         | can't get an ARM desktop motherboard, with a PCI slot capable
         | of actually being used as a desktop, for a reasonable prices.
        
           | schaefer wrote:
           | The Nvidia Jetson AGX Orin Dev Kit is getting pretty damn
           | close to a performant Linux/ARM desktop system.
           | GB5 Scores (Single: 763/ Multi: 7193)       That's roughly
           | 80% the performance of my current x86 desktop.
           | Ubuntu 20.04 LTS       12 Core Cortex A78AE v8.2 64 bit. 2.20
           | Ghz       32 GB LDPDDR5 memory, 256 bit, 204.8 GB/s
           | NVIDIA graphics/AI acceleration.       PCI-E Slot       64 Mb
           | eMMC 5.1       M.2 PCIe Gen 4       Display Port 1.4a +MST
           | 
           | The deal breaker, if there is one is price: $2k
        
           | geerlingguy wrote:
           | I think a lot of these types of boards are built with the
           | business case of either "edge/IoT" (which still for some
           | reason causes people to toss money at them since they're hot
           | buzzwords... just need 5G too for the trifecta), or for
           | deploying many ARM cores/discrete ARM64 computers in a
           | space/energy-efficient manner. Some places need little ARM
           | build farms, and that's where I've seen the most non-hobbyist
           | interest in the CM4 blade, Turing Pi 2, and this board.
        
         | ReadTheLicense wrote:
         | The future of cloud is Zero Isolation... With all the
         | mitigation slowing it down, and the current energy prices and
         | rising, having super-small nodes that are always reserved to
         | one task seems interesting.
        
       | sitkack wrote:
       | Anyone else having problems with the shipping page? Says it
       | cannot ship to my address, formatted incorrectly ...
        
       | erulabs wrote:
       | Man, Ceph really doesn't get enough love. For all the distributed
       | systems hype out there - be it Kubernetes or blockchains or
       | serverless - the ol' rock solid distributed storage systems sat
       | in the background iterating like crazy.
       | 
       | We had a _huge_ Rook /Ceph installation in the early days of our
       | startup before we killed off the product that used it (sadly). It
       | did explode under some rare unusual cases, but I sometimes miss
       | it! For folks who aren't aware, a rough TLDR is that Ceph is to
       | ZFS/LVM what Kubernetes is to containers.
       | 
       | This seems like a very cool board for a Ceph lab - although -
       | extremely expensive - and I say that as someone who sells very
       | expensive Raspberry Pi based computers!
        
         | halbritt wrote:
         | I love it, but when it fails at scale, it can be hard to reason
         | about. Or at least that was the case when I was using it a few
         | years back. Still keen to try it again and see what's changed.
         | I haven't run it since bluestore was released.
        
           | teraflop wrote:
           | Yeah, I've been running a small Ceph cluster at home, and my
           | only real issue with it is the relative scarcity of good
           | _conceptual_ documentation.
           | 
           | I personally learned about Ceph from a coworker and fellow
           | distributed systems geek who's a big fan of the design. So I
           | kind of absorbed a lot of the concepts before I ever actually
           | started using it. There have been quite a few times where I
           | look at a command or config parameter, and think, "oh, I know
           | what that's _probably_ doing under the hood "... but when I
           | try to actually check that assumption, the documentation is
           | missing, or sparse, or outdated, or I have to "read between
           | the lines" of a bunch of different pages to understand what's
           | really happening.
        
         | geerlingguy wrote:
         | I think many people (myself included) had been burned by major
         | disasters on earlier clustered storage solutions (like early
         | Gluster installations). Ceph seems to have been under the radar
         | for a bit of time when it got to a more stable/usable point,
         | and came more in the limelight once people started deploying
         | Kubernetes (and Rook, and more integrated/wholistic clustered
         | storage solutions).
         | 
         | So I think a big part of Ceph's success (at least IMO) was its
         | timing, and it's adoption into a more cloud-first ecosystem.
         | That narrowed the use cases down from what the earliest
         | networked storage software were trying to solve.
        
         | mcronce wrote:
         | Ceph is fantastic. I use it as the storage layer in my homelab.
         | I've done some things that I can only concisely describe as
         | _super fucked up_ to this Ceph cluster, and every single time I
         | 've come out the other side with zero data loss, not having to
         | restore a backup.
        
           | erulabs wrote:
           | Haha "super fucked up" is a much better way of describing the
           | "usual, rare" situations I was putting it into as well :P
        
             | dylan604 wrote:
             | Care to provide examples of what these things were that you
             | were doing to a storage pool? I guess I'm just not
             | imaginative enough to think about ways of using a storage
             | pool other than storing data in it.
        
               | erulabs wrote:
               | In our case we were a free-to-use-without-any-signup way
               | of testing Kubernetes. You could just go to the site and
               | spin up pods. Looking back, it was a bit insane.
               | 
               | Anyways, you can imagine we had all sorts of attacks and
               | miners or other abusive software running. This on top of
               | using ephemeral nodes for our free service meant hosts
               | were always coming and going and ceph was always busy
               | migrating data around. The wrong combo of nodes dying and
               | bursting traffic and beta versions of Rook meant we ran
               | into a huge number of edge cases. We did some
               | optimization and re-design, but it turned out there just
               | weren't enough folks interested in paying for multi-
               | tenant Kubernetes. We did learn an absolute ton about
               | multi-tenant K8s, so, if anyone is running into those
               | challenges, feel free to hire us :P
        
               | Lex-2008 wrote:
               | not OP, but I would start with filling disk space up to
               | 100%, or creating zillions of empty files. In case of
               | distributed filesystems - maybe removing one node (under
               | heavy load preferably), or "cloning" nodes so they had
               | same UUIDs (preferably nodes storing some data on them -
               | to see if the data will be de-duplicated somehow).
               | 
               | Or just a disk with unreliable USB connection?
        
         | 3np wrote:
         | We're more and more feeling we made the wrong call with
         | gluster... The underlying bricks being a POSIX fs felt a lot
         | safer at the time but in hindsight ceph or one of the newer
         | ones would probably have been a better choice. So much
         | inexplicable behavior. For your sake I hope the grass really is
         | greener.
        
           | rwmj wrote:
           | Red Hat (owner of Gluster) has announced EOL in 2024:
           | https://access.redhat.com/support/policy/updates/rhs/
           | 
           | Ceph is where the action is now.
        
         | imiric wrote:
         | Can someone with experience with Ceph and MinIO or SeaweedFS
         | comment on how they compare?
         | 
         | I currently run a single-node SnapRAID setup, but would like to
         | expand to a distributed one, and would ideally prefer something
         | simple (which is why I chose SnapRAID over ZFS). Ceph feels to
         | enterprisey and complex for my needs, but at the same time, I
         | wouldn't want to entrust my data to a simpler project that can
         | have major issues I only discover years down the road.
         | 
         | SeaweedFS has an interesting comparison[1], but I'm not sure
         | how biased it is.
         | 
         | [1]: https://github.com/seaweedfs/seaweedfs#compared-to-ceph
        
         | nik736 wrote:
         | Ceph seems to be always related to big block storage outages.
         | This is why I am very wary of using it. Has this changed? Edit:
         | rephrased a bit
        
           | antongribok wrote:
           | Ceph is incredibly stable and resilient.
           | 
           | I've run Ceph at two Fortune 50 companies since 2013 to now,
           | and I've not lost a single production object. We've had
           | outages, yes, but not because of Ceph, it was always
           | something else causing cascading issues.
           | 
           | Today I have a few dozen clusters with over 250 PB total
           | storage, some on hardware with spinning rust that's over 5
           | years old, and I sleep very well at night. I've been doing
           | storage for a long time, and no other system, open source or
           | enterprise, has given me such a feeling of security in
           | knowing my data is safe.
           | 
           | Any time I read about a big Ceph outage, it's always a bunch
           | of things that should have never been allowed in production,
           | compounded by non-existent monitoring, and poor understanding
           | of how Ceph works.
        
             | shrubble wrote:
             | Can you talk about the method that Ceph has for determining
             | whether there was bit rot in a system?
             | 
             | My understanding is that you have to run a separate
             | task/process that has Ceph go through its file structures
             | and check it against some checksums. Is it a separate step
             | for you, do you run it at night, etc.?
        
               | lathiat wrote:
               | That's called ceph scrub & deep-scrub.
               | 
               | By default it "scrubs" basic metadata daily and does a
               | deep scrub where it fully reads the object and confirms
               | the checksum is correct from all 3 replicas weekly for
               | all of the data in the cluster.
               | 
               | It's automatic and enabled by default.
        
               | shrubble wrote:
               | So what amount of disk bandwidth/usage is involved?
               | 
               | For instance, say that I have 30TB of disk space used and
               | it is across 3 replicas , thus 3 systems.
               | 
               | When I kick off the deep scrub operation, what amiunt of
               | reads will happen on each system? Just the smaller amount
               | of metadata or the actual full size of the files
               | themselves?
        
               | teraflop wrote:
               | In Ceph, objects are organized into placement groups
               | (PGs), and a scrub is performed on one PG at a time,
               | operating on all replicas of that PG.
               | 
               | For a normal scrub, only the metadata (essentially, the
               | list of stored objects) is compared, so the amount of
               | data read is very small. For a deep scrub, each replica
               | reads and verifies the contents of all its data, and
               | compares the hashes with its peers. So a deep scrub of
               | all PGs ends up reading the entire contents of every
               | disk. (Depending on what you mean by "disk space used",
               | that could be 30TB, or 30TBx3.)
               | 
               | The deep scrub frequency is configurable, so e.g. if each
               | disk is fast enough to sequentially read its entire
               | contents in 24 hours, and you choose to deep-scrub every
               | 30 days, you're devoting 1/30th of your total IOPS to
               | scrubbing.
               | 
               | Note that "3 replicas" is not necessarily the same as "3
               | systems". The normal way to use Ceph is that if you set a
               | replication factor of 3, each PG has 3 replicas that are
               | _chosen_ from your pool of disks /servers; a cluster with
               | N replicas and N servers is just a special case of this
               | (with more limited fault-tolerance). In a typical
               | cluster, any given scrub operation only touches a small
               | fraction of the disks at a time.
        
               | teraflop wrote:
               | Just to add to the other comment: Ceph checksums data and
               | metadata on every read/write operation. So even if you
               | completely disable scrubbing, if data on a disk becomes
               | corrupted, the OSD will detect it and the client will
               | transparently fail over to another replica, rather than
               | seeing bad data or an I/O error.
               | 
               | Scrubbing is only necessary to _proactively_ detect bad
               | sectors or silent corruption on infrequently-accessed
               | data, so that you can replace the drive early without
               | losing redundancy.
        
       | aftbit wrote:
       | Imagine being able to buy 6 Raspberry Pis! I have so many
       | projects I'd like to do, both personal and semi-commercial, but
       | it's been literal years since I've seen a Raspberry Pi 4
       | available in stock somewhere in the USA, let alone 6.
        
         | hackeraccount wrote:
         | Crap. When did that happen? I swear I bought like two or three
         | not that long ago and they were like 40 or 50 apiece.
        
         | tssva wrote:
         | Micro Center often has them in stock. My local Micro Center
         | currently has 17 RPi 4 4GB in stock. They are available only in
         | store, but you can check stock at their website. Find one you
         | know someone close to that is willing to purchase for you and
         | ship.
        
       | alexk307 wrote:
       | Cool but I still can't find a single raspberry pi compute module
       | despite having been in the market for 2 years...
        
         | simcop2387 wrote:
         | https://rpilocator.com/ is probably the best place to keep an
         | eye out for them. This is unfortunately also the case for non-
         | CM rpis. Been wanting to get some more pi4s to replace some
         | rather old pi3 (non+) that i've got running just because i want
         | the uefi boot on everything since it makes managing things that
         | much easier.
        
         | mongol wrote:
         | Yes when will this dry spell end?
        
           | geerlingguy wrote:
           | So far it seems like "maybe 2023"... this year supplies have
           | been _slightly_ better, but not amazing. Usually a couple CM4
           | models pop up over the course of a week on rpilocator.com.
        
       | onlyrealcuzzo wrote:
       | > I was able to get 70-80 MB/sec write speeds on the cluster, and
       | 110 MB/sec read speeds. Not too bad, considering the entire
       | thing's running over a 1 Gbps network. You can't really increase
       | throughput due to the Pi's IO limitations--maybe in the next
       | generation of Pi, we can get faster networking!
       | 
       | This isn't clear to me. What am I missing?
        
         | gorkish wrote:
         | The device has an onboard 8 port unmanaged gigabit switch. The
         | two external ports are just switch ports and cannot be
         | aggregated in any way. The entire cluster is therefore limited
         | effectively to 1gbps throughput.
         | 
         | IMO it ruins the product utterly and completely. They should
         | have integrated a switch IC similar to what's used in the
         | netgear gs110mx which has 8 gigabit and 2 multi-gig interfaces.
        
           | geerlingguy wrote:
           | It would be really cool if they could split out 2.5G
           | networking to all the Pis, but with the current generation of
           | Pi it only has one PCIe lane, so you'd have to add in a PCIe
           | switch for each Pi if you still wanted those M.2 NVMe
           | slots... that adds a lot of cost and complexity.
           | 
           | Failing that, a 2.5G external port would be the simplest way
           | to make this thing more viable as a storage cluster board,
           | but that would drive up the switch chip cost a bit (cutting
           | into margins). So the last thing would be allowing management
           | of the chip (I believe this Realtek chip does allow it, but
           | it's not exposed anywhere), so you could do link
           | aggregation... but that's not possible here either. So yeah,
           | the 1 Gbps is a bummer. Still fun for experimentation, and
           | very niche production use cases, but less useful generally.
        
         | lathiat wrote:
         | 110MB/s is gigabit. It's limited to gigabit networking and only
         | has 1Gbps out from the cluster board. So there's no way to do
         | an aggregate speed of more than 1Gbps/110MBs on this particular
         | cluster board.
         | 
         | If each pis Ethernet was broken out individually and you used a
         | 10G uplink switch or multiple 1G client ports then you could do
         | better.
         | 
         | The write speed being lower than the read speed will be because
         | writes have to be replicated to two other nodes in the ceph
         | cluster (everything has 3 replicas) which are also sharing
         | bandwidth on those same 1G links. Reads don't need to replicate
         | so can consume the full bandwidth.
         | 
         | So basically it's all network limited for this use case. Needs
         | a 2.5G uplink, LACP link aggregation or individual Ethernet
         | ports to do better.
        
           | sitkack wrote:
           | Which ICs would you use for this? Do you have something in
           | mind?
        
             | Retr0id wrote:
             | I'm not sure about specific ICs, but if you took a look
             | inside a Netgear GS110MX you'd have a good candidate IC.
        
             | magicalhippo wrote:
             | Just a random search on Mouser, but something like the
             | BCM53134O[1] as four 1GbE ports, and one 2.5GbE port. A bit
             | pricier you have the BCM53156XU[2] with eight 1GbE ports
             | and a 10G port for fiber.
             | 
             | Not my field so could be other, better parts.
             | 
             | [1]: https://eu.mouser.com/ProductDetail/Broadcom-
             | Avago/BCM53134O...
             | 
             | [2]: https://eu.mouser.com/ProductDetail/Broadcom-
             | Avago/BCM53156X...
        
         | CameronNemo wrote:
         | 110 Megabits == 880 Megabits, which is approaching the top
         | speed of the network interface, which is the main bottle neck.
         | A board with more IO, like the rk3568 which has 2x PCIe 2 lanes
         | and 2x PCIe 3 lanes, or a hypothetical rpi5, can deliver more
         | throughput.
        
       | naranha wrote:
       | Do you think Raspberries without ECC RAM are fine to use for a
       | Ceph storage cluster? I did some research yesterday on the same
       | topic, many say ECC Ram is essential for Ceph (and ZFS too). But
       | I'm not sure what to believe, sure data could get corrupted in
       | RAM before being written to the cluster, but how likely is that?
        
         | dpedu wrote:
         | ECC is not necessary for ZFS - this is a commonly repeated
         | myth.
         | 
         | https://news.ycombinator.com/item?id=14447297
        
         | dijit wrote:
         | raspberry pi's have ECC memory; it's just on-die ECC.
         | 
         | (this was a surprise to me too)
        
           | EvanAnderson wrote:
           | It sounds like the ECC counters are completely hidden from
           | the SoC, though:
           | https://forums.raspberrypi.com/viewtopic.php?t=315415
        
             | naranha wrote:
             | Sounds like what DDR5 is doing too, errors are corrected
             | automatically, but not necessarily communicated (?)/
        
               | amarshall wrote:
               | DDR5 on-die ECC is _not_ the same as traditional ECC. To
               | that point, there are DDR5 modules with full ECC. On-die
               | DDR5 ECC is there because it _needs_ to be for the
               | modules to really work at all.
        
       | jonhohle wrote:
       | This looks incredible. Is it possible to expose a full PCI
       | interface from an NVMe slot? I have an old SAS controller that I
       | want to keep running. If I could do that from a PI, that would be
       | incredible.
        
         | geerlingguy wrote:
         | If you want to use SAS with the Pi, I've only gotten newer
         | generation Broadcom/LSI cards working so far--see my notes for
         | the storage controllers here:
         | https://pipci.jeffgeerling.com/#sata-cards-and-storage
        
           | jonhohle wrote:
           | Incredible resource, thanks! I'm currently using an older
           | MegaRAID card, but could upgrade if I can find a reasonable
           | configuration to migrate.
        
           | formerly_proven wrote:
           | > newer
           | 
           | Which is probably for the best - I don't know how these newer
           | cards behave, but a commonality of all the older RAID/HBA
           | cards seems to be "no power management allowed". Maybe they
           | improved that area, because it's pretty unreasonable for an
           | idle RAID card to burn double digit Watts if you ask me...
        
             | geerlingguy wrote:
             | The 9405W cards I most recently tested seem to consume
             | about 7W steady state (which is more than the Pi that was
             | driving it!), so yeah... they're still not quite as
             | efficient as running a smaller SATA card if you just need a
             | few drives. But if you want SAS or Tri-mode (NVMe/SAS/SATA)
             | and have an HBA or RAID card, this is a decent enough way
             | to do it!
        
         | mcronce wrote:
         | You can get M.2 to PCI-E add-in-card adapters, yeah - as long
         | as it's an M.2 slot that supports NVMe, not SATA only
        
           | crest wrote:
           | I don't see how they could have hooked a 2.5Gb/s Ethernet NIC
           | to the CM4 modules without using up the single PCI-e 2.0 lane
           | other than adding a power hungry, expensive and often out of
           | stock PCI-e switching chip.
        
         | wtallis wrote:
         | There's no such thing as an NVMe slot, just M.2 slots that have
         | PCIe lanes. NVMe is a protocol that runs on top of PCIe, and is
         | something that host systems support at the software level, not
         | in hardware. (Firmware support for NVMe is necessary to boot
         | off NVMe SSDs, but the Pi doesn't have that and must boot off
         | eMMC or SD cards.)
        
       | gorkish wrote:
       | Would buy this in an instant if it weren't hobbled as hell by the
       | onboard realtek switch. If it had an upstream 2.5/5/10g port it
       | would be instantly 6 times more capable.
        
       | antongribok wrote:
       | Aside from running Ceph as my day job, I have a 9-node Ceph
       | cluster on Rasberry Pi 4s at home that I've been running for a
       | year now, and I'm slowly starting to move things away from ZFS to
       | this cluster as my main storage.
       | 
       | My setup is individual nodes, with 2.5" external HDDs (mostly
       | SMR), so I actually get sligtly better performance than this
       | cluster, and I'm using 4+2 erasure coding for the main data pool
       | for CephFS.
       | 
       | CephFS has so far been incredibly stable and all my Linux laptops
       | reconnect to it after sleep with no issues (in this regard it's
       | better than NFS).
       | 
       | I like this setup a lot better now than ZFS, and I'm slowly
       | starting to migrate away from ZFS, and now I'm even thinking of
       | setting up a second Ceph cluster. The best thing with Ceph is
       | that I can do a maintenance on a node at any time and storage
       | availability is never affected, with ZFS I've always dreaded any
       | kind of upgrade, and any reboot requires an outage. Plus with
       | Ceph I can add just one disk at a time to the cluster and disks
       | don't have to be the same size. Also, I can move the physical
       | nodes individually to a different part of my home, change
       | switches and network cabling without an outage now. It's a nice
       | feeling.
        
         | bityard wrote:
         | Is 9 the minimum number of nodes you need for a reasonable ceph
         | setup or is that just what you arrived at for your use case?
        
           | geerlingguy wrote:
           | I've seen setups with as few as 2 nodes with osds and a
           | monitor (so 3 in total), but I believe 4-5 nodes is the
           | minimum recommendation.
        
           | antongribok wrote:
           | I would say the minimum is whatever your biggest replication
           | or erasure coding config is, plus 1. So, with just replicated
           | setups, that's 4 nodes, and with EC 4+2, that's 7 nodes. With
           | EC 8+3, which is pretty common for object storage workloads,
           | that's 12 nodes.
           | 
           | Note, a "node" or a failure domain, can be configured as a
           | disk, an actual node (default), a TOR switch, a rack, a row,
           | or even a datacenter. Ceph will spread the replicas across
           | those failure domains for you.
           | 
           | At work, our bigger clusters can withstand a rack going down.
           | Also, the more nodes you have, the less of an impact it is on
           | the cluster when a node goes down, and the faster the
           | recovery.
           | 
           | I started with 3 RPis then quickly expanded to 6, and the
           | only reason I have 9 nodes now is because that's all I could
           | find.
        
             | mbreese wrote:
             | Can I ask an off topic/in-no way RPi related question?
             | 
             | For larger ceph clusters, how many disks/SSD/nvme are
             | usually attached to a single node?
             | 
             | We are in the middle of transitioning from a handful of big
             | (3x60 disk, 1.5PB total) JBOD Gluster/ZFS arrays and I'm
             | trying to figure out how to migrate to a ceph cluster of
             | equivalent size. It's hard to figure out exactly what the
             | right size/configuration should be. And I've been using ZFS
             | for so long (10+ years) that thinking of _not_ having
             | healing zpools is a bit scary.
        
               | antongribok wrote:
               | For production, we have two basic builds, one for block
               | storage, which is all-flash, and one for object storage
               | which is spinning disks plus small NVMe for
               | metadata/Bluestore DB/WAL.
               | 
               | The best way to run Ceph is to build as small a server as
               | you can get away with economically and scale that
               | horizontally to 10s or 100s of servers, instead of trying
               | to build a few very large vertical boxes. I have run Ceph
               | on some 4U 72-drive SuperMicro boxes, but it was not fun
               | trying to manage hundreds of thousands of threads on a
               | single Linux server (not to mention NUMA issues with
               | multiple sockets). An ideal server would be one node to
               | one disk, but that's usually not very economical.
               | 
               | If you don't have access to custom ODM-type gear or
               | open-19 and other such exotics, what's been working for
               | me have been regular single socket 1U servers, both for
               | block and for object.
               | 
               | For block, this is a very normal 1U box with 10x SFF SAS
               | or NVMe drives, single CPU, a dual 25Gb NIC.
               | 
               | For spinning disk, again a 1U box, but with a deeper
               | chassis you can fit 12x LFF and still have room for a
               | PCI-based NVMe card, plus a dual 25Gb NIC. You can get
               | these from SuperMicro, Quanta, HP.
               | 
               | Your 3x60 disk setup sounds like it might fit in 12U
               | (assuming 3x 4U servers). With our 1U servers I believe
               | that can be done with 15x 1U servers (1.5 PiB usable
               | would need roughly 180x 16TB disks with EC 8+3, you'll
               | need more with 3x replication).
               | 
               | Of course, if you're trying to find absolute minimum
               | requirements that you can get away with, we'd have to
               | know a lot more details about your workload and existing
               | environment.
               | 
               | EDITING to add:
               | 
               | Our current production disk sizes are either 7.68 or
               | 15.36 TB for SAS/NVMe SSDs at 1 DWPD or less, and 8 TB
               | for spinning disk. I want to move to 16 TB drives, but
               | haven't done so for various tech and non-tech reasons.
        
           | lathiat wrote:
           | For the standard 3x replicated setup, 3 nodes is the minimum
           | for any kind of practical redundancy but you really want 4 so
           | that after failure of 1 node all the data can be recovered
           | onto the other 3 and still have failure resiliency.
           | 
           | For erasure coded setups which is not really suited to block
           | storage but mainly object storage via radosgw(s3) or cephfs
           | you need minimum k+m and realistically k+m+1. That would
           | translate to 6 minimum but realistically 7 nodes for k=4,m=2.
           | That's 4 data chunks and 2 redundant chunks which means you
           | use 1.5x the storage of the raw data (half that of a
           | replicated setup). You can do k=2,m=1 also. So 4 nodes into
           | that case.
        
           | [deleted]
        
         | kllrnohj wrote:
         | I was running glusterfs on an array of ODROID-HC2s (
         | https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/ )
         | and it was fun, but I've since migrated back to just a single
         | big honking box (specifically a threadripper 1920x running
         | unraid). Monitoring & maintaining an array of systems was its
         | own IT job that kinda didn't seem worth dealing with.
        
           | trhway wrote:
           | Looking at that ODROID-HC2 i wonder when the drive
           | manufacturers would just integrate such a general computer
           | board onto the drive itself.
        
         | Infernal wrote:
         | I want to preface this - I don't have strong opinion already
         | here, and I'm curious about Ceph. As someone who runs a 6 drive
         | raidz2 at home (w/ ECC RAM) does your Ceph config give you
         | similar data integrity guarantees to ZFS? If so, what are the
         | key points of the config that enable that?
        
           | antongribok wrote:
           | When Ceph migrated from Filestore to Bluestore, that enabled
           | data scrubbing and checksumming for data (older versions
           | before Bluestore were only verifying metadata).
           | 
           | Ceph (by default) does metadata scrubs every 24 hours, and
           | data scrubs (deep-scrub) weekly (configurable, and you can
           | manually scrub individual PGs at any time if that's your
           | thing). I believe the default checksum used is "crc32c", and
           | it's configurable, but I've not played with changing it. At
           | work we get scrub errors on average maybe weekly now, at home
           | I've not had a scrub error yet on this cluster in the past
           | year (I did have a drive that failed and still needs to be
           | replaced).
           | 
           | My RPi setup certainly does not have ECC RAM as far as I'm
           | aware, but neither does my current ZFS setup (also a 6 drive
           | RAIDZ2).
           | 
           | Nothing stopping you from running Ceph on boxes with ECC RAM,
           | we certainly do that at my job.
        
         | magicalhippo wrote:
         | If you take say old i7 4770k's, how many of those along with
         | how many disks would you need to get 1GB/s sustained sequential
         | access with Ceph?
         | 
         | My single ZFS box does that with ease, 3x mirrored vdevs = 6
         | disks total, but I'm curious as the flexibility of Ceph sounds
         | tempting.
        
           | antongribok wrote:
           | I just setup a test cluster at work to test this for you:
           | 
           | 4 nodes, each node with 2x SAS SSDs, dual 25Gb NICs (one for
           | front-end, one for back-end replication). The test pool is 3x
           | replicated with Snappy compression enabled.
           | 
           | On a separate client (also with 25Gb) I mounded an RBD image
           | with krbd and ran FIO:                 fio
           | --filename=/dev/rbd1 --direct=1 --sync=1 --rw=write
           | --bs=4096K --numjobs=1 --iodepth=16 --ramp_time=5
           | --runtime=60 --ioengine=libaio --time_based --group_reporting
           | --name=krbd-test --eta-newline=5s
           | 
           | I get a consistent 1.4 GiB/s:                 write:
           | IOPS=357, BW=1431MiB/s (1501MB/s)(83.9GiB/60036msec)
        
         | underwater247 wrote:
         | I would love to hear more about your Ceph setup. Specifically
         | how you are connecting your drives and how many drives per
         | node? I imagine with the Pis limited USB bus bandwidth, your
         | cluster performs as more of an archive data store compared to
         | realtime read/write like the backing block storage of VMs. I
         | have been wanting to build a Ceph test cluster and it sounds
         | like this type of setup might do the trick.
        
       | aloer wrote:
       | Considering that this is custom made for the CM4 form factor, the
       | Turing Pi with carrier boards looks much more attractive because
       | future proof. If only it were already available.
       | 
       | It also has SATA and USB 3.0 which is nice
       | 
       | Until I can preorder one I will slowly stock up on CM4s and hope
       | I'll get there before pi5 comes out.
       | 
       | Are there other boards like this?
        
       | mhd wrote:
       | Can I run a Beowulf cluster on this?
        
       | oblak wrote:
       | Credit where it's due - this is some 18 watt awesomeness at idle.
       | Is it more "practical" than doing a Mini-ITX (or smaller, like
       | one of those super small tomtom with up to 5900HX) build and
       | equipping it with one one or more NVME expansion cards? Probably
       | not. But it's cool.
       | 
       | Now, if there were a new Pi to buy. Isn't it time for the 5? It's
       | been 3 years for most of which they've been hard to fine. Mine
       | broke and I really miss it because having a full blown desktop
       | doing little things makes no sense, especially during the summer.
        
         | formerly_proven wrote:
         | 18 W idle is kinda horrible if you just want a small server
         | (granted, this isn't one server, but instead six set-top boxes
         | in one). That's recycled entry-level rack server range, which
         | come with ILOM/BMC. Most old-ish fat clients can do <10 W, some
         | <5 W, no problem. If you want a desktop that consumes little
         | power when idle or not loaded a lot, just get basically any
         | Intel system with an IGP since 4th gen (Haswell). Avoid Ryzen
         | CPU with dGPU if that's your goal; those are gas guzzlers.
        
           | oblak wrote:
           | 1. I would bet at least half of all that wattage is the SSDs.
           | 
           | 2. Buddy, you're spewing BS at someone who used to run a
           | Haswell in a really small Mini-ITX case. It was a fine HTPC
           | back in 2014. But now everything, bar my dead Pi, is some
           | kind of Ryzen. All desktops and laptops. The various
           | 4800u/5800u/6800u and lower parts offer tremendous
           | performance at 15W nominal power levels. The 5800H I am
           | writing this message on is hardly a guzzler, especially when
           | compared to Intel's core11/12 parts.
           | 
           | This random drive-by intel shilling really took me by
           | surprise.
        
             | formerly_proven wrote:
             | > 1. I would bet at least half of all that wattage is the
             | SSDs.
             | 
             | SSDs are really good at consuming nearly nothing when not
             | servicing I/O.
             | 
             | > 2. Buddy, you're spewing BS ... various 4800u/5800u/6800u
             | ... 5800H ...
             | 
             | None of those SKUs are a "Ryzen CPU with dGPU".
        
               | oblak wrote:
               | SSD can get to really low power levels, depending on
               | state. Does not mean they were all in power sipping 10mW
               | mode during measurement.
               | 
               | > just get basically any Intel system with an IGP since
               | 4th gen (Haswell). Avoid Ryzen CPU with dGPU
               | 
               | Was your advice to me. So I took it at face value and
               | compared them, like you suggested, to the relevant
               | models.
        
       | technofiend wrote:
       | I will once again lament the fact that WD Labs built SBCs that
       | sat with their hard drives to make them individual CEPH nodes but
       | never took the hardware to production. It seems to me there's
       | still a market for SBCs that could serve a CEPH OSD on a per-
       | device basis, although with ever increasing density in the
       | storage and hyperconverged space that's probably more of a small
       | business or prosumer solution.
        
       | hinkley wrote:
       | I think there's something to be said for sizing a raspberry pi or
       | a clone to fit into a hard drive slot.
       | 
       | I also think the TuringPi people screwed up with the model 2.
       | Model 2 of a product should not have fewer slots than the
       | predecessor, and in the case of the Turing Pi, orchestrating 4
       | devices is not particularly compelling. It's not that difficult
       | to wire 4 Pi's together by hand. I had 6 clones wired together
       | using risers and powered by my old Anker charging station and an
       | 8 port switch, with a few magnets to hold the whole thing
       | together.
        
       | rabuse wrote:
       | If only Raspberry Pi's weren't damn near $200 now...
        
         | greggyb wrote:
         | Unless you are constrained in space to a single ITX case as in
         | this example, you can get whole x86 machines for <$100 with RAM
         | and storage included.
         | 
         | There is a lot of choice in the <$150 range. You could get
         | eight of these and a cheap 10-port switch for any kind of
         | clustering lab you want to set up.
         | 
         | Here is an example:
         | https://www.aliexpress.com/item/3256804328705784.html?spm=a2...
        
           | dpedu wrote:
           | Same cpu, half the ram, quarter the price if you don't mind
           | going used: https://www.ebay.com/itm/154960426458
           | 
           | These are thin clients but flip an option in the bios and
           | it's a regular pc.
        
             | greggyb wrote:
             | Yes. I just figured I would compare new for new. I love
             | eBay for electronics shopping.
        
           | criddell wrote:
           | Would one of the boards from Pine work for this application?
           | 
           | https://pine64.com/product/pine-a64-lts/
        
             | RL_Quine wrote:
             | No, those are nasty slow.
        
           | Asdrubalini wrote:
           | What is the power consumption tho? It easily adds up over
           | time.
        
             | greggyb wrote:
             | The linked machine uses a 2W processor.
             | 
             | The successor product on the company's site uses a 12 volt,
             | 2 amp power adapter: https://www.bee-
             | link.net/products/t4-pro
             | 
             | Here is a YouTube review of the linked model with input
             | power listed at 12 volt, 1.5 amp (link to timestamp of
             | bottom of unit): https://youtu.be/56UA2Uto1ns?t=129
        
             | belval wrote:
             | A low-end x86 CPU will perform better than the RasPis. My
             | current NAS is an Intel G4560 with 40GB of RAM and 4 HDD
             | and it barely does over ~40W on average. The article's
             | cluster does 18W which is better, but even over a year
             | that's only a 192kWh difference (assuming that is runs all
             | the time) which would amount to about 40$ at $0.20/kWh.
             | 
             | It's not really worth comparing further as the
             | configuration are significantly different, but if your goal
             | is doing 110MB/s R/W, even when accounting for power
             | consumption the product in the article is much more
             | expensive.
        
               | SparkyMcUnicorn wrote:
               | The HP 290 idles around 10W.
               | 
               | Picked one up off craigslist for ~$50 and use it as a
               | plex transcoder since it has QuickSync and can
               | simultaneously transcode around 20 streams of 1080p
               | content.
        
               | pbhjpbhj wrote:
               | I don't know much about NAS and thought they were just a
               | bundle of drives with some [media] access related apps on
               | a longer cable ... 40G RAM? What's that for, is it normal
               | for a NAS to be so loaded? I was looking at NAS and
               | people were talking about 1G as being standard (which
               | conversely seemed really low).
               | 
               | G4560 suggests you're not processing much, is the NAS
               | caching a lot?
        
               | belval wrote:
               | 40G is purely overkill and is not utilized. Initial build
               | had 8G and then I had 32G lying around so I added it.
               | 
               | 4G is probably enough. Though nextcloud does use a lot of
               | memory for miniature generation.
               | 
               | As for the G4560, I can stream 1080p with jellyfin so it
               | packs a surprising punch for it's power envelope.
        
               | formerly_proven wrote:
               | Even for mainsteam x86 Intel chips idle power consumption
               | is mostly down to peripherals, power supply (if you build
               | a small NAS that idles on 2-3 W on the 12 V rail and
               | can't pull more than 50 W, don't use a 650 W PSU),
               | cooling fans, and whether someone forgot to enable power
               | management.
        
             | jotm wrote:
             | I hear it's still possible, through heretic magic, to limit
             | a CPUs power draw and most importantly, it will not affect
             | speed on any level (load will increase).
             | 
             | There's even people selling their souls to the devil for
             | the ability to control the _actual voltage_ of their chips,
             | increasing performance per watt drawn!
             | 
             | But only Gods and top OSS contributors can control the
             | power draw of chips integrated into drives/extension
             | cards/etc
        
         | Rackedup wrote:
         | Adafruit had some in stock a few minutes ago:
         | https://twitter.com/rpilocator ... I think every Wednesday
         | around 11am ... I almost got one this time, but because they
         | had me setup 2FA I couldn't checkout on time.
        
         | mmastrac wrote:
         | Is that just on the secondary market(s)? I'm still seeing them
         | available <$100 in various models, but not always in-stock.
        
         | lathiat wrote:
         | Try https://rpilocator.com/ - no promises though.
        
         | snak wrote:
         | Wow, just checked and Pi 3 is over 100EUR, and Pi 4 over
         | 200EUR.
         | 
         | What happened? I remember buying a Pi 3B+ in 2019 for less than
         | 50EUR.
        
       | kordlessagain wrote:
       | pastel-mature-herring~> Is this where compute is going?
       | 
       | awesome-zebra*> There is no definitive answer to this question,
       | as the direction of compute technology is always changing and
       | evolving. However, the trend in recent years has been towards
       | smaller, more powerful devices that are able to pack more
       | processing power into a smaller form factor. The DeskPi Super6c
       | is an example of this trend, as it offers a trim form factor and
       | six individual Raspberry Pi Compute Module 4s, each of which
       | offers a high level of processing power.
        
       | mgarfias wrote:
       | now if only we could get compute modules
        
       | rustdeveloper wrote:
       | This looks really cool! There was a tutorial posted on HN about
       | building mobile proxy pool with RPI that had obvious limitations:
       | https://scrapingfish.com/blog/byo-mobile-proxy-for-web-scrap...
       | It seems this could be a solution to scale capabilities of a
       | single RPI.
        
       | marcodiego wrote:
       | It is a shame we have nothing as simple as the old OpenMOSIX.
        
       | sschueller wrote:
       | If someone is trying to find a pi you can try the telegram bot I
       | made for rpilocator.com. It will notify you as soon as there is
       | stock with filters for specific pis and your location/preferd
       | vendor.
       | 
       | The bot is here: https://t.me/rpilocating_bot
       | 
       | source: https://github.com/sschueller/rpilocatorbot
        
       | 3np wrote:
       | This could be neat for a 3xnomad + 3xvault cluster. Just add
       | runners and an LB.
        
       | marshray wrote:
       | Quite the bold design choice to put the removable storage on the
       | underside of the motherboard.
        
       | pishpash wrote:
       | 18W at idle seems like a lot of power.
        
         | justsomehnguy wrote:
         | Divide it for six PIs, six NVMEs, one switch.
        
           | pishpash wrote:
           | Why divide? You don't divide by how many cores are on a
           | regular PC, to which this has comparable power.
        
       | cptnapalm wrote:
       | Oh my God, I want this. I have no use for it, whatsoever, but oh
       | my God I want it anyway.
        
       | cosmiccatnap wrote:
       | I love fun projects like this. I would love to know if I could
       | make one a router since the nic has two ports.
       | 
       | I have a poweredge now which works fine but it's nowhere close to
       | 20 watts and I barely use a quarter of it's cpu and memory
        
       | antod wrote:
       | Is it bad that my first thought was "Imagine a Beowulf cluster of
       | these..."
        
       ___________________________________________________________________
       (page generated 2022-08-17 23:01 UTC)