[HN Gopher] Seagate Creates an NVMe Hard Disk Drive ___________________________________________________________________ Seagate Creates an NVMe Hard Disk Drive Author : drewrem11 Score : 64 points Date : 2021-11-13 12:56 UTC (1 days ago) (HTM) web link (www.pcmag.com) (TXT) w3m dump (www.pcmag.com) | joenathanone wrote: | >"Hence, using the faster NVME protocol may seem rather | pointless." | | Isn't it the interface that is faster and not the protocol? PCIe | vs SATA | | Edit: after reading more, this article is littered with | inaccuracies | wtallis wrote: | It's both. Basic things like submitting a command to the drive | requires fewer round trips with NMVe than AHCI+SATA, allowing | for lower latency and lower CPU overhead. But the raw | throughput advantage of multiple lanes of PCIe each running at | 8Gbps or higher compared to a single SATA link at 6Gbps is far | more noticeable. | joenathanone wrote: | I get that, but with NVME being designed from the ground up | specifically for SSD's wouldn't using it for an HDD present | extra overhead for the controller to deal with an HDD, | negating any theoretical protocol advantages? | wtallis wrote: | NVMe as originally conceived was still based around the | block storage abstraction implemented by hard drives. Any | SSD you can buy at retail is still fundamentally emulating | classic hard drive behavior, with some optional extra | functionality to allow the host and drive to cooperate | better (eg. Trim/Deallocate). But out of the box, you're | still dealing with reading and writing to 512-byte LBAs, so | there's not actually much that needs to be added back in to | make NVMe work well for hard drives. | | The low-level advantages of NVMe 1.0 were mostly about | reducing overhead and improving scalability in ways that | were not strictly necessary when dealing with mechanical | storage and were not possible without breaking | compatibility with old storage interfaces. Nothing about | eg. the command submission and completion queue structures | inherently favor SSDs over hard drives, except that | allowing multiple queues per drive each supporting queue | lengths of hundreds or thousands of commands is a bit silly | in the context of a single hard drive (because you never | actually want the OS to enqueue 18 hours worth of IO at | once). | londons_explore wrote: | > because you never actually want the OS to enqueue 18 | hours worth of IO at once | | As a thought experiment, I think there _are_ usecases for | this kind of thing for a hard drive. | | The very nature of a hard drive is that sometimes | accessing certain data happens to be very cheap - for | example, if the head just happens to pass over a block of | data on the way to another block of data I asked to read. | In that case, the first read was 'free'. | | If the drive API could represent this, then very low | priority operations, like reading and compressing dormant | data, defragmentation, error checking existing data, | rebuilding RAID arrays etc. might benefit from such a | long queue. Pretty much, a super long queue of "read this | data only if you can do so without delaying the actual | high priority queue". | wtallis wrote: | When a drive only has one actuator for all of the heads, | there's only a little bit of throughput to be gained from | Native Command Queueing, and that only requires a dozen | or so commands in the queue. What you're suggesting goes | a little further than just plain NCQ, but I'd be | surprised if it could yield more than another 5% | throughput increase even in the absence of high-priority | commands. | | But the big problem with having the drive's queue contain | a full second or more worth of work (let alone the | _hours_ possible with NVMe at hard drive speeds) is that | you start needing the ability to cancel or re-order /re- | prioritize commands that have already been sent to the | drive, unless you're working in an environment with | absolutely no QoS targets whatsoever. The drive is the | right place for scheduling IO at the millisecond scale, | but over longer time horizons it's better to leave things | to the OS, which may be able to fulfill a request using a | different drive in the array, or provide some | feedback/backpressure to the application, or simply have | more memory available for buffering and combining | operations. | [deleted] | vmception wrote: | What other connectors are coming down the pipeline or may | currently be in draft specification phase? | | Admittedly, I totally did not know NVMe's were becoming a thing | until a year ago, as I had been in the laptop-only space for a | while or didnt need to optimize storage speed when connecting an | existing drive to a secondhand motherboard. | | I like being ahead of the curve and am now curious whats next | ksec wrote: | >I like being ahead of the curve and am now curious whats next | | Others correct me if I am wrong. | | NVMe in itself is an interface specification, people often use | the term NVMe when they are meant M2, the Connector. | | You wont get a new connector in the pipeline. But the M.2 is | essentially just 4 lane PCI-E express. So every time you get | PCI-E express update, currently 4.0, 5.0 around the corner and | 6.0 in final draft. 7.0 possibly within next 3-4 years. So you | can expect 14GB/s SSD soon, 28GB/s in ~2024, 50GB/s within this | decade assuming we could somehow get the SSD controller power | usage down to a reasonable level. | vmception wrote: | Hmm insightful, yes I recall noticing that | connector/specification thing when I was trying to get the | right size cards, had figured NVMe and M.2 were just synonyms | but I see the cause for the correlation now | | So the NVMe card I added to an old motherboard's PCI-E slot | is really just PCI-E on PCI-E? yo dawg | wtallis wrote: | > So the NVMe card I added to an old motherboard's PCI-E | slot is really just PCI-E on PCI-E? | | Assuming that's a PCIe to M.2 adapter card, it's just | rearranging the wires to a more compact connector. There's | no nesting or layering of any protocols. Electrically, | nothing changed about how the PCIe signals are carried | (though M.2 lacks the 12V power supply), and either way you | have NVMe commands encapsulated in PCIe packets (exactly | analogous to how your network may have IP packets | encapsulated in Ethernet frames). | wtallis wrote: | In the server space, SAS, U.2 and U.3 connectors are | mechanically compatible with each other and partially | compatible with SATA connectors. U.3 is probably the dead end | for that family, but they won't disappear completely for a long | time. | | Traditional PCIe add-in cards (PCIe CEM connector) are still | around and also not going to be disappearing anytime soon, but | are in decline as many use cases have switched over to other | connectors and form factors, particularly for the sake of | better hot-swap support. | | M.2 (primarily SSDs) is in rapid decline in the server space. | It may hang on for a while for boot drives, but for everything | else you want hot-swap and better power delivery (12V rather | than 3.3V). | | The up and coming connector is SFF-TA-1002, used in the EDSFF | form factors and a few other applications like the latest OCP | NIC form factor. Its smaller configurations are only a bit | larger than M.2, and the wider versions are quite a bit denser | than regular PCIe add-in card slots. EDSFF provides a variety | of form factors suitable for 1U and 2U servers, replacing 2.5" | drives. The SFF-TA-1002 connector can also be used as direct | replacement for the PCIe CEM connector, but I'm not sure if | that's actually going to happen anytime soon. | | I haven't seen any sign that EDSFF or SFF-TA-1002 will be | showing up in consumer systems. Existing solutions like M.2 and | PCIe CEM are good enough for now and the foreseeable future. | The older connectors sometimes need to have tolerances | tightened up to support higher signal rates, but so far a | backwards-incompatible redesign hasn't been necessary to | support newer generations of PCIe (though the practical | distances usable without retimers have been decreasing). | h2odragon wrote: | Disks have had complex CPUs on them for awhile, might as well go | full mainframe and admit they're smart storage subsystems and put | them on the first class bus. is "DASD" still an IBM trademark? | | Of course theres a long history of "multiple interface" drives | which are always ugly hacks that turn up as rare collectors items | and examples of boondoggle. | DaiPlusPlus wrote: | > and admit they're smart storage subsystems and put them on | the first class bus. is "DASD" still an IBM trademark? | | Y'know that eventually we'll be running everything off Intel's | Optane non-volatile RAM: we won't have block-addressable | storage anymore, _everything_ will be directly byte- | addressable. All of the storage abstractions that have popped- | up over the past decades (tracks, heads, cylinders, ew, block- | addressing, unnecessarily large block sizes, etc) will be | obsolete because we 'll already have _perfect-storage_. | | It's not quite DASD, but it's much better. | trasz wrote: | Optane didn't exactly take the market by storm. Also: flash | memory doesn't work this way; its inherently organized into | blocks/pages. | DaiPlusPlus wrote: | The "Optane" that Intel released as their 3DXPoint NVMe | brand, (and quietly withdrew it recently) isn't the same | Optane as their byte-addressable non-volatile storage+RAM | combo. It isn't Flash memory with blocks/pages, it really | is byte-addressable: | https://dl.acm.org/doi/10.1145/3357526.3357568 | | "true" Optane hasn't taken over the scene because, to my | knowledge, there's no commercially supported OS that's | built around a unified memory model (heck, why not have a | single memory-space?) for both storage and memory. | | We can't even write software that can just reanimate itself | from a process image that would be automagically persisted | in the unified storage model. We've got a long way to go | before we'll see operating systems and applications that | take advantage of it. | spijdar wrote: | But RAM is itself accessed in blocks. The process is | hidden to software, but memory is always fetched in word- | aligned blocks. It doesn't contradict your point, but | just pointing out that even DRAM is pulled in chunks not | unlike traditional drives (if you squint) | | (Of course, getting those chunks down to cache line sizes | does open up a lot of possibilities...) | trasz wrote: | Well, yeah, the cacheline can be considered a kind of a | 64-byte block. But it doesn't work like this because of | how RAM works - you could access DRAM in words if you | wished to; it's just that it doesn't make sense because | of CPU cache. For flash, the blocks (and pages) are | inherent to its design and there is no way around it. | | Also, RAM "block size" is 64B, while for flash its more | like 4kB. CPU cache will "deblock" the 64B blocks, but it | can't efficiently do it for 4kB ones. | | And then there's the speed. Does replacing PCIe with a | memory bus actually make a performance difference that's | measurable given flash latency? | rlkf wrote: | > commercially supported OS that's built around a unified | memory model | | Doesn't OS/400 work that way? (Of course then there is | the question to which degree "commercially" should imply | "readily open to third-party software and hardware | vendors") | t0mas88 wrote: | NVMe also does away with the controller knowing about the disk | layout and addressing. Which may make sense for future disks and | ever increasing cache sizes. At some point you probably want to | put all the logic in the drive itself to optimise it (as SSDs | already do) | trasz wrote: | SCSI already got rid of knowledge of disk layout/addressing | some 30 years ago. | wmf wrote: | Hasn't LBA been around since 1990 or so? | wtallis wrote: | Even the old cylinder/head/sector addressing almost never | matched the real hard drive geometry, because the floppy- | oriented BIOS CHS routines used a different number of bits | for each address component than ATA did, so translation was | required even for hard drives with capacities in the hundreds | of MB. | oneplane wrote: | Indeed, ideally the controller or HBA should really just | provide the fabric and nothing more. A bit like Thunderbolt and | USB4. | oneplane wrote: | Like the article explains, it makes sense if you can do this and | reduce the amount of now 'legacy' interfaces. | | We used to do IDE emulation over SATA for a while, then we got | AHCI over SATA and AHCI over other fabrics. Makes sense to stop | carrying all the legacy loads all the time. For people that | really need it for compatibility reasons we still have the | vortex86 style solutions that generally fit the bill as well as | integrated controllers that do PCIe-to-PCI bridging with a | classic IDE controller attached to that converted PCI bus. | Options will stay, but cutting legacy by default makes sense to | me. Except UART of course, that can stay forever. | | Edit: I stand corrected, AHCI (like the name implies: Advanced | Host Controller Interface) is for the communication up to the | controller. Essentially the ATA commands continue to be sent to | the drive but the difference is between commanding the controller | over IDE or AHCI modes. This is also why the controller then | needs to know about the drive, where NVMe doesn't need to know | about the drive because the ATA protocol is no longer needed to | be understood by the controller (as posted here elsewhere). | masklinn wrote: | > then we got AHCI over SATA | | AHCI is the native SATA mode, AFAIK explicit mentions of AHCI | are mostly over non-SATA interfaces (usually m.2, because an | m.2 SSD can be SATA, AHCI over PCIe, or NVMe). | wtallis wrote: | AHCI is the driver protocol used for communication between | the CPU/OS and the SATA host bus adapter. It stops there, and | nothing traveling over the SATA cable can be correctly called | AHCI. | | You can have fully standard SATA communication happening | between a SATA drive and a SATA-compatible SAS HBA that uses | a proprietary non-AHCI driver interface, and from the drive's | end of the SATA cable this situation is completely | indistinguishable from using a normal SATA HBA. | | Likewise, you can have AHCI communication to a PCIe SSD and | the OS will _think_ it 's talking to a single-port SATA HBA, | with the peculiarity that sustained transfers in excess of | 6Gbps are possible. | AussieWog93 wrote: | >It stops there, and nothing traveling over the SATA cable | can be correctly called AHCI. | | Informally, a lot of BIOSes (another informality there!) | would give you the choice back in the day between IDE or | AHCI when it came to communicating with the drive. | | I think this is where most of the confusion came from. | wtallis wrote: | That choice was largely about which drivers your OS | included. The IDE compatibility mode meant you could use | an older OS that didn't include an AHCI driver. The | toggle didn't actually change anything about the | communication between the chipset and the drive, but | _did_ change how the OS saw the storage controller built- | in to the chipset on the motherboard. | | (Later, a third option appeared for proprietary RAID | modes that required vendor-specific drivers, and that | eventually led to more insanity and user | misunderstandings when NVMe came onto the scene.) | KingMachiavelli wrote: | Would this reduce the need to have expensive and power hungry | RAID/HBA cards? I would assume splitting nvme/PCIe is a lot | simpler than PCIe to SATA. | toast0 wrote: | Seems like it. The article mentions a PCIe switch, but PCIe | bifurcation may also be an option. (That's splitting a multiple | lane slot into multiple slots, requires system firmware | suppport though) | formerly_proven wrote: | Bifurcation has never been a thing on desktop platforms and | even most entry-level (single socket, desktop-equivalent) | servers don't support it. It seems to be reserved for HEDT | and real server platforms. (This is of course purely a market | segmentation decision by Intel/AMD). | wtallis wrote: | PCIe bifurcation works fine on AMD consumer platforms. I've | used a passive quad-M.2 riser card on an AMD B550 | motherboard with no trouble other than changing a firmware | setting to turn on bifurcation. It's only Intel that is | strict about this aspect of product segmentation. | toast0 wrote: | My A520 mini-itx board supports it; can't get anymore | desktop than that. Although, that has limited options, I | think I can do either 2 x8 or one x8 and two x4. For this, | it looks like each drive is expected to be x1, so you'd | want one x16 to 16 x1s. It's doable, but not without | mucking about in the firmware (either by the OEM, or | dedicated enthusiasts), so a PCIe switch is probably | advisable. | formerly_proven wrote: | Ah I see. It seems like previously bifurcation was not | qualified for anything but the X-series chipset, but in | the 500 series it's qualified for all. On top of that, it | seems like some boards just allowed it regardless in | prior generations. | | Another complication is of course that the non-PEG slots | on the (non-HEDT) platforms are usually electrically only | x4 or x1, so bifurcation really only makes sense in the | PEG. | wtallis wrote: | PCIe root ports in CPUs are generally designed to provide | an x16 with bifurcation down to x4x4x4x4 (or merely | x8x4x4 for Intel consumer CPUs). Large-scale PCIe | switches also commonly support bifurcation only down to | x4 or sometimes x2, though x1 may start catching on with | PCIe gen5. | | Smaller PCIe switches and motherboard chipsets usually | support link widths from x4 down to x1. Treating each | lane individually goes hand in hand with the fact that | many of the lanes provided by a motherboard chipset can | be reconfigured between some combination of PCIe, SATA | and USB: they design a multi-purpose PHY, put down one or | two dozen copies of it at the perimeter of the die, and | connect them to an appropriate variety of MACs. | formerly_proven wrote: | Yeah, but what I meant above was that if you only have an | x4 slot electrically, sticking in an x16 -> 4x M.2 riser | isn't going to do a whole lot, because the 12 lanes of 3 | out of 4 slots aren't hooked up to anything. So in this | scenario you'd really want a riser with a switch in it | instead (which are more expensive than almost all | motherboards). | | So on the consumer platforms that give you two PEGs best | you could do while still having a GPU is stick that riser | in the second PEG and use the x8/x8 split. Now the | question becomes whether the UEFI allows you to use the | x8/x8 bifurcation meant for dual GPU or similar use in an | x8/(x4+x4) triple bifurcation kind of setup. | | Realistically this entire thing just doesn't make a lot | of sense on the consumer platforms because they just | don't have enough PCIe lanes out of the CPU. Intel used | to be slightly worse here with 20 (of which 4 are | reserved for the PCH but you know that), while AM4 has 28 | (4 for the I/O hub again). On an HEDT platform with 40+ | lanes though... | | (When I say bifurcation I meant bifurcation of a slot on | the mainboard, not the various ways the slots and ports | on the board itself can be configured, though that's | technically bifurcation as well (or even switching | protocols) | wtallis wrote: | > Yeah, but what I meant above was that if you only have | an x4 slot electrically, sticking in an x16 -> 4x M.2 | riser isn't going to do a whole lot, because the 12 lanes | of 3 out of 4 slots aren't hooked up to anything. So in | this scenario you'd really want a riser with a switch in | it instead (which are more expensive than almost all | motherboards). | | True; but given how PCIe speeds are no longer stalled, we | may soon see motherboards offering an x4 slot that can be | operated as x1x1x1x1. Currently, the only risers you're | likely to find that split a slot into independent x1 | ports are intended for crypto mining, and they require | switches. A passive (or retimer-only) quad-M.2 riser that | only provides one PCIe lane per drive currently sounds a | bit bandwidth-starved and wouldn't work with current | motherboards. But given PCIe gen5 SSDs or widespread | availability of PCIe-native hard drives, those uses for | an x4 slot will start to make sense. | wtallis wrote: | PCIe switches are a lot simpler and more standardized than RAID | and HBA controllers, but I'm not sure they're any cheaper for | similar bandwidth and port counts. Broadcom/Avago/PLX and | Microchip/Microsemi are the only two vendors for large-scale | current generation PCIe switches, and starting with PCIe gen3 | they decided to price them _way_ out of the consumer market, | contributing to the disappearance of multi-GPU from gaming PCs. | inetknght wrote: | Does this have the reliability as the rest of Seagate's lineup? | Or is this actually something that isn't a ripoff? | wmf wrote: | Obligatory reminder that there are only three hard disk vendors | and all of them have made bad drives at one time. | thijsvandien wrote: | To be specific: Seagate, Toshiba and Western Digital. | mastax wrote: | Could you use multiple NVMe namespaces to represent the separate | actuators in a multiple-actuator drive? Would there be a benefit? | Do different namespaces get separate command queues or whatever? | wtallis wrote: | NVMe supports multiple queues even for a single namespace, and | multiple queues are used more for efficiency on the software | side (one queue per CPU core) than for exposing the parallelism | of the storage hardware. | | There are several NVMe features intended to expose some | information about the underlying segmentation and allocation of | the storage media, for the sake of QoS. Since current multi- | actuator drives are merely multiple actuators per spindle but | still only one head per platter, this split could be exposed as | separate namespaces or separate NVM sets or separate endurance | groups. If we ever see multiple heads per platter come back (a | la Conner Chinook), that would be best abstracted with multiple | queues. ___________________________________________________________________ (page generated 2021-11-14 23:00 UTC)