[HN Gopher] Seagate Creates an NVMe Hard Disk Drive
       ___________________________________________________________________
        
       Seagate Creates an NVMe Hard Disk Drive
        
       Author : drewrem11
       Score  : 64 points
       Date   : 2021-11-13 12:56 UTC (1 days ago)
        
 (HTM) web link (www.pcmag.com)
 (TXT) w3m dump (www.pcmag.com)
        
       | joenathanone wrote:
       | >"Hence, using the faster NVME protocol may seem rather
       | pointless."
       | 
       | Isn't it the interface that is faster and not the protocol? PCIe
       | vs SATA
       | 
       | Edit: after reading more, this article is littered with
       | inaccuracies
        
         | wtallis wrote:
         | It's both. Basic things like submitting a command to the drive
         | requires fewer round trips with NMVe than AHCI+SATA, allowing
         | for lower latency and lower CPU overhead. But the raw
         | throughput advantage of multiple lanes of PCIe each running at
         | 8Gbps or higher compared to a single SATA link at 6Gbps is far
         | more noticeable.
        
           | joenathanone wrote:
           | I get that, but with NVME being designed from the ground up
           | specifically for SSD's wouldn't using it for an HDD present
           | extra overhead for the controller to deal with an HDD,
           | negating any theoretical protocol advantages?
        
             | wtallis wrote:
             | NVMe as originally conceived was still based around the
             | block storage abstraction implemented by hard drives. Any
             | SSD you can buy at retail is still fundamentally emulating
             | classic hard drive behavior, with some optional extra
             | functionality to allow the host and drive to cooperate
             | better (eg. Trim/Deallocate). But out of the box, you're
             | still dealing with reading and writing to 512-byte LBAs, so
             | there's not actually much that needs to be added back in to
             | make NVMe work well for hard drives.
             | 
             | The low-level advantages of NVMe 1.0 were mostly about
             | reducing overhead and improving scalability in ways that
             | were not strictly necessary when dealing with mechanical
             | storage and were not possible without breaking
             | compatibility with old storage interfaces. Nothing about
             | eg. the command submission and completion queue structures
             | inherently favor SSDs over hard drives, except that
             | allowing multiple queues per drive each supporting queue
             | lengths of hundreds or thousands of commands is a bit silly
             | in the context of a single hard drive (because you never
             | actually want the OS to enqueue 18 hours worth of IO at
             | once).
        
               | londons_explore wrote:
               | > because you never actually want the OS to enqueue 18
               | hours worth of IO at once
               | 
               | As a thought experiment, I think there _are_ usecases for
               | this kind of thing for a hard drive.
               | 
               | The very nature of a hard drive is that sometimes
               | accessing certain data happens to be very cheap - for
               | example, if the head just happens to pass over a block of
               | data on the way to another block of data I asked to read.
               | In that case, the first read was 'free'.
               | 
               | If the drive API could represent this, then very low
               | priority operations, like reading and compressing dormant
               | data, defragmentation, error checking existing data,
               | rebuilding RAID arrays etc. might benefit from such a
               | long queue. Pretty much, a super long queue of "read this
               | data only if you can do so without delaying the actual
               | high priority queue".
        
               | wtallis wrote:
               | When a drive only has one actuator for all of the heads,
               | there's only a little bit of throughput to be gained from
               | Native Command Queueing, and that only requires a dozen
               | or so commands in the queue. What you're suggesting goes
               | a little further than just plain NCQ, but I'd be
               | surprised if it could yield more than another 5%
               | throughput increase even in the absence of high-priority
               | commands.
               | 
               | But the big problem with having the drive's queue contain
               | a full second or more worth of work (let alone the
               | _hours_ possible with NVMe at hard drive speeds) is that
               | you start needing the ability to cancel or re-order /re-
               | prioritize commands that have already been sent to the
               | drive, unless you're working in an environment with
               | absolutely no QoS targets whatsoever. The drive is the
               | right place for scheduling IO at the millisecond scale,
               | but over longer time horizons it's better to leave things
               | to the OS, which may be able to fulfill a request using a
               | different drive in the array, or provide some
               | feedback/backpressure to the application, or simply have
               | more memory available for buffering and combining
               | operations.
        
         | [deleted]
        
       | vmception wrote:
       | What other connectors are coming down the pipeline or may
       | currently be in draft specification phase?
       | 
       | Admittedly, I totally did not know NVMe's were becoming a thing
       | until a year ago, as I had been in the laptop-only space for a
       | while or didnt need to optimize storage speed when connecting an
       | existing drive to a secondhand motherboard.
       | 
       | I like being ahead of the curve and am now curious whats next
        
         | ksec wrote:
         | >I like being ahead of the curve and am now curious whats next
         | 
         | Others correct me if I am wrong.
         | 
         | NVMe in itself is an interface specification, people often use
         | the term NVMe when they are meant M2, the Connector.
         | 
         | You wont get a new connector in the pipeline. But the M.2 is
         | essentially just 4 lane PCI-E express. So every time you get
         | PCI-E express update, currently 4.0, 5.0 around the corner and
         | 6.0 in final draft. 7.0 possibly within next 3-4 years. So you
         | can expect 14GB/s SSD soon, 28GB/s in ~2024, 50GB/s within this
         | decade assuming we could somehow get the SSD controller power
         | usage down to a reasonable level.
        
           | vmception wrote:
           | Hmm insightful, yes I recall noticing that
           | connector/specification thing when I was trying to get the
           | right size cards, had figured NVMe and M.2 were just synonyms
           | but I see the cause for the correlation now
           | 
           | So the NVMe card I added to an old motherboard's PCI-E slot
           | is really just PCI-E on PCI-E? yo dawg
        
             | wtallis wrote:
             | > So the NVMe card I added to an old motherboard's PCI-E
             | slot is really just PCI-E on PCI-E?
             | 
             | Assuming that's a PCIe to M.2 adapter card, it's just
             | rearranging the wires to a more compact connector. There's
             | no nesting or layering of any protocols. Electrically,
             | nothing changed about how the PCIe signals are carried
             | (though M.2 lacks the 12V power supply), and either way you
             | have NVMe commands encapsulated in PCIe packets (exactly
             | analogous to how your network may have IP packets
             | encapsulated in Ethernet frames).
        
         | wtallis wrote:
         | In the server space, SAS, U.2 and U.3 connectors are
         | mechanically compatible with each other and partially
         | compatible with SATA connectors. U.3 is probably the dead end
         | for that family, but they won't disappear completely for a long
         | time.
         | 
         | Traditional PCIe add-in cards (PCIe CEM connector) are still
         | around and also not going to be disappearing anytime soon, but
         | are in decline as many use cases have switched over to other
         | connectors and form factors, particularly for the sake of
         | better hot-swap support.
         | 
         | M.2 (primarily SSDs) is in rapid decline in the server space.
         | It may hang on for a while for boot drives, but for everything
         | else you want hot-swap and better power delivery (12V rather
         | than 3.3V).
         | 
         | The up and coming connector is SFF-TA-1002, used in the EDSFF
         | form factors and a few other applications like the latest OCP
         | NIC form factor. Its smaller configurations are only a bit
         | larger than M.2, and the wider versions are quite a bit denser
         | than regular PCIe add-in card slots. EDSFF provides a variety
         | of form factors suitable for 1U and 2U servers, replacing 2.5"
         | drives. The SFF-TA-1002 connector can also be used as direct
         | replacement for the PCIe CEM connector, but I'm not sure if
         | that's actually going to happen anytime soon.
         | 
         | I haven't seen any sign that EDSFF or SFF-TA-1002 will be
         | showing up in consumer systems. Existing solutions like M.2 and
         | PCIe CEM are good enough for now and the foreseeable future.
         | The older connectors sometimes need to have tolerances
         | tightened up to support higher signal rates, but so far a
         | backwards-incompatible redesign hasn't been necessary to
         | support newer generations of PCIe (though the practical
         | distances usable without retimers have been decreasing).
        
       | h2odragon wrote:
       | Disks have had complex CPUs on them for awhile, might as well go
       | full mainframe and admit they're smart storage subsystems and put
       | them on the first class bus. is "DASD" still an IBM trademark?
       | 
       | Of course theres a long history of "multiple interface" drives
       | which are always ugly hacks that turn up as rare collectors items
       | and examples of boondoggle.
        
         | DaiPlusPlus wrote:
         | > and admit they're smart storage subsystems and put them on
         | the first class bus. is "DASD" still an IBM trademark?
         | 
         | Y'know that eventually we'll be running everything off Intel's
         | Optane non-volatile RAM: we won't have block-addressable
         | storage anymore, _everything_ will be directly byte-
         | addressable. All of the storage abstractions that have popped-
         | up over the past decades (tracks, heads, cylinders, ew, block-
         | addressing, unnecessarily large block sizes, etc) will be
         | obsolete because we 'll already have _perfect-storage_.
         | 
         | It's not quite DASD, but it's much better.
        
           | trasz wrote:
           | Optane didn't exactly take the market by storm. Also: flash
           | memory doesn't work this way; its inherently organized into
           | blocks/pages.
        
             | DaiPlusPlus wrote:
             | The "Optane" that Intel released as their 3DXPoint NVMe
             | brand, (and quietly withdrew it recently) isn't the same
             | Optane as their byte-addressable non-volatile storage+RAM
             | combo. It isn't Flash memory with blocks/pages, it really
             | is byte-addressable:
             | https://dl.acm.org/doi/10.1145/3357526.3357568
             | 
             | "true" Optane hasn't taken over the scene because, to my
             | knowledge, there's no commercially supported OS that's
             | built around a unified memory model (heck, why not have a
             | single memory-space?) for both storage and memory.
             | 
             | We can't even write software that can just reanimate itself
             | from a process image that would be automagically persisted
             | in the unified storage model. We've got a long way to go
             | before we'll see operating systems and applications that
             | take advantage of it.
        
               | spijdar wrote:
               | But RAM is itself accessed in blocks. The process is
               | hidden to software, but memory is always fetched in word-
               | aligned blocks. It doesn't contradict your point, but
               | just pointing out that even DRAM is pulled in chunks not
               | unlike traditional drives (if you squint)
               | 
               | (Of course, getting those chunks down to cache line sizes
               | does open up a lot of possibilities...)
        
               | trasz wrote:
               | Well, yeah, the cacheline can be considered a kind of a
               | 64-byte block. But it doesn't work like this because of
               | how RAM works - you could access DRAM in words if you
               | wished to; it's just that it doesn't make sense because
               | of CPU cache. For flash, the blocks (and pages) are
               | inherent to its design and there is no way around it.
               | 
               | Also, RAM "block size" is 64B, while for flash its more
               | like 4kB. CPU cache will "deblock" the 64B blocks, but it
               | can't efficiently do it for 4kB ones.
               | 
               | And then there's the speed. Does replacing PCIe with a
               | memory bus actually make a performance difference that's
               | measurable given flash latency?
        
               | rlkf wrote:
               | > commercially supported OS that's built around a unified
               | memory model
               | 
               | Doesn't OS/400 work that way? (Of course then there is
               | the question to which degree "commercially" should imply
               | "readily open to third-party software and hardware
               | vendors")
        
       | t0mas88 wrote:
       | NVMe also does away with the controller knowing about the disk
       | layout and addressing. Which may make sense for future disks and
       | ever increasing cache sizes. At some point you probably want to
       | put all the logic in the drive itself to optimise it (as SSDs
       | already do)
        
         | trasz wrote:
         | SCSI already got rid of knowledge of disk layout/addressing
         | some 30 years ago.
        
         | wmf wrote:
         | Hasn't LBA been around since 1990 or so?
        
           | wtallis wrote:
           | Even the old cylinder/head/sector addressing almost never
           | matched the real hard drive geometry, because the floppy-
           | oriented BIOS CHS routines used a different number of bits
           | for each address component than ATA did, so translation was
           | required even for hard drives with capacities in the hundreds
           | of MB.
        
         | oneplane wrote:
         | Indeed, ideally the controller or HBA should really just
         | provide the fabric and nothing more. A bit like Thunderbolt and
         | USB4.
        
       | oneplane wrote:
       | Like the article explains, it makes sense if you can do this and
       | reduce the amount of now 'legacy' interfaces.
       | 
       | We used to do IDE emulation over SATA for a while, then we got
       | AHCI over SATA and AHCI over other fabrics. Makes sense to stop
       | carrying all the legacy loads all the time. For people that
       | really need it for compatibility reasons we still have the
       | vortex86 style solutions that generally fit the bill as well as
       | integrated controllers that do PCIe-to-PCI bridging with a
       | classic IDE controller attached to that converted PCI bus.
       | Options will stay, but cutting legacy by default makes sense to
       | me. Except UART of course, that can stay forever.
       | 
       | Edit: I stand corrected, AHCI (like the name implies: Advanced
       | Host Controller Interface) is for the communication up to the
       | controller. Essentially the ATA commands continue to be sent to
       | the drive but the difference is between commanding the controller
       | over IDE or AHCI modes. This is also why the controller then
       | needs to know about the drive, where NVMe doesn't need to know
       | about the drive because the ATA protocol is no longer needed to
       | be understood by the controller (as posted here elsewhere).
        
         | masklinn wrote:
         | > then we got AHCI over SATA
         | 
         | AHCI is the native SATA mode, AFAIK explicit mentions of AHCI
         | are mostly over non-SATA interfaces (usually m.2, because an
         | m.2 SSD can be SATA, AHCI over PCIe, or NVMe).
        
           | wtallis wrote:
           | AHCI is the driver protocol used for communication between
           | the CPU/OS and the SATA host bus adapter. It stops there, and
           | nothing traveling over the SATA cable can be correctly called
           | AHCI.
           | 
           | You can have fully standard SATA communication happening
           | between a SATA drive and a SATA-compatible SAS HBA that uses
           | a proprietary non-AHCI driver interface, and from the drive's
           | end of the SATA cable this situation is completely
           | indistinguishable from using a normal SATA HBA.
           | 
           | Likewise, you can have AHCI communication to a PCIe SSD and
           | the OS will _think_ it 's talking to a single-port SATA HBA,
           | with the peculiarity that sustained transfers in excess of
           | 6Gbps are possible.
        
             | AussieWog93 wrote:
             | >It stops there, and nothing traveling over the SATA cable
             | can be correctly called AHCI.
             | 
             | Informally, a lot of BIOSes (another informality there!)
             | would give you the choice back in the day between IDE or
             | AHCI when it came to communicating with the drive.
             | 
             | I think this is where most of the confusion came from.
        
               | wtallis wrote:
               | That choice was largely about which drivers your OS
               | included. The IDE compatibility mode meant you could use
               | an older OS that didn't include an AHCI driver. The
               | toggle didn't actually change anything about the
               | communication between the chipset and the drive, but
               | _did_ change how the OS saw the storage controller built-
               | in to the chipset on the motherboard.
               | 
               | (Later, a third option appeared for proprietary RAID
               | modes that required vendor-specific drivers, and that
               | eventually led to more insanity and user
               | misunderstandings when NVMe came onto the scene.)
        
       | KingMachiavelli wrote:
       | Would this reduce the need to have expensive and power hungry
       | RAID/HBA cards? I would assume splitting nvme/PCIe is a lot
       | simpler than PCIe to SATA.
        
         | toast0 wrote:
         | Seems like it. The article mentions a PCIe switch, but PCIe
         | bifurcation may also be an option. (That's splitting a multiple
         | lane slot into multiple slots, requires system firmware
         | suppport though)
        
           | formerly_proven wrote:
           | Bifurcation has never been a thing on desktop platforms and
           | even most entry-level (single socket, desktop-equivalent)
           | servers don't support it. It seems to be reserved for HEDT
           | and real server platforms. (This is of course purely a market
           | segmentation decision by Intel/AMD).
        
             | wtallis wrote:
             | PCIe bifurcation works fine on AMD consumer platforms. I've
             | used a passive quad-M.2 riser card on an AMD B550
             | motherboard with no trouble other than changing a firmware
             | setting to turn on bifurcation. It's only Intel that is
             | strict about this aspect of product segmentation.
        
             | toast0 wrote:
             | My A520 mini-itx board supports it; can't get anymore
             | desktop than that. Although, that has limited options, I
             | think I can do either 2 x8 or one x8 and two x4. For this,
             | it looks like each drive is expected to be x1, so you'd
             | want one x16 to 16 x1s. It's doable, but not without
             | mucking about in the firmware (either by the OEM, or
             | dedicated enthusiasts), so a PCIe switch is probably
             | advisable.
        
               | formerly_proven wrote:
               | Ah I see. It seems like previously bifurcation was not
               | qualified for anything but the X-series chipset, but in
               | the 500 series it's qualified for all. On top of that, it
               | seems like some boards just allowed it regardless in
               | prior generations.
               | 
               | Another complication is of course that the non-PEG slots
               | on the (non-HEDT) platforms are usually electrically only
               | x4 or x1, so bifurcation really only makes sense in the
               | PEG.
        
               | wtallis wrote:
               | PCIe root ports in CPUs are generally designed to provide
               | an x16 with bifurcation down to x4x4x4x4 (or merely
               | x8x4x4 for Intel consumer CPUs). Large-scale PCIe
               | switches also commonly support bifurcation only down to
               | x4 or sometimes x2, though x1 may start catching on with
               | PCIe gen5.
               | 
               | Smaller PCIe switches and motherboard chipsets usually
               | support link widths from x4 down to x1. Treating each
               | lane individually goes hand in hand with the fact that
               | many of the lanes provided by a motherboard chipset can
               | be reconfigured between some combination of PCIe, SATA
               | and USB: they design a multi-purpose PHY, put down one or
               | two dozen copies of it at the perimeter of the die, and
               | connect them to an appropriate variety of MACs.
        
               | formerly_proven wrote:
               | Yeah, but what I meant above was that if you only have an
               | x4 slot electrically, sticking in an x16 -> 4x M.2 riser
               | isn't going to do a whole lot, because the 12 lanes of 3
               | out of 4 slots aren't hooked up to anything. So in this
               | scenario you'd really want a riser with a switch in it
               | instead (which are more expensive than almost all
               | motherboards).
               | 
               | So on the consumer platforms that give you two PEGs best
               | you could do while still having a GPU is stick that riser
               | in the second PEG and use the x8/x8 split. Now the
               | question becomes whether the UEFI allows you to use the
               | x8/x8 bifurcation meant for dual GPU or similar use in an
               | x8/(x4+x4) triple bifurcation kind of setup.
               | 
               | Realistically this entire thing just doesn't make a lot
               | of sense on the consumer platforms because they just
               | don't have enough PCIe lanes out of the CPU. Intel used
               | to be slightly worse here with 20 (of which 4 are
               | reserved for the PCH but you know that), while AM4 has 28
               | (4 for the I/O hub again). On an HEDT platform with 40+
               | lanes though...
               | 
               | (When I say bifurcation I meant bifurcation of a slot on
               | the mainboard, not the various ways the slots and ports
               | on the board itself can be configured, though that's
               | technically bifurcation as well (or even switching
               | protocols)
        
               | wtallis wrote:
               | > Yeah, but what I meant above was that if you only have
               | an x4 slot electrically, sticking in an x16 -> 4x M.2
               | riser isn't going to do a whole lot, because the 12 lanes
               | of 3 out of 4 slots aren't hooked up to anything. So in
               | this scenario you'd really want a riser with a switch in
               | it instead (which are more expensive than almost all
               | motherboards).
               | 
               | True; but given how PCIe speeds are no longer stalled, we
               | may soon see motherboards offering an x4 slot that can be
               | operated as x1x1x1x1. Currently, the only risers you're
               | likely to find that split a slot into independent x1
               | ports are intended for crypto mining, and they require
               | switches. A passive (or retimer-only) quad-M.2 riser that
               | only provides one PCIe lane per drive currently sounds a
               | bit bandwidth-starved and wouldn't work with current
               | motherboards. But given PCIe gen5 SSDs or widespread
               | availability of PCIe-native hard drives, those uses for
               | an x4 slot will start to make sense.
        
         | wtallis wrote:
         | PCIe switches are a lot simpler and more standardized than RAID
         | and HBA controllers, but I'm not sure they're any cheaper for
         | similar bandwidth and port counts. Broadcom/Avago/PLX and
         | Microchip/Microsemi are the only two vendors for large-scale
         | current generation PCIe switches, and starting with PCIe gen3
         | they decided to price them _way_ out of the consumer market,
         | contributing to the disappearance of multi-GPU from gaming PCs.
        
       | inetknght wrote:
       | Does this have the reliability as the rest of Seagate's lineup?
       | Or is this actually something that isn't a ripoff?
        
         | wmf wrote:
         | Obligatory reminder that there are only three hard disk vendors
         | and all of them have made bad drives at one time.
        
           | thijsvandien wrote:
           | To be specific: Seagate, Toshiba and Western Digital.
        
       | mastax wrote:
       | Could you use multiple NVMe namespaces to represent the separate
       | actuators in a multiple-actuator drive? Would there be a benefit?
       | Do different namespaces get separate command queues or whatever?
        
         | wtallis wrote:
         | NVMe supports multiple queues even for a single namespace, and
         | multiple queues are used more for efficiency on the software
         | side (one queue per CPU core) than for exposing the parallelism
         | of the storage hardware.
         | 
         | There are several NVMe features intended to expose some
         | information about the underlying segmentation and allocation of
         | the storage media, for the sake of QoS. Since current multi-
         | actuator drives are merely multiple actuators per spindle but
         | still only one head per platter, this split could be exposed as
         | separate namespaces or separate NVM sets or separate endurance
         | groups. If we ever see multiple heads per platter come back (a
         | la Conner Chinook), that would be best abstracted with multiple
         | queues.
        
       ___________________________________________________________________
       (page generated 2021-11-14 23:00 UTC)