[HN Gopher] ECC RAM should be a human right
       ___________________________________________________________________
        
       ECC RAM should be a human right
        
       Author : zdw
       Score  : 49 points
       Date   : 2023-01-21 21:44 UTC (1 hours ago)
        
 (HTM) web link (dmitrybrant.com)
 (TXT) w3m dump (dmitrybrant.com)
        
       | greenbit wrote:
       | Commodity PCs back in the 80s and 90s didn't have error
       | correction but they did have parity (iirc). Correction requires
       | three extra bits per byte compared to parity only carrying one
       | extra bit per byte. I recall around 1990 you could get your 30pin
       | SIMs as 9-bit (parity) or 8-bit (no-parity), and virtually all of
       | the PCs at the time wanted the 9 bit modules. Parity can't
       | correct errors, but at least it can cause an exception when you
       | read something that's had a bit flip.
        
         | Felger wrote:
         | Yes I can indeed recall systems being shipped with 4x 9-bits
         | sticks.
         | 
         | And god the horrific price of thoses sticks...
        
       | phkahler wrote:
       | Unfortunately DDR5 is going to complicate rather than fix the
       | story.
        
         | zdw wrote:
         | It's very strange that DDR5 mandates internal ECC within each
         | physical package, but not on the longer and possibly more EMI
         | sensitive connections between the memory chips and controller.
         | 
         | I would have thought that the additional cost would be minimal
         | (additional wiring on the logic board in some cases), but maybe
         | this is just more artificial market segmentation?
        
           | ilyt wrote:
           | They have internal ECC coz that allows them to have higher
           | yields, what could be considered faulty chip in DDR4 can be
           | now sold in DDR5. So it is effectively cost-reducing measure
           | for them. Exposing that to the user not only would cost extra
           | pennies, but potentially have uses go "hey, this stick is
           | shit, look at how many correctable errors it is producing,
           | please replace it"
        
         | arp242 wrote:
         | Why is that?
        
           | loeg wrote:
           | DDR5 will have some minimal ECC on the stick but critically
           | does not mandate full runs to the CPU. Or in Wikipedia's
           | words:
           | 
           | > Unlike DDR4, all DDR5 chips have on-die ECC, where errors
           | are detected and corrected before sending data to the CPU.
           | This, however, is not the same as true ECC memory with an
           | extra data correction chip on the memory module. DDR5's on-
           | die error correction is to improve reliability and to allow
           | denser RAM chips which lowers the per-chip defect rate. There
           | still exist non-ECC and ECC DDR5 DIMM variants; the ECC
           | variants have extra data lines to the CPU to send error-
           | detection data, letting the CPU detect and correct errors
           | that occurred in transit.
           | 
           | So in some ways it is better than previous generations, but
           | it gives vendors another excuse not to implement full-
           | coverage ECC. That's my guess of why GP said it complicates
           | things.
           | 
           | https://en.wikipedia.org/wiki/DDR5_SDRAM
        
       | adhoc32 wrote:
       | Latest Intel desktop CPUs (i.e. i9-13900KF) supports ECC with the
       | W680 chipset.
        
         | fortran77 wrote:
         | One of the main reasons I buy Xeon desktops is the ECC. With
         | 128 GB of memory, and 1 bitflip/GB/year average error rate, it
         | seems too risky to not use ECC for production work.
        
           | Retric wrote:
           | Real world numbers are closer to 1 bitflip/GB/hour than year
           | because bit flips are highly correlated.
           | 
           | "A large-scale study based on Google's very large number of
           | servers was presented at the SIGMETRICS/Performance '09
           | conference.[6] The actual error rate found was several orders
           | of magnitude higher than the previous small-scale or
           | laboratory studies, with between 25,000 (2.5 x 10-11
           | error/bit*h) and 70,000 (7.0 x 10-11 error/bit*h, or 1 bit
           | error per gigabyte of RAM per 1.8 hours) errors per billion
           | device hours per megabit. More than 8% of DIMM memory modules
           | were affected by errors per year."
           | https://en.wikipedia.org/wiki/ECC_memory
           | 
           | A random stick of non ECC memory might be far above average
           | fine, but you don't know.
        
         | skunkworker wrote:
         | I wish those motherboards didn't cost $450+, I've contemplated
         | building a home server with a 13th gen + ECC because you also
         | get quicksync onboard.
        
           | coder543 wrote:
           | Exactly. $450 for a motherboard just to get ECC support is
           | ridiculous. I don't know how it is with AM5, but on AM4, you
           | could use ECC memory with many normally-priced motherboards.
           | 
           | Mentioning W680 feels pointless. You've _always_ been able to
           | buy high end motherboards and stick ECC in them. The entire
           | point of the article is that _all_ computers should be using
           | ECC RAM, not just the expensive, workstation class computers.
        
         | Dylan16807 wrote:
         | It's worth keeping in mind that the chipset has zero
         | involvement in ECC. The CPU is directly attached to the memory
         | slots. They're using the chipset as an expensive dongle.
        
       | NelsonMinar wrote:
       | Still wild to me we lost ECC RAM. It used to be standard in PCs.
       | 
       | Does Apple hardware come with ECC RAM? If anyone could make it
       | make sense as a business, it's them.
        
         | pram wrote:
         | The Xeon based Macs had ECC of course. None of the ARM ones do
         | (yet)
        
           | [deleted]
        
         | MBCook wrote:
         | When was it standard? It's been the high-end extra thing for as
         | long as I can remember.
        
           | MisterTea wrote:
           | I know the Pentium Pro/2/3 chipsets and motherboards all(?)
           | supported it. Unsure of the Pentium 1 as the 430TX on my Tyan
           | Tomcat IV doesn't, and that is a dual processor board. 486
           | and earlier likely depended on the chipset as there were
           | many.
           | 
           | At work I have two working slot 1 PIII 800's each with 1GB
           | ECC (4x 256MB DIMMS) on a regular Asus board (doing nothing
           | but waiting to go home with me one day). The board reports
           | the RAM is in fact ECC and that it is enabled.
        
           | NelsonMinar wrote:
           | I was thinking of 386 era computers and strictly speaking it
           | was just parity RAM, not ECC. Which often led to annoyances
           | when a single parity error would cause your whole computer to
           | halt.
           | 
           | Wikipedia says "By the mid-1990s, most DRAM had dropped
           | parity checking as manufacturers felt confident that it was
           | no longer necessary.".
           | https://en.wikipedia.org/wiki/RAM_parity
           | 
           | I'd love to read a technical deep dive on RAM reliability
           | over time. You'd think with increasing memory cell density
           | and overall larger RAM the number of absolute errors on a
           | desktop computer would be going up over time.
        
           | Felger wrote:
           | I can remember 486 Motherboard in Packard Bell (quite the
           | entry brand...) systems frequently used 36 bits ECC FP SIMMs.
           | 
           | Printers and plotters from this era used ECC modules most of
           | the time.
           | 
           | But by the end of the century, they were replaced by
           | unbuffered, unregistered 16/32/64 bits modules.
           | 
           | Every mid range server still use ECC. Entry HPE Servers use
           | ECC UREG (unregistered, 9 chips) modules, while mid range and
           | more use ECC REG modules (9 chip + interface controller
           | onboard). Ironically, UREG module are more expensive than ECC
           | REG.
           | 
           | Also, most workstations used ECC modules. Less frequently
           | since 4-5 years.
        
         | [deleted]
        
       | dale_glass wrote:
       | ECC RAM would actually be a boon to everyone, including gamers.
       | 
       | ECC means not only that you know precisely when you've gone too
       | far with overclocking, but potentially allows overclocking a bit
       | further, relying on that some amount of trouble can now be
       | tolerated.
       | 
       | It also means you're not going to break your OS by playing with
       | this stuff. Memory corruption carries a huge risk of disk
       | corruption, which can mean things like corrupt data, random
       | crashes or an unbootable system that persists even after
       | reverting everything to defaults.
        
         | p1necone wrote:
         | The sweet spot for overclocking ECC ram is still before it
         | starts malfunctioning. If it's clocked higher but is correcting
         | for errors it will still be slower.
        
           | ilyt wrote:
           | Entirely depends on error rate
        
             | [deleted]
        
         | RealityVoid wrote:
         | I doubt that would actually be useful with overclocking. I
         | don't know the arch of the modern PC well enough to say with
         | 100% confidence, but on embedded arches, the RAM has the parity
         | bits checked when they get placed on the bus. If the error
         | happens on data retrieval(or was already present) , then the
         | ECC saves you, but if it happen anywhere else... not really? I
         | don't know if.. ALU's for example automatically include the
         | parity bits in the computation.
        
           | p1mrx wrote:
           | You're talking about overclocking the CPU. ECC is more
           | relevant when overclocking the RAM itself, which also affects
           | gaming performance.
        
           | Dylan16807 wrote:
           | They specifically mean overclocking the memory.
        
         | jjtheblunt wrote:
         | totally embarrassingly naive question : why bother overclocking
         | ?
        
           | dale_glass wrote:
           | I think it's mostly pointless in this day and age.
           | 
           | I'm just saying that it has a potential appeal for gamers
           | too, so it's not just a datacenter type of technology that
           | some nerds want to play with.
           | 
           | At the very least it'd make overclocking safer and easier, so
           | any manufacturer making gamer type boards with a lot of
           | overclocking settings in the BIOS should like the idea of it.
        
           | eric__cartman wrote:
           | Some people prefer to trade off stability for a slight
           | performance improvement. With modern hardware I don't think
           | it's worth it to be honest. I want my computer to work day in
           | and day out even if it means a 2% lower score in some
           | benchmark.
        
             | jjeaff wrote:
             | There are also a lot of cases where you can overclock
             | without sacrificing stability. The standard clock speed for
             | any line of processors is simply the minimum it is tested
             | for. But you sometimes get lucky and can get a better chip
             | with more viable transistors. So you can boost the clock ok
             | those and reap the benefits without any drawbacks.
             | 
             | There are sites and services that do "binning" where they
             | test the specific chips and you can buy ones that have been
             | vetted to clock higher.
        
       | [deleted]
        
       | LanternLight83 wrote:
       | I'm with you, but it's worth noting that errant bit-flips are
       | also the most convincing argument for vertically integrated file-
       | systems like ZFS and BTRFS.
        
       | whitepoplar wrote:
       | At least make it user-configurable! I'd trade off a bit of memory
       | capacity for ECC protection in a heartbeat.
        
       | PaulKeeble wrote:
       | ECC has been used as an artificial market segmentation mechanism
       | for a long time and it needs to come to an end. RAM just like
       | SSDs and HDDs ought to have some amount of self protection again
       | basic errors, all places where data is stored even for short
       | periods needs this.
        
       | thinking001001 wrote:
       | Digital privacy should be a human right. ECC RAM is just another
       | iteration.
        
         | theandrewbailey wrote:
         | * * *
        
       ___________________________________________________________________
       (page generated 2023-01-21 23:00 UTC)