[HN Gopher] Non-ECC memory corrupted my hard drive image [video]
       ___________________________________________________________________
        
       Non-ECC memory corrupted my hard drive image [video]
        
       Author : zeristor
       Score  : 82 points
       Date   : 2022-12-25 11:17 UTC (11 hours ago)
        
 (HTM) web link (www.youtube.com)
 (TXT) w3m dump (www.youtube.com)
        
       | tibbydudeza wrote:
       | Got a Gen 10 HPE Microserver for my NAS run some AMD dual core
       | SoC - factory fitted with ECC memory running Unraid.
       | 
       | Think it mysteriously crashed once or twice in the 4 years I had
       | it and the HP diagnostic light came on.
        
       | encryptluks2 wrote:
       | I watched the full video. It was long but very informative. The
       | humor at times made up for the length and the presenter showed a
       | lot of deep knowledge that most people won't have. My biggest
       | gripe is that they just didn't try replacing the RAM sticks in
       | the first place. I get that they wanted to do a root cause
       | analysis, but geez the time and patience they had to do all those
       | memory tests. No wonder they did a video about it, cause
       | otherwise that lost time would have been painful. I was baffled
       | as well that dd and ddrescue work differently in how they utilize
       | the RAM. Caught me offguard.
       | 
       | Onto the discussion of ECC RAM. In a perfect world, all memory
       | would be ECC... but try finding some high performance 16GB sticks
       | of ECC DDR4 RAM like what you'll see on gaming computers. I don't
       | even think they make anything comparable in terms of speed and
       | definitely not costs. I guess you don't really know that you
       | needed ECC until it's too late.
        
         | morelikeborelax wrote:
         | > " I guess you don't really know that you needed ECC until
         | it's too late."
         | 
         | I spent many years on hardware consultation and was amazed at
         | the all the times I had to explain it was just a what if
         | insurance like any other things their business was mitigating
         | against. Sometimes they'd even decided they needed to save
         | costs in non-ecc ram when it was $4 a gb in difference, or
         | (during the FB-DIMM era) there wasn't even an option to avoid
         | it.
         | 
         | Never really understood the resistance towards it.
         | 
         | Maybe the lack of evidence before the Google study and people
         | thinking RAM manufacturers were trying to rip them off or
         | something.
         | 
         | The "never had a problem so why would I need" it attitude with
         | no way to know if an issue was caused by a bit flip was most
         | baffling.
        
         | simoncion wrote:
         | > ...but try finding some high performance 16GB sticks of ECC
         | DDR4 RAM like what you'll see on gaming computers.
         | 
         | Here ya go:
         | 
         | https://nemixram.com/16gb-ddr4-3200-pc4-25600-ecc-udimm-2rx8...
         | 
         | It doesn't have pretty lights on it, but it does seem to be in
         | the same speed class that gets called "gaming RAM" by a _whole_
         | bunch of retailers.
        
       | jeroenhd wrote:
       | It's good that with DDR5 consumer memory will get some super
       | basic ECC on die, so hopefully the next generation of memory will
       | make the problematic sticks more obvious (or prevent damage in
       | the very least). ECC won't save you from memory corruption, but
       | it'll save your data at least.
       | 
       | Personally, I would've just checksummed the individual failing
       | files rather than the disk image and only back up the bad files
       | separately. There are all kinds of ways for a disk image to fail
       | and I wouldn't spend a second longer on it than absolutely
       | necessary. The whole memtest permutation setup also would've been
       | too much work for. E, I would just declare the motherboard faulty
       | when two sticks that otherwise pass the test fail in specific
       | configurations. A new motherboard is cheaper than super specific
       | RAM sticks.
        
       | tibbydudeza wrote:
       | Afaik ECC memory is slower than normal memory, so it does not
       | impress the folks who base their purchase decisions on benchmark
       | scores rather than utility and best bang for the buck.
        
       | ksec wrote:
       | This is especially true for NAS. And why you need BTRFS or
       | preferably ZFS. Unfortunately none of the consumer NAS offers
       | ZFS, and BTRFS is still not a default option. Neither Synology or
       | Qnap seems to care.
        
         | aborsy wrote:
         | Synology offers ECC in 2023 consumer modes such as 923.
         | 
         | Still the experience that synology btrfs provides is nowhere as
         | good as ZFS (due to a lot of limitations).
        
           | metadat wrote:
           | I bought a 6 disk Synology a few months ago and it came with
           | ECC by default. I did a cursory web search about this just
           | now and ECC support appears to be the norm for 22 (as in the
           | year 2022) model revisions and newer (thankfully!).
        
             | layer8 wrote:
             | It's because they use AMD CPUs now (already in the 21
             | models). The trade-off is that those CPUs have worse
             | hardware codec support than the Intel ones they previously
             | used, if you want to do video transcoding.
        
           | [deleted]
        
         | gmokki wrote:
         | I had RAM go bad after running 18T (5 HDDs in raid1) btrfs
         | system in closet for years. Btrfs of course noticed it and
         | fixed most of them automatically when some of blocks were
         | corrupt. But eventually the system failed: the tree that
         | contains the checksums for all the other trees corrupted itself
         | on both copies of one node. Fixed the HW problem and then had
         | to use hex editor to set the checksum manually to correct value
         | (I modified the kernel to print the expected value). Now the
         | system has been again stable for 3 years.
        
       | [deleted]
        
       | moloch-hai wrote:
       | We don't have ECC mainly because Intel has long been hostile to
       | "consumer" access to ECC.
       | 
       |  _Apparently_ this was conceived as a market segmentation scheme:
       | people outfitting servers could get ECC when they pay a huge
       | premium. They would thereby not be tempted to cheap out and buy
       | consumer-grade equipment, otherwise wholly adequate to meet all
       | their needs at a radically cheaper price.
       | 
       | That we cannot get laptops or even desk machines with ECC, and so
       | have them crash frequently, is seen as a trivial side effect of
       | the strategy. If you did not hate Intel enough before, you may
       | increase your hatred accordingly. Intel doesn't hate you back;
       | they simply care not even a little how you feel.
       | 
       | (Historically, just running Microsoft software was overwhelmingly
       | more likely to be the cause of a crash than a memory bit-flip;
       | and there were orders of magnitude fewer RAM bits at risk.
       | Microsoft succeeded in getting customers to accept and even
       | expect frequent crashes; before MS, a program crashing was
       | grounds for a refund.)
        
         | helf wrote:
         | I see this a lot but I really don't think it was/is that
         | simplistic.
         | 
         | It's added complexity and cost for something that rarely would
         | benefit most consumers. Now, you can argue that the complexity
         | and cost is a nonissue on modern setups and I would probably
         | agree.
         | 
         | But Intel has long had desktop grade hardware with ECC support.
         | The 440GX chipset supported ECC and I ran a Dell GX1 SFF with
         | 768MB of ECC PC100 for yeeeears with a 450 MHz P3 and later
         | upgraded to 1.4ghz tualatin-256 via a slotket adapter.
         | 
         | The 440HX /socket 7/ chipset supported ECC. And that's a
         | Pentium 1 chipset.
         | 
         | The 440BX/GX and 450NX supported ECC and that's with desktop
         | pentium 2 and 3 chips.
         | 
         | The 820/820E/840 supported ECC with desktop celeron and
         | pentium2/3 chips
         | 
         | 845/845e/850/850e/860 pentium4 chipsets support ECC
         | 
         | 875/e7205/e7221/e7230 did with desktop pentium 4 and pentium d
         | chips
         | 
         | 925/925xe/955x/975x did with desktop pentium 4/pentium d/core 2
         | 
         | It's more sparse now that they moved to the IMC, granted. But
         | Intel has long had multiple chipsets per generation with ECC
         | support for desktop grade hardware.
        
         | eternityforest wrote:
         | I used to be a big fan of Intel, up until the latest chips from
         | other companies that seem to have beat them on
         | performance/watt. My next laptop will probably be AMD if the
         | situation hasn't changed.
        
           | vladvasiliu wrote:
           | I'm happy with my AMD based laptop. But I haven't seen any
           | that support ECC.
           | 
           | But I did see a Lenovo model, IIRC, that had some kind of
           | Xeon and ECC. Not sure what the noise and battery life
           | situations on that thing were, though.
        
             | OJFord wrote:
             | I realise that's blurrier when it comes to laptops, but
             | AIUI it's more a case of whether the motherboard than
             | supports it than about the AMD chip. i.e. given a desktop
             | CPU, as far as I know you can put it in a motherboard that
             | either does or does not support ECC RAM.
        
               | eternityforest wrote:
               | I'm surprised nobody has made RAM with the ECC logic
               | built into the ram itself, that just looks like normal
               | ram to the CPU.
        
               | IYasha wrote:
               | Having ECC being checked inside the CPU is actually
               | useful as data loss may be induced by EMI (and other
               | factors) on PCB data lines.
        
               | AdrianB1 wrote:
               | It is called DDR5 - it has ECC built in the module
               | itself. Making such a module does not make much sense if
               | you cannot report the rate of errors, so if it is just
               | hiding you have a bad RAM stick there is only so much
               | value in having ECC.
        
               | simoncion wrote:
               | It is my understanding that the ECC that you're talking
               | about only protects data-in-flight between the module and
               | whatever is reading or writing the data. It does not
               | protect against corruption of data-at-rest, which is what
               | is protected with ECC in DDR4 and older.
               | 
               | It's also my understanding that the DDR5 data-in-flight
               | ECC is a _mandatory_ feature because the link between the
               | memory modules and everything else is so error-prone that
               | the system would simply not function without it.
        
               | eternityforest wrote:
               | Taking a quick glace at the articles, I think it's the
               | opposite, DDR5 protects data at rest only, because they
               | want to make the chips so unreliable it can't work
               | without it, not the bus.
               | 
               | But in practice, it will probably be more reliable than
               | DDR4 without ECC, since now you need 2 cosmic ray flips,
               | or 1 plus a manufacturing defect flip, and the defect
               | flips will probably be uncommon-ish.
               | 
               | It's too bad data in flight isn't protected without old
               | fashioned ECC on top of that, but it will probably be a
               | big step up, the same way that flash memory is now very
               | reliable even though the actual uncorrected errors are
               | probably worse under the hood.
        
               | toast0 wrote:
               | The problem with the DDR5 approach is there's no
               | reporting mechanism, so while it will reduce the error
               | rate of a marginal module, it doesn't let you know so you
               | can replace it. In my experience with ECC modules, a
               | module with some errors is a lot more likely to get more
               | errors than one that's operating with zero errors.
        
               | my123 wrote:
               | Ryzen APUs, which include almost all AMD laptops,
               | actually have ECC fused off in silicon unless you buy the
               | "Pro" variant.
        
               | vladvasiliu wrote:
               | Huh. I didn't know that.
               | 
               | My particular laptop does have a "pro" CPU. However, I
               | would be surprised to no end to learn that it supports
               | ECC. This particular model sports an MBP-level price tag
               | [0], but is absurdly cheaply built. Even for "customer
               | facing components", that are easy to compare, such as the
               | screen (terrible colors) and case (creaks if you look at
               | it wrong). HP doesn't offer ECC RAM, not even as an
               | upgrade, so I really don't think the additional lines are
               | physically present.
               | 
               | ---
               | 
               | [0] I don't remember the specific number, but it was
               | within 100 EUR of a 14" M1 MBP with 32 GB RAM and 512 GB
               | SSD. That's counting a RAM (8 -> 32) and SSD (256 -> 512)
               | upgrade which were made with components bought separately
               | (though they were rather high-end).
        
               | vladvasiliu wrote:
               | That's right, but seeing how laptops seem to do the bare-
               | minimum, I would be really surprised to learn than a
               | random model, _which doesn 't advertise it_, actually
               | supports it.
        
         | erk__ wrote:
         | > That we cannot get laptops or even desk machines with ECC
         | 
         | The Xeon series of laptop processors does support ECC just at a
         | quite large premium.
        
         | gruez wrote:
         | > That we cannot get laptops or even desk machines with ECC,
         | and so have them crash frequently, is seen as a trivial side
         | effect of the strategy
         | 
         | I'm not sure what you mean by "frequently", but my non-ECC
         | machines definitely do not crash "frequently".
         | 
         | > before MS, a program crashing was grounds for a refund
         | 
         | Source?
        
           | Brian_K_White wrote:
           | The problem with untrusworthy memory (or any other component)
           | is not that your system crashes, it's that it _doesn 't_.
        
             | gruez wrote:
             | I don't doubt that non-ECC hardware experiences some non-
             | zero number of bitflips per year. I'm just doubting the
             | parent commenter's claim that non-ECC ram is causing
             | computers to crash "frequently".
        
               | layer8 wrote:
               | And the parent is pointing out that _not_ crashing on bit
               | flips is exactly the problem.
        
           | AdrianB1 wrote:
           | "frequently" is very subjective or relative in this context.
           | 25 years ago I had a crash per hour on almost any regular
           | computer, but zero crashes per month on servers with ECC. In
           | the past couple of years I think I had a few cases of frozen
           | apps, but I don't remember of any OS level problem. At the
           | same time, on servers I see from time to time ECC fixing a
           | bit, but on the desktop or laptop I have no idea how many
           | times corrupted bits went undetected and what is the
           | consequence.
        
             | gruez wrote:
             | >25 years ago I had a crash per hour on almost any regular
             | computer, but zero crashes per month on servers with ECC
             | 
             | If it's crashing once per hour, it's probably unstable
             | drivers/software or flaky hardware that needs to be RMAed,
             | not random bitflips.
        
             | navjack27 wrote:
             | I think sometime else is wrong if you've had a "crash" per
             | hour.
        
         | IYasha wrote:
         | For this reason for my first truly made-from-scratch home NAS I
         | went AMD64 with ECC UDIMMs. It was some very basic Athlon64,
         | but it COULD do ECC. Since then I moved to Opterons and Xeons
         | but I still remember that choice.
        
         | Gordonjcp wrote:
         | > That we cannot get laptops or even desk machines with ECC,
         | and so have them crash frequently, is seen as a trivial side
         | effect of the strategy.
         | 
         | How frequently would you say you encounter a crash that you can
         | pin down to a lack of ECC memory in your laptop or desktop?
        
           | Filligree wrote:
           | You can't, that's the thing, right?
           | 
           | I have a Ryzen desktop with ECC, and it registers about one
           | bit-flip per week. I don't know how many of those would
           | become crashes, but I'm more worried about the ones that
           | wouldn't.
        
       | dale_glass wrote:
       | Yup, been there.
       | 
       | Way back I had a Pentium 133 doing firewall duty in a closet. It
       | did approximately nothing besides iptables, but of course any
       | machine has logs, updates and so on going on.
       | 
       | After running fine for months one day it suddenly died. I
       | rebooted it. A few days later it died again. Another reboot. Then
       | it died for the last time and failed to boot at all. Examination
       | showed the disk was corrupt and couldn't be mounted. Further
       | examination showed that one of the memory modules was loose for
       | some reason, could be that it was never firmly in and I just
       | bumped the box when messing with something else.
       | 
       | Then came the wasted weekend of dealing with that my normal
       | internet connection relied on the thing that was now completely
       | broken.
       | 
       | And that was the luckiest case I can imagine, when the broken
       | machine contains no data of actual value. Since then I'm very
       | paranoid, always run memtest on any new RAM I buy overnight, and
       | have ECC where it's possible to have it.
        
         | vladvasiliu wrote:
         | > Since then I'm very paranoid, always run memtest on any new
         | RAM I buy overnight, and have ECC where it's possible to have
         | it.
         | 
         | Yeah, I do the same, but I've learned that you have to do it
         | regularly.
         | 
         | In one of my desktop machines, the RAM ran fine for like two
         | years. Then, all of sudden, random Firefox segfaults, etc.
         | 
         | Whipped up a memtest ISO, and sure enough, one of the sticks
         | was bad.
        
           | dale_glass wrote:
           | That's the nice thing about ECC, it acts like an always
           | running memory test.
           | 
           | You normally have a scrub time that can be configured in the
           | BIOS, which also adds a regular verification of the entire
           | RAM at regular intervals, just in case something goes wrong
           | in some rarely used part of the memory.
        
             | IYasha wrote:
             | Unfortunately, background scrubbing significantly increases
             | power consumption and impacts performance as well.
        
               | metadat wrote:
               | Do you want (1) a higher rate of stable and correct
               | computations to be performed at a slightly higher energy
               | cost, or (2) a demonstrably less reliable device at a
               | slightly lower energy efficiency?
               | 
               | I'll go for #1 in most cases, as long as the system is to
               | be relied upon for anything deemed important.
        
               | IYasha wrote:
               | Me too, of course. I'm just highlighting downsides so
               | people know what to expect.
        
           | ilyt wrote:
           | Weirdly enough I had same case but memory turned out to be
           | fine, replacing power supply fixed the issue. I ran test, saw
           | memory is bad, replaced sticks, same problem, put the sticks
           | back in and decided to just run it (it was gaming PC).
           | 
           | Few months later powersupply outright died (had ~8years at
           | that point), replaced it with good one, no memory errors.
        
             | vladvasiliu wrote:
             | In my case it was clearly a bad RAM stick. Took it out, OK.
             | Switched them around: errors. Replaced it with a new one,
             | back to OK.
             | 
             | In this particular case, a bad PSU would be the end of the
             | PC. It's an HP dekstop mini. Basically a laptop without a
             | screen, powered by and external adaptor that puts out a
             | single 12V line. All further conversions are done on the
             | motherboard somehow.
        
         | consp wrote:
         | Badly socketed ram was one of the reasons my PC started failing
         | after being on for a while. When everything was cool it was all
         | fine, when the case and everything heated up a bit it failed
         | eventually. Re-seating the ram fixed it and ran for quite a
         | time without issues. This was in the early Athlon days though.
        
       | lizardactivist wrote:
       | I'm curious what the actual manufacturing costs for ECC DRAM is
       | compared to regular DRAM. Is it considerably more expensive, or
       | just the usual over-charge because it's better?
        
         | jeffbee wrote:
         | It's exactly 1/8th more.
        
       | aortega wrote:
       | I have 128 GB of non-ecc memory in my notebook, never detected a
       | single error, and has been on 24/7 for more than 4 years.
       | 
       | Unless you live over 4000 meters over the sea level, like to
       | compile while flying or live close to an unshielded nuclear
       | reactor, you don't need ECC.
       | 
       | And most memory problems you can fix by better cooling, and
       | better shielding.
        
       | H8crilA wrote:
       | This problem has no ultimate solution. I've seen all components
       | flip bits, CPUs, networking cards, RAM, most often you just can't
       | know for sure what did it. You can remedy it a bit (like with
       | ECC), but ultimately there will always be corruption if you
       | process hundreds of petabytes of data. Get used to it, your
       | computer executes an instruction with a probability extremely
       | close to 1, but not equal to 1.
       | 
       | Deep in the archives of a well known tech company is a very well
       | documented case of a bit flip that caused the wrong function to
       | be executed in a C++ v-table. The big oof was that this function
       | was the equivalent of an SQL "drop table", and just happened to
       | be 32 bytes off of a very benign function that did something like
       | stat(). Really funny stuff once the crisis is over :)
        
         | dale_glass wrote:
         | ECC isn't a terribly complicated technology, and can be used in
         | all those cases.
         | 
         | In limited cases, a checksum is good enough. If you checksum
         | outgoing data, and verify it on reception, then it being
         | corrupted in transit whether on the network card or the cable
         | can be detected and transparently compensated for.
         | 
         | Really, we can do much better than to "get used to it".
        
           | H8crilA wrote:
           | You are under the impression that CPUs and other chips always
           | perform the same instructions as are written in the code, and
           | only RAM can flip bits because DRAM is DRAM :)
           | 
           | It can (and should! whenever possible) be improved, not
           | fixed. There's always that pesky gamma that can hit a
           | specific transistor, even if it is deep underground. Gamma
           | cannot be fully stopped. At certain scales data corruption
           | becomes directly measurable. And yes, corruption levels vary
           | between pieces of hardware.
        
             | ilyt wrote:
             | Sure but once your registers, cache, data bus and address
             | bus has ECC you have vastly smaller area that can flip.
             | 
             | You can even _just buy_ (well, chipaggedon aside) ARM cores
             | that have 2 chips running in parallel and faulting when the
             | result is different
        
               | my123 wrote:
               | > You can even just buy (well, chipaggedon aside) ARM
               | cores that have 2 chips running in parallel and faulting
               | when the result is different
               | 
               | See dual-core lock-step Arm chips (used for automotive).
        
             | dale_glass wrote:
             | Of course not. I'm not saying we can have perfection. I'm
             | saying that we can do much better, using methods and
             | technologies that are very old at this point.
             | 
             | The reason why we don't is laziness and market
             | segmentation, mostly.
        
             | sobriquet9 wrote:
             | The probability of a bit flip depends on the size of the
             | transistor used. RAM tends to pack many small transistors.
        
         | don-code wrote:
         | Is this something that's documented publicly? I'd love to read
         | more.
        
         | fpoling wrote:
         | One can process petabytes without bit flips if one use proper
         | checksums and error correction codes. While that does imposes
         | overhead, it is not big and, thanks to Shannon theorem, can be
         | made arbitrary small with sufficiently big blocks.
        
         | water8 wrote:
         | [dead]
        
       | IYasha wrote:
       | It would be so nice if DD and DDRescue did calculate hashes while
       | copying.
        
       | chunk_waffle wrote:
       | For anyone curious (as I was once and looked into it) Dell and
       | Lenovo both ship Laptops with Xeon's and ECC memory though they
       | are very expensive.
        
         | thekombustor wrote:
         | Yep, you can get it on the higher end trims of the P-series
         | Thinkpads.
        
       | TacticalCoder wrote:
       | > Why ECC Memory Is So Important
       | 
       | Except that it's not for many use cases. It's great for servers
       | but for people on their personal and/or work computer, it's
       | simply not that useful.
       | 
       | Seriously: which percentage of developers have ECC on their
       | development machine(s)?
       | 
       | As developers we live in a world of SSH, cryptographic hashes,
       | checksums everywhere, Git repositories (that is a big one),
       | Merkle trees, digital signatures, reproducible builds (which are
       | gaining traction), etc.
       | 
       | Heck, I'm torrenting the latest Debian or Devuan .iso image. My
       | torrent client is using every known trick under the sun to make
       | sure that should anything go wrong, the broken data shall be
       | discarded and re-downloaded. Download is done, I dd the image to
       | some installation medium. I can then verify its checksum matches
       | the official one. A bit flip didn't slip by unnoticed.
       | 
       | All the music I carefully ripped from my audio CDs? They're all
       | cross-checked with an online DB of known bit-perfect rips.
       | There's an accompanying file containing each song's hash and I
       | can verify at anytime that all my files are 100% correct.
       | 
       | But really most of all I live in a world of Git repositories. My
       | entire Emacs config is versioned under Git (I know YMMV but I
       | like it that way). Some people version under Git their entire
       | user dir.
       | 
       | Tell me how my lack of ECC is going to really make life miserable
       | here?
       | 
       | I have nothing _against_ ECC... But if I want to upgrade my AMD
       | 3700X to a 7700X, apparently I cannot get ECC.
       | 
       | And that's totally fine: I certainly won't discard the 7700X
       | because I cannot get ECC for it.
       | 
       | And if _anything_ looks suspicious, running Memtest is the first
       | thing you should do.
       | 
       | I've had bad RAM at times. I'm still there.
        
         | craftkiller wrote:
         | > Seriously: which percentage of developers have ECC on their
         | development machine(s)?
         | 
         | Every single one of my non-mini computers uses ECC ram except
         | for my laptop. If someone would release a framework laptop
         | motherboard that supports ECC ram (and preferably risc-v) I'd
         | finally be able to close to reliability gap. It blows my mind
         | that we say "Well sure, we COULD make infallible ram, but that
         | would cost a tiny bit extra so instead lets just hope nothing
         | bad happens." That's right up there with not wearing a seat-
         | belt because I haven't needed one yet.
        
           | digitallyfree wrote:
           | I develop on VMs on my server (with ECC and RAIDZ) which i
           | access using SSH or VDI protocols. I would love to have ECC
           | on all my machines but that isn't feasible until the status
           | quo changes, so I stick with the remote approach. To me
           | that's an acceptable tradeoff as the non-ECC desktops/laptops
           | are just used as dumb terminals while the real work happens
           | on the reliable server.
           | 
           | I can't speak for ECC at the moment but ZFS has definitely
           | saved me from data corruption that would have been left to
           | manifset otherwise.
        
           | Brian_K_White wrote:
           | "If someone would release a framework laptop motherboard that
           | supports ECC ram..."
           | 
           | This. If I could, I would. The only reason my laptop doesn't
           | have ecc is because the manufacturer doesn't offer the option
           | in any macines I otherwise want.
           | 
           | That comment was very misguided in trying to suggest that
           | there is any valid excuse to tolerate unreliable execution
           | hardware. git and ssh and md5sums do _not_ mean that it 's ok
           | if your very brain can't be trusted to deliver data from one
           | part to another within itself, or spit back the same data
           | that was put in a cell. Everything else is built _upon_ that!
        
         | leguminous wrote:
         | Checksums don't save you from memory corruption. If your data
         | gets corrupted in memory, you will just end up checksumming and
         | committing bad data. Or your checksum could get corrupted, and
         | you commit a checksum that doesn't match your data. Checksums
         | are more useful for safeguarding against disk or network
         | corruption (although you shouldn't have network corruption
         | issues over TLS or SSH).
         | 
         | Apparently Ryzen 7000 cpus can use ECC. I've heard reports that
         | AMD needs to release an AGESA update, though, and ECC DDR5
         | memory availability is terrible. I'm hopeful that the situation
         | will improve, because I also want to update my desktop. I've
         | been using ECC memory since losing a filesystem on a desktop
         | when a DIMM went bad.
        
         | dale_glass wrote:
         | > But really most of all I live in a world of Git repositories.
         | My entire Emacs config is versioned under Git (I know YMMV but
         | I like it that way). Some people version under Git their entire
         | user dir.
         | 
         | > Tell me how my lack of ECC is going to really make life
         | miserable here?
         | 
         | Git will break if RAM is bad just like anything else. That it
         | checksums everything won't save you from checking in corrupt
         | data, the filesystem itself being corrupt, or some internal git
         | structure becoming corrupt. Losing your repo because something
         | in it was written wrong is very much a possibility.
         | 
         | Having multiple machines involved helps, but it's not a
         | complete fix, because the possibility exists of something
         | damaged being transmitted from a broken machine to a good one,
         | ensuring there's no good copy anywhere.
         | 
         | There's really nothing software can do to operate correctly
         | with bad RAM all of the time. Instructions for the software are
         | in RAM. The OS that the software expects to behave right is in
         | RAM. Various buffers used for disk access and networking are in
         | RAM. An application like git assumes all of that is performing
         | correctly, and can't compensate for every possible malfunction
         | that could happen.
        
           | everybodyknows wrote:
           | How trustworthy is git-fsck for detecting random-bit
           | corruption?
        
             | dale_glass wrote:
             | Depends on how you look at it.
             | 
             | For actual verification, unless there's a bug my
             | understanding is that very trustworthy. But by that point
             | it's already too late. Okay, you know something is broken,
             | but that won't give your good data back.
             | 
             | But that only tells you that Git data is intact and that
             | all the hashes match. If git got a corrupted file to start
             | with, then correctly hashed it, everything will verify 100%
             | and still be broken.
        
         | rhn_mk1 wrote:
         | > which percentage of developers have ECC on their development
         | machine(s)?
         | 
         | That question won't help you evaluate the demand for ECC simply
         | because the supply is strangled. Those who want it have to make
         | compromises to get ECC: get a Xeon, or get one of few AMD
         | motherboards with matching CPUs and overpriced RAM without a
         | guarantee that it will end up working.
        
           | simoncion wrote:
           | > ...get one of few AMD motherboards with matching CPUs and
           | overpriced RAM...
           | 
           | I mean, if you want a _guarantee_, then sure, get one of
           | those certified-for-ECC motherboards.
           | 
           | But, like, as far as I know, going _at least_ far back as the
           | Phenom II (released in 2008) AMD desktop processors have
           | always supported ECC RAM. And -as far as I know- ASUS
           | motherboards for said processors have always supported
           | dropping in ECC RAM (and Linux and memtest and friends have
           | always agreed that ECC was enabled and functioning in such a
           | system).
           | 
           | Source: Personal experience with Phenom II, Threadripper, and
           | Ryzen 5 CPUs and ASUS motherboards, and looking-from-a-
           | distance at the rest of the AMD CPUs between the Phenom and
           | the Ryzen 5.
        
         | tasubotadas wrote:
         | When people complain about these random OS crashes and freezes
         | it's usually RAM corruption at fault.
         | 
         | Per 1gb of RAM, you can expect to see 266 bit errors per
         | month[1-2] if you are using your PC 16h per day. Multiply that
         | by 64GB or 128GB of RAM and it's crazy to think that you won't
         | run into any of the stability issues.
         | 
         | [1-2]
         | https://static.googleusercontent.com/media/research.google.c...
         | [1-2] https://en.wikipedia.org/wiki/ECC_memory#Research
        
           | pflanze wrote:
           | > Per 1gb of RAM
           | 
           | The Google study you linked says "25,000 to 70,000 errors per
           | billion device hours per Mbit".
           | 
           | Assuming bits, {25000 to 70000}/1e9 * 30 _16_ 1000 = 12 to
           | 33.6 bit errors per month. Assuming bytes, it 's 96 to 268
           | bit errors per month.
           | 
           | Apparently you've meant bytes, not bits. (I'm a pedant, but
           | was also just unsure and interested in the numbers.)
           | 
           | FWIW, a comment I ran across that confirms toast0's view[1]:
           | "It's a bimodal distribution - you either have many errors
           | (due to a defect somewhere) or basically zero. If you're on
           | the good side of the distribution, with only extremely rare
           | errors, then you probably don't need ECC. But without ECC,
           | you don't know whether you need ECC!"
           | 
           | [1] https://www.realworldtech.com/forum/?threadid=198497&curp
           | ost...
        
           | ubercow13 wrote:
           | >266 bit errors per month
           | 
           | That seems like an overestimate. How can memtest ever pass on
           | non-ECC RAM if errors are that frequent?
        
             | moloch-hai wrote:
             | Because memtest only looks at values it wrote out very
             | recently, before they have had a chance to flip.
             | 
             | Memtest is looking for reliable failures, not evanescent
             | one-off events.
        
               | willis936 wrote:
               | SDRAM is continuously refreshing all cells. How long ago
               | data was written doesn't make a big difference (aside
               | from the case where you're reading data immediately after
               | writing or reading that data).
        
               | moloch-hai wrote:
               | This does not, of course, make any sense. Any given
               | memory cell will be read once per, say, millisecond, and
               | the value written back. Once it has flipped once, the
               | wrong value will then be written back, and after that the
               | wrong value is read back out and rewritten again,
               | indefinitely. Errors are sticky, and accumulate.
               | (Flipping back again is negligible unlikely.)
               | 
               | With ECC in the refresh path, such an error could be
               | corrected and the right value would be overwritten over
               | top of the bad one. Then errors would not accumulate, but
               | would instead be "scrubbed". Mainframe machines scrub
               | their RAM. Disks too.
        
               | ilyt wrote:
               | > SDRAM is continuously refreshing all cells
               | 
               | ...so ? we're not talking about slow deterioration (which
               | is why refresh is needed), we're talking about bit flip
               | from cosmic rays where cell changes state completely.
        
               | [deleted]
        
           | toast0 wrote:
           | I don't think this rate is reasonable to use in this manner.
           | I've run thousands of servers, with lots of ECC ram (quite a
           | few servers had 768GB of ram, but most were more like 32 or
           | 64). The vast majority ran for years with zero reported
           | errors. A small handful would have major errors of thousands
           | in an hour, but we would replace for hundreds in an hour. A
           | couple servers developed a periodic report of one or two
           | (correctable) errors per day. If 266 bit errors per month per
           | 1Gb was a usable rate, all of our servers would have been
           | throwing ECC correctable errors all the time.
           | 
           | But I didn't have time or desire to publish a study on our
           | experience, so there's no hard numbers.
        
         | kevingadd wrote:
         | I went out of my way for ECC because losing files can mean
         | losing days of work. With how much a sweng gets paid it's worth
         | the ECC premium if it saves me a few days of work over the life
         | of the machine.
         | 
         | Recently I discovered that one of my SSDs was quietly failing
         | without setting off any warnings; doing a chkdsk showed that
         | some files had already gotten corrupted. One of them was my
         | backblaze backup index!
         | 
         | Even though I have automated backups (backblaze + macrium
         | backups to a NAS), recovering files from them is non-trivial.
         | If I were to lose work to non-ECC ram who knows how long it
         | would take me to reconstruct a known-good work environment and
         | file set. Imagine if you're working on something huge and hard
         | to validate like neural net weights where corruption can occur
         | silently and be hard to detect after it's happened?
        
           | newZWhoDis wrote:
           | My favorite is when data silently corrupts, and then is
           | happily propagated to your off-site recovery :/
        
             | kevingadd wrote:
             | Yeah, I don't know when the drive started failing, so in
             | practice my backups are probably all screwed too. My only
             | option would be to get a spare drive, restore a backup to
             | it, then do a block level diff of the current and backup
             | volumes and try to figure out whether any of the
             | differences are file corruption.
        
               | Filligree wrote:
               | Probably I'm saying nothing you haven't thought of, but
               | ZFS is great for preventing this. If you also pair it
               | with ECC, then you've eliminated most ways to cause
               | corruption.
        
         | rwmj wrote:
         | But if I open up an editor and write a document then that could
         | be corrupted in RAM and then the corrupted data saved to disk.
         | The document is likely to be more important than some ripped
         | CDs or a git repo that I can download again.
         | 
         | The CPU, RAM and mobo manufacturers need to get together and
         | make ECC RAM mandatory. It's absurd that we have machines with
         | gigabytes of storage using microscopic (nanoscopic??)
         | capacitors that doesn't have this basic protection. And
         | honestly this should have happened _years_ ago.
         | 
         | (Edit: And before anyone says DDR5 is ECC by default, that's
         | not quite true although the difference is a bit subtle: https:/
         | /en.wikipedia.org/wiki/DDR5_SDRAM#DIMMs_versus_memory...)
        
           | ilyt wrote:
           | >But if I open up an editor and write a document then that
           | could be corrupted in RAM and then the corrupted data saved
           | to disk. The document is likely to be more important than
           | some ripped CDs or a git repo that I can download again.
           | 
           | Devils advocate: if it is just some bits in character flipped
           | it's entirely recoverable while flipping some bits in
           | compressed stream for video would corrupt more.
        
             | rwmj wrote:
             | Agreed. One problem with RAM errors (which I've actually
             | experienced) is they are insidious. You probably won't
             | notice them immediately so the error can be propagated, and
             | they're very very difficult to diagnose if you do notice
             | them.
             | 
             | Back in the 90s we had a database server which had a stuck
             | bit in memory normally mapped to the page cache. This
             | caused sectors to be written to the backing software RAID
             | which couldn't be read back in (because I think some
             | checksum was corrupted when written and then failed when
             | read back). It took an absolute age to diagnose this. I
             | think I only worked it out by eliminating everything else.
        
         | justsomehnguy wrote:
         | And condoms are only 99.8% effective, so let's not use it? You
         | can always pull out in time or verify checksu^W^W Plan B?
         | 
         | > But really most of all I live in a world of Git repositories
         | 
         | That's great _for you_ , but 99% don't even know what Git is,
         | just like _checksums, cryptographic hashes, Merkle trees,
         | digital signatures and reproducible builds_.
         | 
         | You just miss the one imortant thing: ECC isn't that helpful
         | where _you have_ the means to check and verify the data. ECC is
         | the only way to at least know what something is happening with
         | data when there is no way to check.
         | 
         | To give you a slight idea I would tell you an anecdote from my
         | L1 life almost two decades ago:
         | 
         | I visited a client who claimed what the PC was working
         | erratically and constantly threw weird error messages.
         | 
         | Welp, the usual deal, just some ugly software or a virus. In
         | the first 3 minutes I got like 5 errors about failing to load a
         | .dll from C:\WINDOWS\SYSTEM33\USER32.DLL, nothing unusual, just
         | need to.. WAIT. Why system _33_? Stupid virus masking for a
         | well-known folder? Doubt. So I go to C:\Windows and I see:
         | System32       System33       System34
         | 
         | If you are a smart fellow you probably already understood what
         | that was a bit-flip error which somehow managed to be in the
         | in-memory copy of MFT of the system drive. And this is the only
         | reason the user noticed it - because sometimes programs
         | wouldn't start and sometimes there was weird error messages. If
         | that error was in the data area of some program - it wouldn't
         | be discovered at all.
        
           | [deleted]
        
           | dmitrybrant wrote:
           | Hang on... to go from "System32" to "System33" is one bit
           | flip, but to go from "System33" to "System34" is three bit
           | flips at once. Doesn't that seem astronomically more likely
           | to be a malicious program than bad RAM?
        
       | stefantalpalaru wrote:
       | [dead]
        
       | baeaz wrote:
       | [flagged]
        
       | mrlonglong wrote:
       | We've got a server that keeps rebooting due to a bad ECC DIMM
       | chip. I thought the whole point of ECC was to keep the server
       | going until we can replace the DIMM?
        
         | AaronFriel wrote:
         | ECC can typically correct 1 bit errors and usually detect (and
         | fault) on 2 bit errors.
         | 
         | Your server is faulting and preventing itself from corrupting
         | data.
        
       ___________________________________________________________________
       (page generated 2022-12-25 23:00 UTC)