[HN Gopher] Non-ECC memory corrupted my hard drive image [video] ___________________________________________________________________ Non-ECC memory corrupted my hard drive image [video] Author : zeristor Score : 82 points Date : 2022-12-25 11:17 UTC (11 hours ago) (HTM) web link (www.youtube.com) (TXT) w3m dump (www.youtube.com) | tibbydudeza wrote: | Got a Gen 10 HPE Microserver for my NAS run some AMD dual core | SoC - factory fitted with ECC memory running Unraid. | | Think it mysteriously crashed once or twice in the 4 years I had | it and the HP diagnostic light came on. | encryptluks2 wrote: | I watched the full video. It was long but very informative. The | humor at times made up for the length and the presenter showed a | lot of deep knowledge that most people won't have. My biggest | gripe is that they just didn't try replacing the RAM sticks in | the first place. I get that they wanted to do a root cause | analysis, but geez the time and patience they had to do all those | memory tests. No wonder they did a video about it, cause | otherwise that lost time would have been painful. I was baffled | as well that dd and ddrescue work differently in how they utilize | the RAM. Caught me offguard. | | Onto the discussion of ECC RAM. In a perfect world, all memory | would be ECC... but try finding some high performance 16GB sticks | of ECC DDR4 RAM like what you'll see on gaming computers. I don't | even think they make anything comparable in terms of speed and | definitely not costs. I guess you don't really know that you | needed ECC until it's too late. | morelikeborelax wrote: | > " I guess you don't really know that you needed ECC until | it's too late." | | I spent many years on hardware consultation and was amazed at | the all the times I had to explain it was just a what if | insurance like any other things their business was mitigating | against. Sometimes they'd even decided they needed to save | costs in non-ecc ram when it was $4 a gb in difference, or | (during the FB-DIMM era) there wasn't even an option to avoid | it. | | Never really understood the resistance towards it. | | Maybe the lack of evidence before the Google study and people | thinking RAM manufacturers were trying to rip them off or | something. | | The "never had a problem so why would I need" it attitude with | no way to know if an issue was caused by a bit flip was most | baffling. | simoncion wrote: | > ...but try finding some high performance 16GB sticks of ECC | DDR4 RAM like what you'll see on gaming computers. | | Here ya go: | | https://nemixram.com/16gb-ddr4-3200-pc4-25600-ecc-udimm-2rx8... | | It doesn't have pretty lights on it, but it does seem to be in | the same speed class that gets called "gaming RAM" by a _whole_ | bunch of retailers. | jeroenhd wrote: | It's good that with DDR5 consumer memory will get some super | basic ECC on die, so hopefully the next generation of memory will | make the problematic sticks more obvious (or prevent damage in | the very least). ECC won't save you from memory corruption, but | it'll save your data at least. | | Personally, I would've just checksummed the individual failing | files rather than the disk image and only back up the bad files | separately. There are all kinds of ways for a disk image to fail | and I wouldn't spend a second longer on it than absolutely | necessary. The whole memtest permutation setup also would've been | too much work for. E, I would just declare the motherboard faulty | when two sticks that otherwise pass the test fail in specific | configurations. A new motherboard is cheaper than super specific | RAM sticks. | tibbydudeza wrote: | Afaik ECC memory is slower than normal memory, so it does not | impress the folks who base their purchase decisions on benchmark | scores rather than utility and best bang for the buck. | ksec wrote: | This is especially true for NAS. And why you need BTRFS or | preferably ZFS. Unfortunately none of the consumer NAS offers | ZFS, and BTRFS is still not a default option. Neither Synology or | Qnap seems to care. | aborsy wrote: | Synology offers ECC in 2023 consumer modes such as 923. | | Still the experience that synology btrfs provides is nowhere as | good as ZFS (due to a lot of limitations). | metadat wrote: | I bought a 6 disk Synology a few months ago and it came with | ECC by default. I did a cursory web search about this just | now and ECC support appears to be the norm for 22 (as in the | year 2022) model revisions and newer (thankfully!). | layer8 wrote: | It's because they use AMD CPUs now (already in the 21 | models). The trade-off is that those CPUs have worse | hardware codec support than the Intel ones they previously | used, if you want to do video transcoding. | [deleted] | gmokki wrote: | I had RAM go bad after running 18T (5 HDDs in raid1) btrfs | system in closet for years. Btrfs of course noticed it and | fixed most of them automatically when some of blocks were | corrupt. But eventually the system failed: the tree that | contains the checksums for all the other trees corrupted itself | on both copies of one node. Fixed the HW problem and then had | to use hex editor to set the checksum manually to correct value | (I modified the kernel to print the expected value). Now the | system has been again stable for 3 years. | [deleted] | moloch-hai wrote: | We don't have ECC mainly because Intel has long been hostile to | "consumer" access to ECC. | | _Apparently_ this was conceived as a market segmentation scheme: | people outfitting servers could get ECC when they pay a huge | premium. They would thereby not be tempted to cheap out and buy | consumer-grade equipment, otherwise wholly adequate to meet all | their needs at a radically cheaper price. | | That we cannot get laptops or even desk machines with ECC, and so | have them crash frequently, is seen as a trivial side effect of | the strategy. If you did not hate Intel enough before, you may | increase your hatred accordingly. Intel doesn't hate you back; | they simply care not even a little how you feel. | | (Historically, just running Microsoft software was overwhelmingly | more likely to be the cause of a crash than a memory bit-flip; | and there were orders of magnitude fewer RAM bits at risk. | Microsoft succeeded in getting customers to accept and even | expect frequent crashes; before MS, a program crashing was | grounds for a refund.) | helf wrote: | I see this a lot but I really don't think it was/is that | simplistic. | | It's added complexity and cost for something that rarely would | benefit most consumers. Now, you can argue that the complexity | and cost is a nonissue on modern setups and I would probably | agree. | | But Intel has long had desktop grade hardware with ECC support. | The 440GX chipset supported ECC and I ran a Dell GX1 SFF with | 768MB of ECC PC100 for yeeeears with a 450 MHz P3 and later | upgraded to 1.4ghz tualatin-256 via a slotket adapter. | | The 440HX /socket 7/ chipset supported ECC. And that's a | Pentium 1 chipset. | | The 440BX/GX and 450NX supported ECC and that's with desktop | pentium 2 and 3 chips. | | The 820/820E/840 supported ECC with desktop celeron and | pentium2/3 chips | | 845/845e/850/850e/860 pentium4 chipsets support ECC | | 875/e7205/e7221/e7230 did with desktop pentium 4 and pentium d | chips | | 925/925xe/955x/975x did with desktop pentium 4/pentium d/core 2 | | It's more sparse now that they moved to the IMC, granted. But | Intel has long had multiple chipsets per generation with ECC | support for desktop grade hardware. | eternityforest wrote: | I used to be a big fan of Intel, up until the latest chips from | other companies that seem to have beat them on | performance/watt. My next laptop will probably be AMD if the | situation hasn't changed. | vladvasiliu wrote: | I'm happy with my AMD based laptop. But I haven't seen any | that support ECC. | | But I did see a Lenovo model, IIRC, that had some kind of | Xeon and ECC. Not sure what the noise and battery life | situations on that thing were, though. | OJFord wrote: | I realise that's blurrier when it comes to laptops, but | AIUI it's more a case of whether the motherboard than | supports it than about the AMD chip. i.e. given a desktop | CPU, as far as I know you can put it in a motherboard that | either does or does not support ECC RAM. | eternityforest wrote: | I'm surprised nobody has made RAM with the ECC logic | built into the ram itself, that just looks like normal | ram to the CPU. | IYasha wrote: | Having ECC being checked inside the CPU is actually | useful as data loss may be induced by EMI (and other | factors) on PCB data lines. | AdrianB1 wrote: | It is called DDR5 - it has ECC built in the module | itself. Making such a module does not make much sense if | you cannot report the rate of errors, so if it is just | hiding you have a bad RAM stick there is only so much | value in having ECC. | simoncion wrote: | It is my understanding that the ECC that you're talking | about only protects data-in-flight between the module and | whatever is reading or writing the data. It does not | protect against corruption of data-at-rest, which is what | is protected with ECC in DDR4 and older. | | It's also my understanding that the DDR5 data-in-flight | ECC is a _mandatory_ feature because the link between the | memory modules and everything else is so error-prone that | the system would simply not function without it. | eternityforest wrote: | Taking a quick glace at the articles, I think it's the | opposite, DDR5 protects data at rest only, because they | want to make the chips so unreliable it can't work | without it, not the bus. | | But in practice, it will probably be more reliable than | DDR4 without ECC, since now you need 2 cosmic ray flips, | or 1 plus a manufacturing defect flip, and the defect | flips will probably be uncommon-ish. | | It's too bad data in flight isn't protected without old | fashioned ECC on top of that, but it will probably be a | big step up, the same way that flash memory is now very | reliable even though the actual uncorrected errors are | probably worse under the hood. | toast0 wrote: | The problem with the DDR5 approach is there's no | reporting mechanism, so while it will reduce the error | rate of a marginal module, it doesn't let you know so you | can replace it. In my experience with ECC modules, a | module with some errors is a lot more likely to get more | errors than one that's operating with zero errors. | my123 wrote: | Ryzen APUs, which include almost all AMD laptops, | actually have ECC fused off in silicon unless you buy the | "Pro" variant. | vladvasiliu wrote: | Huh. I didn't know that. | | My particular laptop does have a "pro" CPU. However, I | would be surprised to no end to learn that it supports | ECC. This particular model sports an MBP-level price tag | [0], but is absurdly cheaply built. Even for "customer | facing components", that are easy to compare, such as the | screen (terrible colors) and case (creaks if you look at | it wrong). HP doesn't offer ECC RAM, not even as an | upgrade, so I really don't think the additional lines are | physically present. | | --- | | [0] I don't remember the specific number, but it was | within 100 EUR of a 14" M1 MBP with 32 GB RAM and 512 GB | SSD. That's counting a RAM (8 -> 32) and SSD (256 -> 512) | upgrade which were made with components bought separately | (though they were rather high-end). | vladvasiliu wrote: | That's right, but seeing how laptops seem to do the bare- | minimum, I would be really surprised to learn than a | random model, _which doesn 't advertise it_, actually | supports it. | erk__ wrote: | > That we cannot get laptops or even desk machines with ECC | | The Xeon series of laptop processors does support ECC just at a | quite large premium. | gruez wrote: | > That we cannot get laptops or even desk machines with ECC, | and so have them crash frequently, is seen as a trivial side | effect of the strategy | | I'm not sure what you mean by "frequently", but my non-ECC | machines definitely do not crash "frequently". | | > before MS, a program crashing was grounds for a refund | | Source? | Brian_K_White wrote: | The problem with untrusworthy memory (or any other component) | is not that your system crashes, it's that it _doesn 't_. | gruez wrote: | I don't doubt that non-ECC hardware experiences some non- | zero number of bitflips per year. I'm just doubting the | parent commenter's claim that non-ECC ram is causing | computers to crash "frequently". | layer8 wrote: | And the parent is pointing out that _not_ crashing on bit | flips is exactly the problem. | AdrianB1 wrote: | "frequently" is very subjective or relative in this context. | 25 years ago I had a crash per hour on almost any regular | computer, but zero crashes per month on servers with ECC. In | the past couple of years I think I had a few cases of frozen | apps, but I don't remember of any OS level problem. At the | same time, on servers I see from time to time ECC fixing a | bit, but on the desktop or laptop I have no idea how many | times corrupted bits went undetected and what is the | consequence. | gruez wrote: | >25 years ago I had a crash per hour on almost any regular | computer, but zero crashes per month on servers with ECC | | If it's crashing once per hour, it's probably unstable | drivers/software or flaky hardware that needs to be RMAed, | not random bitflips. | navjack27 wrote: | I think sometime else is wrong if you've had a "crash" per | hour. | IYasha wrote: | For this reason for my first truly made-from-scratch home NAS I | went AMD64 with ECC UDIMMs. It was some very basic Athlon64, | but it COULD do ECC. Since then I moved to Opterons and Xeons | but I still remember that choice. | Gordonjcp wrote: | > That we cannot get laptops or even desk machines with ECC, | and so have them crash frequently, is seen as a trivial side | effect of the strategy. | | How frequently would you say you encounter a crash that you can | pin down to a lack of ECC memory in your laptop or desktop? | Filligree wrote: | You can't, that's the thing, right? | | I have a Ryzen desktop with ECC, and it registers about one | bit-flip per week. I don't know how many of those would | become crashes, but I'm more worried about the ones that | wouldn't. | dale_glass wrote: | Yup, been there. | | Way back I had a Pentium 133 doing firewall duty in a closet. It | did approximately nothing besides iptables, but of course any | machine has logs, updates and so on going on. | | After running fine for months one day it suddenly died. I | rebooted it. A few days later it died again. Another reboot. Then | it died for the last time and failed to boot at all. Examination | showed the disk was corrupt and couldn't be mounted. Further | examination showed that one of the memory modules was loose for | some reason, could be that it was never firmly in and I just | bumped the box when messing with something else. | | Then came the wasted weekend of dealing with that my normal | internet connection relied on the thing that was now completely | broken. | | And that was the luckiest case I can imagine, when the broken | machine contains no data of actual value. Since then I'm very | paranoid, always run memtest on any new RAM I buy overnight, and | have ECC where it's possible to have it. | vladvasiliu wrote: | > Since then I'm very paranoid, always run memtest on any new | RAM I buy overnight, and have ECC where it's possible to have | it. | | Yeah, I do the same, but I've learned that you have to do it | regularly. | | In one of my desktop machines, the RAM ran fine for like two | years. Then, all of sudden, random Firefox segfaults, etc. | | Whipped up a memtest ISO, and sure enough, one of the sticks | was bad. | dale_glass wrote: | That's the nice thing about ECC, it acts like an always | running memory test. | | You normally have a scrub time that can be configured in the | BIOS, which also adds a regular verification of the entire | RAM at regular intervals, just in case something goes wrong | in some rarely used part of the memory. | IYasha wrote: | Unfortunately, background scrubbing significantly increases | power consumption and impacts performance as well. | metadat wrote: | Do you want (1) a higher rate of stable and correct | computations to be performed at a slightly higher energy | cost, or (2) a demonstrably less reliable device at a | slightly lower energy efficiency? | | I'll go for #1 in most cases, as long as the system is to | be relied upon for anything deemed important. | IYasha wrote: | Me too, of course. I'm just highlighting downsides so | people know what to expect. | ilyt wrote: | Weirdly enough I had same case but memory turned out to be | fine, replacing power supply fixed the issue. I ran test, saw | memory is bad, replaced sticks, same problem, put the sticks | back in and decided to just run it (it was gaming PC). | | Few months later powersupply outright died (had ~8years at | that point), replaced it with good one, no memory errors. | vladvasiliu wrote: | In my case it was clearly a bad RAM stick. Took it out, OK. | Switched them around: errors. Replaced it with a new one, | back to OK. | | In this particular case, a bad PSU would be the end of the | PC. It's an HP dekstop mini. Basically a laptop without a | screen, powered by and external adaptor that puts out a | single 12V line. All further conversions are done on the | motherboard somehow. | consp wrote: | Badly socketed ram was one of the reasons my PC started failing | after being on for a while. When everything was cool it was all | fine, when the case and everything heated up a bit it failed | eventually. Re-seating the ram fixed it and ran for quite a | time without issues. This was in the early Athlon days though. | lizardactivist wrote: | I'm curious what the actual manufacturing costs for ECC DRAM is | compared to regular DRAM. Is it considerably more expensive, or | just the usual over-charge because it's better? | jeffbee wrote: | It's exactly 1/8th more. | aortega wrote: | I have 128 GB of non-ecc memory in my notebook, never detected a | single error, and has been on 24/7 for more than 4 years. | | Unless you live over 4000 meters over the sea level, like to | compile while flying or live close to an unshielded nuclear | reactor, you don't need ECC. | | And most memory problems you can fix by better cooling, and | better shielding. | H8crilA wrote: | This problem has no ultimate solution. I've seen all components | flip bits, CPUs, networking cards, RAM, most often you just can't | know for sure what did it. You can remedy it a bit (like with | ECC), but ultimately there will always be corruption if you | process hundreds of petabytes of data. Get used to it, your | computer executes an instruction with a probability extremely | close to 1, but not equal to 1. | | Deep in the archives of a well known tech company is a very well | documented case of a bit flip that caused the wrong function to | be executed in a C++ v-table. The big oof was that this function | was the equivalent of an SQL "drop table", and just happened to | be 32 bytes off of a very benign function that did something like | stat(). Really funny stuff once the crisis is over :) | dale_glass wrote: | ECC isn't a terribly complicated technology, and can be used in | all those cases. | | In limited cases, a checksum is good enough. If you checksum | outgoing data, and verify it on reception, then it being | corrupted in transit whether on the network card or the cable | can be detected and transparently compensated for. | | Really, we can do much better than to "get used to it". | H8crilA wrote: | You are under the impression that CPUs and other chips always | perform the same instructions as are written in the code, and | only RAM can flip bits because DRAM is DRAM :) | | It can (and should! whenever possible) be improved, not | fixed. There's always that pesky gamma that can hit a | specific transistor, even if it is deep underground. Gamma | cannot be fully stopped. At certain scales data corruption | becomes directly measurable. And yes, corruption levels vary | between pieces of hardware. | ilyt wrote: | Sure but once your registers, cache, data bus and address | bus has ECC you have vastly smaller area that can flip. | | You can even _just buy_ (well, chipaggedon aside) ARM cores | that have 2 chips running in parallel and faulting when the | result is different | my123 wrote: | > You can even just buy (well, chipaggedon aside) ARM | cores that have 2 chips running in parallel and faulting | when the result is different | | See dual-core lock-step Arm chips (used for automotive). | dale_glass wrote: | Of course not. I'm not saying we can have perfection. I'm | saying that we can do much better, using methods and | technologies that are very old at this point. | | The reason why we don't is laziness and market | segmentation, mostly. | sobriquet9 wrote: | The probability of a bit flip depends on the size of the | transistor used. RAM tends to pack many small transistors. | don-code wrote: | Is this something that's documented publicly? I'd love to read | more. | fpoling wrote: | One can process petabytes without bit flips if one use proper | checksums and error correction codes. While that does imposes | overhead, it is not big and, thanks to Shannon theorem, can be | made arbitrary small with sufficiently big blocks. | water8 wrote: | [dead] | IYasha wrote: | It would be so nice if DD and DDRescue did calculate hashes while | copying. | chunk_waffle wrote: | For anyone curious (as I was once and looked into it) Dell and | Lenovo both ship Laptops with Xeon's and ECC memory though they | are very expensive. | thekombustor wrote: | Yep, you can get it on the higher end trims of the P-series | Thinkpads. | TacticalCoder wrote: | > Why ECC Memory Is So Important | | Except that it's not for many use cases. It's great for servers | but for people on their personal and/or work computer, it's | simply not that useful. | | Seriously: which percentage of developers have ECC on their | development machine(s)? | | As developers we live in a world of SSH, cryptographic hashes, | checksums everywhere, Git repositories (that is a big one), | Merkle trees, digital signatures, reproducible builds (which are | gaining traction), etc. | | Heck, I'm torrenting the latest Debian or Devuan .iso image. My | torrent client is using every known trick under the sun to make | sure that should anything go wrong, the broken data shall be | discarded and re-downloaded. Download is done, I dd the image to | some installation medium. I can then verify its checksum matches | the official one. A bit flip didn't slip by unnoticed. | | All the music I carefully ripped from my audio CDs? They're all | cross-checked with an online DB of known bit-perfect rips. | There's an accompanying file containing each song's hash and I | can verify at anytime that all my files are 100% correct. | | But really most of all I live in a world of Git repositories. My | entire Emacs config is versioned under Git (I know YMMV but I | like it that way). Some people version under Git their entire | user dir. | | Tell me how my lack of ECC is going to really make life miserable | here? | | I have nothing _against_ ECC... But if I want to upgrade my AMD | 3700X to a 7700X, apparently I cannot get ECC. | | And that's totally fine: I certainly won't discard the 7700X | because I cannot get ECC for it. | | And if _anything_ looks suspicious, running Memtest is the first | thing you should do. | | I've had bad RAM at times. I'm still there. | craftkiller wrote: | > Seriously: which percentage of developers have ECC on their | development machine(s)? | | Every single one of my non-mini computers uses ECC ram except | for my laptop. If someone would release a framework laptop | motherboard that supports ECC ram (and preferably risc-v) I'd | finally be able to close to reliability gap. It blows my mind | that we say "Well sure, we COULD make infallible ram, but that | would cost a tiny bit extra so instead lets just hope nothing | bad happens." That's right up there with not wearing a seat- | belt because I haven't needed one yet. | digitallyfree wrote: | I develop on VMs on my server (with ECC and RAIDZ) which i | access using SSH or VDI protocols. I would love to have ECC | on all my machines but that isn't feasible until the status | quo changes, so I stick with the remote approach. To me | that's an acceptable tradeoff as the non-ECC desktops/laptops | are just used as dumb terminals while the real work happens | on the reliable server. | | I can't speak for ECC at the moment but ZFS has definitely | saved me from data corruption that would have been left to | manifset otherwise. | Brian_K_White wrote: | "If someone would release a framework laptop motherboard that | supports ECC ram..." | | This. If I could, I would. The only reason my laptop doesn't | have ecc is because the manufacturer doesn't offer the option | in any macines I otherwise want. | | That comment was very misguided in trying to suggest that | there is any valid excuse to tolerate unreliable execution | hardware. git and ssh and md5sums do _not_ mean that it 's ok | if your very brain can't be trusted to deliver data from one | part to another within itself, or spit back the same data | that was put in a cell. Everything else is built _upon_ that! | leguminous wrote: | Checksums don't save you from memory corruption. If your data | gets corrupted in memory, you will just end up checksumming and | committing bad data. Or your checksum could get corrupted, and | you commit a checksum that doesn't match your data. Checksums | are more useful for safeguarding against disk or network | corruption (although you shouldn't have network corruption | issues over TLS or SSH). | | Apparently Ryzen 7000 cpus can use ECC. I've heard reports that | AMD needs to release an AGESA update, though, and ECC DDR5 | memory availability is terrible. I'm hopeful that the situation | will improve, because I also want to update my desktop. I've | been using ECC memory since losing a filesystem on a desktop | when a DIMM went bad. | dale_glass wrote: | > But really most of all I live in a world of Git repositories. | My entire Emacs config is versioned under Git (I know YMMV but | I like it that way). Some people version under Git their entire | user dir. | | > Tell me how my lack of ECC is going to really make life | miserable here? | | Git will break if RAM is bad just like anything else. That it | checksums everything won't save you from checking in corrupt | data, the filesystem itself being corrupt, or some internal git | structure becoming corrupt. Losing your repo because something | in it was written wrong is very much a possibility. | | Having multiple machines involved helps, but it's not a | complete fix, because the possibility exists of something | damaged being transmitted from a broken machine to a good one, | ensuring there's no good copy anywhere. | | There's really nothing software can do to operate correctly | with bad RAM all of the time. Instructions for the software are | in RAM. The OS that the software expects to behave right is in | RAM. Various buffers used for disk access and networking are in | RAM. An application like git assumes all of that is performing | correctly, and can't compensate for every possible malfunction | that could happen. | everybodyknows wrote: | How trustworthy is git-fsck for detecting random-bit | corruption? | dale_glass wrote: | Depends on how you look at it. | | For actual verification, unless there's a bug my | understanding is that very trustworthy. But by that point | it's already too late. Okay, you know something is broken, | but that won't give your good data back. | | But that only tells you that Git data is intact and that | all the hashes match. If git got a corrupted file to start | with, then correctly hashed it, everything will verify 100% | and still be broken. | rhn_mk1 wrote: | > which percentage of developers have ECC on their development | machine(s)? | | That question won't help you evaluate the demand for ECC simply | because the supply is strangled. Those who want it have to make | compromises to get ECC: get a Xeon, or get one of few AMD | motherboards with matching CPUs and overpriced RAM without a | guarantee that it will end up working. | simoncion wrote: | > ...get one of few AMD motherboards with matching CPUs and | overpriced RAM... | | I mean, if you want a _guarantee_, then sure, get one of | those certified-for-ECC motherboards. | | But, like, as far as I know, going _at least_ far back as the | Phenom II (released in 2008) AMD desktop processors have | always supported ECC RAM. And -as far as I know- ASUS | motherboards for said processors have always supported | dropping in ECC RAM (and Linux and memtest and friends have | always agreed that ECC was enabled and functioning in such a | system). | | Source: Personal experience with Phenom II, Threadripper, and | Ryzen 5 CPUs and ASUS motherboards, and looking-from-a- | distance at the rest of the AMD CPUs between the Phenom and | the Ryzen 5. | tasubotadas wrote: | When people complain about these random OS crashes and freezes | it's usually RAM corruption at fault. | | Per 1gb of RAM, you can expect to see 266 bit errors per | month[1-2] if you are using your PC 16h per day. Multiply that | by 64GB or 128GB of RAM and it's crazy to think that you won't | run into any of the stability issues. | | [1-2] | https://static.googleusercontent.com/media/research.google.c... | [1-2] https://en.wikipedia.org/wiki/ECC_memory#Research | pflanze wrote: | > Per 1gb of RAM | | The Google study you linked says "25,000 to 70,000 errors per | billion device hours per Mbit". | | Assuming bits, {25000 to 70000}/1e9 * 30 _16_ 1000 = 12 to | 33.6 bit errors per month. Assuming bytes, it 's 96 to 268 | bit errors per month. | | Apparently you've meant bytes, not bits. (I'm a pedant, but | was also just unsure and interested in the numbers.) | | FWIW, a comment I ran across that confirms toast0's view[1]: | "It's a bimodal distribution - you either have many errors | (due to a defect somewhere) or basically zero. If you're on | the good side of the distribution, with only extremely rare | errors, then you probably don't need ECC. But without ECC, | you don't know whether you need ECC!" | | [1] https://www.realworldtech.com/forum/?threadid=198497&curp | ost... | ubercow13 wrote: | >266 bit errors per month | | That seems like an overestimate. How can memtest ever pass on | non-ECC RAM if errors are that frequent? | moloch-hai wrote: | Because memtest only looks at values it wrote out very | recently, before they have had a chance to flip. | | Memtest is looking for reliable failures, not evanescent | one-off events. | willis936 wrote: | SDRAM is continuously refreshing all cells. How long ago | data was written doesn't make a big difference (aside | from the case where you're reading data immediately after | writing or reading that data). | moloch-hai wrote: | This does not, of course, make any sense. Any given | memory cell will be read once per, say, millisecond, and | the value written back. Once it has flipped once, the | wrong value will then be written back, and after that the | wrong value is read back out and rewritten again, | indefinitely. Errors are sticky, and accumulate. | (Flipping back again is negligible unlikely.) | | With ECC in the refresh path, such an error could be | corrected and the right value would be overwritten over | top of the bad one. Then errors would not accumulate, but | would instead be "scrubbed". Mainframe machines scrub | their RAM. Disks too. | ilyt wrote: | > SDRAM is continuously refreshing all cells | | ...so ? we're not talking about slow deterioration (which | is why refresh is needed), we're talking about bit flip | from cosmic rays where cell changes state completely. | [deleted] | toast0 wrote: | I don't think this rate is reasonable to use in this manner. | I've run thousands of servers, with lots of ECC ram (quite a | few servers had 768GB of ram, but most were more like 32 or | 64). The vast majority ran for years with zero reported | errors. A small handful would have major errors of thousands | in an hour, but we would replace for hundreds in an hour. A | couple servers developed a periodic report of one or two | (correctable) errors per day. If 266 bit errors per month per | 1Gb was a usable rate, all of our servers would have been | throwing ECC correctable errors all the time. | | But I didn't have time or desire to publish a study on our | experience, so there's no hard numbers. | kevingadd wrote: | I went out of my way for ECC because losing files can mean | losing days of work. With how much a sweng gets paid it's worth | the ECC premium if it saves me a few days of work over the life | of the machine. | | Recently I discovered that one of my SSDs was quietly failing | without setting off any warnings; doing a chkdsk showed that | some files had already gotten corrupted. One of them was my | backblaze backup index! | | Even though I have automated backups (backblaze + macrium | backups to a NAS), recovering files from them is non-trivial. | If I were to lose work to non-ECC ram who knows how long it | would take me to reconstruct a known-good work environment and | file set. Imagine if you're working on something huge and hard | to validate like neural net weights where corruption can occur | silently and be hard to detect after it's happened? | newZWhoDis wrote: | My favorite is when data silently corrupts, and then is | happily propagated to your off-site recovery :/ | kevingadd wrote: | Yeah, I don't know when the drive started failing, so in | practice my backups are probably all screwed too. My only | option would be to get a spare drive, restore a backup to | it, then do a block level diff of the current and backup | volumes and try to figure out whether any of the | differences are file corruption. | Filligree wrote: | Probably I'm saying nothing you haven't thought of, but | ZFS is great for preventing this. If you also pair it | with ECC, then you've eliminated most ways to cause | corruption. | rwmj wrote: | But if I open up an editor and write a document then that could | be corrupted in RAM and then the corrupted data saved to disk. | The document is likely to be more important than some ripped | CDs or a git repo that I can download again. | | The CPU, RAM and mobo manufacturers need to get together and | make ECC RAM mandatory. It's absurd that we have machines with | gigabytes of storage using microscopic (nanoscopic??) | capacitors that doesn't have this basic protection. And | honestly this should have happened _years_ ago. | | (Edit: And before anyone says DDR5 is ECC by default, that's | not quite true although the difference is a bit subtle: https:/ | /en.wikipedia.org/wiki/DDR5_SDRAM#DIMMs_versus_memory...) | ilyt wrote: | >But if I open up an editor and write a document then that | could be corrupted in RAM and then the corrupted data saved | to disk. The document is likely to be more important than | some ripped CDs or a git repo that I can download again. | | Devils advocate: if it is just some bits in character flipped | it's entirely recoverable while flipping some bits in | compressed stream for video would corrupt more. | rwmj wrote: | Agreed. One problem with RAM errors (which I've actually | experienced) is they are insidious. You probably won't | notice them immediately so the error can be propagated, and | they're very very difficult to diagnose if you do notice | them. | | Back in the 90s we had a database server which had a stuck | bit in memory normally mapped to the page cache. This | caused sectors to be written to the backing software RAID | which couldn't be read back in (because I think some | checksum was corrupted when written and then failed when | read back). It took an absolute age to diagnose this. I | think I only worked it out by eliminating everything else. | justsomehnguy wrote: | And condoms are only 99.8% effective, so let's not use it? You | can always pull out in time or verify checksu^W^W Plan B? | | > But really most of all I live in a world of Git repositories | | That's great _for you_ , but 99% don't even know what Git is, | just like _checksums, cryptographic hashes, Merkle trees, | digital signatures and reproducible builds_. | | You just miss the one imortant thing: ECC isn't that helpful | where _you have_ the means to check and verify the data. ECC is | the only way to at least know what something is happening with | data when there is no way to check. | | To give you a slight idea I would tell you an anecdote from my | L1 life almost two decades ago: | | I visited a client who claimed what the PC was working | erratically and constantly threw weird error messages. | | Welp, the usual deal, just some ugly software or a virus. In | the first 3 minutes I got like 5 errors about failing to load a | .dll from C:\WINDOWS\SYSTEM33\USER32.DLL, nothing unusual, just | need to.. WAIT. Why system _33_? Stupid virus masking for a | well-known folder? Doubt. So I go to C:\Windows and I see: | System32 System33 System34 | | If you are a smart fellow you probably already understood what | that was a bit-flip error which somehow managed to be in the | in-memory copy of MFT of the system drive. And this is the only | reason the user noticed it - because sometimes programs | wouldn't start and sometimes there was weird error messages. If | that error was in the data area of some program - it wouldn't | be discovered at all. | [deleted] | dmitrybrant wrote: | Hang on... to go from "System32" to "System33" is one bit | flip, but to go from "System33" to "System34" is three bit | flips at once. Doesn't that seem astronomically more likely | to be a malicious program than bad RAM? | stefantalpalaru wrote: | [dead] | baeaz wrote: | [flagged] | mrlonglong wrote: | We've got a server that keeps rebooting due to a bad ECC DIMM | chip. I thought the whole point of ECC was to keep the server | going until we can replace the DIMM? | AaronFriel wrote: | ECC can typically correct 1 bit errors and usually detect (and | fault) on 2 bit errors. | | Your server is faulting and preventing itself from corrupting | data. ___________________________________________________________________ (page generated 2022-12-25 23:00 UTC)