[HN Gopher] ECC matters ___________________________________________________________________ ECC matters Author : rajesh-s Score : 624 points Date : 2021-01-03 15:38 UTC (7 hours ago) (HTM) web link (www.realworldtech.com) (TXT) w3m dump (www.realworldtech.com) | sys_64738 wrote: | ECC memory is predominantly used in servers where failure | absolutely must be identified and logged. The desktop market to a | lesser extent due to lack of mission critical tasks being run | from there. | dijit wrote: | There are situations though, where you're working on a document | and the documents "save" format is a memory dump. Corruption | for things of that type (Adobe RAW for example) would remove | data. | | It might present itself as a 1pixel colour difference, but it | could be more damaging (incorrect finances, in accounting | software for example). Software trusts memory; but memory can | lie. | | That's dangerous. | MaxBarraclough wrote: | That's an interesting point. In an extreme case, an order or | money transfer might be placed for an incorrect quantity, or | to an incorrect recipient. | KingMachiavelli wrote: | Well maybe. Rather than having to trust memory completely, | it would just be better to use a binary format where each | bit is verifiable so then at least a single bit flip would | be immediately obvious. For example, a bit flip in a TLS | session causes the whole session to fail rather than a | random page element to change. | knorker wrote: | That doesn't help if the memory is corrupted before the | verification code is applied. (the code will simply put a | signature on incorrect data) | | Or after it's been checked. (time-of-check vs time-of- | use) | MaxBarraclough wrote: | Right, exactly. TCP protects us from data-corruption in | network streams, and ECC protects us from data-corruption | in RAM. I doubt any sort of software solution could | practically compete against hardware ECC, even if it | could be done it would presumably be disastrous for | performance. | knorker wrote: | The best integrity checking is "end to end". The problem | with non-ECC is that there are no "ends" that are | trustworthy. | | I guess in theory some software could produce signed data | in CPU cache, and "commit" it to RAM as a verified block. | | But the overhead would be enormous. Would you slow down | your CPU by half in order to not pay 12.5% more for RAM? | | Hmm, I wonder what SGX and similar do about this. | mark-r wrote: | That's the principle behind Gray Code counting: | https://en.wikipedia.org/wiki/Gray_code | sys_64738 wrote: | Those corner cases might occur rarely but are probably | inconsequential given rate of occurrence versus rate of | criticalness - it probably doesn't justify the markup for | most. In a data center you're processing millions of | transactions per minute so occurrence is much more impactful. | knorker wrote: | I would EASILY pay 12.5% more (that's the bit overhead) for | memory that actually works. | | If my data is fine being corrupted to save 12.5% on RAM | costs, then why am I even bothering processing the data? | Apparently it's worthless. | | People today weigh the cost of maybe 16 vs 32GB on a mid- | tier desktop. ~doubling the cost for twice the RAM. Yes, | paying 12.5% more for ECC RAM is a no-brainer. | xxs wrote: | You need 1/8 more memory - that's real the cost. It's | pretty much Intel's fault for the segmentation. | jkbbwr wrote: | To be fair, if your save mechanism is just a straight memory | dump with no checksums and validation. You have bigger | issues. | dijit wrote: | That happens more than you think though. Most* things that | output PNG are making an in-memory data structure and | dumping it to disk. | xxs wrote: | Why does it matter if it =HAD= checksum, the numbers would | have been altered prior to save. It means you store one but | you get two when read later. If the format calculates | immediate checksums on blocks it'd detect memory corruption | at best. The extreme downside is that such a part is | untestable under normal conditions, hard to maintain, and | it costs more than the ECC in development. | projektfu wrote: | Perhaps consumer-grade software that needs guarantees of | correctness should be using error correction in software. For | example, database records for financial software, DNS, e-mail | addresses, etc. | wicket wrote: | Over the years, I don't think I've ever been able to explain to | anyone that their memory error could have been caused a cosmic | ray without being laughed at. | amelius wrote: | Does Apple use ECC in its M1 laptop? | dijit wrote: | No. It uses a unified package of LPDDR4x SDRAM | my123 wrote: | LPDDR4X systems with ECC exist, but it indeed looks like | Apple M1 systems aren't one... | graeme wrote: | This is my one worry. I have an imac pro and anecdotally it has | been a LOT more reliable than my old macbook pro. The imac pro | has ecc. | dijit wrote: | I beg this, every time this conversation comes up it's the same | answer "I don't see a problem". | | It's so easy to chalk these kind of errors to other issues, a | little corruption here, a running program goes bezerk there- | could be a buggy program or a little accidental memory overwrite. | Reboot will fix it. | | But I ran many thousands of physical machines, petabytes of RAM, | I tracked memory flip errors and they were _common_; common even | in: less dense memory, in thick metal enclosures surrounded by | mesh. Where density and shielding impacts bitflips a lot. | | My own experience tracking bitflips across my fleet led me to buy | a Xeon laptop with ECC memory (precision 5520) and it has | (anecdotally) been significantly more reliable than my desktop. | [deleted] | derefr wrote: | Were you around for enough DRAM generations to notice an effect | of DRAM _density_ / cell-size on reported ECC error rate? | | I've always believed that, ECC aside, DRAM made intentionally | with big cells would be less prone to spurious bit-flips (and | that this is one of the things NASA means when they talk about | "radiation hardening" a computer: sourcing memory with ungodly- | large DRAM cells, willingly trading off lower memory capacity | for higher per-cell level-shift activation-energy.) | | _If_ that's true, then that would mean that the per-cell error | rate would have actually been _increasing_ over the years, as | DRAM cell-size decreased, in the same way cell-size decrease | and voltage-level tightening have increased error rate for | flash memory. Combined with the fact that we just have N times | more memory now, you'd think we'd be seeing a _quadratic_ | increase in faults compared to 40 years ago. But do we? It | doesn't seem like it. | | I've _also_ heard a counter-effect proposed, though: maybe | there really are far more "raw" bit-flips going on -- but far | less of main memory is now in the causal chain for corrupting a | workload than it used to be. In the 80s, on an 8-bit micro, | POKEing any random address might wreck a program, since there's | only 64k addresses to POKE and most of the writable ones are in | use for something critical. Today, most RAM is some sort of | cache or buffer that's going to be used once to produce some | ephemeral IO effect (e.g. the compressed data for a video | frame, that might decompress incorrectly, but only cause 16ms | of glitchiness before the next frame comes along to paper over | it); or, if it's functional data, it's part of a fault-tolerant | component (e.g. a TCP packet, that's going to checksum-fail | when passed to the Ethernet controller and so not even be sent, | causing the client to need to retry the request; or, even if | accidentally checksums correctly, the server will choke on the | malformed request, send an error... and the client will need to | retry the request. One generic retry-on-exception handler | around your net request, and you get memory fault-tolerance for | free!) | | If both effects are real, this would imply that regular PCs | without ECC _should_ still seem quite stable -- but that it | would be a far worse idea to run a non-ECC machine as a | densely-packed multitenant VM hypervisor today (i.e. to tile | main memory with OS kernels), than it would have been ~20 years | ago when memory densities were lower. Can anyone attest to | this? | | (I'd just ask for actual numbers on whether per-cell per-second | errors have increased over the years, but I don't expect anyone | has them.) | jeffreygoesto wrote: | Sorry, I don't have the numbers you asked for. But afaik one | other effect is that "modern" semiconductor processes like | FinFET and Fully-Depleted Silicon-on-Insulator are less prone | to single event upsets and especially result in only a single | bit flipping and no drain of a whole region of transistors | from a single alpha particle. | mlyle wrote: | I think it's been quadratic with a pretty low contribution | from the order 2 term. | | Think of the number of events that can flip a bit. If you | make bits smaller, you get a modestly larger number of events | in a given area capable of flipping a bit, spread across a | larger number of bits in that area. | | That is, it's flip event rate * memory die area, not flip | event rate * number of memory bits. | | In recent generations, I understand it's even been a bit | paradoxical-- smaller geometries mean less of the die is | actual memory bits, so you can actually end up with _fewer_ | flips from shrinking geometries. | | And sure, your other effect is true: there's a whole lot | fewer bitflips that "matter". Flip a bit in some framebuffer | used in compositing somewhere-- and that's a lot of my | memory-- and I don't care. | smoyer wrote: | There is no guarantee of state at the quantum level ... just a | high-degree of assurance in a state. After 40 years in the | electronics, optics, software business, I've learned that there | is absolutely the possibility for unexplained "blips". | loeg wrote: | Yeah, it's real obnoxious of Intel to silo ECC support off into | the Xeon line, isn't it? I switched to ECC memory in 2013 or | 2014 with a Xeon E3 (fundamentally a Core i7 without the ECC | support fused off) and of course a Xeon-supporting motherboard | (with weird "server board" quirks: e.g., no on-board sound | device). | | I love that AMD doesn't intentionally break ECC on its consumer | desktop platforms and upgraded to the Threadripper in 2017. | defanor wrote: | I've considered using an AMD CPU instead of Intel's Xeon on | the primary desktop computer, but even low-end Ryzen | Threadripper CPUs have TDP of 180W, which is a bit higher | than I'd like. And though ECC is not disabled in Ryzen CPUs, | AFAIK it's not tested in (or advertised for) those, so one | won't be able to return/replace a CPU if it doesn't work with | ECC memory, AIUI, making it risky. Though I don't know how | common it is for ECC to not be handled properly in an | otherwise functioning CPU; are there any statistics or | estimates around? | BlueTemplar wrote: | > one won't be able to return/replace a CPU if it doesn't | work with ECC memory | | I don't know where you live, but around here, (if you buy | new?), the vendor MUST take back items up to 15 days after | they were delivered, for ANY reason. | | So, as long as you synchronize your buying of CPU, RAM, | (motherboard), you should be fine. | marcosdumay wrote: | Keep in mind that Intel lies about its TDP. | magila wrote: | There's been a lot of misinformation spread about what | TDP means for modern CPUs. In Intel's case TDP is the | steady state power consumption of the CPU in its default | configuration while executing a long running workload. | Long meaning more than a minute or two. The CPU | implements this by keeping an exponentially weighted | moving average (EWMA) of the CPU's power consumption. The | CPU will modulate its frequency to keep this moving | average at-or-below the TDP. | | One consequence of using a moving average is that if the | CPU has been idle for a long time then starts running a | high power workload instantaneous power consumption can | momentarily exceed the TDP while the average catches up. | This is often misleadingly referred to as "turbo mode" by | hardware review sites. It's not a mode, there's no state | machine at work here, it's just a natural result of using | a moving average. The use of EWMA is meant to model the | heat capacity of the cooling solution. When the CPU has | been idle for a while and the heatsink is cool, the CPU | can afford to use more power while the heatsink warms up. | | Another factor which confuses things is motherboard | firmware disabling power limits without the user's | knowledge. Motherboards marketed to enthusiasts often do | this to make the boards look better in review benchmarks. | This is where a lot of the "Intel is lying" comes from, | but it's really the motherboard manufacturers being | underhanded. | | The situation on the AMD side is of course a bit | different. AMD's power and frequency scaling is both more | complex and much less documented than Intel's so it's | hard to say exactly what the CPU is doing. What is known | is that none of the actual power limits programmed into | the CPU align with the TDP listed in the spec. In | practice the steady state power consumption of AMD CPUs | under load is typically about 1.35x the TDP. | | Unlike Intel, firmware for AMD motherboards does not mess | with the CPU's power limit settings unless the user does | so explicitly. Presumably this is because AMD's CPU | warranty is voided by changing those settings, while | Intel's is not. | xxs wrote: | Intel measures TDP at base frequency... that's | disingenuous. | colejohnson66 wrote: | They don't. They just measure it differently than AMD. | Intel measures at base clock, but AMD measures at | sustained max clock IIRC. It's definitely deceptive, but | it's not a lie as long as Intel tells you (which they | do). | wtallis wrote: | Intel's TDP numbers are at best an indicator of which | product segment a chip falls into. They are wildly | inaccurate and unreliable indicators of power draw under | _any_ circumstance. For example, here 's a "58W" TDP | Celeron that can't seem to get above 20W: | https://twitter.com/IanCutress/status/1345656830907789312 | | And on the flip side, if you're building a desktop PC | with a more high-end Intel processor, you will usually | have to change a _lot_ of motherboard firmware settings | to get the behavior to resemble Intel 's own | recommendations that their TDP numbers are supposedly | based on. Without those changes, lots of consumer retail | motherboards default to having most or all of the power | limits effectively disabled. So out of the box, a "65W" | i7-10700 and a "125W" i7-10700K will both hit 190-200W | when all 8 cores/16 threads are loaded. | | If a metric can in practice be off by a factor of three | in either direction, it's really quite useless and should | not be quantified with a scientific unit like Watts. | marcosdumay wrote: | Well, it's a power measurement that isn't total and can't | be used for design... So, it's a lie. | | If they gave it some other name, it would be only | misleading. Calling it TDP is a lie. | ksec wrote: | It is a lie when they change the definition of TDP | without telling you first and later redefined the word to | different thing once they got caught. | | May be we should use a new term for it, something like | iTDP. | mlyle wrote: | They both lie, but Intel lies worse :D | paulmd wrote: | Nah. Both brands pull more than TDP when boosting at max, | AMD will pull up to 30% above the specified TDP for an | indefinite period of time (they call this number the | "PPT" instead). | | Intel mobile processors actually obey this better than | AMD processors do - Tiger Lake has a hard limit, when you | configure a 15W TDP then it really is 15W once steady- | state boost expires, AMD mobile products will pull up to | _50%_ more than configured. | | https://images.anandtech.com/doci/16084/Power%20-%2015W%2 | 0Co... | | "the brands measure it differently" is true but not in | the sense people think. | | On AMD it is literally just a number they pick that goes | into the boost algorithm. Robert Hallock did some dumb | handwavy shit about how it's measured with some delta-t | above ambient with a reference cooler but the fact is | that the chip itself basically determines how high it'll | boost based on the number they configure, so that is a | self-fulfilling prophecy, the delta-t above ambient is | dependent on the number they configure the chip to run | at. | | In practice: what's the difference between a 3600 and a | 3600X? One is configured with a TDP of 65W and one is | configured with a TDP of 95W, the latter lets you boost | higher and therefore it clocks higher. | | Intel nominally states that it's measured as a worst-case | load at base clocks, something like Prime95 that | absolutely nukes the processor (and even then many | processors do not actually hit it). But really it is also | just a number that they pick. The number has shifted over | time, previously they used to undershoot a lot, now they | tend to match the official TDP. It's not an actual | measurement, it's just a "power category" that they | classify the processors as, it's _informed_ by real | numbers but it 's ultimately a human decision which tier | they put them in. | | Real-world you will always boost above base clocks on | both brands at stock TDP, at least on real-world loads. | You won't hit full boost on either brand without | exceeding TDP, the "AMD measures at full boost" is | categorically false despite the fact that it's commonly | repeated. AMD PPT lets them boost above the official TDP | for an unlimited period of time, they cannot run full | boost when limited to official TDP. | numlock86 wrote: | Can you cite something? Sounds interesting. | colejohnson66 wrote: | It's not true. Sortove. Intel measures at base clock | while AMD does at sustained peak clock. Deceptive? Yes. | Lie? No. | CydeWeys wrote: | > but even low-end Ryzen Threadripper CPUs have TDP of | 180W, which is a bit higher than I'd like. | | Why does it matter? It doesn't idle that high; it only goes | that high of you're using it flat out, in which case the | extra power usage is justified because it's giving that | much more performance over a 100 W TDP CPU. Now I totally | get it if you don't want to go Threadripper just for ECC | because it's more _expensive_ , but max power draw, which | you don't even have to use? I've never seen anyone shop a | desktop CPU by TDP, rather than by performance and price. | defanor wrote: | I prefer to pick PSU and fans (for both CPU and chassis) | that can handle it comfortably (preferably while staying | silent and with some reserve) with maximum TDP in mind, | and given that I don't need that many cores or high clock | speed either, a powerful CPU with high TDP is undesirable | because it just makes picking other parts harder. I've | mentioned TDP explicitly because I wouldn't mind if it | was a (possibly even high-end) Threadripper that somehow | didn't produce as much heat. Although price also matters, | indeed. | phkahler wrote: | >> I've never seen anyone shop a desktop CPU by TDP, | rather than by performance and price. | | Oh oh, me! Back in the day I bought a 65W CPU for a | system that could handle a 90W. I wanted quiet and | figured that would keep fan noise down at a modest | performance penalty. It should also last longer, being | the same design but running cooler. I ran that from 2005 | until a few years ago (it still run fine but is in | storage). | | Planning to continue this strategy. I suspect it's common | among SFF enthusiasts. | koolba wrote: | SFF? | lostlogin wrote: | The Intel Nuc and Mac mini are good examples of this - | however the Nuc doesn't have its psu inside, it's a | brick. Great for fixing failures, horrible in general as | a built in psu is so much tidier. | oconnor663 wrote: | "small form factor" as far as I can tell | sam_lowry_ wrote: | Hm... My 2013 NUC in fanless Akasa enclosure runs 24/7 on | a 6W CPU, I recently looked at the options, and the 2019 | 6W offering changes little in performance. Yes, memory | got faster, but that's it. | | My passive-cooled desktop is also running a slightly | trottled down 65W CPU. | | So yes, there are people who choose there hardware by | TDP. | francis-io wrote: | When looking for a CPU for a server that sits in my | living room, I went down the thought process of getting a | low tdp. I don't have a quote, but I seem to remember | coming to the conclusion that tdp is the max temp | threshold, not the consistent power draw. If you have a | computer idling I believe you won't see a difference in | temp between cpus, but you will have the performance when | you need it. | | These days, a quiet, pwm fan with good thermal paste (and | maybe some linux CPU throttling) more than achieves my | needs for a "silent" pc 99% of the time. | | I would love to be told my above assumptions are wrong if | they are. | mlyle wrote: | Yah-- one should look at performance within a given power | envelope. Being able to dissipate more and then either | end up with the fan running or the processor throttling | back somewhat is good, IMO. | | The worst bit is, AMD and Intel define TDP differently-- | neither is the maximum power the processor can draw-- | though Intel is far more optimistic. | mlyle wrote: | On AMD, with Ryzen Master, you can set the TDP-envelope | of the processor to what you want. Then the | boost/frequency/voltage envelope it chooses to operate in | under sustained load is different. | | IMO, shopping by performance/watt makes sense. Shopping | by TDP doesn't. (Especially since there is no comparing | the AMD and Intel TDP numbers as they're defined | differently; neither is the maximum the processor can | draw, and Intel significantly exceeds the specified TDP | on normal workloads). | ReactiveJelly wrote: | Back when my daily driver was a Core 2 laptop, someone | told me that capping the clock frequency would make it | unusable. | | As a petty "Take that", I dropped the max frequency from | 2.0 GHz to 1.0 GHz. I ran a couple benchmarks to prove | the cap was working, and then just kept it at 1.0 for a | few months, to prove my point. | | It made a bigger difference on my ARM SBC, where I tried | capping the 1,000 MHz chip to 200 or 400 MHz. That chip | was already CPU-bound for many tasks and could barely | even run Firefox. Amdahl's Law kicked in - Halving the | frequency made _everything_ twice as slow, because almost | everything was waiting on the CPU. | mlyle wrote: | The funny thing is, on modern processors-- throttling TDP | only affects when running flat out all-core workloads. A | subset of cores can still boost aggressively, and you can | run all-core max-boost for short intervals. | | And the relationship between power and performance isn't | linear as processor voltages climb trying to squeeze out | the last bit of performance. | | So if you want to take a 105W CPU and ask it to operate | in a 65W envelope, you're not giving up even 1/3rd of | peak performance, and much less than that of typical | performance. | vvanders wrote: | TDP matters a fair bit in SFF(Small Form Factor) PCs. For | instance the 3700x is a fantastic little CPU since it has | a 65W TDP but pretty solid performance. | | In a sandwich style case you're usually limited to low | profile coolers like Noctua L9i/L9a since vertical height | is pretty limited. | mlyle wrote: | Performance/watt matters. You can just set TDP to what | you want with throttling choices. | | If you want a 45W TDP from the 3700X, you can just pop | into Ryzen Master and ask for a 45W TDP. Boom, you're | running in that envelope. | | I think shopping based on TDP is not the best, because | it's not comparable between manufacturers and because | it's something you can effectively "choose". | mongol wrote: | How do you do that? Is it a setting in the bios? Or can | it be done runtime? If so, how? It sounds interesting if | I can run a beefy rig as a power efficient device, for | always-on scenarios, and then boost it when I need. | mlyle wrote: | > How do you do that? Is it a setting in the bios? Or can | it be done runtime? | | On AMD, it's a utility you run. I believe you may require | a reboot to apply it. On some Intel platforms, it's been | settings in the BIOS. | | > It sounds interesting if I can run a beefy rig as a | power efficient device, for always-on scenarios, and then | boost it when I need. | | This is what the processor is doing internally anyways. | It throttles voltage and frequency and gates cores based | on demanded usage. Changing the TDP doesn't change the | performance under a light-to-moderate workload scenario | at all. | | Ryzen Master lets you change some of the tuning for the | choices it makes about when and how aggressively to | boost, though, too. | Cloudef wrote: | Ryzen Master doesnt seem to be available for linux so you | end up with bunch of unnofficial hacks that may or may | not work. I run sff setup myself, originally wanted to | get 3600 but it was out of stock, and the next tdp | friendly processor was 3700x. | mlyle wrote: | That's an annoyance, but on Linux you have infinite more | control of thermal throttling and you can get whatever | thermal behavior you want. Thermald has been really good | on Intel, and now that Google contributed RAPL support | you can get the same benefits on AMD-- pick exactly your | power envelope and thermal limits. | vvanders wrote: | Yeah but can I get a metric ton of benchmarks at that 45w | setpoint? | | I don't really see the reason in paying for a 100w TDP | premium if I'm just going to scale it down to 65w. | bayindirh wrote: | > I've never seen anyone shop a desktop CPU by TDP, | rather than by performance and price. | | That's me. When I start to plan for a new system, I | select the processor first and read its thermal design | guidelines (Intel used to have nice load vs. max temp | graphs in their docs) and select every component around | it for sustained max load. | | This results in a more silent system for idle and peace | of mind for loading it for extended duration. | 411111111111111 wrote: | That's not necessarily correct. | | You can passively cool threadrippers if you underclock | them enough and have good ventilation in case. | bayindirh wrote: | If my only interest would be ECC, I might do that but, I | develop scientific software for research purposes. I need | every bit of performance from my system. | | In my case loading means maxing out all cores and | extended period of time can be anything from five minutes | to hours. | mlyle wrote: | The problem is-- you can't compare the TDP nor even the | system cooling design guidelines between AMD and Intel. | | Both are optimistic lies, but-- if you look at the | documents it looks like currently AMD needs more cooling, | but actually dissipates less power in most cases and | definitely has higher performance/watt. | bayindirh wrote: | > The problem is-- you can't compare the TDP nor even the | system cooling design guidelines between AMD and Intel. | | Doesn't matter for me since I'm not interested in | comparing them. | | > Both are optimistic lies, but-- if you look at the | documents it looks like currently AMD needs more cooling, | but actually dissipates less power in most cases and | definitely has higher performance/watt. | | I'm aware of the situation, and I always inflate the | numbers 10-15% to increase headroom in my systems. The | code I'm running is not a _most case_ code. A FPU heavy, | "I will abuse all your cores and memory bandwidth" type, | heavily optimized scientific software. I can sometimes | hear that my system is swearing at me for repeatedly | running for tests. | | I don't like to add this paragraph but, I'm one of the | administrators of one of the biggest HPC clusters in my | country. I know how a system can surpass its TDP or how | can CPU manufacturers skew this TDP numbers to fit in | envelopes. We make these servers blow flames from their | exhausts. | ethanpil wrote: | Built a NAS. My #1 concern for choosing CPU was TDP. This | machine is on 24/7 and power use is a primary concern | where I live because electricity is NOT cheap. | mlyle wrote: | This is a poor way to make the choice. TDP is supposed to | specify the highest power you can get the processor to | dissipate, not typical or idle use. And since different | manufacturers specify TDP differently, you can't even | compare the number. | | Performance/watt metrics and idle consumption would have | been a far better way to make this choice. | | If you have a choice between A) something that can | dissipate 65W peak for 100 units of performance, but | would dissipate 4W average under your workload, and B) | something that can dissipate 45W peak for 60 units of | performance, but would dissipate 4.5W under your | workload... I'm not sure why you'd ever pick B. | mongol wrote: | Is there a metric to look for to understand what power | consumption is at "idle" or something close to that? That | is what confuses me. I don't want to spend a lot of money | on something that will be always on, and usually idling, | and finding that its power usage is way higher than I | thought. But perhaps there is a metric that tells that. I | have not looked closely at it. | | Also, even though the CPU may draw less, can still the | power supply waste more, just because it is beefy? | Comparing with a sports car, they have great performance, | but also use more gas in ordinary traffic? Can a computer | be compared with that? | mlyle wrote: | > Is there a metric to look for to understand what power | consumption is at "idle" or something close to that? That | is what confuses me. I don't want to spend a lot of money | on something that will be always on, and usually idling, | and finding that its power usage is way higher than I | thought. | | Community benchmarks, from Tom's Hardware, etc. | | The vendor numbers are make believe-- you can't use them | for power supply sizing or for thermal path sizing. If | you look at the cited TDP numbers today-- it can be | misleading-- e.g. often Intel 45W TDP parts use more | power at peak than AMD 65W parts. | | On modern systems, almost none of the idle consumption is | the processor. The power supply's idle use and | motherboard functions dominate. | | > Also, even though the CPU may draw less, can still the | power supply waste more, just because it is beefy? | | Yes, having to select a larger power supply can result in | more idle consumption, though this is more of a problem | on the very low end. | vvanders wrote: | I don't think Threadripper is a hard requirement for ECC. | There's some pretty reasonable TDP processors if you step | down from Threadripper. | usefulcat wrote: | It's not. I have a low end Epyc machine with ECC. It has | a TDP of something like 30 watts. | defanor wrote: | I didn't consider embedded CPUs (I guess that's about an | embedded EPYC, not a server one), those look neat. But | there's no official ECC support (i.e., it's similar to | Ryzen CPUs), is there? | | Edit: as detaro mentioned in the reply, there is, and | here's the source [0] -- that's what they mean by "RAS" | on promotional pages [1]. That indeed looks like a nice | option. | | [0] https://www.amd.com/system/files/documents/updated-30 | 00-fami... | | [1] https://www.amd.com/en/products/embedded- | epyc-3000-series | loeg wrote: | RAS covers more than just DRAM, but yes. Historically, | the reporting interface is called MCA (Machine Check | Architecture) / MCE. I think both AMD and Intel have | extensions with other names, but MCA/MCE points you in | the right direction. | detaro wrote: | All EPYC, including the embedded ones, do officially have | ECC support | adrian_b wrote: | For embedded applications, there is official ECC support | for all CPUs named Epyc or Ryzen Vxxxx or Ryzen Rxxxx. | | There are computers in the Intel NUC form factor, with | ECC support (e.g. with Ryzen V2718), e.g from ASRock | Industrial. | detaro wrote: | what kind of machine is that? Been vaguely looking for | one a while back, and everything seemed difficult to get | (since the main target is large-volume customers I guess) | cuu508 wrote: | I haven't seen definite details and test results on these | (but haven't looked recently). | | What specific configurations (CPU, MB, RAM) are known to | work? | | Let's say I have a Ryzen system, how can I check if ECC | really works? Like, can I see how many bit flips got | corrected in, say, last 24h? | xxs wrote: | Every Ryzen (non APU) supports it* Check the montherboard | of your choice, they would declare it in big bold | letters, e.g.[0] | | *not officially, and the memory controller provides no | report for 'fixed' errors. | | 0: http://www.asrock.com/mb/AMD/X570%20Taichi/ | cturner wrote: | Regarding verification. There is a debian package called | edac-utils. As I recall you overclock your RAM and run | your system at load in order to generate failures. | | Looking back at my notes, the output of journalctl -b | tells should say something like, "Node 0: DRAM ECC | enabled." | | Then 'edac-ctl --status' should tell you that drivers are | loaded. | | Then you run 'edac-util -v' to report on what it has | seen, mc0: 0 Uncorrected Errors with no | DIMM info mc0: 0 Corrected Errors with no DIMM | info mc0: csrow2: 0 Uncorrected Errors | mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors | mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors | mc0: csrow3: 0 Uncorrected Errors mc0: csrow3: | mc#0csrow#3channel#0: 0 Corrected Errors mc0: | csrow3: mc#0csrow#3channel#1: 0 Corrected Errors | edac-util: No errors to report. | a1369209993 wrote: | > As I recall you overclock your RAM and run your system | at load in order to generate failures. | | You can also use memtest86+ for this, although I don't | recall if it requires specific configuration for ECC | testing. | p_l wrote: | All AMD CPUs with integrated memory controllers support | ECC. The CPU also exposes an interface usable by the | operating system to verify ECC works - the same interface | is used to provide monitoring of memory fault data | provided by ECC. | | They aren't tested on it, so it's possible to get a dud, | but it's minuscule chance that isn't worth bothering. | | Now, to _actual_ issues you can encounter: _motherboards_ | | The problem is that ECC means you need to have, iirc, 8 | more data lines between CPU and memory module, which of | course mean more physical connections (don't remember how | many right now). Those also need to be properly done and | tested, and you might encounter a motherboard where it | wasn't done. Not sure how common, unfortunately. | | Another issue is motherboard firmware. Even though AMD | supplies the memory init code, the configuration can be | tweaked by motherboard vendor, and they might simply | break ECC support accidentally (even by something as | simple as making a toggle default to _false_ then forgot | to expose it in configuration menu). | | Those are the two issues you can encounter. | | The difference with AFAIK Threadripper PRO, and EPYC, is | that AMD includes ECC in its test and certification | programs for it, which kind of enforces support. | jtl999 wrote: | > Another issue is motherboard firmware. Even though AMD | supplies the memory init code, the configuration can be | tweaked by motherboard vendor, and they might simply | break ECC support accidentally (even by something as | simple as making a toggle default to false then forgot to | expose it in configuration menu). | | I think some Gigabyte boards are infamous for this in | certain circle | | OTOH: Gigabyte _might_ have a Threadripper PRO | motherboard (WRX80 chipset) coming out in the future | p_l wrote: | Gigabyte is also infamous for trying to claim that they | implemented UEFI by dropping a build of DUET (UEFI that | boots on top of BIOS, used for early development) into | BIOS image... | adrian_b wrote: | All desktop Ryzen CPUs without integrated GPU, i.e. with | the exception of APUs, support ECC. | | You must check the specifications of the motherboard to | see if ECC memory is supported. | | As a rule, all ASRock MBs support ECC and also some ASUS | MBs support ECC, e.g. all ASUS workstation motherboards. | | I have no experience with Windows and Ryzen, but I assume | that ECC should work also there. | | With Linux, you must use a kernel with all the relevant | EDAC options enabled, including CONFIG_EDAC_AMD64. | | For the new Zen 3 CPUs, i.e. Ryzen 5xxx, you must use a | kernel 5.10 or later, for ECC support. | | On Linux, there are various programs, e.g. edac-utils, to | monitor the ECC errors. | | To be more certain that the ECC error reporting really | works, the easiest way is to change the BIOS settings to | overclock the memory, until memory errors appear. | theevilsharpie wrote: | On Windows, to check if ECC is working, run the command | 'wmic memphysical get memoryerrorcorrection': | PC C:\> wmic memphysical get memoryerrorcorrection | MemoryErrorCorrection 6 | | SuperUser has a convenient decoder[1], but modern systems | will report "6" here if ECC is working. | | When Windows detects a memory error, it will record it in | the system event log, under the WHEA source. As a side | note, this is also how memory errors within the CPU's | caches are reported under Windows. | | [1] https://superuser.com/questions/893560/how-do-i-tell- | if-my-m... | stefan_ wrote: | I don't understand. Whatever the TDP of Intel processors, | you are straight up getting less bang for watt given their | ancient process. Same reason smartphones burst to high | clocks and power; getting the task done faster is on | average much more efficient. | loeg wrote: | > I've considered using an AMD CPU instead of Intel's Xeon | on the primary desktop computer, but even low-end Ryzen | Threadripper CPUs have TDP of 180W, which is a bit higher | than I'd like. | | Any apples-to-apples comparable Intel CPU will have | comparable power use. The difficulty is that Intel didn't | really have anything like Threadripper -- their i9 series | was the most comparable (high clocks and moderate core | counts), but i9 explicitly did not support ECC memory, | nullifying the comparison. | | You're looking at 2950X, probably? That's a Zen+ (previous | gen) model. 16 core / 32 thread, 3.5 GHz base clock, | launched August 2018. | | Comparable Intel Xeon timeline is Coffee Lake at the | latest, Kaby lake before that. As far as I can tell, _no_ | Kaby Lake nor Coffee Lake Xeons even have 16 cores. | | The closest Skylake I've found is an (OEM) Xeon Gold 6149: | 16/32 core/thread, 3.1 GHz base clock, 205W nominal TDP | (and it's a special OEM part, not available for you). The | closest buyable part is probably Xeon Gold 6154 with 18/36 | core/threads, 3GHz clock, and 200W nominal TDP. | | Looking at i9 from around that time, you had Skylake-X and | a single Coffe Lake-S (i9-9900K). 9900K only has 8 cores. | The Skylake i9-9960X part has 16/32 cores/threads, base | clock of 3.1GHz, and a nominal TDP of 165W. That's somewhat | comparable to the AMD 2950X, ignoring ECC support. | | Another note that might interest you: you could run the | Threadripper part at substantially lower power by | sacrificing a small amount of performance, if thermals are | the most important factor and you are unwilling to trust | Ryzen ECC: | http://apollo.backplane.com/DFlyMisc/threadripper.txt | | Or just buy an Epyc, if you want a low-TDP ECC-definitely- | supported part: EPYC 7302P has 16/32 cores, 3GHz base | clock, and 155W nominal TDP. EPYC 7282 has 16/32 cores, 2.8 | GHz base, and 120W nominal TDP. These are all zen2 (vs | 2950X's zen+) and will outperform zen+ on a clock-for-clock | basis. | | > And though ECC is not disabled in Ryzen CPUs, AFAIK it's | not tested in (or advertised for) those, so one won't be | able to return/replace a CPU if it doesn't work with ECC | memory, AIUI, making it risky. | | If your vendor won't accept defective CPU returns, buy | somewhere else. | | > Though I don't know how common it is for ECC to not be | handled properly in an otherwise functioning CPU; are there | any statistics or estimates around? | | ECC support requires motherboard support; that's the main | thing to be aware of shopping for Ryzen ECC setups. If the | board doesn't have the traces, there's nothing the CPU can | do. | theevilsharpie wrote: | > And though ECC is not disabled in Ryzen CPUs, AFAIK it's | not tested in (or advertised for) those | | ECC isn't validated by AMD for AM4 Ryzen models, but it's | present and supported if the motherboard also supports it. | Many motherboards have ECC support (the manual will say for | sure), and a handful of models even explicitly advertise it | as a feature. | | I have a Ryzen 9 3900X on an ASRock B450M Pro4 and 64 GB of | ECC DRAM, and ECC functionality is active and working. | colejohnson66 wrote: | What do you mean by "validated"? There's the silicon, but | they don't test it? | Laforet wrote: | More like "The feature is present in silicon but | motherboard makers are not required to turn it on". At | the end of the day, ECC support does require extra copper | traces in the PCB and some low end models may | deliberately choose to skip them, thus the expectation | has to be managed. | loeg wrote: | IMO, "validated" is intentionally wishy-washy and mostly | means that AMD would prefer it if enterprises paid them | more money by buying EPYC (or Ryzen Pro) parts instead of | consumer Ryzen parts. Much like how Intel prefers selling | higher-margin Xeons over Core i5. It's market | segmentation, but friendlier to consumers than Intel's | approach. | cturner wrote: | I went through this about a year ago, to build a low-TDP | ECC workstation. I do not have stats on failure rates, just | this anecdotal experience. Asrock and Asus seem to be the | boards to get. For RAM, I got two sticks of Samsung | M391A4G43MB1, and verified. The advice I remember from the | forums was to stick to unbuffered ram (UDIMMS). | everybodyknows wrote: | Did you consider any off-the-shelf ECC boxes? | | Found some here -- bottom of the EPYC product line starts | at $2849 ...! | | https://www.velocitymicro.com/wizard.php?iid=337 | loeg wrote: | Yes, the consumer parts only support UDIMMs. If you want | RDIMMs, you have to pay for EPYC. | CalChris wrote: | Yeah, the iMac Pro has the Xeon W and ECC. T'would be nice if | the Apple Silicon MacBook Pro had it. There's not much of a | reason to pay for the Pro over the Air. But like Linus, I'm | going to blame Intel for this situation in the market. Maybe | Apple will strike out on its own with Apple Silicon but since | their dominant use case is phones, I'll not hold my breath. | DCKing wrote: | Unless something weird happens, the next generation of the | Apple M-line will use LPDDR5 memory instead of the LPDDR4X | used in the Apple M1. While it probably won't support error | correction _monitoring_ , LPDDR5 has built in error | correction that silently corrects single bit flips. That | alone should be a huge reliability improvement. | | LPDDR5 will enable some much needed level of error | correction in a metric ton of other future SoC designs too. | I look forward to the future Raspberry Pi with built in | error correction capabilities. | rhn_mk1 wrote: | Doesn't intel make ECC available on the i3 line of CPUs? | xxs wrote: | Not any more[0] 10300. It used to[1] - 9300: | | 0: https://ark.intel.com/content/www/us/en/ark/products/199 | 281/... | | 1: https://ark.intel.com/content/www/us/en/ark/products/134 | 886/... | minot wrote: | I was going to say no but I just checked and at least ONE | latest generation i3 processor supports ECC | | https://ark.intel.com/content/www/us/en/ark/compare.html?pr | o... | | https://ark.intel.com/content/www/us/en/ark/products/208074 | /... | | Problem is this processor is an Embedded processor so | probably not for us | | > Industrial Extended Temp, Embedded Broad Market Extended | Temp | | My understanding is Intel does not support ECC on the | desktop unless you pay extra. | hollerith wrote: | That i3 is for file servers. | makomk wrote: | Yeah, that appears to be a BGA-packaged processor | designed to be permanently soldered to the board of some | embedded device, not something that you can install in | your desktop at all. I'm not sure why Intel decided to | brand their embedded processors with ECC as i3, though I | suspect the reason this range exists at all is because | companies were going with competitors like AMD instead | due to their across-the-board ECC support. | opencl wrote: | They used to support ECC in the desktop i3 lineup, current | gen does not have ECC except in some embedded SKUs. | | https://ark.intel.com/content/www/us/en/ark/products/199280 | /... | vbezhenar wrote: | You can find non-Xeons with ECC support. But they are rare | and usually suitable for some kinds of micro servers. | fortran77 wrote: | While it's true that Intel only has ECC support on Xeon (and | several other chips targeted at the embedded market) it's not | true that ECC is supported well on AMD. | | We _only_ use Xeons on developer desktops and production | machines here precisely because of ECC. It 's about 1 bit | flip/month/gigabyte. That's too much risk when doing | something critical for a client. | loeg wrote: | > it's not true that ECC is supported well on AMD. | | That's an extreme claim. Why do you say so? | theevilsharpie wrote: | > it's not true that ECC is supported well on AMD | | ECC is supported on most Ryzen models[1], as long as the | motherboard supports it. In fact, ASUS and ASRock (possibly | others) have Ryzen motherboards designed for | workstation/server use where ECC support is specifically | advertised. | | [1] The only exception is the Ryzen CPUs with integrated | graphics. | js2 wrote: | Depends what you mean by supported. Semi-offically: | | _ECC is not disabled. It works, but not validated for | our consumer client platform. | | Validated means run it through server/workstation grade | testing. For the first Ryzen processors, focused on the | prosumer / gaming market, this feature is enabled and | working but not validated by AMD. You should not have | issues creating a whitebox homelab or NAS with ECC memory | enabled._ | | https://old.reddit.com/r/Amd/comments/5x4hxu/we_are_amd_c | rea... | loeg wrote: | Your quote is for consumer platforms (Ryzen) only; GP's | statement was that ECC is not well-supported on AMD _at | all_ , which is obviously false (EPYC, Threadripper). | adrian_b wrote: | Yes there is a risk to buy a Ryzen CPU with non- | functional ECC. | | However, I use only computers with ECC, previously only | Xeons, but in the last years I have replaced many of them | with Ryzens, all of which work OK with ECC memory. | | When having to choose between a very small risk of losing | the price of a CPU and having to use for sure, during | many years, an Intel CPU with half of the AMD speed, the | choice was very obvious for me. | theevilsharpie wrote: | AMD may claim not to validate ECC on Ryzen, but it's | working well enough for major motherboard vendors to | market Ryzen motherboards with ECC advertised as a | feature. | | ECC support not being "validated," for all practical | purposes, simply means that board vendors can advertise a | board lacking ECC support as compatible with AMD's AM4 | platform, without getting a nasty letter from AMD's | lawyers. | jeffbee wrote: | > While it's true that Intel only has ECC support on Xeon | | That's not true. There are Core i3, Atom, Celeron, and | Pentium SKUs with ECC. E.g. the Core i3-9300 | | https://en.wikichip.org/wiki/intel/core_i3/i3-9300 | lighttower wrote: | Can you get decent battery life with this ecc memory in a | laptop? | dijit wrote: | Yes. ECC memory uses only marginally more power than non-ECC | memory. And memory isn't the largest consumer of battery life | by a country mile. | | Screen, Wi-Fi, and to a much lesser extent (unless under | load) the CPU are the most major culprits of low battery | life. | indolering wrote: | It can actually reduce power consumption, because refresh | rates don't need to be so high: | | https://media- | www.micron.com/-/media/client/global/documents... | hosteur wrote: | How did you track memory errors across thousands of physical | machines? | core-questions wrote: | https://github.com/netdata/netdata/issues/1508 | | Looks like `mcelog --client` might be a starting place? Feed | that into your metrics pipeline and alert on it like anything | else... | jeffbee wrote: | Newer linux have replaced mcelog with edac-util. I think | most shops operating systems at that scale are getting | their ECC errors out of band with IPMI SEL, though. | gsvelto wrote: | It's rasdaemon these days: | https://www.setphaserstostun.org/posts/monitoring-ecc- | memory... | ikiris wrote: | The same way you do it with everything else, export the | telemetry and store it in time series... | incrudible wrote: | When you say bitflips were "common" on thousands of physical | machines, does that mean you observed thousands of bitflips? | | Otherwise, I would think that an unlikely event becoming 1000x | more likely by sheer numbers would have warped your perception. | | I believe that hardware reliability is mostly irrelevant, | because software reliability is already far worse. It doesn't | matter whether a bitflip (unlikely) or some bug (likely) causes | a node to spuriously fail, what matters is that this failure is | handled gracefully. | ikiris wrote: | Its enough that graphs can show you solar weather. | | I can't give my source, but its far higher than most people | think. Just pay the money. | dkersten wrote: | Another comment[1] mentioned 1 bitflip per gigabyte per | month. If you have a lot of RAM, that's rather a lot. | | > It doesn't matter whether a bitflip (unlikely) or some bug | (likely) causes a node to spuriously fail | | Except that a bitflip can go undetected. It _may_ crash your | software or system, but it also may simply leak errors into | your data, which can be far more catastrophic. | | [1] https://news.ycombinator.com/item?id=25623206 | jhasse wrote: | So can a bug. | dkersten wrote: | Yes. And? That doesn't suddenly make bitflips benign. | incrudible wrote: | The point is that you can't prevent failure by just | buying something. You have to deal with the fact that | failure _can not be prevented_. | | In other words, if a single defective DIMM somewhere in | your deployment is causing catastraphic failure, your | mistake was not buying the wrong RAM modules. Your | mistake was relying on a single point of failure for | mission critical data. | tyoma wrote: | It depends where the failure happens. Sometimes you really | lose the "failure in the wrong place" lottery. For example, | in a domain name: http://dinaburg.org/bitsquatting.html | jjeaff wrote: | Ya, I'm not buying that biyflips are a problem. Or maybe | modern software can correct better for this? Because I use my | desktop all day everyday running tons of software on 64 gb of | ram and I don't get errors or crashes often enough to | remember ever having one. | ChrisLomont wrote: | > I'm not buying that biyflips are a problem. | | Google and read up - it is a problem, has killed people, | has thrown election results, and much more. | | It's such a common problem than bitsquatting is a real | thing :) | | Want to do an experiment? Pick a bitsquatted domain for a | common site, and see how often you get hits. | | https://en.wikipedia.org/wiki/Bitsquatting | incrudible wrote: | Nobody denies that bitflips _happen_. On the whole, you | fail to make a case that preventing bitflips is the | solution to a problem. Bitsquatting is not a real | problem, it 's a curiosity. | | As for the case of bitflips killing someone: Bitflips are | not the root cause here. The root cause is that somebody | engineered something life-critical that mistakenly | assumed hardware can not fail. Bitflips are just one of | many reasons for hardware failure. | ChrisLomont wrote: | >Bitflips are not the root cause here. | | So those systems didn't fail when a bitflip happened? | | > The root cause is that somebody engineered something | life-critical that mistakenly assumed hardware can not | fail. | | The systems I am aware of were designed with bitflips in | mind. NO software can handle arbitrary amounts of | bitflips. ALL software designed to mitigate bitflips only | lower the odds via various forms of redundancy. (For | context, I've written code for NASA, written a few | proposals on making things more radiation hardened, and | my PhD thesis was on a new class of error correcting | codes - so I do know a little about making redundant | software and hardware specifically designed to mitigate | bitflips). | | By claiming a bitflip didn't kick off the problems, and | trying to push the cause elsewhere, you may as well blame | all of engineering for making a device that can kill on | failure. | | So your argument is a red herring | | >On the whole, you fail to make a case that preventing | bitflips is the solution to a problem | | Yes, had those bitflips been prevented, or not happened, | those fatalities would not have happened. | | >Ya, I'm not buying that biyflips are a problem. | | If bitflips are not a problem then we don't need ECC ram | (or ECC almost anything!) which is clearly used a lot. So | bitflips are enough of a problem that a massively | widespread technology is in place to handle precisely | that problem. | | I guess you've never written a program and watched bits | flip on computers you control? You should try it - it's a | good exercise to see how often it does happen. | | I guess you define something being a problem differently | than I or the ECC ram industry do. | dkersten wrote: | Crashes aren't such a big problem. You can detect them and | reboot or whatever. Silent data corruption is the real | issue IMHO. | | See also this comment above: | https://news.ycombinator.com/item?id=25623764 | adrian_b wrote: | On a single computer with a large memory, e.g. 32 GB or more, | the time between errors can be of a few months, if you are | lucky to have good modules. Moreover, some of the errors will | have no effect, if they happened to affect free memory. | | Nevertheless, anyone who uses the computer for anything else | besides games or movie watching, will greatly benefit from | having ECC memory, because that is the only way to learn when | the memory modules become defective. | | Modern memories have a shorter lifetime than old memories and | very frequently they begin to have bit errors from time to | time long before breaking down completely. | | Without ECC, you will become aware that a memory module is | defective only when the computer crashes or no longer boots | and severe data corruption in your files could have happened | some months before that. | | For myself, this was the most obvious reason why ECC was | useful, because I was able in several cases to replace memory | modules that began to have frequent correctable errors, after | many years with little or no errors, without losing any | precious data and without downtime. | ikiris wrote: | The good modules bit is important. I'm told by some | colleagues that most of the bit flips are from alpha | particles from the ram casings surprisingly enough. | petermcneeley wrote: | I would also add that Row Hammer Attacks are much harder on ECC. | | When I first tried to replicate the row hammer attack I was not | getting any results. Turns out I was doing this on ECC. On non | ECC memory the same test easily replicated the row hammer attack. | | https://en.wikipedia.org/wiki/Row_hammer | rahimiali wrote: | I have trouble parsing information from this rant. Is someone | willing to translate this into an argument (a string of facts | tied by logical steps)? | mark-r wrote: | 1. Linux sometimes has crashes, not due to software errors but | because of memory glitches. 2. ECC would prevent memory | glitches. 3. ECC is hard to find on desktop PCs because Intel | uses the feature to differentiate desktop CPUs from server | CPUs, so it can charge more for servers. 4. Even when someone | like AMD makes the feature available, the market doesn't have | ECC DRAM modules or motherboards readily available because | Intel killed the demand for it. | phh wrote: | I don't know if ECC is that important, but reliability of RAM (or | any storage) feels pretty crazy to me. 128GB being refreshed | every second for a month error requires that the per-bit refresh | process has a reliability of 99.9999999999999999% to be flawless. | Considering we are dealing with quantum effects (which are | inherently probabilistic), I wouldn't trust myself to design | anything like that. | | Now back to ECC, I'll probably be corrected, but I don't think | ECC helps gain more than two order of magnitudes, so we still | need incredibly reliable RAM. If we move to ECC RAM by default | everywhere, aren't we simply going to get less reliable RAM at | the end? | formerly_proven wrote: | RAM is not as reliable as you think. Some ECC memory hardly | ever finds an error, some machines see them at a very | consistent rate, e.g. 50 errors per TB-day. That would | translate to 1-2 errors per day in a 32 GB PC. Without ECC you | cannot know in which bucket you are. | trevyn wrote: | If true, that seems like... a very straightforward bucket to | test if you're in. | toast0 wrote: | The bucket can change over time though. If you want to be | sure, you need to test often, which gets in the way of | using the computer. | bitcharmer wrote: | A system on Earth, at sea level, with 4 GB of RAM has a 96% | percent chance of having a bit error in three days without ECC | RAM. With ECC RAM, that goes down to 1.67e-10 or about one | chance in six billions. | | So I'd say ECC _is_ not only important but insanely impactful. | There 's a reason why many organizations don't even want to | hear about getting rigs with non-ECC memory. | gzalo wrote: | That number is flawed, and the author did a follow-up with | better results: http://lambda-diode.com/opinion/ecc-memory-2 | | "33 to 600 days to get a 96% chance of getting a bit error." | Still, it seems way too high. I guess anyone with ECC RAM | could confirm that they are getting those sort of recovered | error rates? | mrlala wrote: | So, I hear what you are saying. But, on the other hand, I | have been using 2 non-ECC desktops for a workstation/server | for the past ~6 years.. and I would be hard pressed to come | up with a single situation where either of the machines | randomly crashed or applications did anything 'unexpected' | (to my knowledge, of course). | | My point is, when you say there is a "96% chance of having an | error in THREE DAYS", one would EXPECT to be having issues | like.. all the time? So I'm not disagreeing with you, but | with the amount of non-ECC machines all over the world and | how insanely stable modern machines are, it still seems like | a very low risk. | | Now of course I agree that if you want to take every | precaution, go ECC, but simple observation prove that this | "problem" can't be as bad as the numbers are saying. | bitcharmer wrote: | Your questions are perfectly valid. It's just that out of | all the random bit flips that happen over a period of time | on a non-ECC platform only a miniscule percentage will | manifest to you in any noticeable way. | | Most will escape your attention. | johndough wrote: | I ran a memory test for two weeks straight on a consumer | laptop with 8 GB RAM and could not get a single bit flip, so | your mileage may vary. | bitcharmer wrote: | How did you run those tests? From what I understand on the | topic, for your results to be statistically significant you | need at least hundreds of machines and very rigid testing | methodology. | avian wrote: | As someone who also ran a similar test myself and haven't | seen a bit flip, I'm also skeptical of the 96% figure. | | I'm too lazy to run the exact numbers right now, but with | "4 GB, 96% percent chance, three days" as the hypothesis, | I think you'll find that an experimental result of "8 GB, | 0% chance, 14 days" is highly statistically significant. | | Edit: rough back of napkin estimate - you're not seeing | an event in roughly 10x trials (2x number of bits and ~5x | number of days). Given hypothesis is true your | experimental result has a probability of (1-0.96)^10 = | very very small. Conclusion: hypothesis is false. | bitcharmer wrote: | The 96% figure comes from Google and was obtained in a | large scale experiment over many months. I've been in | this business long enough to have witnessed adverse | effects of cosmic rays an non-ECC memory multiple times | myself. I don't think you're sample gets anywhere near | statistical significance. Not mentioning testing | methodology. | toast0 wrote: | My anecdotal evidence is far from rigorous, but the | Google data from ten years ago doesn't match up with my | experience running thousands of ECC enabled servers up to | a few years ago. Their rates seem a lot higher than what | my servers experienced; we would page on any ram errors, | correctable or not (uncorrectable would halt the machine, | so we would have to inspect the console to confirm; when | we knowingly tried machines with uncorrectable errors | after a halt, they nearly all failed again within 24 | hours, so those we didn't inspect the console of probably | were counted on their second failure), and while there | were pages from time to time, it felt like a lot less | than 8% of the machine having a | | There's a lot of variables that go into RAM errors, | including manufacturing quality and condition of the ram, | the dimm, the dimm slot, the motherboard generally, the | power supply, the wiring, and the temperature of all of | those. Google was known for cost cutting in their | servers, especially early on; so I wouldn't be surprised | if some of that resulted in higher bitflip rate than | running in commercially available servers. Things like | running bare motherboards, supported only on the edges | cause excess strain and can impact resistance and | capacitance of traces on the board (and in extreme cases, | break the traces). | tomxor wrote: | I like when people back up their claims with numbers, but | would you mind describing roughly what that 96% probability | of error is based upon? | | I understand altitude has some kind of proportionality to | cosmic ray exposure, and number of bits will multiply the | probability of _an_ error.. I 'm presuming there is also an | inherent error rate to DRAM separate from environment. But | what are those numbers. | bitcharmer wrote: | Apologies, you're totally right. I should have linked to | the source: | | http://lambda-diode.com/opinion/ecc- | memory#:~:text=A%20syste.... | tomxor wrote: | Great thanks! | | [edit] | | Looks like the calculation was revised [0] after | criticism: | | > Under these assumptions, you'll have to wait about 33 | to 600 days to get a 96% chance of getting a bit error. | | What's more worrying is the variance, the above | calculation is based on expected well behaved DRAM.. yet | some computers just seem to have manufacturing defects | that make the incidence of errors high enough to be a | regular problem. | | [0] http://lambda-diode.com/opinion/ecc-memory-2 | dejj wrote: | And even higher in the vicinity of radioactive cattle: | https://www.jakepoz.com/debugging-behind-the-iron-curtain/ | davidw wrote: | Could you measure altitude with memory? | asimpletune wrote: | That's a very interesting idea, and I think you totally | could. You run some benchmarks, measure the bit flips, and | after enough runs you'd be able to say with a degree of | confidence what your altitude is. I wonder though what | accuracy could be achieved with this? | cyberlurker wrote: | If the 96% every 3 days is true, you could approximate | based on that. But it would be a really slow measurement. | tomxor wrote: | :D yes, although I expect you would need either a | prohibitively large quantity of memory or a extremely slow | rate of change in altitude to effectively measure it. | rafaelturk wrote: | Little bit offtopic: Again seems that Intel? what?! is the one | lowering the bar. | b0rsuk wrote: | I browsed some online listings for ECC memory modules, and they | seem to be sold one module at a time. Standard DDR4 modules are | sold in pairs, to benefit from dual channel mode. | | Does ECC memory support dual channel?? | KingMachiavelli wrote: | Is there such a thing as 'software' ECC where a segment in memory | also has a checksum stored in memory and the CPU just verifies it | when the memory segment is accessed? | | It would be a lot slower than real ECC but it could just be used | for operations that would be especially vulnerable to bit flips. | It would also not know for certain if the memory segment of data | or the memory segment holding the checksum was corrupted besides | their relative sizes (checksum is much smaller so more unlikely | to have had a bit flip in it's memory region). | a1369209993 wrote: | Actually... there _is_ a word of memory that you already have | to _read_ every time you access a region of memory: the page | table entry for that region. If you have 64-byte cache lines, | that 's 64 lines per (4KB) page, so you could load a second | 64-bit word from the page table[0], and use that as a parity | bit for each cache line, storing it back on write the same way | you store active and dirty bits in the PTE proper. Actual E[ | _correcting_ ]C would require inflating the effective PTEs from | 8(orginal)-16(parity) bytes to about 64(7 bits per line, | insufficient)-128(15, excessive), which is probably untenable, | but you could at least get parity checks this way. | | There's also the obvious tactic of just storing every logical | 64-bit word as 128 bits of physical memory, which gives you | room for all kinds of crap[1], at the expense of halving your | effective memory and memory bandwidth. | | 0: This is extremely cheap since you're loading a 64- vs | 128-bit value, with no extra round trip time and still fits in | a cache line, so you're likely just paying extra memory use | from larger page tables. | | 1: Offhand, I think you could fit triple or even quadruple | error _correction_ into that kind of space (there 's room for | _eight_ layers of SECDED, but I don 't remember how well bit- | level ECC scales). | temac wrote: | Intel has some recent patents on that. | zdw wrote: | Good news is that for DDR5, ECC is a required part of the spec | and should be a feature of every module: | | https://www.anandtech.com/show/15912/ddr5-specification-rele... | [deleted] | rajesh-s wrote: | A whitepaper on DDR4 ECC by Micron that goes over some of the | implementation challenges | | https://media-www.micron.com/-/media/client/global/documents... | toast0 wrote: | On die ECC is great for increasing reliability, if all else is | equal, but if it doesn't report to the memory controller, and | if the memory controller doesn't report to the OS, I think it | will be worse than status quo, because all else won't be equal. | With no feedback, systems are going to continue to run on the | edge, but now detectable failures will all be multi-bit; | because single bit errors are hidden. | cududa wrote: | Huh? Why would the memory controller not be updated | accordingly? Also I have no idea about Linux or Mac, but | Windows has had ECC support and active management for | decades? | indolering wrote: | It's part of the firmware first trend of fixing things at | the firmware level before reporting problems up the stack. | This makes it a real nightmare for systems integrators to | do root cause analysis. | mlyle wrote: | Normally, ECC has meant just the DIMM stores some extra | bits, and the memory controller itself implements ECC-- | writing the extra parity, and recovering when errors emerge | (and halting when non-recoverable errors happen). | | DDR5 includes on-die ECC, where the RAM fixes the errors | before sending them over the memory bus. | | This means if the bus between the processor and ram | corrupts the bits-- tough luck, they're still corrupted. | And it's unclear whether we're going to get the quality of | memory error reporting that we're used to or get the | desired halt-on-non-recoverable error behavior (I've not | been able to obtain/read the DDR5 specification as yet). | cududa wrote: | Thank you! | [deleted] | hinkley wrote: | Is it built in as an added feature, or as the only way to make | DDR5 reliable? My inner cynic is screaming the latter. | | When the value add feature becomes a necessity, it's not a | value add any more. | CoolGuySteve wrote: | I always wondered why isn't ECC built into the memory | controller, the same hardware that runs the bus into L3 or the | page mapper could checksum groups of cachelines. | | It seems redundant to have every module come with its own | checking hardware. | p_l wrote: | ECC is a function of memory controller, not memory, on | current systems. There's also usually some form of ECC on | whatever passes for system bus, and internal caches have ECC | as well. | | For memory controller, parity/ECC/chipkill/RAIM usually | involved simply adding additional memory planes to store | correction data. I believe the rare exceptions are fully | buffered memories where you have effectively separate memory | controller on each module (or add-in card with DIMMs) | kasabali wrote: | AFAIK it is built into the memory controller, at least for | ECC UDIMM. There's an extra DRAM chip on the module for | parity (generally 8+1), but it is memory controllers | responsibility to utilize it (that's why not all CPUs support | ECC) | bradfa wrote: | I read it to say that on die ecc is recommended but that dimm- | wide ecc is still optional. | | And now you have 8 bits of ecc per 32 data versus older DDR | having 8 bits of ecc per 64 data. Hence the cost for dimm-wide | ecc is going up. | cbanek wrote: | As someone who has had to read thousands of random game crash | reports from all over the interwebs (you know when Windows says | you might want to send that crash log? like that), I totally | agree. | | Of all the things to be worried about, like OS bugs, bad hardware | configuration, etc. bad memory is one of those really troubling | things. You look at the code and say "it's can't make it here, | because this was set" but when you can't trust your memory you | can't trust anything. | | And as the timeline goes to infinity, you may also get one of | these reports and be asked to fix it... good luck. | lighttower wrote: | Someone reads those reports!?! Wow, how do I write them to | ensure someone who reads them takes them seriously? | apankrat wrote: | Aye. I have an assert in the code that fronts a _very_ pedantic | test of the context. In all cases when this assert was tripped | (and reported) an overnight memtest86 test surfaced RAM issues. | | - Edit - | | Also, bit flips in the non-ECC memory are _the_ cause of the | "bitrot" phenomenon. That is when you write out X to a storage | device, but you get Y when you read it back. A common | explanation is that the corruption happens _at rest_. However | all drives from the last 30+ years have FEC support, so in | reality the only way a bit rot can happen is if the data is | damaged _in transit_, while in RAM, on the way to/from the | storage media. | | So, if you ever decide if to get an ECC RAM, get it. It's very | much worth it. | pkaye wrote: | I wonder how much of those crashes are due to gamers | aggressively overclocking their systems? | faitswulff wrote: | Do the crash reports include whether the machine has ECC | memory? | jackric wrote: | Do the crash reports include recent solar activity? | cbanek wrote: | Well, I've had to actually worry about radiation bitflips | as well. It does happen. But usually not so much on Earth! | dharmab wrote: | I once got to tell a CTO the reason our shiny new point | to point connection was suddenly trash was due to solar | flares. | jgalentine007 wrote: | One of the tire pressure sensors in my car tires had a | bit flip a couple years ago and I had to reprogram it's | ID. Luckily it was a subaru, so only a light came on in | the dash. | | My old Honda crv however would turn traction control on | if your pressure was low - which worked by applying | brakes to wheels that were slipping. If you were going up | a slippery hill you would soon have no power, sliding | backwards nearly off the road in nowhere West Virginia on | the way to a ski resort. | jjeaff wrote: | How in the world would you ever know that problem was | caused by a bit flip and not just one of the countless | other reasons that a sensor could fail? | jgalentine007 wrote: | I have a TPMS programming tool (ATEQ QuickSet) and reader | (Autel TS401), because I like to swap my winter / summer | tires on my own. The TPMS light came on one day and | inflating tires didn't help - I used the reader and found | that one sensor's ID had changed. When I compared the ID | (it was in hex) to the last programming - it was a single | bit off. I couldn't reprogram the sensor itself, but I | was able to update the ECU with the changed ID using the | ATEQ. | | I live in Denver but spend a lot of time skiing around | 11k feet, maybe the higher elevation means more | radiation. | dharmab wrote: | Similar story, we saw that one particular IP address in a | public cloud network had a 3% TLS handshake error rate. | We diverted traffic and then analyzed with wireshark. We | found one particular bit was being pulled low (i.e. 0 -> | 0 and 1 -> 0). HTTP connections didn't notice but TLS | checksum verifications would randomly fail. Had a hell of | a time convincing the cloud provider they had a hardware | fault- turned out to be a bug which disabled ECC on some | of their hardware. | | Aside: I'm surprised you got a TPMS programming tool | instead of a set of steelies. Big wheels? Multiple winter | vehicles? | jgalentine007 wrote: | I have 2 cars. I like the TPMS to work since I've had 3 | nails in tires in 4 years (newer construction area). Also | the TPMS light in my impreza is almost as bright as the | sun. | jacquesm wrote: | Timestamp + location should be enough to figure that out. | ant6n wrote: | It would be interesting to see whether there is a | correlation between solar activity and game crashes -- | which in turn may provide an indication whether crashes | are due to bugs or bit flips. | Triv888 wrote: | most gaming desktops don't use ECC RAM anyways (at least | those from a few years ago) | jacquesm wrote: | On intel consumer boxes it is pretty safe to assume that they | don't, on AMD it might be the case but it usually isn't. | Springcleaning wrote: | Worse than a game crash is your data. | | It is incomprehensible that there are still NAS devices being | sold without ECC support. | | Synology took a step in the right direction to offer prosumer | devices with ECC but it is not really advertised as such. It is | actually difficult to find which do have ECC and which ones | don't. | ksec wrote: | >Synology took a step in the right direction to offer | prosumer devices with ECC | | I just look it up because if it was true it would have been | news to me. Synology have been known to be stingy with | Hardware Spec. But none of what I called Prosumer, the Plus | Series have ECC memory by default. And there are "Value" and | "J" Series below that. | | Edit: Only two model from the new xx21 series using AMD Ryzen | V has ECC memory by default. | BlueTemplar wrote: | Yeah, here's one example along many more : | | https://forums.factorio.com/viewtopic.php?p=405060#p405060 | dboreham wrote: | You don't need to look at kernel crashes to speculate about bus | and memory errors -- just check the logs on a few systems that do | have ecc. Pretty soon you'll see correctable errors being | reported. | maddyboo wrote: | I don't know much about this topic, but is it possible that ECC | memory is more prone to single bit errors than non-ECC memory | because there is less pressure on companies to minimize such | errors? If this were the case, it would skew the data. | belzebalex wrote: | Asked myself, would it be possible to build a Geiger counter with | RAM? | johnklos wrote: | From the fortune database: | | As far as we know, our computer has never had an undetected | error. -- Weisert | otterley wrote: | D. J. Bernstein (of qmail/daemontools fame) spoke of it over a | decade ago as well. https://cr.yp.to/hardware/ecc.html | slim wrote: | these days he's more famous for the NaCl crypto library | loup-vaillant wrote: | For which bit flips are even more relevant: EdDSA has this | nasty tendency of leaking the private key if the wrong bits | are flipped (there are papers on fault injection attacks). | People who sign lots of stuff all the time, say _Let 's | Encrypt_, could conceivably gain some piece of mind with ECC. | | _(Note: EdDSA is still much much better than ECDSA, most | notably because it 's easier to implement correctly.)_ | 1996 wrote: | Linus is absolutely right. | | I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. | I can't get that, at all - even without the other fancy things I | would like such as a 4k OLED with pen/touchscreen. | | In 2020, even the Dell XPS stopped shipping OLED (goodbye dear | 7390!) | | I will gladly give my money to anyone who sells AMD laptop with | ECC. Hopefully, it will show there's demand for "high end yet non | bulky laptops" | miahi wrote: | Lenovo P53 has 3 NVMe slots, 4k OLED with touchscreen (and | optional pen) and up to 128GB ECC RAM if you choose the Xeon | processor. It's big and heavy, but it exists. | | I hope AMD will create a better market for the ECC laptop | memory (right now it's hard to find + expensive). | 1996 wrote: | I know- I had my eye on this very model, as you can even add | a mSata on the WWAN slot to get a 4th drive. | | Unfortunately, Lenovo is not selling the P53 anymore, which | is exactly why I say I can't get that even in a "bulky" | version. | otterley wrote: | About 1/3 of Google's machines and 8% of Google's DIMMs in their | fleet suffer at least one correctible memory error per year: | http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf | jjeaff wrote: | Which means, assuming google is running very large machines | with lots of memory that one might expect a single correctable | error once every 6-10 years on your average workstation of | small server. That's generously assuming your workstation has | 1/3 as much memory as the average google server. | Nebasuke wrote: | Google does not use very large or even large machines for | most of their fleet. You can quickly see in the paper this is | for 1, 2, and 4 GB RAM machines (in 2006-2008). | mauri870 wrote: | In case the page os not loading, refer to the wayback machine[1] | for a copy | | [1] | https://web.archive.org/web/*/https://www.realworldtech.com/... | JumpCrisscross wrote: | What is the status of ECC on Macs? | CalChris wrote: | iMac Pro which has Xeon M. There's a good chance that will go | away with the new Apple Silicon iMac Pro due out this year. | MacRumors roundup article doesn't mention ECC. | | https://www.macrumors.com/roundup/imac/ | MAXPOOL wrote: | Well shit. | | I run some large ML models in my home PC and I get NaN's and some | out of range floats every month or so. I have spent hours | debugging but doing the same computation with the same random | seeds does not recreate the problem. | | How about GPU's and their GDDR SDRAM? Do they have parity bits? | layer8 wrote: | Some pro-level Nvidia GPUs have ECC RAM, they are very | expensive though. I don't think regular gaming GPUs have | parity, due to the extra cost, performance impact (probably | minor but measurable) and irrelevance for gaming. | vbezhenar wrote: | Cheap pro-level GPUs don't have ECC RAM either. And it's not | easy to find out, it might be buried somewhere. | [deleted] | JoeAltmaier wrote: | ECC works if done right. Accessing a memory location can fix bit- | flips (ECC is a 'correcting' code). But systems that don't | regularly visit every memory location, can accumulate risk. Those | dark corners of RAM can eventually get double-bit errors and be | uncorrectable. So an OS might 'wash' RAM during idle moments, | reading every location in a round-robin manner to get ECC to kick | in and auto-correct. Doesn't matter how fast (1M every hour or | whatever) as long as somehow ECC has a chance to work. | jacquesm wrote: | Interesting, similar to scrubbing raid arrays. How often do | those double bitflips appear though? You'd have to have a | pretty long running server for that to be a problem, no? | jeffbee wrote: | According to Google's old paper on the subject, about 1% of | their machines suffered from an uncorrectable (i.e. multi- | bit) error in a year. | temac wrote: | The RAM already needs to be refreshed and IIRC it is done by | the memory controller when not in sleep mode. | | However I don't remember if there are provisions for ECC | checking in case there are some dedicated refresh commands. I | hope so, but I'm not sure. | musingsole wrote: | A double-bit error in many cases is fine. If the error is at | least detectable at the time of a read, your protection worked. | What's scary is a triple-flip event. Most of those will still | look like corrupted data, but if it happens to flip into | looking like a fixable, single-bit error, you're out of luck | and won't even know it. | a1369209993 wrote: | > Most of those will still look like corrupted data, | | Not if you're using a typical 72-bit SECDED code[0]. | | You have two error indicators: a summary parity bit (even | number of errors: 0,2,etc vs odd number of errors: 1,etc), | and a error index: 0 for no errors, or the bitwise xor of the | locations each bit error. | | For a triple error at bits a,b, and c, you'll have summary | parity of 1 (odd number of errors, assumed to be 1), and a | error index of a^b^c, in the range 0..127, of which 0..71[1] | (56.25%, a clear albeit not overwhelming majority) will | correspond to legitimate single-bit errors. | | 0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_w | it... | | 1: or 72 out of 128 anyway; the active bits might not all be | assigned contiguous indexes starting from zero, but it | doesn't change the probability and it's simpler to analyse if | summary is bit 0 and index bit i is substrate bit 2^i. | electricshampo1 wrote: | Patrol scrub is basically this (https://www.intel.com/content/d | am/www/public/us/en/documents... it is built into the memory | controller, no OS involvement is needed. | electricshampo1 wrote: | working link: | | https://www.intel.com/content/dam/www/public/us/en/documents. | .. | wagslane wrote: | It really does. I did a write-up recently on it as I was diving | in and understanding the benefits: | https://qvault.io/2020/09/17/very-basic-intro-to-elliptic-cu... | avianes wrote: | Be careful not to confuse ECC memory with ECC encryption. | | ECC memory = memory with Error-Correcting Code | | ECC encryption = Elliptic Curve Cryptography | _0ffh wrote: | Please someone correct me if I'm wrong, but as far as I can | remember memory with extra capacity for error detection used to | be a rather common thing on early PCs. That really only changed a | couple of decades in, in order to be able to offer lower prices | to home users who didn't know or care about the difference. | Probably about the time, or earlier, when with some hard disk | manufacturers megabytes suddenly shrunk to 10^6 bytes (before | kibibytes or mebibytes where a thing, btw). | wmf wrote: | Yes, PCs used to use parity memory. | musingsole wrote: | It's a shame we don't have ECC for individuals. How many of | society's bugs come from someone wandering around with a bit | flipped? | ratiolat wrote: | I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF | (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600 | | I specifically was looking for bang for buck, low(er) wattage and | ECC. | IanCutress wrote: | Those AMD motherboards with consumer CPUs are a bit iffy. They | run ECC memory, but it's hard to tell if it is running in ECC | mode. Even some of the tools that identify ECC is running will | say it is, even when it isn't, because the motherboard will | report it is, even when it isn't. ECC isn't a qualified metric | on the consumer boards, hence all the confusion. | linsomniac wrote: | This reminds me of last year we ordered a new $14K server, it | arrived and we ran it through our burn-in process which included | running memtest86 on it, and it would, after around 7 hours, | generate errors. | | Support was only interested if their built-in memory tester, | which even on it's most thorough, would only run for ~3 hours, | would show errors, which it wouldn't. IIRC, the BMC was logging | "correctable memory errors", but I may be misremembering that. | | "We've run this test on every server we've gotten from you, | including several others that were exactly the same config as | this, this is the only one that's ever thrown errors". Usually | support is really great, but they really didn't care in this | case. | | We finally contacted sales. "Uh, how long do we have to return | this server for a refund?" All of a sudden support was willing to | ship us out a replacement memory module (memtest86 identified | which slot was having the problem), which resolved the problem. | | They were all too willing to have us go to production relying on | ECC to handle the memory error. | FartyMcFarter wrote: | Does anyone know why ECC memory requires the CPU to support it? | | Naively, I can understand why error _reporting_ has dependencies | on other parts of the system, but it would seem possible for | error _correction_ to work transparently. | TomVDB wrote: | I think the memory just provides additional storage bits to | detect the issue, but doesn't contain the logic. | | This is in line with all technical parameters of DRAM: | everything must be as cheap as possible, and all the difficult | parts are moved to the memory controller. | | Which is the right thing to do, because you can share one | memory controller with multiple DRAM chips. | wmf wrote: | Historically the detection and correction is performed in the | memory controller not the DRAM. | toast0 wrote: | As implemented today, ECC is a feature of the memory | controller. You need special ram, because instead of 8 parallel | rams per bank, you need 9, and all the extra data lines to go | to the controller. | | Modern CPUs have integrated memory controllers, so that's why | the CPU needs to support it. | | Correction without reporting isn't great; anyway, you _need_ a | reporting mechanism for uncorrectable errors, or all you 've | done is ensure any memory errors you do experience are worse. | nix23 wrote: | I always have that conversation when ZFS comes up. Some peoples | think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every | single one FS in Linux. And every single reliable Machine needs | ECC. | paulie_a wrote: | There was a great defcon talk a while back regarding using ECC. | The concept was called "dns jitter" | | Basically you can register domains using small bit differences | for domains and start getting email and such for that domain | | If I recall correctly the example given was a variation of | microsoft.com | | All because so much equipment doesn't use ECC | zx2c4 wrote: | Voila http://media.blackhat.com/bh- | us-11/Dinaburg/BH_US_11_Dinabur... | tyoma wrote: | There were some great follow up talks as well! It turns out a | viable attack vector was also MX records. And there was the | guy who registered kremlin.re ( versus kremlin.ru ). | jeffbee wrote: | miclosoft.com is only one bit away from microsoft.com. Used to | see these problems all the time when I worked on gmail. | | At Google even with ECC everywhere there wasn't enough | systematic error detection and correction to prevent the global | database of monitoring metrics from filling up with garbage. | /rpc/server/count was supposed to exist but also in there would | be /lpc/server/count and /rpc/sdrver/count and every other | thing. Reminded me daily of the terrors of flipped bits. | [deleted] | louwrentius wrote: | ECC matters, even on the desktop, it's not even a discussion, to | me. | | If you think it doesn't matter: how do you know? If you don't run | with ECC memory, you'll never know if memory was corrupted (and | recovered). | | That blue screen, that sudden reboot, that program crashing. That | corrupted picture of your kid. | | Who knows. | | I'll tell you, who knows. God damn every sysadmin (or the modern | equivalent) can tell you how often they get ECC errors. And at | even a small scale you'll encounter them. I have, on servers and | even on an SAN Storage controller, for crying out loud. | | If you care about your data, use ECC memory in your computers. | supernovae wrote: | I've got nearly 30 years of experience and not once has non ECC | memory lead to corruption. Maybe a crash, maybe a panic, maybe | a kernel dump... | | But.. in all my time operating servers over 3 decades, it's | always been bad drivers, bad code and problematic hardware | that's caused most of my headaches. | | Have i seen ECC error correction in logs? yeah.. I don't | advocate against it but, i've found for most people you design | around multiple failure scenarios more than you design around | preventing specific ones. | | Take the average web app - you run it on 10 commodity systems | and distribute the load.. if one crashes, so what. Chances are, | a node will crash for many more reasons other than memory | issues. | | If you have an app that requires massive amounts of ram or you | do put all of your begs in one basket, then ECC makes sense... | | I just know i like going horizontal and I avoid vertical | monoliths. | louwrentius wrote: | The problem with memory corruption is not just crashes, those | are the more benign outcomes. | | The real killer is data corruption. Houw would you even begin | to know that data is corrupted until it is too late? | ajnin wrote: | > I've got nearly 30 years of experience and not once has non | ECC memory lead to corruption | | How do you know? | ptx wrote: | > if one crashes, so what | | Crashes might not matter, but silent data corruption does. | The owner/user of that data will care when they eventually | discover that it at some point mysteriously got corrupted. | alkonaut wrote: | I know what it does, but I still don't care (so long as it | costs money or even 1% performance). | | It's a tradeoff between money/performance and the frequency of | crashes, corruption etc. | | Bit rot is just one of many threats to my data. Backups take | care of that as well as other threats like theft, fire, | accidental deletion. | | This is similar to my reasoning around the recent side channel | attacks on intel CPUs. If I had a choice I'd like to run with | max performance without the security fixes even though it would | be less secure. Not because I don't care about security but | because 1% or 5% perf is a lot and I'd rather simply avoid | doing anything security critical on the machine entirely than | take that hit. | louwrentius wrote: | > Bit rot is just one of many threats to my data. Backups | take care of that as well as other threats like theft, fire, | accidental deletion. | | No, that's the big mistake people make: backups just backup | bit-rotted data, until it is too late and the last good | version is rotated out and lost forever. | alkonaut wrote: | I'm aware. But the risk is extremely small (and 99.9% of | important data is not created on the machine but goes | directly from e.g iOS camera to backup). | | My desktop machine is basically a gaming rig with | disposable data. Hence the "performance over integrity". | | I also never rotate anything out. Every version of | everything is in the backups. Storage is that cheap these | days. | mark-r wrote: | Backups can't fix what was already corrupted when it was | written to disk. | kensai wrote: | "ECC availability matters a lot - exactly because Intel has been | instrumental in killing the whole ECC industry with it's horribly | bad market segmentation." | | Its. | | There, I finally corrected Linus Torvalds in something. :)) | hugey010 wrote: | He uses "do do" instead of "to do" which is a more obvious | typo. Linus usually comes across as borderline arrogant, and | deservedly so, but not necessarily perfect in his writing. I | think it's an effective strategy to communicate his priorities | and wrangle smart but easily intimidated folk "do do" what he | believes is right! | mark-r wrote: | I have a simple way of remembering when to leave out the | apostrophe. His, hers, its are all possessive and none of them | have an apostrophe. | Glanford wrote: | In this particular case 'it's' can also be possessive | although it's considered non-standard, so to be correct you | can always treat it like a contraction of 'it is'. | raverbashing wrote: | Yeah I'm always annoyed with this kind of mistake. Especially | as non-native speakers should know better than the native ones | (which usually don't give a f.). | | Now the point about internally doing ECC is an interesting one, | could be a way out of this mess. And apparently ECC is more | available in AMD land | tssva wrote: | The really annoying thing is that auto correct on mobile | device keyboards will often want to incorrectly change "its" | to "it's" or vice versa. | raverbashing wrote: | Yes, auto-corrects compound the problem. | simias wrote: | For a 2nd language speaker making these homophonic mistakes | is actually a sign of fluency. It means that you just | transcribe a mental flow of words instead of consciously | constructing the language. | | The first time I wrote "your" instead of "you're" in English | I thought it was quite a milestone! | raverbashing wrote: | > For a 2nd language speaker making these homophonic | mistakes is actually a sign of fluency. | | I kinda disagree because while the homophony works in | (spoken) English in written it stands as a sore thumb. So | yeah you will make it if you only heard it but doesn't know | the written form. | | (And in their native language it's probably two unrelated | words, so that might intensify the feeling of wrongness) | simias wrote: | I mean, my native language is French where "your" is | "ton" and "you're" is "tu es", yet it (rarely) happens | that I mix them up in English. If I proofread I'll spot | it almost every single time, but if I'm just typing my | "stream of consciousness" my brain's speech-to-text | module sometimes messes up. | leetcrew wrote: | meh, plenty of (intelligent!) native english speakers do | not know all the canonical grammar rules. english | contains a lot of what could be considered error | correction bits, so it doesn't usually impede | understanding. syntactically perfect english with | weird/misused idioms (common among non-native speakers | with lots of formal education) is harder to understand in | my experience. I imagine this is true of most natural | languages. | protomolecule wrote: | For what its worth as a non-native speaker I too started | making this kind of errors when my English became fluent | enough. | [deleted] | andi999 wrote: | Yes. I noticed this. When I was younger, I thought how can | you mix up 'their, they're, there' people you do this must | be the opposite of smart. This lasted for 4 years living in | an English speaking country.... | harperlee wrote: | As an "english as a second language" user, I can't see | myself writing e.g. "should of" instead of "should have", | however fluent I am. I think you don't make that kind of | typo unless you have learnt english before grammar. | simias wrote: | I also wouldn't do this one, but that's because in my | English accent I simply wouldn't pronounce them the same | way. Also the word sequence "should of" is extremely | uncommon in proper English, so it catches the eye more | easily I think. | | "You're/your", "their/they're", "its/it's" and the like | are a different story, because I do pronounce those the | same and they're all very common. | lolc wrote: | I was quite surprised when it started happening to me. | harperlee wrote: | Wow that's interesting! | young_unixer wrote: | I've realized that when I'm engaged in the writing (angry or | emotional in some way) I tend to commit more of these | mistakes, even though I know the difference between "it's" | and "its". Linus is always angry, so that probably makes him | commit more orthographic mistakes. | touisteur wrote: | I think it's available for customer SKUs on AMD and not just | for servers like in 'Xeon-land'... How I've wanted an ECC- | ready NUC... | jeffbee wrote: | The AMD parts all have the ECC feature but the platform | support outside of EPYC may as well not exist. Most | motherboards for the Ryzen segment don't do it properly or | don't do it at all, some support it but aren't capable of | reporting events to the operating system which is dumb. | Ryzen laptops don't have it either. | | Closest you can come to a nuc with ecc is I think a mini | server equipped with one of the four-core i3 parts that | have ecc. | erkkie wrote: | Probably not what you meant but https://ark.intel.com/conte | nt/www/us/en/ark/products/190108/... has support for Xeon | (and ECC). Now how to actually practically source 32GB ECC | enabled SO-DIMM sticks .. | africanboy wrote: | As a non native speaker, my phone has both the Italian and | English dictionary, when I write its it always auto corrects | to it's as soon as I hit space and sometimes it gets | unnoticed. | phkahler wrote: | >> But is ECC more available in AMD land? | | Yes it is. The problem is they dont really advertise it. I'm | not certain but it might even be standard on AMD chips, but | if they dont say so and board makers are also unclear, who | knows... | ethbr0 wrote: | It's a market size problem. | | For consumer motherboard OEMs, only AMD effectively has ECC | support (Intel's has been so spotty and haphazard from | product to product), and of AMD users, only a small number | care about ECC. | | So motherboard companies, being resource and time-starved | as they are, don't make it a priority to address such a | small user-base. | | If Intel started shipping ECC on everything, it would go a | long way towards shifting the market. | [deleted] | jacquesm wrote: | How is your Finnish? | jankeymeulen wrote: | Or Swedish for that matter, as I believe Torvalds maternal | language is Swedish | [deleted] | jacquesm wrote: | Finnish is stupendously hard. Far harder than Swedish, at | least, by my estimation. | dancek wrote: | Yes. Swedish is also easy compared to English and French, | the other two languages I've learned after early | childhood. The only thing that makes it hard is that you | never really have use for it and you're forced to learn | it nevertheless here in Finland. | | I'm happy to see people here on HN respect the difficulty | of learning languages. Most foreigners that speak Finnish | do it very poorly at first and even after decades they | still sound like foreigners. But it shows huge respect to | our small country for someone to make the effort, and we | really appreciate it. I'm hoping other people see | learning their own mother tongue the same way. Sure, most | of us need English, but learning it _well_ is still a | huge task. | dehrmann wrote: | It is. Swedish and English are both Germanic languages, | so there are a lot of commonalities. Finnish is in a | completely different language family. English and Swedish | are more closely related to Persian and Hindi than to | Finnish. | young_unixer wrote: | Yes. https://www.youtube.com/watch?v=0rL-0LAy04E | Igelau wrote: | It could use some Polish. | jacquesm wrote: | Dobrze ;) | xxs wrote: | Linus must have English as his '1st' language now. For non- | originally-native speaker mistakes like 'it's vs its', 'than | vs then', etc. are pretty uncommon. | Tade0 wrote: | I guess this is what happens when someone first learns to | _speak_ the language, learning how to write in it only | later on - as it often is the case with children. | | I spent my preschool years in a multicultural environment | and English was our _lingua franca_ (ironically the school- | mandated language was French), so I didn't properly learn | contractions until grade school - same with similarly | sounding words like "than vs then" and "your vs you're". | jacquesm wrote: | I've spent my whole life speaking multiple languages and | this still trips me up every now and then, in fact quotes | as such are a problem for me and I keep using them wrong, | no idea why, it just won't register. So unless I slow down | to 1/10th of my normal writing speed I will definitely make | mistakes like that. Good we have proofreaders :) | dehrmann wrote: | (guessing you mean apostrophes) | | It's because they have two different uses (three if you | count nested quotes, but those aren't common and are | pretty easy to figure out), contractions and possession, | and they seemingly collide on words like "its" where | you'd think it could mean either. | | Not sure if you've already learned this (or if it helps), | but English used to be declined, and its pronouns still | are, e.g. they/their/them. That's why "its" isn't | contracted; the possessive marker is already in the word. | mixmastamyk wrote: | His, hers, its | JosephRedfern wrote: | Maybe he composed the message using a machine with non-ECC RAM | and suffered a bit flip, which through some chain of events, | led to the ' being added. Best to give him the benefit of | doubt, I think! | notretarded wrote: | The mistake was that it was included. | JosephRedfern wrote: | Oops, that was dumb. Fixed, thanks. | spacedcowboy wrote: | Seems likely that "bad ram" was the reason for the recent AT&T | fiber issues, given that 1 bit was being flipped reliably in data | packets [1] | | [1]: | https://twitter.com/catfish_man/status/1335373029245775872?l... | p_l wrote: | I have had in the past encountered an issue where line card was | stripping exactly one bit of address data. Don't know of the | follow up investigation, but it probably wasn't TCAM | SV_BubbleTime wrote: | I think you meant seems _un_ likely | MarkusWandel wrote: | This is one justified Linus rant! My personal history includes | data loss twice because of defective RAM, and many more RAMs | discarded after the now obligatory overnight run of MemTest86+ | (these were all secondhand RAMs - I would never buy a new one | without a refund guarantee). My very first "PC" still had the ECC | capability and I used it. My own now very dated rant on the | subject: http://wandel.ca/homepage/memory_rant.html | mixmastamyk wrote: | A few years back memtest86 wouldn't run on newer machines, has | that been fixed? | IgorPartola wrote: | I wish this was more of a cohesive argument. He says he thinks | it's important and points to row-hammer problems but doesn't | explain why. Probably because the audience it was written for | already knows the arguments of why, but this is not the best | argument. | | If in doubt, get ECC. Do your own research on how it works and | why. This post won't explain it, just will blame Intel (probably | rightfully so). | turminal wrote: | It's a message in a thread from a technological forum. I think | its intended audience are people already familiar with ECC | unlike here on HN. | IgorPartola wrote: | Exactly my point :) | eloy wrote: | He does explain it: | | > We have decades of odd random kernel oopses that could never | be explained and were likely due to bad memory. And if it | causes a kernel oops, I can guarantee that there are several | orders of magnitude more cases where it just caused a bit-flip | that just never ended up being so critical. | | It might be false, but I think it's a reasonable assumption. | IgorPartola wrote: | To someone on HN who isn't familiar with what ECC does that | explains nothing about how ECC works and how it could have | prevented these situations. Or how often they really happen. | simias wrote: | The problem is that, if you don't have ECC to detect the | errors, it's very hard to know what exactly caused a | random, non-reproducible crash. Especially in kernel mode | where there's little memory protection and basically any | driver could be writing anywhere at any time. | | I can understand Linus's frustration from that point of | view: without ECC RAM when you get some super weird crash | report where some pointer got corrupted for no apparent | reason you can't be sure if it's was just a random bitflip | or if it's actually hiding a bigger problem. | andi999 wrote: | You could run memtest on a pc without ecc for a couple of | days and to estimate the error rate, or not? | fuster wrote: | Pretty sure most memory test tools like memtest86 write | the memory and then read it back shortly thereafter in | relatively small blocks. This makes the window for errors | to be introduced dramatically smaller. Most memory in a | computer is not being continually rewritten under normal | use. | simias wrote: | If you manage to replicate bitflips every few days your | RAM is broken. | | It's the "once every other year" type of bitflip that's | the problem. The proverbial "cosmic ray" hitting your | DRAM and flipping a bit. That will be caught by ECC but | it'll most likely remain a total mystery if it causes | your non-ECC hardware to crash. | zlynx wrote: | It isn't only cosmic rays. Regular old radiation can also | cause it. I've read about a server that had many repeated | problems and the techs replaced the entire motherboard at | one point. | | Then one of them brought in his personal Geiger counter | and found the radiation coming off the steel in that rack | case was significantly higher than background. | | You may never know when the metal you use was recycled | from something used to hold radioactive materials. | reader_mode wrote: | It takes 5 seconds to Google ECC memory if you're really | interested and if you're working on kernel related stuff | you 99.9999% know what it is. | IgorPartola wrote: | Right. My point that TFA serves zero purpose to most | people on here. Those that know how ECC works already | know that it is a must have. Those that don't will learn | very little from the post because it fails to explain | what ECC is and why you need it aside from general | statements about memory errors. It will reaffirm for | those that know about what ECC RAM is that it's a good | idea, but they already know it anyways. It reads a lot | like an article about why vitamin C is a good thing. | nix23 wrote: | To someone on HN who isn't familiar with what Google does | that explains nothing about how Google works ;) | TheCoelacanth wrote: | Google is like an evil version of Duck Duck Go. | Danieru wrote: | Nah, to Google is just a generic verb. For example I too | do all my googling at Duck Duck Go. | | Hi alphabet lawyers. | vorticalbox wrote: | I believe there was a suit against alphabet about this | very thing. | | They argued that 'Google' has now become a verb meaning | 'to search the Internet for' and as such alphabet should | have the name taken away. | chalst wrote: | From https://en.m.wikipedia.org/wiki/ECC_memory - | | > A large-scale study based on Google's very large number | of servers was presented at the SIGMETRICS/Performance '09 | conference.[6] The actual error rate found was several | orders of magnitude higher than the previous small-scale or | laboratory studies, with between 25,000 (2.5 x 10-11 | error/bit*h) and 70,000 (7.0 x 10-11 error/bit*h, or 1 bit | error per gigabyte of RAM per 1.8 hours) errors per billion | device hours per megabit. More than 8% of DIMM memory | modules were affected by errors per year | unixhero wrote: | Fantastic burn by Linus Torvalds whom also had some skin in the | CPU game. | | Offtopic, I wonder if he trawls that site regularly. And | eventually I wonder, is he here also? :) | knorker wrote: | I have multiple times postponed buying new computers for YEARS, | because I'm waiting for intel to get their head out of their ass | and actually let me buy something that does ECC for desktop. | (incl laptops) | | I would have bought computers when I "wanted one". Now I buy them | when I _need_ one. Because buying a non-ECC computer just feels | like buying a defective product. | | In the last 10 years I would have bought TWICE as many computers | if they hadn't segmented their market. | | Fuck intel. I sense that Linus self-censored himself in this | post, and like me is even angrier than the text implies. | vbezhenar wrote: | There are plenty of Xeons which are suitable for desktops and | there are plenty of laptops with Xeons. | | Price is not nice though. | skibbityboop wrote: | Have you finally stopped buying Intel? Current Ryzens are a | much better CPU anyhow, just dump Intel and be happy with your | ECC and everything else. | jhoechtl wrote: | I definitely do not want Linus Torvalds yelling at me in that | tone --- but reading his utterings is certainly entertaining. | indolering wrote: | My favorite example is a bit flip altering election results: | | https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f... | qwerty456127 wrote: | ECC should be everywhere. It seems outrageous to me almost no | laptops have ECC. | arendtio wrote: | It would be interesting to see how many more kernel oops appear | on machines without ECC compared to those with ECC. | nostrademons wrote: | I still remember Craig Silverstein being asked what his biggest | mistake at Google was and him answering "Not pushing for ECC | memory." | | Google's initial strategy (c. 2000) around this was to save a few | bucks on hardware, get non-ECC memory, and then compensate for it | in software. It turns out this is a terrible idea, because if you | can't count on memory being robust against cosmic rays, you also | can't count on the software being stored in that memory being | robust against cosmic rays. And when you have thousands of | machines with petabytes of RAM, those bitflips do happen. Google | wasted many man-years tracking down corrupted GFS files and index | shards before they finally bit the bullet and just paid for ECC. | maria_weber23 wrote: | ECC memory can't eliminate the chances of these failures | entirely. They can still happen. Making software resilient | against bitflips in memory seems very difficult though, since | it not only affects data, but also code. So in theory the | behavior of software under random bit flips is well... Random. | You probably would have to use multiple computers doing the | same calculation and then take the answer from the quorum. I | could imagine that doing so would still be cheaper than using | ECC ram, at least around 2000. | | Generally this goes against software engineering principles. | You don't try to eliminate the chances of failure and hope for | the best. You need to create these failures constantly (within | reasonable bounds) and make sure your software is able to | handle them. Using ECC ram is the opposite. You just make it so | unlikely to happen, that you will generally not encounter these | errors at scale anymore, but nontheless they can still happen | and now you will be completely unprepared to deal with them, | since you chose to ignore this class of errors and move it | under the rug. | | Another intersting side effect of quorum is that it also makes | certain attacks more difficult to pull off, since now you have | to make sure that a quorum of machines gives the same "wrong" | answer for an attack to work. | colejohnson66 wrote: | > You probably would have to use multiple computers doing the | same calculation and then take the answer from the quorum. | | The Apollo missions (or was it the Space Shuttle?) did this. | They had redundant computers that would work with each other | to determine the "true" answer. | EvanAnderson wrote: | The Space Shuttle had redundant computers. The Apollo | Guidance Computer was not redundant (though there were two | AGCs onboard-- one in the CM and one in the LEM). The | aerospace industry has a history of using redundant | dissimilar computers (different CPU architectures, multiple | implementations of the control software developed by | separate teams in different languages, etc) in voting-based | architectures to hedge against various failure modes. | haolez wrote: | Sounds similar to smart contracts running on a blockchain | :) | buildbuildbuild wrote: | This remains common in aerospace, each voting computer is | referred to as a "string". | https://space.stackexchange.com/questions/45076/what-is-a- | fl... | sroussey wrote: | In aerospace where this is common, you often had multiple | implementations, as you wanted to avoid software bugs | made by humans. Problem was, different teams often | created the same error at the same place, so it wasn't as | effective as it would have seemed. | tomxor wrote: | > Making software resilient against bitflips in memory seems | very difficult though, since it not only affects data, but | also code. | | There is an OS that pretty much fits the bill here. There was | a show where Andrew Tanenbaum had a laptop running Minix 3 | hooked up to a button that injected random changes into | module code while it was running to demonstrate it's | resilience to random bugs. Quite fitting that this discussion | was initiated by Linus! | | Although it was intended to protect against bad software I | don't see why it wouldn't also go a long way in protecting | the OS against bitflips. Minix 3 uses a microkernel with a | "reincarnation server" which means it can automatically | reload any misbehaving code not part of the core kernel on | the fly (which for Minix is almost everything). This even | includes disk drivers. In the case of misbehaving code there | is some kind of triple redundancy mechanism much like the | "quorum" you suggest, but that is where my crude | understanding ends. | slumdev wrote: | Error-correcting code (the "ECC" in ECC) is just a quorum at | the bit level. | sobriquet9 wrote: | Modern error correction codes can do much better than that. | eevilspock wrote: | I'm surprised that the other replies don't grasp this. | _This_ is the proper level to do the quorum. | | Doing quorum at the computer level would require | synchronizing parallel computers, and unless that | synchronization were to happen for each low level | instruction, then it would have to be written into the | software to take a vote at critical points. This is going | to be greatly detrimental both to throughput and software | complexity. | | I guess you could implement the quorum at the CPU level... | e.g. have redundant cores each with their own memory. But | unless there was a need to protect against CPU cores | themselves being unreliable, I don't see this making sense | either. | | At the end of the day, _at some level_ , it will always | come down to probabilities. "Software engineering | principles" will never eliminate that. | slumdev wrote: | I would highly recommend a graduate-level course in | computer architecture for anyone who thinks ECC is a | 1980s solution to a modern problem. | | There are a lot of seemingly high-level problems that are | solved (ingeniously) in hardware with very simple, very | low-level solutions. | bollu wrote: | Could you please link me to such a course that displays | the hardware level solutions? I'm super interested! | slumdev wrote: | https://www.udacity.com/course/high-performance-computer- | arc... | andrewaylett wrote: | https://en.wikipedia.org/wiki/NonStop_(server_computers) | | My first employer out of Uni had an option for their | primary product to use a NonStop for storage -- I think | HP funded development, and I'm not sure we ever sold any | licenses for it. | sobriquet9 wrote: | If you use multiple computers doing the same calculation and | then take the answer from the quorum, how do you ensure the | computer that does the comparison is not affected by memory | failures? Remember that _all_ queries have to through it, so | it has to be comparable in scale and power. | rovr138 wrote: | > how do you ensure the computer that does the comparison | is not affected by memory failures? | | You do the comparison on multiple nodes too. Get the | calculations. Pass them to multiple nodes, validate again | and if it all matches, you use it. | sobriquet9 wrote: | > validate again | | Recursion, see recursion. | Guvante wrote: | I mean raft and similar algorithms run multiple | verification machines because a single point of failure | is a single point of failure. | wtallis wrote: | See also Byzantine fault tolerance: https://scholar.harva | rd.edu/files/mickens/files/thesaddestmo... | hn3333 wrote: | Bit flips can happen, but regardless if they can get repaired | by ECC code or not, the OS is notified, iirc. It will signal | a corruption to the process that is mapped to the faulty | address. I suppose that if the memory contains code, the | process is killed (if ECC correction failed). | wtallis wrote: | > I suppose that if the memory contains code, the process | is killed (if ECC correction failed). | | Generally, it would make the most sense to kill the process | if the corrupted page is _data_ , but if it's code, then | maybe re-load that page from the executable file on non- | volatile storage. (You might also be able to rescue some | data pages from swap space this way.) | gizmo686 wrote: | If you go that route, you should be able to avoid the | code/data distinction entirely; as data pages can also be | completly backed by files. I believe the kernel already | keeps track of what pages are a clean copy of data from | the filesystem, so I would think it would be a simple | matter of essentially pageing out the corrupted data. | | What would be interesting is if userspace could mark a | region of memory as recomputable. If the kernel is | notified of memory corruption there, it triggers a | handler in the userspace process to rebuild the data. | Granted, given the current state of hardware; I can't | imagine that is anywhere near worth the effort to | implement. | AaronFriel wrote: | It can't eliminate it but: | | 1. Single bitflip correction along with Google's metrics | could help them identify algorithms they've got, customer's | VMs that are causing bitflips via rowhammer and machines | which have errors regardless of workload | | 2. Double bitflip detection lets Google decide if they say, | want to panic at that point and take the machine out of | service, and they can report on what software was running or | why. Their SREs are world-class and may be able to deduce if | this was a fluke (orders of magnitude less likely than a | single bit flip), if a workload caused it, or if hardware | caused it. | | The advantage the 3 major cloud providers have is scale. If a | Fortune 500 were running their own datacenters, how likely | would it be that they have the same level of visibility into | their workloads, the quality of SREs to diagnose, and the | sheer statistical power of scale? | | I sincerely hope Google is not simply silencing bitflip | corrections and detections. That would be a profound waste. | tjoff wrote: | ECC seems like a trivial thing to log and keep track of. | Surely any Fortune 500 could do it and would have enough | scale to get meaningful data out of it? | giantrobot wrote: | I don't think ECC is going to give anyone a false sense of | security. The issue at Google's scale is they had to spend | thousands of person-hours implementing in software what they | would have gotten for "free" with ECC RAM. Lacking ECC (and | generally using consumer-level hardware) compounded scale and | reliability problems or at least made them more expensive | than they might otherwise had been. | | Using consumer hardware and making up reliability with | redundancy and software was not a bad idea for early Google | but it did end up with an unforeseen cost. Just a thousand | machines in a cosmic ray proof bunker will end up with memory | errors ECC will correct for free. It's just reducing the | surface area of "potential problems". | Animats wrote: | _consumer hardware..._ | | That's Intel's PR. Only "enterprise hardware", with a | bigger markup, supports ECC memory. Adding ECC today should | add only 12% to memory cost. | | AMD decided to break Intel's pricing model. Good for them. | Now if we can get ECC at the retail level... | | The original IBM PC AT had parity in memory. | ksec wrote: | >I still remember Craig Silverstein being asked what his | biggest mistake at Google was and him answering "Not pushing | for ECC memory." | | Did they ( Google ) or He ( Craig Silverstein ) ever officially | admit it on record? I did a Google search and results that came | up were all on HN. Did they at least make a few PR pieces | saying that they are using ECC memory now because I dont see | any with searching. Admitting they made a mistake without | officially saying it? | | I mean the whole world of Server or computer might not need ECC | insanity was started entirely because of Google [1] [2] with | news and articles published even in the early 00s [3]. And | after that it has spread like wildfire and became a common | accepted fact that even Google doesn't need ECC. Just like | Apple were using custom ARM instruction to achieve their fast | JS VM performance became a "fact". ( For the last time, no they | didn't ). And proponents of ECC memory has been fighting this | misinformation like mad for decades. To the point giving up and | only rant about every now and then. [3] | | [1] https://blog.codinghorror.com/building-a-computer-the- | google... | | [2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/ | | [3] https://danluu.com/why-ecc/ | tyoma wrote: | Figure this is as good of a time as any to ask this: | | There are many various DRAMs in a server (say, for disk cache). | Has Google or anyone who operates at a similar scale seen | single bit errors in these components? | [deleted] | gh02t wrote: | The supercomputing community has looked at some of the effect | on different parts of the GPU. | | https://ieeexplore.ieee.org/abstract/document/7056044 | bsder wrote: | This is as old as computing and predates Google. | | When America Online was buying EV6 servers as fast as DEC | could produce them, they used to see about about 1 _double_ | bit error per day across their server farm that would reboot | the whole machine. | | DRAM has only gotten worse--not better. | gigatexal wrote: | I mean early on sure at a startup where you're not printing | money I can see how saving on hardware makes sense. But surely | you don't need an MBA to know that hardware will continue to | get cheaper whereas developers and their time will only get | more expensive: better to let the hardware deal with it than to | burden developers with it ... I'd have made the case for ECC | but hindsight being what it is ... | colejohnson66 wrote: | But if you can save $1M+ now, then throw the cost of fixing | it onto the person who replaces you, why do you care? You | already got your bonus and jumped ship. | starfallg wrote: | Recent advances have blurred the lines a bit. The ECC memory | that we all know and love is mainly side-band EEC, with the | memory bus widened to accommodate the ECC bits driven by the | memory controller. However as process size shrink, bit flips | become more likely to the point that now many types of memory | have on-die EEC, where the error correction is handled | internally on the DRAM modules themselves. This is present on | some DDR4 and DDR5 modules, but information on this is kept | internal by the DRAM makers and not usually public. | | https://semiengineering.com/what-designers-need-to-know-abou... | | There has been a lot of debate regarding this that was | summarised in this post - | | https://blog.codinghorror.com/to-ecc-or-not-to-ecc/ | type0 wrote: | Consumer awareness about ECC needs to be better, with recent | security implications I simply can't understand why more | motherboard manufacturers don't support it on AMD. Intel of | course is all to blame on the blue side, I stopped buying their | overpriced Xeons because of this. | rajesh-s wrote: | Good point on the need for awareness! | | The industry has convinced the average user of consumer | hardware that PPA (Power,Performance,Area) is all that needs to | get better with generational improvements. Hoping that the | concerning aspects of security and reliability that have come | to light in the recent past changes this. | aborsy wrote: | For the average user, what's the impact of bit flips in memory in | practical terms? | | I am not talking about servers dealing with critical data. | | Suppose that I maintain a repository (documents, audio and | video), one copy in a ZFS-ECC system and one in an ext4-nonECC | system. | | Would I notice a difference between these two copies after 5-10 | years? | | That tells us if ECC matters for most people. | throwaway9870 wrote: | This isn't about disk storage, this is about DRAM. A bit flip | in DRAM might corrupt data, but could also cause random crashes | and system hangs. That generally matters to everyone. | [deleted] | theevilsharpie wrote: | > For the average user, what's the impact of bit flips in | memory in practical terms? | | The most likely impact (other than nothing, if bits are flipped | in unused memory) is program crashes or system lock-ups for no | apparent reason. | elgfare wrote: | For those out of the loop like me, ECC does indeed stand for | error correcting code. https://en.m.wikipedia.org/wiki/ECC_memory | vlovich123 wrote: | A couple of years ago there was advancements that claimed to make | Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a | concern for some reason? | | I would think the only guaranteed solutions to Rowhammer are | actually cryptographic digests and/or guard pages. | | [1] https://www.zdnet.com/article/rowhammer-attacks-can-now- | bypa... | theevilsharpie wrote: | ECC isn't a direct mitigation against Rowhammer attacks, as | memory errors caused by three or more flipped bits would still | go undetected (unless you're using ChipKill, but that's a rare | setup). | | However, flipped three bits simultaneously isn't trivial, and | the attempts that flip fewer bits will be detected and logged. | GregarianChild wrote: | Isn't ChipKill just another form of ECC? If so there is a | number of bitflips that ChipKill can no longer correct / | detect. [1] seems to say that they observed some flips in | dRAM with ChipKill, although the paper is a bit vague here. | | [1] B. Schroeder et al, _DRAM Errors in the Wild: A Large- | Scale Field Study_ | http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf | rajesh-s wrote: | Right! Section 1.3 of this publication discusses possible | mitigations for the row hammer problem and where ECC fits in | | https://users.ece.cmu.edu/~omutlu/pub/rowhammer-summary.pdf | GregarianChild wrote: | The paper you cite is from 2014 and the mitigations | discussed there have all been circumvented. [1] is from | 2020 and a better read for Rowhammer mitigation. | | [1] J. S. Kim et al, _Revisiting RowHammer: An Experimental | Analysis of Modern DRAM Devices and Mitigation Techniques_ | https://arxiv.org/abs/2005.13121 | rajesh-s wrote: | Thanks for pointing that out! | simias wrote: | I used to be pretty skeptical of ECC for consumer-grade hardware, | mainly because I felt that I'd always prefer cheaper/more RAM | over ECC RAM even if it meant that I'd get a couple of crash | every year due to rogue bitflips. For servers it's a different | story, but for a desktop I'm fine dealing with some instability | for better performance. | | But these days with the RAM density being so high and bitflipping | attacks being more than a theoretical threat it seems like | there's really no good reason not to switch to ECC everywhere. | ekianjo wrote: | > no good reason not to switch to ECC everywhere. | | Not all CPUs support ECC however. | josefx wrote: | Just Intel fucking over security by making ECC a non feature | on consumer grade hardware - wouldn't be surprised if it was | just a single bit flipped in a feature mask. | jjeaff wrote: | Well, with as common as a bunch of people in this thread | seem to think bit flips are, it should just be a matter of | time until that bit gets flipped on your cpu and activates | the ecc feature. | josefx wrote: | That bit probably is either burned in or stored with the | firmware in something more permanent than RAM. Modern RAM | has the issue that it is optimized for capacity and speed | to a point where state changes can leak into nearby bits. | loeg wrote: | (Intel) | tokamak-teapot wrote: | Are there any Ryzen boards that support ECC and _actually | correct errors_? | gruez wrote: | quick search: | | https://rog.asus.com/forum/showthread.php?112750-List- | Asus-M... | bcrl wrote: | Most Ryzen ASRock boards support ECC as well. I'm happily | using one right now. | loeg wrote: | > Most | | Circa Zen1 launch, ASRock claimed _all_ of their consumer | boards would support ECC. | [deleted] | fulafel wrote: | The functionality seems to all be in the memory controller | integrated to the CPU. | loeg wrote: | Yes. E.g., all ASRock boards. | freeqaz wrote: | I bought ECC RAM for my laptop and it definitely was about 4x the | price. It's valuable to me for a few reasons -- peace of mind | being a big one. | | Bit flips happen and are real. I really wish ECC was plentiful | and not brutally expensive! | washadjeffmad wrote: | For the price, it made more sense for me to buy an R630 and | populate it with a few less expensive, higher capacity ECC | RDIMMs. I don't really need ECC as a local feature, so this | lets me run on the mobile I want. | temac wrote: | Note that the price is mostly due to market segmentation, in | your case _most_ of it by the laptop vendor (of course some for | Intel, but not _that_ much compared to the laptop vendor) | | Xeon with ECC are not that overpriced compared with similar | Core without. Likewise, RAM sticks with ECC are cheap to | produce (basically just one more chip to populate per side per | module). Likewise soldered RAM would simply add maybe $10 or | $20 of extra chips. | bitcharmer wrote: | This is the first time I hear about a laptop that supports ECC | memory. Could you please share the make and model? | bluedino wrote: | Lenovo (P series) and HP workstation models also support ECC | xxs wrote: | Lenovo has Xeon laptops[0], and technically Intel used to | support ECC on i3 (and celeron, etc.) | | 0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/T | hi... | lb1lf wrote: | -My boss has a Xeon Dell - a 7550, methinks - luggable. | | It is filled to the gunwales with ECC RAM. | | Cost him the equivalent of $7k or so. Eeek. | dijit wrote: | I have a Dell Precision 5520 (chassis of an XPS 15) which has | a Xeon and ECC memory. | | Finding a memory upgrade seems difficult though. | markonen wrote: | I was looking at getting the Xeon-based NUC recently and | one of the reasons I decided against it was that ECC SO- | DIMMs seem to be a really marginal product. If you want | ECC, something that takes full-size DIMMs seems _much | easier_ to buy memory for. | jjeaff wrote: | You should be able to check logs for corrected errors, right? | | I'm guessing you won't find any. | londons_explore wrote: | I simply care that my computer executes code perfectly. Let's | settle on "one instance of unintended behaviour per hundred | years" for that metric. | | If it needs ECC memory to do that, then fit it with ECC memory. | If there are other ways to achieve that (for example deeper dram | cells to be more robust to cosmic rays) that's fine too. | | Just meet the reliability spec - I don't care how. | simias wrote: | Then you'll have to pay a huge primer for that privilege. I can | assure you that your standard computer components are not rated | for century-scale use. | | That's why I've always been on the fence with this ECC thing. | For servers it's vital because you need stability and security. | | For desktops I think that for a long time it was fine without | ECC. If I have to chose between having, say, 30% more RAM or | avoid a potential crash once a year, I'll probably take the | additional RAM. | | The problem is that now these problem can be exploited by | malicious code instead of just merely happening because of | cosmic rays. That's the main argument in favour of ECC IMO, the | rest is just a tradeoff to consider. | ClumsyPilot wrote: | But it isn't just a crash, it's also silent data corruption | that will never be detected | dev_tty01 wrote: | This. How many user documents have memory flip errors | introduced that are never detected? Impossible to say, but | it is not a small number given the world-wide use of DRAM. | Most are in trivial and unimportant documents, but some | aren't... | simias wrote: | It can be a concern, that's true, but personally most of | the stuff I edit end up checked into a git repository or | something similar. | | And I mean, we all spend all day editing test messages and | comments and files on non-ECC hardware, yet bitflip-induced | corruption is rare enough that I can't say that I've | witnessed a single instance of it in my life, despite | spending a good chunk of it looking at screens. | | It's just not a problem that occurs in practice in my | experience. If you're compiling the release build of a | critical piece of software, you probably want ECC. If | you're building the dev version of your webapp or writing | an email to your boss, you'll probably survive without it. | ClumsyPilot wrote: | Can make that statement with any certainty? My personal | and family computers have crashed quite a few times, and | have corrupted photoes and files, some of them are | valuable (taxes, healthcare, etc. Personal computers have | valuable data these days) | | I couldn't tell, as a user, which if those corruptions | and crashes were causes by bitflips. Could you? | loup-vaillant wrote: | > _I can assure you that your standard computer components | are not rated for century-scale use._ | | And that's probably not what GP asked for. There's a | difference between guaranteeing an error rate of 1 error per | century of use on average, and guaranteeing it over the | course of an _actual century_. It might be okay to guarantee | that error rate for only 5 years of uninterrupted use, and | degrade after that. For instance: Years 1- | 5: 1 error per century. Years 6-10: 3 errors per | century. Years 10-15: 10 errors per century. | Years 15-20: 20 errors per century. Years 20-30: 1 | error per *year*. Years 30+ : the chip is broken. | | Now, given how energy hungry and polluting the whole computer | industry actually is, it might be a good idea to shoot for | extreme durability and reliability anyway. Say, sustain 1 | error per century, over the course of _fifty years_. It will | be slower and more expensive, but at least it won 't burn the | planet as fast as our current electronics. | temac wrote: | In "theory" it needs ECC because you must also protect the link | between the CPU and the RAM. So with ECC fully in DRAM but no | protection on the bus, you risk some errors during the | transfer. However maybe this kind of errors are rare enough so | that you would have less than one per century. It probably | depends on the motherboard design and fabrication quality | though, and the environment where it is used. | z3t4 wrote: | Memory often comes with lifetime guarantees. If they had ECC it | would be much easier to detect bad memory... | jkuria wrote: | For those, like me, wondering what ECC is, here's an explanation: | | https://www.tomshardware.com/reviews/ecc-memory-ram-glossary... ___________________________________________________________________ (page generated 2021-01-03 23:00 UTC)