[HN Gopher] ECC matters
       ___________________________________________________________________
        
       ECC matters
        
       Author : rajesh-s
       Score  : 624 points
       Date   : 2021-01-03 15:38 UTC (7 hours ago)
        
 (HTM) web link (www.realworldtech.com)
 (TXT) w3m dump (www.realworldtech.com)
        
       | sys_64738 wrote:
       | ECC memory is predominantly used in servers where failure
       | absolutely must be identified and logged. The desktop market to a
       | lesser extent due to lack of mission critical tasks being run
       | from there.
        
         | dijit wrote:
         | There are situations though, where you're working on a document
         | and the documents "save" format is a memory dump. Corruption
         | for things of that type (Adobe RAW for example) would remove
         | data.
         | 
         | It might present itself as a 1pixel colour difference, but it
         | could be more damaging (incorrect finances, in accounting
         | software for example). Software trusts memory; but memory can
         | lie.
         | 
         | That's dangerous.
        
           | MaxBarraclough wrote:
           | That's an interesting point. In an extreme case, an order or
           | money transfer might be placed for an incorrect quantity, or
           | to an incorrect recipient.
        
             | KingMachiavelli wrote:
             | Well maybe. Rather than having to trust memory completely,
             | it would just be better to use a binary format where each
             | bit is verifiable so then at least a single bit flip would
             | be immediately obvious. For example, a bit flip in a TLS
             | session causes the whole session to fail rather than a
             | random page element to change.
        
               | knorker wrote:
               | That doesn't help if the memory is corrupted before the
               | verification code is applied. (the code will simply put a
               | signature on incorrect data)
               | 
               | Or after it's been checked. (time-of-check vs time-of-
               | use)
        
               | MaxBarraclough wrote:
               | Right, exactly. TCP protects us from data-corruption in
               | network streams, and ECC protects us from data-corruption
               | in RAM. I doubt any sort of software solution could
               | practically compete against hardware ECC, even if it
               | could be done it would presumably be disastrous for
               | performance.
        
               | knorker wrote:
               | The best integrity checking is "end to end". The problem
               | with non-ECC is that there are no "ends" that are
               | trustworthy.
               | 
               | I guess in theory some software could produce signed data
               | in CPU cache, and "commit" it to RAM as a verified block.
               | 
               | But the overhead would be enormous. Would you slow down
               | your CPU by half in order to not pay 12.5% more for RAM?
               | 
               | Hmm, I wonder what SGX and similar do about this.
        
               | mark-r wrote:
               | That's the principle behind Gray Code counting:
               | https://en.wikipedia.org/wiki/Gray_code
        
           | sys_64738 wrote:
           | Those corner cases might occur rarely but are probably
           | inconsequential given rate of occurrence versus rate of
           | criticalness - it probably doesn't justify the markup for
           | most. In a data center you're processing millions of
           | transactions per minute so occurrence is much more impactful.
        
             | knorker wrote:
             | I would EASILY pay 12.5% more (that's the bit overhead) for
             | memory that actually works.
             | 
             | If my data is fine being corrupted to save 12.5% on RAM
             | costs, then why am I even bothering processing the data?
             | Apparently it's worthless.
             | 
             | People today weigh the cost of maybe 16 vs 32GB on a mid-
             | tier desktop. ~doubling the cost for twice the RAM. Yes,
             | paying 12.5% more for ECC RAM is a no-brainer.
        
             | xxs wrote:
             | You need 1/8 more memory - that's real the cost. It's
             | pretty much Intel's fault for the segmentation.
        
           | jkbbwr wrote:
           | To be fair, if your save mechanism is just a straight memory
           | dump with no checksums and validation. You have bigger
           | issues.
        
             | dijit wrote:
             | That happens more than you think though. Most* things that
             | output PNG are making an in-memory data structure and
             | dumping it to disk.
        
             | xxs wrote:
             | Why does it matter if it =HAD= checksum, the numbers would
             | have been altered prior to save. It means you store one but
             | you get two when read later. If the format calculates
             | immediate checksums on blocks it'd detect memory corruption
             | at best. The extreme downside is that such a part is
             | untestable under normal conditions, hard to maintain, and
             | it costs more than the ECC in development.
        
           | projektfu wrote:
           | Perhaps consumer-grade software that needs guarantees of
           | correctness should be using error correction in software. For
           | example, database records for financial software, DNS, e-mail
           | addresses, etc.
        
       | wicket wrote:
       | Over the years, I don't think I've ever been able to explain to
       | anyone that their memory error could have been caused a cosmic
       | ray without being laughed at.
        
       | amelius wrote:
       | Does Apple use ECC in its M1 laptop?
        
         | dijit wrote:
         | No. It uses a unified package of LPDDR4x SDRAM
        
           | my123 wrote:
           | LPDDR4X systems with ECC exist, but it indeed looks like
           | Apple M1 systems aren't one...
        
         | graeme wrote:
         | This is my one worry. I have an imac pro and anecdotally it has
         | been a LOT more reliable than my old macbook pro. The imac pro
         | has ecc.
        
       | dijit wrote:
       | I beg this, every time this conversation comes up it's the same
       | answer "I don't see a problem".
       | 
       | It's so easy to chalk these kind of errors to other issues, a
       | little corruption here, a running program goes bezerk there-
       | could be a buggy program or a little accidental memory overwrite.
       | Reboot will fix it.
       | 
       | But I ran many thousands of physical machines, petabytes of RAM,
       | I tracked memory flip errors and they were _common_; common even
       | in: less dense memory, in thick metal enclosures surrounded by
       | mesh. Where density and shielding impacts bitflips a lot.
       | 
       | My own experience tracking bitflips across my fleet led me to buy
       | a Xeon laptop with ECC memory (precision 5520) and it has
       | (anecdotally) been significantly more reliable than my desktop.
        
         | [deleted]
        
         | derefr wrote:
         | Were you around for enough DRAM generations to notice an effect
         | of DRAM _density_ / cell-size on reported ECC error rate?
         | 
         | I've always believed that, ECC aside, DRAM made intentionally
         | with big cells would be less prone to spurious bit-flips (and
         | that this is one of the things NASA means when they talk about
         | "radiation hardening" a computer: sourcing memory with ungodly-
         | large DRAM cells, willingly trading off lower memory capacity
         | for higher per-cell level-shift activation-energy.)
         | 
         |  _If_ that's true, then that would mean that the per-cell error
         | rate would have actually been _increasing_ over the years, as
         | DRAM cell-size decreased, in the same way cell-size decrease
         | and voltage-level tightening have increased error rate for
         | flash memory. Combined with the fact that we just have N times
         | more memory now, you'd think we'd be seeing a _quadratic_
         | increase in faults compared to 40 years ago. But do we? It
         | doesn't seem like it.
         | 
         | I've _also_ heard a counter-effect proposed, though: maybe
         | there really are far more "raw" bit-flips going on -- but far
         | less of main memory is now in the causal chain for corrupting a
         | workload than it used to be. In the 80s, on an 8-bit micro,
         | POKEing any random address might wreck a program, since there's
         | only 64k addresses to POKE and most of the writable ones are in
         | use for something critical. Today, most RAM is some sort of
         | cache or buffer that's going to be used once to produce some
         | ephemeral IO effect (e.g. the compressed data for a video
         | frame, that might decompress incorrectly, but only cause 16ms
         | of glitchiness before the next frame comes along to paper over
         | it); or, if it's functional data, it's part of a fault-tolerant
         | component (e.g. a TCP packet, that's going to checksum-fail
         | when passed to the Ethernet controller and so not even be sent,
         | causing the client to need to retry the request; or, even if
         | accidentally checksums correctly, the server will choke on the
         | malformed request, send an error... and the client will need to
         | retry the request. One generic retry-on-exception handler
         | around your net request, and you get memory fault-tolerance for
         | free!)
         | 
         | If both effects are real, this would imply that regular PCs
         | without ECC _should_ still seem quite stable -- but that it
         | would be a far worse idea to run a non-ECC machine as a
         | densely-packed multitenant VM hypervisor today (i.e. to tile
         | main memory with OS kernels), than it would have been ~20 years
         | ago when memory densities were lower. Can anyone attest to
         | this?
         | 
         | (I'd just ask for actual numbers on whether per-cell per-second
         | errors have increased over the years, but I don't expect anyone
         | has them.)
        
           | jeffreygoesto wrote:
           | Sorry, I don't have the numbers you asked for. But afaik one
           | other effect is that "modern" semiconductor processes like
           | FinFET and Fully-Depleted Silicon-on-Insulator are less prone
           | to single event upsets and especially result in only a single
           | bit flipping and no drain of a whole region of transistors
           | from a single alpha particle.
        
           | mlyle wrote:
           | I think it's been quadratic with a pretty low contribution
           | from the order 2 term.
           | 
           | Think of the number of events that can flip a bit. If you
           | make bits smaller, you get a modestly larger number of events
           | in a given area capable of flipping a bit, spread across a
           | larger number of bits in that area.
           | 
           | That is, it's flip event rate * memory die area, not flip
           | event rate * number of memory bits.
           | 
           | In recent generations, I understand it's even been a bit
           | paradoxical-- smaller geometries mean less of the die is
           | actual memory bits, so you can actually end up with _fewer_
           | flips from shrinking geometries.
           | 
           | And sure, your other effect is true: there's a whole lot
           | fewer bitflips that "matter". Flip a bit in some framebuffer
           | used in compositing somewhere-- and that's a lot of my
           | memory-- and I don't care.
        
         | smoyer wrote:
         | There is no guarantee of state at the quantum level ... just a
         | high-degree of assurance in a state. After 40 years in the
         | electronics, optics, software business, I've learned that there
         | is absolutely the possibility for unexplained "blips".
        
         | loeg wrote:
         | Yeah, it's real obnoxious of Intel to silo ECC support off into
         | the Xeon line, isn't it? I switched to ECC memory in 2013 or
         | 2014 with a Xeon E3 (fundamentally a Core i7 without the ECC
         | support fused off) and of course a Xeon-supporting motherboard
         | (with weird "server board" quirks: e.g., no on-board sound
         | device).
         | 
         | I love that AMD doesn't intentionally break ECC on its consumer
         | desktop platforms and upgraded to the Threadripper in 2017.
        
           | defanor wrote:
           | I've considered using an AMD CPU instead of Intel's Xeon on
           | the primary desktop computer, but even low-end Ryzen
           | Threadripper CPUs have TDP of 180W, which is a bit higher
           | than I'd like. And though ECC is not disabled in Ryzen CPUs,
           | AFAIK it's not tested in (or advertised for) those, so one
           | won't be able to return/replace a CPU if it doesn't work with
           | ECC memory, AIUI, making it risky. Though I don't know how
           | common it is for ECC to not be handled properly in an
           | otherwise functioning CPU; are there any statistics or
           | estimates around?
        
             | BlueTemplar wrote:
             | > one won't be able to return/replace a CPU if it doesn't
             | work with ECC memory
             | 
             | I don't know where you live, but around here, (if you buy
             | new?), the vendor MUST take back items up to 15 days after
             | they were delivered, for ANY reason.
             | 
             | So, as long as you synchronize your buying of CPU, RAM,
             | (motherboard), you should be fine.
        
             | marcosdumay wrote:
             | Keep in mind that Intel lies about its TDP.
        
               | magila wrote:
               | There's been a lot of misinformation spread about what
               | TDP means for modern CPUs. In Intel's case TDP is the
               | steady state power consumption of the CPU in its default
               | configuration while executing a long running workload.
               | Long meaning more than a minute or two. The CPU
               | implements this by keeping an exponentially weighted
               | moving average (EWMA) of the CPU's power consumption. The
               | CPU will modulate its frequency to keep this moving
               | average at-or-below the TDP.
               | 
               | One consequence of using a moving average is that if the
               | CPU has been idle for a long time then starts running a
               | high power workload instantaneous power consumption can
               | momentarily exceed the TDP while the average catches up.
               | This is often misleadingly referred to as "turbo mode" by
               | hardware review sites. It's not a mode, there's no state
               | machine at work here, it's just a natural result of using
               | a moving average. The use of EWMA is meant to model the
               | heat capacity of the cooling solution. When the CPU has
               | been idle for a while and the heatsink is cool, the CPU
               | can afford to use more power while the heatsink warms up.
               | 
               | Another factor which confuses things is motherboard
               | firmware disabling power limits without the user's
               | knowledge. Motherboards marketed to enthusiasts often do
               | this to make the boards look better in review benchmarks.
               | This is where a lot of the "Intel is lying" comes from,
               | but it's really the motherboard manufacturers being
               | underhanded.
               | 
               | The situation on the AMD side is of course a bit
               | different. AMD's power and frequency scaling is both more
               | complex and much less documented than Intel's so it's
               | hard to say exactly what the CPU is doing. What is known
               | is that none of the actual power limits programmed into
               | the CPU align with the TDP listed in the spec. In
               | practice the steady state power consumption of AMD CPUs
               | under load is typically about 1.35x the TDP.
               | 
               | Unlike Intel, firmware for AMD motherboards does not mess
               | with the CPU's power limit settings unless the user does
               | so explicitly. Presumably this is because AMD's CPU
               | warranty is voided by changing those settings, while
               | Intel's is not.
        
               | xxs wrote:
               | Intel measures TDP at base frequency... that's
               | disingenuous.
        
               | colejohnson66 wrote:
               | They don't. They just measure it differently than AMD.
               | Intel measures at base clock, but AMD measures at
               | sustained max clock IIRC. It's definitely deceptive, but
               | it's not a lie as long as Intel tells you (which they
               | do).
        
               | wtallis wrote:
               | Intel's TDP numbers are at best an indicator of which
               | product segment a chip falls into. They are wildly
               | inaccurate and unreliable indicators of power draw under
               | _any_ circumstance. For example, here 's a "58W" TDP
               | Celeron that can't seem to get above 20W:
               | https://twitter.com/IanCutress/status/1345656830907789312
               | 
               | And on the flip side, if you're building a desktop PC
               | with a more high-end Intel processor, you will usually
               | have to change a _lot_ of motherboard firmware settings
               | to get the behavior to resemble Intel 's own
               | recommendations that their TDP numbers are supposedly
               | based on. Without those changes, lots of consumer retail
               | motherboards default to having most or all of the power
               | limits effectively disabled. So out of the box, a "65W"
               | i7-10700 and a "125W" i7-10700K will both hit 190-200W
               | when all 8 cores/16 threads are loaded.
               | 
               | If a metric can in practice be off by a factor of three
               | in either direction, it's really quite useless and should
               | not be quantified with a scientific unit like Watts.
        
               | marcosdumay wrote:
               | Well, it's a power measurement that isn't total and can't
               | be used for design... So, it's a lie.
               | 
               | If they gave it some other name, it would be only
               | misleading. Calling it TDP is a lie.
        
               | ksec wrote:
               | It is a lie when they change the definition of TDP
               | without telling you first and later redefined the word to
               | different thing once they got caught.
               | 
               | May be we should use a new term for it, something like
               | iTDP.
        
               | mlyle wrote:
               | They both lie, but Intel lies worse :D
        
               | paulmd wrote:
               | Nah. Both brands pull more than TDP when boosting at max,
               | AMD will pull up to 30% above the specified TDP for an
               | indefinite period of time (they call this number the
               | "PPT" instead).
               | 
               | Intel mobile processors actually obey this better than
               | AMD processors do - Tiger Lake has a hard limit, when you
               | configure a 15W TDP then it really is 15W once steady-
               | state boost expires, AMD mobile products will pull up to
               | _50%_ more than configured.
               | 
               | https://images.anandtech.com/doci/16084/Power%20-%2015W%2
               | 0Co...
               | 
               | "the brands measure it differently" is true but not in
               | the sense people think.
               | 
               | On AMD it is literally just a number they pick that goes
               | into the boost algorithm. Robert Hallock did some dumb
               | handwavy shit about how it's measured with some delta-t
               | above ambient with a reference cooler but the fact is
               | that the chip itself basically determines how high it'll
               | boost based on the number they configure, so that is a
               | self-fulfilling prophecy, the delta-t above ambient is
               | dependent on the number they configure the chip to run
               | at.
               | 
               | In practice: what's the difference between a 3600 and a
               | 3600X? One is configured with a TDP of 65W and one is
               | configured with a TDP of 95W, the latter lets you boost
               | higher and therefore it clocks higher.
               | 
               | Intel nominally states that it's measured as a worst-case
               | load at base clocks, something like Prime95 that
               | absolutely nukes the processor (and even then many
               | processors do not actually hit it). But really it is also
               | just a number that they pick. The number has shifted over
               | time, previously they used to undershoot a lot, now they
               | tend to match the official TDP. It's not an actual
               | measurement, it's just a "power category" that they
               | classify the processors as, it's _informed_ by real
               | numbers but it 's ultimately a human decision which tier
               | they put them in.
               | 
               | Real-world you will always boost above base clocks on
               | both brands at stock TDP, at least on real-world loads.
               | You won't hit full boost on either brand without
               | exceeding TDP, the "AMD measures at full boost" is
               | categorically false despite the fact that it's commonly
               | repeated. AMD PPT lets them boost above the official TDP
               | for an unlimited period of time, they cannot run full
               | boost when limited to official TDP.
        
               | numlock86 wrote:
               | Can you cite something? Sounds interesting.
        
               | colejohnson66 wrote:
               | It's not true. Sortove. Intel measures at base clock
               | while AMD does at sustained peak clock. Deceptive? Yes.
               | Lie? No.
        
             | CydeWeys wrote:
             | > but even low-end Ryzen Threadripper CPUs have TDP of
             | 180W, which is a bit higher than I'd like.
             | 
             | Why does it matter? It doesn't idle that high; it only goes
             | that high of you're using it flat out, in which case the
             | extra power usage is justified because it's giving that
             | much more performance over a 100 W TDP CPU. Now I totally
             | get it if you don't want to go Threadripper just for ECC
             | because it's more _expensive_ , but max power draw, which
             | you don't even have to use? I've never seen anyone shop a
             | desktop CPU by TDP, rather than by performance and price.
        
               | defanor wrote:
               | I prefer to pick PSU and fans (for both CPU and chassis)
               | that can handle it comfortably (preferably while staying
               | silent and with some reserve) with maximum TDP in mind,
               | and given that I don't need that many cores or high clock
               | speed either, a powerful CPU with high TDP is undesirable
               | because it just makes picking other parts harder. I've
               | mentioned TDP explicitly because I wouldn't mind if it
               | was a (possibly even high-end) Threadripper that somehow
               | didn't produce as much heat. Although price also matters,
               | indeed.
        
               | phkahler wrote:
               | >> I've never seen anyone shop a desktop CPU by TDP,
               | rather than by performance and price.
               | 
               | Oh oh, me! Back in the day I bought a 65W CPU for a
               | system that could handle a 90W. I wanted quiet and
               | figured that would keep fan noise down at a modest
               | performance penalty. It should also last longer, being
               | the same design but running cooler. I ran that from 2005
               | until a few years ago (it still run fine but is in
               | storage).
               | 
               | Planning to continue this strategy. I suspect it's common
               | among SFF enthusiasts.
        
               | koolba wrote:
               | SFF?
        
               | lostlogin wrote:
               | The Intel Nuc and Mac mini are good examples of this -
               | however the Nuc doesn't have its psu inside, it's a
               | brick. Great for fixing failures, horrible in general as
               | a built in psu is so much tidier.
        
               | oconnor663 wrote:
               | "small form factor" as far as I can tell
        
               | sam_lowry_ wrote:
               | Hm... My 2013 NUC in fanless Akasa enclosure runs 24/7 on
               | a 6W CPU, I recently looked at the options, and the 2019
               | 6W offering changes little in performance. Yes, memory
               | got faster, but that's it.
               | 
               | My passive-cooled desktop is also running a slightly
               | trottled down 65W CPU.
               | 
               | So yes, there are people who choose there hardware by
               | TDP.
        
               | francis-io wrote:
               | When looking for a CPU for a server that sits in my
               | living room, I went down the thought process of getting a
               | low tdp. I don't have a quote, but I seem to remember
               | coming to the conclusion that tdp is the max temp
               | threshold, not the consistent power draw. If you have a
               | computer idling I believe you won't see a difference in
               | temp between cpus, but you will have the performance when
               | you need it.
               | 
               | These days, a quiet, pwm fan with good thermal paste (and
               | maybe some linux CPU throttling) more than achieves my
               | needs for a "silent" pc 99% of the time.
               | 
               | I would love to be told my above assumptions are wrong if
               | they are.
        
               | mlyle wrote:
               | Yah-- one should look at performance within a given power
               | envelope. Being able to dissipate more and then either
               | end up with the fan running or the processor throttling
               | back somewhat is good, IMO.
               | 
               | The worst bit is, AMD and Intel define TDP differently--
               | neither is the maximum power the processor can draw--
               | though Intel is far more optimistic.
        
               | mlyle wrote:
               | On AMD, with Ryzen Master, you can set the TDP-envelope
               | of the processor to what you want. Then the
               | boost/frequency/voltage envelope it chooses to operate in
               | under sustained load is different.
               | 
               | IMO, shopping by performance/watt makes sense. Shopping
               | by TDP doesn't. (Especially since there is no comparing
               | the AMD and Intel TDP numbers as they're defined
               | differently; neither is the maximum the processor can
               | draw, and Intel significantly exceeds the specified TDP
               | on normal workloads).
        
               | ReactiveJelly wrote:
               | Back when my daily driver was a Core 2 laptop, someone
               | told me that capping the clock frequency would make it
               | unusable.
               | 
               | As a petty "Take that", I dropped the max frequency from
               | 2.0 GHz to 1.0 GHz. I ran a couple benchmarks to prove
               | the cap was working, and then just kept it at 1.0 for a
               | few months, to prove my point.
               | 
               | It made a bigger difference on my ARM SBC, where I tried
               | capping the 1,000 MHz chip to 200 or 400 MHz. That chip
               | was already CPU-bound for many tasks and could barely
               | even run Firefox. Amdahl's Law kicked in - Halving the
               | frequency made _everything_ twice as slow, because almost
               | everything was waiting on the CPU.
        
               | mlyle wrote:
               | The funny thing is, on modern processors-- throttling TDP
               | only affects when running flat out all-core workloads. A
               | subset of cores can still boost aggressively, and you can
               | run all-core max-boost for short intervals.
               | 
               | And the relationship between power and performance isn't
               | linear as processor voltages climb trying to squeeze out
               | the last bit of performance.
               | 
               | So if you want to take a 105W CPU and ask it to operate
               | in a 65W envelope, you're not giving up even 1/3rd of
               | peak performance, and much less than that of typical
               | performance.
        
               | vvanders wrote:
               | TDP matters a fair bit in SFF(Small Form Factor) PCs. For
               | instance the 3700x is a fantastic little CPU since it has
               | a 65W TDP but pretty solid performance.
               | 
               | In a sandwich style case you're usually limited to low
               | profile coolers like Noctua L9i/L9a since vertical height
               | is pretty limited.
        
               | mlyle wrote:
               | Performance/watt matters. You can just set TDP to what
               | you want with throttling choices.
               | 
               | If you want a 45W TDP from the 3700X, you can just pop
               | into Ryzen Master and ask for a 45W TDP. Boom, you're
               | running in that envelope.
               | 
               | I think shopping based on TDP is not the best, because
               | it's not comparable between manufacturers and because
               | it's something you can effectively "choose".
        
               | mongol wrote:
               | How do you do that? Is it a setting in the bios? Or can
               | it be done runtime? If so, how? It sounds interesting if
               | I can run a beefy rig as a power efficient device, for
               | always-on scenarios, and then boost it when I need.
        
               | mlyle wrote:
               | > How do you do that? Is it a setting in the bios? Or can
               | it be done runtime?
               | 
               | On AMD, it's a utility you run. I believe you may require
               | a reboot to apply it. On some Intel platforms, it's been
               | settings in the BIOS.
               | 
               | > It sounds interesting if I can run a beefy rig as a
               | power efficient device, for always-on scenarios, and then
               | boost it when I need.
               | 
               | This is what the processor is doing internally anyways.
               | It throttles voltage and frequency and gates cores based
               | on demanded usage. Changing the TDP doesn't change the
               | performance under a light-to-moderate workload scenario
               | at all.
               | 
               | Ryzen Master lets you change some of the tuning for the
               | choices it makes about when and how aggressively to
               | boost, though, too.
        
               | Cloudef wrote:
               | Ryzen Master doesnt seem to be available for linux so you
               | end up with bunch of unnofficial hacks that may or may
               | not work. I run sff setup myself, originally wanted to
               | get 3600 but it was out of stock, and the next tdp
               | friendly processor was 3700x.
        
               | mlyle wrote:
               | That's an annoyance, but on Linux you have infinite more
               | control of thermal throttling and you can get whatever
               | thermal behavior you want. Thermald has been really good
               | on Intel, and now that Google contributed RAPL support
               | you can get the same benefits on AMD-- pick exactly your
               | power envelope and thermal limits.
        
               | vvanders wrote:
               | Yeah but can I get a metric ton of benchmarks at that 45w
               | setpoint?
               | 
               | I don't really see the reason in paying for a 100w TDP
               | premium if I'm just going to scale it down to 65w.
        
               | bayindirh wrote:
               | > I've never seen anyone shop a desktop CPU by TDP,
               | rather than by performance and price.
               | 
               | That's me. When I start to plan for a new system, I
               | select the processor first and read its thermal design
               | guidelines (Intel used to have nice load vs. max temp
               | graphs in their docs) and select every component around
               | it for sustained max load.
               | 
               | This results in a more silent system for idle and peace
               | of mind for loading it for extended duration.
        
               | 411111111111111 wrote:
               | That's not necessarily correct.
               | 
               | You can passively cool threadrippers if you underclock
               | them enough and have good ventilation in case.
        
               | bayindirh wrote:
               | If my only interest would be ECC, I might do that but, I
               | develop scientific software for research purposes. I need
               | every bit of performance from my system.
               | 
               | In my case loading means maxing out all cores and
               | extended period of time can be anything from five minutes
               | to hours.
        
               | mlyle wrote:
               | The problem is-- you can't compare the TDP nor even the
               | system cooling design guidelines between AMD and Intel.
               | 
               | Both are optimistic lies, but-- if you look at the
               | documents it looks like currently AMD needs more cooling,
               | but actually dissipates less power in most cases and
               | definitely has higher performance/watt.
        
               | bayindirh wrote:
               | > The problem is-- you can't compare the TDP nor even the
               | system cooling design guidelines between AMD and Intel.
               | 
               | Doesn't matter for me since I'm not interested in
               | comparing them.
               | 
               | > Both are optimistic lies, but-- if you look at the
               | documents it looks like currently AMD needs more cooling,
               | but actually dissipates less power in most cases and
               | definitely has higher performance/watt.
               | 
               | I'm aware of the situation, and I always inflate the
               | numbers 10-15% to increase headroom in my systems. The
               | code I'm running is not a _most case_ code. A FPU heavy,
               | "I will abuse all your cores and memory bandwidth" type,
               | heavily optimized scientific software. I can sometimes
               | hear that my system is swearing at me for repeatedly
               | running for tests.
               | 
               | I don't like to add this paragraph but, I'm one of the
               | administrators of one of the biggest HPC clusters in my
               | country. I know how a system can surpass its TDP or how
               | can CPU manufacturers skew this TDP numbers to fit in
               | envelopes. We make these servers blow flames from their
               | exhausts.
        
               | ethanpil wrote:
               | Built a NAS. My #1 concern for choosing CPU was TDP. This
               | machine is on 24/7 and power use is a primary concern
               | where I live because electricity is NOT cheap.
        
               | mlyle wrote:
               | This is a poor way to make the choice. TDP is supposed to
               | specify the highest power you can get the processor to
               | dissipate, not typical or idle use. And since different
               | manufacturers specify TDP differently, you can't even
               | compare the number.
               | 
               | Performance/watt metrics and idle consumption would have
               | been a far better way to make this choice.
               | 
               | If you have a choice between A) something that can
               | dissipate 65W peak for 100 units of performance, but
               | would dissipate 4W average under your workload, and B)
               | something that can dissipate 45W peak for 60 units of
               | performance, but would dissipate 4.5W under your
               | workload... I'm not sure why you'd ever pick B.
        
               | mongol wrote:
               | Is there a metric to look for to understand what power
               | consumption is at "idle" or something close to that? That
               | is what confuses me. I don't want to spend a lot of money
               | on something that will be always on, and usually idling,
               | and finding that its power usage is way higher than I
               | thought. But perhaps there is a metric that tells that. I
               | have not looked closely at it.
               | 
               | Also, even though the CPU may draw less, can still the
               | power supply waste more, just because it is beefy?
               | Comparing with a sports car, they have great performance,
               | but also use more gas in ordinary traffic? Can a computer
               | be compared with that?
        
               | mlyle wrote:
               | > Is there a metric to look for to understand what power
               | consumption is at "idle" or something close to that? That
               | is what confuses me. I don't want to spend a lot of money
               | on something that will be always on, and usually idling,
               | and finding that its power usage is way higher than I
               | thought.
               | 
               | Community benchmarks, from Tom's Hardware, etc.
               | 
               | The vendor numbers are make believe-- you can't use them
               | for power supply sizing or for thermal path sizing. If
               | you look at the cited TDP numbers today-- it can be
               | misleading-- e.g. often Intel 45W TDP parts use more
               | power at peak than AMD 65W parts.
               | 
               | On modern systems, almost none of the idle consumption is
               | the processor. The power supply's idle use and
               | motherboard functions dominate.
               | 
               | > Also, even though the CPU may draw less, can still the
               | power supply waste more, just because it is beefy?
               | 
               | Yes, having to select a larger power supply can result in
               | more idle consumption, though this is more of a problem
               | on the very low end.
        
             | vvanders wrote:
             | I don't think Threadripper is a hard requirement for ECC.
             | There's some pretty reasonable TDP processors if you step
             | down from Threadripper.
        
               | usefulcat wrote:
               | It's not. I have a low end Epyc machine with ECC. It has
               | a TDP of something like 30 watts.
        
               | defanor wrote:
               | I didn't consider embedded CPUs (I guess that's about an
               | embedded EPYC, not a server one), those look neat. But
               | there's no official ECC support (i.e., it's similar to
               | Ryzen CPUs), is there?
               | 
               | Edit: as detaro mentioned in the reply, there is, and
               | here's the source [0] -- that's what they mean by "RAS"
               | on promotional pages [1]. That indeed looks like a nice
               | option.
               | 
               | [0] https://www.amd.com/system/files/documents/updated-30
               | 00-fami...
               | 
               | [1] https://www.amd.com/en/products/embedded-
               | epyc-3000-series
        
               | loeg wrote:
               | RAS covers more than just DRAM, but yes. Historically,
               | the reporting interface is called MCA (Machine Check
               | Architecture) / MCE. I think both AMD and Intel have
               | extensions with other names, but MCA/MCE points you in
               | the right direction.
        
               | detaro wrote:
               | All EPYC, including the embedded ones, do officially have
               | ECC support
        
               | adrian_b wrote:
               | For embedded applications, there is official ECC support
               | for all CPUs named Epyc or Ryzen Vxxxx or Ryzen Rxxxx.
               | 
               | There are computers in the Intel NUC form factor, with
               | ECC support (e.g. with Ryzen V2718), e.g from ASRock
               | Industrial.
        
               | detaro wrote:
               | what kind of machine is that? Been vaguely looking for
               | one a while back, and everything seemed difficult to get
               | (since the main target is large-volume customers I guess)
        
               | cuu508 wrote:
               | I haven't seen definite details and test results on these
               | (but haven't looked recently).
               | 
               | What specific configurations (CPU, MB, RAM) are known to
               | work?
               | 
               | Let's say I have a Ryzen system, how can I check if ECC
               | really works? Like, can I see how many bit flips got
               | corrected in, say, last 24h?
        
               | xxs wrote:
               | Every Ryzen (non APU) supports it* Check the montherboard
               | of your choice, they would declare it in big bold
               | letters, e.g.[0]
               | 
               | *not officially, and the memory controller provides no
               | report for 'fixed' errors.
               | 
               | 0: http://www.asrock.com/mb/AMD/X570%20Taichi/
        
               | cturner wrote:
               | Regarding verification. There is a debian package called
               | edac-utils. As I recall you overclock your RAM and run
               | your system at load in order to generate failures.
               | 
               | Looking back at my notes, the output of journalctl -b
               | tells should say something like, "Node 0: DRAM ECC
               | enabled."
               | 
               | Then 'edac-ctl --status' should tell you that drivers are
               | loaded.
               | 
               | Then you run 'edac-util -v' to report on what it has
               | seen,                   mc0: 0 Uncorrected Errors with no
               | DIMM info         mc0: 0 Corrected Errors with no DIMM
               | info         mc0: csrow2: 0 Uncorrected Errors
               | mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
               | mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
               | mc0: csrow3: 0 Uncorrected Errors         mc0: csrow3:
               | mc#0csrow#3channel#0: 0 Corrected Errors         mc0:
               | csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
               | edac-util: No errors to report.
        
               | a1369209993 wrote:
               | > As I recall you overclock your RAM and run your system
               | at load in order to generate failures.
               | 
               | You can also use memtest86+ for this, although I don't
               | recall if it requires specific configuration for ECC
               | testing.
        
               | p_l wrote:
               | All AMD CPUs with integrated memory controllers support
               | ECC. The CPU also exposes an interface usable by the
               | operating system to verify ECC works - the same interface
               | is used to provide monitoring of memory fault data
               | provided by ECC.
               | 
               | They aren't tested on it, so it's possible to get a dud,
               | but it's minuscule chance that isn't worth bothering.
               | 
               | Now, to _actual_ issues you can encounter: _motherboards_
               | 
               | The problem is that ECC means you need to have, iirc, 8
               | more data lines between CPU and memory module, which of
               | course mean more physical connections (don't remember how
               | many right now). Those also need to be properly done and
               | tested, and you might encounter a motherboard where it
               | wasn't done. Not sure how common, unfortunately.
               | 
               | Another issue is motherboard firmware. Even though AMD
               | supplies the memory init code, the configuration can be
               | tweaked by motherboard vendor, and they might simply
               | break ECC support accidentally (even by something as
               | simple as making a toggle default to _false_ then forgot
               | to expose it in configuration menu).
               | 
               | Those are the two issues you can encounter.
               | 
               | The difference with AFAIK Threadripper PRO, and EPYC, is
               | that AMD includes ECC in its test and certification
               | programs for it, which kind of enforces support.
        
               | jtl999 wrote:
               | > Another issue is motherboard firmware. Even though AMD
               | supplies the memory init code, the configuration can be
               | tweaked by motherboard vendor, and they might simply
               | break ECC support accidentally (even by something as
               | simple as making a toggle default to false then forgot to
               | expose it in configuration menu).
               | 
               | I think some Gigabyte boards are infamous for this in
               | certain circle
               | 
               | OTOH: Gigabyte _might_ have a Threadripper PRO
               | motherboard (WRX80 chipset) coming out in the future
        
               | p_l wrote:
               | Gigabyte is also infamous for trying to claim that they
               | implemented UEFI by dropping a build of DUET (UEFI that
               | boots on top of BIOS, used for early development) into
               | BIOS image...
        
               | adrian_b wrote:
               | All desktop Ryzen CPUs without integrated GPU, i.e. with
               | the exception of APUs, support ECC.
               | 
               | You must check the specifications of the motherboard to
               | see if ECC memory is supported.
               | 
               | As a rule, all ASRock MBs support ECC and also some ASUS
               | MBs support ECC, e.g. all ASUS workstation motherboards.
               | 
               | I have no experience with Windows and Ryzen, but I assume
               | that ECC should work also there.
               | 
               | With Linux, you must use a kernel with all the relevant
               | EDAC options enabled, including CONFIG_EDAC_AMD64.
               | 
               | For the new Zen 3 CPUs, i.e. Ryzen 5xxx, you must use a
               | kernel 5.10 or later, for ECC support.
               | 
               | On Linux, there are various programs, e.g. edac-utils, to
               | monitor the ECC errors.
               | 
               | To be more certain that the ECC error reporting really
               | works, the easiest way is to change the BIOS settings to
               | overclock the memory, until memory errors appear.
        
               | theevilsharpie wrote:
               | On Windows, to check if ECC is working, run the command
               | 'wmic memphysical get memoryerrorcorrection':
               | PC C:\> wmic memphysical get memoryerrorcorrection
               | MemoryErrorCorrection         6
               | 
               | SuperUser has a convenient decoder[1], but modern systems
               | will report "6" here if ECC is working.
               | 
               | When Windows detects a memory error, it will record it in
               | the system event log, under the WHEA source. As a side
               | note, this is also how memory errors within the CPU's
               | caches are reported under Windows.
               | 
               | [1] https://superuser.com/questions/893560/how-do-i-tell-
               | if-my-m...
        
             | stefan_ wrote:
             | I don't understand. Whatever the TDP of Intel processors,
             | you are straight up getting less bang for watt given their
             | ancient process. Same reason smartphones burst to high
             | clocks and power; getting the task done faster is on
             | average much more efficient.
        
             | loeg wrote:
             | > I've considered using an AMD CPU instead of Intel's Xeon
             | on the primary desktop computer, but even low-end Ryzen
             | Threadripper CPUs have TDP of 180W, which is a bit higher
             | than I'd like.
             | 
             | Any apples-to-apples comparable Intel CPU will have
             | comparable power use. The difficulty is that Intel didn't
             | really have anything like Threadripper -- their i9 series
             | was the most comparable (high clocks and moderate core
             | counts), but i9 explicitly did not support ECC memory,
             | nullifying the comparison.
             | 
             | You're looking at 2950X, probably? That's a Zen+ (previous
             | gen) model. 16 core / 32 thread, 3.5 GHz base clock,
             | launched August 2018.
             | 
             | Comparable Intel Xeon timeline is Coffee Lake at the
             | latest, Kaby lake before that. As far as I can tell, _no_
             | Kaby Lake nor Coffee Lake Xeons even have 16 cores.
             | 
             | The closest Skylake I've found is an (OEM) Xeon Gold 6149:
             | 16/32 core/thread, 3.1 GHz base clock, 205W nominal TDP
             | (and it's a special OEM part, not available for you). The
             | closest buyable part is probably Xeon Gold 6154 with 18/36
             | core/threads, 3GHz clock, and 200W nominal TDP.
             | 
             | Looking at i9 from around that time, you had Skylake-X and
             | a single Coffe Lake-S (i9-9900K). 9900K only has 8 cores.
             | The Skylake i9-9960X part has 16/32 cores/threads, base
             | clock of 3.1GHz, and a nominal TDP of 165W. That's somewhat
             | comparable to the AMD 2950X, ignoring ECC support.
             | 
             | Another note that might interest you: you could run the
             | Threadripper part at substantially lower power by
             | sacrificing a small amount of performance, if thermals are
             | the most important factor and you are unwilling to trust
             | Ryzen ECC:
             | http://apollo.backplane.com/DFlyMisc/threadripper.txt
             | 
             | Or just buy an Epyc, if you want a low-TDP ECC-definitely-
             | supported part: EPYC 7302P has 16/32 cores, 3GHz base
             | clock, and 155W nominal TDP. EPYC 7282 has 16/32 cores, 2.8
             | GHz base, and 120W nominal TDP. These are all zen2 (vs
             | 2950X's zen+) and will outperform zen+ on a clock-for-clock
             | basis.
             | 
             | > And though ECC is not disabled in Ryzen CPUs, AFAIK it's
             | not tested in (or advertised for) those, so one won't be
             | able to return/replace a CPU if it doesn't work with ECC
             | memory, AIUI, making it risky.
             | 
             | If your vendor won't accept defective CPU returns, buy
             | somewhere else.
             | 
             | > Though I don't know how common it is for ECC to not be
             | handled properly in an otherwise functioning CPU; are there
             | any statistics or estimates around?
             | 
             | ECC support requires motherboard support; that's the main
             | thing to be aware of shopping for Ryzen ECC setups. If the
             | board doesn't have the traces, there's nothing the CPU can
             | do.
        
             | theevilsharpie wrote:
             | > And though ECC is not disabled in Ryzen CPUs, AFAIK it's
             | not tested in (or advertised for) those
             | 
             | ECC isn't validated by AMD for AM4 Ryzen models, but it's
             | present and supported if the motherboard also supports it.
             | Many motherboards have ECC support (the manual will say for
             | sure), and a handful of models even explicitly advertise it
             | as a feature.
             | 
             | I have a Ryzen 9 3900X on an ASRock B450M Pro4 and 64 GB of
             | ECC DRAM, and ECC functionality is active and working.
        
               | colejohnson66 wrote:
               | What do you mean by "validated"? There's the silicon, but
               | they don't test it?
        
               | Laforet wrote:
               | More like "The feature is present in silicon but
               | motherboard makers are not required to turn it on". At
               | the end of the day, ECC support does require extra copper
               | traces in the PCB and some low end models may
               | deliberately choose to skip them, thus the expectation
               | has to be managed.
        
               | loeg wrote:
               | IMO, "validated" is intentionally wishy-washy and mostly
               | means that AMD would prefer it if enterprises paid them
               | more money by buying EPYC (or Ryzen Pro) parts instead of
               | consumer Ryzen parts. Much like how Intel prefers selling
               | higher-margin Xeons over Core i5. It's market
               | segmentation, but friendlier to consumers than Intel's
               | approach.
        
             | cturner wrote:
             | I went through this about a year ago, to build a low-TDP
             | ECC workstation. I do not have stats on failure rates, just
             | this anecdotal experience. Asrock and Asus seem to be the
             | boards to get. For RAM, I got two sticks of Samsung
             | M391A4G43MB1, and verified. The advice I remember from the
             | forums was to stick to unbuffered ram (UDIMMS).
        
               | everybodyknows wrote:
               | Did you consider any off-the-shelf ECC boxes?
               | 
               | Found some here -- bottom of the EPYC product line starts
               | at $2849 ...!
               | 
               | https://www.velocitymicro.com/wizard.php?iid=337
        
               | loeg wrote:
               | Yes, the consumer parts only support UDIMMs. If you want
               | RDIMMs, you have to pay for EPYC.
        
           | CalChris wrote:
           | Yeah, the iMac Pro has the Xeon W and ECC. T'would be nice if
           | the Apple Silicon MacBook Pro had it. There's not much of a
           | reason to pay for the Pro over the Air. But like Linus, I'm
           | going to blame Intel for this situation in the market. Maybe
           | Apple will strike out on its own with Apple Silicon but since
           | their dominant use case is phones, I'll not hold my breath.
        
             | DCKing wrote:
             | Unless something weird happens, the next generation of the
             | Apple M-line will use LPDDR5 memory instead of the LPDDR4X
             | used in the Apple M1. While it probably won't support error
             | correction _monitoring_ , LPDDR5 has built in error
             | correction that silently corrects single bit flips. That
             | alone should be a huge reliability improvement.
             | 
             | LPDDR5 will enable some much needed level of error
             | correction in a metric ton of other future SoC designs too.
             | I look forward to the future Raspberry Pi with built in
             | error correction capabilities.
        
           | rhn_mk1 wrote:
           | Doesn't intel make ECC available on the i3 line of CPUs?
        
             | xxs wrote:
             | Not any more[0] 10300. It used to[1] - 9300:
             | 
             | 0: https://ark.intel.com/content/www/us/en/ark/products/199
             | 281/...
             | 
             | 1: https://ark.intel.com/content/www/us/en/ark/products/134
             | 886/...
        
             | minot wrote:
             | I was going to say no but I just checked and at least ONE
             | latest generation i3 processor supports ECC
             | 
             | https://ark.intel.com/content/www/us/en/ark/compare.html?pr
             | o...
             | 
             | https://ark.intel.com/content/www/us/en/ark/products/208074
             | /...
             | 
             | Problem is this processor is an Embedded processor so
             | probably not for us
             | 
             | > Industrial Extended Temp, Embedded Broad Market Extended
             | Temp
             | 
             | My understanding is Intel does not support ECC on the
             | desktop unless you pay extra.
        
               | hollerith wrote:
               | That i3 is for file servers.
        
               | makomk wrote:
               | Yeah, that appears to be a BGA-packaged processor
               | designed to be permanently soldered to the board of some
               | embedded device, not something that you can install in
               | your desktop at all. I'm not sure why Intel decided to
               | brand their embedded processors with ECC as i3, though I
               | suspect the reason this range exists at all is because
               | companies were going with competitors like AMD instead
               | due to their across-the-board ECC support.
        
             | opencl wrote:
             | They used to support ECC in the desktop i3 lineup, current
             | gen does not have ECC except in some embedded SKUs.
             | 
             | https://ark.intel.com/content/www/us/en/ark/products/199280
             | /...
        
           | vbezhenar wrote:
           | You can find non-Xeons with ECC support. But they are rare
           | and usually suitable for some kinds of micro servers.
        
           | fortran77 wrote:
           | While it's true that Intel only has ECC support on Xeon (and
           | several other chips targeted at the embedded market) it's not
           | true that ECC is supported well on AMD.
           | 
           | We _only_ use Xeons on developer desktops and production
           | machines here precisely because of ECC. It 's about 1 bit
           | flip/month/gigabyte. That's too much risk when doing
           | something critical for a client.
        
             | loeg wrote:
             | > it's not true that ECC is supported well on AMD.
             | 
             | That's an extreme claim. Why do you say so?
        
             | theevilsharpie wrote:
             | > it's not true that ECC is supported well on AMD
             | 
             | ECC is supported on most Ryzen models[1], as long as the
             | motherboard supports it. In fact, ASUS and ASRock (possibly
             | others) have Ryzen motherboards designed for
             | workstation/server use where ECC support is specifically
             | advertised.
             | 
             | [1] The only exception is the Ryzen CPUs with integrated
             | graphics.
        
               | js2 wrote:
               | Depends what you mean by supported. Semi-offically:
               | 
               |  _ECC is not disabled. It works, but not validated for
               | our consumer client platform.
               | 
               | Validated means run it through server/workstation grade
               | testing. For the first Ryzen processors, focused on the
               | prosumer / gaming market, this feature is enabled and
               | working but not validated by AMD. You should not have
               | issues creating a whitebox homelab or NAS with ECC memory
               | enabled._
               | 
               | https://old.reddit.com/r/Amd/comments/5x4hxu/we_are_amd_c
               | rea...
        
               | loeg wrote:
               | Your quote is for consumer platforms (Ryzen) only; GP's
               | statement was that ECC is not well-supported on AMD _at
               | all_ , which is obviously false (EPYC, Threadripper).
        
               | adrian_b wrote:
               | Yes there is a risk to buy a Ryzen CPU with non-
               | functional ECC.
               | 
               | However, I use only computers with ECC, previously only
               | Xeons, but in the last years I have replaced many of them
               | with Ryzens, all of which work OK with ECC memory.
               | 
               | When having to choose between a very small risk of losing
               | the price of a CPU and having to use for sure, during
               | many years, an Intel CPU with half of the AMD speed, the
               | choice was very obvious for me.
        
               | theevilsharpie wrote:
               | AMD may claim not to validate ECC on Ryzen, but it's
               | working well enough for major motherboard vendors to
               | market Ryzen motherboards with ECC advertised as a
               | feature.
               | 
               | ECC support not being "validated," for all practical
               | purposes, simply means that board vendors can advertise a
               | board lacking ECC support as compatible with AMD's AM4
               | platform, without getting a nasty letter from AMD's
               | lawyers.
        
             | jeffbee wrote:
             | > While it's true that Intel only has ECC support on Xeon
             | 
             | That's not true. There are Core i3, Atom, Celeron, and
             | Pentium SKUs with ECC. E.g. the Core i3-9300
             | 
             | https://en.wikichip.org/wiki/intel/core_i3/i3-9300
        
         | lighttower wrote:
         | Can you get decent battery life with this ecc memory in a
         | laptop?
        
           | dijit wrote:
           | Yes. ECC memory uses only marginally more power than non-ECC
           | memory. And memory isn't the largest consumer of battery life
           | by a country mile.
           | 
           | Screen, Wi-Fi, and to a much lesser extent (unless under
           | load) the CPU are the most major culprits of low battery
           | life.
        
             | indolering wrote:
             | It can actually reduce power consumption, because refresh
             | rates don't need to be so high:
             | 
             | https://media-
             | www.micron.com/-/media/client/global/documents...
        
         | hosteur wrote:
         | How did you track memory errors across thousands of physical
         | machines?
        
           | core-questions wrote:
           | https://github.com/netdata/netdata/issues/1508
           | 
           | Looks like `mcelog --client` might be a starting place? Feed
           | that into your metrics pipeline and alert on it like anything
           | else...
        
             | jeffbee wrote:
             | Newer linux have replaced mcelog with edac-util. I think
             | most shops operating systems at that scale are getting
             | their ECC errors out of band with IPMI SEL, though.
        
               | gsvelto wrote:
               | It's rasdaemon these days:
               | https://www.setphaserstostun.org/posts/monitoring-ecc-
               | memory...
        
           | ikiris wrote:
           | The same way you do it with everything else, export the
           | telemetry and store it in time series...
        
         | incrudible wrote:
         | When you say bitflips were "common" on thousands of physical
         | machines, does that mean you observed thousands of bitflips?
         | 
         | Otherwise, I would think that an unlikely event becoming 1000x
         | more likely by sheer numbers would have warped your perception.
         | 
         | I believe that hardware reliability is mostly irrelevant,
         | because software reliability is already far worse. It doesn't
         | matter whether a bitflip (unlikely) or some bug (likely) causes
         | a node to spuriously fail, what matters is that this failure is
         | handled gracefully.
        
           | ikiris wrote:
           | Its enough that graphs can show you solar weather.
           | 
           | I can't give my source, but its far higher than most people
           | think. Just pay the money.
        
           | dkersten wrote:
           | Another comment[1] mentioned 1 bitflip per gigabyte per
           | month. If you have a lot of RAM, that's rather a lot.
           | 
           | > It doesn't matter whether a bitflip (unlikely) or some bug
           | (likely) causes a node to spuriously fail
           | 
           | Except that a bitflip can go undetected. It _may_ crash your
           | software or system, but it also may simply leak errors into
           | your data, which can be far more catastrophic.
           | 
           | [1] https://news.ycombinator.com/item?id=25623206
        
             | jhasse wrote:
             | So can a bug.
        
               | dkersten wrote:
               | Yes. And? That doesn't suddenly make bitflips benign.
        
               | incrudible wrote:
               | The point is that you can't prevent failure by just
               | buying something. You have to deal with the fact that
               | failure _can not be prevented_.
               | 
               | In other words, if a single defective DIMM somewhere in
               | your deployment is causing catastraphic failure, your
               | mistake was not buying the wrong RAM modules. Your
               | mistake was relying on a single point of failure for
               | mission critical data.
        
           | tyoma wrote:
           | It depends where the failure happens. Sometimes you really
           | lose the "failure in the wrong place" lottery. For example,
           | in a domain name: http://dinaburg.org/bitsquatting.html
        
           | jjeaff wrote:
           | Ya, I'm not buying that biyflips are a problem. Or maybe
           | modern software can correct better for this? Because I use my
           | desktop all day everyday running tons of software on 64 gb of
           | ram and I don't get errors or crashes often enough to
           | remember ever having one.
        
             | ChrisLomont wrote:
             | > I'm not buying that biyflips are a problem.
             | 
             | Google and read up - it is a problem, has killed people,
             | has thrown election results, and much more.
             | 
             | It's such a common problem than bitsquatting is a real
             | thing :)
             | 
             | Want to do an experiment? Pick a bitsquatted domain for a
             | common site, and see how often you get hits.
             | 
             | https://en.wikipedia.org/wiki/Bitsquatting
        
               | incrudible wrote:
               | Nobody denies that bitflips _happen_. On the whole, you
               | fail to make a case that preventing bitflips is the
               | solution to a problem. Bitsquatting is not a real
               | problem, it 's a curiosity.
               | 
               | As for the case of bitflips killing someone: Bitflips are
               | not the root cause here. The root cause is that somebody
               | engineered something life-critical that mistakenly
               | assumed hardware can not fail. Bitflips are just one of
               | many reasons for hardware failure.
        
               | ChrisLomont wrote:
               | >Bitflips are not the root cause here.
               | 
               | So those systems didn't fail when a bitflip happened?
               | 
               | > The root cause is that somebody engineered something
               | life-critical that mistakenly assumed hardware can not
               | fail.
               | 
               | The systems I am aware of were designed with bitflips in
               | mind. NO software can handle arbitrary amounts of
               | bitflips. ALL software designed to mitigate bitflips only
               | lower the odds via various forms of redundancy. (For
               | context, I've written code for NASA, written a few
               | proposals on making things more radiation hardened, and
               | my PhD thesis was on a new class of error correcting
               | codes - so I do know a little about making redundant
               | software and hardware specifically designed to mitigate
               | bitflips).
               | 
               | By claiming a bitflip didn't kick off the problems, and
               | trying to push the cause elsewhere, you may as well blame
               | all of engineering for making a device that can kill on
               | failure.
               | 
               | So your argument is a red herring
               | 
               | >On the whole, you fail to make a case that preventing
               | bitflips is the solution to a problem
               | 
               | Yes, had those bitflips been prevented, or not happened,
               | those fatalities would not have happened.
               | 
               | >Ya, I'm not buying that biyflips are a problem.
               | 
               | If bitflips are not a problem then we don't need ECC ram
               | (or ECC almost anything!) which is clearly used a lot. So
               | bitflips are enough of a problem that a massively
               | widespread technology is in place to handle precisely
               | that problem.
               | 
               | I guess you've never written a program and watched bits
               | flip on computers you control? You should try it - it's a
               | good exercise to see how often it does happen.
               | 
               | I guess you define something being a problem differently
               | than I or the ECC ram industry do.
        
             | dkersten wrote:
             | Crashes aren't such a big problem. You can detect them and
             | reboot or whatever. Silent data corruption is the real
             | issue IMHO.
             | 
             | See also this comment above:
             | https://news.ycombinator.com/item?id=25623764
        
           | adrian_b wrote:
           | On a single computer with a large memory, e.g. 32 GB or more,
           | the time between errors can be of a few months, if you are
           | lucky to have good modules. Moreover, some of the errors will
           | have no effect, if they happened to affect free memory.
           | 
           | Nevertheless, anyone who uses the computer for anything else
           | besides games or movie watching, will greatly benefit from
           | having ECC memory, because that is the only way to learn when
           | the memory modules become defective.
           | 
           | Modern memories have a shorter lifetime than old memories and
           | very frequently they begin to have bit errors from time to
           | time long before breaking down completely.
           | 
           | Without ECC, you will become aware that a memory module is
           | defective only when the computer crashes or no longer boots
           | and severe data corruption in your files could have happened
           | some months before that.
           | 
           | For myself, this was the most obvious reason why ECC was
           | useful, because I was able in several cases to replace memory
           | modules that began to have frequent correctable errors, after
           | many years with little or no errors, without losing any
           | precious data and without downtime.
        
             | ikiris wrote:
             | The good modules bit is important. I'm told by some
             | colleagues that most of the bit flips are from alpha
             | particles from the ram casings surprisingly enough.
        
       | petermcneeley wrote:
       | I would also add that Row Hammer Attacks are much harder on ECC.
       | 
       | When I first tried to replicate the row hammer attack I was not
       | getting any results. Turns out I was doing this on ECC. On non
       | ECC memory the same test easily replicated the row hammer attack.
       | 
       | https://en.wikipedia.org/wiki/Row_hammer
        
       | rahimiali wrote:
       | I have trouble parsing information from this rant. Is someone
       | willing to translate this into an argument (a string of facts
       | tied by logical steps)?
        
         | mark-r wrote:
         | 1. Linux sometimes has crashes, not due to software errors but
         | because of memory glitches. 2. ECC would prevent memory
         | glitches. 3. ECC is hard to find on desktop PCs because Intel
         | uses the feature to differentiate desktop CPUs from server
         | CPUs, so it can charge more for servers. 4. Even when someone
         | like AMD makes the feature available, the market doesn't have
         | ECC DRAM modules or motherboards readily available because
         | Intel killed the demand for it.
        
       | phh wrote:
       | I don't know if ECC is that important, but reliability of RAM (or
       | any storage) feels pretty crazy to me. 128GB being refreshed
       | every second for a month error requires that the per-bit refresh
       | process has a reliability of 99.9999999999999999% to be flawless.
       | Considering we are dealing with quantum effects (which are
       | inherently probabilistic), I wouldn't trust myself to design
       | anything like that.
       | 
       | Now back to ECC, I'll probably be corrected, but I don't think
       | ECC helps gain more than two order of magnitudes, so we still
       | need incredibly reliable RAM. If we move to ECC RAM by default
       | everywhere, aren't we simply going to get less reliable RAM at
       | the end?
        
         | formerly_proven wrote:
         | RAM is not as reliable as you think. Some ECC memory hardly
         | ever finds an error, some machines see them at a very
         | consistent rate, e.g. 50 errors per TB-day. That would
         | translate to 1-2 errors per day in a 32 GB PC. Without ECC you
         | cannot know in which bucket you are.
        
           | trevyn wrote:
           | If true, that seems like... a very straightforward bucket to
           | test if you're in.
        
             | toast0 wrote:
             | The bucket can change over time though. If you want to be
             | sure, you need to test often, which gets in the way of
             | using the computer.
        
         | bitcharmer wrote:
         | A system on Earth, at sea level, with 4 GB of RAM has a 96%
         | percent chance of having a bit error in three days without ECC
         | RAM. With ECC RAM, that goes down to 1.67e-10 or about one
         | chance in six billions.
         | 
         | So I'd say ECC _is_ not only important but insanely impactful.
         | There 's a reason why many organizations don't even want to
         | hear about getting rigs with non-ECC memory.
        
           | gzalo wrote:
           | That number is flawed, and the author did a follow-up with
           | better results: http://lambda-diode.com/opinion/ecc-memory-2
           | 
           | "33 to 600 days to get a 96% chance of getting a bit error."
           | Still, it seems way too high. I guess anyone with ECC RAM
           | could confirm that they are getting those sort of recovered
           | error rates?
        
           | mrlala wrote:
           | So, I hear what you are saying. But, on the other hand, I
           | have been using 2 non-ECC desktops for a workstation/server
           | for the past ~6 years.. and I would be hard pressed to come
           | up with a single situation where either of the machines
           | randomly crashed or applications did anything 'unexpected'
           | (to my knowledge, of course).
           | 
           | My point is, when you say there is a "96% chance of having an
           | error in THREE DAYS", one would EXPECT to be having issues
           | like.. all the time? So I'm not disagreeing with you, but
           | with the amount of non-ECC machines all over the world and
           | how insanely stable modern machines are, it still seems like
           | a very low risk.
           | 
           | Now of course I agree that if you want to take every
           | precaution, go ECC, but simple observation prove that this
           | "problem" can't be as bad as the numbers are saying.
        
             | bitcharmer wrote:
             | Your questions are perfectly valid. It's just that out of
             | all the random bit flips that happen over a period of time
             | on a non-ECC platform only a miniscule percentage will
             | manifest to you in any noticeable way.
             | 
             | Most will escape your attention.
        
           | johndough wrote:
           | I ran a memory test for two weeks straight on a consumer
           | laptop with 8 GB RAM and could not get a single bit flip, so
           | your mileage may vary.
        
             | bitcharmer wrote:
             | How did you run those tests? From what I understand on the
             | topic, for your results to be statistically significant you
             | need at least hundreds of machines and very rigid testing
             | methodology.
        
               | avian wrote:
               | As someone who also ran a similar test myself and haven't
               | seen a bit flip, I'm also skeptical of the 96% figure.
               | 
               | I'm too lazy to run the exact numbers right now, but with
               | "4 GB, 96% percent chance, three days" as the hypothesis,
               | I think you'll find that an experimental result of "8 GB,
               | 0% chance, 14 days" is highly statistically significant.
               | 
               | Edit: rough back of napkin estimate - you're not seeing
               | an event in roughly 10x trials (2x number of bits and ~5x
               | number of days). Given hypothesis is true your
               | experimental result has a probability of (1-0.96)^10 =
               | very very small. Conclusion: hypothesis is false.
        
               | bitcharmer wrote:
               | The 96% figure comes from Google and was obtained in a
               | large scale experiment over many months. I've been in
               | this business long enough to have witnessed adverse
               | effects of cosmic rays an non-ECC memory multiple times
               | myself. I don't think you're sample gets anywhere near
               | statistical significance. Not mentioning testing
               | methodology.
        
               | toast0 wrote:
               | My anecdotal evidence is far from rigorous, but the
               | Google data from ten years ago doesn't match up with my
               | experience running thousands of ECC enabled servers up to
               | a few years ago. Their rates seem a lot higher than what
               | my servers experienced; we would page on any ram errors,
               | correctable or not (uncorrectable would halt the machine,
               | so we would have to inspect the console to confirm; when
               | we knowingly tried machines with uncorrectable errors
               | after a halt, they nearly all failed again within 24
               | hours, so those we didn't inspect the console of probably
               | were counted on their second failure), and while there
               | were pages from time to time, it felt like a lot less
               | than 8% of the machine having a
               | 
               | There's a lot of variables that go into RAM errors,
               | including manufacturing quality and condition of the ram,
               | the dimm, the dimm slot, the motherboard generally, the
               | power supply, the wiring, and the temperature of all of
               | those. Google was known for cost cutting in their
               | servers, especially early on; so I wouldn't be surprised
               | if some of that resulted in higher bitflip rate than
               | running in commercially available servers. Things like
               | running bare motherboards, supported only on the edges
               | cause excess strain and can impact resistance and
               | capacitance of traces on the board (and in extreme cases,
               | break the traces).
        
           | tomxor wrote:
           | I like when people back up their claims with numbers, but
           | would you mind describing roughly what that 96% probability
           | of error is based upon?
           | 
           | I understand altitude has some kind of proportionality to
           | cosmic ray exposure, and number of bits will multiply the
           | probability of _an_ error.. I 'm presuming there is also an
           | inherent error rate to DRAM separate from environment. But
           | what are those numbers.
        
             | bitcharmer wrote:
             | Apologies, you're totally right. I should have linked to
             | the source:
             | 
             | http://lambda-diode.com/opinion/ecc-
             | memory#:~:text=A%20syste....
        
               | tomxor wrote:
               | Great thanks!
               | 
               | [edit]
               | 
               | Looks like the calculation was revised [0] after
               | criticism:
               | 
               | > Under these assumptions, you'll have to wait about 33
               | to 600 days to get a 96% chance of getting a bit error.
               | 
               | What's more worrying is the variance, the above
               | calculation is based on expected well behaved DRAM.. yet
               | some computers just seem to have manufacturing defects
               | that make the incidence of errors high enough to be a
               | regular problem.
               | 
               | [0] http://lambda-diode.com/opinion/ecc-memory-2
        
           | dejj wrote:
           | And even higher in the vicinity of radioactive cattle:
           | https://www.jakepoz.com/debugging-behind-the-iron-curtain/
        
           | davidw wrote:
           | Could you measure altitude with memory?
        
             | asimpletune wrote:
             | That's a very interesting idea, and I think you totally
             | could. You run some benchmarks, measure the bit flips, and
             | after enough runs you'd be able to say with a degree of
             | confidence what your altitude is. I wonder though what
             | accuracy could be achieved with this?
        
             | cyberlurker wrote:
             | If the 96% every 3 days is true, you could approximate
             | based on that. But it would be a really slow measurement.
        
             | tomxor wrote:
             | :D yes, although I expect you would need either a
             | prohibitively large quantity of memory or a extremely slow
             | rate of change in altitude to effectively measure it.
        
       | rafaelturk wrote:
       | Little bit offtopic: Again seems that Intel? what?! is the one
       | lowering the bar.
        
       | b0rsuk wrote:
       | I browsed some online listings for ECC memory modules, and they
       | seem to be sold one module at a time. Standard DDR4 modules are
       | sold in pairs, to benefit from dual channel mode.
       | 
       | Does ECC memory support dual channel??
        
       | KingMachiavelli wrote:
       | Is there such a thing as 'software' ECC where a segment in memory
       | also has a checksum stored in memory and the CPU just verifies it
       | when the memory segment is accessed?
       | 
       | It would be a lot slower than real ECC but it could just be used
       | for operations that would be especially vulnerable to bit flips.
       | It would also not know for certain if the memory segment of data
       | or the memory segment holding the checksum was corrupted besides
       | their relative sizes (checksum is much smaller so more unlikely
       | to have had a bit flip in it's memory region).
        
         | a1369209993 wrote:
         | Actually... there _is_ a word of memory that you already have
         | to _read_ every time you access a region of memory: the page
         | table entry for that region. If you have 64-byte cache lines,
         | that 's 64 lines per (4KB) page, so you could load a second
         | 64-bit word from the page table[0], and use that as a parity
         | bit for each cache line, storing it back on write the same way
         | you store active and dirty bits in the PTE proper. Actual E[
         | _correcting_ ]C would require inflating the effective PTEs from
         | 8(orginal)-16(parity) bytes to about 64(7 bits per line,
         | insufficient)-128(15, excessive), which is probably untenable,
         | but you could at least get parity checks this way.
         | 
         | There's also the obvious tactic of just storing every logical
         | 64-bit word as 128 bits of physical memory, which gives you
         | room for all kinds of crap[1], at the expense of halving your
         | effective memory and memory bandwidth.
         | 
         | 0: This is extremely cheap since you're loading a 64- vs
         | 128-bit value, with no extra round trip time and still fits in
         | a cache line, so you're likely just paying extra memory use
         | from larger page tables.
         | 
         | 1: Offhand, I think you could fit triple or even quadruple
         | error _correction_ into that kind of space (there 's room for
         | _eight_ layers of SECDED, but I don 't remember how well bit-
         | level ECC scales).
        
         | temac wrote:
         | Intel has some recent patents on that.
        
       | zdw wrote:
       | Good news is that for DDR5, ECC is a required part of the spec
       | and should be a feature of every module:
       | 
       | https://www.anandtech.com/show/15912/ddr5-specification-rele...
        
         | [deleted]
        
         | rajesh-s wrote:
         | A whitepaper on DDR4 ECC by Micron that goes over some of the
         | implementation challenges
         | 
         | https://media-www.micron.com/-/media/client/global/documents...
        
         | toast0 wrote:
         | On die ECC is great for increasing reliability, if all else is
         | equal, but if it doesn't report to the memory controller, and
         | if the memory controller doesn't report to the OS, I think it
         | will be worse than status quo, because all else won't be equal.
         | With no feedback, systems are going to continue to run on the
         | edge, but now detectable failures will all be multi-bit;
         | because single bit errors are hidden.
        
           | cududa wrote:
           | Huh? Why would the memory controller not be updated
           | accordingly? Also I have no idea about Linux or Mac, but
           | Windows has had ECC support and active management for
           | decades?
        
             | indolering wrote:
             | It's part of the firmware first trend of fixing things at
             | the firmware level before reporting problems up the stack.
             | This makes it a real nightmare for systems integrators to
             | do root cause analysis.
        
             | mlyle wrote:
             | Normally, ECC has meant just the DIMM stores some extra
             | bits, and the memory controller itself implements ECC--
             | writing the extra parity, and recovering when errors emerge
             | (and halting when non-recoverable errors happen).
             | 
             | DDR5 includes on-die ECC, where the RAM fixes the errors
             | before sending them over the memory bus.
             | 
             | This means if the bus between the processor and ram
             | corrupts the bits-- tough luck, they're still corrupted.
             | And it's unclear whether we're going to get the quality of
             | memory error reporting that we're used to or get the
             | desired halt-on-non-recoverable error behavior (I've not
             | been able to obtain/read the DDR5 specification as yet).
        
               | cududa wrote:
               | Thank you!
        
         | [deleted]
        
         | hinkley wrote:
         | Is it built in as an added feature, or as the only way to make
         | DDR5 reliable? My inner cynic is screaming the latter.
         | 
         | When the value add feature becomes a necessity, it's not a
         | value add any more.
        
         | CoolGuySteve wrote:
         | I always wondered why isn't ECC built into the memory
         | controller, the same hardware that runs the bus into L3 or the
         | page mapper could checksum groups of cachelines.
         | 
         | It seems redundant to have every module come with its own
         | checking hardware.
        
           | p_l wrote:
           | ECC is a function of memory controller, not memory, on
           | current systems. There's also usually some form of ECC on
           | whatever passes for system bus, and internal caches have ECC
           | as well.
           | 
           | For memory controller, parity/ECC/chipkill/RAIM usually
           | involved simply adding additional memory planes to store
           | correction data. I believe the rare exceptions are fully
           | buffered memories where you have effectively separate memory
           | controller on each module (or add-in card with DIMMs)
        
           | kasabali wrote:
           | AFAIK it is built into the memory controller, at least for
           | ECC UDIMM. There's an extra DRAM chip on the module for
           | parity (generally 8+1), but it is memory controllers
           | responsibility to utilize it (that's why not all CPUs support
           | ECC)
        
         | bradfa wrote:
         | I read it to say that on die ecc is recommended but that dimm-
         | wide ecc is still optional.
         | 
         | And now you have 8 bits of ecc per 32 data versus older DDR
         | having 8 bits of ecc per 64 data. Hence the cost for dimm-wide
         | ecc is going up.
        
       | cbanek wrote:
       | As someone who has had to read thousands of random game crash
       | reports from all over the interwebs (you know when Windows says
       | you might want to send that crash log? like that), I totally
       | agree.
       | 
       | Of all the things to be worried about, like OS bugs, bad hardware
       | configuration, etc. bad memory is one of those really troubling
       | things. You look at the code and say "it's can't make it here,
       | because this was set" but when you can't trust your memory you
       | can't trust anything.
       | 
       | And as the timeline goes to infinity, you may also get one of
       | these reports and be asked to fix it... good luck.
        
         | lighttower wrote:
         | Someone reads those reports!?! Wow, how do I write them to
         | ensure someone who reads them takes them seriously?
        
         | apankrat wrote:
         | Aye. I have an assert in the code that fronts a _very_ pedantic
         | test of the context. In all cases when this assert was tripped
         | (and reported) an overnight memtest86 test surfaced RAM issues.
         | 
         | - Edit -
         | 
         | Also, bit flips in the non-ECC memory are _the_ cause of the
         | "bitrot" phenomenon. That is when you write out X to a storage
         | device, but you get Y when you read it back. A common
         | explanation is that the corruption happens _at rest_. However
         | all drives from the last 30+ years have FEC support, so in
         | reality the only way a bit rot can happen is if the data is
         | damaged _in transit_, while in RAM, on the way to/from the
         | storage media.
         | 
         | So, if you ever decide if to get an ECC RAM, get it. It's very
         | much worth it.
        
         | pkaye wrote:
         | I wonder how much of those crashes are due to gamers
         | aggressively overclocking their systems?
        
         | faitswulff wrote:
         | Do the crash reports include whether the machine has ECC
         | memory?
        
           | jackric wrote:
           | Do the crash reports include recent solar activity?
        
             | cbanek wrote:
             | Well, I've had to actually worry about radiation bitflips
             | as well. It does happen. But usually not so much on Earth!
        
               | dharmab wrote:
               | I once got to tell a CTO the reason our shiny new point
               | to point connection was suddenly trash was due to solar
               | flares.
        
               | jgalentine007 wrote:
               | One of the tire pressure sensors in my car tires had a
               | bit flip a couple years ago and I had to reprogram it's
               | ID. Luckily it was a subaru, so only a light came on in
               | the dash.
               | 
               | My old Honda crv however would turn traction control on
               | if your pressure was low - which worked by applying
               | brakes to wheels that were slipping. If you were going up
               | a slippery hill you would soon have no power, sliding
               | backwards nearly off the road in nowhere West Virginia on
               | the way to a ski resort.
        
               | jjeaff wrote:
               | How in the world would you ever know that problem was
               | caused by a bit flip and not just one of the countless
               | other reasons that a sensor could fail?
        
               | jgalentine007 wrote:
               | I have a TPMS programming tool (ATEQ QuickSet) and reader
               | (Autel TS401), because I like to swap my winter / summer
               | tires on my own. The TPMS light came on one day and
               | inflating tires didn't help - I used the reader and found
               | that one sensor's ID had changed. When I compared the ID
               | (it was in hex) to the last programming - it was a single
               | bit off. I couldn't reprogram the sensor itself, but I
               | was able to update the ECU with the changed ID using the
               | ATEQ.
               | 
               | I live in Denver but spend a lot of time skiing around
               | 11k feet, maybe the higher elevation means more
               | radiation.
        
               | dharmab wrote:
               | Similar story, we saw that one particular IP address in a
               | public cloud network had a 3% TLS handshake error rate.
               | We diverted traffic and then analyzed with wireshark. We
               | found one particular bit was being pulled low (i.e. 0 ->
               | 0 and 1 -> 0). HTTP connections didn't notice but TLS
               | checksum verifications would randomly fail. Had a hell of
               | a time convincing the cloud provider they had a hardware
               | fault- turned out to be a bug which disabled ECC on some
               | of their hardware.
               | 
               | Aside: I'm surprised you got a TPMS programming tool
               | instead of a set of steelies. Big wheels? Multiple winter
               | vehicles?
        
               | jgalentine007 wrote:
               | I have 2 cars. I like the TPMS to work since I've had 3
               | nails in tires in 4 years (newer construction area). Also
               | the TPMS light in my impreza is almost as bright as the
               | sun.
        
             | jacquesm wrote:
             | Timestamp + location should be enough to figure that out.
        
               | ant6n wrote:
               | It would be interesting to see whether there is a
               | correlation between solar activity and game crashes --
               | which in turn may provide an indication whether crashes
               | are due to bugs or bit flips.
        
           | Triv888 wrote:
           | most gaming desktops don't use ECC RAM anyways (at least
           | those from a few years ago)
        
           | jacquesm wrote:
           | On intel consumer boxes it is pretty safe to assume that they
           | don't, on AMD it might be the case but it usually isn't.
        
         | Springcleaning wrote:
         | Worse than a game crash is your data.
         | 
         | It is incomprehensible that there are still NAS devices being
         | sold without ECC support.
         | 
         | Synology took a step in the right direction to offer prosumer
         | devices with ECC but it is not really advertised as such. It is
         | actually difficult to find which do have ECC and which ones
         | don't.
        
           | ksec wrote:
           | >Synology took a step in the right direction to offer
           | prosumer devices with ECC
           | 
           | I just look it up because if it was true it would have been
           | news to me. Synology have been known to be stingy with
           | Hardware Spec. But none of what I called Prosumer, the Plus
           | Series have ECC memory by default. And there are "Value" and
           | "J" Series below that.
           | 
           | Edit: Only two model from the new xx21 series using AMD Ryzen
           | V has ECC memory by default.
        
         | BlueTemplar wrote:
         | Yeah, here's one example along many more :
         | 
         | https://forums.factorio.com/viewtopic.php?p=405060#p405060
        
       | dboreham wrote:
       | You don't need to look at kernel crashes to speculate about bus
       | and memory errors -- just check the logs on a few systems that do
       | have ecc. Pretty soon you'll see correctable errors being
       | reported.
        
         | maddyboo wrote:
         | I don't know much about this topic, but is it possible that ECC
         | memory is more prone to single bit errors than non-ECC memory
         | because there is less pressure on companies to minimize such
         | errors? If this were the case, it would skew the data.
        
       | belzebalex wrote:
       | Asked myself, would it be possible to build a Geiger counter with
       | RAM?
        
       | johnklos wrote:
       | From the fortune database:
       | 
       | As far as we know, our computer has never had an undetected
       | error. -- Weisert
        
       | otterley wrote:
       | D. J. Bernstein (of qmail/daemontools fame) spoke of it over a
       | decade ago as well. https://cr.yp.to/hardware/ecc.html
        
         | slim wrote:
         | these days he's more famous for the NaCl crypto library
        
           | loup-vaillant wrote:
           | For which bit flips are even more relevant: EdDSA has this
           | nasty tendency of leaking the private key if the wrong bits
           | are flipped (there are papers on fault injection attacks).
           | People who sign lots of stuff all the time, say _Let 's
           | Encrypt_, could conceivably gain some piece of mind with ECC.
           | 
           |  _(Note: EdDSA is still much much better than ECDSA, most
           | notably because it 's easier to implement correctly.)_
        
       | 1996 wrote:
       | Linus is absolutely right.
       | 
       | I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM.
       | I can't get that, at all - even without the other fancy things I
       | would like such as a 4k OLED with pen/touchscreen.
       | 
       | In 2020, even the Dell XPS stopped shipping OLED (goodbye dear
       | 7390!)
       | 
       | I will gladly give my money to anyone who sells AMD laptop with
       | ECC. Hopefully, it will show there's demand for "high end yet non
       | bulky laptops"
        
         | miahi wrote:
         | Lenovo P53 has 3 NVMe slots, 4k OLED with touchscreen (and
         | optional pen) and up to 128GB ECC RAM if you choose the Xeon
         | processor. It's big and heavy, but it exists.
         | 
         | I hope AMD will create a better market for the ECC laptop
         | memory (right now it's hard to find + expensive).
        
           | 1996 wrote:
           | I know- I had my eye on this very model, as you can even add
           | a mSata on the WWAN slot to get a 4th drive.
           | 
           | Unfortunately, Lenovo is not selling the P53 anymore, which
           | is exactly why I say I can't get that even in a "bulky"
           | version.
        
       | otterley wrote:
       | About 1/3 of Google's machines and 8% of Google's DIMMs in their
       | fleet suffer at least one correctible memory error per year:
       | http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
        
         | jjeaff wrote:
         | Which means, assuming google is running very large machines
         | with lots of memory that one might expect a single correctable
         | error once every 6-10 years on your average workstation of
         | small server. That's generously assuming your workstation has
         | 1/3 as much memory as the average google server.
        
           | Nebasuke wrote:
           | Google does not use very large or even large machines for
           | most of their fleet. You can quickly see in the paper this is
           | for 1, 2, and 4 GB RAM machines (in 2006-2008).
        
       | mauri870 wrote:
       | In case the page os not loading, refer to the wayback machine[1]
       | for a copy
       | 
       | [1]
       | https://web.archive.org/web/*/https://www.realworldtech.com/...
        
       | JumpCrisscross wrote:
       | What is the status of ECC on Macs?
        
         | CalChris wrote:
         | iMac Pro which has Xeon M. There's a good chance that will go
         | away with the new Apple Silicon iMac Pro due out this year.
         | MacRumors roundup article doesn't mention ECC.
         | 
         | https://www.macrumors.com/roundup/imac/
        
       | MAXPOOL wrote:
       | Well shit.
       | 
       | I run some large ML models in my home PC and I get NaN's and some
       | out of range floats every month or so. I have spent hours
       | debugging but doing the same computation with the same random
       | seeds does not recreate the problem.
       | 
       | How about GPU's and their GDDR SDRAM? Do they have parity bits?
        
         | layer8 wrote:
         | Some pro-level Nvidia GPUs have ECC RAM, they are very
         | expensive though. I don't think regular gaming GPUs have
         | parity, due to the extra cost, performance impact (probably
         | minor but measurable) and irrelevance for gaming.
        
           | vbezhenar wrote:
           | Cheap pro-level GPUs don't have ECC RAM either. And it's not
           | easy to find out, it might be buried somewhere.
        
         | [deleted]
        
       | JoeAltmaier wrote:
       | ECC works if done right. Accessing a memory location can fix bit-
       | flips (ECC is a 'correcting' code). But systems that don't
       | regularly visit every memory location, can accumulate risk. Those
       | dark corners of RAM can eventually get double-bit errors and be
       | uncorrectable. So an OS might 'wash' RAM during idle moments,
       | reading every location in a round-robin manner to get ECC to kick
       | in and auto-correct. Doesn't matter how fast (1M every hour or
       | whatever) as long as somehow ECC has a chance to work.
        
         | jacquesm wrote:
         | Interesting, similar to scrubbing raid arrays. How often do
         | those double bitflips appear though? You'd have to have a
         | pretty long running server for that to be a problem, no?
        
           | jeffbee wrote:
           | According to Google's old paper on the subject, about 1% of
           | their machines suffered from an uncorrectable (i.e. multi-
           | bit) error in a year.
        
         | temac wrote:
         | The RAM already needs to be refreshed and IIRC it is done by
         | the memory controller when not in sleep mode.
         | 
         | However I don't remember if there are provisions for ECC
         | checking in case there are some dedicated refresh commands. I
         | hope so, but I'm not sure.
        
         | musingsole wrote:
         | A double-bit error in many cases is fine. If the error is at
         | least detectable at the time of a read, your protection worked.
         | What's scary is a triple-flip event. Most of those will still
         | look like corrupted data, but if it happens to flip into
         | looking like a fixable, single-bit error, you're out of luck
         | and won't even know it.
        
           | a1369209993 wrote:
           | > Most of those will still look like corrupted data,
           | 
           | Not if you're using a typical 72-bit SECDED code[0].
           | 
           | You have two error indicators: a summary parity bit (even
           | number of errors: 0,2,etc vs odd number of errors: 1,etc),
           | and a error index: 0 for no errors, or the bitwise xor of the
           | locations each bit error.
           | 
           | For a triple error at bits a,b, and c, you'll have summary
           | parity of 1 (odd number of errors, assumed to be 1), and a
           | error index of a^b^c, in the range 0..127, of which 0..71[1]
           | (56.25%, a clear albeit not overwhelming majority) will
           | correspond to legitimate single-bit errors.
           | 
           | 0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_w
           | it...
           | 
           | 1: or 72 out of 128 anyway; the active bits might not all be
           | assigned contiguous indexes starting from zero, but it
           | doesn't change the probability and it's simpler to analyse if
           | summary is bit 0 and index bit i is substrate bit 2^i.
        
         | electricshampo1 wrote:
         | Patrol scrub is basically this (https://www.intel.com/content/d
         | am/www/public/us/en/documents... it is built into the memory
         | controller, no OS involvement is needed.
        
           | electricshampo1 wrote:
           | working link:
           | 
           | https://www.intel.com/content/dam/www/public/us/en/documents.
           | ..
        
       | wagslane wrote:
       | It really does. I did a write-up recently on it as I was diving
       | in and understanding the benefits:
       | https://qvault.io/2020/09/17/very-basic-intro-to-elliptic-cu...
        
         | avianes wrote:
         | Be careful not to confuse ECC memory with ECC encryption.
         | 
         | ECC memory = memory with Error-Correcting Code
         | 
         | ECC encryption = Elliptic Curve Cryptography
        
       | _0ffh wrote:
       | Please someone correct me if I'm wrong, but as far as I can
       | remember memory with extra capacity for error detection used to
       | be a rather common thing on early PCs. That really only changed a
       | couple of decades in, in order to be able to offer lower prices
       | to home users who didn't know or care about the difference.
       | Probably about the time, or earlier, when with some hard disk
       | manufacturers megabytes suddenly shrunk to 10^6 bytes (before
       | kibibytes or mebibytes where a thing, btw).
        
         | wmf wrote:
         | Yes, PCs used to use parity memory.
        
       | musingsole wrote:
       | It's a shame we don't have ECC for individuals. How many of
       | society's bugs come from someone wandering around with a bit
       | flipped?
        
       | ratiolat wrote:
       | I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF
       | (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600
       | 
       | I specifically was looking for bang for buck, low(er) wattage and
       | ECC.
        
         | IanCutress wrote:
         | Those AMD motherboards with consumer CPUs are a bit iffy. They
         | run ECC memory, but it's hard to tell if it is running in ECC
         | mode. Even some of the tools that identify ECC is running will
         | say it is, even when it isn't, because the motherboard will
         | report it is, even when it isn't. ECC isn't a qualified metric
         | on the consumer boards, hence all the confusion.
        
       | linsomniac wrote:
       | This reminds me of last year we ordered a new $14K server, it
       | arrived and we ran it through our burn-in process which included
       | running memtest86 on it, and it would, after around 7 hours,
       | generate errors.
       | 
       | Support was only interested if their built-in memory tester,
       | which even on it's most thorough, would only run for ~3 hours,
       | would show errors, which it wouldn't. IIRC, the BMC was logging
       | "correctable memory errors", but I may be misremembering that.
       | 
       | "We've run this test on every server we've gotten from you,
       | including several others that were exactly the same config as
       | this, this is the only one that's ever thrown errors". Usually
       | support is really great, but they really didn't care in this
       | case.
       | 
       | We finally contacted sales. "Uh, how long do we have to return
       | this server for a refund?" All of a sudden support was willing to
       | ship us out a replacement memory module (memtest86 identified
       | which slot was having the problem), which resolved the problem.
       | 
       | They were all too willing to have us go to production relying on
       | ECC to handle the memory error.
        
       | FartyMcFarter wrote:
       | Does anyone know why ECC memory requires the CPU to support it?
       | 
       | Naively, I can understand why error _reporting_ has dependencies
       | on other parts of the system, but it would seem possible for
       | error _correction_ to work transparently.
        
         | TomVDB wrote:
         | I think the memory just provides additional storage bits to
         | detect the issue, but doesn't contain the logic.
         | 
         | This is in line with all technical parameters of DRAM:
         | everything must be as cheap as possible, and all the difficult
         | parts are moved to the memory controller.
         | 
         | Which is the right thing to do, because you can share one
         | memory controller with multiple DRAM chips.
        
         | wmf wrote:
         | Historically the detection and correction is performed in the
         | memory controller not the DRAM.
        
         | toast0 wrote:
         | As implemented today, ECC is a feature of the memory
         | controller. You need special ram, because instead of 8 parallel
         | rams per bank, you need 9, and all the extra data lines to go
         | to the controller.
         | 
         | Modern CPUs have integrated memory controllers, so that's why
         | the CPU needs to support it.
         | 
         | Correction without reporting isn't great; anyway, you _need_ a
         | reporting mechanism for uncorrectable errors, or all you 've
         | done is ensure any memory errors you do experience are worse.
        
       | nix23 wrote:
       | I always have that conversation when ZFS comes up. Some peoples
       | think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every
       | single one FS in Linux. And every single reliable Machine needs
       | ECC.
        
       | paulie_a wrote:
       | There was a great defcon talk a while back regarding using ECC.
       | The concept was called "dns jitter"
       | 
       | Basically you can register domains using small bit differences
       | for domains and start getting email and such for that domain
       | 
       | If I recall correctly the example given was a variation of
       | microsoft.com
       | 
       | All because so much equipment doesn't use ECC
        
         | zx2c4 wrote:
         | Voila http://media.blackhat.com/bh-
         | us-11/Dinaburg/BH_US_11_Dinabur...
        
           | tyoma wrote:
           | There were some great follow up talks as well! It turns out a
           | viable attack vector was also MX records. And there was the
           | guy who registered kremlin.re ( versus kremlin.ru ).
        
         | jeffbee wrote:
         | miclosoft.com is only one bit away from microsoft.com. Used to
         | see these problems all the time when I worked on gmail.
         | 
         | At Google even with ECC everywhere there wasn't enough
         | systematic error detection and correction to prevent the global
         | database of monitoring metrics from filling up with garbage.
         | /rpc/server/count was supposed to exist but also in there would
         | be /lpc/server/count and /rpc/sdrver/count and every other
         | thing. Reminded me daily of the terrors of flipped bits.
        
       | [deleted]
        
       | louwrentius wrote:
       | ECC matters, even on the desktop, it's not even a discussion, to
       | me.
       | 
       | If you think it doesn't matter: how do you know? If you don't run
       | with ECC memory, you'll never know if memory was corrupted (and
       | recovered).
       | 
       | That blue screen, that sudden reboot, that program crashing. That
       | corrupted picture of your kid.
       | 
       | Who knows.
       | 
       | I'll tell you, who knows. God damn every sysadmin (or the modern
       | equivalent) can tell you how often they get ECC errors. And at
       | even a small scale you'll encounter them. I have, on servers and
       | even on an SAN Storage controller, for crying out loud.
       | 
       | If you care about your data, use ECC memory in your computers.
        
         | supernovae wrote:
         | I've got nearly 30 years of experience and not once has non ECC
         | memory lead to corruption. Maybe a crash, maybe a panic, maybe
         | a kernel dump...
         | 
         | But.. in all my time operating servers over 3 decades, it's
         | always been bad drivers, bad code and problematic hardware
         | that's caused most of my headaches.
         | 
         | Have i seen ECC error correction in logs? yeah.. I don't
         | advocate against it but, i've found for most people you design
         | around multiple failure scenarios more than you design around
         | preventing specific ones.
         | 
         | Take the average web app - you run it on 10 commodity systems
         | and distribute the load.. if one crashes, so what. Chances are,
         | a node will crash for many more reasons other than memory
         | issues.
         | 
         | If you have an app that requires massive amounts of ram or you
         | do put all of your begs in one basket, then ECC makes sense...
         | 
         | I just know i like going horizontal and I avoid vertical
         | monoliths.
        
           | louwrentius wrote:
           | The problem with memory corruption is not just crashes, those
           | are the more benign outcomes.
           | 
           | The real killer is data corruption. Houw would you even begin
           | to know that data is corrupted until it is too late?
        
           | ajnin wrote:
           | > I've got nearly 30 years of experience and not once has non
           | ECC memory lead to corruption
           | 
           | How do you know?
        
           | ptx wrote:
           | > if one crashes, so what
           | 
           | Crashes might not matter, but silent data corruption does.
           | The owner/user of that data will care when they eventually
           | discover that it at some point mysteriously got corrupted.
        
         | alkonaut wrote:
         | I know what it does, but I still don't care (so long as it
         | costs money or even 1% performance).
         | 
         | It's a tradeoff between money/performance and the frequency of
         | crashes, corruption etc.
         | 
         | Bit rot is just one of many threats to my data. Backups take
         | care of that as well as other threats like theft, fire,
         | accidental deletion.
         | 
         | This is similar to my reasoning around the recent side channel
         | attacks on intel CPUs. If I had a choice I'd like to run with
         | max performance without the security fixes even though it would
         | be less secure. Not because I don't care about security but
         | because 1% or 5% perf is a lot and I'd rather simply avoid
         | doing anything security critical on the machine entirely than
         | take that hit.
        
           | louwrentius wrote:
           | > Bit rot is just one of many threats to my data. Backups
           | take care of that as well as other threats like theft, fire,
           | accidental deletion.
           | 
           | No, that's the big mistake people make: backups just backup
           | bit-rotted data, until it is too late and the last good
           | version is rotated out and lost forever.
        
             | alkonaut wrote:
             | I'm aware. But the risk is extremely small (and 99.9% of
             | important data is not created on the machine but goes
             | directly from e.g iOS camera to backup).
             | 
             | My desktop machine is basically a gaming rig with
             | disposable data. Hence the "performance over integrity".
             | 
             | I also never rotate anything out. Every version of
             | everything is in the backups. Storage is that cheap these
             | days.
        
           | mark-r wrote:
           | Backups can't fix what was already corrupted when it was
           | written to disk.
        
       | kensai wrote:
       | "ECC availability matters a lot - exactly because Intel has been
       | instrumental in killing the whole ECC industry with it's horribly
       | bad market segmentation."
       | 
       | Its.
       | 
       | There, I finally corrected Linus Torvalds in something. :))
        
         | hugey010 wrote:
         | He uses "do do" instead of "to do" which is a more obvious
         | typo. Linus usually comes across as borderline arrogant, and
         | deservedly so, but not necessarily perfect in his writing. I
         | think it's an effective strategy to communicate his priorities
         | and wrangle smart but easily intimidated folk "do do" what he
         | believes is right!
        
         | mark-r wrote:
         | I have a simple way of remembering when to leave out the
         | apostrophe. His, hers, its are all possessive and none of them
         | have an apostrophe.
        
           | Glanford wrote:
           | In this particular case 'it's' can also be possessive
           | although it's considered non-standard, so to be correct you
           | can always treat it like a contraction of 'it is'.
        
         | raverbashing wrote:
         | Yeah I'm always annoyed with this kind of mistake. Especially
         | as non-native speakers should know better than the native ones
         | (which usually don't give a f.).
         | 
         | Now the point about internally doing ECC is an interesting one,
         | could be a way out of this mess. And apparently ECC is more
         | available in AMD land
        
           | tssva wrote:
           | The really annoying thing is that auto correct on mobile
           | device keyboards will often want to incorrectly change "its"
           | to "it's" or vice versa.
        
             | raverbashing wrote:
             | Yes, auto-corrects compound the problem.
        
           | simias wrote:
           | For a 2nd language speaker making these homophonic mistakes
           | is actually a sign of fluency. It means that you just
           | transcribe a mental flow of words instead of consciously
           | constructing the language.
           | 
           | The first time I wrote "your" instead of "you're" in English
           | I thought it was quite a milestone!
        
             | raverbashing wrote:
             | > For a 2nd language speaker making these homophonic
             | mistakes is actually a sign of fluency.
             | 
             | I kinda disagree because while the homophony works in
             | (spoken) English in written it stands as a sore thumb. So
             | yeah you will make it if you only heard it but doesn't know
             | the written form.
             | 
             | (And in their native language it's probably two unrelated
             | words, so that might intensify the feeling of wrongness)
        
               | simias wrote:
               | I mean, my native language is French where "your" is
               | "ton" and "you're" is "tu es", yet it (rarely) happens
               | that I mix them up in English. If I proofread I'll spot
               | it almost every single time, but if I'm just typing my
               | "stream of consciousness" my brain's speech-to-text
               | module sometimes messes up.
        
               | leetcrew wrote:
               | meh, plenty of (intelligent!) native english speakers do
               | not know all the canonical grammar rules. english
               | contains a lot of what could be considered error
               | correction bits, so it doesn't usually impede
               | understanding. syntactically perfect english with
               | weird/misused idioms (common among non-native speakers
               | with lots of formal education) is harder to understand in
               | my experience. I imagine this is true of most natural
               | languages.
        
               | protomolecule wrote:
               | For what its worth as a non-native speaker I too started
               | making this kind of errors when my English became fluent
               | enough.
        
               | [deleted]
        
             | andi999 wrote:
             | Yes. I noticed this. When I was younger, I thought how can
             | you mix up 'their, they're, there' people you do this must
             | be the opposite of smart. This lasted for 4 years living in
             | an English speaking country....
        
             | harperlee wrote:
             | As an "english as a second language" user, I can't see
             | myself writing e.g. "should of" instead of "should have",
             | however fluent I am. I think you don't make that kind of
             | typo unless you have learnt english before grammar.
        
               | simias wrote:
               | I also wouldn't do this one, but that's because in my
               | English accent I simply wouldn't pronounce them the same
               | way. Also the word sequence "should of" is extremely
               | uncommon in proper English, so it catches the eye more
               | easily I think.
               | 
               | "You're/your", "their/they're", "its/it's" and the like
               | are a different story, because I do pronounce those the
               | same and they're all very common.
        
               | lolc wrote:
               | I was quite surprised when it started happening to me.
        
               | harperlee wrote:
               | Wow that's interesting!
        
           | young_unixer wrote:
           | I've realized that when I'm engaged in the writing (angry or
           | emotional in some way) I tend to commit more of these
           | mistakes, even though I know the difference between "it's"
           | and "its". Linus is always angry, so that probably makes him
           | commit more orthographic mistakes.
        
           | touisteur wrote:
           | I think it's available for customer SKUs on AMD and not just
           | for servers like in 'Xeon-land'... How I've wanted an ECC-
           | ready NUC...
        
             | jeffbee wrote:
             | The AMD parts all have the ECC feature but the platform
             | support outside of EPYC may as well not exist. Most
             | motherboards for the Ryzen segment don't do it properly or
             | don't do it at all, some support it but aren't capable of
             | reporting events to the operating system which is dumb.
             | Ryzen laptops don't have it either.
             | 
             | Closest you can come to a nuc with ecc is I think a mini
             | server equipped with one of the four-core i3 parts that
             | have ecc.
        
             | erkkie wrote:
             | Probably not what you meant but https://ark.intel.com/conte
             | nt/www/us/en/ark/products/190108/... has support for Xeon
             | (and ECC). Now how to actually practically source 32GB ECC
             | enabled SO-DIMM sticks ..
        
           | africanboy wrote:
           | As a non native speaker, my phone has both the Italian and
           | English dictionary, when I write its it always auto corrects
           | to it's as soon as I hit space and sometimes it gets
           | unnoticed.
        
           | phkahler wrote:
           | >> But is ECC more available in AMD land?
           | 
           | Yes it is. The problem is they dont really advertise it. I'm
           | not certain but it might even be standard on AMD chips, but
           | if they dont say so and board makers are also unclear, who
           | knows...
        
             | ethbr0 wrote:
             | It's a market size problem.
             | 
             | For consumer motherboard OEMs, only AMD effectively has ECC
             | support (Intel's has been so spotty and haphazard from
             | product to product), and of AMD users, only a small number
             | care about ECC.
             | 
             | So motherboard companies, being resource and time-starved
             | as they are, don't make it a priority to address such a
             | small user-base.
             | 
             | If Intel started shipping ECC on everything, it would go a
             | long way towards shifting the market.
        
               | [deleted]
        
         | jacquesm wrote:
         | How is your Finnish?
        
           | jankeymeulen wrote:
           | Or Swedish for that matter, as I believe Torvalds maternal
           | language is Swedish
        
             | [deleted]
        
             | jacquesm wrote:
             | Finnish is stupendously hard. Far harder than Swedish, at
             | least, by my estimation.
        
               | dancek wrote:
               | Yes. Swedish is also easy compared to English and French,
               | the other two languages I've learned after early
               | childhood. The only thing that makes it hard is that you
               | never really have use for it and you're forced to learn
               | it nevertheless here in Finland.
               | 
               | I'm happy to see people here on HN respect the difficulty
               | of learning languages. Most foreigners that speak Finnish
               | do it very poorly at first and even after decades they
               | still sound like foreigners. But it shows huge respect to
               | our small country for someone to make the effort, and we
               | really appreciate it. I'm hoping other people see
               | learning their own mother tongue the same way. Sure, most
               | of us need English, but learning it _well_ is still a
               | huge task.
        
               | dehrmann wrote:
               | It is. Swedish and English are both Germanic languages,
               | so there are a lot of commonalities. Finnish is in a
               | completely different language family. English and Swedish
               | are more closely related to Persian and Hindi than to
               | Finnish.
        
             | young_unixer wrote:
             | Yes. https://www.youtube.com/watch?v=0rL-0LAy04E
        
           | Igelau wrote:
           | It could use some Polish.
        
             | jacquesm wrote:
             | Dobrze ;)
        
           | xxs wrote:
           | Linus must have English as his '1st' language now. For non-
           | originally-native speaker mistakes like 'it's vs its', 'than
           | vs then', etc. are pretty uncommon.
        
             | Tade0 wrote:
             | I guess this is what happens when someone first learns to
             | _speak_ the language, learning how to write in it only
             | later on - as it often is the case with children.
             | 
             | I spent my preschool years in a multicultural environment
             | and English was our _lingua franca_ (ironically the school-
             | mandated language was French), so I didn't properly learn
             | contractions until grade school - same with similarly
             | sounding words like "than vs then" and "your vs you're".
        
             | jacquesm wrote:
             | I've spent my whole life speaking multiple languages and
             | this still trips me up every now and then, in fact quotes
             | as such are a problem for me and I keep using them wrong,
             | no idea why, it just won't register. So unless I slow down
             | to 1/10th of my normal writing speed I will definitely make
             | mistakes like that. Good we have proofreaders :)
        
               | dehrmann wrote:
               | (guessing you mean apostrophes)
               | 
               | It's because they have two different uses (three if you
               | count nested quotes, but those aren't common and are
               | pretty easy to figure out), contractions and possession,
               | and they seemingly collide on words like "its" where
               | you'd think it could mean either.
               | 
               | Not sure if you've already learned this (or if it helps),
               | but English used to be declined, and its pronouns still
               | are, e.g. they/their/them. That's why "its" isn't
               | contracted; the possessive marker is already in the word.
        
               | mixmastamyk wrote:
               | His, hers, its
        
         | JosephRedfern wrote:
         | Maybe he composed the message using a machine with non-ECC RAM
         | and suffered a bit flip, which through some chain of events,
         | led to the ' being added. Best to give him the benefit of
         | doubt, I think!
        
           | notretarded wrote:
           | The mistake was that it was included.
        
             | JosephRedfern wrote:
             | Oops, that was dumb. Fixed, thanks.
        
       | spacedcowboy wrote:
       | Seems likely that "bad ram" was the reason for the recent AT&T
       | fiber issues, given that 1 bit was being flipped reliably in data
       | packets [1]
       | 
       | [1]:
       | https://twitter.com/catfish_man/status/1335373029245775872?l...
        
         | p_l wrote:
         | I have had in the past encountered an issue where line card was
         | stripping exactly one bit of address data. Don't know of the
         | follow up investigation, but it probably wasn't TCAM
        
         | SV_BubbleTime wrote:
         | I think you meant seems _un_ likely
        
       | MarkusWandel wrote:
       | This is one justified Linus rant! My personal history includes
       | data loss twice because of defective RAM, and many more RAMs
       | discarded after the now obligatory overnight run of MemTest86+
       | (these were all secondhand RAMs - I would never buy a new one
       | without a refund guarantee). My very first "PC" still had the ECC
       | capability and I used it. My own now very dated rant on the
       | subject: http://wandel.ca/homepage/memory_rant.html
        
         | mixmastamyk wrote:
         | A few years back memtest86 wouldn't run on newer machines, has
         | that been fixed?
        
       | IgorPartola wrote:
       | I wish this was more of a cohesive argument. He says he thinks
       | it's important and points to row-hammer problems but doesn't
       | explain why. Probably because the audience it was written for
       | already knows the arguments of why, but this is not the best
       | argument.
       | 
       | If in doubt, get ECC. Do your own research on how it works and
       | why. This post won't explain it, just will blame Intel (probably
       | rightfully so).
        
         | turminal wrote:
         | It's a message in a thread from a technological forum. I think
         | its intended audience are people already familiar with ECC
         | unlike here on HN.
        
           | IgorPartola wrote:
           | Exactly my point :)
        
         | eloy wrote:
         | He does explain it:
         | 
         | > We have decades of odd random kernel oopses that could never
         | be explained and were likely due to bad memory. And if it
         | causes a kernel oops, I can guarantee that there are several
         | orders of magnitude more cases where it just caused a bit-flip
         | that just never ended up being so critical.
         | 
         | It might be false, but I think it's a reasonable assumption.
        
           | IgorPartola wrote:
           | To someone on HN who isn't familiar with what ECC does that
           | explains nothing about how ECC works and how it could have
           | prevented these situations. Or how often they really happen.
        
             | simias wrote:
             | The problem is that, if you don't have ECC to detect the
             | errors, it's very hard to know what exactly caused a
             | random, non-reproducible crash. Especially in kernel mode
             | where there's little memory protection and basically any
             | driver could be writing anywhere at any time.
             | 
             | I can understand Linus's frustration from that point of
             | view: without ECC RAM when you get some super weird crash
             | report where some pointer got corrupted for no apparent
             | reason you can't be sure if it's was just a random bitflip
             | or if it's actually hiding a bigger problem.
        
               | andi999 wrote:
               | You could run memtest on a pc without ecc for a couple of
               | days and to estimate the error rate, or not?
        
               | fuster wrote:
               | Pretty sure most memory test tools like memtest86 write
               | the memory and then read it back shortly thereafter in
               | relatively small blocks. This makes the window for errors
               | to be introduced dramatically smaller. Most memory in a
               | computer is not being continually rewritten under normal
               | use.
        
               | simias wrote:
               | If you manage to replicate bitflips every few days your
               | RAM is broken.
               | 
               | It's the "once every other year" type of bitflip that's
               | the problem. The proverbial "cosmic ray" hitting your
               | DRAM and flipping a bit. That will be caught by ECC but
               | it'll most likely remain a total mystery if it causes
               | your non-ECC hardware to crash.
        
               | zlynx wrote:
               | It isn't only cosmic rays. Regular old radiation can also
               | cause it. I've read about a server that had many repeated
               | problems and the techs replaced the entire motherboard at
               | one point.
               | 
               | Then one of them brought in his personal Geiger counter
               | and found the radiation coming off the steel in that rack
               | case was significantly higher than background.
               | 
               | You may never know when the metal you use was recycled
               | from something used to hold radioactive materials.
        
             | reader_mode wrote:
             | It takes 5 seconds to Google ECC memory if you're really
             | interested and if you're working on kernel related stuff
             | you 99.9999% know what it is.
        
               | IgorPartola wrote:
               | Right. My point that TFA serves zero purpose to most
               | people on here. Those that know how ECC works already
               | know that it is a must have. Those that don't will learn
               | very little from the post because it fails to explain
               | what ECC is and why you need it aside from general
               | statements about memory errors. It will reaffirm for
               | those that know about what ECC RAM is that it's a good
               | idea, but they already know it anyways. It reads a lot
               | like an article about why vitamin C is a good thing.
        
               | nix23 wrote:
               | To someone on HN who isn't familiar with what Google does
               | that explains nothing about how Google works ;)
        
               | TheCoelacanth wrote:
               | Google is like an evil version of Duck Duck Go.
        
               | Danieru wrote:
               | Nah, to Google is just a generic verb. For example I too
               | do all my googling at Duck Duck Go.
               | 
               | Hi alphabet lawyers.
        
               | vorticalbox wrote:
               | I believe there was a suit against alphabet about this
               | very thing.
               | 
               | They argued that 'Google' has now become a verb meaning
               | 'to search the Internet for' and as such alphabet should
               | have the name taken away.
        
             | chalst wrote:
             | From https://en.m.wikipedia.org/wiki/ECC_memory -
             | 
             | > A large-scale study based on Google's very large number
             | of servers was presented at the SIGMETRICS/Performance '09
             | conference.[6] The actual error rate found was several
             | orders of magnitude higher than the previous small-scale or
             | laboratory studies, with between 25,000 (2.5 x 10-11
             | error/bit*h) and 70,000 (7.0 x 10-11 error/bit*h, or 1 bit
             | error per gigabyte of RAM per 1.8 hours) errors per billion
             | device hours per megabit. More than 8% of DIMM memory
             | modules were affected by errors per year
        
       | unixhero wrote:
       | Fantastic burn by Linus Torvalds whom also had some skin in the
       | CPU game.
       | 
       | Offtopic, I wonder if he trawls that site regularly. And
       | eventually I wonder, is he here also? :)
        
       | knorker wrote:
       | I have multiple times postponed buying new computers for YEARS,
       | because I'm waiting for intel to get their head out of their ass
       | and actually let me buy something that does ECC for desktop.
       | (incl laptops)
       | 
       | I would have bought computers when I "wanted one". Now I buy them
       | when I _need_ one. Because buying a non-ECC computer just feels
       | like buying a defective product.
       | 
       | In the last 10 years I would have bought TWICE as many computers
       | if they hadn't segmented their market.
       | 
       | Fuck intel. I sense that Linus self-censored himself in this
       | post, and like me is even angrier than the text implies.
        
         | vbezhenar wrote:
         | There are plenty of Xeons which are suitable for desktops and
         | there are plenty of laptops with Xeons.
         | 
         | Price is not nice though.
        
         | skibbityboop wrote:
         | Have you finally stopped buying Intel? Current Ryzens are a
         | much better CPU anyhow, just dump Intel and be happy with your
         | ECC and everything else.
        
       | jhoechtl wrote:
       | I definitely do not want Linus Torvalds yelling at me in that
       | tone --- but reading his utterings is certainly entertaining.
        
       | indolering wrote:
       | My favorite example is a bit flip altering election results:
       | 
       | https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...
        
       | qwerty456127 wrote:
       | ECC should be everywhere. It seems outrageous to me almost no
       | laptops have ECC.
        
       | arendtio wrote:
       | It would be interesting to see how many more kernel oops appear
       | on machines without ECC compared to those with ECC.
        
       | nostrademons wrote:
       | I still remember Craig Silverstein being asked what his biggest
       | mistake at Google was and him answering "Not pushing for ECC
       | memory."
       | 
       | Google's initial strategy (c. 2000) around this was to save a few
       | bucks on hardware, get non-ECC memory, and then compensate for it
       | in software. It turns out this is a terrible idea, because if you
       | can't count on memory being robust against cosmic rays, you also
       | can't count on the software being stored in that memory being
       | robust against cosmic rays. And when you have thousands of
       | machines with petabytes of RAM, those bitflips do happen. Google
       | wasted many man-years tracking down corrupted GFS files and index
       | shards before they finally bit the bullet and just paid for ECC.
        
         | maria_weber23 wrote:
         | ECC memory can't eliminate the chances of these failures
         | entirely. They can still happen. Making software resilient
         | against bitflips in memory seems very difficult though, since
         | it not only affects data, but also code. So in theory the
         | behavior of software under random bit flips is well... Random.
         | You probably would have to use multiple computers doing the
         | same calculation and then take the answer from the quorum. I
         | could imagine that doing so would still be cheaper than using
         | ECC ram, at least around 2000.
         | 
         | Generally this goes against software engineering principles.
         | You don't try to eliminate the chances of failure and hope for
         | the best. You need to create these failures constantly (within
         | reasonable bounds) and make sure your software is able to
         | handle them. Using ECC ram is the opposite. You just make it so
         | unlikely to happen, that you will generally not encounter these
         | errors at scale anymore, but nontheless they can still happen
         | and now you will be completely unprepared to deal with them,
         | since you chose to ignore this class of errors and move it
         | under the rug.
         | 
         | Another intersting side effect of quorum is that it also makes
         | certain attacks more difficult to pull off, since now you have
         | to make sure that a quorum of machines gives the same "wrong"
         | answer for an attack to work.
        
           | colejohnson66 wrote:
           | > You probably would have to use multiple computers doing the
           | same calculation and then take the answer from the quorum.
           | 
           | The Apollo missions (or was it the Space Shuttle?) did this.
           | They had redundant computers that would work with each other
           | to determine the "true" answer.
        
             | EvanAnderson wrote:
             | The Space Shuttle had redundant computers. The Apollo
             | Guidance Computer was not redundant (though there were two
             | AGCs onboard-- one in the CM and one in the LEM). The
             | aerospace industry has a history of using redundant
             | dissimilar computers (different CPU architectures, multiple
             | implementations of the control software developed by
             | separate teams in different languages, etc) in voting-based
             | architectures to hedge against various failure modes.
        
             | haolez wrote:
             | Sounds similar to smart contracts running on a blockchain
             | :)
        
             | buildbuildbuild wrote:
             | This remains common in aerospace, each voting computer is
             | referred to as a "string".
             | https://space.stackexchange.com/questions/45076/what-is-a-
             | fl...
        
               | sroussey wrote:
               | In aerospace where this is common, you often had multiple
               | implementations, as you wanted to avoid software bugs
               | made by humans. Problem was, different teams often
               | created the same error at the same place, so it wasn't as
               | effective as it would have seemed.
        
           | tomxor wrote:
           | > Making software resilient against bitflips in memory seems
           | very difficult though, since it not only affects data, but
           | also code.
           | 
           | There is an OS that pretty much fits the bill here. There was
           | a show where Andrew Tanenbaum had a laptop running Minix 3
           | hooked up to a button that injected random changes into
           | module code while it was running to demonstrate it's
           | resilience to random bugs. Quite fitting that this discussion
           | was initiated by Linus!
           | 
           | Although it was intended to protect against bad software I
           | don't see why it wouldn't also go a long way in protecting
           | the OS against bitflips. Minix 3 uses a microkernel with a
           | "reincarnation server" which means it can automatically
           | reload any misbehaving code not part of the core kernel on
           | the fly (which for Minix is almost everything). This even
           | includes disk drivers. In the case of misbehaving code there
           | is some kind of triple redundancy mechanism much like the
           | "quorum" you suggest, but that is where my crude
           | understanding ends.
        
           | slumdev wrote:
           | Error-correcting code (the "ECC" in ECC) is just a quorum at
           | the bit level.
        
             | sobriquet9 wrote:
             | Modern error correction codes can do much better than that.
        
             | eevilspock wrote:
             | I'm surprised that the other replies don't grasp this.
             | _This_ is the proper level to do the quorum.
             | 
             | Doing quorum at the computer level would require
             | synchronizing parallel computers, and unless that
             | synchronization were to happen for each low level
             | instruction, then it would have to be written into the
             | software to take a vote at critical points. This is going
             | to be greatly detrimental both to throughput and software
             | complexity.
             | 
             | I guess you could implement the quorum at the CPU level...
             | e.g. have redundant cores each with their own memory. But
             | unless there was a need to protect against CPU cores
             | themselves being unreliable, I don't see this making sense
             | either.
             | 
             | At the end of the day, _at some level_ , it will always
             | come down to probabilities. "Software engineering
             | principles" will never eliminate that.
        
               | slumdev wrote:
               | I would highly recommend a graduate-level course in
               | computer architecture for anyone who thinks ECC is a
               | 1980s solution to a modern problem.
               | 
               | There are a lot of seemingly high-level problems that are
               | solved (ingeniously) in hardware with very simple, very
               | low-level solutions.
        
               | bollu wrote:
               | Could you please link me to such a course that displays
               | the hardware level solutions? I'm super interested!
        
               | slumdev wrote:
               | https://www.udacity.com/course/high-performance-computer-
               | arc...
        
               | andrewaylett wrote:
               | https://en.wikipedia.org/wiki/NonStop_(server_computers)
               | 
               | My first employer out of Uni had an option for their
               | primary product to use a NonStop for storage -- I think
               | HP funded development, and I'm not sure we ever sold any
               | licenses for it.
        
           | sobriquet9 wrote:
           | If you use multiple computers doing the same calculation and
           | then take the answer from the quorum, how do you ensure the
           | computer that does the comparison is not affected by memory
           | failures? Remember that _all_ queries have to through it, so
           | it has to be comparable in scale and power.
        
             | rovr138 wrote:
             | > how do you ensure the computer that does the comparison
             | is not affected by memory failures?
             | 
             | You do the comparison on multiple nodes too. Get the
             | calculations. Pass them to multiple nodes, validate again
             | and if it all matches, you use it.
        
               | sobriquet9 wrote:
               | > validate again
               | 
               | Recursion, see recursion.
        
               | Guvante wrote:
               | I mean raft and similar algorithms run multiple
               | verification machines because a single point of failure
               | is a single point of failure.
        
               | wtallis wrote:
               | See also Byzantine fault tolerance: https://scholar.harva
               | rd.edu/files/mickens/files/thesaddestmo...
        
           | hn3333 wrote:
           | Bit flips can happen, but regardless if they can get repaired
           | by ECC code or not, the OS is notified, iirc. It will signal
           | a corruption to the process that is mapped to the faulty
           | address. I suppose that if the memory contains code, the
           | process is killed (if ECC correction failed).
        
             | wtallis wrote:
             | > I suppose that if the memory contains code, the process
             | is killed (if ECC correction failed).
             | 
             | Generally, it would make the most sense to kill the process
             | if the corrupted page is _data_ , but if it's code, then
             | maybe re-load that page from the executable file on non-
             | volatile storage. (You might also be able to rescue some
             | data pages from swap space this way.)
        
               | gizmo686 wrote:
               | If you go that route, you should be able to avoid the
               | code/data distinction entirely; as data pages can also be
               | completly backed by files. I believe the kernel already
               | keeps track of what pages are a clean copy of data from
               | the filesystem, so I would think it would be a simple
               | matter of essentially pageing out the corrupted data.
               | 
               | What would be interesting is if userspace could mark a
               | region of memory as recomputable. If the kernel is
               | notified of memory corruption there, it triggers a
               | handler in the userspace process to rebuild the data.
               | Granted, given the current state of hardware; I can't
               | imagine that is anywhere near worth the effort to
               | implement.
        
           | AaronFriel wrote:
           | It can't eliminate it but:
           | 
           | 1. Single bitflip correction along with Google's metrics
           | could help them identify algorithms they've got, customer's
           | VMs that are causing bitflips via rowhammer and machines
           | which have errors regardless of workload
           | 
           | 2. Double bitflip detection lets Google decide if they say,
           | want to panic at that point and take the machine out of
           | service, and they can report on what software was running or
           | why. Their SREs are world-class and may be able to deduce if
           | this was a fluke (orders of magnitude less likely than a
           | single bit flip), if a workload caused it, or if hardware
           | caused it.
           | 
           | The advantage the 3 major cloud providers have is scale. If a
           | Fortune 500 were running their own datacenters, how likely
           | would it be that they have the same level of visibility into
           | their workloads, the quality of SREs to diagnose, and the
           | sheer statistical power of scale?
           | 
           | I sincerely hope Google is not simply silencing bitflip
           | corrections and detections. That would be a profound waste.
        
             | tjoff wrote:
             | ECC seems like a trivial thing to log and keep track of.
             | Surely any Fortune 500 could do it and would have enough
             | scale to get meaningful data out of it?
        
           | giantrobot wrote:
           | I don't think ECC is going to give anyone a false sense of
           | security. The issue at Google's scale is they had to spend
           | thousands of person-hours implementing in software what they
           | would have gotten for "free" with ECC RAM. Lacking ECC (and
           | generally using consumer-level hardware) compounded scale and
           | reliability problems or at least made them more expensive
           | than they might otherwise had been.
           | 
           | Using consumer hardware and making up reliability with
           | redundancy and software was not a bad idea for early Google
           | but it did end up with an unforeseen cost. Just a thousand
           | machines in a cosmic ray proof bunker will end up with memory
           | errors ECC will correct for free. It's just reducing the
           | surface area of "potential problems".
        
             | Animats wrote:
             | _consumer hardware..._
             | 
             | That's Intel's PR. Only "enterprise hardware", with a
             | bigger markup, supports ECC memory. Adding ECC today should
             | add only 12% to memory cost.
             | 
             | AMD decided to break Intel's pricing model. Good for them.
             | Now if we can get ECC at the retail level...
             | 
             | The original IBM PC AT had parity in memory.
        
         | ksec wrote:
         | >I still remember Craig Silverstein being asked what his
         | biggest mistake at Google was and him answering "Not pushing
         | for ECC memory."
         | 
         | Did they ( Google ) or He ( Craig Silverstein ) ever officially
         | admit it on record? I did a Google search and results that came
         | up were all on HN. Did they at least make a few PR pieces
         | saying that they are using ECC memory now because I dont see
         | any with searching. Admitting they made a mistake without
         | officially saying it?
         | 
         | I mean the whole world of Server or computer might not need ECC
         | insanity was started entirely because of Google [1] [2] with
         | news and articles published even in the early 00s [3]. And
         | after that it has spread like wildfire and became a common
         | accepted fact that even Google doesn't need ECC. Just like
         | Apple were using custom ARM instruction to achieve their fast
         | JS VM performance became a "fact". ( For the last time, no they
         | didn't ). And proponents of ECC memory has been fighting this
         | misinformation like mad for decades. To the point giving up and
         | only rant about every now and then. [3]
         | 
         | [1] https://blog.codinghorror.com/building-a-computer-the-
         | google...
         | 
         | [2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/
         | 
         | [3] https://danluu.com/why-ecc/
        
         | tyoma wrote:
         | Figure this is as good of a time as any to ask this:
         | 
         | There are many various DRAMs in a server (say, for disk cache).
         | Has Google or anyone who operates at a similar scale seen
         | single bit errors in these components?
        
           | [deleted]
        
           | gh02t wrote:
           | The supercomputing community has looked at some of the effect
           | on different parts of the GPU.
           | 
           | https://ieeexplore.ieee.org/abstract/document/7056044
        
           | bsder wrote:
           | This is as old as computing and predates Google.
           | 
           | When America Online was buying EV6 servers as fast as DEC
           | could produce them, they used to see about about 1 _double_
           | bit error per day across their server farm that would reboot
           | the whole machine.
           | 
           | DRAM has only gotten worse--not better.
        
         | gigatexal wrote:
         | I mean early on sure at a startup where you're not printing
         | money I can see how saving on hardware makes sense. But surely
         | you don't need an MBA to know that hardware will continue to
         | get cheaper whereas developers and their time will only get
         | more expensive: better to let the hardware deal with it than to
         | burden developers with it ... I'd have made the case for ECC
         | but hindsight being what it is ...
        
           | colejohnson66 wrote:
           | But if you can save $1M+ now, then throw the cost of fixing
           | it onto the person who replaces you, why do you care? You
           | already got your bonus and jumped ship.
        
         | starfallg wrote:
         | Recent advances have blurred the lines a bit. The ECC memory
         | that we all know and love is mainly side-band EEC, with the
         | memory bus widened to accommodate the ECC bits driven by the
         | memory controller. However as process size shrink, bit flips
         | become more likely to the point that now many types of memory
         | have on-die EEC, where the error correction is handled
         | internally on the DRAM modules themselves. This is present on
         | some DDR4 and DDR5 modules, but information on this is kept
         | internal by the DRAM makers and not usually public.
         | 
         | https://semiengineering.com/what-designers-need-to-know-abou...
         | 
         | There has been a lot of debate regarding this that was
         | summarised in this post -
         | 
         | https://blog.codinghorror.com/to-ecc-or-not-to-ecc/
        
       | type0 wrote:
       | Consumer awareness about ECC needs to be better, with recent
       | security implications I simply can't understand why more
       | motherboard manufacturers don't support it on AMD. Intel of
       | course is all to blame on the blue side, I stopped buying their
       | overpriced Xeons because of this.
        
         | rajesh-s wrote:
         | Good point on the need for awareness!
         | 
         | The industry has convinced the average user of consumer
         | hardware that PPA (Power,Performance,Area) is all that needs to
         | get better with generational improvements. Hoping that the
         | concerning aspects of security and reliability that have come
         | to light in the recent past changes this.
        
       | aborsy wrote:
       | For the average user, what's the impact of bit flips in memory in
       | practical terms?
       | 
       | I am not talking about servers dealing with critical data.
       | 
       | Suppose that I maintain a repository (documents, audio and
       | video), one copy in a ZFS-ECC system and one in an ext4-nonECC
       | system.
       | 
       | Would I notice a difference between these two copies after 5-10
       | years?
       | 
       | That tells us if ECC matters for most people.
        
         | throwaway9870 wrote:
         | This isn't about disk storage, this is about DRAM. A bit flip
         | in DRAM might corrupt data, but could also cause random crashes
         | and system hangs. That generally matters to everyone.
        
           | [deleted]
        
         | theevilsharpie wrote:
         | > For the average user, what's the impact of bit flips in
         | memory in practical terms?
         | 
         | The most likely impact (other than nothing, if bits are flipped
         | in unused memory) is program crashes or system lock-ups for no
         | apparent reason.
        
       | elgfare wrote:
       | For those out of the loop like me, ECC does indeed stand for
       | error correcting code. https://en.m.wikipedia.org/wiki/ECC_memory
        
       | vlovich123 wrote:
       | A couple of years ago there was advancements that claimed to make
       | Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a
       | concern for some reason?
       | 
       | I would think the only guaranteed solutions to Rowhammer are
       | actually cryptographic digests and/or guard pages.
       | 
       | [1] https://www.zdnet.com/article/rowhammer-attacks-can-now-
       | bypa...
        
         | theevilsharpie wrote:
         | ECC isn't a direct mitigation against Rowhammer attacks, as
         | memory errors caused by three or more flipped bits would still
         | go undetected (unless you're using ChipKill, but that's a rare
         | setup).
         | 
         | However, flipped three bits simultaneously isn't trivial, and
         | the attempts that flip fewer bits will be detected and logged.
        
           | GregarianChild wrote:
           | Isn't ChipKill just another form of ECC? If so there is a
           | number of bitflips that ChipKill can no longer correct /
           | detect. [1] seems to say that they observed some flips in
           | dRAM with ChipKill, although the paper is a bit vague here.
           | 
           | [1] B. Schroeder et al, _DRAM Errors in the Wild: A Large-
           | Scale Field Study_
           | http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
        
           | rajesh-s wrote:
           | Right! Section 1.3 of this publication discusses possible
           | mitigations for the row hammer problem and where ECC fits in
           | 
           | https://users.ece.cmu.edu/~omutlu/pub/rowhammer-summary.pdf
        
             | GregarianChild wrote:
             | The paper you cite is from 2014 and the mitigations
             | discussed there have all been circumvented. [1] is from
             | 2020 and a better read for Rowhammer mitigation.
             | 
             | [1] J. S. Kim et al, _Revisiting RowHammer: An Experimental
             | Analysis of Modern DRAM Devices and Mitigation Techniques_
             | https://arxiv.org/abs/2005.13121
        
               | rajesh-s wrote:
               | Thanks for pointing that out!
        
       | simias wrote:
       | I used to be pretty skeptical of ECC for consumer-grade hardware,
       | mainly because I felt that I'd always prefer cheaper/more RAM
       | over ECC RAM even if it meant that I'd get a couple of crash
       | every year due to rogue bitflips. For servers it's a different
       | story, but for a desktop I'm fine dealing with some instability
       | for better performance.
       | 
       | But these days with the RAM density being so high and bitflipping
       | attacks being more than a theoretical threat it seems like
       | there's really no good reason not to switch to ECC everywhere.
        
         | ekianjo wrote:
         | > no good reason not to switch to ECC everywhere.
         | 
         | Not all CPUs support ECC however.
        
           | josefx wrote:
           | Just Intel fucking over security by making ECC a non feature
           | on consumer grade hardware - wouldn't be surprised if it was
           | just a single bit flipped in a feature mask.
        
             | jjeaff wrote:
             | Well, with as common as a bunch of people in this thread
             | seem to think bit flips are, it should just be a matter of
             | time until that bit gets flipped on your cpu and activates
             | the ecc feature.
        
               | josefx wrote:
               | That bit probably is either burned in or stored with the
               | firmware in something more permanent than RAM. Modern RAM
               | has the issue that it is optimized for capacity and speed
               | to a point where state changes can leak into nearby bits.
        
           | loeg wrote:
           | (Intel)
        
         | tokamak-teapot wrote:
         | Are there any Ryzen boards that support ECC and _actually
         | correct errors_?
        
           | gruez wrote:
           | quick search:
           | 
           | https://rog.asus.com/forum/showthread.php?112750-List-
           | Asus-M...
        
             | bcrl wrote:
             | Most Ryzen ASRock boards support ECC as well. I'm happily
             | using one right now.
        
               | loeg wrote:
               | > Most
               | 
               | Circa Zen1 launch, ASRock claimed _all_ of their consumer
               | boards would support ECC.
        
           | [deleted]
        
           | fulafel wrote:
           | The functionality seems to all be in the memory controller
           | integrated to the CPU.
        
           | loeg wrote:
           | Yes. E.g., all ASRock boards.
        
       | freeqaz wrote:
       | I bought ECC RAM for my laptop and it definitely was about 4x the
       | price. It's valuable to me for a few reasons -- peace of mind
       | being a big one.
       | 
       | Bit flips happen and are real. I really wish ECC was plentiful
       | and not brutally expensive!
        
         | washadjeffmad wrote:
         | For the price, it made more sense for me to buy an R630 and
         | populate it with a few less expensive, higher capacity ECC
         | RDIMMs. I don't really need ECC as a local feature, so this
         | lets me run on the mobile I want.
        
         | temac wrote:
         | Note that the price is mostly due to market segmentation, in
         | your case _most_ of it by the laptop vendor (of course some for
         | Intel, but not _that_ much compared to the laptop vendor)
         | 
         | Xeon with ECC are not that overpriced compared with similar
         | Core without. Likewise, RAM sticks with ECC are cheap to
         | produce (basically just one more chip to populate per side per
         | module). Likewise soldered RAM would simply add maybe $10 or
         | $20 of extra chips.
        
         | bitcharmer wrote:
         | This is the first time I hear about a laptop that supports ECC
         | memory. Could you please share the make and model?
        
           | bluedino wrote:
           | Lenovo (P series) and HP workstation models also support ECC
        
           | xxs wrote:
           | Lenovo has Xeon laptops[0], and technically Intel used to
           | support ECC on i3 (and celeron, etc.)
           | 
           | 0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/T
           | hi...
        
           | lb1lf wrote:
           | -My boss has a Xeon Dell - a 7550, methinks - luggable.
           | 
           | It is filled to the gunwales with ECC RAM.
           | 
           | Cost him the equivalent of $7k or so. Eeek.
        
           | dijit wrote:
           | I have a Dell Precision 5520 (chassis of an XPS 15) which has
           | a Xeon and ECC memory.
           | 
           | Finding a memory upgrade seems difficult though.
        
             | markonen wrote:
             | I was looking at getting the Xeon-based NUC recently and
             | one of the reasons I decided against it was that ECC SO-
             | DIMMs seem to be a really marginal product. If you want
             | ECC, something that takes full-size DIMMs seems _much
             | easier_ to buy memory for.
        
         | jjeaff wrote:
         | You should be able to check logs for corrected errors, right?
         | 
         | I'm guessing you won't find any.
        
       | londons_explore wrote:
       | I simply care that my computer executes code perfectly. Let's
       | settle on "one instance of unintended behaviour per hundred
       | years" for that metric.
       | 
       | If it needs ECC memory to do that, then fit it with ECC memory.
       | If there are other ways to achieve that (for example deeper dram
       | cells to be more robust to cosmic rays) that's fine too.
       | 
       | Just meet the reliability spec - I don't care how.
        
         | simias wrote:
         | Then you'll have to pay a huge primer for that privilege. I can
         | assure you that your standard computer components are not rated
         | for century-scale use.
         | 
         | That's why I've always been on the fence with this ECC thing.
         | For servers it's vital because you need stability and security.
         | 
         | For desktops I think that for a long time it was fine without
         | ECC. If I have to chose between having, say, 30% more RAM or
         | avoid a potential crash once a year, I'll probably take the
         | additional RAM.
         | 
         | The problem is that now these problem can be exploited by
         | malicious code instead of just merely happening because of
         | cosmic rays. That's the main argument in favour of ECC IMO, the
         | rest is just a tradeoff to consider.
        
           | ClumsyPilot wrote:
           | But it isn't just a crash, it's also silent data corruption
           | that will never be detected
        
             | dev_tty01 wrote:
             | This. How many user documents have memory flip errors
             | introduced that are never detected? Impossible to say, but
             | it is not a small number given the world-wide use of DRAM.
             | Most are in trivial and unimportant documents, but some
             | aren't...
        
             | simias wrote:
             | It can be a concern, that's true, but personally most of
             | the stuff I edit end up checked into a git repository or
             | something similar.
             | 
             | And I mean, we all spend all day editing test messages and
             | comments and files on non-ECC hardware, yet bitflip-induced
             | corruption is rare enough that I can't say that I've
             | witnessed a single instance of it in my life, despite
             | spending a good chunk of it looking at screens.
             | 
             | It's just not a problem that occurs in practice in my
             | experience. If you're compiling the release build of a
             | critical piece of software, you probably want ECC. If
             | you're building the dev version of your webapp or writing
             | an email to your boss, you'll probably survive without it.
        
               | ClumsyPilot wrote:
               | Can make that statement with any certainty? My personal
               | and family computers have crashed quite a few times, and
               | have corrupted photoes and files, some of them are
               | valuable (taxes, healthcare, etc. Personal computers have
               | valuable data these days)
               | 
               | I couldn't tell, as a user, which if those corruptions
               | and crashes were causes by bitflips. Could you?
        
           | loup-vaillant wrote:
           | > _I can assure you that your standard computer components
           | are not rated for century-scale use._
           | 
           | And that's probably not what GP asked for. There's a
           | difference between guaranteeing an error rate of 1 error per
           | century of use on average, and guaranteeing it over the
           | course of an _actual century_. It might be okay to guarantee
           | that error rate for only 5 years of uninterrupted use, and
           | degrade after that. For instance:                 Years  1-
           | 5:  1 error  per century.       Years  6-10:  3 errors per
           | century.       Years 10-15: 10 errors per century.
           | Years 15-20: 20 errors per century.       Years 20-30:  1
           | error  per *year*.       Years 30+  : the chip is broken.
           | 
           | Now, given how energy hungry and polluting the whole computer
           | industry actually is, it might be a good idea to shoot for
           | extreme durability and reliability anyway. Say, sustain 1
           | error per century, over the course of _fifty years_. It will
           | be slower and more expensive, but at least it won 't burn the
           | planet as fast as our current electronics.
        
         | temac wrote:
         | In "theory" it needs ECC because you must also protect the link
         | between the CPU and the RAM. So with ECC fully in DRAM but no
         | protection on the bus, you risk some errors during the
         | transfer. However maybe this kind of errors are rare enough so
         | that you would have less than one per century. It probably
         | depends on the motherboard design and fabrication quality
         | though, and the environment where it is used.
        
       | z3t4 wrote:
       | Memory often comes with lifetime guarantees. If they had ECC it
       | would be much easier to detect bad memory...
        
       | jkuria wrote:
       | For those, like me, wondering what ECC is, here's an explanation:
       | 
       | https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...
        
       ___________________________________________________________________
       (page generated 2021-01-03 23:00 UTC)