[HN Gopher] What Flips Your Bit: Cosmic Ray Errors at Mozilla ___________________________________________________________________ What Flips Your Bit: Cosmic Ray Errors at Mozilla Author : dannyobrien Score : 146 points Date : 2022-04-13 15:42 UTC (7 hours ago) (HTM) web link (blog.mozilla.org) (TXT) w3m dump (blog.mozilla.org) | [deleted] | spullara wrote: | I bit squatted cloudfront.net years ago and got many, many | requests. Most of them *.js which would, if I were malicious, | have allowed me to do just about anything. It was interesting to | see that the errors definitely happened in different places. For | instance, sometimes the Host header was the original domain and | sometimes it matched my domain. | robotsteve2 wrote: | Any sort of hardware or software error seems much more likely. | Computers are incredibly complex and approximations are used | everywhere (in the design of the hardware, in the theory of | operation). I don't think inference-based experiments or analysis | on cosmic ray bit flips are appropriate. | | You really need some kind of dedicated cosmic ray detector nearby | as a control. If the flux of cosmic rays into the detector is | orders of magnitude lower than the rate of bit errors you ascribe | to cosmic rays, it's probably some hardware/software issue and | not the cosmic rays. | AshamedCaptain wrote: | I believe people use "cosmic rays" as catch-all phrase for all | these very low probability error causes (just because of the | coolness of cosmic rays), but in practice _any_ other cause is | much more common than cosmic rays. | | Even at the processor level every single transistor on it has a | rated mean time between failures a.k.a. MTBF. Sure it may be | astronomical, but you do have a lot of transistors, so in | practice a random bitflip is not such a rare event. Designers | actually explore MTBF vs power usage trade-offs here, and there | is even a fascinating area of "fault resilient computing" | research. | | Every single clock domain crossing has another MTBF (google | metastability). Again they are very high (billions of years if | done properly), but you will have plenty of such crossings (and | the number keeps growing with modern, more asynchronous | design). | | Processors are quite unreliable things. | gnufx wrote: | Yes, but what you'd want to do is look for coincidences between | a detector for a cosmic ray shower around (above?) the | electronics you're monitoring with whatever it is these days | that instruments ECC events. The time resolution would be | pathetic for a nuclear physics experiment, but probably good | enough. | | If you look at the ambient gamma-ray spectrum in a | semiconductor detector (which would be germanium rather than | silicon) the main background you see is typically from | concrete; I'm ashamed to say I've forgotten the energy from | K-40, but in the region of 1500 keV. (Ironically, large | concrete blocks used for shielding would be regarded as a | significant radiation hazard if all the activity in them was | concentrated.) | jldugger wrote: | Indeed, there was a study in IEEE pointing out the absurdity of | cosmic rays as causes -- one point cited was that the vast | majority of bitflip happen at specific points in the address | space, page boundaries between chips essentially | bqmjjx0kac wrote: | I'm curious why that is evidence against the cosmic ray | explanation. | | Couldn't it have something to do with the physical layout of | memory? Perhaps those page-boundary-adjacent addresses | present a larger physical target, perhaps on the bus. | | Of course I am wildly speculating right now. I'd love to see | the article if you have a link! | 323 wrote: | Modern devices have tiny features which are extremely | fragile to any sort of interference, which are much more | abundant than cosmic rays. | | See the row-hammer attack where you can flip an unrelated | bit just by read/writes to adjacent bits from software!!! | cozzyd wrote: | I'd be very interested in reading that article if you have a | link (or title, or doi...) | Avlin67 wrote: | What about overclocking ? does it cause bit flip ? especially low | grad DDR4 pushed to its limit... | anonymousiam wrote: | This is why I always buy ECC/EDAC capable servers. SEUs are a | real thing. | ThePhysicist wrote: | One of the first things you'll learn when studying experimental | physics is how to come up with all kinds of alternative | mechanisms that might explain the result you've observed in your | experiment, and then think of ways to test that the results | weren't actually caused by those unwanted mechanisms. Most Nobel- | prize winning physics experiments were carefully designed to | compensate for any relevant secondary effects, and I would even | go as far as saying that this is often the largest challenge when | doing high-precision experiments. | | So the first question I'd ask myself when thinking about cosmic- | ray induced errors is how I would ensure that the bit flips are | not caused by e.g. problems on the hard drives or the NAND array | (which are probably much more likely to occur than cosmic ray | events, at least on the surface of the earth). | mherdeg wrote: | Yeah that was one of the gotchas in the story at | https://blogs.oracle.com/linux/post/attack-of-the-cosmic-ray... | -- MAYBE it was a bit flip due to a cosmic ray, or MAYBE it was | a bit flip due to another layer of the system that makes RAM | chips store and retrieve data. | | I like the idea of a physicist who thinks about this and says - | "well, why should we shrug and say 'maybe it was a cosmic ray?' | Surely we can test this! Let's put the computer in a lead-lined | enclosure and benchmark the memory failure rate and see if it | changes", or whatever. | | That's a great extension of the classic computer-hacker view | that "of course we can understand why this bug happened, we | don't have to shrug and say it segfaults until we restart | sometimes, we can just dig some more." How far can you go? | grog454 wrote: | On the subject of bit flips, I am able to detect these in the | client to server UDP packets in my game. With specific logging | enabled I would see an error about once per minute while | receiving about 15,000 of one type of packet per second. I was | able to estimate about 1/1,000,000 packets contained a single | flipped bit. | dextercd wrote: | The '1 error for every 256MB memory a month' sounds like way tko | much to me. | | A program I wrote launches every time I start my computer. It | allocates some memory and scans it periodically for unexpected | changes. After an equivalent of 15.8 256MB/months no anomalies | have been found yet. | | Would really like to see more authoritative figures for modern | consumer hardware. | axg11 wrote: | This is fascinating and hints at a future possible scientific | study: using phones across the globe to map cosmic ray events. | I'm not a physicist so I can't speak for the value of such data. | If cosmic ray events do not occur uniformly across the globe then | mapping events from 100,000s of phones could give interesting | insights. | antognini wrote: | As they say, there's an app for that: | | https://cosmicrayapp.com/ | | Basically it monitors your camera for the streaks produced by | cosmic rays. You can see the real time stream of events here: | | https://cosmicrayobserver.com/#0.4/0/0 | arc-in-space wrote: | Wait, this works? That's amazing. I suppose you could do the | same with any camera sensor? | li2uR3ce wrote: | Looks like the rays only hit populated areas. | https://xkcd.com/1138/ | seanw444 wrote: | A fascinating revelation indeed. | axg11 wrote: | This is amazing - thank you for sharing! | trollied wrote: | I know HN has a decent Factorio fanbase. Factorio properly | stresses PC hardware, and borderline memory is usually ok for a | casual gamer until you start a Factorio megabase. A decent | example is Warger who does speedruns: | https://forums.factorio.com/viewtopic.php?f=7&t=100646 | https://www.speedrun.com/factorio#100 Those that have played the | game - speedruns are amazing to watch, if you haven't already. | geophile wrote: | As the article points out, using collected client data is | problematic, because some errors will often be undetectable, as | in numeric data. And in general, you would have to control for | bit flips somehow caused by software. | | I wonder whether a SETI approach would be useful here. Allocate, | say, 1MB of memory. Fill it with some known bit pattern. | Periodically check the memory and look for discrepancies. Do this | once an hour, on 10M devices, and that is a LOT of monitoring. | Report discrepancies along with time, location (including | elevation), hardware and OS information. | | I would think that this approach would provide a lot of | interesting information about when and where bit flips occur, | especially when matched against information on solar and | atmospheric events (as in the article). Perhaps sensitive | hardware and OS environments would be detected. Even completely | negative results would be interesting: no bit flips observed | would suggest that purported bit flips elsewhere might have other | explanations. | simne wrote: | I've few years ago hear, US gov't created app, which | periodically checked for random light flashes on camera sensor | of smartphone, and send info to cloud, to use them as large | distributed network for detecting of illegal nukes. (btw, | Soviet and Russians really love to collect such weird facts). | | Idea looks very realistic, except that it will drain battery, | and could also be used as surveillance tool, so I have not seen | real app. | | To be strict, any digital camera sensor is excellent tool for | such things, much better than ram, only need to close objective | with something opaque for light, but transparent to particles | (any thin plastic fit), and sure, run monitoring program and | store logs in cloud. | | And in real life, on ali could buy Geiger counter shield for | arduino, and one my pal even expose such sensor to internet. | | - Better to use two such sensors and simple logical circuit, so | they will detect not all events, but only when two sensors | simultaneous detect something, so you will see vector, from | where cosmic particle appear. | | - This method of many sensors with logic and logs, where used | in researches, which detected hidden rooms in Great Pyramids of | Giza (real cosmic rays so powerful, that could detected with | simple equipment at depths up to hundred meters, so ordinary | concrete is like cardboard for them). And I even seen post from | one guy, who installed such machinery at home (used 8 or 16 | counters, I forgot details), and in few months from logs was | clearly seen, where in his room is window :) | EscargotCult wrote: | This could be crowdsourced. Imagine some sort of reporting | network of volunteers, running the simple program (memory | allocation and periodic checks) on any hardware they're loaning | time on, and submitting their location and altitude as well. | tconfrey wrote: | I worked at Sonus Networks (now Ribbon[0]) in the early 2000's | building VoIP solutions for telcos. We had a bunch of unexplained | errors in a new installation in Denver. After much head | scratching the engineers on the problem concluded that the higher | altitude significantly increased the likelihood of impact by | alpha particles and that that was the cause of the problem! | | (IIRC we increased the shielding on the devices.) | | https://ribboncommunications.com/ | perihelions wrote: | Minor clarification: alpha radiation doesn't come from cosmic | rays, rather from U/Th contamination in the circuit materials. | The altitude dependent component you'd be seeing is rather | muons and neutrons. | cozzyd wrote: | you can get alphas from spallation, but you wouldn't really | call that alpha radiation. | jaytaylor wrote: | Maybe it's a stupid question, but did the additional shielding | completely resolve the discrepancy? | li2uR3ce wrote: | > In almost every case we cannot find any plausible explanation | or bug | | Observe the natural state of every software developer. I kid... | or do I? | | > What if it wasn't just some fantastical explanation? | | Doesn't sound nearly as fantastical but bad RAM is probably more | common than one would expect. You seldom really know the quality | of hardware you run on. Just say'n, sometimes you don't need a | helping cosmic ray. | IncRnd wrote: | "Bitsquatting is a form of cybersquatting which relies on bit- | flip errors that occur during the process of making a DNS | request. These bit-flips may occur due to factors such as faulty | hardware or cosmic rays. When such an error occurs, the user | requesting the domain may be directed to a website registered | under a domain name similar to a legitimate domain, except with | one bit flipped in their respective binary representations. | | "A 2011 Black Hat paper detailed an analysis where eight | legitimate domains were targeted with thirty one bitsquat | domains. Over the course of one day, 3,434 requests were made to | bitsquat domains." [1] | | Cisco presented a paper on bitsquatting at defcon, "Examining the | Bitsquatting Attack Surface". From the paper, "The conclusion is | that the possibility of bitsquat attacks is more widespread than | originally thought, but several techniques exist for mitigating | the effects of these new attacks." [2] | | [1] https://en.wikipedia.org/wiki/Bitsquatting | | [2] | https://media.defcon.org/DEF%20CON%2021/DEF%20CON%2021%20pre... | chadwittman wrote: | Amazing comment, this is wild. Thank you for sharing this! | anonymousiam wrote: | I had the pleasure of working with the author for a brief time, | and I attended his presentation. Great stuff. What I found | particularly interesting is some later work that characterized | the probability of error based upon the device type, and the | ambient temperature (based on IP Geo-location). | pitaj wrote: | Is one mitigation TLS certificate verification? | legalcorrection wrote: | Depends on where in the stack the error happened. | simulate-me wrote: | It depends on whether or not the client specifically | requested, or is expecting, traffic over HTTPS. But yes, if | the user's client requests encrypted traffic, the attacker | will not be able to produce a valid certificate. This attack | isn't that different than a MITM. | tedunangst wrote: | Nothing prevents an attacker from getting a cert for | snytimg.com or oslashdot.org. | legalcorrection wrote: | That only helps the attacker if the error happened before | reaching the DNS-specific path. If the error happens | inside the DNS path, then the browser is still expecting | to get a certificate for the correct website. | xenophonf wrote: | rat9988 wrote: | > Why not read the linked paper, | | You answered your own question. The answer is in page 12, | which means there is too much information. He is not | interested in the whole topic, just about this question. So | he asks, maybe someone is charitable enough to answer. | Nothing wrong with it. | xenophonf wrote: | How could I possibly answer their question better than | the experts who wrote the paper? | rat9988 wrote: | Maybe you can't but someone else can. The question is | open to anyone who can and wants to answer. | nixpulvis wrote: | Skim until you are in the right section? | | Literally titled: "Section II - Mitigation of | bitsquatting attacks" | tedunangst wrote: | dahfizz wrote: | How does a comic bit flip make it past the Ethernet CRC? | incomingpain wrote: | I had the opportunity to design my SOC from scratch. Mostly | ripping off Berkeley's public design. | | Something I have documented in the last 2 years. Solar flare | activity is what causes problems. All memory is ECC but it still | happens. | | Faraday cage incoming? | | Wait? Faraday cage racks million $ idea? | jeffreygoesto wrote: | Using an FD-SOI process can help reducing soft errors. | legalcorrection wrote: | I suspect without great evidence that cosmic ray bitflips are | mostly a scapegoat for imperfect hardware and are in fact one or | two orders of magnitude less common than popular wisdom would | suggest. | zepearl wrote: | I don't know folks. | | 2 years ago I took a laptop which I wasn't using (16 GiB RAM non- | ECC) => I created in Linux with Python an array ("bytes"? Don't | remember exactly anymore) of ~10 or 12 GiB containing random | integers => computed the array's hash and saved it. | | Then for ~1-2 months I recomputed from time to time the hash of | that array (inbetween the laptop was in suspend-to-RAM) and | compared it to the original result => it always matched, I never | had any bitflips. | | I therefore doubt that the estimation of "1/256MB/month" is | correct - I could not prove that, at least not with my laptop. | deckard1 wrote: | I've always been a bit skeptical of published numbers. I | usually just chalk it up to vastly different operating | conditions and scale. | | On my home server w/ ECC you can check the corrected and | uncorrected (multibit) errors. Assuming my Ryzen is correctly | reporting them to Linux, I have 0 errors corrected and 0 | uncorrected with a 80 day uptime. I've checked a few other | times and never seen an error. Others with ECC often report the | same. | | My understanding of modern RAM is that it has checks built in | to the modules which are somewhat equivalent to ECC already | (the correcting part, not the reporting part). Which is a | necessity in order to hit the density we are at today. | cozzyd wrote: | A server with 64 GB of ECC ram sitting at an altitude of 3.2 km | on the Greenland ice sheet is reporting... 0 bit errors | (whether correctable or uncorrectable) in the 244 days it's | been up. | | A server with 16 GB of ECC ram at an altitude of 3.8 km in | California is reporting.... 0 bit errors in the 146 days it's | been up. | | Maybe I shouldn't believe what /sys/devices/system/edac/mc is | reporting? These are EL8 systems... | tclancy wrote: | >I therefore doubt that the estimation of "1/256MB/month" is | correct | | As someone who did incredibly poorly in high school physics, | this line in the article bothered me as well: the study is from | the 1990s when the density of memory would have been much | lower. I would think the percentage per megabyte has dropped | significantly in 30 or so years. It also assumes a constant | form factor for the memory, doesn't it? | nomel wrote: | > I therefore doubt that the estimation of "1/256MB/month" is | correct | | The probability is related to the physical volume the memory | takes, since it's caused by a physical particle going through | that volume. So, this rate will continuously drop as memory | density increases. ___________________________________________________________________ (page generated 2022-04-13 23:00 UTC)