[HN Gopher] What Flips Your Bit: Cosmic Ray Errors at Mozilla
       ___________________________________________________________________
        
       What Flips Your Bit: Cosmic Ray Errors at Mozilla
        
       Author : dannyobrien
       Score  : 146 points
       Date   : 2022-04-13 15:42 UTC (7 hours ago)
        
 (HTM) web link (blog.mozilla.org)
 (TXT) w3m dump (blog.mozilla.org)
        
       | [deleted]
        
       | spullara wrote:
       | I bit squatted cloudfront.net years ago and got many, many
       | requests. Most of them *.js which would, if I were malicious,
       | have allowed me to do just about anything. It was interesting to
       | see that the errors definitely happened in different places. For
       | instance, sometimes the Host header was the original domain and
       | sometimes it matched my domain.
        
       | robotsteve2 wrote:
       | Any sort of hardware or software error seems much more likely.
       | Computers are incredibly complex and approximations are used
       | everywhere (in the design of the hardware, in the theory of
       | operation). I don't think inference-based experiments or analysis
       | on cosmic ray bit flips are appropriate.
       | 
       | You really need some kind of dedicated cosmic ray detector nearby
       | as a control. If the flux of cosmic rays into the detector is
       | orders of magnitude lower than the rate of bit errors you ascribe
       | to cosmic rays, it's probably some hardware/software issue and
       | not the cosmic rays.
        
         | AshamedCaptain wrote:
         | I believe people use "cosmic rays" as catch-all phrase for all
         | these very low probability error causes (just because of the
         | coolness of cosmic rays), but in practice _any_ other cause is
         | much more common than cosmic rays.
         | 
         | Even at the processor level every single transistor on it has a
         | rated mean time between failures a.k.a. MTBF. Sure it may be
         | astronomical, but you do have a lot of transistors, so in
         | practice a random bitflip is not such a rare event. Designers
         | actually explore MTBF vs power usage trade-offs here, and there
         | is even a fascinating area of "fault resilient computing"
         | research.
         | 
         | Every single clock domain crossing has another MTBF (google
         | metastability). Again they are very high (billions of years if
         | done properly), but you will have plenty of such crossings (and
         | the number keeps growing with modern, more asynchronous
         | design).
         | 
         | Processors are quite unreliable things.
        
         | gnufx wrote:
         | Yes, but what you'd want to do is look for coincidences between
         | a detector for a cosmic ray shower around (above?) the
         | electronics you're monitoring with whatever it is these days
         | that instruments ECC events. The time resolution would be
         | pathetic for a nuclear physics experiment, but probably good
         | enough.
         | 
         | If you look at the ambient gamma-ray spectrum in a
         | semiconductor detector (which would be germanium rather than
         | silicon) the main background you see is typically from
         | concrete; I'm ashamed to say I've forgotten the energy from
         | K-40, but in the region of 1500 keV. (Ironically, large
         | concrete blocks used for shielding would be regarded as a
         | significant radiation hazard if all the activity in them was
         | concentrated.)
        
         | jldugger wrote:
         | Indeed, there was a study in IEEE pointing out the absurdity of
         | cosmic rays as causes -- one point cited was that the vast
         | majority of bitflip happen at specific points in the address
         | space, page boundaries between chips essentially
        
           | bqmjjx0kac wrote:
           | I'm curious why that is evidence against the cosmic ray
           | explanation.
           | 
           | Couldn't it have something to do with the physical layout of
           | memory? Perhaps those page-boundary-adjacent addresses
           | present a larger physical target, perhaps on the bus.
           | 
           | Of course I am wildly speculating right now. I'd love to see
           | the article if you have a link!
        
             | 323 wrote:
             | Modern devices have tiny features which are extremely
             | fragile to any sort of interference, which are much more
             | abundant than cosmic rays.
             | 
             | See the row-hammer attack where you can flip an unrelated
             | bit just by read/writes to adjacent bits from software!!!
        
           | cozzyd wrote:
           | I'd be very interested in reading that article if you have a
           | link (or title, or doi...)
        
       | Avlin67 wrote:
       | What about overclocking ? does it cause bit flip ? especially low
       | grad DDR4 pushed to its limit...
        
       | anonymousiam wrote:
       | This is why I always buy ECC/EDAC capable servers. SEUs are a
       | real thing.
        
       | ThePhysicist wrote:
       | One of the first things you'll learn when studying experimental
       | physics is how to come up with all kinds of alternative
       | mechanisms that might explain the result you've observed in your
       | experiment, and then think of ways to test that the results
       | weren't actually caused by those unwanted mechanisms. Most Nobel-
       | prize winning physics experiments were carefully designed to
       | compensate for any relevant secondary effects, and I would even
       | go as far as saying that this is often the largest challenge when
       | doing high-precision experiments.
       | 
       | So the first question I'd ask myself when thinking about cosmic-
       | ray induced errors is how I would ensure that the bit flips are
       | not caused by e.g. problems on the hard drives or the NAND array
       | (which are probably much more likely to occur than cosmic ray
       | events, at least on the surface of the earth).
        
         | mherdeg wrote:
         | Yeah that was one of the gotchas in the story at
         | https://blogs.oracle.com/linux/post/attack-of-the-cosmic-ray...
         | -- MAYBE it was a bit flip due to a cosmic ray, or MAYBE it was
         | a bit flip due to another layer of the system that makes RAM
         | chips store and retrieve data.
         | 
         | I like the idea of a physicist who thinks about this and says -
         | "well, why should we shrug and say 'maybe it was a cosmic ray?'
         | Surely we can test this! Let's put the computer in a lead-lined
         | enclosure and benchmark the memory failure rate and see if it
         | changes", or whatever.
         | 
         | That's a great extension of the classic computer-hacker view
         | that "of course we can understand why this bug happened, we
         | don't have to shrug and say it segfaults until we restart
         | sometimes, we can just dig some more." How far can you go?
        
       | grog454 wrote:
       | On the subject of bit flips, I am able to detect these in the
       | client to server UDP packets in my game. With specific logging
       | enabled I would see an error about once per minute while
       | receiving about 15,000 of one type of packet per second. I was
       | able to estimate about 1/1,000,000 packets contained a single
       | flipped bit.
        
       | dextercd wrote:
       | The '1 error for every 256MB memory a month' sounds like way tko
       | much to me.
       | 
       | A program I wrote launches every time I start my computer. It
       | allocates some memory and scans it periodically for unexpected
       | changes. After an equivalent of 15.8 256MB/months no anomalies
       | have been found yet.
       | 
       | Would really like to see more authoritative figures for modern
       | consumer hardware.
        
       | axg11 wrote:
       | This is fascinating and hints at a future possible scientific
       | study: using phones across the globe to map cosmic ray events.
       | I'm not a physicist so I can't speak for the value of such data.
       | If cosmic ray events do not occur uniformly across the globe then
       | mapping events from 100,000s of phones could give interesting
       | insights.
        
         | antognini wrote:
         | As they say, there's an app for that:
         | 
         | https://cosmicrayapp.com/
         | 
         | Basically it monitors your camera for the streaks produced by
         | cosmic rays. You can see the real time stream of events here:
         | 
         | https://cosmicrayobserver.com/#0.4/0/0
        
           | arc-in-space wrote:
           | Wait, this works? That's amazing. I suppose you could do the
           | same with any camera sensor?
        
           | li2uR3ce wrote:
           | Looks like the rays only hit populated areas.
           | https://xkcd.com/1138/
        
             | seanw444 wrote:
             | A fascinating revelation indeed.
        
           | axg11 wrote:
           | This is amazing - thank you for sharing!
        
       | trollied wrote:
       | I know HN has a decent Factorio fanbase. Factorio properly
       | stresses PC hardware, and borderline memory is usually ok for a
       | casual gamer until you start a Factorio megabase. A decent
       | example is Warger who does speedruns:
       | https://forums.factorio.com/viewtopic.php?f=7&t=100646
       | https://www.speedrun.com/factorio#100 Those that have played the
       | game - speedruns are amazing to watch, if you haven't already.
        
       | geophile wrote:
       | As the article points out, using collected client data is
       | problematic, because some errors will often be undetectable, as
       | in numeric data. And in general, you would have to control for
       | bit flips somehow caused by software.
       | 
       | I wonder whether a SETI approach would be useful here. Allocate,
       | say, 1MB of memory. Fill it with some known bit pattern.
       | Periodically check the memory and look for discrepancies. Do this
       | once an hour, on 10M devices, and that is a LOT of monitoring.
       | Report discrepancies along with time, location (including
       | elevation), hardware and OS information.
       | 
       | I would think that this approach would provide a lot of
       | interesting information about when and where bit flips occur,
       | especially when matched against information on solar and
       | atmospheric events (as in the article). Perhaps sensitive
       | hardware and OS environments would be detected. Even completely
       | negative results would be interesting: no bit flips observed
       | would suggest that purported bit flips elsewhere might have other
       | explanations.
        
         | simne wrote:
         | I've few years ago hear, US gov't created app, which
         | periodically checked for random light flashes on camera sensor
         | of smartphone, and send info to cloud, to use them as large
         | distributed network for detecting of illegal nukes. (btw,
         | Soviet and Russians really love to collect such weird facts).
         | 
         | Idea looks very realistic, except that it will drain battery,
         | and could also be used as surveillance tool, so I have not seen
         | real app.
         | 
         | To be strict, any digital camera sensor is excellent tool for
         | such things, much better than ram, only need to close objective
         | with something opaque for light, but transparent to particles
         | (any thin plastic fit), and sure, run monitoring program and
         | store logs in cloud.
         | 
         | And in real life, on ali could buy Geiger counter shield for
         | arduino, and one my pal even expose such sensor to internet.
         | 
         | - Better to use two such sensors and simple logical circuit, so
         | they will detect not all events, but only when two sensors
         | simultaneous detect something, so you will see vector, from
         | where cosmic particle appear.
         | 
         | - This method of many sensors with logic and logs, where used
         | in researches, which detected hidden rooms in Great Pyramids of
         | Giza (real cosmic rays so powerful, that could detected with
         | simple equipment at depths up to hundred meters, so ordinary
         | concrete is like cardboard for them). And I even seen post from
         | one guy, who installed such machinery at home (used 8 or 16
         | counters, I forgot details), and in few months from logs was
         | clearly seen, where in his room is window :)
        
         | EscargotCult wrote:
         | This could be crowdsourced. Imagine some sort of reporting
         | network of volunteers, running the simple program (memory
         | allocation and periodic checks) on any hardware they're loaning
         | time on, and submitting their location and altitude as well.
        
       | tconfrey wrote:
       | I worked at Sonus Networks (now Ribbon[0]) in the early 2000's
       | building VoIP solutions for telcos. We had a bunch of unexplained
       | errors in a new installation in Denver. After much head
       | scratching the engineers on the problem concluded that the higher
       | altitude significantly increased the likelihood of impact by
       | alpha particles and that that was the cause of the problem!
       | 
       | (IIRC we increased the shielding on the devices.)
       | 
       | https://ribboncommunications.com/
        
         | perihelions wrote:
         | Minor clarification: alpha radiation doesn't come from cosmic
         | rays, rather from U/Th contamination in the circuit materials.
         | The altitude dependent component you'd be seeing is rather
         | muons and neutrons.
        
           | cozzyd wrote:
           | you can get alphas from spallation, but you wouldn't really
           | call that alpha radiation.
        
         | jaytaylor wrote:
         | Maybe it's a stupid question, but did the additional shielding
         | completely resolve the discrepancy?
        
       | li2uR3ce wrote:
       | > In almost every case we cannot find any plausible explanation
       | or bug
       | 
       | Observe the natural state of every software developer. I kid...
       | or do I?
       | 
       | > What if it wasn't just some fantastical explanation?
       | 
       | Doesn't sound nearly as fantastical but bad RAM is probably more
       | common than one would expect. You seldom really know the quality
       | of hardware you run on. Just say'n, sometimes you don't need a
       | helping cosmic ray.
        
       | IncRnd wrote:
       | "Bitsquatting is a form of cybersquatting which relies on bit-
       | flip errors that occur during the process of making a DNS
       | request. These bit-flips may occur due to factors such as faulty
       | hardware or cosmic rays. When such an error occurs, the user
       | requesting the domain may be directed to a website registered
       | under a domain name similar to a legitimate domain, except with
       | one bit flipped in their respective binary representations.
       | 
       | "A 2011 Black Hat paper detailed an analysis where eight
       | legitimate domains were targeted with thirty one bitsquat
       | domains. Over the course of one day, 3,434 requests were made to
       | bitsquat domains." [1]
       | 
       | Cisco presented a paper on bitsquatting at defcon, "Examining the
       | Bitsquatting Attack Surface". From the paper, "The conclusion is
       | that the possibility of bitsquat attacks is more widespread than
       | originally thought, but several techniques exist for mitigating
       | the effects of these new attacks." [2]
       | 
       | [1] https://en.wikipedia.org/wiki/Bitsquatting
       | 
       | [2]
       | https://media.defcon.org/DEF%20CON%2021/DEF%20CON%2021%20pre...
        
         | chadwittman wrote:
         | Amazing comment, this is wild. Thank you for sharing this!
        
         | anonymousiam wrote:
         | I had the pleasure of working with the author for a brief time,
         | and I attended his presentation. Great stuff. What I found
         | particularly interesting is some later work that characterized
         | the probability of error based upon the device type, and the
         | ambient temperature (based on IP Geo-location).
        
         | pitaj wrote:
         | Is one mitigation TLS certificate verification?
        
           | legalcorrection wrote:
           | Depends on where in the stack the error happened.
        
           | simulate-me wrote:
           | It depends on whether or not the client specifically
           | requested, or is expecting, traffic over HTTPS. But yes, if
           | the user's client requests encrypted traffic, the attacker
           | will not be able to produce a valid certificate. This attack
           | isn't that different than a MITM.
        
             | tedunangst wrote:
             | Nothing prevents an attacker from getting a cert for
             | snytimg.com or oslashdot.org.
        
               | legalcorrection wrote:
               | That only helps the attacker if the error happened before
               | reaching the DNS-specific path. If the error happens
               | inside the DNS path, then the browser is still expecting
               | to get a certificate for the correct website.
        
           | xenophonf wrote:
        
             | rat9988 wrote:
             | > Why not read the linked paper,
             | 
             | You answered your own question. The answer is in page 12,
             | which means there is too much information. He is not
             | interested in the whole topic, just about this question. So
             | he asks, maybe someone is charitable enough to answer.
             | Nothing wrong with it.
        
               | xenophonf wrote:
               | How could I possibly answer their question better than
               | the experts who wrote the paper?
        
               | rat9988 wrote:
               | Maybe you can't but someone else can. The question is
               | open to anyone who can and wants to answer.
        
               | nixpulvis wrote:
               | Skim until you are in the right section?
               | 
               | Literally titled: "Section II - Mitigation of
               | bitsquatting attacks"
        
               | tedunangst wrote:
        
         | dahfizz wrote:
         | How does a comic bit flip make it past the Ethernet CRC?
        
       | incomingpain wrote:
       | I had the opportunity to design my SOC from scratch. Mostly
       | ripping off Berkeley's public design.
       | 
       | Something I have documented in the last 2 years. Solar flare
       | activity is what causes problems. All memory is ECC but it still
       | happens.
       | 
       | Faraday cage incoming?
       | 
       | Wait? Faraday cage racks million $ idea?
        
         | jeffreygoesto wrote:
         | Using an FD-SOI process can help reducing soft errors.
        
       | legalcorrection wrote:
       | I suspect without great evidence that cosmic ray bitflips are
       | mostly a scapegoat for imperfect hardware and are in fact one or
       | two orders of magnitude less common than popular wisdom would
       | suggest.
        
       | zepearl wrote:
       | I don't know folks.
       | 
       | 2 years ago I took a laptop which I wasn't using (16 GiB RAM non-
       | ECC) => I created in Linux with Python an array ("bytes"? Don't
       | remember exactly anymore) of ~10 or 12 GiB containing random
       | integers => computed the array's hash and saved it.
       | 
       | Then for ~1-2 months I recomputed from time to time the hash of
       | that array (inbetween the laptop was in suspend-to-RAM) and
       | compared it to the original result => it always matched, I never
       | had any bitflips.
       | 
       | I therefore doubt that the estimation of "1/256MB/month" is
       | correct - I could not prove that, at least not with my laptop.
        
         | deckard1 wrote:
         | I've always been a bit skeptical of published numbers. I
         | usually just chalk it up to vastly different operating
         | conditions and scale.
         | 
         | On my home server w/ ECC you can check the corrected and
         | uncorrected (multibit) errors. Assuming my Ryzen is correctly
         | reporting them to Linux, I have 0 errors corrected and 0
         | uncorrected with a 80 day uptime. I've checked a few other
         | times and never seen an error. Others with ECC often report the
         | same.
         | 
         | My understanding of modern RAM is that it has checks built in
         | to the modules which are somewhat equivalent to ECC already
         | (the correcting part, not the reporting part). Which is a
         | necessity in order to hit the density we are at today.
        
         | cozzyd wrote:
         | A server with 64 GB of ECC ram sitting at an altitude of 3.2 km
         | on the Greenland ice sheet is reporting... 0 bit errors
         | (whether correctable or uncorrectable) in the 244 days it's
         | been up.
         | 
         | A server with 16 GB of ECC ram at an altitude of 3.8 km in
         | California is reporting.... 0 bit errors in the 146 days it's
         | been up.
         | 
         | Maybe I shouldn't believe what /sys/devices/system/edac/mc is
         | reporting? These are EL8 systems...
        
         | tclancy wrote:
         | >I therefore doubt that the estimation of "1/256MB/month" is
         | correct
         | 
         | As someone who did incredibly poorly in high school physics,
         | this line in the article bothered me as well: the study is from
         | the 1990s when the density of memory would have been much
         | lower. I would think the percentage per megabyte has dropped
         | significantly in 30 or so years. It also assumes a constant
         | form factor for the memory, doesn't it?
        
         | nomel wrote:
         | > I therefore doubt that the estimation of "1/256MB/month" is
         | correct
         | 
         | The probability is related to the physical volume the memory
         | takes, since it's caused by a physical particle going through
         | that volume. So, this rate will continuously drop as memory
         | density increases.
        
       ___________________________________________________________________
       (page generated 2022-04-13 23:00 UTC)