[HN Gopher] Everything Is Broken: Shipping Rust-Minidump at Mozilla
       ___________________________________________________________________
        
       Everything Is Broken: Shipping Rust-Minidump at Mozilla
        
       Author : mthermidor
       Score  : 229 points
       Date   : 2022-06-14 15:19 UTC (7 hours ago)
        
 (HTM) web link (hacks.mozilla.org)
 (TXT) w3m dump (hacks.mozilla.org)
        
       | js2 wrote:
       | Thank you for this work!
       | 
       | I've been involved with minidumps in one way or another since
       | around 2010. Was at a startup at the time that had a browser
       | based on Chromium and we needed crash reporting for our own app.
       | So I wrote a pretty simply backend that received minidumps, ran
       | them through the breakpad processor and shoved the output into
       | Splunk. That was our crash-reporting system.
       | 
       | Circa 2013 the company gets acquired by Yahoo which at the time
       | was using Crittercism for its mobile apps but Yahoo wasn't happy
       | with it. Somehow I was now the mobile app crash reporting expert
       | at the company though so I built a whole new in-house crash
       | reporting solution.
       | 
       | For iOS I wrote an SDK around PLCrashReporter because unwinding
       | stacks on the client works out way better on iOS than dealing
       | with a minidump.
       | 
       | For Android I had to deal with both JVM (er, Dalvik, er ART)
       | stack traces, easy enough, but also native code crashes. For the
       | latter I used breakpad's crash handler and minidumps. But it
       | turns out that minidumps from Android devices are almost useless
       | for two reasons:
       | 
       | 1) If the crashes originate in managed code or calls into managed
       | code you can't trace back through the managed code frames from a
       | minidump. Especially if you don't have frame pointers.
       | 
       | 2) You basically cannot get the symbols for all the different
       | flavors of Android. Without symbols any stack trace that breakpad
       | reconstructs is pretty useless.
       | 
       | Eventually I abandoned minidumps on Android and instead unwinding
       | on the phone using corkscrew, wait no, libbacktrace, wait no,
       | libunwind. But that still doesn't give useful stack traces very
       | often. In the end, I ended up capturing logcat output when
       | restarting after a crash which actually tends to have the most
       | useful stack traces.
       | 
       | Which is all to say, both Apple and Google make it really hard
       | for a mobile app to find out why it crashed. Both Android and iOS
       | create a crash report for any app which crashes, but the app
       | can't access those. So we're all shipping apps with third-party
       | crash handlers built-in that try to capture a stack or minidump
       | in-process and make sense of it later.
        
       | avgcorrection wrote:
       | Gankra is the most entertaining Rust author (Rust programmer who
       | writes about Rust). Easily.
        
         | robby_w_g wrote:
         | There's an unreasonably grumpy commenter below that disagrees,
         | but I personally agree with you and found this to be a fun
         | read.
         | 
         | I was interested in the topic before reading, but it could have
         | easily been a slog of technical minutia. I'm glad that wasn't
         | the case!
         | 
         | Edit: the comment I referenced was deleted in the time I took
         | to post this. It's probably for the best
        
         | tialaramex wrote:
         | Mmm. I think @m_ou_se is probably the most _entertaining_ at
         | least if we consider that both Saturday Night Live and
         | Nightmare On Elm Street is entertainment.
         | 
         | For example, Rust deliberately doesn't have the tertiary
         | operator, and random other types don't get silently coerced as
         | booleans - so you can't write a = x ? 1 : -1; however you can
         | write a = if x != 0 { 1 } else { -1 }; with the same effect.
         | But Mara isn't satisfied with this verbose yet sensible answer,
         | and proposes you could instead, for example:
         | 
         | a = x.count_ones().count_ones().count_ones().count_ones() as
         | i32 * 2 - 1;
         | 
         | Hilarious? Or maybe terrifying? Entertaining certainly.
         | https://twitter.com/m_ou_se/status/1404034056405368833?lang=...
         | 
         | Aria is _more informative_ but I 'm not going to end up choking
         | and spilling my beverage all over the desk.
        
       | wly_cdgr wrote:
        
       | singhrac wrote:
       | > Rust is a really good language for writing parsers. C++ really
       | isn't.
       | 
       | One thing I appreciate about writing Rust is that ADT support
       | implies writing parsers is simpler under the "parse don't
       | validate" mindset (which was clarified for me I think in this [0]
       | article).
       | 
       | [0]: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
       | va...
        
         | marcosdumay wrote:
         | It's not ADT, it's strict types. The reason you can't do
         | "parse, don't validate" in C++ is because you can't assume
         | anything is valid at the point you use the data.
         | 
         | What ADT does is give you enough flexibility so that a strict
         | typing system doesn't suck.
        
           | singhrac wrote:
           | > you can't assume anything is valid at the point you use the
           | data.
           | 
           | Just to double check my understanding: are you talking about
           | raw pointers (i.e. void*) being common in C++ and not in
           | Rust? You're right that I was using ADT a bit loosely; to be
           | honest the main value add for me has been the first class
           | data-holding enums/sum types. C++ has std::variant, but the
           | syntax support in Rust feels nicer.
        
             | marcosdumay wrote:
             | C++ has a series of issues.
             | 
             | You can't trust pointers have a value, or that the value is
             | valid, you can't trust that your enums have a value inside
             | their interval, or in fact you can't trust that any value
             | from any type is inside its interval at all.
             | 
             | You also can't really trust that your values have the
             | correct size.
             | 
             | We choose some of those to ignore, otherwise we wouldn't be
             | able to program at all, but C++ gives you no guarantees at
             | all about anything. The point is that if you do a parsing
             | run in C++ and encode your value, you will still get many
             | of the above problems because of bugs in your code.
        
               | swolchok wrote:
               | > you can't trust that your enums have a value inside
               | their interval
               | 
               | If you don't set the underlying type, assigning a value
               | that doesn't match an enumerator via `static_cast` is
               | undefined behavior. See
               | https://en.cppreference.com/w/cpp/language/enum . (Doing
               | weird pointer casting things is also undefined behavior
               | per the strict aliasing rule, though, come to think of
               | it, I'm not sure whether memcpying an out-of-range value
               | into an enum through the "reinterpret_cast to `char*`"
               | loophole is undefined behavior.)
        
               | marcosdumay wrote:
               | Even if that covered all the problem space (instead of
               | replacing it with a much larger one), if your code is
               | flawless, parsing and validating are equivalent.
               | 
               | Choosing one just makes a difference because code has
               | problems.
        
               | nemetroid wrote:
               | I'm assuming you are referring to this part:
               | 
               | > If the underlying type is not fixed and the source
               | value is out of range, the behavior is undefined.
               | 
               | Note the fine print about the meaning of "out of range":
               | 
               | > (The source value, as converted to the enumeration's
               | underlying type if floating-point, is in range if it
               | would fit in the smallest bit field large enough to hold
               | all enumerators of the target enumeration.)
               | 
               | So this is _not_ undefined:                 enum E { A =
               | 0, B = 1, C = 2 };       E valid = static_cast<E>(3);
        
       | danShumway wrote:
       | It's linked at the bottom of the article, but reminder that
       | Gankra's blog (https://gankra.github.io/blah/) has a ton of other
       | great writing like this.
       | 
       | In particular, I always recommend "Text Rendering Hates You."
        
         | nindalf wrote:
         | I link that article every time I see someone on the internet
         | say "that sounds easy, why don't you just"
         | 
         | And the answer is always well, things are more complicated than
         | they look. Even something as _trivial_ as rendering text on a
         | screen.
        
       | draw_down wrote:
        
       | secondcoming wrote:
       | Maybe I'm missing something, but they ported from C++ (because
       | 'C++ is bad donchaknow') to Rust and still ran into problems
       | parsing crash dumps?
       | 
       | If the dump is corrupt then just stop trying to parse/make sense
       | of it; it's garbage.
        
         | Gankra wrote:
         | No we removed many random crashes that the C++ code had. You
         | cannot "simply" discard a crash report if something is slightly
         | off because then you would discard most crash reports. And most
         | debuginfo too.
         | 
         | You can't expect "thing that runs when a process may have just
         | experienced memory corruption" and "all builds of your
         | application for all eternity" and "every toolchain you ever
         | built your program with for all eternity" to be even vaguely
         | reliable, because those things are in the past and we're trying
         | to figure out how to fix the bugs people are experiencing in
         | production today.
         | 
         | It is a horribly miserable answer to tell your coworkers "yeah
         | sorry I know users are getting thousands of crashes this
         | morning but the crash-dumper didn't sign its name in cursive so
         | I'm gonna refuse to let you read the letter it sent at all".
         | 
         | And just an incoherent answer to say "yeah I know this is a
         | stack overflow but it left the stack in a mildly corrupt state
         | so I absolutely refuse to try to even look at the stack and
         | figure anything out about it". Like, that is the entire purpose
         | of a crashreporter, to investigate a program in an invalid
         | state!
        
           | rockdoe wrote:
           | Reminds me of "Your program shouldn't have bugs in it isn't
           | an acceptable position to take for a debugger", from the rr
           | folks. Unfortunately I can't find the source of the quote any
           | more, but it stuck in my mind.
        
             | Gankra wrote:
             | Yeah computing backtraces in a crashreporter is extremely
             | similar to a debugger in that you need a lot of fudge-
             | factor heuristics and fallback modes for known toolchain
             | bugs or common corruptions.
        
             | khuey wrote:
             | You're probably remembering https://pernos.co/blog/tzcnt-
             | portability/
        
             | gpm wrote:
             | Speaking of the rr folk, they also had the fascinating
             | point that you can reliably generate a "stack trace" by
             | figuring out which `call` instructions were executed with
             | what values (also other jump instructions I suppose),
             | instead of walking the stack. Thereby skipping the whole
             | "parsing the stack is insanely difficult and unreliable"
             | issue.
        
               | glandium wrote:
               | FWIW, that's from pernosco, not rr.
        
               | gpm wrote:
               | I think it's the same people?
        
         | structural wrote:
         | This is the excessively fun part of dealing with crash dumps in
         | general. Many of them are going to be 1% corrupt, 99% fine, and
         | somewhere in them likely has vital information about what
         | caused the corruption.
         | 
         | So the entire reason for being for things like rust-minidump
         | are to make enough sense out of files that are known to be
         | corrupt garbage to be able to find bugs.
        
       | mrlonglong wrote:
       | Do tell us more, don't leave us hanging ! Loved it.
        
       | ComputerGuru wrote:
       | A better/more technical article on the same tech, from Mozilla's
       | collaborators on this project: https://jake-
       | shadle.github.io/crash-reporting/
        
         | Gankra wrote:
         | That article is about the client-side (generating the minidump
         | for a crashed process) to this article's server-side
         | (processing/analyzing the minidump).
        
       | nindalf wrote:
       | This is a fantastic article, thank you for writing it. Looking
       | forward to part 2!
        
       | pierrebai wrote:
       | If the follow-up post does not make it to HN front page, I'll
       | have a hole in my life.
        
       | Gankra wrote:
       | Extra shoutouts to the folks at Sentry who also flipped rust-
       | minidump on as their default backend and had to deal with way
       | more exotic issues than I did (and fixed them!) because although
       | Firefox sees some horrendous stuff and gets a bajillion crash
       | reports, it's still one application with one basically stable
       | minidump writing configuration.
       | 
       | They have to deal with basically random apps doing whatever they
       | want and it sounds like hell.
        
       | yjftsjthsd-h wrote:
       | > how we got absolutely owned by simple fuzzing
       | 
       | > You are reading part 1, wherein we build up our hubris.
       | 
       | Props to anyone willing to own their faults this readily:)
        
       | j3s wrote:
       | What a fun read! :3 I really like your writing style. Deploying
       | stuff to production is always so nerve-wracking, I related to
       | that very hard. I recently developed a golang alternative to an
       | old erlang-ruby-hodgepodge, and when it worked in production I
       | found myself constantly not believing that nothing went wrong.
        
         | tclancy wrote:
         | Ha, weeks and months of thinking, "Please just work" and then
         | it does and it's always a shock.
        
       ___________________________________________________________________
       (page generated 2022-06-14 23:00 UTC)