[HN Gopher] Everything Is Broken: Shipping Rust-Minidump at Mozilla ___________________________________________________________________ Everything Is Broken: Shipping Rust-Minidump at Mozilla Author : mthermidor Score : 229 points Date : 2022-06-14 15:19 UTC (7 hours ago) (HTM) web link (hacks.mozilla.org) (TXT) w3m dump (hacks.mozilla.org) | js2 wrote: | Thank you for this work! | | I've been involved with minidumps in one way or another since | around 2010. Was at a startup at the time that had a browser | based on Chromium and we needed crash reporting for our own app. | So I wrote a pretty simply backend that received minidumps, ran | them through the breakpad processor and shoved the output into | Splunk. That was our crash-reporting system. | | Circa 2013 the company gets acquired by Yahoo which at the time | was using Crittercism for its mobile apps but Yahoo wasn't happy | with it. Somehow I was now the mobile app crash reporting expert | at the company though so I built a whole new in-house crash | reporting solution. | | For iOS I wrote an SDK around PLCrashReporter because unwinding | stacks on the client works out way better on iOS than dealing | with a minidump. | | For Android I had to deal with both JVM (er, Dalvik, er ART) | stack traces, easy enough, but also native code crashes. For the | latter I used breakpad's crash handler and minidumps. But it | turns out that minidumps from Android devices are almost useless | for two reasons: | | 1) If the crashes originate in managed code or calls into managed | code you can't trace back through the managed code frames from a | minidump. Especially if you don't have frame pointers. | | 2) You basically cannot get the symbols for all the different | flavors of Android. Without symbols any stack trace that breakpad | reconstructs is pretty useless. | | Eventually I abandoned minidumps on Android and instead unwinding | on the phone using corkscrew, wait no, libbacktrace, wait no, | libunwind. But that still doesn't give useful stack traces very | often. In the end, I ended up capturing logcat output when | restarting after a crash which actually tends to have the most | useful stack traces. | | Which is all to say, both Apple and Google make it really hard | for a mobile app to find out why it crashed. Both Android and iOS | create a crash report for any app which crashes, but the app | can't access those. So we're all shipping apps with third-party | crash handlers built-in that try to capture a stack or minidump | in-process and make sense of it later. | avgcorrection wrote: | Gankra is the most entertaining Rust author (Rust programmer who | writes about Rust). Easily. | robby_w_g wrote: | There's an unreasonably grumpy commenter below that disagrees, | but I personally agree with you and found this to be a fun | read. | | I was interested in the topic before reading, but it could have | easily been a slog of technical minutia. I'm glad that wasn't | the case! | | Edit: the comment I referenced was deleted in the time I took | to post this. It's probably for the best | tialaramex wrote: | Mmm. I think @m_ou_se is probably the most _entertaining_ at | least if we consider that both Saturday Night Live and | Nightmare On Elm Street is entertainment. | | For example, Rust deliberately doesn't have the tertiary | operator, and random other types don't get silently coerced as | booleans - so you can't write a = x ? 1 : -1; however you can | write a = if x != 0 { 1 } else { -1 }; with the same effect. | But Mara isn't satisfied with this verbose yet sensible answer, | and proposes you could instead, for example: | | a = x.count_ones().count_ones().count_ones().count_ones() as | i32 * 2 - 1; | | Hilarious? Or maybe terrifying? Entertaining certainly. | https://twitter.com/m_ou_se/status/1404034056405368833?lang=... | | Aria is _more informative_ but I 'm not going to end up choking | and spilling my beverage all over the desk. | wly_cdgr wrote: | singhrac wrote: | > Rust is a really good language for writing parsers. C++ really | isn't. | | One thing I appreciate about writing Rust is that ADT support | implies writing parsers is simpler under the "parse don't | validate" mindset (which was clarified for me I think in this [0] | article). | | [0]: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t- | va... | marcosdumay wrote: | It's not ADT, it's strict types. The reason you can't do | "parse, don't validate" in C++ is because you can't assume | anything is valid at the point you use the data. | | What ADT does is give you enough flexibility so that a strict | typing system doesn't suck. | singhrac wrote: | > you can't assume anything is valid at the point you use the | data. | | Just to double check my understanding: are you talking about | raw pointers (i.e. void*) being common in C++ and not in | Rust? You're right that I was using ADT a bit loosely; to be | honest the main value add for me has been the first class | data-holding enums/sum types. C++ has std::variant, but the | syntax support in Rust feels nicer. | marcosdumay wrote: | C++ has a series of issues. | | You can't trust pointers have a value, or that the value is | valid, you can't trust that your enums have a value inside | their interval, or in fact you can't trust that any value | from any type is inside its interval at all. | | You also can't really trust that your values have the | correct size. | | We choose some of those to ignore, otherwise we wouldn't be | able to program at all, but C++ gives you no guarantees at | all about anything. The point is that if you do a parsing | run in C++ and encode your value, you will still get many | of the above problems because of bugs in your code. | swolchok wrote: | > you can't trust that your enums have a value inside | their interval | | If you don't set the underlying type, assigning a value | that doesn't match an enumerator via `static_cast` is | undefined behavior. See | https://en.cppreference.com/w/cpp/language/enum . (Doing | weird pointer casting things is also undefined behavior | per the strict aliasing rule, though, come to think of | it, I'm not sure whether memcpying an out-of-range value | into an enum through the "reinterpret_cast to `char*`" | loophole is undefined behavior.) | marcosdumay wrote: | Even if that covered all the problem space (instead of | replacing it with a much larger one), if your code is | flawless, parsing and validating are equivalent. | | Choosing one just makes a difference because code has | problems. | nemetroid wrote: | I'm assuming you are referring to this part: | | > If the underlying type is not fixed and the source | value is out of range, the behavior is undefined. | | Note the fine print about the meaning of "out of range": | | > (The source value, as converted to the enumeration's | underlying type if floating-point, is in range if it | would fit in the smallest bit field large enough to hold | all enumerators of the target enumeration.) | | So this is _not_ undefined: enum E { A = | 0, B = 1, C = 2 }; E valid = static_cast<E>(3); | danShumway wrote: | It's linked at the bottom of the article, but reminder that | Gankra's blog (https://gankra.github.io/blah/) has a ton of other | great writing like this. | | In particular, I always recommend "Text Rendering Hates You." | nindalf wrote: | I link that article every time I see someone on the internet | say "that sounds easy, why don't you just" | | And the answer is always well, things are more complicated than | they look. Even something as _trivial_ as rendering text on a | screen. | draw_down wrote: | secondcoming wrote: | Maybe I'm missing something, but they ported from C++ (because | 'C++ is bad donchaknow') to Rust and still ran into problems | parsing crash dumps? | | If the dump is corrupt then just stop trying to parse/make sense | of it; it's garbage. | Gankra wrote: | No we removed many random crashes that the C++ code had. You | cannot "simply" discard a crash report if something is slightly | off because then you would discard most crash reports. And most | debuginfo too. | | You can't expect "thing that runs when a process may have just | experienced memory corruption" and "all builds of your | application for all eternity" and "every toolchain you ever | built your program with for all eternity" to be even vaguely | reliable, because those things are in the past and we're trying | to figure out how to fix the bugs people are experiencing in | production today. | | It is a horribly miserable answer to tell your coworkers "yeah | sorry I know users are getting thousands of crashes this | morning but the crash-dumper didn't sign its name in cursive so | I'm gonna refuse to let you read the letter it sent at all". | | And just an incoherent answer to say "yeah I know this is a | stack overflow but it left the stack in a mildly corrupt state | so I absolutely refuse to try to even look at the stack and | figure anything out about it". Like, that is the entire purpose | of a crashreporter, to investigate a program in an invalid | state! | rockdoe wrote: | Reminds me of "Your program shouldn't have bugs in it isn't | an acceptable position to take for a debugger", from the rr | folks. Unfortunately I can't find the source of the quote any | more, but it stuck in my mind. | Gankra wrote: | Yeah computing backtraces in a crashreporter is extremely | similar to a debugger in that you need a lot of fudge- | factor heuristics and fallback modes for known toolchain | bugs or common corruptions. | khuey wrote: | You're probably remembering https://pernos.co/blog/tzcnt- | portability/ | gpm wrote: | Speaking of the rr folk, they also had the fascinating | point that you can reliably generate a "stack trace" by | figuring out which `call` instructions were executed with | what values (also other jump instructions I suppose), | instead of walking the stack. Thereby skipping the whole | "parsing the stack is insanely difficult and unreliable" | issue. | glandium wrote: | FWIW, that's from pernosco, not rr. | gpm wrote: | I think it's the same people? | structural wrote: | This is the excessively fun part of dealing with crash dumps in | general. Many of them are going to be 1% corrupt, 99% fine, and | somewhere in them likely has vital information about what | caused the corruption. | | So the entire reason for being for things like rust-minidump | are to make enough sense out of files that are known to be | corrupt garbage to be able to find bugs. | mrlonglong wrote: | Do tell us more, don't leave us hanging ! Loved it. | ComputerGuru wrote: | A better/more technical article on the same tech, from Mozilla's | collaborators on this project: https://jake- | shadle.github.io/crash-reporting/ | Gankra wrote: | That article is about the client-side (generating the minidump | for a crashed process) to this article's server-side | (processing/analyzing the minidump). | nindalf wrote: | This is a fantastic article, thank you for writing it. Looking | forward to part 2! | pierrebai wrote: | If the follow-up post does not make it to HN front page, I'll | have a hole in my life. | Gankra wrote: | Extra shoutouts to the folks at Sentry who also flipped rust- | minidump on as their default backend and had to deal with way | more exotic issues than I did (and fixed them!) because although | Firefox sees some horrendous stuff and gets a bajillion crash | reports, it's still one application with one basically stable | minidump writing configuration. | | They have to deal with basically random apps doing whatever they | want and it sounds like hell. | yjftsjthsd-h wrote: | > how we got absolutely owned by simple fuzzing | | > You are reading part 1, wherein we build up our hubris. | | Props to anyone willing to own their faults this readily:) | j3s wrote: | What a fun read! :3 I really like your writing style. Deploying | stuff to production is always so nerve-wracking, I related to | that very hard. I recently developed a golang alternative to an | old erlang-ruby-hodgepodge, and when it worked in production I | found myself constantly not believing that nothing went wrong. | tclancy wrote: | Ha, weeks and months of thinking, "Please just work" and then | it does and it's always a shock. ___________________________________________________________________ (page generated 2022-06-14 23:00 UTC)