[HN Gopher] We found and fixed a rare race condition in our sess... ___________________________________________________________________ We found and fixed a rare race condition in our session handling Author : todsacerdoti Score : 214 points Date : 2021-03-18 20:00 UTC (2 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | maddyboo wrote: | I'm very impressed by the level of detail provided in this post | mortem, but it leaves me with a few questions. | | What we know based on this report: | | - This bug was live in production for at least 3 days. | | - It was common enough to happen numerous times in the wild. | | - GitHub says that they identified some specific user sessions | that may have been impacted by this bug by analyzing their logs. | | What we do not know: | | - How many leaked user sessions were identified in the followup | investigation? | | - Is GitHub confident that they identified all leaked user | sessions? | | - Was there any evidence of malicious use of these sessions? | | - Did GitHub notify the users whose sessions were leaked? | | - How long/for how many requests could these sessions have been | valid between their leak and when they were definitively revoked | on the 8th? | | - Is it possible that any private user data was leaked? If so, | what data? | azernik wrote: | > - Is GitHub confident that they identified all leaked user | sessions? | | Given that they revoked _all_ current logins, rather than just | the affected ones, I assume that they have no way to tell which | sessions were affected. | dfcowell wrote: | Most of that information is available in the initial report | linked to in the first paragraph of this blog post: | https://github.blog/2021-03-08-github-security-update-a-bug-... | maddyboo wrote: | That provides more information but doesn't fully answer any | of my questions. | | Things that are still unclear: | | - GitHub provides an estimate of the prevalence at 0.001%, | but doesn't state how many leaked sessions were actually | identified. | | - Is GitHub confident that they identified all leaked user | sessions? This has bearing on my question about notification | of users whose sessions were leaked; if they did not identify | all leaked sessions, they obviously could not contact those | users. | | - GitHub states that this bug could not be intentionally | triggered by a malicious party (why?), but they don't state | whether any of the natural occurrences may have been used | maliciously. | | - How long/for how many requests could these sessions have | been valid between their leak and when they were definitively | revoked on the 8th? | | - Is it possible that any private user data was leaked? If | so, what data? | | (Correction to my original comment: the bug was live for a | cumulative period of "less than 2 weeks") | the_mitsuhiko wrote: | I think this issue is relevant for Python programmers as well. | uwsgi for many years shipped a default allocation mode for the | WSGI environment which just cleared it between requests. One had | to switch over to `wsgi-env-behaviour=holy` to prevent it. Holy | is now the default but many people still set it to cheap because | that prevented some segfaults for a while in uWSGI. | shay_ker wrote: | Brutal - Unicorn didn't use Rack.env in a thread safe way. It's | tougher given that this was in a widely used app-server and not | even in Rails code. | | Impressed with GH folks in finding this, though I wonder how many | more of these types of bugs are out there in Ruby libs. It'd be | nice to make this more difficult to do, a la FP languages, by | making it really intentional to modify state. Maybe in Ruby 4?? | alexeiz wrote: | > We initially thought this to be an internal reporting problem | only and that we would see some data logged for an otherwise | unrelated request from the background thread. | | So they already conceded that their code is fundamentally not | thread-safe and they decided to let it slide. | tyingq wrote: | I may be missing something here. I do see that the patch[1] now | creates a new request object, and thus a new "env" Ruby hash. But | I don't see the old behavior described as _" allocates one single | Ruby Hash that is then cleared (using Hash#clear) between each | request"_...even after poking around in http_request.rb and other | places. | | [1] https://yhbt.net/unicorn- | public/66A68DD8-83EF-4C7A-80E8-3F1F... | phjesusthatguy3 wrote: | I flagged this because it has "race" in the title and if there's | one thing HN doesn't want to talk about, it's race. | | Totally serious, but also totally disgusted. | | Yeah, I'm done with HN. Fuck you niggers _. | | _ Six-to-a-room six-figure-a-year motherfuckers crying about how | life is unfair. Fuck you. | [deleted] | lambda_obrien wrote: | Always log important events with a timestamp and save them as | long as possible. | oars wrote: | Wow, impressive! | lambda_obrien wrote: | I'm guessing that's sarcasm, I edited my comment. | gowld wrote: | > Always log important events with a timestamp and save them as | long as possible. | | Except when those events are associated with private user data | or behavior. | lambda_obrien wrote: | The personal data itself should not be logged but the event | itself should be logged with lots of details that aren't | private info. | closeparen wrote: | Nope, _especially_ in those cases. The more sensitive the | data, the more important it is to have records of who | accessed or changed it and when. | temp667 wrote: | Huh? I want my bank to be VERY clear on when I transferred | money for my home closing (and keep that record). They need | to keep records and the browser used, IP used, authentication | flow etc. Why does this have to be thrown away? | ghiculescu wrote: | Anyone know if puma would be susceptible to the same bug or if | it's unicorn specific? | jakswa wrote: | This is what I'm going to be digging for later... I'm wondering | if puma also shares `env` objects across requests. | Freaky wrote: | Each server does share a "proto env" object, but it is | duplicated for each Client: | | https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8. | .. | | and Client objects are not reused: | | https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8. | .. | pianoben wrote: | This is some fantastic debugging. I love war stories like this! | They give me confidence that the world is full of mere mortals | like myself, and we can do great things nonetheless. | benmmurphy wrote: | This post mortem and changes to unicorn should be useful for a | lot of other ruby shops. I suspect github aren't the only people | who are not defensively copying rack data when sharing it with | background threads. | thecodemonkey wrote: | Really interesting write-up. Curious how they do logging, and how | they were able to get such a detailed look at the previous | requests, including the HTTP response. Wouldn't it be a massive | amount of data if all HTTP requests/responses were logged? (Not | to mention the security implications of that) | firebaze wrote: | As expensive as it is, datadog provides the required level of | insight via, among other drill-down methods, detailed | flamegraphs, to get to the bottom of problems like this one. | | No, not affiliated with datadog, and not convinced the | cost/benefit ratio is positive for us in the long run. | ckuehl wrote: | Purely a guess, but I've sometimes been able to work backwards | from the content length (very often included in HTTP server | logs) to figure out what the body had to be. | bob1029 wrote: | It would indeed be a massive amount of data, but bear in mind | that only a small fraction of HTTP requests hitting GitHub are | actually both authenticated and attempting to change the state | of the system. Most requests are unauthenticated read-only and | cause no state changes in GitHub. | vlovich123 wrote: | It's a reasonable theory but step one required to trigger | this bug is an unauthenticated request. It's unclear if their | logs indicated that or they got lucky trying to repro that | someone thought to try with an unauthenticated request first | when the back to back session didn't repro. | whimsicalism wrote: | From the writeup, it _sounds_ like the latter, but we | obviously can 't be sure. | aidos wrote: | Could be that there's some tell tale in the logs. | | We recently fixed a bug with incomplete requests by | noticing in nginx logs that the size of the headers + the | size of the body was the size of the buffer of a linux | socket. | xvector wrote: | Distributed tracing with a small sample size or small | persistence duration? | dmlittle wrote: | Given the rarity of this bug I doubt a small sample sized | would have captured the right requests to be able to debug | this. The probability would be the probability of this bug | occurring * (sample percentage)^3 [ you'd need to have | sampled the correct 3 consecutive requests]. | mtnygard wrote: | I'm looking forward to learning about how Rust makes this kind of | bug impossible. | vlovich123 wrote: | It doesn't. The issue here seems like a more vanilla logical | data race + object reuse that Rust can't really do anything | about. Rust only makes sure that only one thread at a time | writes and other reads don't race that write. It can't make | guarantees that the relative ordering of reads and wires makes | sense for your use-case. | | It can simplify certain abstractions of course so that you can | provide easier tools for devs to avoid footguns. | cogman10 wrote: | What rust would do for this particular situation is make it | painful to share an object like this across threads and make | it obvious that such a share is happening. That wouldn't | prevent this problem, but it would make it harder to stumble | into. | skytreader wrote: | > On March 2, 2021, we received a report via our support team | from a user who, while using GitHub.com logged in as their own | user, was suddenly authenticated as another user. | | Sobering thought for me is how probable it is that a similar | issue might be present in other platforms. The title calls it | "rare" but given (a) the number of complex-enough web/app | platforms in use today, (b) just the endless number of ways/infra | permutations for "doing x", and (c) the size of a typical | company's codebase and dependencies, could it be a more common | occurrence than we think? | | Total anecdote, up for you to believe: Back in April 2016, one of | my then-housemates bought a brand-new Macbook Air. A week into | his ownership, he exclaims surprise that he is logged-in to | another Facebook account, someone we are totally not acquainted | with. I borrow his MBA and try to investigate but, alas, I lack | expertise for it. The only relevant thing I can eke out is that | they both went to the same event a week or so ago and probably | shared a Wifi AP, where a "leak" could've happened. Or maybe a | hash collision in FB's side? | | I would've reported it to FB but then I don't have enough | details; with the two conjectures I have, I'm basically holding | water with a sieve. But now the thought that someone I totally | don't know might end up logged-in in my FB account is a | possibility that bothers me. And I have no idea how to prevent it | other than minimizing what could be compromised (and no, "Get off | FB!" just isn't in the cards, I'm sorry). | | Kudos for Github for investigating and fixing this issue. Another | thought that occurred to me while writing this is how many users | of other platforms might be filing issues like this but can't | provide enough detail and so their reports are eventually marked | as "Could Not Reproduce". Not exactly leaky sessions but just | general security mishaps that they would struggle to describe to | support. | [deleted] | exdsq wrote: | Don't sell yourself short, I'm sure you're more than | experienced to get your own MBA! | ficklepickle wrote: | If it were a hash collision, I would think the odds of it | involving two users geographically close to each other would be | quite small. | | That's a weird one, thanks for sharing it! | thraway123412 wrote: | Also on GOG: | https://www.gog.com/forum/general/warning_massive_security_r... | | > Hello everyone! This is forum regular fronzelneekburm, | posting from the account of some poor Chinese guy that gog | randomly logged me into. I'll explain the situation in further | detail in another post from my actual account once I logged out | of this one. | benlivengood wrote: | This is the kind of bug that makes me leery of frameworks and | tall application stacks. Having to trust more than one layer of | multithreaded code between an application and the kernel gets | riskier with every layer. | | Also another reason not to mix the main application code with the | authentication/authorization code. Setting security cookies | should be done by dedicated code in a dedicated process to avoid | this kind of interaction, and the load-balancer layer should | filter the setting of security cookies from non-security backends | to enforce it. | gizmo wrote: | There is a lesson here about the way we go about sharing memory | between threads that is implicit and fundamentally broken. Adding | extra threading in order to improve performance is clearly the | wrong tradeoff when you're in an environment where threads can | introduce bugs like these. When it takes 10 senior engineers to | figure out why one user session gets data from another session | you've already lost. | | Shared memory environments are very unsafe and we should be | working towards a memory model where data that belongs to | different organizations can never get mixed. This isn't | impossible, but it requires new programming languages that allow | us to express which memory areas can't get mixed up. | | We've got to start designing our backend infrastructure | differently such that this entire class of memory errors can't | ever occur. | lstamour wrote: | In many languages and web frameworks, the assumption is that | the application server will do this for you. For example, by | issuing a new env variable with each request such that requests | can't easily share memory with other requests, though as noted | that might not stop background threads from retaining such | memory. | | In other cases, you might assume containers would be enough | separation, or maybe different databases (sharding, for | example). One approach I've considered but not yet implemented | is to take end-user JWT cookies and use them to authenticate to | a database such that a user can only ever see the rows they | should be allowed access to, and all other SQL queries would | fail. | | But no matter how you implement a security control, it's | possible to make mistakes and allow others to see data you | didn't intend to share with them. Even Rust, famous for its | explicit memory ownership model, could still have similar bugs, | they just might be a bit more obvious to spot in some session | connection pool implemented incorrectly, for example, or maybe | exposed via a compiler optimization gone wrong. Code review | will still likely be one of the primary defenses of this kind | of bug, though hopefully better tooling and language support to | detect race conditions or shared ownership warnings will help | over time? | mcherm wrote: | I just want to make clear that for me, write ups containing this | level of detail analyzing the error that occurred give me greater | confidence in the company rather than less confidence. Greater | even than if I had never heard of the original bug. There are | many bugs I have never heard about, but any company that uses | this level of care and openness is likely to be very reliable. | TedShiller wrote: | This is such a n00b mistake though, the "greater confidence" is | quickly canceled out. I end up having less confidence in this | company. | petters wrote: | I agree with you in general, but in this particular case they | knowingly deployed thread-unsafe code in production while | increasing the threaded concurrency. That should have broken | internal development rules. | eli wrote: | Now imagine what happens behind the scenes at other service | providers. I'd bet the ones that have policies that bury the | details of incidents, on average, have worse problems. | Complexicate wrote: | Agreed. However, I smirked a bit when I read your comment, | since this is ultimately coming from Microsoft. Let's hope some | of this mentality seeps into other Microsoft projects | thitcanh wrote: | GitHub is owned by Microsoft but as far as I know it operates | pretty independently. Let's hope it stays that way. | craftinator wrote: | At this point we're all just waiting for that last E. | edoceo wrote: | Or, pro-actively moving to GitLab - either self-hosted or | paid and using GH as a simple mirror for discovery. | arthurcolle wrote: | I don't understand this reference, can you elucidate | zelon88 wrote: | > We believe that transparency is key in earning and keeping the | trust of our users and want to share more about this bug. | | I find it ironic that the largest repository of open source code | is closed source and and owned by the largest closed source | software company on the planet, yet they somehow still "believe | in transparency." | | That's like McDonald's "transparently" recalling burgers because | of a chemical contamination, and then when you ask what else is | in the burgers they tell you it's a secret. | spaetzleesser wrote: | I wonder if open sourcing their code would even be a problem | for them now. I don't think the code is worth much vs the brand | name and scale. | baggy_trough wrote: | The smartest engineers in the world are not smart enough to write | bug free multithreaded code. | fanf2 wrote: | This is the kind of multithreading bug that Rust's borrow | checker can prevent: the compiler would have complained that | the background task and the main request handler have write | access to the env object at the same time, which is not | allowed. | astrange wrote: | Writing your website in PHP would also have prevented it, | because its amateur process per request design is actually | the best way to do things, scales well, and has the right | security properties. Having an app server that can have | requests from different users in the same memory space is a | mistake. | WJW wrote: | Nobody is smart enough to write bug free code, but at least | some companies have the smarts to fix their code and blog about | how they did it. | robotresearcher wrote: | You're right but the parent's point is that multithreaded | code with shared objects is known to be particularly | difficult to do perfectly. It's so difficult that the | rational strategy may be to architect to avoid it. We will | have bugs, but not _those_ bugs. ___________________________________________________________________ (page generated 2021-03-18 23:00 UTC)