[HN Gopher] We found and fixed a rare race condition in our sess...
       ___________________________________________________________________
        
       We found and fixed a rare race condition in our session handling
        
       Author : todsacerdoti
       Score  : 214 points
       Date   : 2021-03-18 20:00 UTC (2 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | maddyboo wrote:
       | I'm very impressed by the level of detail provided in this post
       | mortem, but it leaves me with a few questions.
       | 
       | What we know based on this report:
       | 
       | - This bug was live in production for at least 3 days.
       | 
       | - It was common enough to happen numerous times in the wild.
       | 
       | - GitHub says that they identified some specific user sessions
       | that may have been impacted by this bug by analyzing their logs.
       | 
       | What we do not know:
       | 
       | - How many leaked user sessions were identified in the followup
       | investigation?
       | 
       | - Is GitHub confident that they identified all leaked user
       | sessions?
       | 
       | - Was there any evidence of malicious use of these sessions?
       | 
       | - Did GitHub notify the users whose sessions were leaked?
       | 
       | - How long/for how many requests could these sessions have been
       | valid between their leak and when they were definitively revoked
       | on the 8th?
       | 
       | - Is it possible that any private user data was leaked? If so,
       | what data?
        
         | azernik wrote:
         | > - Is GitHub confident that they identified all leaked user
         | sessions?
         | 
         | Given that they revoked _all_ current logins, rather than just
         | the affected ones, I assume that they have no way to tell which
         | sessions were affected.
        
         | dfcowell wrote:
         | Most of that information is available in the initial report
         | linked to in the first paragraph of this blog post:
         | https://github.blog/2021-03-08-github-security-update-a-bug-...
        
           | maddyboo wrote:
           | That provides more information but doesn't fully answer any
           | of my questions.
           | 
           | Things that are still unclear:
           | 
           | - GitHub provides an estimate of the prevalence at 0.001%,
           | but doesn't state how many leaked sessions were actually
           | identified.
           | 
           | - Is GitHub confident that they identified all leaked user
           | sessions? This has bearing on my question about notification
           | of users whose sessions were leaked; if they did not identify
           | all leaked sessions, they obviously could not contact those
           | users.
           | 
           | - GitHub states that this bug could not be intentionally
           | triggered by a malicious party (why?), but they don't state
           | whether any of the natural occurrences may have been used
           | maliciously.
           | 
           | - How long/for how many requests could these sessions have
           | been valid between their leak and when they were definitively
           | revoked on the 8th?
           | 
           | - Is it possible that any private user data was leaked? If
           | so, what data?
           | 
           | (Correction to my original comment: the bug was live for a
           | cumulative period of "less than 2 weeks")
        
       | the_mitsuhiko wrote:
       | I think this issue is relevant for Python programmers as well.
       | uwsgi for many years shipped a default allocation mode for the
       | WSGI environment which just cleared it between requests. One had
       | to switch over to `wsgi-env-behaviour=holy` to prevent it. Holy
       | is now the default but many people still set it to cheap because
       | that prevented some segfaults for a while in uWSGI.
        
       | shay_ker wrote:
       | Brutal - Unicorn didn't use Rack.env in a thread safe way. It's
       | tougher given that this was in a widely used app-server and not
       | even in Rails code.
       | 
       | Impressed with GH folks in finding this, though I wonder how many
       | more of these types of bugs are out there in Ruby libs. It'd be
       | nice to make this more difficult to do, a la FP languages, by
       | making it really intentional to modify state. Maybe in Ruby 4??
        
       | alexeiz wrote:
       | > We initially thought this to be an internal reporting problem
       | only and that we would see some data logged for an otherwise
       | unrelated request from the background thread.
       | 
       | So they already conceded that their code is fundamentally not
       | thread-safe and they decided to let it slide.
        
       | tyingq wrote:
       | I may be missing something here. I do see that the patch[1] now
       | creates a new request object, and thus a new "env" Ruby hash. But
       | I don't see the old behavior described as _" allocates one single
       | Ruby Hash that is then cleared (using Hash#clear) between each
       | request"_...even after poking around in http_request.rb and other
       | places.
       | 
       | [1] https://yhbt.net/unicorn-
       | public/66A68DD8-83EF-4C7A-80E8-3F1F...
        
       | phjesusthatguy3 wrote:
       | I flagged this because it has "race" in the title and if there's
       | one thing HN doesn't want to talk about, it's race.
       | 
       | Totally serious, but also totally disgusted.
       | 
       | Yeah, I'm done with HN. Fuck you niggers _.
       | 
       | _ Six-to-a-room six-figure-a-year motherfuckers crying about how
       | life is unfair. Fuck you.
        
       | [deleted]
        
       | lambda_obrien wrote:
       | Always log important events with a timestamp and save them as
       | long as possible.
        
         | oars wrote:
         | Wow, impressive!
        
           | lambda_obrien wrote:
           | I'm guessing that's sarcasm, I edited my comment.
        
         | gowld wrote:
         | > Always log important events with a timestamp and save them as
         | long as possible.
         | 
         | Except when those events are associated with private user data
         | or behavior.
        
           | lambda_obrien wrote:
           | The personal data itself should not be logged but the event
           | itself should be logged with lots of details that aren't
           | private info.
        
           | closeparen wrote:
           | Nope, _especially_ in those cases. The more sensitive the
           | data, the more important it is to have records of who
           | accessed or changed it and when.
        
           | temp667 wrote:
           | Huh? I want my bank to be VERY clear on when I transferred
           | money for my home closing (and keep that record). They need
           | to keep records and the browser used, IP used, authentication
           | flow etc. Why does this have to be thrown away?
        
       | ghiculescu wrote:
       | Anyone know if puma would be susceptible to the same bug or if
       | it's unicorn specific?
        
         | jakswa wrote:
         | This is what I'm going to be digging for later... I'm wondering
         | if puma also shares `env` objects across requests.
        
           | Freaky wrote:
           | Each server does share a "proto env" object, but it is
           | duplicated for each Client:
           | 
           | https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8.
           | ..
           | 
           | and Client objects are not reused:
           | 
           | https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8.
           | ..
        
       | pianoben wrote:
       | This is some fantastic debugging. I love war stories like this!
       | They give me confidence that the world is full of mere mortals
       | like myself, and we can do great things nonetheless.
        
       | benmmurphy wrote:
       | This post mortem and changes to unicorn should be useful for a
       | lot of other ruby shops. I suspect github aren't the only people
       | who are not defensively copying rack data when sharing it with
       | background threads.
        
       | thecodemonkey wrote:
       | Really interesting write-up. Curious how they do logging, and how
       | they were able to get such a detailed look at the previous
       | requests, including the HTTP response. Wouldn't it be a massive
       | amount of data if all HTTP requests/responses were logged? (Not
       | to mention the security implications of that)
        
         | firebaze wrote:
         | As expensive as it is, datadog provides the required level of
         | insight via, among other drill-down methods, detailed
         | flamegraphs, to get to the bottom of problems like this one.
         | 
         | No, not affiliated with datadog, and not convinced the
         | cost/benefit ratio is positive for us in the long run.
        
         | ckuehl wrote:
         | Purely a guess, but I've sometimes been able to work backwards
         | from the content length (very often included in HTTP server
         | logs) to figure out what the body had to be.
        
         | bob1029 wrote:
         | It would indeed be a massive amount of data, but bear in mind
         | that only a small fraction of HTTP requests hitting GitHub are
         | actually both authenticated and attempting to change the state
         | of the system. Most requests are unauthenticated read-only and
         | cause no state changes in GitHub.
        
           | vlovich123 wrote:
           | It's a reasonable theory but step one required to trigger
           | this bug is an unauthenticated request. It's unclear if their
           | logs indicated that or they got lucky trying to repro that
           | someone thought to try with an unauthenticated request first
           | when the back to back session didn't repro.
        
             | whimsicalism wrote:
             | From the writeup, it _sounds_ like the latter, but we
             | obviously can 't be sure.
        
             | aidos wrote:
             | Could be that there's some tell tale in the logs.
             | 
             | We recently fixed a bug with incomplete requests by
             | noticing in nginx logs that the size of the headers + the
             | size of the body was the size of the buffer of a linux
             | socket.
        
         | xvector wrote:
         | Distributed tracing with a small sample size or small
         | persistence duration?
        
           | dmlittle wrote:
           | Given the rarity of this bug I doubt a small sample sized
           | would have captured the right requests to be able to debug
           | this. The probability would be the probability of this bug
           | occurring * (sample percentage)^3 [ you'd need to have
           | sampled the correct 3 consecutive requests].
        
       | mtnygard wrote:
       | I'm looking forward to learning about how Rust makes this kind of
       | bug impossible.
        
         | vlovich123 wrote:
         | It doesn't. The issue here seems like a more vanilla logical
         | data race + object reuse that Rust can't really do anything
         | about. Rust only makes sure that only one thread at a time
         | writes and other reads don't race that write. It can't make
         | guarantees that the relative ordering of reads and wires makes
         | sense for your use-case.
         | 
         | It can simplify certain abstractions of course so that you can
         | provide easier tools for devs to avoid footguns.
        
           | cogman10 wrote:
           | What rust would do for this particular situation is make it
           | painful to share an object like this across threads and make
           | it obvious that such a share is happening. That wouldn't
           | prevent this problem, but it would make it harder to stumble
           | into.
        
       | skytreader wrote:
       | > On March 2, 2021, we received a report via our support team
       | from a user who, while using GitHub.com logged in as their own
       | user, was suddenly authenticated as another user.
       | 
       | Sobering thought for me is how probable it is that a similar
       | issue might be present in other platforms. The title calls it
       | "rare" but given (a) the number of complex-enough web/app
       | platforms in use today, (b) just the endless number of ways/infra
       | permutations for "doing x", and (c) the size of a typical
       | company's codebase and dependencies, could it be a more common
       | occurrence than we think?
       | 
       | Total anecdote, up for you to believe: Back in April 2016, one of
       | my then-housemates bought a brand-new Macbook Air. A week into
       | his ownership, he exclaims surprise that he is logged-in to
       | another Facebook account, someone we are totally not acquainted
       | with. I borrow his MBA and try to investigate but, alas, I lack
       | expertise for it. The only relevant thing I can eke out is that
       | they both went to the same event a week or so ago and probably
       | shared a Wifi AP, where a "leak" could've happened. Or maybe a
       | hash collision in FB's side?
       | 
       | I would've reported it to FB but then I don't have enough
       | details; with the two conjectures I have, I'm basically holding
       | water with a sieve. But now the thought that someone I totally
       | don't know might end up logged-in in my FB account is a
       | possibility that bothers me. And I have no idea how to prevent it
       | other than minimizing what could be compromised (and no, "Get off
       | FB!" just isn't in the cards, I'm sorry).
       | 
       | Kudos for Github for investigating and fixing this issue. Another
       | thought that occurred to me while writing this is how many users
       | of other platforms might be filing issues like this but can't
       | provide enough detail and so their reports are eventually marked
       | as "Could Not Reproduce". Not exactly leaky sessions but just
       | general security mishaps that they would struggle to describe to
       | support.
        
         | [deleted]
        
         | exdsq wrote:
         | Don't sell yourself short, I'm sure you're more than
         | experienced to get your own MBA!
        
         | ficklepickle wrote:
         | If it were a hash collision, I would think the odds of it
         | involving two users geographically close to each other would be
         | quite small.
         | 
         | That's a weird one, thanks for sharing it!
        
         | thraway123412 wrote:
         | Also on GOG:
         | https://www.gog.com/forum/general/warning_massive_security_r...
         | 
         | > Hello everyone! This is forum regular fronzelneekburm,
         | posting from the account of some poor Chinese guy that gog
         | randomly logged me into. I'll explain the situation in further
         | detail in another post from my actual account once I logged out
         | of this one.
        
       | benlivengood wrote:
       | This is the kind of bug that makes me leery of frameworks and
       | tall application stacks. Having to trust more than one layer of
       | multithreaded code between an application and the kernel gets
       | riskier with every layer.
       | 
       | Also another reason not to mix the main application code with the
       | authentication/authorization code. Setting security cookies
       | should be done by dedicated code in a dedicated process to avoid
       | this kind of interaction, and the load-balancer layer should
       | filter the setting of security cookies from non-security backends
       | to enforce it.
        
       | gizmo wrote:
       | There is a lesson here about the way we go about sharing memory
       | between threads that is implicit and fundamentally broken. Adding
       | extra threading in order to improve performance is clearly the
       | wrong tradeoff when you're in an environment where threads can
       | introduce bugs like these. When it takes 10 senior engineers to
       | figure out why one user session gets data from another session
       | you've already lost.
       | 
       | Shared memory environments are very unsafe and we should be
       | working towards a memory model where data that belongs to
       | different organizations can never get mixed. This isn't
       | impossible, but it requires new programming languages that allow
       | us to express which memory areas can't get mixed up.
       | 
       | We've got to start designing our backend infrastructure
       | differently such that this entire class of memory errors can't
       | ever occur.
        
         | lstamour wrote:
         | In many languages and web frameworks, the assumption is that
         | the application server will do this for you. For example, by
         | issuing a new env variable with each request such that requests
         | can't easily share memory with other requests, though as noted
         | that might not stop background threads from retaining such
         | memory.
         | 
         | In other cases, you might assume containers would be enough
         | separation, or maybe different databases (sharding, for
         | example). One approach I've considered but not yet implemented
         | is to take end-user JWT cookies and use them to authenticate to
         | a database such that a user can only ever see the rows they
         | should be allowed access to, and all other SQL queries would
         | fail.
         | 
         | But no matter how you implement a security control, it's
         | possible to make mistakes and allow others to see data you
         | didn't intend to share with them. Even Rust, famous for its
         | explicit memory ownership model, could still have similar bugs,
         | they just might be a bit more obvious to spot in some session
         | connection pool implemented incorrectly, for example, or maybe
         | exposed via a compiler optimization gone wrong. Code review
         | will still likely be one of the primary defenses of this kind
         | of bug, though hopefully better tooling and language support to
         | detect race conditions or shared ownership warnings will help
         | over time?
        
       | mcherm wrote:
       | I just want to make clear that for me, write ups containing this
       | level of detail analyzing the error that occurred give me greater
       | confidence in the company rather than less confidence. Greater
       | even than if I had never heard of the original bug. There are
       | many bugs I have never heard about, but any company that uses
       | this level of care and openness is likely to be very reliable.
        
         | TedShiller wrote:
         | This is such a n00b mistake though, the "greater confidence" is
         | quickly canceled out. I end up having less confidence in this
         | company.
        
         | petters wrote:
         | I agree with you in general, but in this particular case they
         | knowingly deployed thread-unsafe code in production while
         | increasing the threaded concurrency. That should have broken
         | internal development rules.
        
           | eli wrote:
           | Now imagine what happens behind the scenes at other service
           | providers. I'd bet the ones that have policies that bury the
           | details of incidents, on average, have worse problems.
        
         | Complexicate wrote:
         | Agreed. However, I smirked a bit when I read your comment,
         | since this is ultimately coming from Microsoft. Let's hope some
         | of this mentality seeps into other Microsoft projects
        
           | thitcanh wrote:
           | GitHub is owned by Microsoft but as far as I know it operates
           | pretty independently. Let's hope it stays that way.
        
             | craftinator wrote:
             | At this point we're all just waiting for that last E.
        
               | edoceo wrote:
               | Or, pro-actively moving to GitLab - either self-hosted or
               | paid and using GH as a simple mirror for discovery.
        
               | arthurcolle wrote:
               | I don't understand this reference, can you elucidate
        
       | zelon88 wrote:
       | > We believe that transparency is key in earning and keeping the
       | trust of our users and want to share more about this bug.
       | 
       | I find it ironic that the largest repository of open source code
       | is closed source and and owned by the largest closed source
       | software company on the planet, yet they somehow still "believe
       | in transparency."
       | 
       | That's like McDonald's "transparently" recalling burgers because
       | of a chemical contamination, and then when you ask what else is
       | in the burgers they tell you it's a secret.
        
         | spaetzleesser wrote:
         | I wonder if open sourcing their code would even be a problem
         | for them now. I don't think the code is worth much vs the brand
         | name and scale.
        
       | baggy_trough wrote:
       | The smartest engineers in the world are not smart enough to write
       | bug free multithreaded code.
        
         | fanf2 wrote:
         | This is the kind of multithreading bug that Rust's borrow
         | checker can prevent: the compiler would have complained that
         | the background task and the main request handler have write
         | access to the env object at the same time, which is not
         | allowed.
        
           | astrange wrote:
           | Writing your website in PHP would also have prevented it,
           | because its amateur process per request design is actually
           | the best way to do things, scales well, and has the right
           | security properties. Having an app server that can have
           | requests from different users in the same memory space is a
           | mistake.
        
         | WJW wrote:
         | Nobody is smart enough to write bug free code, but at least
         | some companies have the smarts to fix their code and blog about
         | how they did it.
        
           | robotresearcher wrote:
           | You're right but the parent's point is that multithreaded
           | code with shared objects is known to be particularly
           | difficult to do perfectly. It's so difficult that the
           | rational strategy may be to architect to avoid it. We will
           | have bugs, but not _those_ bugs.
        
       ___________________________________________________________________
       (page generated 2021-03-18 23:00 UTC)