[HN Gopher] March 20 ChatGPT outage: Here's what happened
       ___________________________________________________________________
        
       March 20 ChatGPT outage: Here's what happened
        
       Author : zerojames
       Score  : 234 points
       Date   : 2023-03-24 16:08 UTC (6 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | Kuinox wrote:
       | I managed to manually produce this bug 2 months ago. As they
       | don't have any bug bounty, I didn't submitted it. By starting a
       | conversation and refreshing before ChatGPT has time to answer, I
       | managed to reproduce this bug 2-3 times in January.
        
         | breckenedge wrote:
         | did you reach out via https://openai.com/security.txt?
        
       | rvz wrote:
       | First GitHub, then OpenAI. Two of Microsoft finest(!) (majority
       | owned and acquired) companies on the top of HN announcing a
       | serious security incident.
       | 
       | It's quite unsettling to see this leak of highly sensitive
       | information and a private key exposure as well. Doesn't look good
       | and seems like they don't take security seriously.
        
         | skybrian wrote:
         | In the case of OpenAI, the product is more of a research demo
         | that had to be drastically scaled up, though. From an
         | operations point of view it's more like a startup.
        
         | deltree7 wrote:
         | Nobody cares and yet another case study of HN being out-of-
         | touch with reality
        
           | rvz wrote:
           | > _" Nobody cares"_
           | 
           | Yet another case study of absolutism, which can be simply
           | dismissed.
           | 
           | People paying for ChatGPT care once it goes down and getting
           | their details and chats leaked that is certainly outside of
           | HN. Same with GitHub. Both having ~100M users between them.
           | 
           | That's the reality.
        
             | sebzim4500 wrote:
             | I'm paying for ChatGPT and I don't care about this any more
             | than the many, many other services I use that have at some
             | point had an embarassing security issue.
        
             | deltree7 wrote:
             | I'm paying and I don't care. If I write perfect bug-free
             | code, lead a perfect life, live in a perfect world, I'd be
             | upset.
             | 
             | But, I know that shit happens and the reliability meter
             | should be flexible for different things (bridges, heart
             | surgery and chat agent).
             | 
             | If I train my brain to bitch, whine, moan about every
             | thing, I'd not have resources to care about really
             | important things.
        
       | kkarpkkarp wrote:
       | is this only me who don't see any chat history since yesterday
       | and generally chat does not work (you can type the message, but
       | clicking button or hitting enter / ctrl+enter) does not give any
       | effect?
       | 
       | in chat history there is a button to "retry" but clicking it and
       | inspecting the result, you see "internal server error"
        
       | LarsDu88 wrote:
       | I called it:
       | https://news.ycombinator.com/item?id=35267569#35270165
        
       | fintechie wrote:
       | I reported this race condition via ChatGPT's internal feedback
       | system after I saw other user's chat titles loading on my sidebar
       | a couple of times (around 7-8 weeks ago). Didn't get a response,
       | so I assumed it was fixed...
       | 
       | Hopefully they'll start a bug bounty program soon, and prioritise
       | bug reports over features.
        
         | totallyunknown wrote:
         | same to my. actually only the summary of the history was from a
         | different user. the content itself was mine.
        
           | sebzim4500 wrote:
           | The claim made at the time was that the titles were not from
           | other people and were in fact caused by the model
           | hallucinating after the input query timed out (or something
           | like that). Obviously that sounds a little suspect now, but
           | it might be true.
        
             | nwienert wrote:
             | That's a lie if so, if you look at the Reddit threads
             | there's no way those were not specific other users
             | histories as they had the logical history of reading
             | browser history. Eg, one I saw had stuff like "what is X",
             | then the next would be "How to X" or something. Some were
             | all in Japanese, others all in Chinese. If it was random
             | you wouldn't see clear logical consistency across the list.
        
         | jetrink wrote:
         | The explanation at the time was that unavailable chat data (due
         | to, e.g. high load) resulted in a null input sometimes being
         | presented to the chat summary system, which in turn caused the
         | system to hallucinate believable chat titles. It's possible
         | that they misdiagnosed the issue or that both bugs were present
         | and they caught the benign one before the serious one.
        
       | ElijahLynn wrote:
       | That is a pretty good disclosure that creates trust.
        
       | kristianpaul wrote:
       | This more a data leak than an outage...
        
         | sebzim4500 wrote:
         | It was down for quite a while, so I would call it an outage.
        
       | qwertox wrote:
       | Nice writeup, it's fair in the content presented to us.
       | 
       | Yet I'm wondering why there is no checking if the response does
       | actually belong to the issued query.
       | 
       | The client issuing a query can pass a token and verify upon
       | answer that this answer contains the token.
       | 
       | TBH as a user of the client I would kind of expect the library to
       | have this feature built-in, and if I'm starting to use the
       | library to solve a problem, handling this edge-case would be of a
       | somewhat low priority to me if the library wouldn't implement it,
       | probably because I'm lazy.
       | 
       | I hope that the fix they offered to Redis Labs does contain a
       | solution to this problem and that everyone of us using this
       | library will be able to profit from the effort put into resolving
       | the issue.
       | 
       | It doesn't [0], so the burden is still on the developer using the
       | library.
       | 
       | [0] https://github.com/redis/redis-
       | py/commit/66a4d6b2a493dd3a20c...
       | 
       | ---
       | 
       | Edit: Now I'm confused, this issue [1] was raised on March 17 and
       | fixed on March 22, was this a regression? Or did OpenAI start
       | using this library on March 19-20?
       | 
       | Interesing comment:
       | 
       | > drago-balto commented 3 hours ago
       | 
       | > Yep, that's the one, and the #2641 has not fixed it fully, as I
       | already commented here: #2641 (comment)
       | 
       | > I am asking for this ticket to be re-oped, since I can still
       | reproduce the problem in the latest 4.5.3. version
       | 
       | [1] https://github.com/redis/redis-
       | py/issues/2624#issue-16293351...
        
         | [deleted]
        
         | menzoic wrote:
         | That sounds more like a hindsight thing. In most systems
         | authorization doesn't happen at the storage layer. Most queries
         | fetch data by an identifier which is only assumed to be valid
         | based on authorization that typically happens at the edge and
         | then everything below relies on that result.
         | 
         | It's not the safest design but I wouldn't say the client should
         | be expected to implement it. That security concern is at the
         | application layer and the actual needs of the implementation
         | can be wildly different depending on the application. You can
         | imagine use cases for redis where this isn't even relevant,
         | like if it's being used to store price data for stocks that
         | update every 30 seconds. There's no private data involved
         | there. It's out of scope for a storage client to implement.
        
         | [deleted]
        
       | benmmurphy wrote:
       | This is a common bug with a lot of software. For example some
       | HTTP clients that do pooling won't invalidate the connection
       | after timing out waiting for the response.
        
       | picodguyo wrote:
       | If you're subscribed to their status page, you'll know it's
       | actually unusual for a day to go by without an outage alert from
       | OpenAI. They don't usually write them up like this but I guess
       | this counts as PII leak disclosure for them? For having raised
       | billions of dollars the are comically immature from a reliability
       | and support perspective.
        
         | thequadehunter wrote:
         | To be fair, they accidently made a game-changing breakthrough
         | that gained millions of users overnight, and I don't think they
         | were ready for it.
         | 
         | Before chatgpt, most normal people had never heard of OpenAI.
         | Their flagship product was basically an API that only
         | programmers could make useful.
         | 
         | Team leaders at OpenAI have stated that they were not expecting
         | the success, let alone the highest adoption rate for any
         | product in history. In their minds, it was just a cleaned-up
         | version of a 2-year old product. It was billed as a research
         | preview.
         | 
         | So, all of a sudden you go from hiring mostly researchers
         | because you only have to maintain an API and some mid-traffic
         | web infra, to suddenly having the fastest growing web product
         | in history and having to scale up as fast as you can. Keep in
         | mind that they didn't get backing from Microsoft until January
         | 23, 2023-- that was only 2 months ago.
         | 
         | I'd say we should cut them some slack.
        
           | picodguyo wrote:
           | These problems predate ChatGPT. Their API has been on the
           | market for nearly 3 years. And they raised their first $1B in
           | 2019. That's plenty of money and time to hire capable
           | leadership.
        
             | [deleted]
        
       | construct0 wrote:
       | The bug: https://github.com/redis/redis-py/issues/2624
        
         | braindead_in wrote:
         | Was this written by ChatGPT? Maybe it found the bug as well,
         | who knows.
        
         | photochemsyn wrote:
         | > "If a request is canceled after the request is pushed onto
         | the incoming queue, but before the response popped from the
         | outgoing queue, we see our bug: the connection thus becomes
         | corrupted and the next response that's dequeued for an
         | unrelated request can receive data left behind in the
         | connection."
         | 
         | The OpenAI API was incredibly slow and lots of requests
         | probably got cancelled (I certainly was doing that) for some
         | days. I imagine someone could write a whole blog post about how
         | that worked, it would be interesting reading.
        
         | construct0 wrote:
         | .... "I am asking for this ticket to be re-opened, since I can
         | still reproduce the problem in the latest 4.5.3. version"
        
         | chatmasta wrote:
         | The PR: https://github.com/redis/redis-py/pull/2641
         | 
         | According to the latest comments there, the bug is only
         | partially fixed.
        
       | chatmasta wrote:
       | Why did it take them _9 hours_ to notice? The problem was
       | immediately obvious to anyone who used the web interface, as
       | evidenced by the many threads on Reddit and HN.
       | 
       | > between 1 a.m. and 10 a.m. Pacific time.
       | 
       | Oh... so it was because they're based in San Francisco. Do they
       | really not have a 24/7 SRE on-call rotation? Given the size of
       | their funding, and the number of users they have, there is really
       | no excuse not to at least have some basic monitoring system in
       | place for this (although it's true that, ironically, this
       | particular class of bug is difficult to detect in a monitoring
       | system that doesn't explicitly check for it, despite being
       | immediately obvious to a human observer).
       | 
       | Perhaps they should consider opening an office in Europe, or
       | hiring remotely, at least for security roles. Or maybe they could
       | have GPT-4 keep an eye on the site!
        
         | guessmyname wrote:
         | > _[...] it was because they 're based in San Francisco. Do
         | they really not have a 24/7 SRE on-call rotation?_
         | 
         | OpenAI is hiring Site Reliability Engineers (SRE) in case you,
         | or anyone you know, is interested in working for them:
         | https://openai.com/careers/it-engineer-sre . Unfortunately, the
         | job is an onsite role that requires 5 days a week in their San
         | Francisco office, so they do not appear to be planning to have
         | a 24/7 on-call rotation any time soon.
         | 
         | Too bad because I could support them in APAC (from Japan).
         | 
         | Over 10 years of industry experience, if anyone is interested.
        
           | p1esk wrote:
           | Also, I heard their interviews (for any technical position)
           | are very tough.
        
         | eep_social wrote:
         | Staffing an actual 24x7 rotation of SREs costs about a million
         | dollars a year in base salary as a floor and there are few SREs
         | for hire. A metrics-based monitor probably would have triggered
         | on the increased error rate but it wouldn't have been
         | immediately obvious that there was also a leaking cache. The
         | most plausible way to detect the problem from the user
         | perspective would be a synthetic test running some affected
         | workflow, built to check that the data coming back matches
         | specific, expected strings (not just well-formed). All possible
         | but none of this sounds easy to me. Absolutely none of this is
         | plausible when your startup business is at the top of the news
         | cycle every single day for the past several months.
        
           | namaria wrote:
           | Every system fail prompts people to exclaim "why aren't there
           | safeguards?". Every time. Well guess what, if we try to do
           | new stuff, we will run into new problems.
        
             | wouldbecouldbe wrote:
             | There is nothing new about using redis for cache, or
             | returning a list for a user.
        
               | namaria wrote:
               | Are you trying to say cache invalidation in a distributed
               | system is a trivial problem?
        
               | oulu2006 wrote:
               | It's non-trivial but it's also not that hard, there are
               | well known strategies for achieving it; especially if you
               | relax guarantees and only promise eventual consistency
               | then it becomes fairly trivial - we do this for example
               | and have little problems with it.
        
               | chatmasta wrote:
               | I'm not disagreeing with you, and I'm not the commenter
               | you're replying to, but it's worth noting that cache
               | leakage and cache invalidation are two different
               | problems.
        
               | namaria wrote:
               | You're right. Thanks for pointing that out. My original
               | point still stands, distributed systems are hard and
               | people demanding zero failures are setting an impossible
               | standard.
        
           | sosodev wrote:
           | "there are few SREs for hire"
           | 
           | How do you figure? If you mean there are few SRE with several
           | years of experience you might be right. SRE is a fairly new
           | title so that's not too surprising.
           | 
           | However, my experience with a recent job search is that most
           | companies aren't hiring SRE right now because they consider
           | reliability a luxury. In fact, I was search of a new SRE
           | position because I was laid off for that very reason.
        
             | chatmasta wrote:
             | You don't even need an SRE to have an on-call rotation; you
             | could ping a software engineer who could at least recognize
             | the problem and either push a temporary fix, or try to wake
             | someone else to put a mitigation in place (e.g. disabling
             | the history API, which is what they eventually did).
             | 
             | However, I think the GP's point about this class of bug
             | being difficult to detect in a monitoring system is the
             | more salient issue.
        
               | eep_social wrote:
               | Well hang on! Your question was why was the time to
               | detect so high and you specifically mentioned 24x7 SRE so
               | I thought that's what we were talking about ;)
               | 
               | And I do think the answer is that monitoring is easy but
               | good monitoring takes a whole lot of work. Devops teams
               | tend to get to sufficient observability where a SRE team
               | should be dedicating its time to engineering great
               | observability because the SRE team is not being pushed by
               | product to deliver features. A functional org will
               | protect SRE teams from that pressure, a great one will
               | allow the SRE team to apply counter-pressure from the
               | reliability and non-functional perspective to the product
               | perspective. This equilibrium is ideal because it allows
               | speed but keeps a tight leash on tech debt by developing
               | rigor around what is too fast or too many errors or
               | whatever your relevant metrics are.
        
             | eep_social wrote:
             | I've anecdotally observed the opposite. I have noticed SRE
             | jobs remain posted, even by companies laying off or
             | announcing some kind of hiring slowdown over the last
             | quarter or so. More generally, businesses that have decided
             | that they need SRE are often building out from some kind of
             | devops baseline that has become unsustainable for the dev
             | team. When you hit that limit and need to split out a
             | dedicated team, there aren't a ton of alternatives to
             | getting a SRE or two in and shuffling some prod-oriented
             | devs to the new SRE team (or building a full team from
             | scratch which is what the $$ was estimating above). Among
             | other things, the SRE baliwick includes capacity planning
             | and resource efficiency; SRE will save you money in the
             | long term.
             | 
             | On a personal note, I am sorry to hear that your job search
             | has not yet been fruitful. Presumably I am interested in
             | different criteria from you --- I have found several
             | postings that are quite appealing to the point where I am
             | updating my CV and applying, despite being weakly motivated
             | at the moment.
        
           | pharmakom wrote:
           | They raised a billion dollars.
        
           | dharmab wrote:
           | You don't necessarily need a full team of SREs- you can also
           | have a lightly staffed ops center with escalation paths.
        
             | eep_social wrote:
             | I don't think that model has the properties you think it
             | does. Someone still has to take call to back the operators.
             | Someone has to build the signals that the ops folks watch.
             | Someone has to write criteria for what should and should
             | not be escalated, and in a larger org they will also need
             | to know which escalation path is correct. And on and on --
             | the work has to get done somewhere!
        
               | majormajor wrote:
               | The way those criteria usually get written in a startup
               | with mission-critical customer-facing stuff (like this
               | privacy issue) is that _first_ the person watching
               | Twitter and email and whatever else pages the engineers,
               | and _then_ there 's a retro on whether or not that
               | particular one was necessary, lather, rinse, repeat.
               | 
               | All you need on day 1 is someone to watch the
               | (metaphorical) phones + a way to page an engineer. Don't
               | start by spending a million bucks a year, start by having
               | a first aid kit at the ready.
               | 
               | Perhaps they could also help this person out by looking
               | into some sort of fancy software to automatically
               | summarize messages that were being sent to them, or their
               | mentions on Reddit, or something, even?
        
           | scarmig wrote:
           | Since it now handles visual inputs, I wonder how hard it'd be
           | to get GPT to monitor itself. Have it constantly observe a
           | set of screenshares of automated processes starting and
           | repeating ChatGPT sessions on prod, alert the on-call when it
           | notices something "weird."
        
         | inconceivable wrote:
         | nobody qualified wants the 24/7 SRE job unless it pays an
         | enormous amount of money. i wouldn't do it for less than 500
         | grand cash. getting woken up at 3am constantly or working 3rd
         | shift is the kind of thing you do with a specific monetary goal
         | in mind (i.e., early retirement) or else it's absolute hell.
         | 
         | combine that with ludicrous requirements (the same as a senior
         | software engineer) and you get gaps in coverage. ask yourself
         | what senior software engineer on earth would tolerate getting
         | called CONSTANTLY at 3am, or working 3rd shift.
         | 
         | the vast majority of computer systems just simply aren't as
         | important as hospitals or nuclear power plants.
        
           | nijave wrote:
           | Not only that, but you probably need follow the sun if you
           | want <30 minute response time.
           | 
           | Given a system that collects minute-based metrics, it
           | generally takes around 5-10 minutes to generate an alert.
           | Another 5-10 minutes for the person to get to their computer
           | unless it's already in their hand (what if you get unlucky
           | and on-call was taking a shower or using the toilet?). After
           | that, another 5-10 minutes to see what's going on with the
           | system.
           | 
           | After all that, it usually takes some more minutes to
           | actually fix the problem.
           | 
           | Dropbox has a nice article on all the changes they made to
           | streamline incidence response
           | https://dropbox.tech/infrastructure/lessons-learned-in-
           | incid...
        
           | mnahkies wrote:
           | Timezones are a thing - your 3am is someone's 9am and may be
           | a significant part of your customer base.
           | 
           | Being paged constantly is a sign of bad alerts or bad systems
           | IMO - either adjust the alert to accept the current reality
           | or improve the system
        
             | inconceivable wrote:
             | spinning up a subsidiary in another country (especially one
             | with very strict labor laws, like in european countries) is
             | not as easy as "find some guy on the internet and pay him
             | to watch your dashboard. and then give him root so he can
             | actually fix stuff without calling your domestic team,
             | which would defeat the whole purpose.
             | 
             | also, even getting paged ONCE a month at 3am will fuck up
             | an entire week at a time if you have a family. if it
             | happens twice a month, that person is going to quit unless
             | they're young and need the experience.
        
               | mnahkies wrote:
               | Sorry to be clear I was replying to this part of your
               | comment
               | 
               | > the vast majority of computer systems just simply
               | aren't as important as hospitals or nuclear power plants.
               | 
               | I agree that the stakes are lower in terms of harm, but
               | was trying to express that whilst it might not be life
               | and death, it might be hindering someone being able to do
               | their job / use your product - eg: it still impacts
               | customer experience and your (business) reputation.
               | 
               | False pages for transient errors are bad - ideally you
               | only get paged if human intervention is required, and
               | this should form a feedback cycle to determine how to
               | avoid it in future. If all the pages are genuine problems
               | requiring human action then this should feed into tickets
               | to improve things
        
               | chatmasta wrote:
               | It's really not that difficult, and there are providers
               | like Deel who can manage it all for you, to the point you
               | just ACH them every month.
               | 
               | Source: co-founder of a remote startup with employees in
               | five countries
        
               | inconceivable wrote:
               | like you said, timezones are a thing. now you're managing
               | a global team.
        
               | Godel_unicode wrote:
               | That sounds harder than it is, especially if you already
               | allow remote work. It mostly just forces you to have
               | better docs.
        
           | oulu2006 wrote:
           | I did that for a few years, and wasn't on 500k a year, but
           | I'm also the company co-founder, so you could argue that a
           | "specific monetary goal" was applicable.
        
         | [deleted]
        
         | cloudking wrote:
         | Probably because they launched ChatGPT as an experiment and
         | didn't think it would blow up, needing full time SRE etc. I
         | don't think it was designed for scale and reliability when they
         | launched.
        
         | majormajor wrote:
         | You don't need 24/7 SREs, you could do it with 24/7 first-line
         | customer support staff monitoring Twitter, Reddit, and official
         | lines of comms that have the ability to page the regular
         | engineering team.
         | 
         | That's a lot easier to hire, and lower cost. More training
         | required of what is worth waking people up over; way less in
         | terms of how to fix database/cache bugs.
        
         | CubsFan1060 wrote:
         | Do events like this cause them to lose enough revenue that it
         | would make sense to hire a bunch of SRE's?
        
           | nijave wrote:
           | Probably the real reason. I assume they intend to make money
           | off enterprise contracts which would include SLAs. Then
           | they'd set their support based off that
        
             | chatmasta wrote:
             | Given the Microsoft partnership, they might not even need
             | to manage any real infrastructure. Just hand it off to
             | Azure and let them handle the details.
        
       | killerstorm wrote:
       | Serious question: Why do people feel it's necessary to use a
       | redis cluster?
       | 
       | I understand in early 2000s we were using spinning disks and it
       | was the only way. Well, we don't use spinning disks any more, do
       | we?
       | 
       | A modern server can easily have terabytes of RAM and petabytes of
       | NVMe, so what's stopping people from just using postgres?
       | 
       | A cluster of radishes is an anti-pattern.
        
         | lofaszvanitt wrote:
         | People know it, that's all.
        
         | cplli wrote:
         | For caching the query results you get from your database. Also
         | it's easier to spin up Redis and replicate it closer to your
         | user than doing that with your main database. From my
         | experience anyway.
        
           | killerstorm wrote:
           | > For caching the query results you get from your database.
           | 
           | This only makes sense if queries are computationally
           | intensive. If you're fetching a single row by index you
           | aren't winning much (or anything).
        
             | dpkirchner wrote:
             | Of course? I'm not really sure what the original question
             | actually is if you know that users benefit from caching the
             | results of computationally intensive queries.
        
               | killerstorm wrote:
               | OpenAI uses redis to store pieces of text. Fetching
               | pieces of text is not computationally intensive.
        
               | mannyv wrote:
               | Most likely they have them in an rdbms, so it's more like
               | joining a forum thread together. Not expensive, but why
               | not prebuild and store it instead?
        
             | acuozzo wrote:
             | > This only makes sense if queries are computationally
             | intensive.
             | 
             | Or if the link to your DB is higher latency than you're
             | comfortable with.
        
           | mike_hearn wrote:
           | I think the idea is that if your db can hold the working set
           | in RAM and you're using a good db + prepared queries, you can
           | just let it absorb the full workload because the act of
           | fetching the data from the db is nearly as cheap as fetching
           | it from redis.
        
         | xp84 wrote:
         | I'm confused on why the need to complicate something as
         | seemingly-straightforward as a KV store into a series of queues
         | that can get all mixed up. I asked ChatGPT to explain it
         | though, and it sounds like the justification for its existence
         | is that it doesn't "block the event loop" while a request is
         | "waiting for a response from Redis."
         | 
         | Last time I checked, Redis doesn't take that long to provide a
         | response. And if your Redis servers actually are that
         | overloaded that you're seeing latency in your requests, it
         | seems like simple key-based sharding would allow horizontally
         | scaling your Redis cluster.
         | 
         |  _Disclaimer: I am probably less smart than most people who
         | work at OpenAI so I 'm sure I'm missing some details. Also this
         | is apparently a Python thing and I don't know it beyond surface
         | familiarity._
        
           | zmj wrote:
           | I'm not familiar with the Python client specifically, but
           | Redis clients generally multiplex concurrent requests onto a
           | single connection per Redis server. That necessitates some
           | queueing.
        
         | adrr wrote:
         | My redis clusters are 10x more cost effective than my
         | postgresdb in handling load.
        
         | amtamt wrote:
         | For caching somewhat larger objects based on ETag?
        
         | eldenring wrote:
         | Yes! I have been spending the last couple months pulling out
         | completely unnecessary redis caching from some of our internal
         | web servers.
         | 
         | The only loss here is network latency which negligible when
         | you're colocated in AWS.
         | 
         | Postgres's caches end up pulling a lot more weight too when
         | you're not only hitting the db on a cache miss from the web
         | server.
        
           | [deleted]
        
         | aadvark69 wrote:
         | Better concurrency (10k vs ~200 max connections compared to
         | postgres). ~20x faster than Postgres at Key-value read/write
         | operations. (mostly) single threaded, so atomicity is achieved
         | without the synchronicity overhead found in RDBMS.
         | 
         | Thus, it's much cheaper to run at massive scale like OpenAI's
         | for certain workloads, including KV caching
         | 
         | also:
         | 
         | - robust, flexible data structures and atomic APIs to
         | manipulate them are available out-of-the box
         | 
         | - large and supportive community + tooling
        
         | manv1 wrote:
         | 1. Redis can handle a lot more connections, more quickly, than
         | a database can. 2. It's still faster than a database,
         | especially a database that's busy.
         | 
         | #2 is an interesting point. When you benchmark, the normal
         | process is to just set up a database then run a shitload of
         | queries against it. I don't think a lot of people put actual
         | production load on the database then run the same set of
         | queries against it...usually because you don't have a
         | production load in the prototyping phase.
         | 
         | However, load does make a difference. It made more of a
         | difference in the HDD era, but it still makes a difference
         | today.
         | 
         | I mean, redis is a cache, and you do need to ensure that stuff
         | works if your purge redis (ie: be sure the rebuild process
         | works), etc, etc.
         | 
         | But just because it's old doesn't mean it's bad. OS/390 and
         | AS/400 boxes are still out there doing their jobs.
        
           | hobobaggins wrote:
           | and those have reliable backup/restore infrastructure. Using
           | redis as a cache is fine, just don't use it as your primary
           | DB.
        
           | nijave wrote:
           | A pretty small Redis server can handle 10k clients and
           | saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres
           | database and definitely need a connection pooler to come
           | anywhere close.
        
             | anarazel wrote:
             | I agree that redis can handle some query volumes and client
             | counts that postgres can't.
             | 
             | But FWIW I can easily saturate a 10GBit ethernet link with
             | primary key-lookup read-only queries, without the results
             | being ridiculously wide or anything.
             | 
             | Because it didn't need any setup, I just used:
             | SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;
             | 
             | I don't immediately have access to a faster network,
             | connecting via tcp to localhost, and using some moderate
             | pipelining (common in the redis world afaik), I get up to
             | 19GB/s on my workstation.
        
       | qwerty3344 wrote:
       | using chatGPT to summarize:
       | 
       | This section summarizes the issue with the ChatGPT service which
       | caused some users to be able to see another user's chat history
       | and payment information. OpenAI took the ChatGPT service offline
       | and published technical details of the problem. The bug was found
       | in the Redis client open-source library and has been patched.
       | OpenAI has reached out to affected users and improved their
       | systems to prevent this from happening again. They have also
       | expressed their commitment to protecting user privacy and their
       | dedication to the open-source community.
        
         | zamnos wrote:
         | Interesting. I pasted in the outage report, and then asked
         | ChatGPT to give me a single sentence summary and it gave me an
         | answer way longer than a sentence. I pointed out that it wasn't
         | a single sentence and it gave me:
         | 
         | > ChatGPT experienced an outage due to a bug in an open-source
         | library that allowed some users to see titles from another
         | active user's chat history and payment-related information of
         | 1.2% of ChatGPT Plus subscribers was exposed; the bug has since
         | been patched and actions have been taken to prevent future
         | incidents.
         | 
         | I then asked it _why_ there was the outage and it spit out:
         | 
         | > The exact cause of the Facebook outage is unknown, but it is
         | believed to be related to a configuration change in the
         | company's routers.
         | 
         | It's likely because I ran out of tokens because the OpenAI
         | outage report is long. Pasting in the text of the outage
         | report, and then re-asking about why, it was able to give a
         | much better answer:
         | 
         | > There was an outage due to a bug in an open-source library
         | that allowed some users to see titles from another active
         | user's chat history and also unintentionally exposed payment-
         | related information of 1.2% of ChatGPT Plus subscribers who
         | were active during a specific nine-hour window.
         | 
         | Querying it further, again having to repeat the whole OpenAI
         | outage report, and asking it a few different ways I eventually
         | managed to get this succinct answer:
         | 
         | > The bug was caused by the redis-py library's shared pool of
         | connections becoming corrupted and returning cached data
         | belonging to another user when a request was cancelled before
         | the corresponding response was received, due to a spike in
         | Redis request cancellations caused by a server change on March
         | 20.
         | 
         | It did take me more than a few minutes to get to there, so just
         | actually reading the report would have been faster, and I ended
         | up having to read the report to verify that answer was correct
         | and not a hallucination anyway, so our jobs are safe for now.
        
           | flangola7 wrote:
           | Try with GPT 4. The token window is quadruple.
        
       | layer8 wrote:
       | That sounds like the kind of bug that could be prevented by
       | modeling with TLA+.
        
       | m00dy wrote:
       | maybe they've just scrolled over issue lists of popular tech
       | stacks and cherry-picked the most compelling one to bury the
       | dirt.
        
       | w10-1 wrote:
       | It's interesting (read: wrong) for an AI company to bother
       | writing the user interface for their web application.
       | 
       | This was a failure of integration testing and defensive design,
       | whether the component was open-source or not. There's no reason
       | to believe that an AI company would have the diligence and
       | experience to do the grunt work of hardening a site.
       | 
       | But management obviously understood the level and character of
       | interest. Actual users include probably 10,000 curiosity seekers
       | for every actual AI researcher, with 1,000 of those being
       | commercial prospects -- people who might buy their service.
       | 
       | This is a clear sign that the managers who've made technical
       | breakthrough's in AI are not capable even of deploying the
       | service at scale -- no less managing the societal consequences of
       | AI.
       | 
       | The difficulty with the board getting adults in the room is that
       | leaders today give the appearance of humility and cooperation,
       | with transparent disclosures and incorporation of influencers
       | into advisory committees. The leaders may believe their own
       | abilities because their underlings don't challenge them. So
       | there's no obvious domineering friction, but the risk is still
       | there, because of inability to manage.
       | 
       | Delegation is the key to scaling, code and organizations. "Know
       | thyself" is about knowing your limits, and having the humility to
       | get help instead of basking in the puffery of being in control.
       | 
       | This isn't a PR problem. It's the Achilles' heel of capitalism,
       | and the capitalists in OpenAI's board should nip this incipient
       | Musk in the bud or risk losing 2-3 orders of magnitude return on
       | their investment.
        
       | stygiansonic wrote:
       | The key part:
       | 
       |  _If a request is canceled after the request is pushed onto the
       | incoming queue, but before the response popped from the outgoing
       | queue, we see our bug: the connection thus becomes corrupted and
       | the next response that's dequeued for an unrelated request can
       | receive data left behind in the connection._
        
         | [deleted]
        
       | qwertox wrote:
       | This reminds me of a comment I made 1.5 months ago [0]:
       | 
       | I was logging in during heavy load, and after typing the question
       | I started getting responses to questions which I didn't ask.
       | 
       | gdb answered on that comment "these are not actually messages
       | from other users, but instead the model generating something
       | ~random due to hitting a bug on our backend where, rather than
       | submitting your question, we submitted an empty query to the
       | model."
       | 
       | I wonder if it was the same redis-py issue back then, but just at
       | another point in the backend. His answer didn't really convince
       | me back then.
       | 
       | [0] https://news.ycombinator.com/item?id=34614796&p=2#34615875
        
       | lopkeny12ko wrote:
       | The original issue report is here:
       | https://github.com/redis/redis-py/issues/2624
       | 
       | This bit is particularly interesting:
       | 
       | > I am asking for this ticket to be re-oped, since I can still
       | reproduce the problem in the latest 4.5.3. version
       | 
       | Sounds like the bug has not actually been fixed, per drago-balto.
        
       | nvartolomei wrote:
       | I wonder how much time passed between the first case of
       | corruptions leading to exceptions (and they ignored it as "eh,
       | not great not terrible we'll look at it later) and users
       | reporting seeing other's users data?
        
       | jchw wrote:
       | Does anyone else find it a bit off-putting how much emphasis they
       | keep putting on "open source library"? I don't think I've read
       | about this without the word open source appearing more than once
       | in their own messaging about it. Why is it so important to
       | emphasize that the library with the bug is open source?
       | 
       | The cynic in me wants to believe that it's a way of deflecting
       | blame somehow, to make it seem like they did their due diligence
       | but were thwarted by something outside of their control. I don't
       | think it holds. If you use an open source library with no
       | warranty, you are responsible (legally and otherwise) to ensure
       | that it is sufficient. For example, if you break HIPAA compliance
       | due to an open source library, it is still you who is responsible
       | for that.
       | 
       | But of course, they're not claiming it's anyone else's fault
       | anywhere explicitly, so it's uncharitable to just assume that's
       | what they meant. Still, it rubs me the wrong way. I can't fight
       | the feeling that it's a _wink wink nudge nudge_ to give them more
       | slack than they 'd otherwise get. It feels like it's inviting you
       | to just criticize redis-py and give them a break.
       | 
       | The open postmortem and whatnot is appreciated and everything,
       | but sometimes it's important to be mindful of what you emphasize
       | in your postmortems. People read things even if you don't write
       | them, sometimes.
        
         | babl-yc wrote:
         | I don't find it over-emphasized. Many in the Twitter-sphere are
         | acting as if they aren't being appreciative of open source
         | software and I don't see it that way.
         | 
         | The technical root cause was in the open source library.
         | There's a patch available and more likely than not OpenAI will
         | continue to use the library.
         | 
         | Being overly sensitive to blame would be distracting to the
         | technical issue at hand. It's great they are posting this post-
         | mortem to raise awareness that the libraries you use can have
         | bugs and to consider that risk when building systems.
        
           | fabianhjr wrote:
           | Root cause analysis would likely also include the lack of
           | threat modeling / security evaluation of their dependencies
           | 
           | Would likely also question the lack of resources allocated to
           | these open source projects by companies with a lot of profits
           | from, in part, using those open source projects.
        
         | bitxbitxbitcoin wrote:
         | Not surprising from a company that calls itself openai. The
         | "open source" keyword stuffing is so people associate the open
         | from openai with open source. Psyops I mean marketing 101.
        
         | ishaanbahal wrote:
         | The emphasis could also have been done to educate folks using
         | this combination to check their setup.
         | 
         | Though version reference in the postmartem should also be
         | posted, as a general guidance to their own readers, but at
         | least a quick google search leads you to it.
         | 
         | https://github.com/redis/redis-py/releases/tag/v4.5.3
         | 
         | For anyone reading this and using a combination of asyncio and
         | py-redis, please bump your versions.
         | 
         | Similar issues I've encountered with asyncio python and
         | postgres too in the past when trying to pool connections. It's
         | really not easy to debug them either.
        
         | layer8 wrote:
         | I think you're overreacting. What bothered me is that they
         | didn't link to the actual bug or provide a reference ID.
        
         | mlsu wrote:
         | The gaping hole in this write-up goes something like:
         | 
         | "In order to prevent a bug like this from happening in the
         | future, we have stepped up our review process for external
         | dependencies. In addition, we are conducting audits around code
         | that involves sensitive information."
         | 
         | Of course, we all know what actually happened here:
         | 
         | - we did no auditing;
         | 
         | - because our audit process consists of "blame someone else
         | when our consumers are harmed";
         | 
         | - because we would rather not waste dev time on making sure our
         | consumers are not harmed
         | 
         | If you want to know why no software "engineering" is happening
         | here, this is your answer. Can you imagine if a bridge
         | collapsed, and the builder of the bridge said, "iunno, it's the
         | truck's fault for driving over the bridge."
        
           | marshmellman wrote:
           | Are you confident that an audit would have uncovered this
           | bug? I'd be surprised if audits are effective at finding
           | subtle bugs and race conditions, but I could be wrong.
        
         | cwkoss wrote:
         | If the FTC had teeth and good judgement, they'd force OpenAI to
         | rename themselves.
        
         | kobalsky wrote:
         | the library is provided by the redis team themselves and the
         | bug is awful [1]. I know it's not redis' fault but this bug
         | could hit anyone. Connections may be left in a dirty state
         | where they return data from the previous request in the
         | following one.
         | 
         | [1] https://github.com/redis/redis-py/issues/2624
        
         | adrianmonk wrote:
         | I noticed it too, but it doesn't necessarily bother me.
         | Possibly they're just trying to say, "This incident may have
         | made us look like we're complete amateurs who don't have any
         | clue about security, but it wasn't like that."
         | 
         | Using someone else's library doesn't absolve you of
         | responsibility, but failing to be vigilant at thoroughly
         | vetting and testing external dependencies is a different kind
         | of mistake than creating a terrible security bug yourself
         | because your engineers don't know how to code or everyone is in
         | too much of a rush to care about anything.
        
           | thefreeman wrote:
           | They really skirt around the fact that they apparently
           | introduced a bug which quite consistently initiated redis
           | requests and terminated the connection before receiving the
           | result.
        
           | mewpmewp2 wrote:
           | Yes, I agree with that sentiment, and I thought precisely the
           | same. I know as an engineer that I would feel compelled to
           | mention that it was an obscure bug in an open source library,
           | if that was the case. Not to excuse myself of responsibility,
           | but because I would feel so ashamed if I myself introduced
           | such an obvious security flaw. I would still of course
           | consider myself responsible for what happened.
           | 
           | A lot of the time when people make mistakes, they explain
           | themselves so as they are afraid to be perceived as
           | completely stupid or incompetent for making that mistake, not
           | excusing themselves of taking responsibility even though
           | people frequently think that excuses or explanation means
           | that you are trying to absolve yourself of what you did.
           | 
           | There's a huge difference to me between having an obscure bug
           | like this and introducing that type of security issue because
           | you couldn't logically consider it. First one can be resolved
           | in the future by introducing processes and make sure all open
           | source libraries are from trusted sources, but second one
           | implies that you are fundamentally unable to think and
           | therefore also probably improve on that.
        
           | mlsu wrote:
           | Why?
           | 
           | The result for the end consumer is identical whether they
           | have their PII leaked from "an external library" vs a
           | vendor's own home-baked solution.
           | 
           | It's not really a different kind of mistake, it's exactly the
           | same kind of mistake, because it is exactly the same mistake!
           | This is talking the talk, and not walking the walk, when it
           | comes to security.
           | 
           | Publishing a writeup that passes the buck to some (unnamed)
           | overworked and underpaid open source maintainer is _worse_ ,
           | not better!
        
           | Veserv wrote:
           | I agree, it is a different kind of mistake; it is immensely
           | worse than creating a terrible security bug yourself.
           | 
           | Outsourcing your development work without a acceptance
           | criteria and without validation for fitness of purpose is
           | complete, abject engineering incompetence. Do you think
           | bridge builders look at the rivets in the design and then
           | just waltz over to Home Depot and just pick out one that
           | looks kind of like the right size? No, they have exact
           | specifications and it is their job to source rivets that meet
           | those specifications. They then either validate the rivets
           | themselves or contract with a reputable organization that
           | _legally guarantees_ they meet the specifications and it
           | might be prudent to validate it again anyways just to be
           | sure.
           | 
           | The fact that, in software, not validating your dependencies,
           | i.e. the things your system _depends_ on, is viewed as not so
           | bad is a major reason why software security is such a utter
           | joke and why everybody keeps making such utterly egregious
           | security errors. If one of the worst engineering practices is
           | viewed as normal and not so bad, it is no wonder the entire
           | thing is utterly rotten.
        
           | jchw wrote:
           | I do not believe it's necessarily _nefarious_ in nature, but
           | maybe more specifically it feels kind of like they 're
           | implying that this is actually a valid escape hatch: "Sorry,
           | we can't possibly audit this code because who audits all of
           | their open source deps, amirite?"
           | 
           | But the truth is that actually, maybe that hints at a deeper
           | problem. It was a direct dependency to their application code
           | in a critical path. I mean, don't get me wrong, I don't think
           | everyone can be expected to audit or fund auditing for every
           | single line of code that they wind up running in production,
           | and frankly even doing that might not be good enough to
           | prevent most bugs anyways. Like clearly, every startup fully
           | auditing the Linux kernel before using it to run some HTTP
           | server is just not sustainable. But let's take it back a
           | step: if the point of a postmortem is to analyze what went
           | wrong to prevent it in the future, then this analysis has
           | failed. It almost reads as "Bug in an open source project
           | screwed us over, sorry. It will happen again." I realize
           | that's not the most charitable reading, but the one takeaway
           | I had is this: They don't actually know how to prevent this
           | from happening again.
           | 
           | Open source software helps all of us by providing us a wealth
           | of powerful libraries that we can use to build solutions, be
           | we hobbyists, employees, entrepreneurs, etc. There are many
           | wrinkles to the way this all works, including obviously
           | discussions regarding sustainability, but I think there is
           | more room for improvement to be had. Wouldn't it be nice if
           | we periodically had actual security audits on even just the
           | most popular libraries people use in their service code?
           | Nobody in particular has an impetus to fund such a thing, but
           | in a sense, everyone has an impetus to fund such work, and
           | everyone stands to gain from it, too. Today it's not the
           | norm, but perhaps it could become the norm some day in the
           | future?
           | 
           | Still, in any case... I don't really mean to imply that
           | they're being nefarious with it, but I do feel it comes off
           | as at _best_ a bit tacky.
        
             | xxpor wrote:
             | I mean, if there were ever a company in a position to
             | figure out a scalable way to audit OSS before usage, it'd
             | be OpenAI, right?
        
           | jvm___ wrote:
           | Doesn't bother me either. All the car companies issue recalls
           | regularly, sometimes an issue only shows up when the system
           | hits capacity or you run into an edge case.
        
         | skybrian wrote:
         | I think you're reading too much into it. Being an open source
         | library is relevant because it means it's third party and
         | doesn't come with a support agreement, so fixing a bug is a
         | somewhat different process than if it were in your own code or
         | from a proprietary vendor.
         | 
         | Yes, it's technically up to you to vet all your dependencies,
         | but in practice, often it doesn't happen, people make
         | assumptions that the code works, and that's relevant too.
        
           | fabianhjr wrote:
           | Open source can be fixed as if it was your own code. (And
           | that is a strong tenant of free/open source software)
           | 
           | Not only do most open/free source libraries come without
           | support agreements: they come with the broadest possible
           | limitation of warranties. (As they should)
           | 
           | So the company, knowing that what they are using comes
           | without any warranty either of quality or fitness to the use-
           | case, have a very strong burden of due diligence / vetting.
        
           | danenania wrote:
           | Also, vetting a dependency != auditing and testing every line
           | of code to find all possible bugs.
           | 
           | If this bug was an open issue in the project's repo, that
           | might be concerning and indicate that proper vetting wasn't
           | done. Ditto if the project is old and unmaintained, doesn't
           | have tests, etc. But if they were the first to trigger the
           | bug and it only occurs under heavy load in production
           | conditions, well, running into some of those occasionally is
           | inevitable. The alternative is not using any dependencies, in
           | which case you'd just be introducing these bugs yourself
           | instead. Even with very thorough testing and QA, you're never
           | going to perfectly mimic high load production conditions.
        
           | JohnFen wrote:
           | > in practice, often it doesn't happen, people make
           | assumptions that the code works
           | 
           | True, but that's an inexcusable practice and always has been.
           | We as an industry need to stop accepting it.
        
             | isopede wrote:
             | What do you mean by "stop accepting it?"
             | 
             | All of us rely on millions of lines of code that we have
             | not personally audited every single day. Have you audited
             | every framework you use? Your kernel? Drivers? Your
             | compiler? Your CPU microcode? Your bootrom? The firmware in
             | every gizmo you own?
             | 
             | If "Reflections on Trusting Trust" has taught us anything,
             | it's turtles all the way down. At some point, you have to
             | either trust something, or abandon all hope and trust
             | nothing.
        
               | JohnFen wrote:
               | > Have you audited every framework you use? Your
               | compiler? Your CPU microcode? Your bootrom?
               | 
               | Of course not. I exclude the CPU microcode, bootrom, and
               | the like from the discussion because that's not part of
               | the product being shipped.
               | 
               | But it's also true that I don't do a deep dive analyzing
               | every library I use, etc. I'm not saying that we should
               | have to.
               | 
               | What I'm saying is that when a bug pops up, that's on us
               | as developers even when the bug is in a library, the
               | compiler, etc. A lot of developers seem to think that
               | just because the bug was in code they didn't personally
               | write, that means that their hands are clean.
               | 
               | That's just not a viable stance to take. The bug should
               | have been caught in testing, after all.
               | 
               | If your car breaks down because of a design failure in a
               | component the auto manufacturer bought from another
               | supplier, you'll still (rightfully) hold the auto
               | manufacturer responsible.
        
               | skybrian wrote:
               | > when a bug pops up
               | 
               | That's reacting to a bug you know about. Do you mean to
               | talk about how developers aren't good enough at reacting
               | to bugs found in third party libraries, or how they
               | should do more prevention?
               | 
               | In this case, it seems like OpenAI reacted fairly
               | appropriately, though perhaps they could have caught it
               | sooner since people reported it privately.
               | 
               | "Holding someone responsible" is somewhat ambiguous about
               | what you expect. It seems reasonable that a car
               | manufacturer should be prepared to do a recall and to pay
               | damages without saying that they should be perfect and
               | recalls should never happen.
        
               | JohnFen wrote:
               | > Do you mean to talk about how developers aren't good
               | enough at reacting to bugs found in third party
               | libraries, or how they should do more prevention?
               | 
               | My point was neither of these. My point is very simple:
               | the developers of a product are responsible for how that
               | product behaves.
               | 
               | I'm not saying developers have to be perfect, I'm just
               | saying that there appears to be a tendency, when
               | something goes wrong because of external code, to deflect
               | blame and responsibility away from them and onto the
               | external code.
               | 
               | I think this is an unseemly thing. If I ship a product
               | and it malfunctions, that's on me. The customer will
               | rightly blame me, and it's up to me to fix the problem.
               | 
               | Whether the bug was in code I wrote or in a library I
               | used isn't relevant to that point.
        
         | JohnFen wrote:
         | > The cynic in me wants to believe that it's a way of
         | deflecting blame somehow
         | 
         | That's how it reads to me as well.
         | 
         | Of course, it doesn't deflect blame at all. Any time you
         | include code in your project, no matter where the code came
         | from, you are responsible for the behavior of that code.
        
         | amtamt wrote:
         | Was postmortem generated by chatGPT?
        
         | dilap wrote:
         | I half agree, but I also half-sympathize with them, because it
         | really wasn't their fault -- it was a quite-bad bug in a very
         | fundamental library.
         | 
         | Bugs happen, though. Especially in Python.
        
           | airstrike wrote:
           | _> Especially in Python._
           | 
           | as opposed to...?
        
             | moffkalast wrote:
             | As opposed to not in Python.
        
               | deathanatos wrote:
               | ... like JavaScript? Bash? C? PHP?
               | 
               | Certainly none of those are widely used and have a
               | reputation for making it easy to keep the gun aimed
               | squarely at the foot.
        
               | moffkalast wrote:
               | Those would be roughly similar. The main difference would
               | be between dynamically typed interpreted languages and
               | statically typed compiled ones I guess. At least I think
               | I make less mistakes when the compiler literally tells me
               | what's wrong before I even run the thing. It's awful and
               | slow to develop that way, but it is more reliable for
               | when that's a requirement.
               | 
               | So compared to ones like Kotlin or Rust.
        
             | dilap wrote:
             | Go, for one.
             | 
             | In my experience errors are more common (for both cultural
             | and technological reasons) in Python than in Go.
             | 
             | I would guess something similar applies to Rust, though I
             | don't have personal experience.
             | 
             | There's wide variation in C, but with careful
             | discrimination, you can find very high-quality libraries or
             | software (redis itself being an excellent example).
             | 
             | I don't have rigourous data to baack this stuff up, but I'm
             | pretty convinced it's true, based on my own experience.
        
           | qwertox wrote:
           | I was upvoting you, but then reading
           | 
           | > Especially in Python.
           | 
           | made me unvote.
        
           | kljhghfgdfjkgh wrote:
           | it really _was_ their fault. they chose to ship the bug. it
           | doesn 't matter in the last that someone else previously
           | published the code under a license with no warranty
           | whatsoever.
        
           | gkbrk wrote:
           | Instead of spending engineering time, they used a free and
           | open-source library to do less work.
           | 
           | The license they agreed to in order to use this library has
           | this in capital letters. [THE SOFTWARE IS PROVIDED "AS IS",
           | WITHOUT WARRANTY OF ANY KIND].
           | 
           | After agreeing to this license and using the library for
           | free, they charged people money and sold them a service. And
           | when that library they got for free, which they read and
           | agreed that had no warranty of any kind, had a software bug,
           | they wrote a blog post and blamed the outage of their paid
           | service on this free library.
           | 
           | This is not another open-source project, or a small business.
           | This is a company that got billions of dollars in investment,
           | and a lot of income by selling services to businesses and
           | individuals. They don't get to use free, no-warranty code
           | written by others to save their own money, and then blame it
           | and complain about it loudly for bugs.
        
           | JohnFen wrote:
           | > it really wasn't their fault -- it was a quite-bad bug in a
           | very fundamental library.
           | 
           | It's still their fault. When you ship code, you are
           | responsible for how that code behaves regardless of where the
           | code came from.
        
             | JamesBarney wrote:
             | Only for some incredibly broad definition of fault that
             | almost no one uses.
             | 
             | How many people make sure all of the open source libraries
             | they're using are bug free?
             | 
             | Anyone besides maybe NASA?
        
               | JohnFen wrote:
               | > Only for some incredibly broad definition of fault that
               | almost no one uses.
               | 
               | It's a definition most laypeople use. It's developers who
               | tend to use a very narrow definition.
               | 
               | I don't think it should be controversial to say that when
               | you ship a product, you are responsible for how that
               | product behaves.
        
               | pjmlp wrote:
               | Anyone that has to pay from their own pocket when things
               | go wrong, like consulting warranties, liabitiliy in
               | security exploits,...
        
               | majormajor wrote:
               | I've never cared per se that a library was bug free but
               | I've put a lot of effort/$ into making sure _the features
               | that used the libraries in my product_ were bug free
               | (with the amount of effort depending on the sensitivity
               | of the feature, data, etc).
               | 
               | Usually "fix the original library" wasn't as easy or
               | immediate a fix as "hack around it" which is sad just re:
               | the overall OSS ecosystem but still the person releasing
               | a product's responsibility.
               | 
               | Unfortunately these sorts of bugs are wildly difficult to
               | predict. Yet it's also a wildly common architecture.
               | _That 's_ what's sad for all of us as engineers as a
               | whole. But "caching credit card details and home
               | addresses", for instance, is... particularly dicey.
               | That's very sensitive, and you're tossing it into more
               | DBs, without good access control restrictions?
        
               | rschoultz wrote:
               | Anywhere where you have payments related or any other PII
               | data, then transitive dependencies, framework and
               | language choices, memory sharing and other risks are
               | taken into account as something that you as someone
               | developing and operating a service is solely responsible
               | for.
        
             | practice9 wrote:
             | There have been several reports of this issue in Feb/early
             | March on r/ChatGPT subreddit - OpenAI could have known if
             | they listened to community.
             | 
             | Alternatively, they knew about it, and didn't fix the bug
             | until it bit them
        
         | JamesBarney wrote:
         | This doesn't come across this way to me at all. They just
         | described what happened. Do you expect them to jump in front of
         | a bus for the library they're using, and beg for forgiveness
         | for not ensuring the widely used libraries they're leveraging
         | are bug free?
         | 
         | There are very few companies that couldn't get caught by this
         | type of bug.
        
         | nickvincent wrote:
         | Basically agree -- feels off-putting, but not technically a
         | wrong detail to add. An additional reason it rubs me the wrong
         | way, however, is that I believe open-source software code is
         | especially critical to ChatGPT family's capabilities. Not just
         | for code-related queries, but for everything! (e.g. see this
         | "lineage-tracing" blog post: https://yaofu.notion.site/How-
         | does-GPT-Obtain-its-Ability-Tr...)
         | 
         | Thus, I honestly think firms operating generative AI should be
         | walking on eggshells to avoid placing blame on "open-source".
         | Rather, they really should going out of their way to channel as
         | much positive energy towards it as possible.
         | 
         | Still, agree the charitable interpretation is that this just
         | purely descriptive.
        
         | jatins wrote:
         | I think you are reading a bit between the lines, and didn't
         | feel them blaming the library as much as stating that the bug
         | happened because of an issue in the library. Maybe they could
         | have sugarcoated it between 10 layers of corporate jargon but
         | I'd rather take this over that
        
         | thequadehunter wrote:
         | Personally, I think it was partially a virtue signal to show
         | that they use open source software and collaborate with the
         | maintainers.
        
         | chamakits wrote:
         | I've also noticed it, and I can't help but interpret it as
         | their way of shifting blame. Which is irresponsible. It's their
         | product, and they need to take accountability for the bug
         | occurring.
         | 
         | It's a serious bug, but in the grand scheme of things, not
         | earth shattering, and not something that I think would
         | discourage usage of their product. But their treatment of the
         | bug causes more concerns than the bug itself. They are shifting
         | the blame away from the work they did using a library with a
         | bug, rather than their process by which that library made it
         | into their product. And I don't understand how they can't see
         | how that reflects poorly on them as an AI company.
         | 
         | I find it so confusing that at the end of the day, OpenAI's
         | biggest product is having created a good process by which to
         | create value out of a massive amount of data, and build a good
         | API on top of it. And the open source library is effectively
         | something they processed into their product and built an API
         | based off of it. So it creates (to me) some amount of doubt
         | about how they will react when faced with similar challenges to
         | their core product. How will they behave when the data they
         | consume impacts their product negatively? From limited
         | experience, they'll shift the blame to the data, not their
         | process, and keep it pushing.
         | 
         | It seems likely that this is only the beginning of OpenAI
         | having a large customer base, with a high impact on many
         | products. This is a disappointing result on their first test on
         | how they'll manage issues and bugs with their products.
        
         | metanonsense wrote:
         | I don't know. To me it's simply an explanation of what has
         | happened. I think its exactly what I would have written if I
         | was in their position. And show me the one company that has
         | audited all source code of all used open source projects, at
         | least in a way that is able to rule out complex bugs like this.
         | I have once found a memory corruption bug in Berkeley DB
         | wrecking our huge production database, which I would have never
         | found in any pre-emptive source code audit, however detailed.
         | 
         | Edit: On second thought, maybe they could have just written
         | "external library" instead of "open source library".
        
       | davedx wrote:
       | They were/are storing payment data in redis? LOL!
        
         | taxman22 wrote:
         | The postmortem doesn't say that. It just says they were caching
         | "user information". Maybe that includes a Stripe customer or
         | subscription ID that they look up before sending an email, for
         | example.
        
           | tmpz22 wrote:
           | Yeah probably the session id and when the wrong session id is
           | returned other operations like GET User details would pull
           | its data from relational storage.
        
       | galnagli wrote:
       | Well - they have had more bugs and will have more bugs to worry
       | from.
       | 
       | https://twitter.com/naglinagli/status/1639343866313601024
        
       | abujazar wrote:
       | The disclosure is provides valuable information, but the
       | introduction suggests someone else or <<open-source>> is to
       | blame:
       | 
       | >We took ChatGPT offline earlier this week due to a bug in an
       | open-source library which allowed some users to see titles from
       | another active user's chat history.
       | 
       | Blaming an open-source library for a fault in closed-source
       | product is simply unfair. The MIT licensed dependency explicitly
       | comes without any warranties. After all, the bug went unnoticed
       | until ChatGPT put it under pressure, and it was ChatGPT that
       | failed to rule out the bug in their release QA.
        
       | ajhai wrote:
       | > In the hours before we took ChatGPT offline on Monday, it was
       | possible for some users to see another active user's first and
       | last name, email address, payment address, the last four digits
       | (only) of a credit card number, and credit card expiration date
       | 
       | This is a lot of sensitive data. It says 1.2% of ChatGPT Plus
       | subscribers active during a 9 hour window, which considering
       | their user base must be a lot.
        
         | mach1ne wrote:
         | It's a bit unclear if this means that 1.2% of all chatGPT Plus
         | subscribers were active during that 9-hour window
        
       | jkern wrote:
       | Funnily enough I've had a very similar bug occur in an entirely
       | separate redis library. It was a pretty troubling failure mode to
       | suddenly start getting back unrelated data
        
       | pixl97 wrote:
       | There are 2 hard problems in computer science: cache
       | invalidation, naming things, and off-by-1 errors.
        
         | deathanatos wrote:
         | ... in this case this variant seems more appropriate:
         | There are 3 hard problems in Computer Science:       1. naming
         | things       2. cache invalidation       3. 4. off-by-one
         | errors       concurrency
        
       | DeathArrow wrote:
       | I'm the only one terrible bored by the assault of the trivial AI
       | news last months?
       | 
       | Every fart some AI related person makes becomes a huge news. And
       | it's followed by tens of random blog postings all posted to HN.
        
         | Nuzzerino wrote:
         | At least it isn't about the Rust language this time _grumbles_
        
           | DeathArrow wrote:
           | Because Rust hasn't conquered AI the way it conquered crypto.
           | 
           | But we will see AI stuff rewritten in Rust quite soon.
        
           | spprashant wrote:
           | For some reason I liked reading about Rust (or any other
           | technology) a lot more that the AI.
           | 
           | Part of it is that, the average engineer could understand and
           | grok what those articles were talking about, and I could
           | appreciate, relate, and if applicable criticize it.
           | 
           | The AI news just seems to swing between hype and doomsday
           | prophecies, and little discussion about the technical aspects
           | of it.
           | 
           | Obviously OpenAI choosing to keep it closed source makes any
           | in-depth discussion close to impossible, but also some of
           | this is so beyond the capabilities of an average engineer
           | with a laptop. It can be frustrating.
        
       | [deleted]
        
       | polyrand wrote:
       | Commit fixing the bug:
       | 
       | https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20c...
        
       | ketchupdebugger wrote:
       | It's surprising that openai seems to be the only one being
       | affected. If the issue is with redis-py reusing connections then
       | wouldn't more companies/products be affected by this?
        
         | zzzeek wrote:
         | their description of the problem seemed kind of obtuse, in
         | practice, these connection-pool related issues have to do with
         | 1. request is interrupted 2. exception is thrown 3. catch
         | exception, return connection to pool, move on. The thing that
         | has to be implemented is 2a. clean up the state of the
         | connection when the interrupted exception is caught, _then_
         | return to the pool.
         | 
         | that is, this seems like a very basic programming mistake and
         | not some deep issue in Redis. the strange way it was described
         | makes it seem like they're trying to conceal that a bit.
        
           | roberttod wrote:
           | It's an open source library, I assume that logic is
           | abstracted within it and that the "basic mistake" was one of
           | the maintainer's.
        
       | 19h wrote:
       | It boggles my mind how they're not absolutely checking the user &
       | conversation id for EVERY message in the queue given the possible
       | sensitivity of the requests. How is this even remotely
       | acceptable?
       | 
       | In the one reddit post first surfacing this the user saw
       | conversations related to politics in china and other rather
       | sensitive topics related to CCP.
       | 
       | This can absolutely get people hurt and they absolutely must take
       | this serious.
        
         | zaroth wrote:
         | It doesn't boggle my mind at all. Session data appears, and is
         | used to render the page. Do you verify every time the actual
         | cookie and go back to the DB to see what user it pointed to?
         | 
         | No, everyone assumes their session object is instantiated with
         | the right values at that level of the code.
        
       | m_0x wrote:
       | Did they use chat-gpt to fix the bug?
        
       | sergiotapia wrote:
       | It sounds like their redis key was not unique enough and yada
       | yada yada it returned sensitive info the wrong people.
        
         | Jabrov wrote:
         | Did you read the article? That's not at all what happened.
        
       ___________________________________________________________________
       (page generated 2023-03-24 23:00 UTC)