[HN Gopher] March 20 ChatGPT outage: Here's what happened ___________________________________________________________________ March 20 ChatGPT outage: Here's what happened Author : zerojames Score : 234 points Date : 2023-03-24 16:08 UTC (6 hours ago) (HTM) web link (openai.com) (TXT) w3m dump (openai.com) | Kuinox wrote: | I managed to manually produce this bug 2 months ago. As they | don't have any bug bounty, I didn't submitted it. By starting a | conversation and refreshing before ChatGPT has time to answer, I | managed to reproduce this bug 2-3 times in January. | breckenedge wrote: | did you reach out via https://openai.com/security.txt? | rvz wrote: | First GitHub, then OpenAI. Two of Microsoft finest(!) (majority | owned and acquired) companies on the top of HN announcing a | serious security incident. | | It's quite unsettling to see this leak of highly sensitive | information and a private key exposure as well. Doesn't look good | and seems like they don't take security seriously. | skybrian wrote: | In the case of OpenAI, the product is more of a research demo | that had to be drastically scaled up, though. From an | operations point of view it's more like a startup. | deltree7 wrote: | Nobody cares and yet another case study of HN being out-of- | touch with reality | rvz wrote: | > _" Nobody cares"_ | | Yet another case study of absolutism, which can be simply | dismissed. | | People paying for ChatGPT care once it goes down and getting | their details and chats leaked that is certainly outside of | HN. Same with GitHub. Both having ~100M users between them. | | That's the reality. | sebzim4500 wrote: | I'm paying for ChatGPT and I don't care about this any more | than the many, many other services I use that have at some | point had an embarassing security issue. | deltree7 wrote: | I'm paying and I don't care. If I write perfect bug-free | code, lead a perfect life, live in a perfect world, I'd be | upset. | | But, I know that shit happens and the reliability meter | should be flexible for different things (bridges, heart | surgery and chat agent). | | If I train my brain to bitch, whine, moan about every | thing, I'd not have resources to care about really | important things. | kkarpkkarp wrote: | is this only me who don't see any chat history since yesterday | and generally chat does not work (you can type the message, but | clicking button or hitting enter / ctrl+enter) does not give any | effect? | | in chat history there is a button to "retry" but clicking it and | inspecting the result, you see "internal server error" | LarsDu88 wrote: | I called it: | https://news.ycombinator.com/item?id=35267569#35270165 | fintechie wrote: | I reported this race condition via ChatGPT's internal feedback | system after I saw other user's chat titles loading on my sidebar | a couple of times (around 7-8 weeks ago). Didn't get a response, | so I assumed it was fixed... | | Hopefully they'll start a bug bounty program soon, and prioritise | bug reports over features. | totallyunknown wrote: | same to my. actually only the summary of the history was from a | different user. the content itself was mine. | sebzim4500 wrote: | The claim made at the time was that the titles were not from | other people and were in fact caused by the model | hallucinating after the input query timed out (or something | like that). Obviously that sounds a little suspect now, but | it might be true. | nwienert wrote: | That's a lie if so, if you look at the Reddit threads | there's no way those were not specific other users | histories as they had the logical history of reading | browser history. Eg, one I saw had stuff like "what is X", | then the next would be "How to X" or something. Some were | all in Japanese, others all in Chinese. If it was random | you wouldn't see clear logical consistency across the list. | jetrink wrote: | The explanation at the time was that unavailable chat data (due | to, e.g. high load) resulted in a null input sometimes being | presented to the chat summary system, which in turn caused the | system to hallucinate believable chat titles. It's possible | that they misdiagnosed the issue or that both bugs were present | and they caught the benign one before the serious one. | ElijahLynn wrote: | That is a pretty good disclosure that creates trust. | kristianpaul wrote: | This more a data leak than an outage... | sebzim4500 wrote: | It was down for quite a while, so I would call it an outage. | qwertox wrote: | Nice writeup, it's fair in the content presented to us. | | Yet I'm wondering why there is no checking if the response does | actually belong to the issued query. | | The client issuing a query can pass a token and verify upon | answer that this answer contains the token. | | TBH as a user of the client I would kind of expect the library to | have this feature built-in, and if I'm starting to use the | library to solve a problem, handling this edge-case would be of a | somewhat low priority to me if the library wouldn't implement it, | probably because I'm lazy. | | I hope that the fix they offered to Redis Labs does contain a | solution to this problem and that everyone of us using this | library will be able to profit from the effort put into resolving | the issue. | | It doesn't [0], so the burden is still on the developer using the | library. | | [0] https://github.com/redis/redis- | py/commit/66a4d6b2a493dd3a20c... | | --- | | Edit: Now I'm confused, this issue [1] was raised on March 17 and | fixed on March 22, was this a regression? Or did OpenAI start | using this library on March 19-20? | | Interesing comment: | | > drago-balto commented 3 hours ago | | > Yep, that's the one, and the #2641 has not fixed it fully, as I | already commented here: #2641 (comment) | | > I am asking for this ticket to be re-oped, since I can still | reproduce the problem in the latest 4.5.3. version | | [1] https://github.com/redis/redis- | py/issues/2624#issue-16293351... | [deleted] | menzoic wrote: | That sounds more like a hindsight thing. In most systems | authorization doesn't happen at the storage layer. Most queries | fetch data by an identifier which is only assumed to be valid | based on authorization that typically happens at the edge and | then everything below relies on that result. | | It's not the safest design but I wouldn't say the client should | be expected to implement it. That security concern is at the | application layer and the actual needs of the implementation | can be wildly different depending on the application. You can | imagine use cases for redis where this isn't even relevant, | like if it's being used to store price data for stocks that | update every 30 seconds. There's no private data involved | there. It's out of scope for a storage client to implement. | [deleted] | benmmurphy wrote: | This is a common bug with a lot of software. For example some | HTTP clients that do pooling won't invalidate the connection | after timing out waiting for the response. | picodguyo wrote: | If you're subscribed to their status page, you'll know it's | actually unusual for a day to go by without an outage alert from | OpenAI. They don't usually write them up like this but I guess | this counts as PII leak disclosure for them? For having raised | billions of dollars the are comically immature from a reliability | and support perspective. | thequadehunter wrote: | To be fair, they accidently made a game-changing breakthrough | that gained millions of users overnight, and I don't think they | were ready for it. | | Before chatgpt, most normal people had never heard of OpenAI. | Their flagship product was basically an API that only | programmers could make useful. | | Team leaders at OpenAI have stated that they were not expecting | the success, let alone the highest adoption rate for any | product in history. In their minds, it was just a cleaned-up | version of a 2-year old product. It was billed as a research | preview. | | So, all of a sudden you go from hiring mostly researchers | because you only have to maintain an API and some mid-traffic | web infra, to suddenly having the fastest growing web product | in history and having to scale up as fast as you can. Keep in | mind that they didn't get backing from Microsoft until January | 23, 2023-- that was only 2 months ago. | | I'd say we should cut them some slack. | picodguyo wrote: | These problems predate ChatGPT. Their API has been on the | market for nearly 3 years. And they raised their first $1B in | 2019. That's plenty of money and time to hire capable | leadership. | [deleted] | construct0 wrote: | The bug: https://github.com/redis/redis-py/issues/2624 | braindead_in wrote: | Was this written by ChatGPT? Maybe it found the bug as well, | who knows. | photochemsyn wrote: | > "If a request is canceled after the request is pushed onto | the incoming queue, but before the response popped from the | outgoing queue, we see our bug: the connection thus becomes | corrupted and the next response that's dequeued for an | unrelated request can receive data left behind in the | connection." | | The OpenAI API was incredibly slow and lots of requests | probably got cancelled (I certainly was doing that) for some | days. I imagine someone could write a whole blog post about how | that worked, it would be interesting reading. | construct0 wrote: | .... "I am asking for this ticket to be re-opened, since I can | still reproduce the problem in the latest 4.5.3. version" | chatmasta wrote: | The PR: https://github.com/redis/redis-py/pull/2641 | | According to the latest comments there, the bug is only | partially fixed. | chatmasta wrote: | Why did it take them _9 hours_ to notice? The problem was | immediately obvious to anyone who used the web interface, as | evidenced by the many threads on Reddit and HN. | | > between 1 a.m. and 10 a.m. Pacific time. | | Oh... so it was because they're based in San Francisco. Do they | really not have a 24/7 SRE on-call rotation? Given the size of | their funding, and the number of users they have, there is really | no excuse not to at least have some basic monitoring system in | place for this (although it's true that, ironically, this | particular class of bug is difficult to detect in a monitoring | system that doesn't explicitly check for it, despite being | immediately obvious to a human observer). | | Perhaps they should consider opening an office in Europe, or | hiring remotely, at least for security roles. Or maybe they could | have GPT-4 keep an eye on the site! | guessmyname wrote: | > _[...] it was because they 're based in San Francisco. Do | they really not have a 24/7 SRE on-call rotation?_ | | OpenAI is hiring Site Reliability Engineers (SRE) in case you, | or anyone you know, is interested in working for them: | https://openai.com/careers/it-engineer-sre . Unfortunately, the | job is an onsite role that requires 5 days a week in their San | Francisco office, so they do not appear to be planning to have | a 24/7 on-call rotation any time soon. | | Too bad because I could support them in APAC (from Japan). | | Over 10 years of industry experience, if anyone is interested. | p1esk wrote: | Also, I heard their interviews (for any technical position) | are very tough. | eep_social wrote: | Staffing an actual 24x7 rotation of SREs costs about a million | dollars a year in base salary as a floor and there are few SREs | for hire. A metrics-based monitor probably would have triggered | on the increased error rate but it wouldn't have been | immediately obvious that there was also a leaking cache. The | most plausible way to detect the problem from the user | perspective would be a synthetic test running some affected | workflow, built to check that the data coming back matches | specific, expected strings (not just well-formed). All possible | but none of this sounds easy to me. Absolutely none of this is | plausible when your startup business is at the top of the news | cycle every single day for the past several months. | namaria wrote: | Every system fail prompts people to exclaim "why aren't there | safeguards?". Every time. Well guess what, if we try to do | new stuff, we will run into new problems. | wouldbecouldbe wrote: | There is nothing new about using redis for cache, or | returning a list for a user. | namaria wrote: | Are you trying to say cache invalidation in a distributed | system is a trivial problem? | oulu2006 wrote: | It's non-trivial but it's also not that hard, there are | well known strategies for achieving it; especially if you | relax guarantees and only promise eventual consistency | then it becomes fairly trivial - we do this for example | and have little problems with it. | chatmasta wrote: | I'm not disagreeing with you, and I'm not the commenter | you're replying to, but it's worth noting that cache | leakage and cache invalidation are two different | problems. | namaria wrote: | You're right. Thanks for pointing that out. My original | point still stands, distributed systems are hard and | people demanding zero failures are setting an impossible | standard. | sosodev wrote: | "there are few SREs for hire" | | How do you figure? If you mean there are few SRE with several | years of experience you might be right. SRE is a fairly new | title so that's not too surprising. | | However, my experience with a recent job search is that most | companies aren't hiring SRE right now because they consider | reliability a luxury. In fact, I was search of a new SRE | position because I was laid off for that very reason. | chatmasta wrote: | You don't even need an SRE to have an on-call rotation; you | could ping a software engineer who could at least recognize | the problem and either push a temporary fix, or try to wake | someone else to put a mitigation in place (e.g. disabling | the history API, which is what they eventually did). | | However, I think the GP's point about this class of bug | being difficult to detect in a monitoring system is the | more salient issue. | eep_social wrote: | Well hang on! Your question was why was the time to | detect so high and you specifically mentioned 24x7 SRE so | I thought that's what we were talking about ;) | | And I do think the answer is that monitoring is easy but | good monitoring takes a whole lot of work. Devops teams | tend to get to sufficient observability where a SRE team | should be dedicating its time to engineering great | observability because the SRE team is not being pushed by | product to deliver features. A functional org will | protect SRE teams from that pressure, a great one will | allow the SRE team to apply counter-pressure from the | reliability and non-functional perspective to the product | perspective. This equilibrium is ideal because it allows | speed but keeps a tight leash on tech debt by developing | rigor around what is too fast or too many errors or | whatever your relevant metrics are. | eep_social wrote: | I've anecdotally observed the opposite. I have noticed SRE | jobs remain posted, even by companies laying off or | announcing some kind of hiring slowdown over the last | quarter or so. More generally, businesses that have decided | that they need SRE are often building out from some kind of | devops baseline that has become unsustainable for the dev | team. When you hit that limit and need to split out a | dedicated team, there aren't a ton of alternatives to | getting a SRE or two in and shuffling some prod-oriented | devs to the new SRE team (or building a full team from | scratch which is what the $$ was estimating above). Among | other things, the SRE baliwick includes capacity planning | and resource efficiency; SRE will save you money in the | long term. | | On a personal note, I am sorry to hear that your job search | has not yet been fruitful. Presumably I am interested in | different criteria from you --- I have found several | postings that are quite appealing to the point where I am | updating my CV and applying, despite being weakly motivated | at the moment. | pharmakom wrote: | They raised a billion dollars. | dharmab wrote: | You don't necessarily need a full team of SREs- you can also | have a lightly staffed ops center with escalation paths. | eep_social wrote: | I don't think that model has the properties you think it | does. Someone still has to take call to back the operators. | Someone has to build the signals that the ops folks watch. | Someone has to write criteria for what should and should | not be escalated, and in a larger org they will also need | to know which escalation path is correct. And on and on -- | the work has to get done somewhere! | majormajor wrote: | The way those criteria usually get written in a startup | with mission-critical customer-facing stuff (like this | privacy issue) is that _first_ the person watching | Twitter and email and whatever else pages the engineers, | and _then_ there 's a retro on whether or not that | particular one was necessary, lather, rinse, repeat. | | All you need on day 1 is someone to watch the | (metaphorical) phones + a way to page an engineer. Don't | start by spending a million bucks a year, start by having | a first aid kit at the ready. | | Perhaps they could also help this person out by looking | into some sort of fancy software to automatically | summarize messages that were being sent to them, or their | mentions on Reddit, or something, even? | scarmig wrote: | Since it now handles visual inputs, I wonder how hard it'd be | to get GPT to monitor itself. Have it constantly observe a | set of screenshares of automated processes starting and | repeating ChatGPT sessions on prod, alert the on-call when it | notices something "weird." | inconceivable wrote: | nobody qualified wants the 24/7 SRE job unless it pays an | enormous amount of money. i wouldn't do it for less than 500 | grand cash. getting woken up at 3am constantly or working 3rd | shift is the kind of thing you do with a specific monetary goal | in mind (i.e., early retirement) or else it's absolute hell. | | combine that with ludicrous requirements (the same as a senior | software engineer) and you get gaps in coverage. ask yourself | what senior software engineer on earth would tolerate getting | called CONSTANTLY at 3am, or working 3rd shift. | | the vast majority of computer systems just simply aren't as | important as hospitals or nuclear power plants. | nijave wrote: | Not only that, but you probably need follow the sun if you | want <30 minute response time. | | Given a system that collects minute-based metrics, it | generally takes around 5-10 minutes to generate an alert. | Another 5-10 minutes for the person to get to their computer | unless it's already in their hand (what if you get unlucky | and on-call was taking a shower or using the toilet?). After | that, another 5-10 minutes to see what's going on with the | system. | | After all that, it usually takes some more minutes to | actually fix the problem. | | Dropbox has a nice article on all the changes they made to | streamline incidence response | https://dropbox.tech/infrastructure/lessons-learned-in- | incid... | mnahkies wrote: | Timezones are a thing - your 3am is someone's 9am and may be | a significant part of your customer base. | | Being paged constantly is a sign of bad alerts or bad systems | IMO - either adjust the alert to accept the current reality | or improve the system | inconceivable wrote: | spinning up a subsidiary in another country (especially one | with very strict labor laws, like in european countries) is | not as easy as "find some guy on the internet and pay him | to watch your dashboard. and then give him root so he can | actually fix stuff without calling your domestic team, | which would defeat the whole purpose. | | also, even getting paged ONCE a month at 3am will fuck up | an entire week at a time if you have a family. if it | happens twice a month, that person is going to quit unless | they're young and need the experience. | mnahkies wrote: | Sorry to be clear I was replying to this part of your | comment | | > the vast majority of computer systems just simply | aren't as important as hospitals or nuclear power plants. | | I agree that the stakes are lower in terms of harm, but | was trying to express that whilst it might not be life | and death, it might be hindering someone being able to do | their job / use your product - eg: it still impacts | customer experience and your (business) reputation. | | False pages for transient errors are bad - ideally you | only get paged if human intervention is required, and | this should form a feedback cycle to determine how to | avoid it in future. If all the pages are genuine problems | requiring human action then this should feed into tickets | to improve things | chatmasta wrote: | It's really not that difficult, and there are providers | like Deel who can manage it all for you, to the point you | just ACH them every month. | | Source: co-founder of a remote startup with employees in | five countries | inconceivable wrote: | like you said, timezones are a thing. now you're managing | a global team. | Godel_unicode wrote: | That sounds harder than it is, especially if you already | allow remote work. It mostly just forces you to have | better docs. | oulu2006 wrote: | I did that for a few years, and wasn't on 500k a year, but | I'm also the company co-founder, so you could argue that a | "specific monetary goal" was applicable. | [deleted] | cloudking wrote: | Probably because they launched ChatGPT as an experiment and | didn't think it would blow up, needing full time SRE etc. I | don't think it was designed for scale and reliability when they | launched. | majormajor wrote: | You don't need 24/7 SREs, you could do it with 24/7 first-line | customer support staff monitoring Twitter, Reddit, and official | lines of comms that have the ability to page the regular | engineering team. | | That's a lot easier to hire, and lower cost. More training | required of what is worth waking people up over; way less in | terms of how to fix database/cache bugs. | CubsFan1060 wrote: | Do events like this cause them to lose enough revenue that it | would make sense to hire a bunch of SRE's? | nijave wrote: | Probably the real reason. I assume they intend to make money | off enterprise contracts which would include SLAs. Then | they'd set their support based off that | chatmasta wrote: | Given the Microsoft partnership, they might not even need | to manage any real infrastructure. Just hand it off to | Azure and let them handle the details. | killerstorm wrote: | Serious question: Why do people feel it's necessary to use a | redis cluster? | | I understand in early 2000s we were using spinning disks and it | was the only way. Well, we don't use spinning disks any more, do | we? | | A modern server can easily have terabytes of RAM and petabytes of | NVMe, so what's stopping people from just using postgres? | | A cluster of radishes is an anti-pattern. | lofaszvanitt wrote: | People know it, that's all. | cplli wrote: | For caching the query results you get from your database. Also | it's easier to spin up Redis and replicate it closer to your | user than doing that with your main database. From my | experience anyway. | killerstorm wrote: | > For caching the query results you get from your database. | | This only makes sense if queries are computationally | intensive. If you're fetching a single row by index you | aren't winning much (or anything). | dpkirchner wrote: | Of course? I'm not really sure what the original question | actually is if you know that users benefit from caching the | results of computationally intensive queries. | killerstorm wrote: | OpenAI uses redis to store pieces of text. Fetching | pieces of text is not computationally intensive. | mannyv wrote: | Most likely they have them in an rdbms, so it's more like | joining a forum thread together. Not expensive, but why | not prebuild and store it instead? | acuozzo wrote: | > This only makes sense if queries are computationally | intensive. | | Or if the link to your DB is higher latency than you're | comfortable with. | mike_hearn wrote: | I think the idea is that if your db can hold the working set | in RAM and you're using a good db + prepared queries, you can | just let it absorb the full workload because the act of | fetching the data from the db is nearly as cheap as fetching | it from redis. | xp84 wrote: | I'm confused on why the need to complicate something as | seemingly-straightforward as a KV store into a series of queues | that can get all mixed up. I asked ChatGPT to explain it | though, and it sounds like the justification for its existence | is that it doesn't "block the event loop" while a request is | "waiting for a response from Redis." | | Last time I checked, Redis doesn't take that long to provide a | response. And if your Redis servers actually are that | overloaded that you're seeing latency in your requests, it | seems like simple key-based sharding would allow horizontally | scaling your Redis cluster. | | _Disclaimer: I am probably less smart than most people who | work at OpenAI so I 'm sure I'm missing some details. Also this | is apparently a Python thing and I don't know it beyond surface | familiarity._ | zmj wrote: | I'm not familiar with the Python client specifically, but | Redis clients generally multiplex concurrent requests onto a | single connection per Redis server. That necessitates some | queueing. | adrr wrote: | My redis clusters are 10x more cost effective than my | postgresdb in handling load. | amtamt wrote: | For caching somewhat larger objects based on ETag? | eldenring wrote: | Yes! I have been spending the last couple months pulling out | completely unnecessary redis caching from some of our internal | web servers. | | The only loss here is network latency which negligible when | you're colocated in AWS. | | Postgres's caches end up pulling a lot more weight too when | you're not only hitting the db on a cache miss from the web | server. | [deleted] | aadvark69 wrote: | Better concurrency (10k vs ~200 max connections compared to | postgres). ~20x faster than Postgres at Key-value read/write | operations. (mostly) single threaded, so atomicity is achieved | without the synchronicity overhead found in RDBMS. | | Thus, it's much cheaper to run at massive scale like OpenAI's | for certain workloads, including KV caching | | also: | | - robust, flexible data structures and atomic APIs to | manipulate them are available out-of-the box | | - large and supportive community + tooling | manv1 wrote: | 1. Redis can handle a lot more connections, more quickly, than | a database can. 2. It's still faster than a database, | especially a database that's busy. | | #2 is an interesting point. When you benchmark, the normal | process is to just set up a database then run a shitload of | queries against it. I don't think a lot of people put actual | production load on the database then run the same set of | queries against it...usually because you don't have a | production load in the prototyping phase. | | However, load does make a difference. It made more of a | difference in the HDD era, but it still makes a difference | today. | | I mean, redis is a cache, and you do need to ensure that stuff | works if your purge redis (ie: be sure the rebuild process | works), etc, etc. | | But just because it's old doesn't mean it's bad. OS/390 and | AS/400 boxes are still out there doing their jobs. | hobobaggins wrote: | and those have reliable backup/restore infrastructure. Using | redis as a cache is fine, just don't use it as your primary | DB. | nijave wrote: | A pretty small Redis server can handle 10k clients and | saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres | database and definitely need a connection pooler to come | anywhere close. | anarazel wrote: | I agree that redis can handle some query volumes and client | counts that postgres can't. | | But FWIW I can easily saturate a 10GBit ethernet link with | primary key-lookup read-only queries, without the results | being ridiculously wide or anything. | | Because it didn't need any setup, I just used: | SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass; | | I don't immediately have access to a faster network, | connecting via tcp to localhost, and using some moderate | pipelining (common in the redis world afaik), I get up to | 19GB/s on my workstation. | qwerty3344 wrote: | using chatGPT to summarize: | | This section summarizes the issue with the ChatGPT service which | caused some users to be able to see another user's chat history | and payment information. OpenAI took the ChatGPT service offline | and published technical details of the problem. The bug was found | in the Redis client open-source library and has been patched. | OpenAI has reached out to affected users and improved their | systems to prevent this from happening again. They have also | expressed their commitment to protecting user privacy and their | dedication to the open-source community. | zamnos wrote: | Interesting. I pasted in the outage report, and then asked | ChatGPT to give me a single sentence summary and it gave me an | answer way longer than a sentence. I pointed out that it wasn't | a single sentence and it gave me: | | > ChatGPT experienced an outage due to a bug in an open-source | library that allowed some users to see titles from another | active user's chat history and payment-related information of | 1.2% of ChatGPT Plus subscribers was exposed; the bug has since | been patched and actions have been taken to prevent future | incidents. | | I then asked it _why_ there was the outage and it spit out: | | > The exact cause of the Facebook outage is unknown, but it is | believed to be related to a configuration change in the | company's routers. | | It's likely because I ran out of tokens because the OpenAI | outage report is long. Pasting in the text of the outage | report, and then re-asking about why, it was able to give a | much better answer: | | > There was an outage due to a bug in an open-source library | that allowed some users to see titles from another active | user's chat history and also unintentionally exposed payment- | related information of 1.2% of ChatGPT Plus subscribers who | were active during a specific nine-hour window. | | Querying it further, again having to repeat the whole OpenAI | outage report, and asking it a few different ways I eventually | managed to get this succinct answer: | | > The bug was caused by the redis-py library's shared pool of | connections becoming corrupted and returning cached data | belonging to another user when a request was cancelled before | the corresponding response was received, due to a spike in | Redis request cancellations caused by a server change on March | 20. | | It did take me more than a few minutes to get to there, so just | actually reading the report would have been faster, and I ended | up having to read the report to verify that answer was correct | and not a hallucination anyway, so our jobs are safe for now. | flangola7 wrote: | Try with GPT 4. The token window is quadruple. | layer8 wrote: | That sounds like the kind of bug that could be prevented by | modeling with TLA+. | m00dy wrote: | maybe they've just scrolled over issue lists of popular tech | stacks and cherry-picked the most compelling one to bury the | dirt. | w10-1 wrote: | It's interesting (read: wrong) for an AI company to bother | writing the user interface for their web application. | | This was a failure of integration testing and defensive design, | whether the component was open-source or not. There's no reason | to believe that an AI company would have the diligence and | experience to do the grunt work of hardening a site. | | But management obviously understood the level and character of | interest. Actual users include probably 10,000 curiosity seekers | for every actual AI researcher, with 1,000 of those being | commercial prospects -- people who might buy their service. | | This is a clear sign that the managers who've made technical | breakthrough's in AI are not capable even of deploying the | service at scale -- no less managing the societal consequences of | AI. | | The difficulty with the board getting adults in the room is that | leaders today give the appearance of humility and cooperation, | with transparent disclosures and incorporation of influencers | into advisory committees. The leaders may believe their own | abilities because their underlings don't challenge them. So | there's no obvious domineering friction, but the risk is still | there, because of inability to manage. | | Delegation is the key to scaling, code and organizations. "Know | thyself" is about knowing your limits, and having the humility to | get help instead of basking in the puffery of being in control. | | This isn't a PR problem. It's the Achilles' heel of capitalism, | and the capitalists in OpenAI's board should nip this incipient | Musk in the bud or risk losing 2-3 orders of magnitude return on | their investment. | stygiansonic wrote: | The key part: | | _If a request is canceled after the request is pushed onto the | incoming queue, but before the response popped from the outgoing | queue, we see our bug: the connection thus becomes corrupted and | the next response that's dequeued for an unrelated request can | receive data left behind in the connection._ | [deleted] | qwertox wrote: | This reminds me of a comment I made 1.5 months ago [0]: | | I was logging in during heavy load, and after typing the question | I started getting responses to questions which I didn't ask. | | gdb answered on that comment "these are not actually messages | from other users, but instead the model generating something | ~random due to hitting a bug on our backend where, rather than | submitting your question, we submitted an empty query to the | model." | | I wonder if it was the same redis-py issue back then, but just at | another point in the backend. His answer didn't really convince | me back then. | | [0] https://news.ycombinator.com/item?id=34614796&p=2#34615875 | lopkeny12ko wrote: | The original issue report is here: | https://github.com/redis/redis-py/issues/2624 | | This bit is particularly interesting: | | > I am asking for this ticket to be re-oped, since I can still | reproduce the problem in the latest 4.5.3. version | | Sounds like the bug has not actually been fixed, per drago-balto. | nvartolomei wrote: | I wonder how much time passed between the first case of | corruptions leading to exceptions (and they ignored it as "eh, | not great not terrible we'll look at it later) and users | reporting seeing other's users data? | jchw wrote: | Does anyone else find it a bit off-putting how much emphasis they | keep putting on "open source library"? I don't think I've read | about this without the word open source appearing more than once | in their own messaging about it. Why is it so important to | emphasize that the library with the bug is open source? | | The cynic in me wants to believe that it's a way of deflecting | blame somehow, to make it seem like they did their due diligence | but were thwarted by something outside of their control. I don't | think it holds. If you use an open source library with no | warranty, you are responsible (legally and otherwise) to ensure | that it is sufficient. For example, if you break HIPAA compliance | due to an open source library, it is still you who is responsible | for that. | | But of course, they're not claiming it's anyone else's fault | anywhere explicitly, so it's uncharitable to just assume that's | what they meant. Still, it rubs me the wrong way. I can't fight | the feeling that it's a _wink wink nudge nudge_ to give them more | slack than they 'd otherwise get. It feels like it's inviting you | to just criticize redis-py and give them a break. | | The open postmortem and whatnot is appreciated and everything, | but sometimes it's important to be mindful of what you emphasize | in your postmortems. People read things even if you don't write | them, sometimes. | babl-yc wrote: | I don't find it over-emphasized. Many in the Twitter-sphere are | acting as if they aren't being appreciative of open source | software and I don't see it that way. | | The technical root cause was in the open source library. | There's a patch available and more likely than not OpenAI will | continue to use the library. | | Being overly sensitive to blame would be distracting to the | technical issue at hand. It's great they are posting this post- | mortem to raise awareness that the libraries you use can have | bugs and to consider that risk when building systems. | fabianhjr wrote: | Root cause analysis would likely also include the lack of | threat modeling / security evaluation of their dependencies | | Would likely also question the lack of resources allocated to | these open source projects by companies with a lot of profits | from, in part, using those open source projects. | bitxbitxbitcoin wrote: | Not surprising from a company that calls itself openai. The | "open source" keyword stuffing is so people associate the open | from openai with open source. Psyops I mean marketing 101. | ishaanbahal wrote: | The emphasis could also have been done to educate folks using | this combination to check their setup. | | Though version reference in the postmartem should also be | posted, as a general guidance to their own readers, but at | least a quick google search leads you to it. | | https://github.com/redis/redis-py/releases/tag/v4.5.3 | | For anyone reading this and using a combination of asyncio and | py-redis, please bump your versions. | | Similar issues I've encountered with asyncio python and | postgres too in the past when trying to pool connections. It's | really not easy to debug them either. | layer8 wrote: | I think you're overreacting. What bothered me is that they | didn't link to the actual bug or provide a reference ID. | mlsu wrote: | The gaping hole in this write-up goes something like: | | "In order to prevent a bug like this from happening in the | future, we have stepped up our review process for external | dependencies. In addition, we are conducting audits around code | that involves sensitive information." | | Of course, we all know what actually happened here: | | - we did no auditing; | | - because our audit process consists of "blame someone else | when our consumers are harmed"; | | - because we would rather not waste dev time on making sure our | consumers are not harmed | | If you want to know why no software "engineering" is happening | here, this is your answer. Can you imagine if a bridge | collapsed, and the builder of the bridge said, "iunno, it's the | truck's fault for driving over the bridge." | marshmellman wrote: | Are you confident that an audit would have uncovered this | bug? I'd be surprised if audits are effective at finding | subtle bugs and race conditions, but I could be wrong. | cwkoss wrote: | If the FTC had teeth and good judgement, they'd force OpenAI to | rename themselves. | kobalsky wrote: | the library is provided by the redis team themselves and the | bug is awful [1]. I know it's not redis' fault but this bug | could hit anyone. Connections may be left in a dirty state | where they return data from the previous request in the | following one. | | [1] https://github.com/redis/redis-py/issues/2624 | adrianmonk wrote: | I noticed it too, but it doesn't necessarily bother me. | Possibly they're just trying to say, "This incident may have | made us look like we're complete amateurs who don't have any | clue about security, but it wasn't like that." | | Using someone else's library doesn't absolve you of | responsibility, but failing to be vigilant at thoroughly | vetting and testing external dependencies is a different kind | of mistake than creating a terrible security bug yourself | because your engineers don't know how to code or everyone is in | too much of a rush to care about anything. | thefreeman wrote: | They really skirt around the fact that they apparently | introduced a bug which quite consistently initiated redis | requests and terminated the connection before receiving the | result. | mewpmewp2 wrote: | Yes, I agree with that sentiment, and I thought precisely the | same. I know as an engineer that I would feel compelled to | mention that it was an obscure bug in an open source library, | if that was the case. Not to excuse myself of responsibility, | but because I would feel so ashamed if I myself introduced | such an obvious security flaw. I would still of course | consider myself responsible for what happened. | | A lot of the time when people make mistakes, they explain | themselves so as they are afraid to be perceived as | completely stupid or incompetent for making that mistake, not | excusing themselves of taking responsibility even though | people frequently think that excuses or explanation means | that you are trying to absolve yourself of what you did. | | There's a huge difference to me between having an obscure bug | like this and introducing that type of security issue because | you couldn't logically consider it. First one can be resolved | in the future by introducing processes and make sure all open | source libraries are from trusted sources, but second one | implies that you are fundamentally unable to think and | therefore also probably improve on that. | mlsu wrote: | Why? | | The result for the end consumer is identical whether they | have their PII leaked from "an external library" vs a | vendor's own home-baked solution. | | It's not really a different kind of mistake, it's exactly the | same kind of mistake, because it is exactly the same mistake! | This is talking the talk, and not walking the walk, when it | comes to security. | | Publishing a writeup that passes the buck to some (unnamed) | overworked and underpaid open source maintainer is _worse_ , | not better! | Veserv wrote: | I agree, it is a different kind of mistake; it is immensely | worse than creating a terrible security bug yourself. | | Outsourcing your development work without a acceptance | criteria and without validation for fitness of purpose is | complete, abject engineering incompetence. Do you think | bridge builders look at the rivets in the design and then | just waltz over to Home Depot and just pick out one that | looks kind of like the right size? No, they have exact | specifications and it is their job to source rivets that meet | those specifications. They then either validate the rivets | themselves or contract with a reputable organization that | _legally guarantees_ they meet the specifications and it | might be prudent to validate it again anyways just to be | sure. | | The fact that, in software, not validating your dependencies, | i.e. the things your system _depends_ on, is viewed as not so | bad is a major reason why software security is such a utter | joke and why everybody keeps making such utterly egregious | security errors. If one of the worst engineering practices is | viewed as normal and not so bad, it is no wonder the entire | thing is utterly rotten. | jchw wrote: | I do not believe it's necessarily _nefarious_ in nature, but | maybe more specifically it feels kind of like they 're | implying that this is actually a valid escape hatch: "Sorry, | we can't possibly audit this code because who audits all of | their open source deps, amirite?" | | But the truth is that actually, maybe that hints at a deeper | problem. It was a direct dependency to their application code | in a critical path. I mean, don't get me wrong, I don't think | everyone can be expected to audit or fund auditing for every | single line of code that they wind up running in production, | and frankly even doing that might not be good enough to | prevent most bugs anyways. Like clearly, every startup fully | auditing the Linux kernel before using it to run some HTTP | server is just not sustainable. But let's take it back a | step: if the point of a postmortem is to analyze what went | wrong to prevent it in the future, then this analysis has | failed. It almost reads as "Bug in an open source project | screwed us over, sorry. It will happen again." I realize | that's not the most charitable reading, but the one takeaway | I had is this: They don't actually know how to prevent this | from happening again. | | Open source software helps all of us by providing us a wealth | of powerful libraries that we can use to build solutions, be | we hobbyists, employees, entrepreneurs, etc. There are many | wrinkles to the way this all works, including obviously | discussions regarding sustainability, but I think there is | more room for improvement to be had. Wouldn't it be nice if | we periodically had actual security audits on even just the | most popular libraries people use in their service code? | Nobody in particular has an impetus to fund such a thing, but | in a sense, everyone has an impetus to fund such work, and | everyone stands to gain from it, too. Today it's not the | norm, but perhaps it could become the norm some day in the | future? | | Still, in any case... I don't really mean to imply that | they're being nefarious with it, but I do feel it comes off | as at _best_ a bit tacky. | xxpor wrote: | I mean, if there were ever a company in a position to | figure out a scalable way to audit OSS before usage, it'd | be OpenAI, right? | jvm___ wrote: | Doesn't bother me either. All the car companies issue recalls | regularly, sometimes an issue only shows up when the system | hits capacity or you run into an edge case. | skybrian wrote: | I think you're reading too much into it. Being an open source | library is relevant because it means it's third party and | doesn't come with a support agreement, so fixing a bug is a | somewhat different process than if it were in your own code or | from a proprietary vendor. | | Yes, it's technically up to you to vet all your dependencies, | but in practice, often it doesn't happen, people make | assumptions that the code works, and that's relevant too. | fabianhjr wrote: | Open source can be fixed as if it was your own code. (And | that is a strong tenant of free/open source software) | | Not only do most open/free source libraries come without | support agreements: they come with the broadest possible | limitation of warranties. (As they should) | | So the company, knowing that what they are using comes | without any warranty either of quality or fitness to the use- | case, have a very strong burden of due diligence / vetting. | danenania wrote: | Also, vetting a dependency != auditing and testing every line | of code to find all possible bugs. | | If this bug was an open issue in the project's repo, that | might be concerning and indicate that proper vetting wasn't | done. Ditto if the project is old and unmaintained, doesn't | have tests, etc. But if they were the first to trigger the | bug and it only occurs under heavy load in production | conditions, well, running into some of those occasionally is | inevitable. The alternative is not using any dependencies, in | which case you'd just be introducing these bugs yourself | instead. Even with very thorough testing and QA, you're never | going to perfectly mimic high load production conditions. | JohnFen wrote: | > in practice, often it doesn't happen, people make | assumptions that the code works | | True, but that's an inexcusable practice and always has been. | We as an industry need to stop accepting it. | isopede wrote: | What do you mean by "stop accepting it?" | | All of us rely on millions of lines of code that we have | not personally audited every single day. Have you audited | every framework you use? Your kernel? Drivers? Your | compiler? Your CPU microcode? Your bootrom? The firmware in | every gizmo you own? | | If "Reflections on Trusting Trust" has taught us anything, | it's turtles all the way down. At some point, you have to | either trust something, or abandon all hope and trust | nothing. | JohnFen wrote: | > Have you audited every framework you use? Your | compiler? Your CPU microcode? Your bootrom? | | Of course not. I exclude the CPU microcode, bootrom, and | the like from the discussion because that's not part of | the product being shipped. | | But it's also true that I don't do a deep dive analyzing | every library I use, etc. I'm not saying that we should | have to. | | What I'm saying is that when a bug pops up, that's on us | as developers even when the bug is in a library, the | compiler, etc. A lot of developers seem to think that | just because the bug was in code they didn't personally | write, that means that their hands are clean. | | That's just not a viable stance to take. The bug should | have been caught in testing, after all. | | If your car breaks down because of a design failure in a | component the auto manufacturer bought from another | supplier, you'll still (rightfully) hold the auto | manufacturer responsible. | skybrian wrote: | > when a bug pops up | | That's reacting to a bug you know about. Do you mean to | talk about how developers aren't good enough at reacting | to bugs found in third party libraries, or how they | should do more prevention? | | In this case, it seems like OpenAI reacted fairly | appropriately, though perhaps they could have caught it | sooner since people reported it privately. | | "Holding someone responsible" is somewhat ambiguous about | what you expect. It seems reasonable that a car | manufacturer should be prepared to do a recall and to pay | damages without saying that they should be perfect and | recalls should never happen. | JohnFen wrote: | > Do you mean to talk about how developers aren't good | enough at reacting to bugs found in third party | libraries, or how they should do more prevention? | | My point was neither of these. My point is very simple: | the developers of a product are responsible for how that | product behaves. | | I'm not saying developers have to be perfect, I'm just | saying that there appears to be a tendency, when | something goes wrong because of external code, to deflect | blame and responsibility away from them and onto the | external code. | | I think this is an unseemly thing. If I ship a product | and it malfunctions, that's on me. The customer will | rightly blame me, and it's up to me to fix the problem. | | Whether the bug was in code I wrote or in a library I | used isn't relevant to that point. | JohnFen wrote: | > The cynic in me wants to believe that it's a way of | deflecting blame somehow | | That's how it reads to me as well. | | Of course, it doesn't deflect blame at all. Any time you | include code in your project, no matter where the code came | from, you are responsible for the behavior of that code. | amtamt wrote: | Was postmortem generated by chatGPT? | dilap wrote: | I half agree, but I also half-sympathize with them, because it | really wasn't their fault -- it was a quite-bad bug in a very | fundamental library. | | Bugs happen, though. Especially in Python. | airstrike wrote: | _> Especially in Python._ | | as opposed to...? | moffkalast wrote: | As opposed to not in Python. | deathanatos wrote: | ... like JavaScript? Bash? C? PHP? | | Certainly none of those are widely used and have a | reputation for making it easy to keep the gun aimed | squarely at the foot. | moffkalast wrote: | Those would be roughly similar. The main difference would | be between dynamically typed interpreted languages and | statically typed compiled ones I guess. At least I think | I make less mistakes when the compiler literally tells me | what's wrong before I even run the thing. It's awful and | slow to develop that way, but it is more reliable for | when that's a requirement. | | So compared to ones like Kotlin or Rust. | dilap wrote: | Go, for one. | | In my experience errors are more common (for both cultural | and technological reasons) in Python than in Go. | | I would guess something similar applies to Rust, though I | don't have personal experience. | | There's wide variation in C, but with careful | discrimination, you can find very high-quality libraries or | software (redis itself being an excellent example). | | I don't have rigourous data to baack this stuff up, but I'm | pretty convinced it's true, based on my own experience. | qwertox wrote: | I was upvoting you, but then reading | | > Especially in Python. | | made me unvote. | kljhghfgdfjkgh wrote: | it really _was_ their fault. they chose to ship the bug. it | doesn 't matter in the last that someone else previously | published the code under a license with no warranty | whatsoever. | gkbrk wrote: | Instead of spending engineering time, they used a free and | open-source library to do less work. | | The license they agreed to in order to use this library has | this in capital letters. [THE SOFTWARE IS PROVIDED "AS IS", | WITHOUT WARRANTY OF ANY KIND]. | | After agreeing to this license and using the library for | free, they charged people money and sold them a service. And | when that library they got for free, which they read and | agreed that had no warranty of any kind, had a software bug, | they wrote a blog post and blamed the outage of their paid | service on this free library. | | This is not another open-source project, or a small business. | This is a company that got billions of dollars in investment, | and a lot of income by selling services to businesses and | individuals. They don't get to use free, no-warranty code | written by others to save their own money, and then blame it | and complain about it loudly for bugs. | JohnFen wrote: | > it really wasn't their fault -- it was a quite-bad bug in a | very fundamental library. | | It's still their fault. When you ship code, you are | responsible for how that code behaves regardless of where the | code came from. | JamesBarney wrote: | Only for some incredibly broad definition of fault that | almost no one uses. | | How many people make sure all of the open source libraries | they're using are bug free? | | Anyone besides maybe NASA? | JohnFen wrote: | > Only for some incredibly broad definition of fault that | almost no one uses. | | It's a definition most laypeople use. It's developers who | tend to use a very narrow definition. | | I don't think it should be controversial to say that when | you ship a product, you are responsible for how that | product behaves. | pjmlp wrote: | Anyone that has to pay from their own pocket when things | go wrong, like consulting warranties, liabitiliy in | security exploits,... | majormajor wrote: | I've never cared per se that a library was bug free but | I've put a lot of effort/$ into making sure _the features | that used the libraries in my product_ were bug free | (with the amount of effort depending on the sensitivity | of the feature, data, etc). | | Usually "fix the original library" wasn't as easy or | immediate a fix as "hack around it" which is sad just re: | the overall OSS ecosystem but still the person releasing | a product's responsibility. | | Unfortunately these sorts of bugs are wildly difficult to | predict. Yet it's also a wildly common architecture. | _That 's_ what's sad for all of us as engineers as a | whole. But "caching credit card details and home | addresses", for instance, is... particularly dicey. | That's very sensitive, and you're tossing it into more | DBs, without good access control restrictions? | rschoultz wrote: | Anywhere where you have payments related or any other PII | data, then transitive dependencies, framework and | language choices, memory sharing and other risks are | taken into account as something that you as someone | developing and operating a service is solely responsible | for. | practice9 wrote: | There have been several reports of this issue in Feb/early | March on r/ChatGPT subreddit - OpenAI could have known if | they listened to community. | | Alternatively, they knew about it, and didn't fix the bug | until it bit them | JamesBarney wrote: | This doesn't come across this way to me at all. They just | described what happened. Do you expect them to jump in front of | a bus for the library they're using, and beg for forgiveness | for not ensuring the widely used libraries they're leveraging | are bug free? | | There are very few companies that couldn't get caught by this | type of bug. | nickvincent wrote: | Basically agree -- feels off-putting, but not technically a | wrong detail to add. An additional reason it rubs me the wrong | way, however, is that I believe open-source software code is | especially critical to ChatGPT family's capabilities. Not just | for code-related queries, but for everything! (e.g. see this | "lineage-tracing" blog post: https://yaofu.notion.site/How- | does-GPT-Obtain-its-Ability-Tr...) | | Thus, I honestly think firms operating generative AI should be | walking on eggshells to avoid placing blame on "open-source". | Rather, they really should going out of their way to channel as | much positive energy towards it as possible. | | Still, agree the charitable interpretation is that this just | purely descriptive. | jatins wrote: | I think you are reading a bit between the lines, and didn't | feel them blaming the library as much as stating that the bug | happened because of an issue in the library. Maybe they could | have sugarcoated it between 10 layers of corporate jargon but | I'd rather take this over that | thequadehunter wrote: | Personally, I think it was partially a virtue signal to show | that they use open source software and collaborate with the | maintainers. | chamakits wrote: | I've also noticed it, and I can't help but interpret it as | their way of shifting blame. Which is irresponsible. It's their | product, and they need to take accountability for the bug | occurring. | | It's a serious bug, but in the grand scheme of things, not | earth shattering, and not something that I think would | discourage usage of their product. But their treatment of the | bug causes more concerns than the bug itself. They are shifting | the blame away from the work they did using a library with a | bug, rather than their process by which that library made it | into their product. And I don't understand how they can't see | how that reflects poorly on them as an AI company. | | I find it so confusing that at the end of the day, OpenAI's | biggest product is having created a good process by which to | create value out of a massive amount of data, and build a good | API on top of it. And the open source library is effectively | something they processed into their product and built an API | based off of it. So it creates (to me) some amount of doubt | about how they will react when faced with similar challenges to | their core product. How will they behave when the data they | consume impacts their product negatively? From limited | experience, they'll shift the blame to the data, not their | process, and keep it pushing. | | It seems likely that this is only the beginning of OpenAI | having a large customer base, with a high impact on many | products. This is a disappointing result on their first test on | how they'll manage issues and bugs with their products. | metanonsense wrote: | I don't know. To me it's simply an explanation of what has | happened. I think its exactly what I would have written if I | was in their position. And show me the one company that has | audited all source code of all used open source projects, at | least in a way that is able to rule out complex bugs like this. | I have once found a memory corruption bug in Berkeley DB | wrecking our huge production database, which I would have never | found in any pre-emptive source code audit, however detailed. | | Edit: On second thought, maybe they could have just written | "external library" instead of "open source library". | davedx wrote: | They were/are storing payment data in redis? LOL! | taxman22 wrote: | The postmortem doesn't say that. It just says they were caching | "user information". Maybe that includes a Stripe customer or | subscription ID that they look up before sending an email, for | example. | tmpz22 wrote: | Yeah probably the session id and when the wrong session id is | returned other operations like GET User details would pull | its data from relational storage. | galnagli wrote: | Well - they have had more bugs and will have more bugs to worry | from. | | https://twitter.com/naglinagli/status/1639343866313601024 | abujazar wrote: | The disclosure is provides valuable information, but the | introduction suggests someone else or <<open-source>> is to | blame: | | >We took ChatGPT offline earlier this week due to a bug in an | open-source library which allowed some users to see titles from | another active user's chat history. | | Blaming an open-source library for a fault in closed-source | product is simply unfair. The MIT licensed dependency explicitly | comes without any warranties. After all, the bug went unnoticed | until ChatGPT put it under pressure, and it was ChatGPT that | failed to rule out the bug in their release QA. | ajhai wrote: | > In the hours before we took ChatGPT offline on Monday, it was | possible for some users to see another active user's first and | last name, email address, payment address, the last four digits | (only) of a credit card number, and credit card expiration date | | This is a lot of sensitive data. It says 1.2% of ChatGPT Plus | subscribers active during a 9 hour window, which considering | their user base must be a lot. | mach1ne wrote: | It's a bit unclear if this means that 1.2% of all chatGPT Plus | subscribers were active during that 9-hour window | jkern wrote: | Funnily enough I've had a very similar bug occur in an entirely | separate redis library. It was a pretty troubling failure mode to | suddenly start getting back unrelated data | pixl97 wrote: | There are 2 hard problems in computer science: cache | invalidation, naming things, and off-by-1 errors. | deathanatos wrote: | ... in this case this variant seems more appropriate: | There are 3 hard problems in Computer Science: 1. naming | things 2. cache invalidation 3. 4. off-by-one | errors concurrency | DeathArrow wrote: | I'm the only one terrible bored by the assault of the trivial AI | news last months? | | Every fart some AI related person makes becomes a huge news. And | it's followed by tens of random blog postings all posted to HN. | Nuzzerino wrote: | At least it isn't about the Rust language this time _grumbles_ | DeathArrow wrote: | Because Rust hasn't conquered AI the way it conquered crypto. | | But we will see AI stuff rewritten in Rust quite soon. | spprashant wrote: | For some reason I liked reading about Rust (or any other | technology) a lot more that the AI. | | Part of it is that, the average engineer could understand and | grok what those articles were talking about, and I could | appreciate, relate, and if applicable criticize it. | | The AI news just seems to swing between hype and doomsday | prophecies, and little discussion about the technical aspects | of it. | | Obviously OpenAI choosing to keep it closed source makes any | in-depth discussion close to impossible, but also some of | this is so beyond the capabilities of an average engineer | with a laptop. It can be frustrating. | [deleted] | polyrand wrote: | Commit fixing the bug: | | https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20c... | ketchupdebugger wrote: | It's surprising that openai seems to be the only one being | affected. If the issue is with redis-py reusing connections then | wouldn't more companies/products be affected by this? | zzzeek wrote: | their description of the problem seemed kind of obtuse, in | practice, these connection-pool related issues have to do with | 1. request is interrupted 2. exception is thrown 3. catch | exception, return connection to pool, move on. The thing that | has to be implemented is 2a. clean up the state of the | connection when the interrupted exception is caught, _then_ | return to the pool. | | that is, this seems like a very basic programming mistake and | not some deep issue in Redis. the strange way it was described | makes it seem like they're trying to conceal that a bit. | roberttod wrote: | It's an open source library, I assume that logic is | abstracted within it and that the "basic mistake" was one of | the maintainer's. | 19h wrote: | It boggles my mind how they're not absolutely checking the user & | conversation id for EVERY message in the queue given the possible | sensitivity of the requests. How is this even remotely | acceptable? | | In the one reddit post first surfacing this the user saw | conversations related to politics in china and other rather | sensitive topics related to CCP. | | This can absolutely get people hurt and they absolutely must take | this serious. | zaroth wrote: | It doesn't boggle my mind at all. Session data appears, and is | used to render the page. Do you verify every time the actual | cookie and go back to the DB to see what user it pointed to? | | No, everyone assumes their session object is instantiated with | the right values at that level of the code. | m_0x wrote: | Did they use chat-gpt to fix the bug? | sergiotapia wrote: | It sounds like their redis key was not unique enough and yada | yada yada it returned sensitive info the wrong people. | Jabrov wrote: | Did you read the article? That's not at all what happened. ___________________________________________________________________ (page generated 2023-03-24 23:00 UTC)