[HN Gopher] Discovering Azure's unannounced breaking change with... ___________________________________________________________________ Discovering Azure's unannounced breaking change with Cosmos DB Author : jmartens Score : 99 points Date : 2022-10-13 18:55 UTC (4 hours ago) (HTM) web link (metrist.io) (TXT) w3m dump (metrist.io) | rroot wrote: | girvo wrote: | andrewstuart2 wrote: | What are you doing here if you're not going to RTFA? The fifth | paragraph pretty clearly describes the issue before they go | into depth on how they determined that Azure did indeed publish | a backwards-incompatible change without notice. | speedgoose wrote: | I believe it is easy for a well-made software to immediately | detect and report what goes wrong. With Sentry, Elk, or whatever | else. | | So, let's say I'm woken up in the middle of the night because my | black box database as a service suddenly returns errors. If I'm | not incompetent, I should have error messages and stacktraces | available in a few seconds. If I'm a rich cloud customer, I can | call the premium cloud support and ask for an explanation. If | not, I would probably have to debug it myself. | | With your service, I understand that I can blame the cloud | provider faster. Maybe it can make the debugging session slightly | faster when your monitoring also returns errors. End users don't | care whether it's my code or the cloud provider code crashing, so | it's a developer tool for emergencies. Did I understand well? | jmartens wrote: | You got it right, it's a developer tool. Its not hard to get | alerted about an issue, or to suspect a cloud dependency. | Verifying it, which is typically required to take action, is | what can take 10-30 minutes. | dec0dedab0de wrote: | This seems like an accident. Microsoft should treat it as a bug, | and set the default on their backend to fix it. | dharmab wrote: | Back around 2017-2018 unannounced breaking changes in Azure | services were so common, my team coined a term "Cloud Monday" | (echoing Patch Tuesday) because usually our integration tests | would break between 8-10AM Pacific Time on Mondays. (They did | eventually become far less frequent.) | TurkTurkleton wrote: | > my team coined a term "Cloud Monday" | | Azure being a shade of blue, you should've called it "Blue | Monday"[0]. Could've even rigged up something to play the song | when integration tests mysteriously failed. _How does it feel / | to treat me like you do?/ When you've laid your hands upon me/ | and told me who you are?_... | | [0]: https://www.youtube.com/watch?v=c1GxjzHm5us | hupt wrote: | Cosmos was originally created for hosting massive datasets | internally within Microsoft. For example they use it for the OS | telemetry sent in from customer machines, and raw data for threat | intelligence. As part of Microsoft's move of everything hosted | on-premise to their cloud, they decided to upon up Cosmos to | other users of Azure. But the primary customer is and will likely | always be Microsoft themselves. Which is probably why we see | these breaking changes, it'll be in response to some internal | ticket most likely. | CurtHagenlocher wrote: | CosmosDB is not the same as the internal Cosmos system. | int0x2e wrote: | Cosmos != CosmosDB. | | The two have nothing in common (and trust me, it sure is fun | having to constantly make sure which of the two someone is | actually referring to every time...). | prepend wrote: | This reenforces my idea that no one uses Cosmos because it is | utter garbage. | | It sounds cool, but I was surprised when after what I think | should be the worst and dumbest security design flaw breach [0] | there wasn't much uproar. | | I thought maybe no one is using it so there wasn't much impact. | | Pushing out breaking changes without telling your customers also | gets explained by there not being any (or many since these folks | found it) users. | | Could you image how big of a deal it would be if a breaking | change or elevated privs bug were in actually used products. | | [0] | https://www.techtarget.com/searchsecurity/news/252505973/Res... | DishyDev wrote: | As someone whose job involves maintaining uptime of a critical | system that's dependent on Cosmos DB this sort of thing is scary. | Where there's been other reliability issues with Cosmos before | we've not had an understanding customer base, and it feels very | out of my control. | | I'm finding a lot of the reliability guarantees of Azure PaaS | services are overblown or come with big caveats when you start to | work with them in a serious way. For example I've had some bad | reliability issues with Azure Functions not firing, or their | premium function runtimes becoming unresponsive. And it seems | like that's just the start of the outstanding issues with them | https://github.com/Azure/azure-functions-host/issues | | I think people need to look more carefully at these PaaS | guarantees and look at what that 99.999% reliability Microsoft | are claiming actually means. | rrdharan wrote: | Even as bad as their reliability issues are, I'd still be more | worried about their security issues: | | https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v... | https://msrc-blog.microsoft.com/2021/08/27/update-on-vulnera... | xiwenc wrote: | Do you know what blew me off? When azure executes maintenance | on for instance postgresql servers, there is no record of that | activity in the activity logs or anything to note in service | health. The service was unavailable during the maintenance. And | stronger yet when the database is unusable due to an incident | the cpu is maxed out and it doesnt allow any successful | connection, nothing is detected. | | How can this be a premium iaas/paas? Azure feels like the MS | teams of tele conference. Companies buy in because they are | already in the MS world. Not because azure is better. | nobodyandproud wrote: | For new projects, why wouldn't anyone use postgres? | semicolon_storm wrote: | It's a lot easier to sell Microsoft products to management when | working at a Microsoft shop. | jen20 wrote: | One reason is wanting zero-downtime failover. | johndfsgdgdfg wrote: | Just be glad that they didn't shutdown the service unannounced | like Google. | pb7 wrote: | Reading through your past comments, it's clear that you have a | strong dislike of Google[0] and a history of reactionary | comments lacking both substance and clarification when | challenged[1,2,3,4,5,6,7]. If you're not going to post anything | worthwhile, perhaps it's best for you to skip over posts about | Google since it's clear you have an axe to grind and nothing | more. | | >HN used to be a place for interesting discussions. Now it's a | grievance forum for entitled freeloaders.[8] | | Be the change you seek. | | [0] https://news.ycombinator.com/item?id=33120431 [1] | https://news.ycombinator.com/item?id=33183900 [2] | https://news.ycombinator.com/item?id=33158451 [3] | https://news.ycombinator.com/item?id=33102921 [4] | https://news.ycombinator.com/item?id=33102794 [5] | https://news.ycombinator.com/item?id=33102761 [6] | https://news.ycombinator.com/item?id=32937987 [7] | https://news.ycombinator.com/item?id=32868992 [8] | https://news.ycombinator.com/item?id=32657508 | metadat wrote: | Thanks for pointing this out. As a self-admitted Google | disliker, I would prefer to at least see more variation in | the rhetoric. The same message spouted over and over makes | for extremely dull reading. | NicoJuicy wrote: | Which service did Google shutdown unannounced in their cloud | offering? | metadat wrote: | Stadia, their cloud gaming platform. It was cancelled without | warning two weeks ago: | | https://news.ycombinator.com/item?id=33022768 | dagss wrote: | This rhymes with my overall impression of Cosmos. It took us a | while to see through the smokescreen because when talking to | Microsoft support and representatives it is the Best Thing Ever | and they sound so confident about it. But it really is a beta | demo product sold with an alpha premium price tag. | | If your traffic pattern is exactly right, and you always scale | traffic up and never ever down and do not have spikes, I guess it | is probably OK. The main problem is the docs are (or, at least | were 2 years ago) not clear about all the caveats and | restrictions but pretend it is a generic database that just | works. So one has to discover all the caveats oneself. | | Microsoft thinks the exact workings of the partitioning is | something that should work so well you don't need to know it in | detail. But, if your usecase is slightly off you end up really | needing to know. I know at least one team who routinely copy all | their data from one Cosmos instance to another and switch over | traffic to the copy just to get a partitioning reset; it is one | thing to have to do it; another to discover in production | yourself it has to be done with no prior warning.. | | Also: The ipython+portal+Cosmos security meltdown from 1 1/2 | years ago alone should be reason to look elsewhere. | | (No, not a competitor, just have spent way way way too much | engineering time moving first on and then off Cosmos and yes I am | bitter) | VWWHFSfQ wrote: | After suffering through the AWS SimpleDB disaster 10 years ago | I will never use any of the cloud providers' hairbrained | databases ever again. I'll use bog-standard Postgres or MySQL | if they host it but nothing else. | nobodyandproud wrote: | Is this analogous to NTFS? | | For you young uns, back in the 1990s Microsoft was so convinced | that NTFS made file fragmentation impossible that they didn't | provide a way to defrag for a very long time. | MrBuddyCasino wrote: | I used it a few months ago, it still is a half-baked piece of | shit. Code quality of client libs even worse than AWS. | | It could have been easy. We could have used Postgres. | PaulWaldman wrote: | >(No, not a competitor, just have spent way way way too much | engineering time moving first on and then off Cosmos and yes I | am bitter) | | Can you share what you migrated onto and the results? | jmartens wrote: | We used our own product to learn about and debug the issue. Its | rather wild that they'd roll out this change so incrementally, | which my colleague outlines here. | Scaevolus wrote: | Gradual rollouts are pretty typical to give the team a chance | to do a rollback before they cause a complete outage. This | particular usage pattern probably just didn't appear as a | significant enough spike in error rates. | jmartens wrote: | Ya, that makes sense, it really isn't a normal use-case. I | wish we kept tracking the other regions to see if they have | had this change roll out to them yet, or if it's still slow | rolling. | twodave wrote: | Funny, I was just last week having an argument with one of our | team leads. I'd told him to create a specific container without a | partition key (which I wouldn't recommend except in certain | circumstances), and he said he couldn't. I assumed he was just | doing it wrong. | int0x2e wrote: | In a document store, what does it even mean to create a | container without a partition key? The document store has to | partition the data somehow, and doing so implicitly sounds | dangerous to me since all you're doing is creating a hotspot on | one of the partitions... | whalesalad wrote: | This is very typical Microsoft behavior, unfortunately. ___________________________________________________________________ (page generated 2022-10-13 23:00 UTC)