[HN Gopher] Discovering Azure's unannounced breaking change with...
       ___________________________________________________________________
        
       Discovering Azure's unannounced breaking change with Cosmos DB
        
       Author : jmartens
       Score  : 99 points
       Date   : 2022-10-13 18:55 UTC (4 hours ago)
        
 (HTM) web link (metrist.io)
 (TXT) w3m dump (metrist.io)
        
       | rroot wrote:
        
         | girvo wrote:
        
         | andrewstuart2 wrote:
         | What are you doing here if you're not going to RTFA? The fifth
         | paragraph pretty clearly describes the issue before they go
         | into depth on how they determined that Azure did indeed publish
         | a backwards-incompatible change without notice.
        
       | speedgoose wrote:
       | I believe it is easy for a well-made software to immediately
       | detect and report what goes wrong. With Sentry, Elk, or whatever
       | else.
       | 
       | So, let's say I'm woken up in the middle of the night because my
       | black box database as a service suddenly returns errors. If I'm
       | not incompetent, I should have error messages and stacktraces
       | available in a few seconds. If I'm a rich cloud customer, I can
       | call the premium cloud support and ask for an explanation. If
       | not, I would probably have to debug it myself.
       | 
       | With your service, I understand that I can blame the cloud
       | provider faster. Maybe it can make the debugging session slightly
       | faster when your monitoring also returns errors. End users don't
       | care whether it's my code or the cloud provider code crashing, so
       | it's a developer tool for emergencies. Did I understand well?
        
         | jmartens wrote:
         | You got it right, it's a developer tool. Its not hard to get
         | alerted about an issue, or to suspect a cloud dependency.
         | Verifying it, which is typically required to take action, is
         | what can take 10-30 minutes.
        
       | dec0dedab0de wrote:
       | This seems like an accident. Microsoft should treat it as a bug,
       | and set the default on their backend to fix it.
        
       | dharmab wrote:
       | Back around 2017-2018 unannounced breaking changes in Azure
       | services were so common, my team coined a term "Cloud Monday"
       | (echoing Patch Tuesday) because usually our integration tests
       | would break between 8-10AM Pacific Time on Mondays. (They did
       | eventually become far less frequent.)
        
         | TurkTurkleton wrote:
         | > my team coined a term "Cloud Monday"
         | 
         | Azure being a shade of blue, you should've called it "Blue
         | Monday"[0]. Could've even rigged up something to play the song
         | when integration tests mysteriously failed. _How does it feel /
         | to treat me like you do?/ When you've laid your hands upon me/
         | and told me who you are?_...
         | 
         | [0]: https://www.youtube.com/watch?v=c1GxjzHm5us
        
       | hupt wrote:
       | Cosmos was originally created for hosting massive datasets
       | internally within Microsoft. For example they use it for the OS
       | telemetry sent in from customer machines, and raw data for threat
       | intelligence. As part of Microsoft's move of everything hosted
       | on-premise to their cloud, they decided to upon up Cosmos to
       | other users of Azure. But the primary customer is and will likely
       | always be Microsoft themselves. Which is probably why we see
       | these breaking changes, it'll be in response to some internal
       | ticket most likely.
        
         | CurtHagenlocher wrote:
         | CosmosDB is not the same as the internal Cosmos system.
        
         | int0x2e wrote:
         | Cosmos != CosmosDB.
         | 
         | The two have nothing in common (and trust me, it sure is fun
         | having to constantly make sure which of the two someone is
         | actually referring to every time...).
        
       | prepend wrote:
       | This reenforces my idea that no one uses Cosmos because it is
       | utter garbage.
       | 
       | It sounds cool, but I was surprised when after what I think
       | should be the worst and dumbest security design flaw breach [0]
       | there wasn't much uproar.
       | 
       | I thought maybe no one is using it so there wasn't much impact.
       | 
       | Pushing out breaking changes without telling your customers also
       | gets explained by there not being any (or many since these folks
       | found it) users.
       | 
       | Could you image how big of a deal it would be if a breaking
       | change or elevated privs bug were in actually used products.
       | 
       | [0]
       | https://www.techtarget.com/searchsecurity/news/252505973/Res...
        
       | DishyDev wrote:
       | As someone whose job involves maintaining uptime of a critical
       | system that's dependent on Cosmos DB this sort of thing is scary.
       | Where there's been other reliability issues with Cosmos before
       | we've not had an understanding customer base, and it feels very
       | out of my control.
       | 
       | I'm finding a lot of the reliability guarantees of Azure PaaS
       | services are overblown or come with big caveats when you start to
       | work with them in a serious way. For example I've had some bad
       | reliability issues with Azure Functions not firing, or their
       | premium function runtimes becoming unresponsive. And it seems
       | like that's just the start of the outstanding issues with them
       | https://github.com/Azure/azure-functions-host/issues
       | 
       | I think people need to look more carefully at these PaaS
       | guarantees and look at what that 99.999% reliability Microsoft
       | are claiming actually means.
        
         | rrdharan wrote:
         | Even as bad as their reliability issues are, I'd still be more
         | worried about their security issues:
         | 
         | https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v...
         | https://msrc-blog.microsoft.com/2021/08/27/update-on-vulnera...
        
         | xiwenc wrote:
         | Do you know what blew me off? When azure executes maintenance
         | on for instance postgresql servers, there is no record of that
         | activity in the activity logs or anything to note in service
         | health. The service was unavailable during the maintenance. And
         | stronger yet when the database is unusable due to an incident
         | the cpu is maxed out and it doesnt allow any successful
         | connection, nothing is detected.
         | 
         | How can this be a premium iaas/paas? Azure feels like the MS
         | teams of tele conference. Companies buy in because they are
         | already in the MS world. Not because azure is better.
        
       | nobodyandproud wrote:
       | For new projects, why wouldn't anyone use postgres?
        
         | semicolon_storm wrote:
         | It's a lot easier to sell Microsoft products to management when
         | working at a Microsoft shop.
        
         | jen20 wrote:
         | One reason is wanting zero-downtime failover.
        
       | johndfsgdgdfg wrote:
       | Just be glad that they didn't shutdown the service unannounced
       | like Google.
        
         | pb7 wrote:
         | Reading through your past comments, it's clear that you have a
         | strong dislike of Google[0] and a history of reactionary
         | comments lacking both substance and clarification when
         | challenged[1,2,3,4,5,6,7]. If you're not going to post anything
         | worthwhile, perhaps it's best for you to skip over posts about
         | Google since it's clear you have an axe to grind and nothing
         | more.
         | 
         | >HN used to be a place for interesting discussions. Now it's a
         | grievance forum for entitled freeloaders.[8]
         | 
         | Be the change you seek.
         | 
         | [0] https://news.ycombinator.com/item?id=33120431 [1]
         | https://news.ycombinator.com/item?id=33183900 [2]
         | https://news.ycombinator.com/item?id=33158451 [3]
         | https://news.ycombinator.com/item?id=33102921 [4]
         | https://news.ycombinator.com/item?id=33102794 [5]
         | https://news.ycombinator.com/item?id=33102761 [6]
         | https://news.ycombinator.com/item?id=32937987 [7]
         | https://news.ycombinator.com/item?id=32868992 [8]
         | https://news.ycombinator.com/item?id=32657508
        
           | metadat wrote:
           | Thanks for pointing this out. As a self-admitted Google
           | disliker, I would prefer to at least see more variation in
           | the rhetoric. The same message spouted over and over makes
           | for extremely dull reading.
        
         | NicoJuicy wrote:
         | Which service did Google shutdown unannounced in their cloud
         | offering?
        
           | metadat wrote:
           | Stadia, their cloud gaming platform. It was cancelled without
           | warning two weeks ago:
           | 
           | https://news.ycombinator.com/item?id=33022768
        
       | dagss wrote:
       | This rhymes with my overall impression of Cosmos. It took us a
       | while to see through the smokescreen because when talking to
       | Microsoft support and representatives it is the Best Thing Ever
       | and they sound so confident about it. But it really is a beta
       | demo product sold with an alpha premium price tag.
       | 
       | If your traffic pattern is exactly right, and you always scale
       | traffic up and never ever down and do not have spikes, I guess it
       | is probably OK. The main problem is the docs are (or, at least
       | were 2 years ago) not clear about all the caveats and
       | restrictions but pretend it is a generic database that just
       | works. So one has to discover all the caveats oneself.
       | 
       | Microsoft thinks the exact workings of the partitioning is
       | something that should work so well you don't need to know it in
       | detail. But, if your usecase is slightly off you end up really
       | needing to know. I know at least one team who routinely copy all
       | their data from one Cosmos instance to another and switch over
       | traffic to the copy just to get a partitioning reset; it is one
       | thing to have to do it; another to discover in production
       | yourself it has to be done with no prior warning..
       | 
       | Also: The ipython+portal+Cosmos security meltdown from 1 1/2
       | years ago alone should be reason to look elsewhere.
       | 
       | (No, not a competitor, just have spent way way way too much
       | engineering time moving first on and then off Cosmos and yes I am
       | bitter)
        
         | VWWHFSfQ wrote:
         | After suffering through the AWS SimpleDB disaster 10 years ago
         | I will never use any of the cloud providers' hairbrained
         | databases ever again. I'll use bog-standard Postgres or MySQL
         | if they host it but nothing else.
        
         | nobodyandproud wrote:
         | Is this analogous to NTFS?
         | 
         | For you young uns, back in the 1990s Microsoft was so convinced
         | that NTFS made file fragmentation impossible that they didn't
         | provide a way to defrag for a very long time.
        
         | MrBuddyCasino wrote:
         | I used it a few months ago, it still is a half-baked piece of
         | shit. Code quality of client libs even worse than AWS.
         | 
         | It could have been easy. We could have used Postgres.
        
         | PaulWaldman wrote:
         | >(No, not a competitor, just have spent way way way too much
         | engineering time moving first on and then off Cosmos and yes I
         | am bitter)
         | 
         | Can you share what you migrated onto and the results?
        
       | jmartens wrote:
       | We used our own product to learn about and debug the issue. Its
       | rather wild that they'd roll out this change so incrementally,
       | which my colleague outlines here.
        
         | Scaevolus wrote:
         | Gradual rollouts are pretty typical to give the team a chance
         | to do a rollback before they cause a complete outage. This
         | particular usage pattern probably just didn't appear as a
         | significant enough spike in error rates.
        
           | jmartens wrote:
           | Ya, that makes sense, it really isn't a normal use-case. I
           | wish we kept tracking the other regions to see if they have
           | had this change roll out to them yet, or if it's still slow
           | rolling.
        
       | twodave wrote:
       | Funny, I was just last week having an argument with one of our
       | team leads. I'd told him to create a specific container without a
       | partition key (which I wouldn't recommend except in certain
       | circumstances), and he said he couldn't. I assumed he was just
       | doing it wrong.
        
         | int0x2e wrote:
         | In a document store, what does it even mean to create a
         | container without a partition key? The document store has to
         | partition the data somehow, and doing so implicitly sounds
         | dangerous to me since all you're doing is creating a hotspot on
         | one of the partitions...
        
       | whalesalad wrote:
       | This is very typical Microsoft behavior, unfortunately.
        
       ___________________________________________________________________
       (page generated 2022-10-13 23:00 UTC)