[HN Gopher] Cloudflare Dashboard and API Service Issues ___________________________________________________________________ Cloudflare Dashboard and API Service Issues Author : easywiththe Score : 90 points Date : 2020-04-15 15:49 UTC (7 hours ago) (HTM) web link (www.cloudflarestatus.com) (TXT) w3m dump (www.cloudflarestatus.com) | nikisweeting wrote: | All our sites using Argo Tunnels go down whenever the API goes | down, but the dashboard still claims Argo Tunnels are unaffected | :/ | | At least we'll get a good blog post out of it in a few days. | rohansingh wrote: | Dashboard must have updated, because it now says that Argo | Tunnels is offline. | inapis wrote: | Does Argo help much with performance? How much of a difference | are we talking about here? | robertcope wrote: | Argo is kind of an overloaded term... Argo (not tunneling) is | better routing from the edge, which can have a measurable | impact on performance. Argo (tunneling) can also impact | performance in a positive way but it is also a nice way to | secure things, provide access to secure/internal sites, | provide ssh/rdp access into a datacenter, and other things. | buildbuildbuild wrote: | Argo Tunnels are most useful for security: you can serve HTTP | from a server that allows no inbound connections. Or, you can | use it to reliably serve a public site from behind NAT. | | Argo routing's ability to improve page load times depends on | how international your user base is, and how poor your | origin's transit quality is. It's great at improving the | latency and reliability of a cheap host. | jbergstroem wrote: | Yes, same here -- which equals site down. Nothing on status | page yet. | | > Apr 15 17:01:37 mysite cloudflared[12412]: | time="2020-04-15T17:01:37Z" level=info msg="Connected to SJC" | connectionID=0 | | > Apr 15 17:03:09 mysite cloudflared[12412]: | time="2020-04-15T17:03:09Z" level=error msg="Register tunnel | error from server side" connectionID=0 error="Server error: | Reached maximum retry 5: dial tcp 198.41.248.96:9100: connect: | connection timed out" | | > Apr 15 17:03:09 mysite cloudflared[12412]: | time="2020-04-15T17:03:09Z" level=info msg="Retrying in 8s | seconds" connectionID=0 | nikisweeting wrote: | Here's our logs in case it helps anyone: | | > hera ra | time="2020-04-15T17:08:09Z" level=error | msg="Unable to dial edge" error="DialContext error: dial tcp | 198.41.200.233:7844: i/o timeout" | | > hera | time="2020-04-15T17:08:09Z" level=info msg="Retrying | in 1s seconds" hera | time="2020-04-15T17:08:09Z" level=error | msg="Unable to dial edge" error="Handshake with edge error: | read tcp 172.18.0.2:36016->198.41.192.227:7844: i/o timeout" | | > hera | time="2020-04-15T17:08:09Z" level=info msg="Retrying | in 1s seconds" hera | time="2020-04-15T17:08:11Z" level=error | msg="Unable to dial edge" error="Handshake with edge error: | read tcp 172.18.0.2:44626->198.41.192.7:7844: i/o timeout" | | > hera | time="2020-04-15T17:08:11Z" level=info msg="Retrying | in 4s seconds" | | > hera | time="2020-04-15T17:08:13Z" level=info | msg="Connected to EWR" | | > hera | time="2020-04-15T17:08:13Z" level=error | msg="Register tunnel error from server side" connectionID=0 | error="Server error: Reached maximum retry 5: dial tcp | 198.41.248.96:9100: connect: connection timed out" | | > hera | time="2020-04-15T17:08:13Z" level=warning | msg="Tunnel disconnected due to error" error="Server error: | Reached maximum retry 5: dial tcp 198.41.248.96:9100: | connect: connection timed out" | | > hera | time="2020-04-15T17:08:19Z" level=error msg="Unable | to dial edge" error="Handshake with edge error: read tcp | 172.18.0.2:51974->198.41.192.107:7844: i/o timeout" | buildbuildbuild wrote: | Confirmed, and an interesting opportunity to spot who is using | Argo Tunnels based on their current downtime. | | Tunnels are one of Cloudflare's best features for developers, | instant NAT traversal for home-hosted demos and prototypes. | It's a shame you have to pay for Argo on your entire domain in | order to use Argo Tunnels even on one subdomain. | | Cloudflare, please offer origin tunnels as a separate service | rather than bundling it with Argo routing on the client side. | jbergstroem wrote: | I believe they are separate; its just naming confusion. | | One is called "tiered routing" and is a dropdown in settings | whereas the "nat" solution nowadays is implemented as | cloudflared. | robertcope wrote: | Yup, I am emailed my account rep about Argo tunnels being down | and the status page not really reflecting that issue... which | is much larger than the dashboard being offline or APIs not | working, IMHO. | rkwasny wrote: | What's worse is that a status page is making it look like we | are down and cloudflare is up... | [deleted] | js4ever wrote: | It's triggering the regular cloudflare error when trying to | access their dashboard. | | Error 522: If you're the owner of this website: Contact your | hosting provider letting them know your web server is not | completing requests. An Error 522 means that the request was able | to connect to your web server, but that the request didn't | finish. The most likely cause is that something on your server is | hogging resources. | | I can't wait to see the postmortem, I wonder if it's a DDOS, | network/hardware issue or a deployment error | pul wrote: | They've added the cause on their status page now: > [...] a | disruption that occurred during a maintenance. | | https://www.cloudflarestatus.com/incidents/g7nd3k80rxdb | bithavoc wrote: | The pushed a change to fix the DNS propagation 45 minutes ago[0]. | Edge servers continue to proxy but no new records are being | served. | | [0]https://www.cloudflarestatus.com/incidents/57shkf1841kh | gazelleeatslion wrote: | Noticed this yesterday around 5pm EST. | | Edit: Also noticed that when generating API keys, the dropdown | wouldn't list all my accounts for setting permissions. Just | assumed it was all related or something. | | Either way, overall super insanely reliable product/service and | could not live without. | rkwasny wrote: | Update from CEO | https://twitter.com/eastdakota/status/1250479501726760960 | zymhan wrote: | How on earth can rented "remote hands" at a datacenter take | down the ENTIRE Cloudflare management layer? | MichaelApproved wrote: | and for over 3 hours? What could they do accidentally that | would cause an outage for this long? | | The tweet says they're failing over to their backup facility. | I would've expected that fail over to happen much faster. | | Seems like they have two issues going on. First, the remote | hands could take down the datacenter. Second, their fail over | is taking this long to come online. | | I also wonder how much Covid impacted the process, if at all. | | I'm looking forward to the details after they get things back | online. | | Good luck CF engineers! | foobarbazetc wrote: | Probably nuked the main DB. | eb0la wrote: | Three hours for a _physical_ problem like this is very | little time. Once you discover your equipment is not there, | you still have to put it on its place, power up, reconnect | al cabling, and check networking. | | From this kind of mess to a fully functional infrastructure | you need at least 12h-16h to function minimally. Probably | takes 2-3 days to have that node working as before. | | When I was on call, I always jocked with my colleages the | worst incident you could have in a datatenter was someone | swapping two cables. | hinkley wrote: | "If things are painful you should do them more often," isn't | supposed to be about callouses, but a lot of people take it | that way. People with callouses do not experience the | activity the way everybody else does. Sounds like a bit of | that happened here. | | I want my most of my mid-level people (and all of the ones | bucking for a promotion) to be comfortable running 80-90% of | routine maintenance operations and have pretty decent guesses | on what should be done for the next 5-10%. I don't often get | my way. | [deleted] | mcpherrinm wrote: | I don't know how big Cloudflare's management layer is, but | I'm assuming it's relatively small (maybe dozens of racks at | most), hosted in some sort of datacenter (Equinix, Digital | Realty, Coresite, etc) that provides the remote hands. | | Maybe some piece of core network equipment. | | A small colo may have a single pair of routers, switches, or | firewalls at its edge. If one had failed for some reason, and | the remote hands removed the wrong one, it is possible you | could knock the entire colo offline. | | There's a bunch of other possible components: Storage | platforms, power, maybe something like an HSM storing | secrets, or even just a key database server. | | Their failover to their backup facility may be impaired by | the fact that well, their management plane is down. They | probably rely on their own services. Avoiding chicken-and-egg | issues can require careful ahead-of-time planning. | bogomipz wrote: | From your link: | | >"During planned maintenance remote hands decommissioned some | equipment that they shouldn't have. We're failing over to a | backup facility and working to get the equipment back online." | | Some questions: | | When the gear was first powered off why wasn't it just powered | back up again as soon as the on-call person's phone started | blowing up? Why does this require a "failover" to another | datacenter? | | Why are remote hands decommissioning Cloudflare's gear in the | first place? Isn't this supposed to be a security focused | company? For context "remote hands" is the term for people that | work for the colocation facility such as Eqiunix, Telehouse, | etc. They are not employees of the tenant(Cloudflare.) Remote- | hands are great resources for things like checking a cable or | running new cross connects etc but certainly not | decommissioning gear without some form of tenant supervision. | | Why is this a single point of failure? | davidu wrote: | Just FYI -- Most colo facilities are prohibiting customer | access during the COVID-19 lockdowns and have gone to 'smart | hands' only for health and safety reasons. | bogomipz wrote: | Sure but that would seem reason enough to postpone the | scheduled maintenance then no? I would think trusting | "remote hands" to decommission production gear would carry | some serious weight in the risk/reward analysis of keeping | the maintenance window. At any rate I would think that | those same remote hands should have been able to | immediately powered that gear back up as soon as they were | made aware of the error. | tedk-42 wrote: | If the risk of the change was minimal, why would they not | proceed? | | How can you plan for things occurring out of your | control? CF engineers are people as well. Things like | this happen and there will be learnings to take out of it | (like how to fail over faster) | bogomipz wrote: | What you mean "what if"? Clearly the risk was not minimal | if it resulted in a 3 hour outage and affected customers. | Someone at Cloudflare should have been able to identify | that the services that were colocated at this facility | had no redundancy prior to asking a third party to power | down gear. Foregoing the maintenance until proper | redundancy was in place was not out of their control. | | Circulating a MOP(method of procedure) for data center | maintenance among all stakeholder is pretty standard. The | purpose of the MOPS is so that everyone can vet the | plan(roll forward and roll back)and identify the risks. | arn wrote: | Can't clear our cache via api or manually, so our cached HTML | pages are stuck until the natural expire happens -- which are set | somewhat long. Not great for a blog / news site. for example, if | we publish a story, our front page won't reflect it. | verroq wrote: | By degraded performance it means down completely. | haik90 wrote: | yes, completely down. | | I've trying for 10 minutes or so, it just loading (test from | various location using VPN) | mmm_grayons wrote: | Yeah, dashboard still appears to be completely down. | gramakri wrote: | All DNS APIs are failing :( | zymhan wrote: | As are our CircleCI piplines that call out to Cloudflare. | | Time to get some fresh air! | [deleted] | capableweb wrote: | Wonder if this is somehow related to the "The Devastating Decline | of a Brilliant Young Coder" article that was published today | https://www.wired.com/story/lee-holloway-devastating-decline... | (discussed here: https://news.ycombinator.com/item?id=22878136) | | One can wonder what kind of access Lee might still have in | Cloudflare infrastructure. | | CEO says "Not attack related" but weird coincidence it happen on | the very same day. | jasongill wrote: | Hanlon's Razor: "Never attribute to malice that which is | adequately explained by stupidity" | searchableguy wrote: | I am worried about you. | robinhouston wrote: | This has broken publishing in our app, which purges the file from | the Cloudflare cache when something is republished. We're | ignoring errors from the Cloudflare API, but that isn't enough in | this case because it isn't returning an error - it's just hanging | till the request times out. | | We're pushing an emergency config change to skip the cache | invalidation, which will stop it timing out but means republished | projects won't update (because the old version will still be | cached). | | Godspeed to the Cloudflare engineers who are presumably | scrambling to fix this! | darkerside wrote: | Do you not have some kind of task runner you can offload this | to? That seems like it would be of general benefit. | robinhouston wrote: | The trouble with returning to the user before the cache has | been purged is that they think it hasn't worked, because they | still see the old cached version. | partiallypro wrote: | While they are fixing this, could they please roll out a feature | to allow me to assign users to only specific domains? Biggest | complaint about Cloudflare, heck even GoDaddy lets you do that at | no cost. | nikisweeting wrote: | https://www.cloudflare.com/en-ca/products/cloudflare-access/ | | https://developers.cloudflare.com/access/setting-up-access/ | partiallypro wrote: | This is not what I'm talking about, I'm talking about the | ability to edit a zone. | judge2020 wrote: | You can, but only once you upgrade to Enterprise - the | delegated dashboard per-zone access functionality is part of | their business model. | | See: | https://www.sec.gov/Archives/edgar/data/1477333/000119312519... | (RBAC) | partiallypro wrote: | I said at no cost. Other registrars/DNS providers provide | this for free. | jgrahamc wrote: | Coming back now. | redm wrote: | I wish cloudflarestatus.com (powered by StatusPage) offered a | subscription (like https://status.box.com/, also on StatusPage) | so you could get a pro-active notice about outages. | | I had to debug customer issues to find out that this was down. | | Even if they don't want to offer this to the general public (free | customers), they should have another notice mechanism for | enterprise customers. | htilford wrote: | It's pull rather than push, but you can use the public API to | get notified of incidents: | https://www.cloudflarestatus.com/api/v2 | pmccarren wrote: | StatusPage has native Subscriber Notifications [0]. Cloudflare | must not have it setup on their end. | | [0] https://status.io/features | redm wrote: | I know, that's my point. My guess is they have not enabled it | because of the cost. | dijital wrote: | You could hook something up to their Status Page's RSS feed. We | have a #provider-updates Slack channel for things like this. | | https://www.cloudflarestatus.com/history.atom | htilford wrote: | https://www.cloudflarestatus.com/history.rss and | https://www.cloudflarestatus.com/history.json work as well. | mattashii wrote: | > "Cloudflare is continuing to investigate a network connectivity | issue in the data center which serves API and dashboard | functions." | | This implies that CF hosts its API and dashboard all in one DC, | which I find an _interesting_ observation. One would expect a | company like CF to host its critical infrastructure in a | redundant fashion. | tyingq wrote: | It's certainly not ideal. But it's not unusual to spend a lot | more time on making the runtime very redundant, but less | time/money on dashboards and configuration change | underpinnings. Doesn't work well in this case since it kills | invalidating cache items for customers. | | The comments seem to imply that having a redundant way to | refresh the page cache, even if it were global/domain versus | page, would be an okay backup for many. | mattashii wrote: | I agree that first priority would be data integrety (which | would be the runtime). But a large part of the CF experience | of a CF customer would be the availability of their | management APIs/dashboards, and that would be another part to | optimize for. | | I'm really suprised that they hosted all those non-vital but | still quite critical services in just one DC, or somehow had | one DC as a single point of failure. Network issues happen | "regular enough" to want to protect against that, or at least | have mitigations available. | bostik wrote: | To be fair, you have these latent single points of failures | even in the most resilient distributed systems. | | Such as S3. The bucket names are globally unique, which | means that their source of truth is in a single location. | (Virginia, IIRC.) | | Now... a small thought exercise. If I wanted to take down a | Cloudflare datacenter and I had access to a few suitably | careless remote hands, I'd take out the power supplies to | the core routers, and while the external network is out of | commission, power down the racks where they have their PXE | servers. That should keep anything, within the DC, from | being unable to recover on its own. ___________________________________________________________________ (page generated 2020-04-15 23:01 UTC)