[HN Gopher] Cloudflare Dashboard and API Service Issues
       ___________________________________________________________________
        
       Cloudflare Dashboard and API Service Issues
        
       Author : easywiththe
       Score  : 90 points
       Date   : 2020-04-15 15:49 UTC (7 hours ago)
        
 (HTM) web link (www.cloudflarestatus.com)
 (TXT) w3m dump (www.cloudflarestatus.com)
        
       | nikisweeting wrote:
       | All our sites using Argo Tunnels go down whenever the API goes
       | down, but the dashboard still claims Argo Tunnels are unaffected
       | :/
       | 
       | At least we'll get a good blog post out of it in a few days.
        
         | rohansingh wrote:
         | Dashboard must have updated, because it now says that Argo
         | Tunnels is offline.
        
         | inapis wrote:
         | Does Argo help much with performance? How much of a difference
         | are we talking about here?
        
           | robertcope wrote:
           | Argo is kind of an overloaded term... Argo (not tunneling) is
           | better routing from the edge, which can have a measurable
           | impact on performance. Argo (tunneling) can also impact
           | performance in a positive way but it is also a nice way to
           | secure things, provide access to secure/internal sites,
           | provide ssh/rdp access into a datacenter, and other things.
        
           | buildbuildbuild wrote:
           | Argo Tunnels are most useful for security: you can serve HTTP
           | from a server that allows no inbound connections. Or, you can
           | use it to reliably serve a public site from behind NAT.
           | 
           | Argo routing's ability to improve page load times depends on
           | how international your user base is, and how poor your
           | origin's transit quality is. It's great at improving the
           | latency and reliability of a cheap host.
        
         | jbergstroem wrote:
         | Yes, same here -- which equals site down. Nothing on status
         | page yet.
         | 
         | > Apr 15 17:01:37 mysite cloudflared[12412]:
         | time="2020-04-15T17:01:37Z" level=info msg="Connected to SJC"
         | connectionID=0
         | 
         | > Apr 15 17:03:09 mysite cloudflared[12412]:
         | time="2020-04-15T17:03:09Z" level=error msg="Register tunnel
         | error from server side" connectionID=0 error="Server error:
         | Reached maximum retry 5: dial tcp 198.41.248.96:9100: connect:
         | connection timed out"
         | 
         | > Apr 15 17:03:09 mysite cloudflared[12412]:
         | time="2020-04-15T17:03:09Z" level=info msg="Retrying in 8s
         | seconds" connectionID=0
        
           | nikisweeting wrote:
           | Here's our logs in case it helps anyone:
           | 
           | > hera ra | time="2020-04-15T17:08:09Z" level=error
           | msg="Unable to dial edge" error="DialContext error: dial tcp
           | 198.41.200.233:7844: i/o timeout"
           | 
           | > hera | time="2020-04-15T17:08:09Z" level=info msg="Retrying
           | in 1s seconds" hera | time="2020-04-15T17:08:09Z" level=error
           | msg="Unable to dial edge" error="Handshake with edge error:
           | read tcp 172.18.0.2:36016->198.41.192.227:7844: i/o timeout"
           | 
           | > hera | time="2020-04-15T17:08:09Z" level=info msg="Retrying
           | in 1s seconds" hera | time="2020-04-15T17:08:11Z" level=error
           | msg="Unable to dial edge" error="Handshake with edge error:
           | read tcp 172.18.0.2:44626->198.41.192.7:7844: i/o timeout"
           | 
           | > hera | time="2020-04-15T17:08:11Z" level=info msg="Retrying
           | in 4s seconds"
           | 
           | > hera | time="2020-04-15T17:08:13Z" level=info
           | msg="Connected to EWR"
           | 
           | > hera | time="2020-04-15T17:08:13Z" level=error
           | msg="Register tunnel error from server side" connectionID=0
           | error="Server error: Reached maximum retry 5: dial tcp
           | 198.41.248.96:9100: connect: connection timed out"
           | 
           | > hera | time="2020-04-15T17:08:13Z" level=warning
           | msg="Tunnel disconnected due to error" error="Server error:
           | Reached maximum retry 5: dial tcp 198.41.248.96:9100:
           | connect: connection timed out"
           | 
           | > hera | time="2020-04-15T17:08:19Z" level=error msg="Unable
           | to dial edge" error="Handshake with edge error: read tcp
           | 172.18.0.2:51974->198.41.192.107:7844: i/o timeout"
        
         | buildbuildbuild wrote:
         | Confirmed, and an interesting opportunity to spot who is using
         | Argo Tunnels based on their current downtime.
         | 
         | Tunnels are one of Cloudflare's best features for developers,
         | instant NAT traversal for home-hosted demos and prototypes.
         | It's a shame you have to pay for Argo on your entire domain in
         | order to use Argo Tunnels even on one subdomain.
         | 
         | Cloudflare, please offer origin tunnels as a separate service
         | rather than bundling it with Argo routing on the client side.
        
           | jbergstroem wrote:
           | I believe they are separate; its just naming confusion.
           | 
           | One is called "tiered routing" and is a dropdown in settings
           | whereas the "nat" solution nowadays is implemented as
           | cloudflared.
        
         | robertcope wrote:
         | Yup, I am emailed my account rep about Argo tunnels being down
         | and the status page not really reflecting that issue... which
         | is much larger than the dashboard being offline or APIs not
         | working, IMHO.
        
         | rkwasny wrote:
         | What's worse is that a status page is making it look like we
         | are down and cloudflare is up...
        
           | [deleted]
        
       | js4ever wrote:
       | It's triggering the regular cloudflare error when trying to
       | access their dashboard.
       | 
       | Error 522: If you're the owner of this website: Contact your
       | hosting provider letting them know your web server is not
       | completing requests. An Error 522 means that the request was able
       | to connect to your web server, but that the request didn't
       | finish. The most likely cause is that something on your server is
       | hogging resources.
       | 
       | I can't wait to see the postmortem, I wonder if it's a DDOS,
       | network/hardware issue or a deployment error
        
         | pul wrote:
         | They've added the cause on their status page now: > [...] a
         | disruption that occurred during a maintenance.
         | 
         | https://www.cloudflarestatus.com/incidents/g7nd3k80rxdb
        
       | bithavoc wrote:
       | The pushed a change to fix the DNS propagation 45 minutes ago[0].
       | Edge servers continue to proxy but no new records are being
       | served.
       | 
       | [0]https://www.cloudflarestatus.com/incidents/57shkf1841kh
        
         | gazelleeatslion wrote:
         | Noticed this yesterday around 5pm EST.
         | 
         | Edit: Also noticed that when generating API keys, the dropdown
         | wouldn't list all my accounts for setting permissions. Just
         | assumed it was all related or something.
         | 
         | Either way, overall super insanely reliable product/service and
         | could not live without.
        
       | rkwasny wrote:
       | Update from CEO
       | https://twitter.com/eastdakota/status/1250479501726760960
        
         | zymhan wrote:
         | How on earth can rented "remote hands" at a datacenter take
         | down the ENTIRE Cloudflare management layer?
        
           | MichaelApproved wrote:
           | and for over 3 hours? What could they do accidentally that
           | would cause an outage for this long?
           | 
           | The tweet says they're failing over to their backup facility.
           | I would've expected that fail over to happen much faster.
           | 
           | Seems like they have two issues going on. First, the remote
           | hands could take down the datacenter. Second, their fail over
           | is taking this long to come online.
           | 
           | I also wonder how much Covid impacted the process, if at all.
           | 
           | I'm looking forward to the details after they get things back
           | online.
           | 
           | Good luck CF engineers!
        
             | foobarbazetc wrote:
             | Probably nuked the main DB.
        
             | eb0la wrote:
             | Three hours for a _physical_ problem like this is very
             | little time. Once you discover your equipment is not there,
             | you still have to put it on its place, power up, reconnect
             | al cabling, and check networking.
             | 
             | From this kind of mess to a fully functional infrastructure
             | you need at least 12h-16h to function minimally. Probably
             | takes 2-3 days to have that node working as before.
             | 
             | When I was on call, I always jocked with my colleages the
             | worst incident you could have in a datatenter was someone
             | swapping two cables.
        
           | hinkley wrote:
           | "If things are painful you should do them more often," isn't
           | supposed to be about callouses, but a lot of people take it
           | that way. People with callouses do not experience the
           | activity the way everybody else does. Sounds like a bit of
           | that happened here.
           | 
           | I want my most of my mid-level people (and all of the ones
           | bucking for a promotion) to be comfortable running 80-90% of
           | routine maintenance operations and have pretty decent guesses
           | on what should be done for the next 5-10%. I don't often get
           | my way.
        
           | [deleted]
        
           | mcpherrinm wrote:
           | I don't know how big Cloudflare's management layer is, but
           | I'm assuming it's relatively small (maybe dozens of racks at
           | most), hosted in some sort of datacenter (Equinix, Digital
           | Realty, Coresite, etc) that provides the remote hands.
           | 
           | Maybe some piece of core network equipment.
           | 
           | A small colo may have a single pair of routers, switches, or
           | firewalls at its edge. If one had failed for some reason, and
           | the remote hands removed the wrong one, it is possible you
           | could knock the entire colo offline.
           | 
           | There's a bunch of other possible components: Storage
           | platforms, power, maybe something like an HSM storing
           | secrets, or even just a key database server.
           | 
           | Their failover to their backup facility may be impaired by
           | the fact that well, their management plane is down. They
           | probably rely on their own services. Avoiding chicken-and-egg
           | issues can require careful ahead-of-time planning.
        
         | bogomipz wrote:
         | From your link:
         | 
         | >"During planned maintenance remote hands decommissioned some
         | equipment that they shouldn't have. We're failing over to a
         | backup facility and working to get the equipment back online."
         | 
         | Some questions:
         | 
         | When the gear was first powered off why wasn't it just powered
         | back up again as soon as the on-call person's phone started
         | blowing up? Why does this require a "failover" to another
         | datacenter?
         | 
         | Why are remote hands decommissioning Cloudflare's gear in the
         | first place? Isn't this supposed to be a security focused
         | company? For context "remote hands" is the term for people that
         | work for the colocation facility such as Eqiunix, Telehouse,
         | etc. They are not employees of the tenant(Cloudflare.) Remote-
         | hands are great resources for things like checking a cable or
         | running new cross connects etc but certainly not
         | decommissioning gear without some form of tenant supervision.
         | 
         | Why is this a single point of failure?
        
           | davidu wrote:
           | Just FYI -- Most colo facilities are prohibiting customer
           | access during the COVID-19 lockdowns and have gone to 'smart
           | hands' only for health and safety reasons.
        
             | bogomipz wrote:
             | Sure but that would seem reason enough to postpone the
             | scheduled maintenance then no? I would think trusting
             | "remote hands" to decommission production gear would carry
             | some serious weight in the risk/reward analysis of keeping
             | the maintenance window. At any rate I would think that
             | those same remote hands should have been able to
             | immediately powered that gear back up as soon as they were
             | made aware of the error.
        
               | tedk-42 wrote:
               | If the risk of the change was minimal, why would they not
               | proceed?
               | 
               | How can you plan for things occurring out of your
               | control? CF engineers are people as well. Things like
               | this happen and there will be learnings to take out of it
               | (like how to fail over faster)
        
               | bogomipz wrote:
               | What you mean "what if"? Clearly the risk was not minimal
               | if it resulted in a 3 hour outage and affected customers.
               | Someone at Cloudflare should have been able to identify
               | that the services that were colocated at this facility
               | had no redundancy prior to asking a third party to power
               | down gear. Foregoing the maintenance until proper
               | redundancy was in place was not out of their control.
               | 
               | Circulating a MOP(method of procedure) for data center
               | maintenance among all stakeholder is pretty standard. The
               | purpose of the MOPS is so that everyone can vet the
               | plan(roll forward and roll back)and identify the risks.
        
       | arn wrote:
       | Can't clear our cache via api or manually, so our cached HTML
       | pages are stuck until the natural expire happens -- which are set
       | somewhat long. Not great for a blog / news site. for example, if
       | we publish a story, our front page won't reflect it.
        
       | verroq wrote:
       | By degraded performance it means down completely.
        
         | haik90 wrote:
         | yes, completely down.
         | 
         | I've trying for 10 minutes or so, it just loading (test from
         | various location using VPN)
        
         | mmm_grayons wrote:
         | Yeah, dashboard still appears to be completely down.
        
       | gramakri wrote:
       | All DNS APIs are failing :(
        
         | zymhan wrote:
         | As are our CircleCI piplines that call out to Cloudflare.
         | 
         | Time to get some fresh air!
        
       | [deleted]
        
       | capableweb wrote:
       | Wonder if this is somehow related to the "The Devastating Decline
       | of a Brilliant Young Coder" article that was published today
       | https://www.wired.com/story/lee-holloway-devastating-decline...
       | (discussed here: https://news.ycombinator.com/item?id=22878136)
       | 
       | One can wonder what kind of access Lee might still have in
       | Cloudflare infrastructure.
       | 
       | CEO says "Not attack related" but weird coincidence it happen on
       | the very same day.
        
         | jasongill wrote:
         | Hanlon's Razor: "Never attribute to malice that which is
         | adequately explained by stupidity"
        
         | searchableguy wrote:
         | I am worried about you.
        
       | robinhouston wrote:
       | This has broken publishing in our app, which purges the file from
       | the Cloudflare cache when something is republished. We're
       | ignoring errors from the Cloudflare API, but that isn't enough in
       | this case because it isn't returning an error - it's just hanging
       | till the request times out.
       | 
       | We're pushing an emergency config change to skip the cache
       | invalidation, which will stop it timing out but means republished
       | projects won't update (because the old version will still be
       | cached).
       | 
       | Godspeed to the Cloudflare engineers who are presumably
       | scrambling to fix this!
        
         | darkerside wrote:
         | Do you not have some kind of task runner you can offload this
         | to? That seems like it would be of general benefit.
        
           | robinhouston wrote:
           | The trouble with returning to the user before the cache has
           | been purged is that they think it hasn't worked, because they
           | still see the old cached version.
        
       | partiallypro wrote:
       | While they are fixing this, could they please roll out a feature
       | to allow me to assign users to only specific domains? Biggest
       | complaint about Cloudflare, heck even GoDaddy lets you do that at
       | no cost.
        
         | nikisweeting wrote:
         | https://www.cloudflare.com/en-ca/products/cloudflare-access/
         | 
         | https://developers.cloudflare.com/access/setting-up-access/
        
           | partiallypro wrote:
           | This is not what I'm talking about, I'm talking about the
           | ability to edit a zone.
        
         | judge2020 wrote:
         | You can, but only once you upgrade to Enterprise - the
         | delegated dashboard per-zone access functionality is part of
         | their business model.
         | 
         | See:
         | https://www.sec.gov/Archives/edgar/data/1477333/000119312519...
         | (RBAC)
        
           | partiallypro wrote:
           | I said at no cost. Other registrars/DNS providers provide
           | this for free.
        
       | jgrahamc wrote:
       | Coming back now.
        
       | redm wrote:
       | I wish cloudflarestatus.com (powered by StatusPage) offered a
       | subscription (like https://status.box.com/, also on StatusPage)
       | so you could get a pro-active notice about outages.
       | 
       | I had to debug customer issues to find out that this was down.
       | 
       | Even if they don't want to offer this to the general public (free
       | customers), they should have another notice mechanism for
       | enterprise customers.
        
         | htilford wrote:
         | It's pull rather than push, but you can use the public API to
         | get notified of incidents:
         | https://www.cloudflarestatus.com/api/v2
        
         | pmccarren wrote:
         | StatusPage has native Subscriber Notifications [0]. Cloudflare
         | must not have it setup on their end.
         | 
         | [0] https://status.io/features
        
           | redm wrote:
           | I know, that's my point. My guess is they have not enabled it
           | because of the cost.
        
         | dijital wrote:
         | You could hook something up to their Status Page's RSS feed. We
         | have a #provider-updates Slack channel for things like this.
         | 
         | https://www.cloudflarestatus.com/history.atom
        
           | htilford wrote:
           | https://www.cloudflarestatus.com/history.rss and
           | https://www.cloudflarestatus.com/history.json work as well.
        
       | mattashii wrote:
       | > "Cloudflare is continuing to investigate a network connectivity
       | issue in the data center which serves API and dashboard
       | functions."
       | 
       | This implies that CF hosts its API and dashboard all in one DC,
       | which I find an _interesting_ observation. One would expect a
       | company like CF to host its critical infrastructure in a
       | redundant fashion.
        
         | tyingq wrote:
         | It's certainly not ideal. But it's not unusual to spend a lot
         | more time on making the runtime very redundant, but less
         | time/money on dashboards and configuration change
         | underpinnings. Doesn't work well in this case since it kills
         | invalidating cache items for customers.
         | 
         | The comments seem to imply that having a redundant way to
         | refresh the page cache, even if it were global/domain versus
         | page, would be an okay backup for many.
        
           | mattashii wrote:
           | I agree that first priority would be data integrety (which
           | would be the runtime). But a large part of the CF experience
           | of a CF customer would be the availability of their
           | management APIs/dashboards, and that would be another part to
           | optimize for.
           | 
           | I'm really suprised that they hosted all those non-vital but
           | still quite critical services in just one DC, or somehow had
           | one DC as a single point of failure. Network issues happen
           | "regular enough" to want to protect against that, or at least
           | have mitigations available.
        
             | bostik wrote:
             | To be fair, you have these latent single points of failures
             | even in the most resilient distributed systems.
             | 
             | Such as S3. The bucket names are globally unique, which
             | means that their source of truth is in a single location.
             | (Virginia, IIRC.)
             | 
             | Now... a small thought exercise. If I wanted to take down a
             | Cloudflare datacenter and I had access to a few suitably
             | careless remote hands, I'd take out the power supplies to
             | the core routers, and while the external network is out of
             | commission, power down the racks where they have their PXE
             | servers. That should keep anything, within the DC, from
             | being unable to recover on its own.
        
       ___________________________________________________________________
       (page generated 2020-04-15 23:01 UTC)