[HN Gopher] Last week's Let's Encrypt downtime
       ___________________________________________________________________
        
       Last week's Let's Encrypt downtime
        
       Author : agwa
       Score  : 150 points
       Date   : 2023-06-22 14:42 UTC (8 hours ago)
        
 (HTM) web link (www.agwa.name)
 (TXT) w3m dump (www.agwa.name)
        
       | AdamJacobMuller wrote:
       | Did we kill crt.sh?                   FATAL:  terminating
       | connection due to conflict with recovery         DETAIL:  User
       | query might have needed to see row versions that must be removed.
       | CONTEXT:  SQL statement "SELECT c.ID, x509_print(c.CERTIFICATE,
       | NULL, 196608), ca.ID, cac.CA_ID,
       | digest(c.CERTIFICATE, 'sha1'::text),
       | digest(c.CERTIFICATE, 'sha256'::text),
       | x509_serialNumber(c.CERTIFICATE),
       | digest(x509_publicKey(c.CERTIFICATE), 'sha256'::text),
       | x509_rsamodulus(c.CERTIFICATE),
       | x509_hasROCAFingerprint(c.CERTIFICATE),
       | x509_hasClosePrimes(c.CERTIFICATE),              c.CERTIFICATE
       | FROM certificate c              LEFT OUTER JOIN ca ON
       | (c.ISSUER_CA_ID = ca.ID)              LEFT OUTER JOIN
       | ca_certificate cac                  ON (c.ID =
       | cac.CERTIFICATE_ID)             WHERE digest(c.CERTIFICATE,
       | 'sha256') = t_bytea"         PL/pgSQL function
       | web_apis(text,text[],text[]) line 1757 at SQL statement
       | ERROR:  server conn crashed?         server closed the connection
       | unexpectedly          This probably means the server terminated
       | abnormally          before or while processing the request.
       | 
       | and now it's just a 502 error!
        
         | agwa wrote:
         | Unfortunately, crt.sh is chronically overloaded.
        
           | AdamJacobMuller wrote:
           | I've never seen it happen before, but, you would know better!
        
       | mjw1007 wrote:
       | > these certificates were already being rejected by Chrome and
       | Safari for having invalid SCTs
       | 
       | What's a good way to make an equivalent check from a script, if I
       | want (in future) to be able to check whether I have a website
       | whose certificate has such a problem?
        
         | agwa wrote:
         | Excellent question! The sctcheck command from
         | https://github.com/google/certificate-transparency-go/ can be
         | used to check the signatures of the embedded SCTs in a
         | certificate.
         | 
         | I've also got an online tool which you can use to test a site
         | for CT policy compliance:
         | https://sslmate.com/labs/ct_policy_analyzer/
         | 
         | Example of a working site:
         | https://sslmate.com/labs/ct_policy_analyzer/?sslmate.com
         | 
         | Example of one of the sites affected by the Let's Encrypt
         | incident:
         | https://sslmate.com/labs/ct_policy_analyzer/?thecandyshake.c...
        
       | jimmyl02 wrote:
       | This is a great writeup and intro to certificate transparency
       | overall. Glad to see that certificate authorities are being held
       | accountable and learn more about how its done!
        
       | jrpelkonen wrote:
       | > I find it alarming that a week after the incident, 40% of the
       | affected certificates are still in use, despite being rejected by
       | the most popular browsers and despite affected subscribers being
       | emailed by Let's Encrypt.
       | 
       | This is perhaps a consequence on how well-oiled of a machine LE
       | typically is: people stop paying attention to it.
        
         | tredre3 wrote:
         | That is true but how come certbot had no awareness of
         | revoked/withdrawn certificates before now? It seems like one of
         | the things a CA is supposed to solve for you, and the fact that
         | it doesn't is bit alarming in itself.
         | 
         | Though, as the following sentence points out, they were already
         | working on it before the outage, so clearly they knew it was
         | needed.
         | 
         | 1. https://datatracker.ietf.org/doc/draft-ietf-acme-ari/
        
           | 411111111111111 wrote:
           | The CA can't solve it for you.
           | 
           | The certificate authority signs certificate requests,
           | creating certificates. The revocation process is necessary as
           | well, but the CA doesn't have the ability to change the
           | already issued certificate, thus it cannot take action.
           | 
           | A software like certbot can solve it for you, but that's not
           | affiliated with your CA
        
             | agwa wrote:
             | The CA is part of the solution by using ARI to inform ACME
             | clients to replace impacted certificates.
        
               | mcpherrinm wrote:
               | Even before ARI, some integrated ACME/Web servers use
               | OCSP as a way of knowing to renew if a cert was revoked.
               | Plus if you're doing that you can pin the OCSP response
               | while you're at it.
        
               | 411111111111111 wrote:
               | My point was that the CA can't solve it for you, they can
               | only give you APIs and processes with which you can solve
               | it yourself.
               | 
               | If your webserver supports checking the certificate
               | validity then it's not solved by the CA, it's been solved
               | by the developers of that software and by you installing
               | it.
        
         | tialaramex wrote:
         | I haven't looked at a list of revoked certificates, because I
         | was busy, (and I no longer operate my own CT auditing software,
         | so I'd have to poke around in crt.sh which is not much fun) but
         | lets suppose these are a random sample of Let's Encrypt's ~2
         | million issuances per day.
         | 
         | What %age of the world's HTTPS web sites are "parked" and so
         | there is nobody who expects them to actually work?
         | BrandFromATVShow.example ? TeenDanceISawOnTikTok.example ?
         | SomeShortEnglishWord.example ? Nobody cares, if they do visit,
         | and there's a certificate failure, they realise that's not
         | where they meant to go and leave.
         | 
         | Then what %age are somebody's fever dream / retirement plan /
         | abandoned start-up idea and so although the owner may notice
         | _eventually_ that it 's broken, that might not happen before
         | automatic renewal "fixes" the problem anyway if ever.
         | MyTownOlympicSwimmingPool.example JimAndBethsCakeShop.example
         | and LikeAWSForDogsSomehow.example
         | 
         | And then how about all the outfits which folded weeks, months,
         | even in some cases years ago, but the ISP bill was paid, so,
         | the web site continues to exist until somebody removes it, but
         | of course nobody cares ? BoughtByGoogle.example and
         | YetAnotherBayAreaCryptoStartup.example together with
         | DefinitelyViableProduct.example and
         | OopsWalmartAlreadySellsThatForLessMoney.example
         | 
         | If it was 95% I'd be more worried, at 40% I'd need to actually
         | check at least a decent sample and see for myself. In the time
         | I was writing this post I checked one, it wasn't replaced...
         | exactly, because the actual web site uses a certificate issued
         | five days earlier. Chances are they've got a bunch of duplicate
         | certificates, so the fact that some they don't use are broken
         | has never come up - that's just rude (wastes other people's
         | resources) but it works fine technically.
        
           | tedunangst wrote:
           | Renewing a cert without immediately deploying it seems like a
           | reasonable practice in the face of CAs that will misissue
           | through no fault of your own.
        
             | schoen wrote:
             | When we wrote Certbot, we thought (by analogy with prior
             | practice) that many sysadmins would want to manually
             | inspect certificates before deploying them! That's one
             | reason that we kept old certificates around and used a
             | symlink-updating system.
             | 
             | As it turned out, misissued and invalid certs account for
             | an incredibly small fraction of Let's Encrypt's issuance
             | volume (I'm going to say < 1/108 offhand?) and manual
             | inspection kind of gets in the way of automation, so the
             | idea of separating these steps has come to seem kind of
             | quaint, for me at least. I've also helped thousands of
             | people on the Let's Encrypt forum and I think at most 2
             | have said they were interested in looking at their new
             | certs' contents before starting to use them.
        
               | tedunangst wrote:
               | I may not inspect it myself (which wouldn't even catch
               | this issue), but letting it simmer for a week isn't hard.
        
               | agwa wrote:
               | That's a pretty good idea, and would also mitigate
               | clients with slow clocks rejecting a certificate for not
               | being valid yet.
        
         | NovemberWhiskey wrote:
         | Based on my experience, the capability model for certificate
         | management usually went like:
         | 
         | 1) Chaos: certificates requested and installed manually, either
         | in response to incidents caused by expiration or calendar
         | reminders
         | 
         | 2) Monitoring: certificates requested and installed manually,
         | in response to noisy alerting by probers looking for
         | indications of pending expiration or other ill-health
         | 
         | 3) Automation: continuous certificate provisioning,
         | distribution and enablement either through platform or
         | integration
         | 
         | The Let's Encrypt revolution has taken a lot of people from
         | stage 1 to stage 3 without stage 2 in between.
        
         | hinkley wrote:
         | Vernor Vinge has dominated the Singularity space in science
         | fiction pretty much from the beginning of the concept.
         | 
         | Rainbow's End plays around in time frame right around where we
         | are now, just a bit before the sorts of doglegs we predict
         | would presage a Singularity in your lifetime.
         | 
         | At one point the protagonists need to attack a bad actor, and
         | to make it work they need chaos on the internet. I don't recall
         | exactly how this plays out, but the way they decide to achieve
         | it is that one of the collaborators believes that they can
         | reject a CA cert that affects 10% of all certificates in the
         | wild, and the resulting pandemonium will give them
         | approximately the sort of chaos they need.
         | 
         | Sounds to me like maybe that is either no longer true, or never
         | was.
        
           | tialaramex wrote:
           | [Spoilers]
           | 
           | They don't need Chaos. They want to disable Rabbit, and they
           | know Rabbit's certificates mostly tie back to a single CA,
           | Credit Suisse. So they "revoke" Credit Suisse and accept the
           | consequences, which (they acknowledge) are career ending for
           | the Europeans. This is mostly a plot convenience because
           | Rabbit is much too powerful to allow what Vinge wants to
           | happen next.
           | 
           | No, you can't actually "revoke" a root CA, the decision to
           | trust (or not) a root is local. So this part of the novel is
           | a fantasy. But even if you assume it means that the European
           | authorities can somehow reach into Credit Suisse and cause it
           | to revoke all the intermediates (which _maybe_ is a plausible
           | reading) and so on down to end entity certificates, that
           | doesn 't really work either. Not on the time scale Vinge
           | needs for the novel.
           | 
           | Hours are conceivable but unlikely. Days maybe. A week. But
           | the novel needs it to be seconds.
           | 
           | There are two big obstacles to even the revocation which does
           | really exist. Firstly humans are _much_ more enthusiastic
           | about seeing Dancing Pigs than they are about safety, because
           | safety is a very abstract idea, whereas seeing dancing pigs
           | is an immediate reward. This is the Dancing Pigs problem, and
           | we 've put some effort in, it's _less_ likely a random Chrome
           | user would get their face ripped to pieces because they
           | wanted Dancing Pigs and so they bypassed the security checks
           | that would protected them - than say - fifteen years ago, but
           | only somewhat.
           | 
           | Secondly though, there's not a great enthusiasm technically
           | for this sort of counter-measure. It's so rarely beneficial
           | in practice. Most of the time those humans were right, we
           | were just denying them Dancing Pigs. Their face _might_ get
           | ripped to pieces, but to be honest it 's as likely to be
           | because they deliberate went to "Rip My Face To
           | Pieces.example" as through anything we could have prevented.
           | This is only barely a technical problem. So, when there are
           | things we could do to get closer to what's in the novel, why
           | would we?
           | 
           | Building the PKI which exists in Vinge's novel is probably a
           | bad expenditure of resources.
        
       | francislavoie wrote:
       | FWIW, if those websites used Caddy as their ACME client, then it
       | would have detected the certificate being revoked as soon as
       | possible via OCSP stapling and would have had the certificate
       | renewed. It's a shame that other ACME clients aren't as robust to
       | problems like this. (Disclaimer: I work on Caddy as a volunteer)
        
         | agwa wrote:
         | Note that the certificates were not revoked until 2023-06-19 at
         | 18:00. In contrast, ARI was updated on 2023-06-15 at 22:43 to
         | tell ARI-supporting clients (such as lego) to renew
         | immediately. That means Caddy served broken certificates for
         | almost 4 days longer than necessary.
         | 
         | Are there plans for Caddy to support ARI?
        
           | mholt wrote:
           | > That means Caddy served broken certificates for almost 4
           | days longer than necessary.
           | 
           | This would be news to me. Do you have a source for Caddy
           | serving any of the affected certificates? I'd like as much
           | info as possible.
           | 
           | > Are there plans for Caddy to support ARI?
           | 
           | If ARI can be made into an effective mechanism, then yes.
           | ACMEz already supports the current draft.
           | 
           | I know Francis linked to a forum category, here's some more
           | specific links for background:
           | 
           | - https://community.letsencrypt.org/t/can-ari-conforming-
           | clien...
           | 
           | - https://community.letsencrypt.org/t/thoughts-from-
           | starting-t...
        
             | agwa wrote:
             | _> This would be news to me. Do you have a source for Caddy
             | serving any of the affected certificates? I'd like as much
             | info as possible._
             | 
             | That's news to you? I informed you last week that Caddy
             | would serve broken certificates in this situation:
             | https://news.ycombinator.com/item?id=36344549
             | 
             | I omitted "would" from my previous comment, but I think
             | it's pretty clear from Francis' comment that we're
             | discussing a hypothetical situation, and neither of us know
             | if any of the 645 affected certificates were requested by
             | Caddy or not.
             | 
             | I skimmed the forum links (it would be productive if you
             | could send a email summarizing your thoughts to the IETF
             | ACME WG) and it seems like your complaints could also be
             | said of OCSP so it's hard to figure out why OCSP is OK for
             | Caddy but ARI isn't.
             | 
             | FWIW, there's currently a ballot in the CABF which would
             | make OCSP optional for CAs, so OCSP may be on the way out
             | in the WebPKI.
        
               | mholt wrote:
               | You said:
               | 
               | > Caddy served broken certificates
               | 
               | So yes, that would be news to me. I'm asking for more
               | information. If Caddy did not serve broken certificates,
               | then I would appreciate clarification there so I know
               | where to spend my energy.
               | 
               | > (it would be productive if you could send a email
               | summarizing your thoughts to the IETF ACME WG)
               | 
               | I did this once and it was like talking into a black
               | hole. All the responses I got to the issue I brought up
               | were laced with complacency.
               | 
               | > I skimmed the forum links and it seems like your
               | complaints could also be said of OCSP so it's hard to
               | figure out why OCSP is OK for Caddy but ARI isn't.
               | 
               | Because OCSP does what it's intended to do. ARI does not.
               | 
               | > FWIW, there's currently a ballot in the CABF which
               | would make OCSP optional for CAs, so OCSP may be on the
               | way out in the WebPKI.
               | 
               | I am tracking that proposal and get daily notifications.
               | It is only for short-lived certs. I would be thrilled if
               | we could replace revocation -- and OCSP -- with short-
               | lived certs.
        
               | agwa wrote:
               | _> So yes, that would be news to me. I 'm asking for more
               | information. If Caddy did not serve broken certificates,
               | then I would appreciate clarification there so I know
               | where to spend my energy._
               | 
               | This is not engaging in good faith.
               | 
               |  _> I am tracking that proposal and get daily
               | notifications. It is only for short-lived certs._
               | 
               | It would make OCSP optional for all certificates. CRLs
               | would be optional only for short-lived certs.
        
               | mholt wrote:
               | > This is not engaging in good faith.
               | 
               | Sorry, come again? Why so combative?
        
           | francislavoie wrote:
           | > Note that the certificates were not revoked until
           | 2023-06-19 at 18:00.
           | 
           | Ah okay, I missed that.
           | 
           | > Are there plans for Caddy to support ARI?
           | 
           | It's... complicated. Matt argues that ARI does not make sense
           | for a variety of reasons. You can find the complex and deep
           | discussions about it on the LE forums. Do a Ctrl+F for ARI in
           | https://community.letsencrypt.org/c/client-dev/14 to find
           | them, there's a lot.
        
       | ElongatedMusket wrote:
       | Thanks for following through on this writeup! I knew LE certs
       | were publicly logged but didn't know the logs were decentralized
       | or how they hold the CA accountable. Appreciate the layman
       | explanation.
        
       | fruitreunion1 wrote:
       | Will non-browser clients like curl/requests ever support checking
       | CT logs? It's great that some browsers have it, but browsers are
       | not the only clients using TLS with CAs. Also doesn't help that a
       | lot of software can't use CA root stores with much granularity:
       | https://news.ycombinator.com/item?id=33876949
        
         | agwa wrote:
         | Hopefully, although there are challenges to overcome. CT is a
         | fast-moving ecosystem, with logs coming and going, and policies
         | changing regularly. This requires CT-enforcing clients to be
         | very on-the-ball with updates, both in the sense that the
         | developers need to pay attention and update their code in time,
         | and any users of the apps need to upgrade frequently. Browser
         | makers can handle this because they are competently-staffed and
         | well-resourced. The authors of non-browser apps need to know
         | what they're getting into.
         | 
         | A cautionary tale: there is a library for adding CT enforcement
         | to Android apps. Earlier this year, every app using this
         | library was suddenly unable to establish any TLS connections
         | because Google stopped publishing a JSON file which the library
         | should never have been consuming in the first place. There was
         | plenty of warning that this would happen, but the author of the
         | library was not on-the-ball.
         | https://groups.google.com/g/certificate-transparency/c/38Lr9...
        
         | NovemberWhiskey wrote:
         | The elephant in the room is that TLS implementations for
         | browsers and those in the libraries of common programming
         | languages have diverged really substantially: Web PKI is
         | massively more restrictive and depends on a bunch of technology
         | that's not in the baseline PKI.
        
       ___________________________________________________________________
       (page generated 2023-06-22 23:00 UTC)