[HN Gopher] Retries - An interactive study of request retry methods
       ___________________________________________________________________
        
       Retries - An interactive study of request retry methods
        
       Author : whenlambo
       Score  : 188 points
       Date   : 2023-11-23 13:32 UTC (9 hours ago)
        
 (HTM) web link (encore.dev)
 (TXT) w3m dump (encore.dev)
        
       | samwho wrote:
       | Thanks for sharing!
       | 
       | I'm the author of this post, and happy to answer any questions :)
        
         | j1elo wrote:
         | There's a subtle insight that could be added to the post if you
         | consider worth it, and it's something that's actually _there_
         | already, but difficult to realize: Clients in your simulation
         | have an absolute maximum number of retries.
         | 
         | I noticed this mid-read, when looking at one of the animations
         | with 28 clients, that they would hammer the server but suddenly
         | go into wait state, without apparent reason.
         | 
         | Later in the final animation with debug mode enabled, the
         | reason becomes apparent for those who click on the Controls
         | button:
         | 
         | Retry Strategy > Max Attempts = 10
         | 
         | It makes sense, because in the worst case when everything goes
         | wrong, a client should reach a point where it desists and just
         | aborts with a "service not available" error.
        
           | samwho wrote:
           | You know, I hadn't actually considered mentioning it. Another
           | commenter brought it up, too. It's so second nature I forgot
           | about it entirely.
           | 
           | I'll look about giving it a nod in the text, thank you for
           | the feedback. :)
        
             | fiddlerwoaroof wrote:
             | Exponential retries can effectively have a maximum number
             | of requests if the gap between retries gets long enough
             | quickly enough. In practice, the user will refresh or close
             | the page if things look broken for too long.
        
               | marcosdumay wrote:
               | Oh, please don't do that.
               | 
               | Unbounded exponential backoff is an horrible experience,
               | and improves basically nothing.
               | 
               | If it makes sense to completely fail the request, do it
               | before the waiting becomes noticeable. If it's something
               | that can't just fail, set a maximum waiting time and add
               | jitter.
        
         | codebeaker wrote:
         | What technology did you use for the animations? I've a bunch of
         | itches I'd like to scratch that would be improved by having
         | some canvas animated explainers or UI but I never clicked with
         | anything. D3 back in the day.
         | 
         | A rudimentary look in the source code showed a <traffic-
         | simulation/> element but I'm not up to date enough with web
         | standards to guess where to look for that in your JS bundle to
         | guess at the framework!
        
           | samwho wrote:
           | It uses PixiJS (https://pixijs.com/) for the 2D rendering and
           | GSAP3 (https://gsap.com/) for the animation. The <traffic-
           | simulation /> blocks are custom HTMl elements
           | (https://developer.mozilla.org/en-
           | US/docs/Web/API/Web_compone...) which I use to encapsulate
           | the logic.
           | 
           | I've been thinking about creating a separate repo to house
           | the source code of posts I've finished so people can see it.
           | I don't like all the bundling and minification but sadly it
           | serves a very real purpose to the end user experience (faster
           | load speeds on slow connections).
           | 
           | Until then feel free to email me (you'll find my address at
           | the bottom of my site) and I'd be happy to share a zip of
           | this post with you.
        
             | samwho wrote:
             | I've uploaded the code for all of my visualisation posts
             | here: https://github.com/samwho/visualisations.
             | 
             | Enjoy! :)
        
       | self_awareness wrote:
       | Really nice animations, I especially liked the demonstration of
       | the effect that after some servers will "explode", any server
       | that will be restarted will automatically be DoS'ed until we'll
       | throw a bunch of extra temporary servers into the system. Thanks.
        
         | samwho wrote:
         | Yeah! An insidious problem that's not obvious when you're
         | picking a retry interval.
         | 
         | I had fun with the details of the explosion animation. When it
         | explodes, the number of requests that come out is the actual
         | number of in-progress requests.
        
       | christophberger wrote:
       | A must-read (or rather: must-see) for anyone who thinks
       | exponential backoff is overrated.
        
         | rewmie wrote:
         | > A must-read (or rather: must-see) for anyone who thinks
         | exponential backoff is overrated.
         | 
         | I don't think exponential backoffs were ever accused of being
         | overrated. Retries in general have been criticized for being
         | counterproductive in multiple aspects, including the risk of
         | creating self-inflicted DDOS attacks, and exponential backoffs
         | can result in untenable performance and usability problems
         | without adding any upside. These are known problems, but none
         | of them is hardly classified as "overrating".
        
       | whenlambo wrote:
       | Remember to limit the exponential backoff interval if you are not
       | limiting the number of retries
        
       | fadhilkurnia wrote:
       | The animations are so cool!!!
       | 
       | In general the phenomena is known as _metastable failure_ that
       | could be triggered when there are more things to do during
       | failure than normal run.
       | 
       | With retry, the client do more work within the same amount of
       | time, compared to doing nothing or doing exponential backoff.
        
       | lclarkmichalek wrote:
       | This still isn't what I'd call "safe". Retries are amazing at
       | supporting clients in handling temporary issues, but horrible for
       | helping them deal with consistently overloaded servers. While
       | jitter & exponential backoff help with the timing, they don't
       | reduce the overall load sent to the service.
       | 
       | The next step is usually local circuit breakers. The two easiest
       | to implement are terminating the request if the error rate to the
       | service over the last <window> is greater than x%, and
       | terminating the request (or disabling retries) if the % of
       | requests that are retries over the last <window> is greater than
       | x%.
       | 
       | i.e. don't bother sending a request if 70% of requests have
       | errored in the last minute, and don't bother retrying if 50% of
       | the requests we've sent in the last minute have already been
       | retries.
       | 
       | Google SRE book describes lots of other basic techniques to make
       | retries safe.
        
         | samwho wrote:
         | Totally! Thanks for bringing those up. I tried to keep the
         | scope specifically on retries and client-side mitigation.
         | There's a whole bunch of cool stuff to visualise on the server-
         | side, and I'm hoping to get to it in the future.
        
           | Axsuul wrote:
           | Do you have a newsletter?
        
             | samwho wrote:
             | Not a newsletter as such but I do have an email list where
             | I post whenever I write something new. You can find it
             | here: https://buttondown.email/samwho
        
           | cowsandmilk wrote:
           | Your response makes it sound like you think circuit breakers
           | are server side and not related to retries. They are not;
           | they are a client-side mitigation that are a critical part of
           | a mature retry library.
        
             | korm wrote:
             | The client can track its own error rate to the service, but
             | it would need information from a server to get the overall
             | health of the service, which is what the author probably
             | means. Furthermore the load balancer can add a Retry-After
             | header to have more control over the client's retries.
        
               | samwho wrote:
               | I think I've misunderstood what circuit breakers are for
               | years! I did indeed think they were a server-side
               | mechanism. The original commenter's description of them
               | is great, you can essentially create a heuristic based on
               | the observed behaviour of the server and decide against
               | overwhelming it further if you think it's unhealthy.
               | 
               | TIL! Seems like it can have tricky emergent behaviour. I
               | bet if you implement it wrong you can end up in very
               | weird situations. I should visualise it. :)
        
               | lclarkmichalek wrote:
               | I mean, they can and should be both. Local decisions can
               | be cheap, and very simple to implement. But global
               | decisions can be smarter, and more predictable. In my
               | experience, it's incredibly hard to make good decisions
               | in pathological situations locally, as you often don't
               | know you're in a pathological situation with only local
               | data. But local data is often enough to "do less harm" :)
        
         | spockz wrote:
         | Finagle fixes this with Retry Budgets:
         | https://finagle.github.io/blog/2016/02/08/retry-budgets/
        
       | usrbinbash wrote:
       | This is the client side of things. And I think this is a great
       | resource that everyone who writes clients for anything, should
       | see.
       | 
       | But there is an additional piece of info everyone who writes
       | clients needs to see: And that's what people like me, who
       | implement backend services, may do if clients ignore such wisdom.
       | 
       | Because: I'm not gonna let bad clients break my service.
       | 
       | What that means in practice: Clients are given a choice: They can
       | behave, or they can                   HTTP 429 Too Many Requests
        
         | rewmie wrote:
         | > This is the client side of things.
         | 
         | The article is about making requests, and strategies to
         | implement when the request fails. By definition, these are
         | clients. Was there any ambiguity?
         | 
         | > But there is an additional piece of info everyone who writes
         | clients needs to see: And that's what people like me, who
         | implement backend services, may do if clients ignore such
         | wisdom.
         | 
         | I don't think this is the obscure detail you are making it out
         | to be. A few of the most basic and popular retry strategies are
         | designed explicitly with a) handling throttled responses by the
         | servers, b) mitigate the risk of causing self-inflicted DDoS
         | attacks. This article covers a few of those, such as the
         | exponential backoff and jitters.
        
           | usrbinbash wrote:
           | > Was there any ambiguity?
           | 
           | Did I say there was?
           | 
           | > I don't think this is the obscure detail you are making it
           | out to be
           | 
           | Where did I call this detail "obscure"?
           | 
           | My post is meant as a light-hearted, humorous note pointing
           | out one of the many reasons why it is in general a good idea
           | for clients to implement the principles outlined in the
           | article.
        
             | samwho wrote:
             | Throttling, tarpitting, and circuit-breakers are something
             | I'd love to visualise in future, too. Throttling on its own
             | is such a massive topic!
        
       | tyingq wrote:
       | This is one of those things that sort of exposes our industry
       | maturity versus other engineering that's been around longer. You
       | would think by now that the various frameworks for remote calls
       | would have standardized down to include the best practice retry
       | patterns, with standard names, setting ranges, etc. But we mostly
       | still roll our own for most languages/frameworks. And that's full
       | of footguns around DNS caching, when/how to retry on certain
       | failures (unauthorized, for example), and so on.
       | 
       | (Yes, there should also be the non-abstracted direct path for
       | cases where you do want to roll your own).
        
       | sesm wrote:
       | Summary of the article: use exponential backoff + jitter for
       | retry intervals.
       | 
       | What author didn't mention: sometimes you want to add jitter to
       | delay the first request too, if the request happens immediately
       | after some event from server (like server waking up). If you
       | don't do this, you may crash the server, and if your exponential
       | backoff counter is not global you can even put server into cyclic
       | restart.
        
         | whenlambo wrote:
         | If you can crash the server with an improperly timed request,
         | then you have a much bigger problem than client-side stuff.
        
           | andenacitelli wrote:
           | Yes. Worst that should happen is getting a 404 or something.
           | A crash due to requesting a piece of data that has not yet
           | been created is poor design.
        
           | samwho wrote:
           | I think what they mean is something that would cause client
           | to do something at the same time (could be all sorts, some
           | synchronised crash, aligning timers to clock-time, etc.). If
           | the requests aren't user-driven then yes, you likely would
           | want to include some jitter in the first request too.
           | 
           | Funnily, you'll notice that some of the visualisations have
           | the clients staggering their first request. It's exactly for
           | this reason. I wanted the visualisations to be as
           | deterministic as possible while still feeling somewhat
           | realistic. This staggering was a bit of a compromise.
           | 
           | Not sure what is meant by "if your exponential backoff
           | counter is not global", though. Would love to know more about
           | that.
        
           | sroussey wrote:
           | True, but you can imagine something like a websocket to all
           | clients getting reset and everyone re-connecting, re-
           | authenticating, and getting a new payload.
        
           | __turbobrew__ wrote:
           | One example is if a datacenter loses power and then all the
           | hosts get turned on at the same time they can all send
           | requests at the same time and crash a server.
        
         | fooey wrote:
         | Yup, classic Thundering Herd Problem
        
       | cratermoon wrote:
       | I worked at a company with a self-inflicted wound related to
       | retries.
       | 
       | At some point in the distant (internet time) past, a sales
       | engineer, or the equivalent, had written a sample script to
       | demonstrate basic uses of the API. As many of you quickly
       | guessed, customers went on a copy/paste rampage and put this
       | sample script into production.
       | 
       | The script went into a tight loop on failure, naively using a
       | simple library that did not include any back-off or retry in the
       | request. I'm not deeply familiar with how the company dealt with
       | this situation. I am aware there was a complex load balancing
       | system across distributed infrastructure, but also, just a lot of
       | horsepower.
       | 
       | Lesson for anyone offering an API product: don't hand out example
       | code with a self-own, because it will become someone's production
       | code.
        
       | joshka wrote:
       | For a lot of things, retry once and only once (at the outermost
       | layer to avoid multiplicative amplification) is more correct. At
       | a large enough scale, failing twice is often significantly (like
       | 90%+) correlated with the likelihood of failing a third time
       | regardless of backoff / jitter. This means that the second retry
       | only serves to add more load to an already failing service.
        
         | tomwt wrote:
         | Retrying end-to-end instead of stepwise greatly reduces the
         | reliability of a process with a reasonable number of steps.
         | 
         | That being said, processes should ideally be failing in ways
         | which make it clear whether an error is retryable or not.
        
         | xer wrote:
         | Correct. It's also the case that human generated requests will
         | lose their relevance within seconds, a quick retry is all it's
         | worth. As for machine generated requests a dead letter queue
         | would make more sense, poor engineered backend services would
         | OOM and well-engineered would load shed, if the requests are
         | queued on the application servers they are doomed to be lost
         | anyway.
        
       | davidw wrote:
       | I have been thinking about queueing theory lately. I don't have
       | the math abilities to do anything deep with it, but it seems like
       | even basic applications of certain things could prove valuable in
       | real world situations where people are just kind of winging it
       | with resource allocation.
        
       ___________________________________________________________________
       (page generated 2023-11-23 23:00 UTC)