[HN Gopher] Retries - An interactive study of request retry methods ___________________________________________________________________ Retries - An interactive study of request retry methods Author : whenlambo Score : 188 points Date : 2023-11-23 13:32 UTC (9 hours ago) (HTM) web link (encore.dev) (TXT) w3m dump (encore.dev) | samwho wrote: | Thanks for sharing! | | I'm the author of this post, and happy to answer any questions :) | j1elo wrote: | There's a subtle insight that could be added to the post if you | consider worth it, and it's something that's actually _there_ | already, but difficult to realize: Clients in your simulation | have an absolute maximum number of retries. | | I noticed this mid-read, when looking at one of the animations | with 28 clients, that they would hammer the server but suddenly | go into wait state, without apparent reason. | | Later in the final animation with debug mode enabled, the | reason becomes apparent for those who click on the Controls | button: | | Retry Strategy > Max Attempts = 10 | | It makes sense, because in the worst case when everything goes | wrong, a client should reach a point where it desists and just | aborts with a "service not available" error. | samwho wrote: | You know, I hadn't actually considered mentioning it. Another | commenter brought it up, too. It's so second nature I forgot | about it entirely. | | I'll look about giving it a nod in the text, thank you for | the feedback. :) | fiddlerwoaroof wrote: | Exponential retries can effectively have a maximum number | of requests if the gap between retries gets long enough | quickly enough. In practice, the user will refresh or close | the page if things look broken for too long. | marcosdumay wrote: | Oh, please don't do that. | | Unbounded exponential backoff is an horrible experience, | and improves basically nothing. | | If it makes sense to completely fail the request, do it | before the waiting becomes noticeable. If it's something | that can't just fail, set a maximum waiting time and add | jitter. | codebeaker wrote: | What technology did you use for the animations? I've a bunch of | itches I'd like to scratch that would be improved by having | some canvas animated explainers or UI but I never clicked with | anything. D3 back in the day. | | A rudimentary look in the source code showed a <traffic- | simulation/> element but I'm not up to date enough with web | standards to guess where to look for that in your JS bundle to | guess at the framework! | samwho wrote: | It uses PixiJS (https://pixijs.com/) for the 2D rendering and | GSAP3 (https://gsap.com/) for the animation. The <traffic- | simulation /> blocks are custom HTMl elements | (https://developer.mozilla.org/en- | US/docs/Web/API/Web_compone...) which I use to encapsulate | the logic. | | I've been thinking about creating a separate repo to house | the source code of posts I've finished so people can see it. | I don't like all the bundling and minification but sadly it | serves a very real purpose to the end user experience (faster | load speeds on slow connections). | | Until then feel free to email me (you'll find my address at | the bottom of my site) and I'd be happy to share a zip of | this post with you. | samwho wrote: | I've uploaded the code for all of my visualisation posts | here: https://github.com/samwho/visualisations. | | Enjoy! :) | self_awareness wrote: | Really nice animations, I especially liked the demonstration of | the effect that after some servers will "explode", any server | that will be restarted will automatically be DoS'ed until we'll | throw a bunch of extra temporary servers into the system. Thanks. | samwho wrote: | Yeah! An insidious problem that's not obvious when you're | picking a retry interval. | | I had fun with the details of the explosion animation. When it | explodes, the number of requests that come out is the actual | number of in-progress requests. | christophberger wrote: | A must-read (or rather: must-see) for anyone who thinks | exponential backoff is overrated. | rewmie wrote: | > A must-read (or rather: must-see) for anyone who thinks | exponential backoff is overrated. | | I don't think exponential backoffs were ever accused of being | overrated. Retries in general have been criticized for being | counterproductive in multiple aspects, including the risk of | creating self-inflicted DDOS attacks, and exponential backoffs | can result in untenable performance and usability problems | without adding any upside. These are known problems, but none | of them is hardly classified as "overrating". | whenlambo wrote: | Remember to limit the exponential backoff interval if you are not | limiting the number of retries | fadhilkurnia wrote: | The animations are so cool!!! | | In general the phenomena is known as _metastable failure_ that | could be triggered when there are more things to do during | failure than normal run. | | With retry, the client do more work within the same amount of | time, compared to doing nothing or doing exponential backoff. | lclarkmichalek wrote: | This still isn't what I'd call "safe". Retries are amazing at | supporting clients in handling temporary issues, but horrible for | helping them deal with consistently overloaded servers. While | jitter & exponential backoff help with the timing, they don't | reduce the overall load sent to the service. | | The next step is usually local circuit breakers. The two easiest | to implement are terminating the request if the error rate to the | service over the last <window> is greater than x%, and | terminating the request (or disabling retries) if the % of | requests that are retries over the last <window> is greater than | x%. | | i.e. don't bother sending a request if 70% of requests have | errored in the last minute, and don't bother retrying if 50% of | the requests we've sent in the last minute have already been | retries. | | Google SRE book describes lots of other basic techniques to make | retries safe. | samwho wrote: | Totally! Thanks for bringing those up. I tried to keep the | scope specifically on retries and client-side mitigation. | There's a whole bunch of cool stuff to visualise on the server- | side, and I'm hoping to get to it in the future. | Axsuul wrote: | Do you have a newsletter? | samwho wrote: | Not a newsletter as such but I do have an email list where | I post whenever I write something new. You can find it | here: https://buttondown.email/samwho | cowsandmilk wrote: | Your response makes it sound like you think circuit breakers | are server side and not related to retries. They are not; | they are a client-side mitigation that are a critical part of | a mature retry library. | korm wrote: | The client can track its own error rate to the service, but | it would need information from a server to get the overall | health of the service, which is what the author probably | means. Furthermore the load balancer can add a Retry-After | header to have more control over the client's retries. | samwho wrote: | I think I've misunderstood what circuit breakers are for | years! I did indeed think they were a server-side | mechanism. The original commenter's description of them | is great, you can essentially create a heuristic based on | the observed behaviour of the server and decide against | overwhelming it further if you think it's unhealthy. | | TIL! Seems like it can have tricky emergent behaviour. I | bet if you implement it wrong you can end up in very | weird situations. I should visualise it. :) | lclarkmichalek wrote: | I mean, they can and should be both. Local decisions can | be cheap, and very simple to implement. But global | decisions can be smarter, and more predictable. In my | experience, it's incredibly hard to make good decisions | in pathological situations locally, as you often don't | know you're in a pathological situation with only local | data. But local data is often enough to "do less harm" :) | spockz wrote: | Finagle fixes this with Retry Budgets: | https://finagle.github.io/blog/2016/02/08/retry-budgets/ | usrbinbash wrote: | This is the client side of things. And I think this is a great | resource that everyone who writes clients for anything, should | see. | | But there is an additional piece of info everyone who writes | clients needs to see: And that's what people like me, who | implement backend services, may do if clients ignore such wisdom. | | Because: I'm not gonna let bad clients break my service. | | What that means in practice: Clients are given a choice: They can | behave, or they can HTTP 429 Too Many Requests | rewmie wrote: | > This is the client side of things. | | The article is about making requests, and strategies to | implement when the request fails. By definition, these are | clients. Was there any ambiguity? | | > But there is an additional piece of info everyone who writes | clients needs to see: And that's what people like me, who | implement backend services, may do if clients ignore such | wisdom. | | I don't think this is the obscure detail you are making it out | to be. A few of the most basic and popular retry strategies are | designed explicitly with a) handling throttled responses by the | servers, b) mitigate the risk of causing self-inflicted DDoS | attacks. This article covers a few of those, such as the | exponential backoff and jitters. | usrbinbash wrote: | > Was there any ambiguity? | | Did I say there was? | | > I don't think this is the obscure detail you are making it | out to be | | Where did I call this detail "obscure"? | | My post is meant as a light-hearted, humorous note pointing | out one of the many reasons why it is in general a good idea | for clients to implement the principles outlined in the | article. | samwho wrote: | Throttling, tarpitting, and circuit-breakers are something | I'd love to visualise in future, too. Throttling on its own | is such a massive topic! | tyingq wrote: | This is one of those things that sort of exposes our industry | maturity versus other engineering that's been around longer. You | would think by now that the various frameworks for remote calls | would have standardized down to include the best practice retry | patterns, with standard names, setting ranges, etc. But we mostly | still roll our own for most languages/frameworks. And that's full | of footguns around DNS caching, when/how to retry on certain | failures (unauthorized, for example), and so on. | | (Yes, there should also be the non-abstracted direct path for | cases where you do want to roll your own). | sesm wrote: | Summary of the article: use exponential backoff + jitter for | retry intervals. | | What author didn't mention: sometimes you want to add jitter to | delay the first request too, if the request happens immediately | after some event from server (like server waking up). If you | don't do this, you may crash the server, and if your exponential | backoff counter is not global you can even put server into cyclic | restart. | whenlambo wrote: | If you can crash the server with an improperly timed request, | then you have a much bigger problem than client-side stuff. | andenacitelli wrote: | Yes. Worst that should happen is getting a 404 or something. | A crash due to requesting a piece of data that has not yet | been created is poor design. | samwho wrote: | I think what they mean is something that would cause client | to do something at the same time (could be all sorts, some | synchronised crash, aligning timers to clock-time, etc.). If | the requests aren't user-driven then yes, you likely would | want to include some jitter in the first request too. | | Funnily, you'll notice that some of the visualisations have | the clients staggering their first request. It's exactly for | this reason. I wanted the visualisations to be as | deterministic as possible while still feeling somewhat | realistic. This staggering was a bit of a compromise. | | Not sure what is meant by "if your exponential backoff | counter is not global", though. Would love to know more about | that. | sroussey wrote: | True, but you can imagine something like a websocket to all | clients getting reset and everyone re-connecting, re- | authenticating, and getting a new payload. | __turbobrew__ wrote: | One example is if a datacenter loses power and then all the | hosts get turned on at the same time they can all send | requests at the same time and crash a server. | fooey wrote: | Yup, classic Thundering Herd Problem | cratermoon wrote: | I worked at a company with a self-inflicted wound related to | retries. | | At some point in the distant (internet time) past, a sales | engineer, or the equivalent, had written a sample script to | demonstrate basic uses of the API. As many of you quickly | guessed, customers went on a copy/paste rampage and put this | sample script into production. | | The script went into a tight loop on failure, naively using a | simple library that did not include any back-off or retry in the | request. I'm not deeply familiar with how the company dealt with | this situation. I am aware there was a complex load balancing | system across distributed infrastructure, but also, just a lot of | horsepower. | | Lesson for anyone offering an API product: don't hand out example | code with a self-own, because it will become someone's production | code. | joshka wrote: | For a lot of things, retry once and only once (at the outermost | layer to avoid multiplicative amplification) is more correct. At | a large enough scale, failing twice is often significantly (like | 90%+) correlated with the likelihood of failing a third time | regardless of backoff / jitter. This means that the second retry | only serves to add more load to an already failing service. | tomwt wrote: | Retrying end-to-end instead of stepwise greatly reduces the | reliability of a process with a reasonable number of steps. | | That being said, processes should ideally be failing in ways | which make it clear whether an error is retryable or not. | xer wrote: | Correct. It's also the case that human generated requests will | lose their relevance within seconds, a quick retry is all it's | worth. As for machine generated requests a dead letter queue | would make more sense, poor engineered backend services would | OOM and well-engineered would load shed, if the requests are | queued on the application servers they are doomed to be lost | anyway. | davidw wrote: | I have been thinking about queueing theory lately. I don't have | the math abilities to do anything deep with it, but it seems like | even basic applications of certain things could prove valuable in | real world situations where people are just kind of winging it | with resource allocation. ___________________________________________________________________ (page generated 2023-11-23 23:00 UTC)