[HN Gopher] How percentile approximation works and why it's more... ___________________________________________________________________ How percentile approximation works and why it's more useful than averages Author : od0 Score : 415 points Date : 2021-09-14 16:14 UTC (6 hours ago) (HTM) web link (blog.timescale.com) (TXT) w3m dump (blog.timescale.com) | axpy906 wrote: | There's something called a five number summary in statistics: | mean, median, standard deviation, 25th percentile and 75th | percentile. | | The bonus is that the 75th - 50th gives you the interquartile | range. | | Mean is not a robust measure and as such you need to look at | variety to truly understand the spread of your data. | bluesmoon wrote: | IQR is 75th - 25th, aka, the middle-50% | axpy906 wrote: | You're right I mistyped. | monkeydust wrote: | Ok this, box plots are a good way to visualize and show | distribution esp to a not so stat heavy audience. | 10000truths wrote: | If you're going to use multiple quantities to summarize a | distribution, wouldn't using percentiles for all of them give | you the most information? The mean and standard deviation could | then be estimated from that data. | [deleted] | sharmin123 wrote: | Let's Secure WiFi Network and Prevent WiFi Hacking: | https://www.hackerslist.co/lets-secure-wifi-network-and-prev... | achenatx wrote: | Ive been trying to get the marketing team to always include a std | deviation with averages. Average alone is simply not useful, | standard deviation is a simple way to essentially include | percentiles. | | They regularly compare experiments to the mean but dont use a T | test to ensure the results are actually different from the mean. | doctorsher wrote: | I heavily caution against the feeling that "standard deviation | is a simple way to essentially include percentiles." The | usefulness of the standard deviation depends on the | distributions that you are working with. Heavy tailed | distributions appear a fair amount in practice, and the combo | of summary statistics mentioned would not do well on those. | Also, Madars' comment in this thread is a beautiful example of | this: 4 completely different distributions, with identical mean | and standard deviation (among other things). Histograms and | percentiles, and if necessary their approximations, are more | desirable for the above reasons. | slaymaker1907 wrote: | I assume most of the distributions a marketing department | would be dealing with are generally normal in which case | stddev is a great way to analyze the data. This can be easily | verified by just plotting said data and making sure the tails | don't look weird. | djk447 wrote: | NB: Post author here. | | Std deviation definitely helps a lot, still often not as good | as percentiles, was actually thinking about adding some of that | in the post but it was already getting so long. It's funny how | things you think are simple sometimes take the most effort to | explain, definitely found that on this one. | waynecochran wrote: | Yeah -- std deviation has a similar problem to the mean in | that it doesn't give you a full picture unless the | distribution is close to normal / gaussian. | esyir wrote: | Pretty much why summary statistics often give the IQR, | which gives some idea to the skew and shape of the | distribution as well. | | Unfortunately, BD and marketing just want a single number | to show that the value is bigger and hate anything more | complicated than a barchart. | varelaz wrote: | Barchart is basically your percentiles (just more of | them) so why not show it? Bars and whiskers could be more | complicated for them but still the same sort of data | fwip wrote: | Barcharts across categorical data :P | | That is, the first bar is "Our Number" and the second bar | is "Competitor's number." | djk447 wrote: | NB: Post author here. | | We've been meaning to add IQR as an accessor function for | these, may have to go back and do it...the frequency | trails [1] stuff from Brendan Gregg also goes into some | of this and it's really cool as a visualization. | | [1]: | https://www.brendangregg.com/FrequencyTrails/mean.html | LoriP wrote: | One thing I like about this post is that it explains things in | an accessible way before getting into a deep dive. Might be | worth sharing with the marketing team as they'll "get" long | tail in the context of web search, so the concept is fairly | transferable to stuff that they would know about. | 123pie123 wrote: | I had to explain the data usage of an interface that looked | extremely busy from the standard graphs | | I did a percentile graph of the usage - the data was only | typically using 5% of the maximum throughput, no-one could really | understand the graph though | | so I did a zoomed-in version of the normal data usage graph and | it looked like a blip lasting 1/20 of the time - everyone got | that - eg it was peaking every few seconds and then doing nothing | for ages | madars wrote: | Good opportunity to plug | https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't | know much about the underlying distribution, simple statistics | don't describe it well. | | From Wikipedia description: Anscombe's quartet comprises four | data sets that have nearly identical simple descriptive | statistics, yet have very different distributions and appear very | different when graphed. Each dataset consists of eleven (x,y) | points. They were constructed in 1973 by the statistician Francis | Anscombe to demonstrate both the importance of graphing data | before analyzing it, and the effect of outliers and other | influential observations on statistical properties. | doctorsher wrote: | This is excellent information, thank you for posting this! I | was not familiar with this example previously, but it is a | _perfect_ example of summary statistics not capturing certain | distributions well. It 's very approachable, even if you had to | limit the discussion to mean and variance alone. Bookmarked, | and much appreciated. | pdpi wrote: | There's a fun paper by Autodesk where they make datasets that | look whatever way you want them to. | | https://www.autodesk.com/research/publications/same-stats-di... | djk447 wrote: | NB: Post author here. | | This is great! So fun...will have to use in the future... | pdpi wrote: | While I have your attention... | | > For the median and average to be equal, the points less | than the median and greater than the median must have the | same distribution (i.e., there must be the same number of | points that are somewhat larger and somewhat smaller and | much larger and much smaller). | | [0, 2, 5, 9, 9] has both median and mean = 5, but the two | sides don't really have the same distribution. | djk447 wrote: | Totally true...thoughts on how I could rephrase? I guess | it's more the "weight" of points greater than and less | than the median should be the same, so symmetric | distributions definitely have it, asymmetric may or may | not. Definitely open to revising... | pdpi wrote: | > symmetric distributions definitely have it, asymmetric | may or may not | | Doesn't have to be any more complicated than that. It's | more a curio than an important point anyway :) | s_gourichon wrote: | Yes, a both funny and insightful lesson on how weak basic | indicators (mean, standard deviation, correlation) can be, | the interest and limits of box-plot with quartiles and | whiskers, the benefit of the violin-plot. | | Definitely worth a quick look. | djk447 wrote: | NB: Post author here. | | That's really nifty, wish I'd heard about it earlier. Might go | back and add a link to it in the post at some point too! Very | useful. Definitely know I wasn't breaking new ground or | anything, but fun to see it represented so succinctly. | [deleted] | pachico wrote: | Surprisingly, many software engineers I know never used | percentiles and keep using mean average. True story. | yongjik wrote: | Bah, I'll be happy if I could even get correct averages. I see | pipelines getting value X1 from a server that served 100 | requests, another value X2 from a server that served one | request, and then it returns (X1+X2)/2. | mschuetz wrote: | The mean is something you can easily compute progressively and | with trivial resources. Median and percentiles, on the other | hand, can be super expensive and potentially unsuitable for | some real-time applications, since you need to maintain a | sorted list of all relevant samples. | groaner wrote: | Not surprising, because computing mean is O(n) and median is | O(n log n). | | Lack of resources or pure laziness doesn't make it the right | measure to use though. | gpderetta wrote: | Introselect is O(n), right? | lewispb wrote: | Side note, but I love the animations, code snippet design and | typography in this blog post. | | Will think about how I can improve my own blog with these ideas. | djk447 wrote: | Thank you! Huge shout out to Shane, Jacob and others on our | team who helped with the graphics / design elements! | solumos wrote: | looking at a histogram is probably the best thing you can do to | understand how data is distributed | airstrike wrote: | Sorry for the minor nitpick, but I find it a bit unusual | (disappointing?) that there's an image of a candlestick chart at | the very top, but the article only uses API response times as | examples... | Lightbody wrote: | Whenever this topic comes up, I always encourage folks to watch | this 2011 classic 15m talk at the Velocity conference by John | Rauser: | | https://www.youtube.com/watch?v=coNDCIMH8bk | mherdeg wrote: | I've skimmed some of the literature here when I've spent time | trying to help people with their bucket boundaries for | Prometheus-style instrumentation of things denominated in | "seconds", such as processing time and freshness. | | My use case is a little different from what's described here or | in a lot of the literature. Some of the differences: | | (1) You have to pre-decide on bucket values, often hardcoded or | stored in code-like places, and realistically won't bother to | update them often unless the data look unusably noisy. | | (2) Your maximum number of buckets is pretty small -- like, no | more than 10 or 15 histogram buckets probably. This is because my | metrics are very high cardinality (my times get recorded | alongside other dimensions that may have 5-100 distinct values, | things like server instance number, method name, client name, or | response status). | | (3) I think I know what percentiles I care about -- I'm | particularly interested in minimizing error for, say, p50, p95, | p99, p999 values and don't care too much about others. | | (4) I think I know what values I care about knowing precisely! | Sometimes people call my metrics "SLIs" and sometimes they even | set an "SLO" which says, say, I want no more than 0.1% of | interactions to take more than 500ms. (Yes, those people say, we | have accepted that this means that 0.1% of people may have an | unbounded bad experience.) So, okay, fine, let's force a bucket | boundary at 500ms and then we'll always be measuring that SLO | with no error. | | (5) I know that the test data I use as input don't always reflect | how the system will behave over time. For example I might feed my | bucket-designing algorithm yesterday's freshness data and that | might have been a day when our async data processing pipeline was | never more than 10 minutes backlogged. But in fact in the real | world every few months we get a >8 hour backlog and it turns out | we'd like to be able to accurately measure the p99 age of | processed messages even if they are very old... So despite our | very limited bucket budget we probably do want some buckets at 1, | 2, 4, 8, 16 hours, even if at design time they seem useless. | | I have always ended up hand-writing my own error approximation | function which takes as input like | | (1) sample data - a representative subset of the actual times | observed in my system yesterday | | (2) proposed buckets - a bundle of, say, 15 bucket boundaries | | (3) percentiles I care about | | then returns as output info about how far off (%age error) each | estimated percentile is from the actual value for my sample data. | | Last time I looked at this I tried using libraries that purport | to compute very good bucket boundaries but they give me, like, | 1500 buckets with very nice tiny error, but no clear way to make | real-world choice about collapsing this into a much smaller set | of buckets with comparatively huge, but manageable, error. | | I ended up just advising people to | | * set bucket boundaries at SLO boundaries, and be sure to update | when the SLO does | | * actually look at your data and understand the data's shape | | * minimize error for the data set you have now; logarithmic | bucket sizes with extra buckets near the distribution's current | median value seems to work well | | * minimize worst-case error if the things you're measuring grow | very small or very large and you care about being able to observe | that (add extra buckets) | convolvatron wrote: | look at Greenwald-Khanna (and the followon work). It adapts the | bucket size to minimize the total error (and the number of | buckets is proportional to log-epsilon I think). | | with all the competition and 'innovation', you would think the | data dogs of this world would grow up beyond time series. | slaymaker1907 wrote: | One thing you could try to use are variance optimal histograms. | These are histograms which set bucket boundaries such that the | weighted average variance in the buckets is minimized. | Unfortunately, this algorithm is quadratic with the number of | data points, but you could try approximating the optimum by | taking random subsets of the data, computing the histogram, | then seeing how well the histogram does on the whole dataset. | | http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap... | djk447 wrote: | NB: Post author here. | | This is interesting and I totally get at least some of the | problems you're facing. I wonder if you could take some of the | strategies from t-digest and modify a bit to accomplish...I'd | be interested in seeing some sort of implementation of this and | would love to see if we can get it into our toolkit if you | do...or you can also open up a ticket for us and we'll see if | we can prioritize to work on something like it ourselves. | | I do think there are some interesting ways you can cut corners | if you know things about the "SLO" or other sort of cutoff | values, but I'd have to think more deeply about it to say more. | Essentially we want a variable error rate based on the distance | away from a known value, where you have little error in the | values relatively close to the known value, care little if at | all for fine distinctions on the low end and, once you get past | some high end value you could probably bucket everything above | it into a "large outliers" bucket or something too...meaning | you p999 could get out of hand if you start getting lots of | stuff above a threshold, but, if that starts happening, | probably everything's already burning down, so might not be | that useful to know it's burning down in a very specific way... | louisnow wrote: | ```To calculate the 10th percentile, let's say we have 10,000 | values. We take all of the values, order them from largest to | smallest, and identify the 1001st value (where 1000 or 10% of the | values are below it), which will be our 10th percentile.``` | | Isn't this contradictory? If we order the values from largest to | smallest and take the 1001st value, then 10 % of the values are | above/larger and not below/smaller. I believe it should say order | from smallest to largest. | emgo wrote: | Yes, this looks like a typo in the article. It should be | smallest to largest. | djk447 wrote: | NB: Post author here. | | Oops, yep, that should probably be order from smallest to | largest. Thanks for the correction! | djk447 wrote: | Fixed! | tiffanyh wrote: | Statistics 101. Mean, median and mode. | k__ wrote: | Shouldn't averages&variance be enough for start? | deft wrote: | Looks like timescale did a big marketing push this morning only | for their whole service to go down minutes later. lol. | satvikpendem wrote: | I often see HN articles crop up soon after a related post, in | this case this Ask HN poster [0] being driven crazy by people | averaging percentiles and them not seeing that it's a big deal. | It's pretty funny to see such tuples of posts appearing. | | https://news.ycombinator.com/item?id=28518795 | michaelmdresser wrote: | Looking particularly at latency measurements, I found the "How | NOT to Measure Latency" [1] talk very illuminating. It goes quite | deep into discussing how percentiles can be used and abused for | measurement. | | [1]: https://www.infoq.com/presentations/latency-response-time/ | lanstin wrote: | I watch this video once a year and send it to my co-workers | whenever averages or medians shows up in a graph for public | consumption. | peheje wrote: | Are the points written in a readable format anywhere? | varelaz wrote: | Percentiles for sure are better than average if you want to | explore distribution: there are several percentiles comparing to | single average. However median is harder to use if you want to do | calculations based on this metric. For example distribution of | sample median could be a problem, if you want to understand | confidence interval for it for example. | djk447 wrote: | NB: Post author here. | | Totally can be true. In our case, we use these approximation | methods that allow you to get multiple percentiles "for free" | definitely need to choose the right ones for the job. (We talk | a bit more about the whole approach where we do the aggregation | then the accessor thing in the previous post on two-step | aggregation [1]). But there are definitely times when | averages/stddev and potentially the 3rd and 4th moments are | more useful etc. | | [1]: https://blog.timescale.com/blog/how-postgresql- | aggregation-w... | motohagiography wrote: | Recovering product manager in me sees 90th percentile queries | with outlier high latency and starts asking instead of how to | reduce it, how we can to spin it out into a dedicated premium | query feature, as if they're willing to wait, they're probably | also willing to pay. | | Highly recommend modelling your solution using queueing theory | with this: https://queueing-tool.readthedocs.io/en/latest/ | | As an exercise in discovering the economics of how your platform | works, even just thinking about it in these terms can save a | great deal of time. | tunesmith wrote: | I just recently tried giving a presentation to my department | (they're developers, I'm architect) about this stuff and they all | just sort of blinked at me. It brought in Little's Law and | Kingman's Formula, in an attempt to underscore why we need to | limit variation in the response times of our requests. | | There are a bunch of queuing theory formulas that are really cool | but don't exactly apply if your responses vary a lot like the | article describes. I think the assumption is that response time | distributions are exponential distributions, which I don't think | is a good assumption (is an Erlang distribution an exponential | distribution?) Nevertheless, hooking the equations up to some | models is a good way to get directional intuition. I didn't | realize how steep the performance drop-off is for server | utilization until I started moving sliders around. | | Our ops team doesn't really follow this either. We're not a huge | department though - is this the kind of stuff that an SRE team | usually pays attention to? | djk447 wrote: | NB: Post author here. | | I found it surprisingly difficult to explain well. Took a lot | of passes and a lot more words than I was expecting. It seems | like such a simple concept. I thought the post was gonna be the | shortest of my recent ones, and then after really explaining it | and getting lots of edits and rewriting, it was 7000 words and | ...whoops! But I guess it's what I needed to explain it well | (hope you thought so anyway). | | It's somewhat exponential, but yeah, not necessarily, it's | definitely long-tailed in some way and it sort of doesn't | matter what the theoretical description is (at least in my | mind) the point is these types of distributions really don't | get described well by a lot of typical statistics. | | Can't talk too much about SREs at the ad analytics company I | mentioned, we were the backend team that wrote a lot of the | backend stores / managed databases / ran the APIs and monitored | this stuff a bit (and probably not all that well). It was a bit | more ad hoc I guess, probably now that company is large enough | they have a dedicated team for that... | tunesmith wrote: | The article hit the pocket pretty exactly for how I felt I | needed to explain it. (Actually I had the same experience | where I thought I would be able to go over it quickly and | then I felt like I was getting mired.) The graphs look great | too - I've been able to explore it pretty well with | JupyterLab but I can't export that into something super- | interactive that is read-only. I've thought about creating an | idyll page out of it to help others explore but that's a bit | overkill for the work style our team has. | | I _think_ the weirdly-shaped long-tail graphs we come across | are just sums of more naturally-distributed response times, | for different types of responses. Another reason to limit | variation I think. | benlivengood wrote: | > I think the weirdly-shaped long-tail graphs we come | across are just sums of more naturally-distributed response | times, for different types of responses. Another reason to | limit variation I think. | | The best method I've found for digging into latency is | profiling RPC traces to see where time is spent. This at | least separates the code and parameters that have simple | latency distributions from those that don't. Some | distributions will be very data-dependent (SQL queries). | bqe wrote: | Awhile ago I wrote a Python library called LiveStats[1] that | computed any percentile for any amount of data using a fixed | amount of memory per percentile. It uses an algorithm I found in | an old paper[2] called P^2. It uses a polynomial to find good | approximations. | | The reason I made this was an old Amazon interview question. The | question was basically, "Find the median of a huge data set | without sorting it," and the "correct" answer was to have a fixed | size sorted buffer and randomly evict items from it and then use | the median of the buffer. However, a candidate I was interviewing | had a really brilliant insight: if we estimate the median and | move it a small amount for each new data point, it would be | pretty close. I ended up doing some research on this and found | P^2, which is a more sophisticated version of that insight. | | [1]: https://github.com/cxxr/LiveStats | | [2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf | [deleted] | [deleted] | cyral wrote: | There are some newer data structures that take this to the next | level such as T-Digest[1], which remains extremely accurate | even when determining percentiles at the very tail end (like | 99.999%) | | [1]: https://arxiv.org/pdf/1902.04023.pdf / | https://github.com/tdunning/t-digest | fotta wrote: | yep I had to implement t-digest in a monitoring library. | another alternative (although older) that the prometheus | libraries use is CKMS quantiles [0]. | | [0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant- | icde.pd... | djk447 wrote: | NB: Post author here. | | Yeah, that was one of the reasons we chose it as one of the | ones to implement, seemed like that was a really interesting | tradeoff, we also used uddsketch[1] which provides relative | error guarantees, which is pretty nifty. We thought they | provided different enough tradeoffs that we wanted to | implement both. | | [1]: https://arxiv.org/abs/1908.10693 | vvern wrote: | Is it using https://github.com/tvondra/tdigest under the | hood, or a separate implementation? | Lockerman wrote: | a separate implementation | mfreed wrote: | If folks are interested: | | https://github.com/timescale/timescaledb- | toolkit/blob/main/e... | | (The TimescaleDB Toolkit is also implemented in Rust) | riskneutral wrote: | Also relevant: Single-Pass Online Statistics Algorithms | | [1] http://www.numericalexpert.com/articles/single_pass_stat/ | breuleux wrote: | That's pretty neat! Can these be used to efficiently compute | rolling percentiles (over windows of the data), or just | incremental? | [deleted] | WireBaron wrote: | The UDDSketch (default) implementation will allow rolling | percentiles, though we still need a bit of work on our end | to support it. There isn't a way to do this with TDigest | however. | jeffbee wrote: | Sure there is. You simply maintain N phases of digests, | and every T time you evict a phase and recompute the | summary (because T-digests are easily merged). | cyral wrote: | This is what I do, it's not a true rolling digest but it | works well enough for my purposes. | djk447 wrote: | I think this would be a tumbling window rather than a | true "rolling" tdigest. I suppose you could decrement the | buckets, but it gets a little weird as splits can't | really be unsplit. The tumbling window one would probably | work, though Tdigest is a little weird on merge etc as | it's not completely deterministic with respect to | ordering and merging (Uddsketch is) so it's _likely_ you | get something that is more than good enough, but wouldn | 't be the same as if you just calculated it directly so | it gets a little confusing and difficult. | | (NB: Post author here). | convolvatron wrote: | i think the new ones started wtih Greenwald-Khanna. but i | definately agree - p^2 can be a little silly and misleading. | in particular it is really poor at finding those little modes | on the tail that correspond to interesting system behaviours. | cyral wrote: | That sounds familiar, I remember reading about Greenwald- | Khanna before I found T-Digest after I ran into the "how to | find a percentile of a massive data set" problem myself. | djk447 wrote: | NB: Post author here. | | Thanks for sharing! Hadn't heard of that algorithm, have seen a | number of other ones out there, we chose a couple that we knew | about / were requested by users. (And we are open to more user | requests if folks want to use other ones! | https://github.com/timescale/timescaledb-toolkit and open an | issue!) | hikerclimber1 wrote: | Smoking cannabis is also bad for you. It's just a ploy by | governments to get you to smoke so you die earlier. | tantalor wrote: | > Find the median ... randomly evict items | | So, not find, but approximate. That's a different thing. | Izikiel43 wrote: | > The question was basically, "Find the median of a huge data | set without sorting it," | | Isn't this done using a min heap and a max heap in conjuction? | 5faulker wrote: | It's funny that this is often left out from a data & | algorithm class. | lanstin wrote: | Because unlike many dynamic programming algorithms, it is | something anyone running a large system will need. | thaumasiotes wrote: | Is the minheap-maxheap approach faster than sorting the | data? The obvious approach (process each element, one by | one, into the appropriate heap, and rebalance the heaps so | they are of equal size) takes n log n time and linear | space. You can use the same resources to just produce a | sorted copy of the input, which is a much better thing to | have than two heaps that center on the median. | ilikebits wrote: | The real constraint here is probably "find the median of a | huge data set without holding the entire data set in memory". | robotresearcher wrote: | 'Estimate the median of an arbitrary sized data set using a | constant amount of memory'. | tantalor wrote: | No, for 2 reasons, | | 1. huge data set: heaps requires storing the whole set, but | "huge" means "more than you can store" | | 2. without sorting it: heaps are basically semi-sorted, so | you are breaking the rules | xyzzyz wrote: | You can actually construct the heap from unsorted data in | O(n) time, so constructing the heap is definitely not | sorting. However, yeah, to actually use the heap to find | median in O(n) time, you need to do something similar to | magic-five (median of medians) algorithm. | thaumasiotes wrote: | > but "huge" means "more than you can store" | | Really? Where's it coming from? | namrog84 wrote: | I think they mean having more than you can store | simultaneously on a single device. | | With a few exceptions this is pretty common scenario. | Dylan16807 wrote: | More than _you_ can store. | | And possibly it's live data. | abnry wrote: | One way to think about why we tend to use averages instead of | medians is that it is related to a really deep theorem in | probability: The Central Limit Theorem. | | But I think we can twist our heads and see in a way that this is | backwards. Mathematically, the mean is much easier to work with | because it is linear and we can do algebra with it. That's how we | got the Central Limit Theorem. Percentiles and the median, except | for symmetric distributions, are not as easy to work with. They | involve solving for the inverse of the cumulative function. | | But in many ways, the median and percentiles are a more relevant | and intuitive number to think about. Especially in contexts where | linearity is inappropriate! | a-dub wrote: | i think of it as: if the data is gaussian, use a mean, | otherwise go non-parametric (medians/percentiles). | | or put another way, if you can't model it, you're going to have | to sort, or estimate a sort, because that's all that's really | left to do. | | this shows up in things from estimating centers with | means/percentiles to doing statistical tests with things like | the wilcoxon tests. | lanstin wrote: | Assume up front none of your measured latencies from a | software networked system will be Gaussian, or | <exaggereation> you will die a painful death </exaggeration>. | Even ping times over the internet have no mean. The only good | thing about means is you can combine them easily, but since | they are probably a mathematical fiction, combining them is | even worse. Use T-Digest or one of the other algorithms being | highlighted here. | a-dub wrote: | yep, have made that mistake before. even turned in a write- | up for a measurement project in a graduate level systems | course that reported network performance dependent | measurements with means over trials with error bars from | standard deviations. | | sadly, the instructor just gave it an A and moved on. (that | said, the amount of work that went into a single semester | project was a bit herculean, even if i do say so myself) | bluesmoon wrote: | This is exactly the algorithm we developed at LogNormal (now part | of Akamai) 10 years ago for doing fast, low-memory percentiles on | large datasets. | | It's implemented in this Node library: | https://github.com/bluesmoon/node-faststats | | Side note: I wish everyone would stop using the term Average to | refer to the Arithmetic mean. "Average" just means some statistic | used to summarize a dataset. It could be the Arithmetic Mean, | Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch | of other statistics. We're stuck with AVG because that's the | function used by early databases and Lotus 123. | JackFr wrote: | No we're stuck with it because average was used colloquially | for arithmetic mean for decades. | | I wish people would stop bad-mouthing the arithmetic mean. If | you have to convey information about a distribution and you've | got only one number to do it, the arithmetic mean is for you. | bluesmoon wrote: | Yes, Lotus 123 came out 38 years ago :) | jrochkind1 wrote: | I think it still depends on the nature of the data and your | questions about it, and median is often the better choice if | you have to pick only one. | hnuser123456 wrote: | Gamers have an intuitive sense of this. Your average framerate | can be arbitrarily high, but if you have a big stutter every | second between the smooth moments, then a lower but more | consistent framerate may be preferable, typically expressed as | the 1% and 0.1% slowest frames, which at a relatively typical | 100fps, represents the slowest frame every second and every 10 | seconds. | slaymaker1907 wrote: | Yep, I often find it useful to limit framerate to a number I | know my GPU can handle so that I don't get as much stuttering. | It's better to run at 60 FPS all the time than 80 FPS most of | the time but with stuttering. | djk447 wrote: | NB: Post author here. | | _Love_ this example. Might have to use that in a future post. | Feel like a lot of us are running into a similar thing with | remote work and video calls these days... | tzs wrote: | I have no hope of finding a cite for this, but a long time ago | I read some command line UI research that found if you had a | system where commands ranged from instant to taking a small but | noticeable time and you introduced delays in the faster | commands to make it so all commands took the same small but | noticeable time people would think that the system was now | faster overall. | wruza wrote: | I guess that's because our minds (and animal minds as well) | are always aware of the pace of repetitive events. If | something is off, the anxiety alarm rings. One old book on | the brain machinery described an example of a cat that was | relaxing near the metronome and became alert when it was | suddenly stopped. Unpredictable delays are disturbing, | because a mispredicted event means you may be in a dangerous | situation and have to recalibrate _now_. | [deleted] | kazinator wrote: | Gamers do not have a more intuitive sense for this than movie | watchers, video/voice talkers, not to mention users who type | text into bloated web browsers or lagged remote login sessions. | | I would rather have a rock-steady consistent 500 ms reponse | time when typing text, than to have a 100 ms average response | time which randomly spikes to outliers that go past one second. | | A rock-steady, though poor, event rate in a paint application | is better for drawing a freehand curve (especially if the | program interpolates well) than a really fast rate that | suddenly has a glitch in it, spoiling your work. | hnuser123456 wrote: | Maybe not all, but ones who create and consume analysis like | this | | https://www.techpowerup.com/review/msi-geforce- | rtx-3090-gami... | exolymph wrote: | "Slow is smooth and smooth is fast" | rahimnathwani wrote: | For some things, you can't even sensibly measure the mean. For | example, if you're measuring the mean response time for a | service, a single failure/timeout makes the mean response time | infinite (because 100 years from now the response still hasn't | been received). | | "Why Averages Suck and Percentiles are Great": | https://www.dynatrace.com/news/blog/why-averages-suck-and-pe... | djk447 wrote: | totally. that blog was also one of the sources I mentioned in | the post! Good stuff | | NB: Post author here. | contravariant wrote: | It's indeed always worth pointing that a mean may or may not | exist. Same with variances/standard-deviation. | | The central limit theorem seems to have given people the | slightly wrong idea that things will _always_ average out | eventually. | maxnoe wrote: | Coughs Cauchy | ruchin_k wrote: | Spent several years in venture capital investing and averages | were always misleading - as Nassim Taleb says "Never cross a | river that is on average 4 feet deep" | robbrown451 wrote: | Median/50 percentile isn't a whole lot better in that case. | aidenn0 wrote: | Be careful translating percentiles of requests to percentiles of | users; if less than 10% of your requests take over 1 second, but | a typical user makes 10 requests, it's possible that the majority | your users are seeing a request take over 1 second. | djk447 wrote: | NB: Post author here. | | Yep! Briefly noted that in the post, but deserves re-stating! | it's definitely a more complex analysis to figure out the | percentage of users affected (though often more important) | could be majority could also be one user who has some data | scientist programmatically making hundreds of long API calls | for some task...(can you tell that I ran into that? Even worse | it was one of our own data scientists ;) ). ___________________________________________________________________ (page generated 2021-09-14 23:00 UTC)