hngopher.com

       [HN Gopher] How percentile approximation works and why it's more...
       ___________________________________________________________________
        
       How percentile approximation works and why it's more useful than
       averages
        
       Author : od0
       Score  : 415 points
       Date   : 2021-09-14 16:14 UTC (6 hours ago)
        
 (HTM) web link (blog.timescale.com)
 (TXT) w3m dump (blog.timescale.com)
        
       | axpy906 wrote:
       | There's something called a five number summary in statistics:
       | mean, median, standard deviation, 25th percentile and 75th
       | percentile.
       | 
       | The bonus is that the 75th - 50th gives you the interquartile
       | range.
       | 
       | Mean is not a robust measure and as such you need to look at
       | variety to truly understand the spread of your data.
        
         | bluesmoon wrote:
         | IQR is 75th - 25th, aka, the middle-50%
        
           | axpy906 wrote:
           | You're right I mistyped.
        
           | monkeydust wrote:
           | Ok this, box plots are a good way to visualize and show
           | distribution esp to a not so stat heavy audience.
        
         | 10000truths wrote:
         | If you're going to use multiple quantities to summarize a
         | distribution, wouldn't using percentiles for all of them give
         | you the most information? The mean and standard deviation could
         | then be estimated from that data.
        
       | [deleted]
        
       | sharmin123 wrote:
       | Let's Secure WiFi Network and Prevent WiFi Hacking:
       | https://www.hackerslist.co/lets-secure-wifi-network-and-prev...
        
       | achenatx wrote:
       | Ive been trying to get the marketing team to always include a std
       | deviation with averages. Average alone is simply not useful,
       | standard deviation is a simple way to essentially include
       | percentiles.
       | 
       | They regularly compare experiments to the mean but dont use a T
       | test to ensure the results are actually different from the mean.
        
         | doctorsher wrote:
         | I heavily caution against the feeling that "standard deviation
         | is a simple way to essentially include percentiles." The
         | usefulness of the standard deviation depends on the
         | distributions that you are working with. Heavy tailed
         | distributions appear a fair amount in practice, and the combo
         | of summary statistics mentioned would not do well on those.
         | Also, Madars' comment in this thread is a beautiful example of
         | this: 4 completely different distributions, with identical mean
         | and standard deviation (among other things). Histograms and
         | percentiles, and if necessary their approximations, are more
         | desirable for the above reasons.
        
           | slaymaker1907 wrote:
           | I assume most of the distributions a marketing department
           | would be dealing with are generally normal in which case
           | stddev is a great way to analyze the data. This can be easily
           | verified by just plotting said data and making sure the tails
           | don't look weird.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Std deviation definitely helps a lot, still often not as good
         | as percentiles, was actually thinking about adding some of that
         | in the post but it was already getting so long. It's funny how
         | things you think are simple sometimes take the most effort to
         | explain, definitely found that on this one.
        
           | waynecochran wrote:
           | Yeah -- std deviation has a similar problem to the mean in
           | that it doesn't give you a full picture unless the
           | distribution is close to normal / gaussian.
        
             | esyir wrote:
             | Pretty much why summary statistics often give the IQR,
             | which gives some idea to the skew and shape of the
             | distribution as well.
             | 
             | Unfortunately, BD and marketing just want a single number
             | to show that the value is bigger and hate anything more
             | complicated than a barchart.
        
               | varelaz wrote:
               | Barchart is basically your percentiles (just more of
               | them) so why not show it? Bars and whiskers could be more
               | complicated for them but still the same sort of data
        
               | fwip wrote:
               | Barcharts across categorical data :P
               | 
               | That is, the first bar is "Our Number" and the second bar
               | is "Competitor's number."
        
               | djk447 wrote:
               | NB: Post author here.
               | 
               | We've been meaning to add IQR as an accessor function for
               | these, may have to go back and do it...the frequency
               | trails [1] stuff from Brendan Gregg also goes into some
               | of this and it's really cool as a visualization.
               | 
               | [1]:
               | https://www.brendangregg.com/FrequencyTrails/mean.html
        
         | LoriP wrote:
         | One thing I like about this post is that it explains things in
         | an accessible way before getting into a deep dive. Might be
         | worth sharing with the marketing team as they'll "get" long
         | tail in the context of web search, so the concept is fairly
         | transferable to stuff that they would know about.
        
       | 123pie123 wrote:
       | I had to explain the data usage of an interface that looked
       | extremely busy from the standard graphs
       | 
       | I did a percentile graph of the usage - the data was only
       | typically using 5% of the maximum throughput, no-one could really
       | understand the graph though
       | 
       | so I did a zoomed-in version of the normal data usage graph and
       | it looked like a blip lasting 1/20 of the time - everyone got
       | that - eg it was peaking every few seconds and then doing nothing
       | for ages
        
       | madars wrote:
       | Good opportunity to plug
       | https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't
       | know much about the underlying distribution, simple statistics
       | don't describe it well.
       | 
       | From Wikipedia description: Anscombe's quartet comprises four
       | data sets that have nearly identical simple descriptive
       | statistics, yet have very different distributions and appear very
       | different when graphed. Each dataset consists of eleven (x,y)
       | points. They were constructed in 1973 by the statistician Francis
       | Anscombe to demonstrate both the importance of graphing data
       | before analyzing it, and the effect of outliers and other
       | influential observations on statistical properties.
        
         | doctorsher wrote:
         | This is excellent information, thank you for posting this! I
         | was not familiar with this example previously, but it is a
         | _perfect_ example of summary statistics not capturing certain
         | distributions well. It 's very approachable, even if you had to
         | limit the discussion to mean and variance alone. Bookmarked,
         | and much appreciated.
        
         | pdpi wrote:
         | There's a fun paper by Autodesk where they make datasets that
         | look whatever way you want them to.
         | 
         | https://www.autodesk.com/research/publications/same-stats-di...
        
           | djk447 wrote:
           | NB: Post author here.
           | 
           | This is great! So fun...will have to use in the future...
        
             | pdpi wrote:
             | While I have your attention...
             | 
             | > For the median and average to be equal, the points less
             | than the median and greater than the median must have the
             | same distribution (i.e., there must be the same number of
             | points that are somewhat larger and somewhat smaller and
             | much larger and much smaller).
             | 
             | [0, 2, 5, 9, 9] has both median and mean = 5, but the two
             | sides don't really have the same distribution.
        
               | djk447 wrote:
               | Totally true...thoughts on how I could rephrase? I guess
               | it's more the "weight" of points greater than and less
               | than the median should be the same, so symmetric
               | distributions definitely have it, asymmetric may or may
               | not. Definitely open to revising...
        
               | pdpi wrote:
               | > symmetric distributions definitely have it, asymmetric
               | may or may not
               | 
               | Doesn't have to be any more complicated than that. It's
               | more a curio than an important point anyway :)
        
           | s_gourichon wrote:
           | Yes, a both funny and insightful lesson on how weak basic
           | indicators (mean, standard deviation, correlation) can be,
           | the interest and limits of box-plot with quartiles and
           | whiskers, the benefit of the violin-plot.
           | 
           | Definitely worth a quick look.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | That's really nifty, wish I'd heard about it earlier. Might go
         | back and add a link to it in the post at some point too! Very
         | useful. Definitely know I wasn't breaking new ground or
         | anything, but fun to see it represented so succinctly.
        
         | [deleted]
        
       | pachico wrote:
       | Surprisingly, many software engineers I know never used
       | percentiles and keep using mean average. True story.
        
         | yongjik wrote:
         | Bah, I'll be happy if I could even get correct averages. I see
         | pipelines getting value X1 from a server that served 100
         | requests, another value X2 from a server that served one
         | request, and then it returns (X1+X2)/2.
        
         | mschuetz wrote:
         | The mean is something you can easily compute progressively and
         | with trivial resources. Median and percentiles, on the other
         | hand, can be super expensive and potentially unsuitable for
         | some real-time applications, since you need to maintain a
         | sorted list of all relevant samples.
        
         | groaner wrote:
         | Not surprising, because computing mean is O(n) and median is
         | O(n log n).
         | 
         | Lack of resources or pure laziness doesn't make it the right
         | measure to use though.
        
           | gpderetta wrote:
           | Introselect is O(n), right?
        
       | lewispb wrote:
       | Side note, but I love the animations, code snippet design and
       | typography in this blog post.
       | 
       | Will think about how I can improve my own blog with these ideas.
        
         | djk447 wrote:
         | Thank you! Huge shout out to Shane, Jacob and others on our
         | team who helped with the graphics / design elements!
        
       | solumos wrote:
       | looking at a histogram is probably the best thing you can do to
       | understand how data is distributed
        
       | airstrike wrote:
       | Sorry for the minor nitpick, but I find it a bit unusual
       | (disappointing?) that there's an image of a candlestick chart at
       | the very top, but the article only uses API response times as
       | examples...
        
       | Lightbody wrote:
       | Whenever this topic comes up, I always encourage folks to watch
       | this 2011 classic 15m talk at the Velocity conference by John
       | Rauser:
       | 
       | https://www.youtube.com/watch?v=coNDCIMH8bk
        
       | mherdeg wrote:
       | I've skimmed some of the literature here when I've spent time
       | trying to help people with their bucket boundaries for
       | Prometheus-style instrumentation of things denominated in
       | "seconds", such as processing time and freshness.
       | 
       | My use case is a little different from what's described here or
       | in a lot of the literature. Some of the differences:
       | 
       | (1) You have to pre-decide on bucket values, often hardcoded or
       | stored in code-like places, and realistically won't bother to
       | update them often unless the data look unusably noisy.
       | 
       | (2) Your maximum number of buckets is pretty small -- like, no
       | more than 10 or 15 histogram buckets probably. This is because my
       | metrics are very high cardinality (my times get recorded
       | alongside other dimensions that may have 5-100 distinct values,
       | things like server instance number, method name, client name, or
       | response status).
       | 
       | (3) I think I know what percentiles I care about -- I'm
       | particularly interested in minimizing error for, say, p50, p95,
       | p99, p999 values and don't care too much about others.
       | 
       | (4) I think I know what values I care about knowing precisely!
       | Sometimes people call my metrics "SLIs" and sometimes they even
       | set an "SLO" which says, say, I want no more than 0.1% of
       | interactions to take more than 500ms. (Yes, those people say, we
       | have accepted that this means that 0.1% of people may have an
       | unbounded bad experience.) So, okay, fine, let's force a bucket
       | boundary at 500ms and then we'll always be measuring that SLO
       | with no error.
       | 
       | (5) I know that the test data I use as input don't always reflect
       | how the system will behave over time. For example I might feed my
       | bucket-designing algorithm yesterday's freshness data and that
       | might have been a day when our async data processing pipeline was
       | never more than 10 minutes backlogged. But in fact in the real
       | world every few months we get a >8 hour backlog and it turns out
       | we'd like to be able to accurately measure the p99 age of
       | processed messages even if they are very old... So despite our
       | very limited bucket budget we probably do want some buckets at 1,
       | 2, 4, 8, 16 hours, even if at design time they seem useless.
       | 
       | I have always ended up hand-writing my own error approximation
       | function which takes as input like
       | 
       | (1) sample data - a representative subset of the actual times
       | observed in my system yesterday
       | 
       | (2) proposed buckets - a bundle of, say, 15 bucket boundaries
       | 
       | (3) percentiles I care about
       | 
       | then returns as output info about how far off (%age error) each
       | estimated percentile is from the actual value for my sample data.
       | 
       | Last time I looked at this I tried using libraries that purport
       | to compute very good bucket boundaries but they give me, like,
       | 1500 buckets with very nice tiny error, but no clear way to make
       | real-world choice about collapsing this into a much smaller set
       | of buckets with comparatively huge, but manageable, error.
       | 
       | I ended up just advising people to
       | 
       | * set bucket boundaries at SLO boundaries, and be sure to update
       | when the SLO does
       | 
       | * actually look at your data and understand the data's shape
       | 
       | * minimize error for the data set you have now; logarithmic
       | bucket sizes with extra buckets near the distribution's current
       | median value seems to work well
       | 
       | * minimize worst-case error if the things you're measuring grow
       | very small or very large and you care about being able to observe
       | that (add extra buckets)
        
         | convolvatron wrote:
         | look at Greenwald-Khanna (and the followon work). It adapts the
         | bucket size to minimize the total error (and the number of
         | buckets is proportional to log-epsilon I think).
         | 
         | with all the competition and 'innovation', you would think the
         | data dogs of this world would grow up beyond time series.
        
         | slaymaker1907 wrote:
         | One thing you could try to use are variance optimal histograms.
         | These are histograms which set bucket boundaries such that the
         | weighted average variance in the buckets is minimized.
         | Unfortunately, this algorithm is quadratic with the number of
         | data points, but you could try approximating the optimum by
         | taking random subsets of the data, computing the histogram,
         | then seeing how well the histogram does on the whole dataset.
         | 
         | http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | This is interesting and I totally get at least some of the
         | problems you're facing. I wonder if you could take some of the
         | strategies from t-digest and modify a bit to accomplish...I'd
         | be interested in seeing some sort of implementation of this and
         | would love to see if we can get it into our toolkit if you
         | do...or you can also open up a ticket for us and we'll see if
         | we can prioritize to work on something like it ourselves.
         | 
         | I do think there are some interesting ways you can cut corners
         | if you know things about the "SLO" or other sort of cutoff
         | values, but I'd have to think more deeply about it to say more.
         | Essentially we want a variable error rate based on the distance
         | away from a known value, where you have little error in the
         | values relatively close to the known value, care little if at
         | all for fine distinctions on the low end and, once you get past
         | some high end value you could probably bucket everything above
         | it into a "large outliers" bucket or something too...meaning
         | you p999 could get out of hand if you start getting lots of
         | stuff above a threshold, but, if that starts happening,
         | probably everything's already burning down, so might not be
         | that useful to know it's burning down in a very specific way...
        
       | louisnow wrote:
       | ```To calculate the 10th percentile, let's say we have 10,000
       | values. We take all of the values, order them from largest to
       | smallest, and identify the 1001st value (where 1000 or 10% of the
       | values are below it), which will be our 10th percentile.```
       | 
       | Isn't this contradictory? If we order the values from largest to
       | smallest and take the 1001st value, then 10 % of the values are
       | above/larger and not below/smaller. I believe it should say order
       | from smallest to largest.
        
         | emgo wrote:
         | Yes, this looks like a typo in the article. It should be
         | smallest to largest.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Oops, yep, that should probably be order from smallest to
         | largest. Thanks for the correction!
        
           | djk447 wrote:
           | Fixed!
        
       | tiffanyh wrote:
       | Statistics 101. Mean, median and mode.
        
       | k__ wrote:
       | Shouldn't averages&variance be enough for start?
        
       | deft wrote:
       | Looks like timescale did a big marketing push this morning only
       | for their whole service to go down minutes later. lol.
        
       | satvikpendem wrote:
       | I often see HN articles crop up soon after a related post, in
       | this case this Ask HN poster [0] being driven crazy by people
       | averaging percentiles and them not seeing that it's a big deal.
       | It's pretty funny to see such tuples of posts appearing.
       | 
       | https://news.ycombinator.com/item?id=28518795
        
       | michaelmdresser wrote:
       | Looking particularly at latency measurements, I found the "How
       | NOT to Measure Latency" [1] talk very illuminating. It goes quite
       | deep into discussing how percentiles can be used and abused for
       | measurement.
       | 
       | [1]: https://www.infoq.com/presentations/latency-response-time/
        
         | lanstin wrote:
         | I watch this video once a year and send it to my co-workers
         | whenever averages or medians shows up in a graph for public
         | consumption.
        
           | peheje wrote:
           | Are the points written in a readable format anywhere?
        
       | varelaz wrote:
       | Percentiles for sure are better than average if you want to
       | explore distribution: there are several percentiles comparing to
       | single average. However median is harder to use if you want to do
       | calculations based on this metric. For example distribution of
       | sample median could be a problem, if you want to understand
       | confidence interval for it for example.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Totally can be true. In our case, we use these approximation
         | methods that allow you to get multiple percentiles "for free"
         | definitely need to choose the right ones for the job. (We talk
         | a bit more about the whole approach where we do the aggregation
         | then the accessor thing in the previous post on two-step
         | aggregation [1]). But there are definitely times when
         | averages/stddev and potentially the 3rd and 4th moments are
         | more useful etc.
         | 
         | [1]: https://blog.timescale.com/blog/how-postgresql-
         | aggregation-w...
        
       | motohagiography wrote:
       | Recovering product manager in me sees 90th percentile queries
       | with outlier high latency and starts asking instead of how to
       | reduce it, how we can to spin it out into a dedicated premium
       | query feature, as if they're willing to wait, they're probably
       | also willing to pay.
       | 
       | Highly recommend modelling your solution using queueing theory
       | with this: https://queueing-tool.readthedocs.io/en/latest/
       | 
       | As an exercise in discovering the economics of how your platform
       | works, even just thinking about it in these terms can save a
       | great deal of time.
        
       | tunesmith wrote:
       | I just recently tried giving a presentation to my department
       | (they're developers, I'm architect) about this stuff and they all
       | just sort of blinked at me. It brought in Little's Law and
       | Kingman's Formula, in an attempt to underscore why we need to
       | limit variation in the response times of our requests.
       | 
       | There are a bunch of queuing theory formulas that are really cool
       | but don't exactly apply if your responses vary a lot like the
       | article describes. I think the assumption is that response time
       | distributions are exponential distributions, which I don't think
       | is a good assumption (is an Erlang distribution an exponential
       | distribution?) Nevertheless, hooking the equations up to some
       | models is a good way to get directional intuition. I didn't
       | realize how steep the performance drop-off is for server
       | utilization until I started moving sliders around.
       | 
       | Our ops team doesn't really follow this either. We're not a huge
       | department though - is this the kind of stuff that an SRE team
       | usually pays attention to?
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | I found it surprisingly difficult to explain well. Took a lot
         | of passes and a lot more words than I was expecting. It seems
         | like such a simple concept. I thought the post was gonna be the
         | shortest of my recent ones, and then after really explaining it
         | and getting lots of edits and rewriting, it was 7000 words and
         | ...whoops! But I guess it's what I needed to explain it well
         | (hope you thought so anyway).
         | 
         | It's somewhat exponential, but yeah, not necessarily, it's
         | definitely long-tailed in some way and it sort of doesn't
         | matter what the theoretical description is (at least in my
         | mind) the point is these types of distributions really don't
         | get described well by a lot of typical statistics.
         | 
         | Can't talk too much about SREs at the ad analytics company I
         | mentioned, we were the backend team that wrote a lot of the
         | backend stores / managed databases / ran the APIs and monitored
         | this stuff a bit (and probably not all that well). It was a bit
         | more ad hoc I guess, probably now that company is large enough
         | they have a dedicated team for that...
        
           | tunesmith wrote:
           | The article hit the pocket pretty exactly for how I felt I
           | needed to explain it. (Actually I had the same experience
           | where I thought I would be able to go over it quickly and
           | then I felt like I was getting mired.) The graphs look great
           | too - I've been able to explore it pretty well with
           | JupyterLab but I can't export that into something super-
           | interactive that is read-only. I've thought about creating an
           | idyll page out of it to help others explore but that's a bit
           | overkill for the work style our team has.
           | 
           | I _think_ the weirdly-shaped long-tail graphs we come across
           | are just sums of more naturally-distributed response times,
           | for different types of responses. Another reason to limit
           | variation I think.
        
             | benlivengood wrote:
             | > I think the weirdly-shaped long-tail graphs we come
             | across are just sums of more naturally-distributed response
             | times, for different types of responses. Another reason to
             | limit variation I think.
             | 
             | The best method I've found for digging into latency is
             | profiling RPC traces to see where time is spent. This at
             | least separates the code and parameters that have simple
             | latency distributions from those that don't. Some
             | distributions will be very data-dependent (SQL queries).
        
       | bqe wrote:
       | Awhile ago I wrote a Python library called LiveStats[1] that
       | computed any percentile for any amount of data using a fixed
       | amount of memory per percentile. It uses an algorithm I found in
       | an old paper[2] called P^2. It uses a polynomial to find good
       | approximations.
       | 
       | The reason I made this was an old Amazon interview question. The
       | question was basically, "Find the median of a huge data set
       | without sorting it," and the "correct" answer was to have a fixed
       | size sorted buffer and randomly evict items from it and then use
       | the median of the buffer. However, a candidate I was interviewing
       | had a really brilliant insight: if we estimate the median and
       | move it a small amount for each new data point, it would be
       | pretty close. I ended up doing some research on this and found
       | P^2, which is a more sophisticated version of that insight.
       | 
       | [1]: https://github.com/cxxr/LiveStats
       | 
       | [2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
        
         | [deleted]
        
         | [deleted]
        
         | cyral wrote:
         | There are some newer data structures that take this to the next
         | level such as T-Digest[1], which remains extremely accurate
         | even when determining percentiles at the very tail end (like
         | 99.999%)
         | 
         | [1]: https://arxiv.org/pdf/1902.04023.pdf /
         | https://github.com/tdunning/t-digest
        
           | fotta wrote:
           | yep I had to implement t-digest in a monitoring library.
           | another alternative (although older) that the prometheus
           | libraries use is CKMS quantiles [0].
           | 
           | [0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-
           | icde.pd...
        
           | djk447 wrote:
           | NB: Post author here.
           | 
           | Yeah, that was one of the reasons we chose it as one of the
           | ones to implement, seemed like that was a really interesting
           | tradeoff, we also used uddsketch[1] which provides relative
           | error guarantees, which is pretty nifty. We thought they
           | provided different enough tradeoffs that we wanted to
           | implement both.
           | 
           | [1]: https://arxiv.org/abs/1908.10693
        
             | vvern wrote:
             | Is it using https://github.com/tvondra/tdigest under the
             | hood, or a separate implementation?
        
               | Lockerman wrote:
               | a separate implementation
        
               | mfreed wrote:
               | If folks are interested:
               | 
               | https://github.com/timescale/timescaledb-
               | toolkit/blob/main/e...
               | 
               | (The TimescaleDB Toolkit is also implemented in Rust)
        
           | riskneutral wrote:
           | Also relevant: Single-Pass Online Statistics Algorithms
           | 
           | [1] http://www.numericalexpert.com/articles/single_pass_stat/
        
           | breuleux wrote:
           | That's pretty neat! Can these be used to efficiently compute
           | rolling percentiles (over windows of the data), or just
           | incremental?
        
             | [deleted]
        
             | WireBaron wrote:
             | The UDDSketch (default) implementation will allow rolling
             | percentiles, though we still need a bit of work on our end
             | to support it. There isn't a way to do this with TDigest
             | however.
        
               | jeffbee wrote:
               | Sure there is. You simply maintain N phases of digests,
               | and every T time you evict a phase and recompute the
               | summary (because T-digests are easily merged).
        
               | cyral wrote:
               | This is what I do, it's not a true rolling digest but it
               | works well enough for my purposes.
        
               | djk447 wrote:
               | I think this would be a tumbling window rather than a
               | true "rolling" tdigest. I suppose you could decrement the
               | buckets, but it gets a little weird as splits can't
               | really be unsplit. The tumbling window one would probably
               | work, though Tdigest is a little weird on merge etc as
               | it's not completely deterministic with respect to
               | ordering and merging (Uddsketch is) so it's _likely_ you
               | get something that is more than good enough, but wouldn
               | 't be the same as if you just calculated it directly so
               | it gets a little confusing and difficult.
               | 
               | (NB: Post author here).
        
           | convolvatron wrote:
           | i think the new ones started wtih Greenwald-Khanna. but i
           | definately agree - p^2 can be a little silly and misleading.
           | in particular it is really poor at finding those little modes
           | on the tail that correspond to interesting system behaviours.
        
             | cyral wrote:
             | That sounds familiar, I remember reading about Greenwald-
             | Khanna before I found T-Digest after I ran into the "how to
             | find a percentile of a massive data set" problem myself.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Thanks for sharing! Hadn't heard of that algorithm, have seen a
         | number of other ones out there, we chose a couple that we knew
         | about / were requested by users. (And we are open to more user
         | requests if folks want to use other ones!
         | https://github.com/timescale/timescaledb-toolkit and open an
         | issue!)
        
         | hikerclimber1 wrote:
         | Smoking cannabis is also bad for you. It's just a ploy by
         | governments to get you to smoke so you die earlier.
        
         | tantalor wrote:
         | > Find the median ... randomly evict items
         | 
         | So, not find, but approximate. That's a different thing.
        
         | Izikiel43 wrote:
         | > The question was basically, "Find the median of a huge data
         | set without sorting it,"
         | 
         | Isn't this done using a min heap and a max heap in conjuction?
        
           | 5faulker wrote:
           | It's funny that this is often left out from a data &
           | algorithm class.
        
             | lanstin wrote:
             | Because unlike many dynamic programming algorithms, it is
             | something anyone running a large system will need.
        
             | thaumasiotes wrote:
             | Is the minheap-maxheap approach faster than sorting the
             | data? The obvious approach (process each element, one by
             | one, into the appropriate heap, and rebalance the heaps so
             | they are of equal size) takes n log n time and linear
             | space. You can use the same resources to just produce a
             | sorted copy of the input, which is a much better thing to
             | have than two heaps that center on the median.
        
           | ilikebits wrote:
           | The real constraint here is probably "find the median of a
           | huge data set without holding the entire data set in memory".
        
             | robotresearcher wrote:
             | 'Estimate the median of an arbitrary sized data set using a
             | constant amount of memory'.
        
           | tantalor wrote:
           | No, for 2 reasons,
           | 
           | 1. huge data set: heaps requires storing the whole set, but
           | "huge" means "more than you can store"
           | 
           | 2. without sorting it: heaps are basically semi-sorted, so
           | you are breaking the rules
        
             | xyzzyz wrote:
             | You can actually construct the heap from unsorted data in
             | O(n) time, so constructing the heap is definitely not
             | sorting. However, yeah, to actually use the heap to find
             | median in O(n) time, you need to do something similar to
             | magic-five (median of medians) algorithm.
        
             | thaumasiotes wrote:
             | > but "huge" means "more than you can store"
             | 
             | Really? Where's it coming from?
        
               | namrog84 wrote:
               | I think they mean having more than you can store
               | simultaneously on a single device.
               | 
               | With a few exceptions this is pretty common scenario.
        
               | Dylan16807 wrote:
               | More than _you_ can store.
               | 
               | And possibly it's live data.
        
       | abnry wrote:
       | One way to think about why we tend to use averages instead of
       | medians is that it is related to a really deep theorem in
       | probability: The Central Limit Theorem.
       | 
       | But I think we can twist our heads and see in a way that this is
       | backwards. Mathematically, the mean is much easier to work with
       | because it is linear and we can do algebra with it. That's how we
       | got the Central Limit Theorem. Percentiles and the median, except
       | for symmetric distributions, are not as easy to work with. They
       | involve solving for the inverse of the cumulative function.
       | 
       | But in many ways, the median and percentiles are a more relevant
       | and intuitive number to think about. Especially in contexts where
       | linearity is inappropriate!
        
         | a-dub wrote:
         | i think of it as: if the data is gaussian, use a mean,
         | otherwise go non-parametric (medians/percentiles).
         | 
         | or put another way, if you can't model it, you're going to have
         | to sort, or estimate a sort, because that's all that's really
         | left to do.
         | 
         | this shows up in things from estimating centers with
         | means/percentiles to doing statistical tests with things like
         | the wilcoxon tests.
        
           | lanstin wrote:
           | Assume up front none of your measured latencies from a
           | software networked system will be Gaussian, or
           | <exaggereation> you will die a painful death </exaggeration>.
           | Even ping times over the internet have no mean. The only good
           | thing about means is you can combine them easily, but since
           | they are probably a mathematical fiction, combining them is
           | even worse. Use T-Digest or one of the other algorithms being
           | highlighted here.
        
             | a-dub wrote:
             | yep, have made that mistake before. even turned in a write-
             | up for a measurement project in a graduate level systems
             | course that reported network performance dependent
             | measurements with means over trials with error bars from
             | standard deviations.
             | 
             | sadly, the instructor just gave it an A and moved on. (that
             | said, the amount of work that went into a single semester
             | project was a bit herculean, even if i do say so myself)
        
       | bluesmoon wrote:
       | This is exactly the algorithm we developed at LogNormal (now part
       | of Akamai) 10 years ago for doing fast, low-memory percentiles on
       | large datasets.
       | 
       | It's implemented in this Node library:
       | https://github.com/bluesmoon/node-faststats
       | 
       | Side note: I wish everyone would stop using the term Average to
       | refer to the Arithmetic mean. "Average" just means some statistic
       | used to summarize a dataset. It could be the Arithmetic Mean,
       | Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch
       | of other statistics. We're stuck with AVG because that's the
       | function used by early databases and Lotus 123.
        
         | JackFr wrote:
         | No we're stuck with it because average was used colloquially
         | for arithmetic mean for decades.
         | 
         | I wish people would stop bad-mouthing the arithmetic mean. If
         | you have to convey information about a distribution and you've
         | got only one number to do it, the arithmetic mean is for you.
        
           | bluesmoon wrote:
           | Yes, Lotus 123 came out 38 years ago :)
        
           | jrochkind1 wrote:
           | I think it still depends on the nature of the data and your
           | questions about it, and median is often the better choice if
           | you have to pick only one.
        
       | hnuser123456 wrote:
       | Gamers have an intuitive sense of this. Your average framerate
       | can be arbitrarily high, but if you have a big stutter every
       | second between the smooth moments, then a lower but more
       | consistent framerate may be preferable, typically expressed as
       | the 1% and 0.1% slowest frames, which at a relatively typical
       | 100fps, represents the slowest frame every second and every 10
       | seconds.
        
         | slaymaker1907 wrote:
         | Yep, I often find it useful to limit framerate to a number I
         | know my GPU can handle so that I don't get as much stuttering.
         | It's better to run at 60 FPS all the time than 80 FPS most of
         | the time but with stuttering.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         |  _Love_ this example. Might have to use that in a future post.
         | Feel like a lot of us are running into a similar thing with
         | remote work and video calls these days...
        
         | tzs wrote:
         | I have no hope of finding a cite for this, but a long time ago
         | I read some command line UI research that found if you had a
         | system where commands ranged from instant to taking a small but
         | noticeable time and you introduced delays in the faster
         | commands to make it so all commands took the same small but
         | noticeable time people would think that the system was now
         | faster overall.
        
           | wruza wrote:
           | I guess that's because our minds (and animal minds as well)
           | are always aware of the pace of repetitive events. If
           | something is off, the anxiety alarm rings. One old book on
           | the brain machinery described an example of a cat that was
           | relaxing near the metronome and became alert when it was
           | suddenly stopped. Unpredictable delays are disturbing,
           | because a mispredicted event means you may be in a dangerous
           | situation and have to recalibrate _now_.
        
         | [deleted]
        
         | kazinator wrote:
         | Gamers do not have a more intuitive sense for this than movie
         | watchers, video/voice talkers, not to mention users who type
         | text into bloated web browsers or lagged remote login sessions.
         | 
         | I would rather have a rock-steady consistent 500 ms reponse
         | time when typing text, than to have a 100 ms average response
         | time which randomly spikes to outliers that go past one second.
         | 
         | A rock-steady, though poor, event rate in a paint application
         | is better for drawing a freehand curve (especially if the
         | program interpolates well) than a really fast rate that
         | suddenly has a glitch in it, spoiling your work.
        
           | hnuser123456 wrote:
           | Maybe not all, but ones who create and consume analysis like
           | this
           | 
           | https://www.techpowerup.com/review/msi-geforce-
           | rtx-3090-gami...
        
         | exolymph wrote:
         | "Slow is smooth and smooth is fast"
        
       | rahimnathwani wrote:
       | For some things, you can't even sensibly measure the mean. For
       | example, if you're measuring the mean response time for a
       | service, a single failure/timeout makes the mean response time
       | infinite (because 100 years from now the response still hasn't
       | been received).
       | 
       | "Why Averages Suck and Percentiles are Great":
       | https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...
        
         | djk447 wrote:
         | totally. that blog was also one of the sources I mentioned in
         | the post! Good stuff
         | 
         | NB: Post author here.
        
         | contravariant wrote:
         | It's indeed always worth pointing that a mean may or may not
         | exist. Same with variances/standard-deviation.
         | 
         | The central limit theorem seems to have given people the
         | slightly wrong idea that things will _always_ average out
         | eventually.
        
           | maxnoe wrote:
           | Coughs Cauchy
        
       | ruchin_k wrote:
       | Spent several years in venture capital investing and averages
       | were always misleading - as Nassim Taleb says "Never cross a
       | river that is on average 4 feet deep"
        
         | robbrown451 wrote:
         | Median/50 percentile isn't a whole lot better in that case.
        
       | aidenn0 wrote:
       | Be careful translating percentiles of requests to percentiles of
       | users; if less than 10% of your requests take over 1 second, but
       | a typical user makes 10 requests, it's possible that the majority
       | your users are seeing a request take over 1 second.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Yep! Briefly noted that in the post, but deserves re-stating!
         | it's definitely a more complex analysis to figure out the
         | percentage of users affected (though often more important)
         | could be majority could also be one user who has some data
         | scientist programmatically making hundreds of long API calls
         | for some task...(can you tell that I ran into that? Even worse
         | it was one of our own data scientists ;) ).
        
       ___________________________________________________________________
       (page generated 2021-09-14 23:00 UTC)