[HN Gopher] Histogram vs. ECDF ___________________________________________________________________ Histogram vs. ECDF Author : r4um Score : 57 points Date : 2022-09-03 05:42 UTC (1 days ago) (HTM) web link (brooker.co.za) (TXT) w3m dump (brooker.co.za) | mike-the-mikado wrote: | I recommend Kernel Density Estimation as an alternative to | histograms if you are specifically interested in the density - | e.g. which values are particularly likely to occur (perhaps for | multimodal distributions). | | https://en.wikipedia.org/wiki/Kernel_density_estimation | iamcreasy wrote: | I've been told box plot with Kernel density estimation on the | side(axis) are very useful. | ttpphd wrote: | I'm a behavioral scientist and I find both are useful. If you | never look at a histogram it's surprisingly easy to fool yourself | about what exactly the ecdf is telling you in certain situations, | particularly when comparing distributions. | aquafox wrote: | The ECDF is particularly useful to compare two distributions. And | it has the nice connection to the Kolmogoroff-Smirnov test for | testing if two distributions are different: It's test statistic | is the maximum distance between the two ECDFs. | lukego wrote: | This test seems really underrated. It's my go-to for comparing | computer system performance (e.g. between versions on CI) since | they often have very peculiar distributions and are relatively | cheap to produce enough samples from. | bigbillheck wrote: | I've stopped using it as I found it far too sensitive to | small differences. | klipt wrote: | Perhaps this comes from a misunderstanding of what | statistical significance means. A test reporting a | statistically significant difference doesn't mean the | difference is big, just that it's big enough to separate | out as a "signal" from the underlying random noise. | | It's basically saying "yes, given this data I am very | confident that there is an underlying difference that is | not just an artefact of random sampling". | mapme wrote: | Hmm, I was trying to do the same with KS test on performance | data but it seemed extremely sensitive to outliers even when | an eyeball test of the two distributions look near exact. | Have you ran into any of those issues? | lukego wrote: | Not with outliers, no, I think it handles them particularly | well (conservatively.) But yes to small consistent | differences (e.g. uniform 1%) that affect the relative | order of results but not by an very important amount. So | have to consider effect size even when the test statistic | is strong. | dafelst wrote: | While this is nice, it seems like without bucketing you would run | into complexity issues with large amounts of data, right? i.e. to | plot a true eCDF you need a sorted list of all the collected | datapoints. I guess for actual plotting you have to effectively | bucketize based on the number of pixels in your plot, but that | seems fairly arbitrary. | | Histograms are nice in that they effectively compress non-trivial | datasets (at least those that have a reasonable bounded domain) | to something quite manageable. | | I guess there is nothing stopping you from doing the same thing | here, but it does kind of discount the author's claim of not | being able to go between histogram and eCDF. | | Am I missing something? | AstralStorm wrote: | Computing a sorted list online is an amortized O(1) operation, | O(log n) worst case. | ironSkillet wrote: | Adding elements one at a time is O(log n) into an already | sorted list. But producing a complete sorted list requires | doing that n times, so you end up with n log n anyways. Am I | missing something? | marcosdumay wrote: | If you have more data points than horizontal pixels, yes, you | will bucket the data on your display resolution. That happens | with any kind of plotting. | | What is a completely different thing from the arbitrary | bucketing for histograms. CDF doesn't go to zero or becomes | misleading if you bucket it wrong. You just lose the details. | dafelst wrote: | My point is more that you need to store n values (where n is | the number of samples) or 2k if there are dupes (where k is | the number of distinct values) for an eCDF, which if you did | that anyways, you could generate a histogram from the same | data. | | If there are duplicate sample values, you can still store a | sorted list of (sample,count) here and generate either a | histogram OR an eCDF, or any other plot really. | | Effectively it is not a fair comparison to compare the two | methods since they both have storage tradeoffs that are not | really discussed. | uluyol wrote: | This is a nice article, but one this that's not quite right is | that you can go from a histogram to an eCDF (basically view the | bucketing as a loss in measurement precision). | | I mention this because histograms, especially HDR histograms, are | a very compact way of measuring distributions, and it's nice that | you can keep those benefits and still convert to an eCDF. | TTPrograms wrote: | I think there's an issue with the histogram rendering in this | post. The rapid descent from the spike on the left is not | consistent with high ECDF impact and the apparent binning | resolution visible in the piecewise line-segments. In general | histograms should not be visualized with connected line-graphs in | this way - the standard bar graph depiction makes the bin-width | apparent and resolves some of the issues the article needs the | ECDF for (e.g. relative impact can be assessed visually by | comparing the relative areas of the associated bars). The bar | visualization also makes it possible to use varying bin sizes, | which is extremely useful with any distribution that has tails. | chrsig wrote: | for anyone finding themselves doing a bit of analysis using a | eCDF, seaborn[0] has a plot for it | | https://seaborn.pydata.org/generated/seaborn.ecdfplot.html ___________________________________________________________________ (page generated 2022-09-04 23:01 UTC)