[HN Gopher] Histogram vs. ECDF
       Histogram vs. ECDF
       Author : r4um
       Score  : 57 points
       Date   : 2022-09-03 05:42 UTC (1 days ago)
 (HTM) web link (brooker.co.za)
 (TXT) w3m dump (brooker.co.za)
       | mike-the-mikado wrote:
       | I recommend Kernel Density Estimation as an alternative to
       | histograms if you are specifically interested in the density -
       | e.g. which values are particularly likely to occur (perhaps for
       | multimodal distributions).
       | https://en.wikipedia.org/wiki/Kernel_density_estimation
         | iamcreasy wrote:
         | I've been told box plot with Kernel density estimation on the
         | side(axis) are very useful.
       | ttpphd wrote:
       | I'm a behavioral scientist and I find both are useful. If you
       | never look at a histogram it's surprisingly easy to fool yourself
       | about what exactly the ecdf is telling you in certain situations,
       | particularly when comparing distributions.
       | aquafox wrote:
       | The ECDF is particularly useful to compare two distributions. And
       | it has the nice connection to the Kolmogoroff-Smirnov test for
       | testing if two distributions are different: It's test statistic
       | is the maximum distance between the two ECDFs.
         | lukego wrote:
         | This test seems really underrated. It's my go-to for comparing
         | computer system performance (e.g. between versions on CI) since
         | they often have very peculiar distributions and are relatively
         | cheap to produce enough samples from.
           | bigbillheck wrote:
           | I've stopped using it as I found it far too sensitive to
           | small differences.
             | klipt wrote:
             | Perhaps this comes from a misunderstanding of what
             | statistical significance means. A test reporting a
             | statistically significant difference doesn't mean the
             | difference is big, just that it's big enough to separate
             | out as a "signal" from the underlying random noise.
             | It's basically saying "yes, given this data I am very
             | confident that there is an underlying difference that is
             | not just an artefact of random sampling".
           | mapme wrote:
           | Hmm, I was trying to do the same with KS test on performance
           | data but it seemed extremely sensitive to outliers even when
           | an eyeball test of the two distributions look near exact.
           | Have you ran into any of those issues?
             | lukego wrote:
             | Not with outliers, no, I think it handles them particularly
             | well (conservatively.) But yes to small consistent
             | differences (e.g. uniform 1%) that affect the relative
             | order of results but not by an very important amount. So
             | have to consider effect size even when the test statistic
             | is strong.
       | dafelst wrote:
       | While this is nice, it seems like without bucketing you would run
       | into complexity issues with large amounts of data, right? i.e. to
       | plot a true eCDF you need a sorted list of all the collected
       | datapoints. I guess for actual plotting you have to effectively
       | bucketize based on the number of pixels in your plot, but that
       | seems fairly arbitrary.
       | Histograms are nice in that they effectively compress non-trivial
       | datasets (at least those that have a reasonable bounded domain)
       | to something quite manageable.
       | I guess there is nothing stopping you from doing the same thing
       | here, but it does kind of discount the author's claim of not
       | being able to go between histogram and eCDF.
       | Am I missing something?
         | AstralStorm wrote:
         | Computing a sorted list online is an amortized O(1) operation,
         | O(log n) worst case.
           | ironSkillet wrote:
           | Adding elements one at a time is O(log n) into an already
           | sorted list. But producing a complete sorted list requires
           | doing that n times, so you end up with n log n anyways. Am I
           | missing something?
         | marcosdumay wrote:
         | If you have more data points than horizontal pixels, yes, you
         | will bucket the data on your display resolution. That happens
         | with any kind of plotting.
         | What is a completely different thing from the arbitrary
         | bucketing for histograms. CDF doesn't go to zero or becomes
         | misleading if you bucket it wrong. You just lose the details.
           | dafelst wrote:
           | My point is more that you need to store n values (where n is
           | the number of samples) or 2k if there are dupes (where k is
           | the number of distinct values) for an eCDF, which if you did
           | that anyways, you could generate a histogram from the same
           | data.
           | If there are duplicate sample values, you can still store a
           | sorted list of (sample,count) here and generate either a
           | histogram OR an eCDF, or any other plot really.
           | Effectively it is not a fair comparison to compare the two
           | methods since they both have storage tradeoffs that are not
           | really discussed.
       | uluyol wrote:
       | This is a nice article, but one this that's not quite right is
       | that you can go from a histogram to an eCDF (basically view the
       | bucketing as a loss in measurement precision).
       | I mention this because histograms, especially HDR histograms, are
       | a very compact way of measuring distributions, and it's nice that
       | you can keep those benefits and still convert to an eCDF.
       | TTPrograms wrote:
       | I think there's an issue with the histogram rendering in this
       | post. The rapid descent from the spike on the left is not
       | consistent with high ECDF impact and the apparent binning
       | resolution visible in the piecewise line-segments. In general
       | histograms should not be visualized with connected line-graphs in
       | this way - the standard bar graph depiction makes the bin-width
       | apparent and resolves some of the issues the article needs the
       | ECDF for (e.g. relative impact can be assessed visually by
       | comparing the relative areas of the associated bars). The bar
       | visualization also makes it possible to use varying bin sizes,
       | which is extremely useful with any distribution that has tails.
       | chrsig wrote:
       | for anyone finding themselves doing a bit of analysis using a
       | eCDF, seaborn[0] has a plot for it
       | https://seaborn.pydata.org/generated/seaborn.ecdfplot.html
