[HN Gopher] Interpreting A/B test results: false positives and s...
       ___________________________________________________________________
        
       Interpreting A/B test results: false positives and statistical
       significance
        
       Author : ciprian_craciun
       Score  : 37 points
       Date   : 2021-10-29 19:25 UTC (3 hours ago)
        
 (HTM) web link (netflixtechblog.com)
 (TXT) w3m dump (netflixtechblog.com)
        
       | palae wrote:
       | It's probably a good idea to remind (or inform) people that at
       | least in scientific research, null hypothesis statistical testing
       | and "statistical significance" in particular have come under fire
       | [1,2]. From the American Statistical Association (ASA) in 2019
       | [2]:
       | 
       | "We conclude, based on our review of the articles in this special
       | issue and the broader literature, that it is time to stop using
       | the term "statistically significant" entirely. Nor should
       | variants such as "significantly different," "p < 0.05," and
       | "nonsignificant" survive, whether expressed in words, by
       | asterisks in a table, or in some other way.
       | 
       | Regardless of whether it was ever useful, a declaration of
       | "statistical significance" has today become meaningless."
       | 
       | [1] The ASA Statement on p-Values: Context, Process, and Purpose
       | - https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1...
       | 
       | [2] Moving to a World Beyond "p < 0.05" -
       | https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...
        
         | antihipocrat wrote:
         | I read through the first link you posted and couldn't find any
         | ideas about what we could use instead of p-Values.
         | 
         | Statistical tests are a very useful and objective method of
         | determining whether the outcomes of one thing/activity are more
         | desirable than another when applied correctly.
         | 
         | Some solutions could be to set a higher bar for statistical
         | analysis education. Or perhaps a more thorough statistically
         | focussed vetting and peer review process for published
         | material?
        
           | brian_spiering wrote:
           | Bayesian hypothesis testing, including Bayes factors, might
           | be more useful.
        
         | kristjansson wrote:
         | It's worth pulling the principles from the ASA's statement [2]
         | as well:                 1. P-values can indicate how
         | incompatible the data are with a specified statistical model.
         | 2. P-values do not measure the probability that the studied
         | hypothesis is true, or the probability that the data were
         | produced by random chance alone.              3. Scientific
         | conclusions and business or policy decisions should not be
         | based only on whether a p-value passes a specific threshold.
         | 4. Proper inference requires full reporting and transparency
         | 5. A p-value, or statistical significance, does not measure the
         | size of an effect or the importance of a result.
         | 6. By itself, a p-value does not provide a good measure of
         | evidence regarding a model or hypothesis.
         | 
         | The basic criticism one of brittleness - that unless very
         | carefully planned, executed, and interpreted, p-values from
         | hypothesis does not support the claims some would like to be on
         | their results, and that meeting the first condition is so
         | difficult that the technique should not be recommended. One
         | _should_ look for 'significant' results, but using measures
         | that align better with colloquial understandings of
         | significance i.e. with how users are misinterpreting p-values
         | now.
        
         | samch93 wrote:
         | The ASA recently published a new statement which is more
         | optimistic about the use of p-values [1]. I myself also think
         | that correctly used p-values are in many situations a good tool
         | for making sense out of data. Of course, a decision should
         | never be conducted on a p-value alone, but the same could also
         | be said about confidence/credible intervals, Bayes factors,
         | relative belief ratios, and any other inferential tool
         | available (and I'm saying this as someone who is doing research
         | in Bayesian hypothesis testing methodology). Data analysts
         | always need to use common sense and put the data at hand into
         | broader context.
         | 
         | [1] https://projecteuclid.org/journals/annals-of-applied-
         | statist...
        
       | jonathanbentz wrote:
       | I am interested to see what they will be testing in some of the
       | upcoming posts in this series. It would be fun to be scrolling
       | Netflix and have the transparency to know that I'm seeing the 'B'
       | test.
        
         | kristjansson wrote:
         | Like all controlled experiments though, the experimenter wants
         | to hide that information from the subject (user in this
         | instance) to measure how they respond to the change itself,
         | rather than the change and being told about it.
        
       | dmitriid wrote:
       | Before interpreting A/B results, the main question that needs to
       | be asked: "what is it that you're A/B testing?"
       | 
       | For too many companies, it's testing "engagement" which leads to
       | hiding functionality (more clicks is more engagement), reducing
       | info density (more time spent is more engagement) etc.
       | 
       | And coming from Netflix... I don't think there's a single person
       | who likes that when you browse Netflix it autoplays random videos
       | (not even trailers) with audio at full volume. But yeah, A/B
       | tests something something. So I wish Netflix learned from their
       | own teachings.
        
         | dafelst wrote:
         | People may not like that feature (I sure don't), but I would
         | bet a decent sum that feature didn't drive increases in
         | negative metrics like churn, and increased positive metrics
         | like hours watched, perhaps by causing people to scroll through
         | more of the library faster, or perhaps drawing people in with
         | the previews. Or maybe they just saw in improvement in a non-
         | core metric like "distance scrolled" with no other negative
         | effects and said, "meh, ship it". Both seem likely.
         | 
         | Of course this is the danger of any sort of behavioral metric
         | driven optimization strategy - you may trade negative customer
         | sentiment for positive business outcomes. That's where the real
         | decision making comes about, i.e. are you willing to make that
         | trade? It seems that in this case, Netflix was.
        
         | mobjack wrote:
         | I've A/B tested hiding functionality and reducing info density
         | increased the number of people spending money on the site.
         | 
         | I was completely shocked by seeing those results initially and
         | dove deep to look for any other negative effects from these
         | changes but could not find any. I've repeated similar tests and
         | the results are often similar.
         | 
         | From that experience, I've learned that most people are not
         | like me or the HN crowd. The things that you complain about
         | could actually make things easier for the majority.
        
         | type_enthusiast wrote:
         | (disclaimer: I work for Netflix. Edit: I should clarify that I
         | wasn't involved with this article in any way)
         | 
         | You can disable the behavior you mentioned. Go to your profile
         | settings, and under "Playback Settings" you can uncheck
         | "Autoplay previews while browsing on all devices".
        
       ___________________________________________________________________
       (page generated 2021-10-29 23:00 UTC)