[HN Gopher] Interpreting A/B test results: false positives and s... ___________________________________________________________________ Interpreting A/B test results: false positives and statistical significance Author : ciprian_craciun Score : 37 points Date : 2021-10-29 19:25 UTC (3 hours ago) (HTM) web link (netflixtechblog.com) (TXT) w3m dump (netflixtechblog.com) | palae wrote: | It's probably a good idea to remind (or inform) people that at | least in scientific research, null hypothesis statistical testing | and "statistical significance" in particular have come under fire | [1,2]. From the American Statistical Association (ASA) in 2019 | [2]: | | "We conclude, based on our review of the articles in this special | issue and the broader literature, that it is time to stop using | the term "statistically significant" entirely. Nor should | variants such as "significantly different," "p < 0.05," and | "nonsignificant" survive, whether expressed in words, by | asterisks in a table, or in some other way. | | Regardless of whether it was ever useful, a declaration of | "statistical significance" has today become meaningless." | | [1] The ASA Statement on p-Values: Context, Process, and Purpose | - https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1... | | [2] Moving to a World Beyond "p < 0.05" - | https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1... | antihipocrat wrote: | I read through the first link you posted and couldn't find any | ideas about what we could use instead of p-Values. | | Statistical tests are a very useful and objective method of | determining whether the outcomes of one thing/activity are more | desirable than another when applied correctly. | | Some solutions could be to set a higher bar for statistical | analysis education. Or perhaps a more thorough statistically | focussed vetting and peer review process for published | material? | brian_spiering wrote: | Bayesian hypothesis testing, including Bayes factors, might | be more useful. | kristjansson wrote: | It's worth pulling the principles from the ASA's statement [2] | as well: 1. P-values can indicate how | incompatible the data are with a specified statistical model. | 2. P-values do not measure the probability that the studied | hypothesis is true, or the probability that the data were | produced by random chance alone. 3. Scientific | conclusions and business or policy decisions should not be | based only on whether a p-value passes a specific threshold. | 4. Proper inference requires full reporting and transparency | 5. A p-value, or statistical significance, does not measure the | size of an effect or the importance of a result. | 6. By itself, a p-value does not provide a good measure of | evidence regarding a model or hypothesis. | | The basic criticism one of brittleness - that unless very | carefully planned, executed, and interpreted, p-values from | hypothesis does not support the claims some would like to be on | their results, and that meeting the first condition is so | difficult that the technique should not be recommended. One | _should_ look for 'significant' results, but using measures | that align better with colloquial understandings of | significance i.e. with how users are misinterpreting p-values | now. | samch93 wrote: | The ASA recently published a new statement which is more | optimistic about the use of p-values [1]. I myself also think | that correctly used p-values are in many situations a good tool | for making sense out of data. Of course, a decision should | never be conducted on a p-value alone, but the same could also | be said about confidence/credible intervals, Bayes factors, | relative belief ratios, and any other inferential tool | available (and I'm saying this as someone who is doing research | in Bayesian hypothesis testing methodology). Data analysts | always need to use common sense and put the data at hand into | broader context. | | [1] https://projecteuclid.org/journals/annals-of-applied- | statist... | jonathanbentz wrote: | I am interested to see what they will be testing in some of the | upcoming posts in this series. It would be fun to be scrolling | Netflix and have the transparency to know that I'm seeing the 'B' | test. | kristjansson wrote: | Like all controlled experiments though, the experimenter wants | to hide that information from the subject (user in this | instance) to measure how they respond to the change itself, | rather than the change and being told about it. | dmitriid wrote: | Before interpreting A/B results, the main question that needs to | be asked: "what is it that you're A/B testing?" | | For too many companies, it's testing "engagement" which leads to | hiding functionality (more clicks is more engagement), reducing | info density (more time spent is more engagement) etc. | | And coming from Netflix... I don't think there's a single person | who likes that when you browse Netflix it autoplays random videos | (not even trailers) with audio at full volume. But yeah, A/B | tests something something. So I wish Netflix learned from their | own teachings. | dafelst wrote: | People may not like that feature (I sure don't), but I would | bet a decent sum that feature didn't drive increases in | negative metrics like churn, and increased positive metrics | like hours watched, perhaps by causing people to scroll through | more of the library faster, or perhaps drawing people in with | the previews. Or maybe they just saw in improvement in a non- | core metric like "distance scrolled" with no other negative | effects and said, "meh, ship it". Both seem likely. | | Of course this is the danger of any sort of behavioral metric | driven optimization strategy - you may trade negative customer | sentiment for positive business outcomes. That's where the real | decision making comes about, i.e. are you willing to make that | trade? It seems that in this case, Netflix was. | mobjack wrote: | I've A/B tested hiding functionality and reducing info density | increased the number of people spending money on the site. | | I was completely shocked by seeing those results initially and | dove deep to look for any other negative effects from these | changes but could not find any. I've repeated similar tests and | the results are often similar. | | From that experience, I've learned that most people are not | like me or the HN crowd. The things that you complain about | could actually make things easier for the majority. | type_enthusiast wrote: | (disclaimer: I work for Netflix. Edit: I should clarify that I | wasn't involved with this article in any way) | | You can disable the behavior you mentioned. Go to your profile | settings, and under "Playback Settings" you can uncheck | "Autoplay previews while browsing on all devices". ___________________________________________________________________ (page generated 2021-10-29 23:00 UTC)