[HN Gopher] Run fewer, better A/B tests ___________________________________________________________________ Run fewer, better A/B tests Author : econti Score : 67 points Date : 2021-06-26 14:44 UTC (8 hours ago) (HTM) web link (edoconti.medium.com) (TXT) w3m dump (edoconti.medium.com) | btilly wrote: | The notifications examples make me wonder what fundamental | mistakes they are making. | | People respond to change. If you A/B test, say, a new email | headline, the change usually wins. Even if it isn't better. Just | because it is different. Then you roll it out in production, look | at it a few months later, and it is probably worse. | | If you don't understand downsides like this, then A/B testing is | going to have a lot of pitfalls that you won't even know that you | fell into. | uyt wrote: | I think it's known as the "novelty effect" in the industry. | jonathankoren wrote: | I'm pretty skeptical of this. I've run a lot of ML based A/B | tests over my career. I've talked to a lot of people that have | also run ML A/B tests over their careers. And the one constant | everyone has discovered is that offline evaluation metrics are | only somewhat directionally correlated with online metrics. | | Seriously. A/B tests are kind of a crap shoot. The systems are | constantly changing. The online inference data drifts from the | historical training data. User behavior changes. | | I've seen positive offline models perform flat. I've seen | _negative_ offline metrics perform positively. There's just a lot | of variance between offline and online performance. | | Just run the test. Lower the friction for running the tests, and | just run them. It's the only way to be sure. | zwaps wrote: | Maybe I am too cynic but what we are really talking about here | is causal inference for observational data based on more or | less structural statistical models. | | Any researcher will tell you: this is really hard. It is more | than an engineering problem. You need to know not only how to | deal with problems, but rather what problems may arise and what | you can actually identify. Most importantly, you need to figure | out what you can not identify. | | There are, at least here in academia, only a limited set of | people who are really good at this. | | Long story short: even if offline analysis is viable, I doubt | every team had the right people for it, making it potentially | not worthwhile. | | It is infinitely easier to produce a statistical analysis that | looks good but isn't, than one that is good. An overwhelming | amount of useless offline models would, statistically speaking, | be expected ;) | varsketiz wrote: | Recently I hear that booking.com is given as an example of a | company that runs a lot of a/b tests. Anyone from booking reading | this? How does it look from the inside, is it worth it to run | hundreds at a time? | tootie wrote: | I've seen a lot of really sophisticated data pipelines and | testing frameworks at a lot of shops. I've seen precious few who | were able to make well-considered product decisions based on the | data. | bruce343434 wrote: | Between the emojis in the headings and the 2009 era memes, this | was a bit of a cringy read. Also, the author seems to avoid at | all costs going in depth about the actual implementation of OPE | and I still don't quite understand how I would go about | implementing it. Machine learning based on past A/B tests that | finds similarities between the UI changes??? | econti wrote: | Author here! I implemented every method I described in the post | in the pip library I used in the post. | | In case you missed it: https://github.com/banditml/offline- | policy-evaluation | xivzgrev wrote: | Yea me too. | | My biggest question is where do you get user data to run the | simulation? Take the simple push example - if to date you've | only sent pushes on day 1, and you want to explore day 2,3,4,5 | etc...where does that user response data come from? It seems | like you need to get the data, then you can simulate various | permutations of a policy. But then why not just run multi arm | bandit? | austincheney wrote: | When I was the A/B test guy for Travelocity I was fortunate to | have an excellent team. The largest bias we discovered is that | our tests were executed with amazing precision and durability. My | dedicated QA was the whining star that made that happen. | Unfortunately when the resulting feature entered the site in | production as a released feature there was always some defect, or | some conflict, or some oversight. The actual business results | would then under perform compared to the team's analyzed | prediction. | tartakovsky wrote: | What is your advice, or more details on the types of challenges | you came across and how you handled this discrepancy? I would | imagine the data shifts a bit, and that standard assumptions | don't hold up around how the difference you were measuring | between your "A" and your "B" remain fixed after the testing | period. | austincheney wrote: | The biggest technical challenge we came across is that when I | had to hire my replacement we couldn't find competent | developer talent. Everybody NEEDED everything to be jQuery | and querySelectors but those were far too slow and buggy | (jQuery was buggy cross browser). A good A/B test must not | look like a test. It has to feel and look like the real | thing. Some of our tests were extremely complex spanning | multiple pages performing varieties of interactions. We | couldn't be dicking around with basic code literacy and | fumbling through entry level beginner defects. | | I was the team developer and not the team analyst so I cannot | speak to business assumption variance. The business didn't | seem to care about this since the release cycle is slow and | defects were common. They were more concerned with the | inverse proportions of cheap tests bringing stellar business | wins. | eximius wrote: | I'm going to stick with multiarm bandit testing. | gingerlime wrote: | What tools/frameworks are you using for running and analysing | results? | nxpnsv wrote: | This is a much better approach... | sbierwagen wrote: | >Now you might be thinking OPE is only useful if you have | Facebook-level quantities of data. Luckily that's not true. If | you have enough data to A/B test policies with statistical | significance, you probably have more than enough data to evaluate | them offline. | | Isn't there a multiple comparisons problem here? If you have | enough data to do single A/B test, how can you do a hundred | historical comparisons and still have the same p value? | dr_dshiv wrote: | The challenge I've seen is to have a combination of good, small- | scale Human-Centered Design research (watching people work,for | instance) and good, large-scale testing. It can be really hard to | learn the "why" from a/b tests otherwise. ___________________________________________________________________ (page generated 2021-06-26 23:00 UTC)