[HN Gopher] Run fewer, better A/B tests
       ___________________________________________________________________
        
       Run fewer, better A/B tests
        
       Author : econti
       Score  : 67 points
       Date   : 2021-06-26 14:44 UTC (8 hours ago)
        
 (HTM) web link (edoconti.medium.com)
 (TXT) w3m dump (edoconti.medium.com)
        
       | btilly wrote:
       | The notifications examples make me wonder what fundamental
       | mistakes they are making.
       | 
       | People respond to change. If you A/B test, say, a new email
       | headline, the change usually wins. Even if it isn't better. Just
       | because it is different. Then you roll it out in production, look
       | at it a few months later, and it is probably worse.
       | 
       | If you don't understand downsides like this, then A/B testing is
       | going to have a lot of pitfalls that you won't even know that you
       | fell into.
        
         | uyt wrote:
         | I think it's known as the "novelty effect" in the industry.
        
       | jonathankoren wrote:
       | I'm pretty skeptical of this. I've run a lot of ML based A/B
       | tests over my career. I've talked to a lot of people that have
       | also run ML A/B tests over their careers. And the one constant
       | everyone has discovered is that offline evaluation metrics are
       | only somewhat directionally correlated with online metrics.
       | 
       | Seriously. A/B tests are kind of a crap shoot. The systems are
       | constantly changing. The online inference data drifts from the
       | historical training data. User behavior changes.
       | 
       | I've seen positive offline models perform flat. I've seen
       | _negative_ offline metrics perform positively. There's just a lot
       | of variance between offline and online performance.
       | 
       | Just run the test. Lower the friction for running the tests, and
       | just run them. It's the only way to be sure.
        
         | zwaps wrote:
         | Maybe I am too cynic but what we are really talking about here
         | is causal inference for observational data based on more or
         | less structural statistical models.
         | 
         | Any researcher will tell you: this is really hard. It is more
         | than an engineering problem. You need to know not only how to
         | deal with problems, but rather what problems may arise and what
         | you can actually identify. Most importantly, you need to figure
         | out what you can not identify.
         | 
         | There are, at least here in academia, only a limited set of
         | people who are really good at this.
         | 
         | Long story short: even if offline analysis is viable, I doubt
         | every team had the right people for it, making it potentially
         | not worthwhile.
         | 
         | It is infinitely easier to produce a statistical analysis that
         | looks good but isn't, than one that is good. An overwhelming
         | amount of useless offline models would, statistically speaking,
         | be expected ;)
        
       | varsketiz wrote:
       | Recently I hear that booking.com is given as an example of a
       | company that runs a lot of a/b tests. Anyone from booking reading
       | this? How does it look from the inside, is it worth it to run
       | hundreds at a time?
        
       | tootie wrote:
       | I've seen a lot of really sophisticated data pipelines and
       | testing frameworks at a lot of shops. I've seen precious few who
       | were able to make well-considered product decisions based on the
       | data.
        
       | bruce343434 wrote:
       | Between the emojis in the headings and the 2009 era memes, this
       | was a bit of a cringy read. Also, the author seems to avoid at
       | all costs going in depth about the actual implementation of OPE
       | and I still don't quite understand how I would go about
       | implementing it. Machine learning based on past A/B tests that
       | finds similarities between the UI changes???
        
         | econti wrote:
         | Author here! I implemented every method I described in the post
         | in the pip library I used in the post.
         | 
         | In case you missed it: https://github.com/banditml/offline-
         | policy-evaluation
        
         | xivzgrev wrote:
         | Yea me too.
         | 
         | My biggest question is where do you get user data to run the
         | simulation? Take the simple push example - if to date you've
         | only sent pushes on day 1, and you want to explore day 2,3,4,5
         | etc...where does that user response data come from? It seems
         | like you need to get the data, then you can simulate various
         | permutations of a policy. But then why not just run multi arm
         | bandit?
        
       | austincheney wrote:
       | When I was the A/B test guy for Travelocity I was fortunate to
       | have an excellent team. The largest bias we discovered is that
       | our tests were executed with amazing precision and durability. My
       | dedicated QA was the whining star that made that happen.
       | Unfortunately when the resulting feature entered the site in
       | production as a released feature there was always some defect, or
       | some conflict, or some oversight. The actual business results
       | would then under perform compared to the team's analyzed
       | prediction.
        
         | tartakovsky wrote:
         | What is your advice, or more details on the types of challenges
         | you came across and how you handled this discrepancy? I would
         | imagine the data shifts a bit, and that standard assumptions
         | don't hold up around how the difference you were measuring
         | between your "A" and your "B" remain fixed after the testing
         | period.
        
           | austincheney wrote:
           | The biggest technical challenge we came across is that when I
           | had to hire my replacement we couldn't find competent
           | developer talent. Everybody NEEDED everything to be jQuery
           | and querySelectors but those were far too slow and buggy
           | (jQuery was buggy cross browser). A good A/B test must not
           | look like a test. It has to feel and look like the real
           | thing. Some of our tests were extremely complex spanning
           | multiple pages performing varieties of interactions. We
           | couldn't be dicking around with basic code literacy and
           | fumbling through entry level beginner defects.
           | 
           | I was the team developer and not the team analyst so I cannot
           | speak to business assumption variance. The business didn't
           | seem to care about this since the release cycle is slow and
           | defects were common. They were more concerned with the
           | inverse proportions of cheap tests bringing stellar business
           | wins.
        
       | eximius wrote:
       | I'm going to stick with multiarm bandit testing.
        
         | gingerlime wrote:
         | What tools/frameworks are you using for running and analysing
         | results?
        
         | nxpnsv wrote:
         | This is a much better approach...
        
       | sbierwagen wrote:
       | >Now you might be thinking OPE is only useful if you have
       | Facebook-level quantities of data. Luckily that's not true. If
       | you have enough data to A/B test policies with statistical
       | significance, you probably have more than enough data to evaluate
       | them offline.
       | 
       | Isn't there a multiple comparisons problem here? If you have
       | enough data to do single A/B test, how can you do a hundred
       | historical comparisons and still have the same p value?
        
       | dr_dshiv wrote:
       | The challenge I've seen is to have a combination of good, small-
       | scale Human-Centered Design research (watching people work,for
       | instance) and good, large-scale testing. It can be really hard to
       | learn the "why" from a/b tests otherwise.
        
       ___________________________________________________________________
       (page generated 2021-06-26 23:00 UTC)