hngopher.com

       [HN Gopher] Calculating the sample size required for developing ...
       ___________________________________________________________________
        
       Calculating the sample size required for developing a clinical
       prediction model
        
       Author : rbanffy
       Score  : 100 points
       Date   : 2020-09-07 13:46 UTC (9 hours ago)
        
 (HTM) web link (www.bmj.com)
 (TXT) w3m dump (www.bmj.com)
        
       | bonoboTP wrote:
       | Not sure if any of the test-driven development people have
       | thought about this, but the same principle also applies there: if
       | you debug and fix the code until it passes the tests, you can
       | overfit the tests. It's no longer a good measure for code
       | quality, once it has been explicitly optimized. You'd need new,
       | previously unseen test cases. It would be an obvious mistake in
       | machine learning to simply add failed test examples into the
       | training set and rejoyce that the new model can now deal with
       | those previously difficult cases.
       | 
       | It's related to Goodheart's law: "When a measure becomes a
       | target, it ceases to be a good measure".
       | 
       | It also works in one's personal life. If you find that you made a
       | mistake, don't just fix that particular thing, go back and see
       | what other similar things may need fixing. It connects to the
       | idea of fixing the deeper reason than fixing the symptoms.
       | "Deeper reason" just means, something that generalizes.
       | 
       | By the way, you need to be careful about your data split. This
       | always depends on how you intend to use the model. If you intend
       | to use it on new, unseen patients, then the train and test data
       | cannot overlap with respect to people. Another obvious case is
       | videos: You can't just take every even frame as a training sample
       | and every odd frame as test. Even though technically you would be
       | testing on non-overlapping data, the performance measure would be
       | biased compared to real world performance. Or if you want to
       | classify burglars from security camera footage, you may need to
       | test it on new camera setups from different houses if you intend
       | to deploy it to new houses. If your scenario is such that you'd
       | perform training on each new site and run a location specific
       | model, you can test on images from the same site.
       | 
       | You always have to use your brain to decide what to do.
        
         | shajznnckfke wrote:
         | > Not sure if any of the test-driven development people have
         | thought about this, but the same principle also applies there:
         | if you debug and fix the code until it passes the tests, you
         | can overfit the tests. It's no longer a good measure for code
         | quality, once it has been explicitly optimized. You'd need new,
         | previously unseen test cases. It would be an obvious mistake in
         | machine learning to simply add failed test examples into the
         | training set and rejoyce that the new model can now deal with
         | those previously difficult cases.
         | 
         | This is a great point, and it articulates some challenges that
         | I've seen with test-driven development.
         | 
         | I would say that if you can write tests that fully specify the
         | desired functionality, rather than merely check a few possible
         | inputs, it's less of an issue. This is a reason to try to build
         | things with less knobs to twiddle, so the space of inputs has
         | lower dimensionality.
        
           | virgilp wrote:
           | Yes but then the tests would formally describe the
           | requirements - i.e.they would be a de-facto implementation.
           | Congratulations, you just coded an (indirect) solution to
           | your problem (a solution that you postulate is correct/bug-
           | free)
        
         | mumblemumble wrote:
         | This is a big reason why I prefer property testing to unit
         | testing. One can think of focusing on the abstract behavior and
         | invariants of the code, rather than just the expected outputs
         | for specific inputs, as being somewhat akin to taking the time
         | to plan out your experiment and the statistical tests you want
         | to perform ahead of time, do your power analysis, etc. It's a
         | fair bit more up-front work, but helps to ensure that you have
         | a crystal clear idea of what you're going to do _before_ you
         | start doing it.
        
         | throwaway4007 wrote:
         | I was super into TDD and never thought that there was such a
         | thing as "TDD going too far" until I read that post by an ex-
         | Oracle employee: https://news.ycombinator.com/item?id=18442941
        
       | peatmoss wrote:
       | And if you're a Bayesian, you can choose a stopping criterion
       | based on your desired degree of certainty. This can have
       | practical benefits of not having to run an experiment to its end.
       | 
       | A family member was part of the control group for a cancer
       | treatment study a few years back. The study chose a stopping
       | criterion based on Bayesian methods. Relatively early into the
       | study they were able to determine that it made sense to move
       | people from the control group to the treatment group.
        
         | [deleted]
        
         | icegreentea2 wrote:
         | You don't have to be a bayesian to do that.
         | 
         | Once you're dealing with real people, early termination is a
         | huge deal, and it quickly goes beyond just the interpretation
         | of statistics. You almost always want to have 3rd party
         | referees involved anyways.
        
           | MiroF wrote:
           | > You don't have to be a bayesian to do that.
           | 
           | The reasoning for doing so is much more coherent (imo) in a
           | Bayesian framework where you don't have to say that you are
           | "going beyond" statistics.
        
             | mattkrause wrote:
             | By "beyond statistics", I think they meant that trials need
             | on-going monitoring for things beyond the outcome variable
             | itself: safety, data quality, and even feasibility of
             | completing the trial itself. Once you've got that
             | infrastructure in place, interim monitoring of the outcome
             | isn't much extra work.
             | 
             | Some of the frequentist approaches to early-stopping seem
             | pretty coherent to me. Curtailiment, for example, just
             | stops collecting data once it will no longer change the
             | outcome of a test. The frequentist emphasis on error
             | control _for the procedure_ also seems like a reasonable
             | fit for a regulatory regime where you 're actually making
             | decisions (no argument that it is...weirder for basic
             | research)
             | 
             | At any rate, some approaches for early-stopping have
             | sensible frequentist and Bayesian properties, which is
             | nice.
        
               | icegreentea2 wrote:
               | Yes, I meant for clinical trials.
               | 
               | Agreed that things are weirder for basic research.
        
       | 2rsf wrote:
       | This is part of the reasons why A/B testing might fail in
       | reality, it is not just about collecting random number of
       | responses about A and B.
       | 
       | When doing a test you should plan it in advance based on other
       | data, you should execute the test as planned and analyze the
       | results accordingly failing any of the tests will results in
       | wrong to totally wrong conclusions.
        
         | activatedgeek wrote:
         | > When doing a test you should plan it in advance based on
         | other data
         | 
         | I think this is not entirely true. In areas where
         | experimentation feedback is fast (and experiments are probably
         | cheaper to run), the problem is much more accurately and sample
         | efficiently solved by Thompson Sampling [1,2] which in fact
         | dictates you to have a broad enough prior over the solution
         | space and then let the posterior dictate your conclusions.
         | 
         | [1]: https://www.microsoft.com/en-us/research/wp-
         | content/uploads/...
         | 
         | [2]: http://www.economics.uci.edu/~ivan/asmb.874.pdf
        
         | mumblemumble wrote:
         | A concrete example:
         | 
         | For starters, consider this xkcd comic: https://xkcd.com/882/
         | 
         | If you're doing A/B testing where you continuously watch the
         | results, and conclude your test as soon as you see the numbers
         | significantly deviate, what you're doing is more-or-less
         | equivalent to what's happening in the comic, only it's the
         | number of jelly beans that you vary instead of their color. Not
         | exactly the same, since in the numbers case your draws from the
         | random variable aren't independent, but still.
        
       | ereinertsen wrote:
       | Estimating a minimum required sample size is one of the most
       | common questions asked by clinical or biomedical collaborators
       | before embarking on a research project. This is especially true
       | when ML is an option. This paper provides rules of thumb and a
       | digestible amount of theory that could inform such conversations,
       | and will surely become a popular reference.
       | 
       | Note intuition from traditional statistics does not universally
       | apply to deep learning and/or extremely high-dimensional data.
       | For example, deep neural networks with 1-4 orders of magnitude
       | more parameters than training examples can still generalize well
       | to unseen data.
        
       | andersource wrote:
       | The link is not to the paper itself but to a response to the
       | paper, this might be (is probably?) intentional but the title is
       | a bit confusing.
       | 
       | Essentially the response, submitted by several ML / CS / math
       | researchers, addresses a note in the paper which recommends
       | against train/test split in model training, calling it
       | "inefficient". The response is dedicated to explaining what
       | generalization error is and why estimating it is important, and
       | how that's basically impossible without any sort of train/test
       | split or cross-validation.
        
         | mumblemumble wrote:
         | The article is there, if you click on the "Article" tab.
         | 
         | Perhaps if any staff are paying attention on the holiday, we
         | could get the "/rr" chopped off the end of the link, so that we
         | get to the main article instead?
        
           | andersource wrote:
           | My guess is that OP intended to submit the response, I was
           | really surprised to see the note in the article and found the
           | response interesting (even the fact of its existence). But
           | that's just my take :)
        
             | mumblemumble wrote:
             | I suppose, but then you'd expect a different title on the
             | HN submission, since the response has barely anything to do
             | with sample sizes.
        
       ___________________________________________________________________
       (page generated 2020-09-07 23:00 UTC)