[HN Gopher] Calculating the sample size required for developing ... ___________________________________________________________________ Calculating the sample size required for developing a clinical prediction model Author : rbanffy Score : 100 points Date : 2020-09-07 13:46 UTC (9 hours ago) (HTM) web link (www.bmj.com) (TXT) w3m dump (www.bmj.com) | bonoboTP wrote: | Not sure if any of the test-driven development people have | thought about this, but the same principle also applies there: if | you debug and fix the code until it passes the tests, you can | overfit the tests. It's no longer a good measure for code | quality, once it has been explicitly optimized. You'd need new, | previously unseen test cases. It would be an obvious mistake in | machine learning to simply add failed test examples into the | training set and rejoyce that the new model can now deal with | those previously difficult cases. | | It's related to Goodheart's law: "When a measure becomes a | target, it ceases to be a good measure". | | It also works in one's personal life. If you find that you made a | mistake, don't just fix that particular thing, go back and see | what other similar things may need fixing. It connects to the | idea of fixing the deeper reason than fixing the symptoms. | "Deeper reason" just means, something that generalizes. | | By the way, you need to be careful about your data split. This | always depends on how you intend to use the model. If you intend | to use it on new, unseen patients, then the train and test data | cannot overlap with respect to people. Another obvious case is | videos: You can't just take every even frame as a training sample | and every odd frame as test. Even though technically you would be | testing on non-overlapping data, the performance measure would be | biased compared to real world performance. Or if you want to | classify burglars from security camera footage, you may need to | test it on new camera setups from different houses if you intend | to deploy it to new houses. If your scenario is such that you'd | perform training on each new site and run a location specific | model, you can test on images from the same site. | | You always have to use your brain to decide what to do. | shajznnckfke wrote: | > Not sure if any of the test-driven development people have | thought about this, but the same principle also applies there: | if you debug and fix the code until it passes the tests, you | can overfit the tests. It's no longer a good measure for code | quality, once it has been explicitly optimized. You'd need new, | previously unseen test cases. It would be an obvious mistake in | machine learning to simply add failed test examples into the | training set and rejoyce that the new model can now deal with | those previously difficult cases. | | This is a great point, and it articulates some challenges that | I've seen with test-driven development. | | I would say that if you can write tests that fully specify the | desired functionality, rather than merely check a few possible | inputs, it's less of an issue. This is a reason to try to build | things with less knobs to twiddle, so the space of inputs has | lower dimensionality. | virgilp wrote: | Yes but then the tests would formally describe the | requirements - i.e.they would be a de-facto implementation. | Congratulations, you just coded an (indirect) solution to | your problem (a solution that you postulate is correct/bug- | free) | mumblemumble wrote: | This is a big reason why I prefer property testing to unit | testing. One can think of focusing on the abstract behavior and | invariants of the code, rather than just the expected outputs | for specific inputs, as being somewhat akin to taking the time | to plan out your experiment and the statistical tests you want | to perform ahead of time, do your power analysis, etc. It's a | fair bit more up-front work, but helps to ensure that you have | a crystal clear idea of what you're going to do _before_ you | start doing it. | throwaway4007 wrote: | I was super into TDD and never thought that there was such a | thing as "TDD going too far" until I read that post by an ex- | Oracle employee: https://news.ycombinator.com/item?id=18442941 | peatmoss wrote: | And if you're a Bayesian, you can choose a stopping criterion | based on your desired degree of certainty. This can have | practical benefits of not having to run an experiment to its end. | | A family member was part of the control group for a cancer | treatment study a few years back. The study chose a stopping | criterion based on Bayesian methods. Relatively early into the | study they were able to determine that it made sense to move | people from the control group to the treatment group. | [deleted] | icegreentea2 wrote: | You don't have to be a bayesian to do that. | | Once you're dealing with real people, early termination is a | huge deal, and it quickly goes beyond just the interpretation | of statistics. You almost always want to have 3rd party | referees involved anyways. | MiroF wrote: | > You don't have to be a bayesian to do that. | | The reasoning for doing so is much more coherent (imo) in a | Bayesian framework where you don't have to say that you are | "going beyond" statistics. | mattkrause wrote: | By "beyond statistics", I think they meant that trials need | on-going monitoring for things beyond the outcome variable | itself: safety, data quality, and even feasibility of | completing the trial itself. Once you've got that | infrastructure in place, interim monitoring of the outcome | isn't much extra work. | | Some of the frequentist approaches to early-stopping seem | pretty coherent to me. Curtailiment, for example, just | stops collecting data once it will no longer change the | outcome of a test. The frequentist emphasis on error | control _for the procedure_ also seems like a reasonable | fit for a regulatory regime where you 're actually making | decisions (no argument that it is...weirder for basic | research) | | At any rate, some approaches for early-stopping have | sensible frequentist and Bayesian properties, which is | nice. | icegreentea2 wrote: | Yes, I meant for clinical trials. | | Agreed that things are weirder for basic research. | 2rsf wrote: | This is part of the reasons why A/B testing might fail in | reality, it is not just about collecting random number of | responses about A and B. | | When doing a test you should plan it in advance based on other | data, you should execute the test as planned and analyze the | results accordingly failing any of the tests will results in | wrong to totally wrong conclusions. | activatedgeek wrote: | > When doing a test you should plan it in advance based on | other data | | I think this is not entirely true. In areas where | experimentation feedback is fast (and experiments are probably | cheaper to run), the problem is much more accurately and sample | efficiently solved by Thompson Sampling [1,2] which in fact | dictates you to have a broad enough prior over the solution | space and then let the posterior dictate your conclusions. | | [1]: https://www.microsoft.com/en-us/research/wp- | content/uploads/... | | [2]: http://www.economics.uci.edu/~ivan/asmb.874.pdf | mumblemumble wrote: | A concrete example: | | For starters, consider this xkcd comic: https://xkcd.com/882/ | | If you're doing A/B testing where you continuously watch the | results, and conclude your test as soon as you see the numbers | significantly deviate, what you're doing is more-or-less | equivalent to what's happening in the comic, only it's the | number of jelly beans that you vary instead of their color. Not | exactly the same, since in the numbers case your draws from the | random variable aren't independent, but still. | ereinertsen wrote: | Estimating a minimum required sample size is one of the most | common questions asked by clinical or biomedical collaborators | before embarking on a research project. This is especially true | when ML is an option. This paper provides rules of thumb and a | digestible amount of theory that could inform such conversations, | and will surely become a popular reference. | | Note intuition from traditional statistics does not universally | apply to deep learning and/or extremely high-dimensional data. | For example, deep neural networks with 1-4 orders of magnitude | more parameters than training examples can still generalize well | to unseen data. | andersource wrote: | The link is not to the paper itself but to a response to the | paper, this might be (is probably?) intentional but the title is | a bit confusing. | | Essentially the response, submitted by several ML / CS / math | researchers, addresses a note in the paper which recommends | against train/test split in model training, calling it | "inefficient". The response is dedicated to explaining what | generalization error is and why estimating it is important, and | how that's basically impossible without any sort of train/test | split or cross-validation. | mumblemumble wrote: | The article is there, if you click on the "Article" tab. | | Perhaps if any staff are paying attention on the holiday, we | could get the "/rr" chopped off the end of the link, so that we | get to the main article instead? | andersource wrote: | My guess is that OP intended to submit the response, I was | really surprised to see the note in the article and found the | response interesting (even the fact of its existence). But | that's just my take :) | mumblemumble wrote: | I suppose, but then you'd expect a different title on the | HN submission, since the response has barely anything to do | with sample sizes. ___________________________________________________________________ (page generated 2020-09-07 23:00 UTC)