[HN Gopher] Leakage and the reproducibility crisis in ML-based s... ___________________________________________________________________ Leakage and the reproducibility crisis in ML-based science Author : randomwalker Score : 37 points Date : 2022-07-15 19:07 UTC (3 hours ago) (HTM) web link (reproducible.cs.princeton.edu) (TXT) w3m dump (reproducible.cs.princeton.edu) | a-dub wrote: | this is sort of one of the weird problems that shows up at the | intersection between science in the public interest and a market | driven system of production. | | pure science that is publicly funded in the public interest would | publish all raw data along with re-runnable processing pipelines | that will literally reproduce the figures of interest. | | but, the funding is often provided by governments with the aim of | producing commercializable new technology that can make life | better for society. | | the problem is that if you do the science in the open, then it | can be literally picked off by large incumbents before smaller | inventors have a chance to try and spin up commercialization of | their life's work. | | so we have this system today where science is semi-closed in | order to protect the inventors, but sometimes to the detriment of | the science itself. | adminprof wrote: | I think you're missing two fatal problems in this "publish all | raw data and code" mindset. I don't think the desire of | commercialization is high on the list of fatal problems | preventing people from publishing data+software. | | 1) How do you handle research in domains where the data is | about people, so that releasing it harms their privacy? | Healthcare, web activity, finances. Sure you can try to | anonymize it, anonymization is imperfect, and even fully | anonymized data can be joined to other data sources to de- | identify people; k-anonymity only works in a closed ecosystem. | If we live in a world where search engine companies don't | publish their research because of this constraint, that seems | worse than the current system. | | 2) How does one define "re-runnable processing"? Software rots, | dependencies disappear, operating systems become incompatible | with software, permission models change. Does every researcher | now need a docker expert to publish? Who verifies that | something is re-runnable, and how are they paid for it? | nicoco wrote: | From my experience in the digital health sector, concerns for | privacy is always the reason given for not sharing anything | valuable and/or useful to others. But it's just a convenient | way of hiding the 'desire of commercialisation'. | a-dub wrote: | this is also true, and it also runs within science itself. | if someone spends two years collecting some data that is | very hard to collect and it has a few papers worth of | insights within it, they're going to want to keep that data | private until they can get those papers out themselves lest | someone else come along, download their data and scoop them | before they have a chance to see the fruits of their hard | labor. | | while it's not great for science at large, i don't blame | them either. | a-dub wrote: | > 1) How do you handle research in domains where the data is | about people, so that releasing it harms their privacy? | | that's an interesting problem that i have not thought about. | | i think maybe that this is not a technical problem, but more | an ethical one. under the open data approach, if you want to | study humans you probably would need to get express informed | consent that indicates that their data will be public and | that it could be linked back to them. | | > 2) How does one define "re-runnable processing"? Software | rots, dependencies disappear, operating systems become | incompatible with software, permission models change. Does | every researcher now need a docker expert to publish? Who | verifies that something is re-runnable, and how are they paid | for it? | | one defines it by building a specialized system for the | purpose of reproducible research computing. i would envision | this as a sort of distributed abstract virtual machine and | source code packaging standard where the entire environment | that was used to process the data is packaged and shipped | with the paper. the success of this system would depend on | the designers getting it right such that researchers | _wouldn't_ have to worry about weird systems level kludges | like docker. as it would behave as a hermetically sealed | virtual machine (or cluster of virtual machines), there would | be no concerns about bitrot unless one needed to make changes | or build a new image based on an existing one. | | the good news is that most data processing and simulation | code is pretty well suited to this sort of paradigm. often it | just does cpu/gpu computations and file i/o. internet | connectivity or outside dependencies are pretty much out of | scope. | | i don't think it's hard... there just hasn't been the will or | financial backing to build this out right and therefore it | does not exist. | a-dub wrote: | ...also, if a technique appears in a paper, an expert on that | technique should be a reviewer and/or a standard rubric should | be applied (i think nature and science have gotten much more | rigorous about this in recent years in the wake of the | psychology replication crisis). | AtNightWeCode wrote: | There is even tech that claims to solve the train-test split | "under the hood". You also get surprised with the low amount of | data points some of these ML people think is necessary. Far off | from what you learn in basic statistics classes. | | To not provide accurate ways of reproducing something claimed in | a paper means that the paper is invalid. | dekhn wrote: | Recently, I saw that people were tagging their input records | (test records in git repos) specifically so that later data | loaders would reject those records in appropriate conditions. I | forget what the tech was called but it was interesting. | jokoon wrote: | Machine learning isn't really science, since it's only | statistical methods. It doesn't provide insight into what | intelligence is. It's only techniques, so it's just engineering. | It's brute force hacking at best, and when it sort of works, it's | impossible to figure out why it does because it's black boxes all | the way down. | | So of course there are cool things like gpt, but it's not like | it's scientific progress. It doesn't really to understand how | brains work, and how to understand what general intelligence | really is. | randomwalker wrote: | It's possible you may have misunderstood the title of the post. | It isn't about the science of ML, or GPT-3, or brains. Rather, | it's about using ML as a tool to do actual science, like | medicine or political science or chemistry or whatnot. The | first sentence of the post explains this. | [deleted] | nestorD wrote: | Machine learning is _not_ about getting insight into what | intelligence is (it might do so as a byproduct but very few | people are using it with that goal in mind). | | However, ML _is_ useful to generalist science as long as you | are be aware of its shortcomings and not just trying to replace | something with ML without thinking about it. | | To give you an example I worked on (to be published): I worked | with some physicists that use an incredibly slow and expensive | iterative solver to get information on particules. We | introduced a machine learning algorithm that predicts the end | result. It does _not_ replace the solver (you could not trust | its results, contrary to a physics based numerical algorithm) | but, using its guess as a starting point for the iterative | solver, you can make the overall solving process orders of | magnitude faster. | YeBanKo wrote: | > It does not replace the solver (you could not trust its | results, contrary to a physics based numerical algorithm) | | And I guess the outcome variable in the train set for the ML | model was produced by the solver? | notrealyme123 wrote: | Statistics are the backbone of many natural sciences. | | It is also valid to make scientific progress just inside of a | field and not in the grand scheme of things. | deelowe wrote: | I feel that the deterministic computing theologists are going | to be in for a rude awakening over time. Computing need not | be perfect to work and the thing about recent advancements in | ML is that they scale Extremely well. | antipaul wrote: | Not a bad checklist ("model info sheet") | | But rather than stand-alone, it should be incorporated into | publications. | | In my experience, only a minority of applied machine learning | papers provide even a minority of the info requested by the info | sheet. | | Meaning, you really have no proper idea how cross validation was | done, what preprocessing was done etc. - in actually published | papers ___________________________________________________________________ (page generated 2022-07-15 23:01 UTC)