[HN Gopher] Leakage and the reproducibility crisis in ML-based s...
       ___________________________________________________________________
        
       Leakage and the reproducibility crisis in ML-based science
        
       Author : randomwalker
       Score  : 37 points
       Date   : 2022-07-15 19:07 UTC (3 hours ago)
        
 (HTM) web link (reproducible.cs.princeton.edu)
 (TXT) w3m dump (reproducible.cs.princeton.edu)
        
       | a-dub wrote:
       | this is sort of one of the weird problems that shows up at the
       | intersection between science in the public interest and a market
       | driven system of production.
       | 
       | pure science that is publicly funded in the public interest would
       | publish all raw data along with re-runnable processing pipelines
       | that will literally reproduce the figures of interest.
       | 
       | but, the funding is often provided by governments with the aim of
       | producing commercializable new technology that can make life
       | better for society.
       | 
       | the problem is that if you do the science in the open, then it
       | can be literally picked off by large incumbents before smaller
       | inventors have a chance to try and spin up commercialization of
       | their life's work.
       | 
       | so we have this system today where science is semi-closed in
       | order to protect the inventors, but sometimes to the detriment of
       | the science itself.
        
         | adminprof wrote:
         | I think you're missing two fatal problems in this "publish all
         | raw data and code" mindset. I don't think the desire of
         | commercialization is high on the list of fatal problems
         | preventing people from publishing data+software.
         | 
         | 1) How do you handle research in domains where the data is
         | about people, so that releasing it harms their privacy?
         | Healthcare, web activity, finances. Sure you can try to
         | anonymize it, anonymization is imperfect, and even fully
         | anonymized data can be joined to other data sources to de-
         | identify people; k-anonymity only works in a closed ecosystem.
         | If we live in a world where search engine companies don't
         | publish their research because of this constraint, that seems
         | worse than the current system.
         | 
         | 2) How does one define "re-runnable processing"? Software rots,
         | dependencies disappear, operating systems become incompatible
         | with software, permission models change. Does every researcher
         | now need a docker expert to publish? Who verifies that
         | something is re-runnable, and how are they paid for it?
        
           | nicoco wrote:
           | From my experience in the digital health sector, concerns for
           | privacy is always the reason given for not sharing anything
           | valuable and/or useful to others. But it's just a convenient
           | way of hiding the 'desire of commercialisation'.
        
             | a-dub wrote:
             | this is also true, and it also runs within science itself.
             | if someone spends two years collecting some data that is
             | very hard to collect and it has a few papers worth of
             | insights within it, they're going to want to keep that data
             | private until they can get those papers out themselves lest
             | someone else come along, download their data and scoop them
             | before they have a chance to see the fruits of their hard
             | labor.
             | 
             | while it's not great for science at large, i don't blame
             | them either.
        
           | a-dub wrote:
           | > 1) How do you handle research in domains where the data is
           | about people, so that releasing it harms their privacy?
           | 
           | that's an interesting problem that i have not thought about.
           | 
           | i think maybe that this is not a technical problem, but more
           | an ethical one. under the open data approach, if you want to
           | study humans you probably would need to get express informed
           | consent that indicates that their data will be public and
           | that it could be linked back to them.
           | 
           | > 2) How does one define "re-runnable processing"? Software
           | rots, dependencies disappear, operating systems become
           | incompatible with software, permission models change. Does
           | every researcher now need a docker expert to publish? Who
           | verifies that something is re-runnable, and how are they paid
           | for it?
           | 
           | one defines it by building a specialized system for the
           | purpose of reproducible research computing. i would envision
           | this as a sort of distributed abstract virtual machine and
           | source code packaging standard where the entire environment
           | that was used to process the data is packaged and shipped
           | with the paper. the success of this system would depend on
           | the designers getting it right such that researchers
           | _wouldn't_ have to worry about weird systems level kludges
           | like docker. as it would behave as a hermetically sealed
           | virtual machine (or cluster of virtual machines), there would
           | be no concerns about bitrot unless one needed to make changes
           | or build a new image based on an existing one.
           | 
           | the good news is that most data processing and simulation
           | code is pretty well suited to this sort of paradigm. often it
           | just does cpu/gpu computations and file i/o. internet
           | connectivity or outside dependencies are pretty much out of
           | scope.
           | 
           | i don't think it's hard... there just hasn't been the will or
           | financial backing to build this out right and therefore it
           | does not exist.
        
         | a-dub wrote:
         | ...also, if a technique appears in a paper, an expert on that
         | technique should be a reviewer and/or a standard rubric should
         | be applied (i think nature and science have gotten much more
         | rigorous about this in recent years in the wake of the
         | psychology replication crisis).
        
       | AtNightWeCode wrote:
       | There is even tech that claims to solve the train-test split
       | "under the hood". You also get surprised with the low amount of
       | data points some of these ML people think is necessary. Far off
       | from what you learn in basic statistics classes.
       | 
       | To not provide accurate ways of reproducing something claimed in
       | a paper means that the paper is invalid.
        
       | dekhn wrote:
       | Recently, I saw that people were tagging their input records
       | (test records in git repos) specifically so that later data
       | loaders would reject those records in appropriate conditions. I
       | forget what the tech was called but it was interesting.
        
       | jokoon wrote:
       | Machine learning isn't really science, since it's only
       | statistical methods. It doesn't provide insight into what
       | intelligence is. It's only techniques, so it's just engineering.
       | It's brute force hacking at best, and when it sort of works, it's
       | impossible to figure out why it does because it's black boxes all
       | the way down.
       | 
       | So of course there are cool things like gpt, but it's not like
       | it's scientific progress. It doesn't really to understand how
       | brains work, and how to understand what general intelligence
       | really is.
        
         | randomwalker wrote:
         | It's possible you may have misunderstood the title of the post.
         | It isn't about the science of ML, or GPT-3, or brains. Rather,
         | it's about using ML as a tool to do actual science, like
         | medicine or political science or chemistry or whatnot. The
         | first sentence of the post explains this.
        
         | [deleted]
        
         | nestorD wrote:
         | Machine learning is _not_ about getting insight into what
         | intelligence is (it might do so as a byproduct but very few
         | people are using it with that goal in mind).
         | 
         | However, ML _is_ useful to generalist science as long as you
         | are be aware of its shortcomings and not just trying to replace
         | something with ML without thinking about it.
         | 
         | To give you an example I worked on (to be published): I worked
         | with some physicists that use an incredibly slow and expensive
         | iterative solver to get information on particules. We
         | introduced a machine learning algorithm that predicts the end
         | result. It does _not_ replace the solver (you could not trust
         | its results, contrary to a physics based numerical algorithm)
         | but, using its guess as a starting point for the iterative
         | solver, you can make the overall solving process orders of
         | magnitude faster.
        
           | YeBanKo wrote:
           | > It does not replace the solver (you could not trust its
           | results, contrary to a physics based numerical algorithm)
           | 
           | And I guess the outcome variable in the train set for the ML
           | model was produced by the solver?
        
         | notrealyme123 wrote:
         | Statistics are the backbone of many natural sciences.
         | 
         | It is also valid to make scientific progress just inside of a
         | field and not in the grand scheme of things.
        
           | deelowe wrote:
           | I feel that the deterministic computing theologists are going
           | to be in for a rude awakening over time. Computing need not
           | be perfect to work and the thing about recent advancements in
           | ML is that they scale Extremely well.
        
       | antipaul wrote:
       | Not a bad checklist ("model info sheet")
       | 
       | But rather than stand-alone, it should be incorporated into
       | publications.
       | 
       | In my experience, only a minority of applied machine learning
       | papers provide even a minority of the info requested by the info
       | sheet.
       | 
       | Meaning, you really have no proper idea how cross validation was
       | done, what preprocessing was done etc. - in actually published
       | papers
        
       ___________________________________________________________________
       (page generated 2022-07-15 23:01 UTC)