[HN Gopher] Making the collective knowledge of chemistry open an...
       ___________________________________________________________________
        
       Making the collective knowledge of chemistry open and machine
       actionable
        
       Author : bryanrasmussen
       Score  : 81 points
       Date   : 2022-06-14 20:53 UTC (3 days ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | pfisherman wrote:
       | Good luck with that...lol. The ontological / informatics space
       | for chemicals is a mess.
       | 
       | To make the collective knowledge of chemistry open and available,
       | you need to represent, organize, and index it. This problem is
       | not as sexy, but it is orders of magnitude more important.
        
         | convolvatron wrote:
         | this is a huge problem. arguably one of the primary technical
         | reasons that 'web 2.0' was such a dud.
        
       | [deleted]
        
       | gtmitchell wrote:
       | Chemist here. Every few years, someone has the novel idea that we
       | should have open data for all chemistry laboratories, so then we
       | can do Better Science. And like every other proposal I've seen,
       | this one will get approximately zero traction because it doesn't
       | address any of the core issues behind why laboratory data is
       | currently closed.
       | 
       | I try not to be too pessimistic about it, because it really would
       | be great if there were more open chemical data. I just really
       | doubt anything could accomplish that without remaking the US
       | university research system from top to bottom.
        
         | bjelkeman-again wrote:
         | What are the core issues?
        
           | mint2 wrote:
           | Probably dealing with enough meta data to capture the stuff
           | like the reaction only works because the supplier of one of
           | the reagents used by that lab had ppm copper impurities
        
           | gtmitchell wrote:
           | Off the top of my head:
           | 
           | -Academic researchers are already overworked, underpaid, and
           | undertrained. Asking them to spend even more of their time to
           | meticulously upload all their notes and data to an electronic
           | notebook is going to be an uphill battle.
           | 
           | -Academic scientists live or die by their ability to publish.
           | Open data, especially if you're sharing in real time, makes
           | you vulnerable to being scooped by competing researchers.
           | Even disclosures of data after the fact make it easier for
           | others to benefit from work you did with no benefit to the
           | ones who collected the data. Given how cut-throat academics
           | is, you're also not going to get many researchers on board
           | with this idea.
           | 
           | -Interoperability of most laboratory software is poor. People
           | have been trying to get laboratory instrument manufacturers
           | to support open data standards for years with little success.
           | They don't have any financial incentive to allow competitors
           | to have easy access to their data.
        
             | Hellbanevil wrote:
             | If I was in charge of granting any federal grants; I would
             | demand the recipients open source the data, and upload
             | everything in a orderly manner.
             | 
             | It would just be if you want this money do the above.
        
             | JPLeRouzic wrote:
             | > _Open data, especially if you 're sharing in real time,
             | makes you vulnerable to being scooped by competing
             | researchers._
             | 
             | Why did something like standards and patents didn't emerge
             | in the scientific world?
        
               | airstrike wrote:
               | No economic incentive
        
               | barry-cotter wrote:
               | The scientific world rewards people in glory and honor
               | much more than money. If you want more money go
               | corporate. If you want to reward people more with money
               | then they'll pay less attention to the glory but that's
               | really expensive.
        
         | BenoitP wrote:
         | There are initiatives in the EU to require -by law- that if
         | it's public research, then it must be released to the public.
         | And there are official guidelines on how to do so:
         | 
         | https://hal.archives-ouvertes.fr/hal-03318932
         | 
         | I believe such an initiative for chemistry could very well
         | succeed, even if it takes 10 years.
         | 
         | Hopefully this can percolate to other countries and continents
         | too, through EU's normative power.
        
           | elcritch wrote:
           | That could be very valuable. In many ways it's like material
           | science and parts of chemistry are skimping along on the
           | fumes of basic science done in the 1950's up to the 70's at
           | national labs. Good experimentalists made solid careers doing
           | core research without chasing endless grants or the latest
           | fads. Seems pretty much all publicity available chemical and
           | material databases comes from that era. Some specialty areas
           | have progressed way beyond that but it's rarely
           | systematically collected, unless you're willing and able to
           | pay lots of money for private databases. Those private
           | databases of course largely build from publicly funded
           | research.
           | 
           | I hope this pans out.
        
       | cellis wrote:
       | Can someone with more knowledge of Chemistry enlighten me why
       | chemistry experimentation isn't the killer app for the Metaverse,
       | at least for low-order reactions? I know the e.g. protein folding
       | class of problems are prohibitively computationally expensive,
       | but surely there's some low hanging fruit?
        
         | photochemsyn wrote:
         | If you're talking about computational modeling of chemical
         | reactions, for example getting a computer to figure out a novel
         | low-cost synthesis route for an important molecule, well...
         | This becomes incredibly complicated very quickly. It's
         | generally more likely to get a result using the traditional
         | experimental methods, with some exceptions for very small
         | molecules perhaps.
         | 
         | The field of physical inorganic/organic chemistry is one of the
         | more difficult ones to build accurate models for. A first step
         | is to calculate the electronic structure of products,
         | reactants, possible intermediaries, and this blows up fast for
         | even moderately complex molecules. A lot of work has been done
         | with simpler systems like 2 H2O -> 2 H2 + O2 but even that's
         | ridiculously complicated, as you have to model the catalyst and
         | the surrounding environment as well, and then get the kinetic
         | model right. The computational power required is on the
         | supercomputer scale, and the level of background knowledge
         | required is pretty high to even start to implement something
         | like that, for a taste see:
         | 
         | https://h2awsm.org/capabilities/dft-and-ab-initio-calculatio...
         | 
         | This is an area where quantum computers may have applications
         | (2021):
         | 
         | https://www.energy.gov/science/ascr/articles/quantum-computi...
        
       | ur-whale wrote:
       | This kind of endeavor should be a common theme to all science,
       | not just chemistry.
        
         | shpongled wrote:
         | It's certainly a goal to work towards. However, it's pretty
         | difficult to build One ELN to Rule Them All given how flexible
         | many kinds of biological experimental designs are - especially
         | when you're working on the bleeding edge.
         | 
         | A good first step is to require supplemental materials are
         | published in a machine readable format (e.g. not manually
         | thrown together Excel files that lack any kind of normalization
         | or rational schema)
        
           | ur-whale wrote:
           | But then there are things like GPT-3 , which means stashing
           | everything in a rigid schema isn't as hard-core of a
           | requirement as it used to be.
           | 
           | OTOH, facilitating:                   1. access to the raw
           | data         2. access to the metadata         3. access to
           | the source code of whatever software was used / created to
           | run the experiment         4. making sure everything is
           | computer readable (i.e. not a 256x128 graph as a PNG embedded
           | in a bloody PDF)
           | 
           | should be a requirement for any scientific publication worth
           | its salt.
        
           | abraxaz wrote:
           | > it's pretty difficult to build One ELN to Rule Them All
           | given how flexible many kinds of biological experimental
           | designs are - especially when you're working on the bleeding
           | edge.
           | 
           | RDF is quite flexible and using a combination of domain
           | specific ontologies like cheminf[1] and other top level
           | ontologies like BFO[2] should allow you to capture most of
           | the semantics.
           | 
           | [1]: https://www.ebi.ac.uk/ols/ontologies/cheminf [2]: https:
           | //en.wikipedia.org/wiki/Basic_Formal_Ontology?wprov=sf...
        
       | apienx wrote:
       | "Alchemists turned into chemists when they stopped keeping
       | secrets." -- Eric S. Raymond
       | 
       | Open Science (in the publishing sense) used to be fringe just a
       | decade ago. It's very much mainstream now.
       | 
       | Open Data will be a much tougher (and long-term) battle, but it's
       | inevitable.
        
       | photochemsyn wrote:
       | The notion of open-source scientific discovery is a good one, but
       | some of the suggestions here seem very unlikely to catch much
       | traction, and even if they do, problems will remain.
       | 
       | For example, say an academic chemical research group synthesizes
       | a series of novel compounds in the lab - they're not going to
       | just release the raw data on everything they did immediately. The
       | thinking might be, 'we can give this MS student this compound to
       | work out a better synthesis route for, or this pHD student can
       | try to extend the synthesis and make other compounds'.
       | 
       | A more realistic scenario mentioned in the article would be to
       | require publication of the raw data to a database as a condition
       | of publication. This is already done to some extent in journals,
       | but materials and methods sections are notorious for leaving out
       | some key factor or other, meaning repeatability is an issue and
       | other labs will generally only try to replicate the more
       | interesting results (possible new antibiotic, etc.).
       | 
       | This worked out fairly well with GenBank, the database of
       | published gene sequences, and also with the protein
       | crystallography databases, but everyone in the molecular biology
       | world knows that all sequence data is not of the same quality,
       | and so cross-referencing by the more reputable researchers and
       | reading their papers to see if their methods are transparent and
       | robust or not is still an important step. A database clogged with
       | low-quality data isn't as valuable as a more carefully curated
       | one, certainly.
       | 
       | It would be nice though, to have a database where you could look
       | up everything there is to know about something like the
       | antibiotic ciproflaxin, including all the spectral identification
       | data, optimal reaction conditions, etc. - but this is also a
       | molecule that researchers are busy making derivatives of, likely
       | with the hopes of patenting some novel new knockoff and getting
       | an exclusive license distribution deal with a major pharma corp,
       | and so they won't be releasing any data, or even publishing in a
       | timely manner (at least not until the patent application goes
       | through, and maybe not even then).
       | 
       | That leads to a controversial question: should research
       | universities and academics financed by taxpayers behave like for-
       | profit startups pitching to a VC outfit?
        
       | statuslover9000 wrote:
       | For chemical reaction prediction, see the Open Reaction Database,
       | a collaboration including the Coley lab at MIT (surprisingly not
       | cited by OP):
       | 
       | Paper: https://pubs.acs.org/doi/10.1021/jacs.1c09820
       | 
       | Docs: https://docs.open-reaction-
       | database.org/en/latest/overview.h...
       | 
       | It's an incredible effort to collate and clean this data, and
       | even then a substantial portion of it will not be reproducible
       | due to experimental variability or outright errors.
       | 
       | For computational methods development it's extremely useful,
       | maybe even necessary, to have a substantial amount of money and
       | one's own lab space to collect new data and experimentally test
       | prospective predictions under tightly controlled conditions. The
       | historical data is certainly useful but is not a panacea.
        
         | mlinksva wrote:
         | Relatedly (and also not citing) from a couple weeks ago
         | https://news.ycombinator.com/item?id=31566200 Call for a Public
         | Open Database of All Chemical Reactions
        
       | RationPhantoms wrote:
       | It would be wonderful to see something like the Materials Project
       | (https://materialsproject.org/) but for Chemical
       | research/knowledge.
        
       | JPLeRouzic wrote:
       | Can someone in the field explain how this "machine actionnable"
       | would be different from Galaxy Pipeline [0], or a Chemputer [1]?
       | 
       | [0] https://en.wikipedia.org/wiki/Galaxy_(computational_biology)
       | 
       | [1] https://www.chem.gla.ac.uk/cronin/news/cronin-group-
       | builds-c...
        
       ___________________________________________________________________
       (page generated 2022-06-17 23:01 UTC)