[HN Gopher] Chemical space is really big (2014)
       ___________________________________________________________________
        
       Chemical space is really big (2014)
        
       Author : optimalsolver
       Score  : 85 points
       Date   : 2021-06-25 18:26 UTC (4 hours ago)
        
 (HTM) web link (www.chemistryworld.com)
 (TXT) w3m dump (www.chemistryworld.com)
        
       | xwdv wrote:
       | What kind of chemicals are we currently searching for in the
       | chemical space?
        
         | d_silin wrote:
         | Catalysts!
        
           | optimalsolver wrote:
           | https://opencatalystproject.org/
        
         | hypertele-Xii wrote:
         | Cure for cancer? Better batteries. Materials for fusion
         | reactors. Replacement for plastic. All sorts of things.
        
         | BeFlatXIII wrote:
         | A better LSD
        
       | [deleted]
        
       | captainmuon wrote:
       | This reminds me of something I always wanted to ask. Are the
       | majority of molecules in the body "named" or "purposeful"
       | molecules, like haemoglobin, vitamins, water, lipids, DNA, etc.,
       | or is there a lot of random stuff, where just some atoms are
       | arranged arbitrarily? Ignoring for a second the trival thing that
       | you can make really long polymers, you have mutations in DNA and
       | so on - I would could those into the first case. What is the
       | ratio of "encyclopaedic" molecules (discovered or not) to "random
       | stuff" (useless or not)?
        
         | correcthorse123 wrote:
         | I couldn't give you a ratio but I'd think it's quite a high
         | ratio. There probably aren't many molecules that don't have
         | either a chemical (i.e. have some function in a pathway) or
         | physicochemical influence.
        
           | Judgmentality wrote:
           | Not saying you're wrong, but why do you think that? I'd have
           | thought the same thing about DNA, but I keep being told most
           | DNA is junk (although I wouldn't be surprised to find out
           | later we just don't know what it's for).
        
         | malux85 wrote:
         | http://biochemical-pathways.com/#/map/1
        
         | frisco wrote:
         | Everything has a name, but it is generally a "systematic"
         | name[1] rather than a one-off descriptive name. Even DNA is a
         | systematic name for the monomer (de-oxy-ribose-nucleic-acid is
         | one of the defined nucleic acids bound to a ribose sugar
         | missing an oxygen at the 2-position carbon).
         | 
         | Biology uses an enormous space of small molecule structures (to
         | say nothing of proteins, which have their own naming schemes)
         | and few have names you might recognize generally, but all have
         | useful systematic names that biologists and chemists can
         | quickly parse.
         | 
         | As a twist, most systematic naming schemes don't produce unique
         | labels, so there's often multiple ways to say the same thing,
         | and different discipline subcultures have different biases in
         | this regard.
         | 
         | Edit: re-reading OP, another interpretation is that they're
         | asking what percent of molecules in the body aren't involved in
         | biology. The answer to that is probably something that
         | approximates 0%. At the end of the day, the combined
         | interaction of all of this chemistry is what biology _is_ , and
         | everything is more or less everywhere. (...concentration is
         | everything.)
         | 
         | [1]
         | https://en.m.wikipedia.org/wiki/Systematic_name#In_chemistry
        
       | adt2bt wrote:
       | I wonder how effective AI will be at enabling us to navigate
       | chemical space for certain desired compounds. Knowing nothing
       | about the problem, is it something akin to the protein folding
       | challenge that AlphaFold[0] recently did well at?
       | 
       | Side note: I love Derek Lowe's writings. I don't know what it is,
       | but every time I see a chemistry related link bubble up in HN, I
       | have a gut feeling it was written by him. And I'm usually
       | impressed. His Things I Won't Work With[1] series is amazingly
       | well written.
       | 
       | [0] https://deepmind.com/blog/article/alphafold-a-solution-
       | to-a-...
       | 
       | [1]
       | https://blogs.sciencemag.org/pipeline/search/Things+I+Wont+W...
        
         | krab wrote:
         | I think that the biggest issue isn't the chemical space but the
         | complexity of biological systems. It's hard to tell what the
         | molecule will do. We just don't have a good enough simulator.
         | AlphaFold is definitely a helpful step but more are needed in
         | the same direction.
         | 
         | (In 2010 - 2012 I worked in a laboratory that did small
         | compounds screening and I was building some tools to explore
         | the chemical space)
        
         | timr wrote:
         | Derek is right about the vastness of chemical space, but I go
         | back and forth on the claim (frequently made by those in drug
         | discovery) that AI cannot possibly extrapolate to spaces of
         | this size, for at least three reasons:
         | 
         | * Image space and text space are also vast, and yet we've had
         | good success applying AI in these areas. I have yet to see a
         | convincing argument that these spaces aren't equally large.
         | 
         | * It's a bit of a red herring: actual drug discovery programs
         | are not exploring "all of chemical space". They're usually
         | focused on "lead series" of much more constrained molecules.
         | 
         | * There are actually context-independent signals that can be
         | used to generalize AI methods. The generalization is far from
         | perfect, but it's not like every one of those 10^60 molecules
         | are entirely different from every other molecule in the set.
         | There are clusters and patterns and trends that can be
         | exploited for gain -- this is what makes "medicinal chemistry"
         | an academic field, and not merely an exercise in fortune-
         | telling.
         | 
         | Personally, I think the bigger problem applying AI and ML to
         | drug discovery is less the "vastness of chemical space" (a
         | proposition that makes med-chemists feel secure about their
         | jobs), and more that the datasets in drug discovery _suck_.
         | There 's tons of siloing of data, none of it is consistent, and
         | you can't even depend that two assays for the same target,
         | measured in the same lab, years apart, will yield consistent
         | data. It's a total mess.
        
           | dnautics wrote:
           | So text space is trivially vectorizable, at the character
           | level and even for difficult languages like Russian chunk-
           | vectorisable with some care. How do you encode the difference
           | between houamine A and atrop-houamine A, while keeping the
           | similarities, without resorting to empirical measurements and
           | classification, which could yield reasonable vectors, but
           | will take 2-5 years of a highly trained grad student's labor
           | to obtain and put into the training corpus
        
             | timr wrote:
             | > How do you encode the difference between houamine A and
             | atrop-houamine A,
             | 
             | There are now lots of ways of encoding molecules. So many,
             | in fact, that it's not really worth debating the merits of
             | any particular method.
             | 
             | ECFP fingerprints shoved into a fully connected NN work
             | surprisingly well for a large class of problems. Molecular
             | graph convolutions (of which there are now many flavors)
             | also work well. The field is to the point where people are
             | doing ensembles of different encodings, and seeing what
             | works for any particular problem.
             | 
             | > without resorting to empirical measurements and
             | classification, which could yield reasonable vectors, but
             | will take 2-5 years of a highly trained grad student's
             | labor to obtain and put into the training corpus
             | 
             | Well, you're sort of touching on my last paragraph with
             | this. The classifier, featurization, etc., usually matters
             | less (a lot less?) than the quality of the assay data. So I
             | agree in that respect.
        
             | [deleted]
        
       | jpollock wrote:
       | What makes 1 billion rows a large search space?
       | 
       | What makes 150 billion rows incomprehensibly large?
       | 
       | With a molecular weight of 500 we're talking something on the
       | order of terabytes of data (for 1b molecules)?
       | 
       | It certainly sounds like a tractable amount of data.
       | 
       | What computer problems are stopping us from generating a
       | compound, computationally testing it for stability, and adding it
       | to the list and then searching?
        
         | whatshisface wrote:
         | I have learned over time that state space size has nothing to
         | do with problem difficulty. Sorting finds and answer in a space
         | with n! possibilities in n log(n) time.
         | 
         | Chemistry is difficult not because of the large number of
         | chemicals but because there hasn't been a lot of structure
         | discovered in them to allow the sort of compounding subcase
         | solving that makes searching a sorted list tractable. The
         | structure that has been discovered can be found in chemistry
         | textbooks and has names like "so-and-so's rule" which can be
         | applied to boroalkanes with between 5 and 12 vertices excepting
         | 6, unless the cage is charged in which case you should treat it
         | like it has one fewer vertex, unless the charge is -2 and the
         | original vertex count is between 8 and 13, in which case...
         | 
         | Those rules are much better than a table as measured by
         | information compression but you can't discover them unless you
         | start with the table filled mostly filled out already.
        
         | rodrigosetti wrote:
         | Chemical simulation is very hard (involves solving multiple np-
         | complete problems).
         | 
         | But that is one of the expected applications of quantum
         | computers (simulate quantum systems).
        
         | _ihaque wrote:
         | Even at 500Da or less, it's much, much larger than that.
         | 
         | You may be interested in the work of Jean-Louis Reymond's
         | group, who have done more or less exactly what you suggest:
         | https://gdb.unibe.ch/downloads/
         | 
         | GDB-11, with 11 heavy atoms, has 26.4M structures (110.9M
         | stereoisomers -- molecules aren't 2D). Going up to 13 gives you
         | 970M molecules. Going up to 17 (still mostly below 250Da) is
         | 166,400M.
         | 
         | There's a lot of space up there below 500Da.
        
           | sseagull wrote:
           | Also note that that only includes organic molecules. There's
           | another 85-ish natural elements in the periodic table that
           | could be important, but is much harder to synthesize or
           | compute.
           | 
           | Although including heavier elements can blow past 500 Da
           | pretty easily.
        
             | whatshisface wrote:
             | The heavier elements start acting more like continuous
             | systems and less like quanta legos, as you get more and
             | more states per eV. Transition metals and lanthanides don't
             | get their own combinatoric explosion until literal sticks
             | are stuck on the smooth balls in coordination chemistry.
        
       | Throw6away wrote:
       | "There are, I think, two reactions to this. One is despair, of
       | course, which is always an option in research, but not a very
       | useful one."
       | 
       | These are words to live by.
        
       | j-wags wrote:
       | If you'd like to check out what current chemical database files
       | look like, "Enamine REAL" is a fairly widely-known one.[1] My
       | understanding is that this file is a mix of their ACTUAL in-stock
       | inventory, as well as the product of running a small number of
       | high-reliability reactions on each compound in that inventory. So
       | it serves as a "vendor catalog" file, where everything in here
       | can be ordered from Enamine and synthesized+delivered to your
       | door in a few weeks.
       | 
       | Another approach I've heard of for iterating through every
       | molecule in a large region of chemical space is to START with a
       | large molecule dataset, then for each molecule, predict the
       | result of performing simple reactions on it. For each reaction
       | product, do your full analysis, and only store the result if the
       | analysis indicates it is noteworthy. This, in effect, lets you
       | scan over a larger region of chemical space than you can fit in
       | memory.
       | 
       | [1] https://enamine.net/compound-collections/real-
       | compounds/real...
        
       | jamestimmins wrote:
       | Can anyone give an ELI5 for what the limitation is in terms of
       | processing these computationally? Is the challenge that it's
       | difficult to model how a molecule will interact with another
       | molecule, so you have to do it with atoms and test the
       | interaction across every other molecule in the search space?
       | 
       | For context I got a B- in high school chemistry and haven't
       | looked back.
       | 
       | *Edit: "do it with atoms" is confusing in this context. I mean do
       | it in the real world outside of bits.
        
         | whatshisface wrote:
         | Molecules obey known laws of physics and can in principle be
         | simulated exactly. That is not practical with present-day
         | computers because it's quantum-mechanical and has an
         | exponentially large state space. Heuristic and approximate
         | methods are used to pare this down, sacrificing absolute truth,
         | leading to results that are not very reliable. That is why
         | experiments are still done in chemistry labs even though
         | everything that happens in a chemistry lab has been
         | "understood" since the 1930s.
         | 
         | Chemists focus in on the least simulatable problems because
         | most interesting chemistry happens right on the border of not
         | happening at all. Molecules that are very easy to calculate are
         | ones that small energy errors don't matter for. That makes them
         | either incredibly stable or incredibly unstable, but chemistry
         | happens near the boundary.
        
         | [deleted]
        
       | ChrisArchitect wrote:
       | Anything new on this since 2014?
        
       | [deleted]
        
       | euske wrote:
       | I want also to add that the programming space, or software
       | specification space, is also mindbogglingly big, if not bigger
       | than the chemical space. People should know how many
       | possibilities of small details exist for implementing a teeny
       | trivial feature, because that's the way it is. Everything around
       | us has a billion-gazillion parameter space, and all we're seeing
       | is just a chance occurrence.
        
       | Severian wrote:
       | "I mean, you may think it's a long way down the road to the
       | chemist's, but that's just peanuts to space."
        
       | Y_Y wrote:
       | There's a couple of xkcd comics relevant to this. Anyway the
       | space of possible compounds is mind-bogglingly huge and that's
       | impressive. At the same time it's countable, and as countable
       | things go, it's not even so big. The kind of hugeness that keeps
       | me up at night is the "long line" or the phase space of the
       | cosmic fluid.
        
         | carl_dr wrote:
         | Genuine question: what are the "long line" or the phase space
         | of the cosmic fluid?
         | 
         | I found the Wikipedia page
         | https://en.m.wikipedia.org/wiki/Long_line_(topology) but like a
         | lot of such pages, they are opaque unless you are familiar with
         | the topic. Consequently, I have no idea if this is the long
         | line you are referring to.
         | 
         | Oh, and which xkcd comics?
        
       | gpcr1949 wrote:
       | It's important to note that although chemical space is quite
       | large, most of this space is not easy to synthesize and also is
       | not chemically feasible, stable or desirable. Another interesting
       | "small" subset of chemical space is ZINC [0] which is a database
       | of about a billion commercially offered compounds, meaning that
       | manufacturers at a minimum think they can easily make them (and
       | effectively the fulfilment is quite high when random compounds
       | are ordered, e.g. 95% in this paper where they did molecular
       | docking simulations on the entirity of this database to find new
       | melatonin receptor modulators [1]). Concerning exploration of
       | chemical space, one area that might be of interest here is the
       | quite effective smooth(ish) movement through structure-property
       | space using VAEs.[2]
       | 
       | [0] https://zinc.docking.org/ [1] "Virtual discovery of melatonin
       | receptor ligands to modulate circadian rhythms"
       | https://www.nature.com/articles/s41586-020-2027-0.pdf [2]
       | "Automatic Chemical Design Using a Data-DrivenContinuous
       | Representation of Molecules",
       | https://arxiv.org/pdf/1610.02415.pdf
        
         | jhirshman wrote:
         | We've been working on these types of chemical search
         | optimizations problems across a variety of industries, and I'd
         | like to echo this comment. Despite the fact that most of the
         | space is unexplored, the act of exploring it for the sake of
         | exploring it is often unwise. A vast, vast majority of the time
         | a naive or even statistically driven search will fail if the
         | goal is to find something "new." The reality is that the path
         | to a truly new innovative chemical is hard to anticipate and
         | even harder to optimize for plus the curse of dimensionality
         | means that our intuition for how hard that search really is is
         | hopelessly misguided.
         | 
         | If you're interested in related problems, my company,
         | Uncountable, is looking for software engineers.
         | https://www.uncountable.com/careers. We emphasize that the most
         | important thing for organizations to do today is structure
         | their data. It's the best chance to take specialized internal
         | knowledge and put it to use to find new chemicals.
        
       ___________________________________________________________________
       (page generated 2021-06-25 23:00 UTC)