[HN Gopher] Chemical space is really big (2014) ___________________________________________________________________ Chemical space is really big (2014) Author : optimalsolver Score : 85 points Date : 2021-06-25 18:26 UTC (4 hours ago) (HTM) web link (www.chemistryworld.com) (TXT) w3m dump (www.chemistryworld.com) | xwdv wrote: | What kind of chemicals are we currently searching for in the | chemical space? | d_silin wrote: | Catalysts! | optimalsolver wrote: | https://opencatalystproject.org/ | hypertele-Xii wrote: | Cure for cancer? Better batteries. Materials for fusion | reactors. Replacement for plastic. All sorts of things. | BeFlatXIII wrote: | A better LSD | [deleted] | captainmuon wrote: | This reminds me of something I always wanted to ask. Are the | majority of molecules in the body "named" or "purposeful" | molecules, like haemoglobin, vitamins, water, lipids, DNA, etc., | or is there a lot of random stuff, where just some atoms are | arranged arbitrarily? Ignoring for a second the trival thing that | you can make really long polymers, you have mutations in DNA and | so on - I would could those into the first case. What is the | ratio of "encyclopaedic" molecules (discovered or not) to "random | stuff" (useless or not)? | correcthorse123 wrote: | I couldn't give you a ratio but I'd think it's quite a high | ratio. There probably aren't many molecules that don't have | either a chemical (i.e. have some function in a pathway) or | physicochemical influence. | Judgmentality wrote: | Not saying you're wrong, but why do you think that? I'd have | thought the same thing about DNA, but I keep being told most | DNA is junk (although I wouldn't be surprised to find out | later we just don't know what it's for). | malux85 wrote: | http://biochemical-pathways.com/#/map/1 | frisco wrote: | Everything has a name, but it is generally a "systematic" | name[1] rather than a one-off descriptive name. Even DNA is a | systematic name for the monomer (de-oxy-ribose-nucleic-acid is | one of the defined nucleic acids bound to a ribose sugar | missing an oxygen at the 2-position carbon). | | Biology uses an enormous space of small molecule structures (to | say nothing of proteins, which have their own naming schemes) | and few have names you might recognize generally, but all have | useful systematic names that biologists and chemists can | quickly parse. | | As a twist, most systematic naming schemes don't produce unique | labels, so there's often multiple ways to say the same thing, | and different discipline subcultures have different biases in | this regard. | | Edit: re-reading OP, another interpretation is that they're | asking what percent of molecules in the body aren't involved in | biology. The answer to that is probably something that | approximates 0%. At the end of the day, the combined | interaction of all of this chemistry is what biology _is_ , and | everything is more or less everywhere. (...concentration is | everything.) | | [1] | https://en.m.wikipedia.org/wiki/Systematic_name#In_chemistry | adt2bt wrote: | I wonder how effective AI will be at enabling us to navigate | chemical space for certain desired compounds. Knowing nothing | about the problem, is it something akin to the protein folding | challenge that AlphaFold[0] recently did well at? | | Side note: I love Derek Lowe's writings. I don't know what it is, | but every time I see a chemistry related link bubble up in HN, I | have a gut feeling it was written by him. And I'm usually | impressed. His Things I Won't Work With[1] series is amazingly | well written. | | [0] https://deepmind.com/blog/article/alphafold-a-solution- | to-a-... | | [1] | https://blogs.sciencemag.org/pipeline/search/Things+I+Wont+W... | krab wrote: | I think that the biggest issue isn't the chemical space but the | complexity of biological systems. It's hard to tell what the | molecule will do. We just don't have a good enough simulator. | AlphaFold is definitely a helpful step but more are needed in | the same direction. | | (In 2010 - 2012 I worked in a laboratory that did small | compounds screening and I was building some tools to explore | the chemical space) | timr wrote: | Derek is right about the vastness of chemical space, but I go | back and forth on the claim (frequently made by those in drug | discovery) that AI cannot possibly extrapolate to spaces of | this size, for at least three reasons: | | * Image space and text space are also vast, and yet we've had | good success applying AI in these areas. I have yet to see a | convincing argument that these spaces aren't equally large. | | * It's a bit of a red herring: actual drug discovery programs | are not exploring "all of chemical space". They're usually | focused on "lead series" of much more constrained molecules. | | * There are actually context-independent signals that can be | used to generalize AI methods. The generalization is far from | perfect, but it's not like every one of those 10^60 molecules | are entirely different from every other molecule in the set. | There are clusters and patterns and trends that can be | exploited for gain -- this is what makes "medicinal chemistry" | an academic field, and not merely an exercise in fortune- | telling. | | Personally, I think the bigger problem applying AI and ML to | drug discovery is less the "vastness of chemical space" (a | proposition that makes med-chemists feel secure about their | jobs), and more that the datasets in drug discovery _suck_. | There 's tons of siloing of data, none of it is consistent, and | you can't even depend that two assays for the same target, | measured in the same lab, years apart, will yield consistent | data. It's a total mess. | dnautics wrote: | So text space is trivially vectorizable, at the character | level and even for difficult languages like Russian chunk- | vectorisable with some care. How do you encode the difference | between houamine A and atrop-houamine A, while keeping the | similarities, without resorting to empirical measurements and | classification, which could yield reasonable vectors, but | will take 2-5 years of a highly trained grad student's labor | to obtain and put into the training corpus | timr wrote: | > How do you encode the difference between houamine A and | atrop-houamine A, | | There are now lots of ways of encoding molecules. So many, | in fact, that it's not really worth debating the merits of | any particular method. | | ECFP fingerprints shoved into a fully connected NN work | surprisingly well for a large class of problems. Molecular | graph convolutions (of which there are now many flavors) | also work well. The field is to the point where people are | doing ensembles of different encodings, and seeing what | works for any particular problem. | | > without resorting to empirical measurements and | classification, which could yield reasonable vectors, but | will take 2-5 years of a highly trained grad student's | labor to obtain and put into the training corpus | | Well, you're sort of touching on my last paragraph with | this. The classifier, featurization, etc., usually matters | less (a lot less?) than the quality of the assay data. So I | agree in that respect. | [deleted] | jpollock wrote: | What makes 1 billion rows a large search space? | | What makes 150 billion rows incomprehensibly large? | | With a molecular weight of 500 we're talking something on the | order of terabytes of data (for 1b molecules)? | | It certainly sounds like a tractable amount of data. | | What computer problems are stopping us from generating a | compound, computationally testing it for stability, and adding it | to the list and then searching? | whatshisface wrote: | I have learned over time that state space size has nothing to | do with problem difficulty. Sorting finds and answer in a space | with n! possibilities in n log(n) time. | | Chemistry is difficult not because of the large number of | chemicals but because there hasn't been a lot of structure | discovered in them to allow the sort of compounding subcase | solving that makes searching a sorted list tractable. The | structure that has been discovered can be found in chemistry | textbooks and has names like "so-and-so's rule" which can be | applied to boroalkanes with between 5 and 12 vertices excepting | 6, unless the cage is charged in which case you should treat it | like it has one fewer vertex, unless the charge is -2 and the | original vertex count is between 8 and 13, in which case... | | Those rules are much better than a table as measured by | information compression but you can't discover them unless you | start with the table filled mostly filled out already. | rodrigosetti wrote: | Chemical simulation is very hard (involves solving multiple np- | complete problems). | | But that is one of the expected applications of quantum | computers (simulate quantum systems). | _ihaque wrote: | Even at 500Da or less, it's much, much larger than that. | | You may be interested in the work of Jean-Louis Reymond's | group, who have done more or less exactly what you suggest: | https://gdb.unibe.ch/downloads/ | | GDB-11, with 11 heavy atoms, has 26.4M structures (110.9M | stereoisomers -- molecules aren't 2D). Going up to 13 gives you | 970M molecules. Going up to 17 (still mostly below 250Da) is | 166,400M. | | There's a lot of space up there below 500Da. | sseagull wrote: | Also note that that only includes organic molecules. There's | another 85-ish natural elements in the periodic table that | could be important, but is much harder to synthesize or | compute. | | Although including heavier elements can blow past 500 Da | pretty easily. | whatshisface wrote: | The heavier elements start acting more like continuous | systems and less like quanta legos, as you get more and | more states per eV. Transition metals and lanthanides don't | get their own combinatoric explosion until literal sticks | are stuck on the smooth balls in coordination chemistry. | Throw6away wrote: | "There are, I think, two reactions to this. One is despair, of | course, which is always an option in research, but not a very | useful one." | | These are words to live by. | j-wags wrote: | If you'd like to check out what current chemical database files | look like, "Enamine REAL" is a fairly widely-known one.[1] My | understanding is that this file is a mix of their ACTUAL in-stock | inventory, as well as the product of running a small number of | high-reliability reactions on each compound in that inventory. So | it serves as a "vendor catalog" file, where everything in here | can be ordered from Enamine and synthesized+delivered to your | door in a few weeks. | | Another approach I've heard of for iterating through every | molecule in a large region of chemical space is to START with a | large molecule dataset, then for each molecule, predict the | result of performing simple reactions on it. For each reaction | product, do your full analysis, and only store the result if the | analysis indicates it is noteworthy. This, in effect, lets you | scan over a larger region of chemical space than you can fit in | memory. | | [1] https://enamine.net/compound-collections/real- | compounds/real... | jamestimmins wrote: | Can anyone give an ELI5 for what the limitation is in terms of | processing these computationally? Is the challenge that it's | difficult to model how a molecule will interact with another | molecule, so you have to do it with atoms and test the | interaction across every other molecule in the search space? | | For context I got a B- in high school chemistry and haven't | looked back. | | *Edit: "do it with atoms" is confusing in this context. I mean do | it in the real world outside of bits. | whatshisface wrote: | Molecules obey known laws of physics and can in principle be | simulated exactly. That is not practical with present-day | computers because it's quantum-mechanical and has an | exponentially large state space. Heuristic and approximate | methods are used to pare this down, sacrificing absolute truth, | leading to results that are not very reliable. That is why | experiments are still done in chemistry labs even though | everything that happens in a chemistry lab has been | "understood" since the 1930s. | | Chemists focus in on the least simulatable problems because | most interesting chemistry happens right on the border of not | happening at all. Molecules that are very easy to calculate are | ones that small energy errors don't matter for. That makes them | either incredibly stable or incredibly unstable, but chemistry | happens near the boundary. | [deleted] | ChrisArchitect wrote: | Anything new on this since 2014? | [deleted] | euske wrote: | I want also to add that the programming space, or software | specification space, is also mindbogglingly big, if not bigger | than the chemical space. People should know how many | possibilities of small details exist for implementing a teeny | trivial feature, because that's the way it is. Everything around | us has a billion-gazillion parameter space, and all we're seeing | is just a chance occurrence. | Severian wrote: | "I mean, you may think it's a long way down the road to the | chemist's, but that's just peanuts to space." | Y_Y wrote: | There's a couple of xkcd comics relevant to this. Anyway the | space of possible compounds is mind-bogglingly huge and that's | impressive. At the same time it's countable, and as countable | things go, it's not even so big. The kind of hugeness that keeps | me up at night is the "long line" or the phase space of the | cosmic fluid. | carl_dr wrote: | Genuine question: what are the "long line" or the phase space | of the cosmic fluid? | | I found the Wikipedia page | https://en.m.wikipedia.org/wiki/Long_line_(topology) but like a | lot of such pages, they are opaque unless you are familiar with | the topic. Consequently, I have no idea if this is the long | line you are referring to. | | Oh, and which xkcd comics? | gpcr1949 wrote: | It's important to note that although chemical space is quite | large, most of this space is not easy to synthesize and also is | not chemically feasible, stable or desirable. Another interesting | "small" subset of chemical space is ZINC [0] which is a database | of about a billion commercially offered compounds, meaning that | manufacturers at a minimum think they can easily make them (and | effectively the fulfilment is quite high when random compounds | are ordered, e.g. 95% in this paper where they did molecular | docking simulations on the entirity of this database to find new | melatonin receptor modulators [1]). Concerning exploration of | chemical space, one area that might be of interest here is the | quite effective smooth(ish) movement through structure-property | space using VAEs.[2] | | [0] https://zinc.docking.org/ [1] "Virtual discovery of melatonin | receptor ligands to modulate circadian rhythms" | https://www.nature.com/articles/s41586-020-2027-0.pdf [2] | "Automatic Chemical Design Using a Data-DrivenContinuous | Representation of Molecules", | https://arxiv.org/pdf/1610.02415.pdf | jhirshman wrote: | We've been working on these types of chemical search | optimizations problems across a variety of industries, and I'd | like to echo this comment. Despite the fact that most of the | space is unexplored, the act of exploring it for the sake of | exploring it is often unwise. A vast, vast majority of the time | a naive or even statistically driven search will fail if the | goal is to find something "new." The reality is that the path | to a truly new innovative chemical is hard to anticipate and | even harder to optimize for plus the curse of dimensionality | means that our intuition for how hard that search really is is | hopelessly misguided. | | If you're interested in related problems, my company, | Uncountable, is looking for software engineers. | https://www.uncountable.com/careers. We emphasize that the most | important thing for organizations to do today is structure | their data. It's the best chance to take specialized internal | knowledge and put it to use to find new chemicals. ___________________________________________________________________ (page generated 2021-06-25 23:00 UTC)