[HN Gopher] Improved protein structure prediction using potentia... ___________________________________________________________________ Improved protein structure prediction using potentials from deep learning Author : lawrenceyan Score : 27 points Date : 2020-02-21 20:31 UTC (2 hours ago) (HTM) web link (www.nature.com) (TXT) w3m dump (www.nature.com) | allovernow wrote: | Paywalled, can anyone with access to the text expand upon the ML | algorithm? Specifically, is AlphaFold based on AlphaGo MCTS? Or | is the name similarity incidental? | asparagui wrote: | not paywalled, try this link: https://rdcu.be/b0mtx | | just a naming similarity, the common part is a pun that is a | popular google naming scheme, eg. alpha-bet. | deepnotderp wrote: | Literally no similarity, it's an IBM Watson like branding move | lawrenceyan wrote: | Predicting protein structure based on a given DNA/RNA sequence | has been a field of study that has existed for quite a while now. | There have been two primary methodologies that have been | explored, one of which has been to try and simulate the actual | physical dynamics of a given system at a molecular/atomic level. | At places like D.E. Shaw or with Folding@Home, you'll see | approaches like these being taken and with relative success. | Though generally with purely physics based solutions, you quickly | run into exponentially growing simulation time scales as well as | a lack of accuracy due to a currently insufficient understanding | of molecular mechanics. | | The other approach has been to take the problem and look at it | purely as a translation problem, ignoring simulation of | intermediary steps, to go directly from sequence to folded | protein target. | | With the advent of deep learning and a massive repository of data | from an existing Protein Data Bank (PDB) et al., this approach | has become increasingly popular, and for protein structure | competitions like CASP, has quickly become state of the art | within the field. DeepMind's recent breakthrough with AlphaFold | in the paper above is just another solid step in the right | direction. | dekhn wrote: | PDB is not a massive repository. It's a very biased, tiny | dataset (~20K structures) and an enormous amount of data | cleaning has to be done to do anything related to big-data | machine learning on it. | | What's far more important is evolutionary data- for example, | making alignments of many similar proteins, and computing | correlated variations across them. Those variations are often | the best structural clues-- better and cheaper to obtain than | protein structures. | | I wouldn't really call DM's work a "breakthrough", other groups | were exploring similar ideas. DM executed well (they're a games | company and understand the rules of the competition) and had a | huge amount of compute resources which handles a lot of the | challenges of optimizing a process like this. | lawrenceyan wrote: | My summary is pretty generalized, aimed more towards a layman | audience, and so I definitely am missing pieces. Co- | evolutionary couplings between different protein sequences | provide a very rich source of information, and are definitely | very important! | | For those of you that are curious as to what these couplings | represent, the basic idea, is that intuitively you can sort | of see how given proteins are a product of evolution, that | they're might be a large amount of conserved structure | between one protein to another. Co-evolutionary coupling is | just an attempt at quantifying this relationship in a | rigorous statistical manner. | dekhn wrote: | I mainly don't want ML folks to suddenly think protein | folding is easy because the PDB is a good training set. | It's not. | [deleted] | lawrenceyan wrote: | Why frame things in such a pessimistic manner? It seems | like it would only be a net benefit to have more people | become interested in protein folding as a field of study. | Is there really a need for this type of gate keeping | here? | dekhn wrote: | yes, personally I think there is. I've spent a tremendous | amount of my career watching computer scientists | misunderstand how to work on protein folding and waste a | lot of people's time. Because the concept of protein | folding is so unbelievably complex, most CS and ML folks | get the basic talk: "nearly all proteins fold reversibly | to a global minimum energy structure which is completely | defined by the sequence of the protein", which isn't | remotely true (basically a weak form of Anfinsen's dogma | and Levinthal's paradox). It's easy to explain, and CS | and ML people get excited and go off to work on the | problem. This led to a lot of publications that focused | on rapidly finding heuristics that could sample enough | space to find an approximation of the lowest energy | structure. these methods typically failed to make good | predictions although eventually methods like Rosetta did | start making good predictions around 15-20 years ago | (amusingly, the author of Rosetta told me: "the larger | the PDB (training data set) gets, the worse the | predictions we make". | | But people who spend a long time getting a biological | education know why this is true: most proteins don't fold | to their energetic minimum, they fold to a collection of | kinetically accessible states, rarely finding their true | minimum (some small proteins do fold quickly, and we | typically can predict their structure). And, many of the | physical approximations that are used lead to | inaccuracies (for example, some variables are constrained | to specific values to save time, but making good | predictions requires them to be unconstrained). | | Some of my work made significant contributions to | changing these beliefs, and I've very thankful for the CS | and ML folks who contributed to that, but all of them | spent a lot of time learning about proteins before they | were useful contributors. | | Myself I've had to "unlearn" a lot of the early things | that were explained to me when I was a layperson, because | when you're first learning something, if somebody gives | you a simplified view, it can be really hard to move on | to the more subtle and nuanced details in the field (for | example, many people learn Mendelian genetics and then | spend years struggling to understand why most traits | don't follow mendelian statistics). | | My goal here is to prevent wasted time on behalf of the | experienced contributors in the field. I do appreciate | good attempts at explaining the field to laypeople, but | want to set the expectations on contributing accurately. ___________________________________________________________________ (page generated 2020-02-21 23:00 UTC)