[HN Gopher] Improved protein structure prediction using potentia...
       ___________________________________________________________________
        
       Improved protein structure prediction using potentials from deep
       learning
        
       Author : lawrenceyan
       Score  : 27 points
       Date   : 2020-02-21 20:31 UTC (2 hours ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | allovernow wrote:
       | Paywalled, can anyone with access to the text expand upon the ML
       | algorithm? Specifically, is AlphaFold based on AlphaGo MCTS? Or
       | is the name similarity incidental?
        
         | asparagui wrote:
         | not paywalled, try this link: https://rdcu.be/b0mtx
         | 
         | just a naming similarity, the common part is a pun that is a
         | popular google naming scheme, eg. alpha-bet.
        
         | deepnotderp wrote:
         | Literally no similarity, it's an IBM Watson like branding move
        
       | lawrenceyan wrote:
       | Predicting protein structure based on a given DNA/RNA sequence
       | has been a field of study that has existed for quite a while now.
       | There have been two primary methodologies that have been
       | explored, one of which has been to try and simulate the actual
       | physical dynamics of a given system at a molecular/atomic level.
       | At places like D.E. Shaw or with Folding@Home, you'll see
       | approaches like these being taken and with relative success.
       | Though generally with purely physics based solutions, you quickly
       | run into exponentially growing simulation time scales as well as
       | a lack of accuracy due to a currently insufficient understanding
       | of molecular mechanics.
       | 
       | The other approach has been to take the problem and look at it
       | purely as a translation problem, ignoring simulation of
       | intermediary steps, to go directly from sequence to folded
       | protein target.
       | 
       | With the advent of deep learning and a massive repository of data
       | from an existing Protein Data Bank (PDB) et al., this approach
       | has become increasingly popular, and for protein structure
       | competitions like CASP, has quickly become state of the art
       | within the field. DeepMind's recent breakthrough with AlphaFold
       | in the paper above is just another solid step in the right
       | direction.
        
         | dekhn wrote:
         | PDB is not a massive repository. It's a very biased, tiny
         | dataset (~20K structures) and an enormous amount of data
         | cleaning has to be done to do anything related to big-data
         | machine learning on it.
         | 
         | What's far more important is evolutionary data- for example,
         | making alignments of many similar proteins, and computing
         | correlated variations across them. Those variations are often
         | the best structural clues-- better and cheaper to obtain than
         | protein structures.
         | 
         | I wouldn't really call DM's work a "breakthrough", other groups
         | were exploring similar ideas. DM executed well (they're a games
         | company and understand the rules of the competition) and had a
         | huge amount of compute resources which handles a lot of the
         | challenges of optimizing a process like this.
        
           | lawrenceyan wrote:
           | My summary is pretty generalized, aimed more towards a layman
           | audience, and so I definitely am missing pieces. Co-
           | evolutionary couplings between different protein sequences
           | provide a very rich source of information, and are definitely
           | very important!
           | 
           | For those of you that are curious as to what these couplings
           | represent, the basic idea, is that intuitively you can sort
           | of see how given proteins are a product of evolution, that
           | they're might be a large amount of conserved structure
           | between one protein to another. Co-evolutionary coupling is
           | just an attempt at quantifying this relationship in a
           | rigorous statistical manner.
        
             | dekhn wrote:
             | I mainly don't want ML folks to suddenly think protein
             | folding is easy because the PDB is a good training set.
             | It's not.
        
               | [deleted]
        
               | lawrenceyan wrote:
               | Why frame things in such a pessimistic manner? It seems
               | like it would only be a net benefit to have more people
               | become interested in protein folding as a field of study.
               | Is there really a need for this type of gate keeping
               | here?
        
               | dekhn wrote:
               | yes, personally I think there is. I've spent a tremendous
               | amount of my career watching computer scientists
               | misunderstand how to work on protein folding and waste a
               | lot of people's time. Because the concept of protein
               | folding is so unbelievably complex, most CS and ML folks
               | get the basic talk: "nearly all proteins fold reversibly
               | to a global minimum energy structure which is completely
               | defined by the sequence of the protein", which isn't
               | remotely true (basically a weak form of Anfinsen's dogma
               | and Levinthal's paradox). It's easy to explain, and CS
               | and ML people get excited and go off to work on the
               | problem. This led to a lot of publications that focused
               | on rapidly finding heuristics that could sample enough
               | space to find an approximation of the lowest energy
               | structure. these methods typically failed to make good
               | predictions although eventually methods like Rosetta did
               | start making good predictions around 15-20 years ago
               | (amusingly, the author of Rosetta told me: "the larger
               | the PDB (training data set) gets, the worse the
               | predictions we make".
               | 
               | But people who spend a long time getting a biological
               | education know why this is true: most proteins don't fold
               | to their energetic minimum, they fold to a collection of
               | kinetically accessible states, rarely finding their true
               | minimum (some small proteins do fold quickly, and we
               | typically can predict their structure). And, many of the
               | physical approximations that are used lead to
               | inaccuracies (for example, some variables are constrained
               | to specific values to save time, but making good
               | predictions requires them to be unconstrained).
               | 
               | Some of my work made significant contributions to
               | changing these beliefs, and I've very thankful for the CS
               | and ML folks who contributed to that, but all of them
               | spent a lot of time learning about proteins before they
               | were useful contributors.
               | 
               | Myself I've had to "unlearn" a lot of the early things
               | that were explained to me when I was a layperson, because
               | when you're first learning something, if somebody gives
               | you a simplified view, it can be really hard to move on
               | to the more subtle and nuanced details in the field (for
               | example, many people learn Mendelian genetics and then
               | spend years struggling to understand why most traits
               | don't follow mendelian statistics).
               | 
               | My goal here is to prevent wasted time on behalf of the
               | experienced contributors in the field. I do appreciate
               | good attempts at explaining the field to laypeople, but
               | want to set the expectations on contributing accurately.
        
       ___________________________________________________________________
       (page generated 2020-02-21 23:00 UTC)