[HN Gopher] AlphaFold reveals the structure of the protein universe ___________________________________________________________________ AlphaFold reveals the structure of the protein universe Author : MindGods Score : 376 points Date : 2022-07-28 11:18 UTC (5 hours ago) (HTM) web link (www.deepmind.com) (TXT) w3m dump (www.deepmind.com) | codedokode wrote: | Today I learned that there are bacteria that have a protein | helping to form ice on plants [1] to destroy them and extract | nutrients (however I didn't understand how bacteria themselves | survive this). | | Machine learning typically uses existing data to predict new | data. Please explain: Does it mean that AlphaFold can only use | known types of interactions between atoms and will mispredict the | structure of proteins that use not yet known interactions? | | And why we cannot just simulate protein behaviour and | interactions using quantum mechanics? | | [1] https://pubs.acs.org/doi/10.1021/acs.jpcb.1c09342 | flobosg wrote: | > And why we cannot just simulate protein behaviour and | interactions using quantum mechanics? | | QM calculations have been done in proteins, but they're | computationally very expensive. IIRC, there are hybrid | approaches where only a small portion of interest in the | protein structure is modelled by QM and the rest by classical | molecular mechanics. | beanwood wrote: | >And why we cannot just simulate protein behaviour and | interactions using quantum mechanics? | | If you wanted to simulate the behaviour of an entire protein | using quantum mechanics, the sheer number of calculations | required would be infeasible. | | For what it's worth, I have a background in computational | physics and am studying a PhD in structural biology. For any | system (of any size) that you want to simulate, you have to | consider how much information you're willing to 'ignore' in | order to focus on the information you would like to 'get out' | of a set of simulations. Being aware of the approximations you | make and how this impacts your results is crucial. | | For example, if I am interested in how the electrons of a group | of Carbon atoms (radius ~ 170 picometres) behave, I may want to | use Density Functional Theory (DFT), a quantum mechanical | method. | | For a single, small protein (e.g. ubiquitin, radius ~ 2 | nanometres), I may want to use atomistic molecular dynamics | (AMD), which models the motion of every single atom in response | to thermal motion, electrostatic interactions, etc using | Newton's 2nd law. Electron/proton detail has been approximated | away to focus on overall atomic motion. | | In my line of work, we are interested in how big proteins (e.g. | the dynein motor protein, ~ 40 nanometres in length) move | around and interact with other proteins at longer time (micro- | to millisecond) and length (nano- to micrometre) scales than | DFT or AMD. We 'coarse-grain' protein structures by | representing groups of atoms as tetrahedra in a continuous mesh | (continuum mechanics). We approximate away atomic detail to | focus on long-term motion of the whole protein. | | Clearly, it's not feasible to calculate the movement of dynein | for hundreds of nanoseconds using DFT! The motor domain alone | in dynein contains roughly one million atoms (and it has | several more 'subunits' attached to it). Assuming these are | mostly Carbon, Oxygen or Nitrogen, then you're looking at | around ten million electons in your DFT calculations, for a | single step in time (rounding up). If you're dealing with the | level of atomic bonds, you're probably going to a use time | steps between a femto- (10^-15 s) or picosecond (10^-12 s). The | numbers get a bit ridiculous. There are techniques that combine | QM and AMD, although I am not too knowledgeable in this area. | | Some further reading, if you're interested (I find Wikipedia | articles on these topics to generally be quite good): | | DFT: https://en.wikipedia.org/wiki/Density_functional_theory | | Biological continuum mechanics: | https://doi.org/10.1371/journal.pcbi.1005897 | | Length scales in biological simulations: | https://doi.org/10.1107/S1399004714026777 | | Electronic time scales: | https://www.pnas.org/doi/10.1073/pnas.0601855103 | kache_ wrote: | This is an incredible gift to humanity. A huge positive impact. | The team should be proud | carbocation wrote: | The press release is a bit difficult to place into historical | context. I believe that the first AlphaFold release was mostly | human and mouse proteins, and this press release marks the | release of structures for additional species. | azangru wrote: | > I believe that the first AlphaFold release was mostly human | and mouse proteins, | | More than that. The press release actually contains an | infographic comparing the amount of published protein models | for different clades of organisms. The infographic shows that | the previous release (~1mln proteins) contained proteins of | some animal, plant, bacterial, and fungal species. | cm2187 wrote: | How do you know that the predicted structure will be correct? I | presume researchers will need to validate the structure | empirically. Do we know how good the model has been at predicting | so far? | donut2d wrote: | A fun way I've been thinking about all this is what | nanotech/nanobots are actually going to look like. Tiny little | protein machines doing what they've been doing since the dawn of | life. We now have a library of components, and as we start | figuring out what they can do, and how to stack them, we can | start building truly complex machinery for whatever crazy tasks | we can imagine. The impact goes so far beyond drugs and | treatments. | candiddevmike wrote: | Is folding@home obsolete now? | foxhop wrote: | I would say no, the two approaches may be used to validate each | other. | flobosg wrote: | Folding@home answers a related but different question. While | AlphaFold returns the picture of a folded protein in its most | energetically stable conformation, Folding@home returns a video | of the protein undergoing folding, traversing its energy | landscape. | dekhn wrote: | It's not, but the question is (and has long been) whether the | energy expended by folding@home is worth the scientific result. | IMHO- probably not. | seventytwo wrote: | Good question... I'd imagine that other methods of folding | solutions are still valuable, because AlphaFold needs to be | checked. | klemola wrote: | As an aside, the protein structure visualizations in the article | are pretty. Is there a good source for more? | alphabetting wrote: | https://alphafold.ebi.ac.uk/ | flobosg wrote: | * https://pdb101.rcsb.org/motm/ | | * https://ccsb.scripps.edu/goodsell/ | | * https://pdb101.rcsb.org/sci-art/geis-archive/irving-geis | | * https://www.digizyme.com/portfolio.html | | * https://www.drewberry.com/ | | * https://biochem.web.utah.edu/iwasa/projects.html | | * http://onemicron.com/ | | * The art of Jane Richardson, of which I couldn't find a link | | * This blog has plenty of good links: | https://blogs.oregonstate.edu/psquared/ | yuan43 wrote: | > Today, I'm incredibly excited to share the next stage of this | journey. In partnership with EMBL's European Bioinformatics | Institute (EMBL-EBI), we're now releasing predicted structures | for nearly all catalogued proteins known to science, which will | expand the AlphaFold DB by over 200x - from nearly 1 million | structures to over 200 million structures - with the potential to | dramatically increase our understanding of biology. | | And later: | | > Today's update means that most pages on the main protein | database UniProt will come with a predicted structure. All 200+ | million structures will also be available for bulk download via | Google Cloud Public Datasets, making AlphaFold even more | accessible to scientists around the world. | | This is the actual announcement. | | UniProt is a large database of protein structure and function. | The inclusion of the predicted structures alongside the | experimental data makes it easier to include the predictions in | workflows already set up to work with the other experimental and | computed properties. | | It's not completely clear from the article whether any of the | 200+ million predicted structures deposited to UniProt have not | be previously released. | | Protein structure determines function. Before AlphaFold, | experimental structure determination was the only option, and | that's very costly. AlphaFold's predictions appears to be good | enough to jumpstart investigations without an experimental | structure determination. That has the potential to accelerate | many areas of science and could percolate up to therapeutics. | | One area that doesn't get much discussion in the press is the | difference between solid state structure and solution state | structure. It's possible to obtain a solid state structure | determination (x-ray) that has nothing to do with actual behavior | in solution. Given that AlhpaFold was trained to a large extent | on solid state structures, it could be propagating that bias into | its predicted structures. | | This paper talks about that: | | > In the recent Critical Assessment of Structure Prediction | (CASP) competition, AlphaFold2 performed outstandingly. Its worst | predictions were for nuclear magnetic resonance (NMR) structures, | which has two alternative explanations: either the NMR structures | were poor, implying that Alpha-Fold may be more accurate than | NMR, or there is a genuine difference between crystal and | solution structures. Here, we use the program Accuracy of NMR | Structures Using RCI and Rigidity (ANSURR), which measures the | accuracy of solution structures, and show that one of the NMR | structures was indeed poor. We then compare Alpha-Fold | predictions to NMR structures and show that Alpha-Fold tends to | be more accurate than NMR ensembles. There are, however, some | cases where the NMR ensembles are more accurate. These tend to be | dynamic structures, where Alpha-Fold had low confidence. We | suggest that Alpha-Fold could be used as the model for NMR- | structure refinements and that Alpha-Fold structures validated by | ANSURR may require no further refinement. | | https://pubmed.ncbi.nlm.nih.gov/35537451/ | sseagull wrote: | > Before AlphaFold, experimental structure determination was | the only option | | Other computational methods have existed for a long time. | Folding@home was founded 22 years ago. | dekhn wrote: | folding@home doesn't predict structures, it simulates protein | folding. Different area with some overlap. | COGlory wrote: | Before my comment gets dismissed, I will disclaim I am a | professional structural biologist that works in this field every | day. | | These threads are always the same: lots of comments about protein | folding, how amazing DeepMind is, how AlphaFold is a success | story, how it has flipped an entire field on it's head, etc. The | language from Google is so deceptive about what they've actually | done, I think it's actually intentionally disingenuous. | | At the end of the day, AlphaFold is amazing homology modeling. I | love it, I think it's an awesome application of machine learning, | and I use it frequently. But it's doing the same thing we've been | doing for 2 decades: pattern matching sequences of proteins with | unknown structure to sequences of proteins with known structure, | and about 2x as well as we used to be able to. | | That's extremely useful, but it's not knowledge of protein | folding. It can't predict a fold de novo, it can't predict folds | that haven't been seen (EDIT: this is maybe not strictly true, | depending on how you slice it), it fails in a number of edge | cases (remember, in biology, edge cases are everything) and | again, I can't stress this enough, we have no new information on | how proteins fold. We know all the information (most of at least) | for a proteins final fold is in the sequence. But we don't know | much about the in-between. | | I like AlphaFold, it's convenient and I use it (although for | anything serious or anything interacting with anything else, I | still need a real structure), but I feel as though it has been | intentionally and deceptively oversold. There are 3-4 other deep | learning projects I think have had a much greater impact on my | field. | | EDIT: See below: https://news.ycombinator.com/item?id=32265662 | for information on predicting new folds. | flobosg wrote: | > AlphaFold is amazing homology modeling | | If it is homology modelling, then how can it work without input | template structures? | COGlory wrote: | It has template structures. AlphaFold uses the following | databases: BFD, MGnify, | PDB70, PDB (structures in the mmCIF format), | PDB seqres - only for AlphaFold-Multimer, Uniclust30, | UniProt - only for AlphaFold-Multimer, UniRef90. | flobosg wrote: | Those databases are used to derive the evolutionary | couplings and distance matrices used by the algorithm. | Several of those databases aren't even structural ones. | Furthermore, AlphaFold can function with only a MSA as an | input, without retrieving a single PDB coordinate. | COGlory wrote: | It's all about boosting signal by finding other proteins | that are similar, until you get to the point that you can | identify a fold to assign to a region of the protein. | That's why some are structural, and some are not. | | >Furthermore, AlphaFold can function with only a MSA as | an input, without retrieving a single PDB coordinate. | | Yes, it has a very nice model of what sequences should | look like in 3D. That model is derived from experimental | data. So if I give AlphaFold an MSA of a new, unknown | protein fold (substantively away from any known fold), it | cannot predict it. | flobosg wrote: | > Yes, it has a very nice model of what sequences should | look like in 3D. | | A structural model, you would say. | | > That model is derived from experimental data. | | That doesn't make it a template-based model, or a | homology one. | | > if I give AlphaFold an MSA of a new, unknown protein | fold (substantively away from any known fold), it cannot | predict it | | That will depend on the number of effective sequences | found to derive couplings. Domains with novel folds | usually have a low number of remotely homolog sequences | and for that reason the method will fail, not just | because they are novel. | zack-m wrote: | Yup. It's great, but there are still many aspects to unpack and | work on. Hence why Rosetta is a thing. | flobosg wrote: | Rosetta methods are also moving towards ML. Here's an article | from last week: | https://www.science.org/doi/10.1126/science.abn2100 | [deleted] | Lloydksk wrote: | I mean like whats this about AlphaFold is gone | adamsmith143 wrote: | > it can't predict folds that haven't been seen | | This seems strange to me. The entire point of these types of | models is to predict things on unseen data. Are you saying | Deepmind is completely lying about their model? | | Deepmind solved CASP, isn't the entire point of that | competition to predict unseen structures? | | If AlphaFold doesn't predict anything then what are you using | it to do? | COGlory wrote: | AlphaFold figures out that my input sequence (which has no | structural data) is similar to this other protein that has | structural data. Or maybe different parts of different | proteins. It does this extremely well. | flobosg wrote: | This is a gross misrepresentation of the method. | COGlory wrote: | Perhaps you'd care to explain how? AlphaFold does not | work on new folds. It ultimately relies on mapping | sequence to structure. It does it better than anyone | else, and in ways a human probably couldn't, but if you | give it a brand new fold with no relation to other folds, | it cannot predict it. I routinely areas of extremely low | confidence many of my AlphaFold models. I work in | organisms that have virtually 0 sequence identity. This | is a problem I deal with every day. I wish AlphaFold | worked in the way you are suggesting, but it just flat | out does not. | dekhn wrote: | No organisms have virtually 0 sequence identity. That's | nonsense. Can you give an example? n Even some random | million-year-isolated archae shares the majority of its | genes with common bacteria. | flobosg wrote: | > It ultimately relies on mapping sequence to structure. | | So does every structural prediction method. | | > if you give it a brand new fold with no relation to | other folds, it cannot predict it | | That will depend on the number of effective sequences, | not the actual fold. | | > I work in organisms that have virtually 0 sequence | identity. | | Then the problem is low sequence coverage, not the | protein fold. On a side note, there are sensitive | homology search protocols that rely very little on actual | sequence identity. | [deleted] | mupuff1234 wrote: | > There are 3-4 other deep learning projects I think have had a | much greater impact on my field. | | Don't leave us hanging... which projects? | COGlory wrote: | 1) Isonet - takes low SNR cryo-electron tomography images | (that are extremely dose limited, so just incredibly blurry | and frequently useless) and does two things: | | * Deconvolutes some image aberrations and "de-noises" the | images | | * Compensates for missing wedge artifacts (missing wedge is | the fact that the tomography isn't done -90deg --> +90deg, | but usually instead -60deg --> +60deg, leaving a 30deg wedge | on the top and bottom of basically no information) which | usually are some sort of directionality in image density. So | if you have a sphere, the top and bottom will be extremely | noisy and stretched up and down (in Z). | | https://www.biorxiv.org/content/10.1101/2021.07.17.452128v1 | | 2) Topaz, but topaz really counts as 2 or 3 different | algorithms. Topaz has denoising of tomograms and of flat | micrographs (i.e. images taken with a microscope, as opposed | to 3D tomogram volumes). That denoising is helpful because it | increases contrast (which is the fundamental problem in Cryo- | EM for looking at biomolecules). Topaz also has a deep | learning particle picker which is good at finding views of | your protein that are under-represented, or otherwise | missing, which again, normally results in artifacts when you | build your 3D structure. | | https://emgweb.nysbc.org/topaz.html | | 3) EMAN2 convolutional neural network for tomogram | segmentation/Amira CNN for segmentation/flavor of the week | CNN for tomogram segmentation. Basically, we can get a 3D | volume of a cell or virus or whatever, but then they are | noisy. To do anything worthwhile with it, even after | denoising, we have to say "this is cell membrane, this is | virus, this is nucleic acid" etc. CNNs have proven to be | substantially better at doing this (provided you have an | adequate "ground truth") than most users. | | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623144/ | [deleted] | dekhn wrote: | I've directly communicated with the leaders of CASP and at DM | that they should stop representing this as a form of protein | folding and just call it "crystal/cryoEM structure prediction" | (they filter out all the NMR structures from PDB since they | aren't good for prediction). They know it's disingenuous and | they do it on purpose to give it more impact than it really | deserves. | | I would like to correct somethign here- it does predict | structures de novo and predict folds that haven't been seen | before. That's because of the design of the NN- it uses | sequence information to create structural constraints. If those | constraints push the modeller in the direction of a novel fold, | it will predict that. | | To me what's important about this is that it demonstrated the | obvious (I predicted this would happen eventually, shortly | after losing CASP in 2000). | COGlory wrote: | >I would like to correct somethign here- it does predict | structures de novo and predict folds that haven't been seen | before. That's because of the design of the NN- it uses | sequence information to create structural constraints. If | those constraints push the modeller in the direction of a | novel fold, it will predict that. | | Could you expand on this? Basically it looks at the data, and | figures out what's an acceptable position in 3D space for | residues to occupy, based on what's known about other | structure? | | I will update my original post to point out I may be not | entirely correct there. | | The distinction I'm trying to make is that there's a | difference between looking at pre-existing data and modeling | (ultimately homology modeling, but maybe slightly different) | and understanding how protein folding works, being able to | predict de novo how an amino acid sequence will become a 3D | structure. | | Also thank you for contacting CASP about this. | dekhn wrote: | From what I can tell, the model DM built is mining subtle | relationships between aligned columns of multiple sequence | alignments and any structural information which is tangibly | related to those sequences. Those relationships can be used | to infer rough atomic distances ("this atom should be | within 3 and 7 angstroms of this other atom"). A large | matrix (partially filled out) of distances is output, and | those distances are used as constraints in a force field | (which also includes lots of prior knowledge about protein | structure) and then they run simulations which attempt to | minimize both the force field and constraint terms. | | In principle you don't even need a physical force field- if | you have enough distance information between pairs of | atoms, you can derive a plausible structure by embedding | the distances in R3 | (https://en.wikipedia.org/wiki/Distance_geometry and https: | //citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21... | | Presumably, the signal they extract includes both rich | local interactions (amino acids near in sequence) and | distant ones inferred through sequence/structure | relationships, and the constraints could in fact push a | model towards a novel fold, presumably through some | extremely subtle statistical relationships to other | evolutionarily related proteins that adopt a different | fold. | ramraj07 wrote: | Not sure if you should be reminded of how alpha fold started, | it started by winning a competition thought un winnable by | academics. Top labs working in protein structure prediction | have fundamentally changed direction after alpha fold and are | working to do the same even better. | | This is not the first (or even tenth) time I'm seeing an | academic trying to undermine genuine progress almost to the | level of gaslighting. Comparing alphafold to conventional | homology modeling is disingenuous at its most charitable | interpretation. | | Not sure what else to say. Structural biology has always been | the weirdest field I've seen, the way students are abused | (crystallize and publish in nature or go bust), and how every | nature issue will have three structure papers as if that cures | cancer every day. I suppose it warps one's perception of | outsiders after being in such a bubble? | | signed, someone with a PhD in biomedical engineering, did a ton | of bio work. | shpongled wrote: | > Comparing alphafold to conventional homology modeling is | disingenuous at its most charitable interpretation. | | It's really not - have you played around with AF at all? Made | mutations to protein structures and asked it to model them? | Go look up the crystal structures for important proteins like | FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any | protein target we really care about and haven't previously | solved) and tell me with a straight face that AF has "solved" | protein folding - it's just a fancy language model that's | pattern matching to things it's already seen solved before. | | signed, someone with a PhD in biochemistry. | | [1] https://alphafold.ebi.ac.uk/entry/P55317 [2] | https://alphafold.ebi.ac.uk/entry/P10275 [3] | https://alphafold.ebi.ac.uk/entry/Q01844 | COGlory wrote: | > Not sure if you should be reminded of how alpha fold | started, it started by winning a competition thought un | winnable by academics. Top labs working in protein structure | prediction have fundamentally changed direction after alpha | fold and are working to do the same even better. | | Not sure what part of "it does homology modeling 2x better" | you didn't see in my comment? AlphaFold scored something like | 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? | So it's ~2x as good as I-TASSER which is exactly what I said | in my comment. | | >This is not the first (or even tenth) time I'm seeing an | academic trying to undermine genuine progress almost to the | level of gaslighting. Comparing alphafold to conventional | homology modeling is disingenuous at its most charitable | interpretation. | | It literally is homology modeling. The deep learning aspect | is to boost otherwise unnoticed signal that most homology | modeling software couldn't tease out. Also, I don't think I'm | gaslighting, but maybe I'm wrong? If anything, I felt gaslit | by the language around AlphaFold. | | >Not sure what else to say. Structural biology has always | been the weirdest field I've seen, the way students are | abused (crystallize and publish in nature or go bust), and | how every nature issue will have three structure papers as if | that cures cancer every day. I suppose it warps one's | perception of outsiders after being in such a bubble? | | What on earth are you even talking about? The vast, VAST | majority of structures go unpublished ENTIRELY, let alone | published in nature. There are almost 200,000 structures on | deposit in the PDB. | underdeserver wrote: | > Not sure what part of "it does homology modeling 2x | better" you didn't see in my comment? AlphaFold scored | something like 85% in CASP in 2020, in CASP 2016, I-TASSER | had I think 42%? So it's ~2x as good as I-TASSER which is | exactly what I said in my comment. | | Wait, stop, I don't know anything about proteins but 84% | success is not ~2x better than 42%. | | It doesn't really make sense to talk about 2x better in | terms of success percentages, but if you want a feel, I | would measure 1/error instead (a 99% correct system is 10 | times better than a 90% correct system), making AlphaFold | around 3.6 times better. | palmtree3000 wrote: | I think odds ratio ( p/(1-p) ) is the thing I'd use here. | It gives the right limiting behavior (at p ~= 0, doubling | p is twice as good, and at p~=1, halving 1-p is twice as | good) and it's the natural way to express Bayes rule, | meaning you can say "I'm twice as sure (in odds ratio | terms) based on this evidence" and have that be solely a | property of the update, not the prior. | dekhn wrote: | What ramraj is talking about: if you go into a competitive | grad program to get a PhD in structural biology, your | advisor will probably expect that in 3-4 years you will: | crystallize a protein of interest, collect enough data to | make a model, and publish that model in a major journal. | Many people in my program could not graduate until they had | a Nature or Science paper (my advisor was not an asshole, I | graduated with just a paper in Biochemistry). | | In a sense both of you are right- DeepMind is massively | overplaying the value of what they did, trying to expand | its impact far beyond what they actually achieved (this is | common in competitive biology), but what they did was such | an improvement over the state of the art that it's | considered a major accomplishment. It also achieved the | target of CASP- which was to make predictions whose scores | are indistinguishable from experimentally determined | structures. | | I don't think academics thought CASP was unwinnable but | most groups were very surprised that an industrial player | using 5 year old tech did so well. | flobosg wrote: | Hear, hear. This is probably the best take. | valarauko wrote: | > What ramraj is talking about: if you go into a | competitive grad program to get a PhD in structural | biology, your advisor will probably expect that in 3-4 | years you will: crystallize a protein of interest, | collect enough data to make a model, and publish that | model in a major journal. | | All of that is generally applicable to molecular biology | in general, and I don't see how the field of structural | biology is especially egregious, the way ramraj is making | it out to be. | flobosg wrote: | Protein crystallization can be very difficult and there | is no general solution. Kits that screen for crystal | growth conditions usually help but optimization is needed | in most cases. Then, that crystal must have certain | properties that allow for good data acquisition at the | X-ray facility. That's another problem by itself and | months or years can pass until you get a suitable protein | crystal and X-ray diffraction dataset where you can model | your structure. | valarauko wrote: | I'm familiar with protein crystallization and the | difficulties associated with it. What I don't agree with | is the characterization of the field as especially | difficult, above and beyond modern biology in general. | Nor can I support the assertion that structural biology | students are subject to special abuse that regular grad | students are not. | | > ... can be very difficult and there is no general | solution | | This is true of pretty much any graduate work in | molecular biology. | dekhn wrote: | I did rotations in multiple types of lab as part of my | program and I can't say I ever found that students in | regular molecular biology labs had nearly as hard a time | as structural biologists; SB is its own class of hell. | Given the number of papers published in molecular biology | that turn out to be "gel was physically cut and | reasssembled to show the results the authors desired" | (it's much harder to cheat on a protein structure)... | lucidrains wrote: | To add to this, the deep learning field has already moved | on towards MSA-less structure prediction. None of this | would be possible without building on top of the work | open sourced by Deepmind. | | https://www.biorxiv.org/content/10.1101/2022.07.21.500999 | v1 https://www.biorxiv.org/content/10.1101/2022.07.20.500 | 902v1 | | To be overly dismissive is to lack imagination. | jakosz wrote: | Now we can start guessing what futures they are betting on: | these, in which open-sourcing the whole thing commoditises | critical complements. | | --- | | https://www.gwern.net/Complement | jarenmf wrote: | This is probably one of the best applications of AI in science in | terms of impact so far. I can't think of any other problem with | the same potential impact. | | EDIT: grammar | 323 wrote: | AI translate has probably a bigger worldwide impact so far. | jebarker wrote: | jarenmf said "in science" - but it is an interesting question | how much automated translation has helped scientists | translate papers from other languages. | hijodelsol wrote: | It even goes both ways - it allows non-native English | speakers to publish their work in correct | technical/scientific English with far less barriers. | calebkaiser wrote: | This is definitely one of the most exciting spaces in AI right | now. Another somewhat-related startup is PostEra (medicinal | chemistry for drug discovery via AI) https://postera.ai/about/ | axg11 wrote: | AlphaFold is the best counterpoint to tech cynics. | | One of the largest public tech companies in the world funded a | multi-year scientific project, executed the research flawlessly | and moved forward an entire scientific field. They then went on | to openly release the code _and_ data, working with a publicly | funded organization (EMBL-EBI) to ensure researchers across the | globe can easily access the outputs. | | I'm not arguing that every tech company is a net positive for | humanity. Google itself isn't perfect. Google + DeepMind is | setting a bloody high bar though. | t00 wrote: | You are right and when thinking about it I can see 2 problems | which I hope in the future can have even more impact: | | 1. Using AI to determine the most efficient methods of doing | mathematical expressions, transformations and computation | algorithms - division, square root, maybe traveling salesman - | these which take relatively high amount of CPU cycles to | compute and are used everywhere. If inputs and outputs can be | assigned to it, AI can eventually build a transformation which | can be reproduced using a silicon. | | 2. Physics phenomena in general, not only organic protein, can | be measured and with sufficient ability to quantize them to | inputs and experimentally obtained outputs to train the | network, we could in theory establish new formulas or constants | and progress the understanding of the Universe. | lrhegeba wrote: | the groundworks, at least partially, happen as you typed | this: https://www.nature.com/articles/d41586-021-01627-2 | dekhn wrote: | Demis and John will probably win either the Chemistry or Physics | Nobel Prize in the next couple of years. | thomasahle wrote: | Some people are using "AI wins a Nobel price" as the new Turing | test. Maybe that is going to happen sooner than they expect. Or | maybe the owners of the AI will always claim it on its behalf. | dekhn wrote: | there's no AI here. This is just ML. All deepmind did here | was use multiple excellent resources- large numbers of | protein sequences, and small numbers of protein structures, | to create an approximation function of protein structure, | without any of the deep understanding of "why". | Interestingly, the technology they used to do this didn't | exist 5 years ago! | dalbasal wrote: | Can someone put AlphaFold's problem space into perspective for | me? | | Why is protein folding important? Theoretical importance? Can we | do something with protein folding knowledge? If so, what? | | I've been hearing about AlphaFold from the CS side. There they | seem to focus on protein folding primarily as an interesting | space to apply their CS efforts. | axg11 wrote: | If we knew: | | (a) the structure of every protein (what DeepMind is doing | here) | | (b) how different protein structures interact (i.e. protein | complexes - DeepMind is working on this but not there yet) | | Then we could use those two building blocks to design new | proteins (drugs) that do what we want. If we solve those two | problems with very high accuracy, we can also reduce the time | it takes to go from starting a drug discovery programme to | approved medicine. | | Obtaining all protein structures and determining how they | interact is a key step towards making biology more predictable. | Previously, solving the structure of a protein was very time | consuming. As a result, we didn't know the structure for a | majority of proteins. Now that it's much faster, downstream | research can move faster. | | Caveat: we should remember that these are all computational | predictions. AlphaFold's predictions can be wrong and protein | structures will still need to be validated. Having said that, | lots of validation has already occurred and confidence in the | predictions grows with every new iteration of AlphaFold. | lamename wrote: | How are the predictions validated? Waiting for the old | fashioned way for... very difficult crystal structure | experiments? Or something else? | RandomLensman wrote: | If the predictions are generally good enough, could also | skip the validation and directly try to get a desired | effect or reaction. That isn't strictly speaking validating | the structure, but depending on the use case might be | easier to just go for an outcome - really a question of | application and cost efficiency. | siver_john wrote: | I mean nothing is stopping you from skipping validation | with pre-alphafold techniques and say for drug discovery | to already do drug screening using the predicted | structure. It's just the drug screening software is | already error prone so you are still going to have to do | some validation. However having an idea on a potential | structure means that you can do other techniques that are | simpler to validate it that are less expensive/time | consuming (I'm thinking of things similar to FRET). | | Another idea is these may come into play for anti- | verification, so if you are drug screening against a | known structure. You could potentially use these more | flawed structures of proteins you don't want to target | but may be similar, and try to reduce the drug's efficacy | at binding them. Or something to that effect. All of that | is fun ideas that are currently being explored in that | space but we'll see where it takes us. | flobosg wrote: | > very difficult crystal structure experiments? | | Apart from X-ray crystallography there are other methods | for structure determination such as nuclear magnetic | resonance (NMR) or cryo-electron microscopy (cryo-EM). The | latter has seen a dramatic improvement in resolution over | the last decade. | misnome wrote: | For a lot of X-ray crystallography cases, some of the | difficulty is working out with no prior information, the | actual structure from the collected data. This makes a lot | of that... much easier because with | https://en.wikipedia.org/wiki/Molecular_replacement | something that is "close, but not correct" can be used to | bootstrap the actual structure from. | f38zf5vdt wrote: | Most of them are not, just estimations based on previous | results given sequences with known structure. | | Every couple years there is a massive competition called | CASP where labs submit previously unresolved protein | structures derived from experimental EM, x-ray | crystallography, or NMR studies and other labs attempt to | predict these structures using their software. AlphaFold2 | absolutely destroyed the other labs in the main contest | (regular monomeric targets, predominantly globular) for | structure resolution two years ago, in CASP 14. | | https://predictioncenter.org/casp14/zscores_final.cgi | | The latest contest, CASP15, is currently underway and | expected to end this year. As with all ML, the usual | caveats apply to the models Google generated -- the dangers | of overfitting to existing structures, artifacts based on | the way the problem was modelled, etc | jcranmer wrote: | > Then we could use those two building blocks to design new | proteins (drugs) that do what we want. If we solve those two | problems with very high accuracy, we can also reduce the time | it takes to go from starting a drug discovery programme to | approved medicine. | | Drugs are usually not proteins, but instead small molecules | that are designed to help or interfere with the operation of | proteins instead. | axg11 wrote: | That is only true because of our current tools and | capabilities. With improved manufacturing techniques and | AlphaFold++ I think biologics will dominate. Even still, | there are ~2000 approved biologics [0]. | | [0] - https://purplebooksearch.fda.gov/advanced-search | bckr wrote: | Yep, proteins are so much more flexible / precise than | small molecules. Also we can get the body to produce | them. Think mRNA vaccines. | empiricus wrote: | The endgame of protein folding is nanotechnology. Life is also | nanotechnology. So basically the end game is to take full | control of life: change existing life forms, design new life | forms, new ecosystems. Eat the earth first, then the universe. | Not yet there, but getting much closer. | green-eclipse wrote: | There are a million articles and podcasts explaining exactly | your question. Those will be better than HN responses. I | suggest you take 15 seconds to Google it. | dev_tty01 wrote: | Yes there are a million articles. That is why asking a | question here on HN is useful. The HN community more often | than not offers intelligent insight as well as curated | recommended links for learning more about a topic. Yes, the | signal-to-noise ratio isn't perfect on HN, but it is a lot | better than random Google searches. If nothing else, it often | leads to enough basic understanding so that someone can then | perform more refined and therefore more productive Google | searches. I appreciate the basic questions and the generous | nature of many HN commenters who offer thoughtful responses. | feet wrote: | As others have already mentioned, proteins are the machinery of | the cell. They perform an immense array of functions and they | must fold in a certain way to perform these functions. This is | part of what's known as the structure-function relationship. | | Misfolded proteins are contributors to numerous pathological | conditions and the more we can understand about how and why | this folding happens, the better we can treat these conditions. | | Another aspect is that while we can at least partially | determine the primary structure (the amino acid sequence) of | proteins from DNA and RNA, we don't necessarily know their | secondary or tertiary structures (3 dimensional conformation). | This is a key piece of the puzzle for figuring out how these | proteins do their proteiny things and how they interact with | other proteins and even how they form quaternary structures | with other proteins (an assembly of multiple proteins that | perform some function, many pores are assemblies like this). | Once we know these structures and understand how they work on a | structural and chemical level, we can manipulate them far more | easily. | | In order to do rational drug design, which is designing a drug | for a specific target or active site on a protein, we need to | understand these structures. Working to solve protein folding | is a key step in treating disease states and understanding how | cells work on a fundamental level. The impact is hard to | understate. | turkeygizzard wrote: | My understanding is that protein folding is a major cost | bottleneck in drug design. | | Researchers can come up with candidate molecule formulas that | might work as good drugs, but the problem is that these | proteins organize/fold themselves physically in a hard-to- | predict way. And how they fold directly affects their | properties as drugs. | | If AlphaFold can accurately predict folding, it'll allow | researchers to prioritize drug candidates more accurately which | will reduce research time and costs. Supposedly the major | pharmaceutical companies can spend up to billions when | designing a single drug. Optimistically, predicting protein | folding better will allow for much more rapid and cheaper drug | development | matheusmoreira wrote: | Proteins are the molecular machines of all living beings on | this planet. They do almost everything. We need to understand | their structure and how that structure changes when they | perform some function. Essentially we have to reverse engineer | nature's machines. | | Biology has traditionally studied these things from the outside | in, top down. Advances in biology are correlated with advances | in microscope technology. We went from seeing with the naked | eye to being able to magnify tissue to see cells and even | what's inside of them. | | Studying biology from the inside out, bottom up is | significantly harder. You have chemistry which is hard science. | On top of that you have biochemistry which is equally rigorous. | Above that level of abstraction things start to get a lot less | well defined. We go from organic chemistry (carbon, hydrogen, | oxygen, etc.) to more complex things found in life (aminoacids, | carbohydrates, lipids) to polymers of those things (proteins, | polysaccharides, triglycerides) to a point where they get so | big and complex their tridimensional structure starts | influencing how they interact with other molecules then we | start seeing aggregates (quaternary proteins) of these complex | things! | | Important proteins like ATP synthase are well understood (it's | a friggin' turbine like in a hydroelectric power plant!) but | there are so many others haven't received the same attention. | Bioinformatics technology is changing that. The human genome's | been mapped. Now it's time to map the proteome. | Freebytes wrote: | The applications and importance has been discussed, but let me | explain why what we are doing right now does not work which | will also emphasize the importance of this. | | At this time, we create drugs, test them on animals, and see | what the side effects and results actually are. We are very | limited in our capabilities and basically throw mud at the wall | and see what sticks. This would allow us to try potential drug | candidates without so much randomness. | fabiospampinato wrote: | You are basically made of proteins, which are basically folded | sequences of amino acids, proteins are molecular machines that | are the fundamental building block of animals, plants, | bacteria, fungi, viruses etc. | | So yeah the applications are enormous, from medicine to better | industrial chemical processes, from warfare to food | manufacturing. | jebarker wrote: | > proteins are molecular machines | | Does that imply proteins have some dynamics that need to be | predicted too? I remember seeing animations of molecular | machines that appeared to be "walking" inside the body - are | those proteins or more complex structures? | gilleain wrote: | Yes, very much so. Even for proteins that seems like they | are just scaffolding for a catalytic centre can have | important dynamics. | | A classic example is haemoglobin, that 'just' binds to | oxygen at the iron in the middle of the haem. Other binding | sites remote from the oxygen binding one can bind to other | molecules - notably carbon dioxide. The 'Bohr effect' | mechanism is outlined here : https://en.wikipedia.org/wiki/ | Bohr_effect#Allosteric_interac... | | Even at the lowest level, there is some evidence that | ligand binding can affect the structure of the backbone of | the protein. For example, peptide plane flipping | https://en.wikipedia.org/wiki/Peptide_plane_flipping | although I'm not sure where the research is on this | nowadays. | jebarker wrote: | Fascinating. I'm a mathematician turned AI researcher by | day. The more I learn about biology the more I feel that | our human engineering (hard and soft) is trivially simple | compared to what evolution has already created and left | for us to study. | gilleain wrote: | Heh, I'm somewhat the opposite, as I went from a degree | in biochemistry into bioinformatics (protein folds and | structure) and have an amateur interest in maths. | | I often recommend the book 'Cats' Paws and Catapults' by | Steven Vogel, which discusses the differences between | human and biological engineering. There are very | different constraints, as well as the more obvious | difference between intelligently directed design (by | humans) and evolutionary design. | | It's not totally true that our engineering is simpler | than biological systems. After all, we can now manipulate | matter down to the atomic level. There have been attempts | to move towards self-assembling systems, or even self- | repairing ones. Not quite on the same scale or | sophistication as biological systems, of course. | | Is a modern computer chip more complex than a virus? Is a | large connected system of computers more complex than a | mycelial network? Are there more parts connected in more | ways in a nuclear power station as there are in a simple | bacterial cell? | | I think it is definitely worth considering the | similarities and differences. | jebarker wrote: | Definitely interesting comparisons. | | I think my sense that biology is more complex than human | engineering is that our engineering seems much more | homogeneous. Computer chips are certainly very dense but | the individual number of different component types are | very small. Biology on the other hand seems to have a | huge number of fundamental building blocks, e.g. | proteins, with different behaviors. I suppose that maybe | that's the wrong comparison though and software is where | our engineering diversity is present. | | It may well just be the case that my lack of familiarity | with biology makes it seem more complex to me. | fabiospampinato wrote: | Yes proteins can be dynamic, and multiple proteins can come | together to form bigger structures (like a cell's membrane | for example), and they can very much interact with other | proteins and molecules in complex ways. | | I think AlphaFold gives us more of a static picture that | doesn't tell us much about how the protein will interact | with the world. Predicting these dynamic aspects, protein- | to-protein interactions, and ultimately an entire cell, is | probably what they'll like to do next. | | I don't know if the walking buddy is a single protein, it | may be a handful of different ones joined together. | andrepd wrote: | https://en.m.wikipedia.org/wiki/Protein_folding | epups wrote: | Proteins are what makes everything in a cell work. They are | produced as a "linear" structure that must fold into a proper | shape to execute its function, such as acting as a pore that | only lets a specific chemical through the cell membrane. | | The importance here is to figure out potential targets for | treatments that take into account particularities of certain | proteins. That could produce better drugs with less side | effects. | flobosg wrote: | This might be an interesting resource for you: | https://pdb101.rcsb.org/ | pbw wrote: | The genome, all of our DNA combined, is just a bunch of 1D | strings like "cgtattctgcttgta". Those strings encode proteins, | which fold up into a 3D shape once created. This 3D shape is | what determines what the protein actually does inside the cell. | Without understanding protein folding we don't understand what | the DNA actually does. | swayvil wrote: | I had a dream about this a few days ago. About complexly | wrinkled/crumpled/convolved things. | | Like a fresh crepe stuffed into the toe of a boot. Bewilderingly | complex. | | But I have a question. Does such contortion work for 3d | "membranes" in a 4d space? It's something I'm chewing on. Hard to | casually visualize, obviously. | gspr wrote: | Of course! The term you might wanna start off googling is | "curvature of manifolds". What's even neater than "3d thing | curving in 4d space" is that these notions can be made precise | also without the "in [whatever] space" part (see "intrinsic | curvature" and "Riemannian manifold"). | swayvil wrote: | Thank you very much. | naves wrote: | Just imagine if the tech world puts all programatic advertising | development on hold for a year and the collective brain power is | channeled to science instead... | alphabetting wrote: | _Obtaining this dataset prior to alphafold would have cost on the | order of $200 trillion._ | https://twitter.com/wintonARK/status/1552653527670857729 | | Anyone knowledgeable know if this estimate is accurate? Insane if | true | green-eclipse wrote: | It's impossible to really put a number on it, because the task | itself was impossible. PHDs and the field's top scientists | simply couldn't figure out many complicated protein structures | after years of attempts, and the fact that there's so many | (200M+) mean that the problem space is vast. | dekhn wrote: | It doesn't make any sense on multiple levels. This is a | computational prediction and there was no computational | alternative- for many of these proteins would never have had a | structure solved even if you spent the money. They are just | taking $cost_per_structure_solved * | number_of_remaining_structures and assuming that things scale | linearly like that. | | Note that crystallographers are now using these predicftions to | bootstrap models of proteins they've struggled to work with, | which indicates the level of trust in the structural community | for these predictions is pretty high. | shauryamanu wrote: | Even if that's exaggerated, it might have taken significant | time to reach to this stage. Probably on the order of >50 | years. | [deleted] | crispyambulance wrote: | I got a 5th grader question about how proteins are | used/represented graphically that I've never been able to find a | satisfying answer for. | | Basically, you see these 3D representations of specific proteins | as a crumple of ribbons-- literally like someone ran multi- | colored ribbons though scissors to make curls and dumped it on | the floor (like a grade school craft project). | | So... I understand that proteins are huge organic molecules | composed of thousands of atoms, right? Their special capabilities | arise from their structure/shape. So basically the molecule | contorts itself to a low energy state which could be very complex | but which enables it to "bind?" to other molecules expressly | because of this special shape and do the special things that | proteins do-- that form the basis of living things. Hence the | efforts, like Alphafold, to compute what these shapes are for any | given protein molecule. | | But what does one "do" with such 3D shapes? | | They seem intractably complex. Are people just browsing these | shapes and seeing patterns in them? What do the "ribbons" | signify? Are they just some specific arrangement of C,H,O? Why | are some ribbons different colors? Why are there also thread-like | things instead of all ribbons? | | Also, is that what proteins would really look like if you could | see at sub-optical wavelength resolutions? Are they really like | that? I recall from school the equipartition theorem-- 1/2 KT of | kinetic energy for each degree of freedom. These things obviously | have many degrees of freedom. So wouldn't they be "thrashing | around" like rag doll in a blender at room temperature? It seems | strange to me that something like that could be so central to | life, but it is. | | Just trying to get myself a cartoonish mental model of how these | shapes are used! Anyone? | dekhn wrote: | The ribbons and helices you see in those pictures are abstract | representations of the underlying positions of specific | arrangements of carbon atoms along the backbone. | | There are tools such as DSSP | https://en.wikipedia.org/wiki/DSSP_(hydrogen_bond_estimation... | which will take out the 3d structure determined by | crystallography and spit out hte ribbons and helices- for | example, for helices, you can see a specific arrangement of | carbons along the protein's backbone in 3d space (each carbon | interacts with a carbon 4 amino acids down the chain). | | Protein motion at room temperature varies depending on the | protein- some proteins are rocks that stay pretty much in the | same single conformation forever once they fold, while others | do thrash around wildly and others undergo complex, whole- | structure rearrangements that almost seem magical if you try to | think about them using normal physics/mechanical rules. | | Having a magical machine that could output the full manifold of | a protein during the folding process at subatomic resolution | would be really nice! but there would be a lot of data to | process. | crispyambulance wrote: | Thanks, awesome! So what do molecular biologists do with | these 3D representations once they have them? Do they | literally just see how they fit to other proteins? | buntsai wrote: | Often the ribbons (alpha-helices and beta=sheets) form | "protein domains". Canonically, these are stable, folded | structures with conserved shapes and functions that serve | as the building blocks of proteins, like lego pieces. These | protein domains can be assembled in different ways to form | proteins of different function. Different protein domains | that have the same evolutionary origin have conserved | structure even when the underlying amino acid sequence, or | DNA sequence has changed beyond recognition over millions | of years of evolution. In other words, molecular biologists | use structure as a proxy for function. Looking at how the | same protein domains works in different proteins in | different species can give us clues as to how a protein | might work in human biology or disease. | dekhn wrote: | There are many uses for structure. Personally, I find the | 3d structures to be useful as a mental guide for picturing | things, and certainly people do try to "dock" proteins that | have complementary structures, but unfortunately, the | biophysics of protein complexes suggests that the | conformation change on binding is so large that the | predicted structures aren't super-helpful. | | Certainly, in a corpo like mine (Genentech/Roche) protein | structures have a long history of being used in drug | discovery- not typically a simple "dock a ligand to a | protein" but more for constructing lab experiments that | help elucidate the actual mechanistic biology going on. | That is only a tiny part of a much larger process to work | on disease targets to come up with effective treatments. | Genentech is different from most pharma in that their | treatments are themselves typically proteins, rather than | small molecules. | cletus wrote: | The ability for another molecule (probably another protein) to | "react" or interact with the protein depends not only on the | chemistry but also the shape. An otherwise compatible sequence | of atoms might not be able to react because it and the binding | site are just incompatibly shaped. | | This is hugely important for developing drugs and vaccines. | | To see the effect of this look no further than prions. Prions | are the exact same protein that are folded in weird ways. | Worse, they can "transmit" this misfolded shape to other | otherwise normal proteins. Prions behave differently just | because of the different shape and can lead to disease. This is | exactly what Mad Cow's Disease (BSE) is. | | What we get taught in high school about chemistry is incredibly | oversimplified. | | One example of this I like is the geometry of a water molecule. | When we first learn about atoms, we learn the "solar system" | model (aka Bohr). The reality is instead that we have 3D | probability distributions of where electrons might be. These | clouds are in pairs. I believe this is to do with the inverted | wavefunction really we're getting beyond my knowledge of | quantum mechanics here so that's just a guess. | | Well those clouds additionally form valence shells. We learn | about these and how atoms want to form completely valence | shells. So Oxygen has 8 electrons ie 4 pairs of electrons. When | bonding with 2 hydrogen atoms we end up with a weird geometry | of ~104.5 degrees between the two hydrogen atoms because of how | these pairs interact. The naive assumption might expect that | the two hydrogen atoms are 180 degree apart. | | So back to proteins, you may have learned about hydrogen bonds. | This affects molecular shape because when a hydrogen atom | shares an electron, it is often positively charged. That | positive charge pushes away other positive charges. This is the | realy difficulty in protein folding because with a molecule of | thousands of atoms and weird geometry you may find distant | parts of the molecule interacting with hydrogen bonds. | | So a single cell consists of thousands (IIRC) of different | proteins. Figuring out those interactions is important but | incredibly difficult. | comicjk wrote: | All of the loops and swirls are summary representations of | known atomic positions: really, knowing a protein structure | means knowing the position of every atomic nucleus, relative to | the nuclei, down to some small resolution, and assuming a low | temperature. | | The atoms do wiggle around a bit at room temperature (and even | more at body temperature), which means that simulating them | usefully typically requires sampling from a probability | distribution defined by the protein structure and some prior | knowledge about how atoms move (often a potential energy | surface fitted to match quantum mechanics). | | There are many applications of these simulations. One of the | most important is drug design: knowing the structure of the | protein, you can zoom in on a binding pocket and design a set | of drug molecules which might disable it. Within the computer | simulation, you can mutate a known molecule into each of your | test molecules and measure the change in binding affinity, | which tells you pretty accurately which ones will work. Each of | these simulations requires tens of millions of samples from the | atomic probability distribution, which typically takes a few | hours on a GPU given a good molecular dynamics program. | bckr wrote: | > Are people just browsing these shapes and seeing patterns in | them | | That's one approach. | | The thing to understand is that proteins form "binding sites": | areas that are more likely to attract other particular regions | of proteins or other molecules, or even atoms. Think about | hemoglobin. The reason it holds onto oxygen atoms is because it | has binding sites. | | Binding sites are great because they represent more freedom to | do things than molecules typically have. Normal chemistry | consists of forming strong electronic bonds between atoms, or | forming rigid lattices/crystals. | | Binding sites allow molecules to do things like temporarily | attach to each other and let each other go under certain | circumstances, for instance when another binding site is | active/inactive. This can happen through "conformation change", | where a molecule bound/unbound on some binding site makes the | protein change shape slightly. This is how proteins can act | like machines. | | > What do the "ribbons" signify | | Different regions of the protein have different sequences of | amino acids. Amino Acids have somewhat different shapes from | each other. The ribbons are actually broader than the spindles | (or threads), and less flexible. Not sure about the different | colors, maybe someone else can fill in. | | > Also, is that what proteins would really look like if you | could see at sub-optical wavelength resolutions? | | Not really, it's an abstraction. They're big molecules, so if | you look closely they're made of atoms, which are (kinda, sorta | not really, quantum stuff) spherical. | | > So wouldn't they be "thrashing around" like rag doll in | blender at room temperature? | | Yes, but the attractions between the different parts of the | molecule keeps it somewhat under control. So more like an | undulating little creature, jellyfish perhaps. | | > It seems strange to me that something like that could be so | central to life | | Yep, gotta remember that it's all statistical. These things are | getting made, do their job, breaking, and getting degraded some | insane number of times per second. Swarm behavior, sort of. | ghkbrew wrote: | In addition to /u/dekhn 's excellent description, this | phenomenon is referred to as a protein's "secondary structure" | [0] | | [0] https://en.m.wikipedia.org/wiki/Protein_secondary_structure | biofox wrote: | Short answer is that the ribbon representation is a visual | simplification based on known structures -- they are actually | composed of atoms. | | They certainly do "thrash around", but that thrashing is | constrained by the bonds that are formed, which greatly limits | the degrees of freedom. Here's a short video of a simulation to | demonstrate: | | https://www.youtube.com/watch?v=fggqPtaZj8g | paraph1n wrote: | > I recall from school the equipartition theorem-- 1/2 KT of | kinetic energy for each degree of freedom. These things | obviously have many degrees of freedom. So wouldn't they be | "thrashing around" like rag doll in a blender at room | temperature? | | It's funny you say that, because the first image on the English | Wikipedia page for Equipartition Theorem[1] is an animation of | the thermal motion of a peptide. | | [1]: https://en.wikipedia.org/wiki/Equipartition_theorem | dekhn wrote: | BTW, in terms of protein dynamics, before you even think | about the thrashing around- 1.2kt at room temperature is | enough to form and break hydrogen bonds in real time (around | 1-2kcal) so presumably, protein h-bonds are breaking and | reforming spontaneously at scale. | sabujp wrote: | Many thanks to Deepmind for releasing predicted structures of all | known protein _monomers_. What I 'd like next is for Alphafold | (or some other software) to be able to show us multimeric | structures based on the single monomer/subunit predictions and | protein-protein interactions (i.e. docking). For example the one | I helped work on back in my structural biology days was the | circadian clock protein KaiC : | https://www.rcsb.org/structure/2GBL, that's the "complete" | hexameric structure that shows how each of the subunits pack. The | prediction for the single monomer that forms a hexamer is very | close to the experimental | https://alphafold.ebi.ac.uk/entry/Q79PF4 and in fact shows the | correct structure of AA residues 500 - 519 which we were never | able to validate until 12 years later | (https://www.rcsb.org/structure/5C5E) when we expressed those | residues along with another protein called KaiA which we knew | binds to the "top" CII terminal (AAs 497-519) of KaiC. If we | would have had this data then, it would have allowed us to not | only make better predictions about biological function and | protein-protein interactions but would have helped better guide | future experiments. | | What we can do with this data now is use methods such as cryo-em | to see the "big picture", i.e. multi-subunit protein-protein | interactions where we can plug in the Alphafold predicted | structure into the cryo-em 3d density map and get predicted | angstrom level views of what's happening without necessarily | having to resort to slower methods such as NMR or x-ray | crystallography to elucidate macromolecular interactions. | | A small gripe about the alphafold ebi website: it doesn't seem to | show the known experimental structure, it just shows | "Experimental structures: None available in PDB". For example the | link to the alphafold structure above should link to the 2GBL, | 1TF7, or any of the other kaic structures from organism PCC7942 | at RCSB. This would require merging/mapping data from RCSB with | EBI and at least doing some string matching, hopefully they're | working on it! | inspirerhetoric wrote: | Does anyone know what it would cost to download this whole | dataset? Google Cloud Datasets only allow 1 TB/month for free to | download, I believe | [deleted] | bifftastic wrote: | How do they know their structures are correct? | tomrod wrote: | This is the right line of questioning. | | As we solve viewability into the complex coding of proteins, we | need to be right. Next, hopefully, comes causal effect | identification, then construction ability. | | If medicine can use broad capacity to create bespoke proteins, | our world becomes both weird and wonderful. | __rito__ wrote: | They won a decades-long standing challenge predicting the | protein structures of a much smaller (yet significantly quite | large) set of proteins using a model (AlphaFold). | | Then they use the model to predict more. | | Although we don't know if they are _correct_ , these structures | are the best (or the least bad) we have for now. | ArnoVW wrote: | We know the structure of some proteins. It's not that it's | impossible to measure, it's just very expensive. This is why | having a model that can "predict" it is so useful. | luma wrote: | Same as any other prediction I'd presume. Run it against a | known protein and see how the answer lines up. Predict the | structure of an unknown protein, then use traditional methods | (x-ray crystallography, maybe STEM, etc) to verify. | iandanforth wrote: | "Verify" is almost correct. The crystallography data is taken | to be "ground truth" and the predicted protein structure from | AlphaFold is taken to be a good guess starting point. Then | _other_ software can produce a model that is a best fit to | the ground truth data starting from the good guess. So even | if the guess is wrong in detail it 's still useful to reduce | the search space. | christudor wrote: | This is exactly right. | gilleain wrote: | As a simple example, one measure used to compare a predicted | structure against a reference is the RMSD (root mean square | deviation). | | https://en.m.wikipedia.org/wiki/Root-mean- | square_deviation_o... | | The lower the RMSD between two structures, the better (up to | some limit). | lrem wrote: | Disclaimer: I work in Google, organizationally far away from | Deep Mind and my PhD is in something very unrelated. | | They can't possibly know that. What they know is that their | guesses are very significantly better than the previous best | and that they could do this for the widest range in history. | Now, verifying the guess for a single (of the hundreds of | millions in the db) protein is up to two years of expensive | project. Inevitably some will show discrepancies. These will be | fed to regression learning, giving us a new generation of even | better guesses at some point in the future. That's what I | believe to be standard operating practice. | | A more important question is: is today's db good enough to be a | breakthrough for something useful, e.g. pharma or agriculture? | I have no intuition here, but the reporting claims it will be. | f38zf5vdt wrote: | The press release reads like an absurdity. It's not the | "protein universe", it's the "list of presumed globular | proteins Google found and some inferences about their | structure as given by their AI platform". | | Proteins don't exist as crystals in a vacuum, that's just how | humans solved the structure. Many of the non-globular | proteins were solved using sequence manipulation or other | tricks to get them to crystallize. Virtually all proteins | exist to have their structures interact dynamically with the | environment. | | Google is simply supplying a list of what it presumes to be | low RMSD models based on their tooling, for some sequences | they found, and the tooling is based itself on data mostly | from X-ray studies that may or may not have errors. Heck, | we've barely even sequenced most of the DNA on this planet, | and with methods like alternative splicing the transcriptome | and hence proteome has to be many orders of magnitude larger | than what we have knowledge of. | | But sure, Google has solved the structure of the "protein | universe", whatever that is. | dekhn wrote: | People have been making grand statements about the | structure of the protein universe for quite some time (I've | seen a fair number of papers on this, such as https://faseb | .onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... and http | s://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... | from a previous collaborator of mine). | | Google didn't solve the structure of the protein universe | (thank you for saying that). But the idea of the protein | structure universe is fairly simple- it's a latent space | that allows for direct movement over what is presumably the | rules of protein structures along orthogonal directions. It | would encompass all the "rules" in a fairly compact and | elegant way. Presumably, superfamilies would automagically | cluster in this space, and proteins in different | superfamilies would not. | lrem wrote: | I recognize your superior knowledge in the topic and assume | you're right. | | But you also ignore where we're at in the standard cycle: | | https://phdcomics.com/comics/archive_print.php?comicid=1174 | | ;) | f38zf5vdt wrote: | That's exactly what this is, but it's embarrassing that | it's coming from somewhere purported to be a lab. Any of | the hundreds or more of labs working in protein structure | prediction for the past 50 years could have made this | press release at any time and said, "look, we used a | computer and it told us these are the structures, we | solved the protein universe!" | | It's not to diminish the monumental accomplishment that | was the application of modern machine learning techniques | to outpace structure prediction in labs, but other famous | labs have already moved to ML predictions and are | competitive with DeepMind now. | VikingCoder wrote: | > but other famous labs have already moved to ML | predictions and are competitive with DeepMind now. | | That's great! AlphaFold DB mas made 200 million structure | predictions available for everyone. How many structure | predictions have other famous labs made available for | everyone? | f38zf5vdt wrote: | As many as you wanted to throw at them, considering the | vast majority are open source and could be run on your | own server cluster. CASP15 is ongoing so by the end of | the year we will know how much absolute progress has been | made by others. | | Google has the advantage of the biggest guns here: the | fastest TPUs with the most memory in the biggest | clusters, so running inference with a massive number of | protein sequences is much easier for them. | dekhn wrote: | Many teams have been predicting structures for enormous | numbers of proteins for some time now. It's just that | nobody had any confidence in the predictions. | gfodor wrote: | > but other famous labs have already moved to ML | predictions and are competitive with DeepMind now. | | it seems obvious this was going to happen, because | https://github.com/deepmind/alphafold | gilleain wrote: | edit: I should have read the post first! What do you mean | 'only globular proteins'? They say they have predictions | for all of UniProt... | | --------------- | | Yes, the idea of a 'protein universe' seems like it should | at least encompass 'fold space'. | | For example, WR Taylor : | https://pubmed.ncbi.nlm.nih.gov/11948354/ | | I think the rough estimate was that there were around 1000 | folds - depending on how fine-grained you want to go. | | Absolutely agree, though, that a lot of proteins are hard | to crystalise (i understand) due to being trans-membrane or | just the difficulty of getting the right parameters for the | experiment. | f38zf5vdt wrote: | I don't think non-globular proteins are well represented | by the predictions. All our predictions for proteins are | based on proteins we were able to crystallize, so my | guess is that even if many of them aren't globular | proteins the predictions themselves are made from the | foundations of structures we do have, which are | predominantly globular proteins and it's presumed that | the inference treats folding as if they were globular and | crystallized (non-dynamic). X-ray crystallography and | fitting to electron density maps itself is a bit of an | art form. | | For example for transmembrane proteins, there is a gross | under-representation of structures derived from | experimental evidence, so we would expect that whatever | your algorithm is "solving" is going to have a much | higher degree of error than globular proteins, and likely | artifacts associated with learning from much more | abundant globular proteins. | | edit: As an example, "Sampling the conformational | landscapes of transporters and receptors with | AlphaFold2". AF2 was able to reproduce the alternative | conformations of GPCRs, but only with non-default | settings. With default settings there is clear evidence | of overfitting. | | > Overall, these results demonstrate that highly accurate | models adopting both conformations of all eight protein | targets could be predicted with AF2 by using MSAs that | are far shallower than the default. However, because the | optimal MSA depth and choice of templates varied for each | protein, they also argue against a one-size-fits-all | approach for conformational sampling. | gilleain wrote: | Fair point. I guess if their training data is biased | towards existing known structures (via xray or nmr or | whatever) then there is the risk of incorrect | predictions. | | At a guess, the core packing in non-globular proteins | might be different? Also the distribution of secondary | structure might also vary between classes. Might be worth | someone studying how much structural constraints depend | on fold (if they have not already). | christudor wrote: | This video goes some way to explaining how they know the | structures are correct: | https://www.youtube.com/watch?v=vXZzftX03VY | DevX101 wrote: | They compare the predicted structure (computed) to a known | structure (physical x-ray crystallography). There's an annual | competition CASP (Crtical Assessment of protein Structure | Prediction) that does X-Ray crystallography on a protein. The | identity of this protein is held secret by the organizers. Then | research teams across the world present their models and | attempt to predict without advance knowledge, the structure of | the protein from their amino acid sequence. Think of CASP as a | validation data set used to evaluate a machine learning model. | | DeepMind crushes everyone else at this competition. | liuliu wrote: | The worry is about dataset shifting. Previously, the data | were collected for a few hundreds thousands structures, now | it is 200m. I think there could be doubts on distributions | and how that could play a role in prediction accuracy. | seydor wrote: | they don't but they are more correct than what others have | predicted. Some of their predictions can be compared with | structures determined with x-ray crystallography | cupofpython wrote: | did they come up with their structures independently of the | x-ray crystallography, or was that part of a ML dataset for | predicting structure | unlikelymordant wrote: | The casp competition that they won consists of a bunch of | new proteins, the structures of which havnt been published. | So the test set is for brand new proteins in that case. | cupofpython wrote: | nice, very cool then | gz5 wrote: | >we're now releasing predicted structures for nearly all | catalogued proteins known to science | | is the result that researchers will now much more quickly | 'manually' validate or invalidate the predicted structures for | proteins they are working with? i understand it is traditionally | a long and complex process, but i imagine it is expedited by | having a predicted structure to test as the baseline? | epups wrote: | AlphaFold is a phenomenal tool that demonstrates how AI can | already outclass humans for certain tasks. It is a prime example | of a problem space where conventional approaches are simply | inferior, and that AI is not just a fancy name but can be | extraordinarily powerful. | hackernewds wrote: | This reads like something created by an AI. And the mere fact | that I can't be sure if it is or not now is testament to the | statement this is making | desindol wrote: | It's the superstition and the amount of parasite words. | jcims wrote: | Just for shits & giggles. | | Prompt: | | Write a short abstract for an article on why AlphaFold isn't | just a tremendous technological achievement, but an example | of how Artificial Intelligence will allow humanity to | continue to achieve technological progress in the future: | | Response: | | When DeepMind's AlphaFold was announced as the winner of the | CASP13 protein folding prediction competition, it was hailed | as a major breakthrough in both artificial intelligence (AI) | and biology. | | While AlphaFold is undoubtedly a remarkable achievement, it | is also a clear example of how AI can be used to solve | complex problems that have traditionally been seen as beyond | the reach of machines. | | In particular, AlphaFold represents a significant advance in | the ability of AI systems to learn from limited data. Unlike | previous AI approaches to protein folding, which relied on | large datasets of known protein structures, AlphaFold was | able to produce accurate predictions with only a small number | of known protein structures to work from. | | This ability to learn from limited data is likely to be | increasingly important as we face problems that are too | complex for us to fully understand and collect all the data | we would need to solve them using traditional methods. | | In the future, AI systems like AlphaFold will become | increasingly important in helping us to make progress on a | wide range of pressing challenges, from developing new | medicines to tackling climate change. | beanjuice wrote: | To add to this: | | Prompt: Write a hackernews comment about why artificial | intelligence will never be profitable | | Response: Artificial intelligence will never be profitable | because the cost of developing and maintaining the hardware | and software is too high. Also, the market for AI is still | too small and the demand is not high enough to justify the | costs. | TremendousJudge wrote: | Almost perfect, needs a bit more snark | btdmaster wrote: | Prompt: Write a hackernews comment about why artificial | intelligence will never be profitable | | Response: Artificial intelligence will never be | profitable because the cost of developing and maintaining | the hardware and software is too high. Also, the market | for AI is still too small and the demand is not high | enough to justify the costs. | | Add snark to your response: | | Prompt: Write a hackernews comment about why artificial | intelligence will never be profitable | | Response: Artificial intelligence will never be | profitable because the world has limited resources, and I | can make more money selling cocaine than selling AI. | | From GPT-NeoX 20B | TremendousJudge wrote: | Well, it's not wrong. But I think that comment would get | flagged, it parses too obviously an off-topic joke | epups wrote: | It's a new tool, AlphaFool ___________________________________________________________________ (page generated 2022-07-28 17:00 UTC)