Ok, best prediction page

Author	Message
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 1070 - Posted: 7 Oct 2005, 17:12:23 UTC In my struggle with the forces of darkness in my understanding of Rosetta ... In the "best" graph we see the coverage dots over most of the x/y range. With the best shown in blue. Ok, fine. Why did we not begin to approach the actual. What I mean is, we got close, but not that close. Did we get caught in a "local minima"? I guess what I am trying to ask, is how did we get to the end state. The graph has no context, but, in its current state seems to me to show abject failure. So, for example, if we started to throw out work units oveer the range of the x/y coordinate space, why did we not have coverage over the entire range? Would this graph be better represented as a three dimensional space with the energy state being the z axis? Still drowning ... I read the referred to Wiki pages and as best as I can tell they are written with the full intent of not conveying meaning to anyone ... then again, maybe it is just me ... ID: 1070 · Rating: 0 · rate: / Reply Quote

Keith E. Laidig Volunteer moderator Project developer Send message Joined: 1 Jul 05 Posts: 154 Credit: 117,189,961 RAC: 0	Message 1117 - Posted: 8 Oct 2005, 16:56:14 UTC Last modified: 8 Oct 2005, 17:00:25 UTC Briefly: The methodolgy used by Rosetta to search the energy hypersurface for the protein is a variation of the Monte Carlo search method, alanogous to the method known as simulated annealing. Simply put, you start with a randomized structure and then start 'optimizing' the structure, moving this and rearranging that to reduce the energy, until you reach the best energy you can. The final energy is the red cross you see on the plot. So, what you see on those diagrams is the end points of thousands of searches. [If you ploted all the energies of all the structures you'ld see a huge mess stretching far and wide off the right and way up over the top of the plot.] This starting randomization makes sure you avoid search only the immediate locality. The process can be thought of as 'heating' the molecule up and then cooling it back down again, i.e. annealing, in the hopes that you move out of local low spot and find the best low spot {which is presumed to be the active 'native' state of the protein - more on that in another post}. So a complete study can be thought of as repeatedly heating and cooling, heating and cooling of the molecule to allow it to explore all around the energy surface. The reason that you don't see a smooth blob all the way down to the best result is an interesting thing. It has been presumed (by the science folk) that the 'real' energy surface of the protein is a funnel shape {see Figure 1.3 of this page for an example}. What the Baker group has only recently found using their 'energy function' is that their energy surface around the best result is a wide flattish area with the best structures being down a narrow well. What? Here's your thought experiment for today: Imagine a largish putting green slowly rolling around with a cup somewhere in there. Your job is to kick the ball from the edge into the cup, while wearing a bilndfold. So, what you have to kick a large number of balls all over the place and hope for the best. So, although I'm not sure how DK has partitioned the job, we can presume that each WU is one kick of the protein 'from the rough'. That's why we need a lot of these 'trajectories' to ensure that we've covered the surface well enough to find the best structure. There are other issues that are raised here: Why would the Baker energy surface be different that the 'real' surface? What is this 'energy' and how do you calculate it? Why do you assume that the best possible structure is the 'native' structure? How do you know you've found the best possible surface? Is this crazy method really the best way to do this? But, I've got yard work waiting - so this will have to wait until later.... ID: 1117 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 1124 - Posted: 8 Oct 2005, 19:58:34 UTC Still paddeling, and we should also capture this discussion :) But, I had thought I had read where the creation of the work units was that we assumed a "surface" with contour for which we would be searching for the minima. To do that search we did a large number of random "drops" over the surface from which we would begin the calculation processes. So, the essence of my question is that from appearance we see that we did a drop, but the subsequent drops were in relation to the originals, but not really randomly across the entire surface. THUS, we seemed to have caught a local minima and did not climb out to find that we "missed" the real minima. Parhaps I am reading into the single state too much. Because we do not have the "history" of the growth of the result state as shown. Hmm, new question, is THAT data captured? In other words, are you looking at the way that the process evolves? To me that is almost an even more interesting question, rising as it does from the life-long study of data and its patterns (usually in non-typical fields and manners). In other words, what would a "movie" of the growth of the "dots" on the screen look like? ID: 1124 · Rating: 0 · rate: / Reply Quote

Keith E. Laidig Volunteer moderator Project developer Send message Joined: 1 Jul 05 Posts: 154 Credit: 117,189,961 RAC: 0	Message 1125 - Posted: 8 Oct 2005, 20:35:31 UTC The growth of the top-predictions diagrams would look like rain falling; a drop here, a drop there, building over time.... The resulting top predictions diagrams can't be viewed as having a simple relationship to the energy surface. The RMSD metric is the root mean squared deviation of all the atoms of the predicted structure in comparison to the 'desired' result. This metric flattens things drastically since one can can have two different structures that have the same RMSD. This is why there is a vertical spread in energies for a given RMSD - many structures just as 'similar' to the desired result but some with better energies than others. The distribution of starting structures is quite broad (which would be shown on the RMSD vs E diagram as a broad blob of large RMSD and less negative energy). In fact, the Monte Carlo algorithm uses a random starting point for the start of each local search (random within constraints - having all the atoms 100 miles apart probably wouldn't be much use). It is this random (stochastic) component that is designed to move the problem from the local area. You generate a random starting strucuture, optimize it, generate a new random starting strucutre, optimize it, and so on and so on. Rosetta has been show to generate quite a broad range of starting structures in general. Now, is it possible that you don't generate the one starting structure random enough to reach the true minimum? Sure! But as I pointed out in the earlier post, the group has found that - in general - the 'correct' result is surrounded by a broad plateau and the best prediction is a narrow well {in fact, the nature of the broad plateau and the similarity between the strucutres found there is used as a metric for testing the success of the search if the well hasn't been found}. Since the broad plateau is roughly flat there isn't much information to provide a systematic approach to finding a better result....everything is roughly the same... So it boils down to having enough starting structures that hit the plateau that you find the narrow well.... More questions: Are we certain that the well even exists using our potential for all proteins or just some? ID: 1125 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 1130 - Posted: 8 Oct 2005, 22:58:35 UTC - in response to Message 1125. The growth of the top-predictions diagrams would look like rain falling; a drop here, a drop there, building over time.... I need to think about the rest of this ... But, one of the best tests I ever saw for random number generators did something along this line. You "pulled" numbers from the generator and then plotted a dot at the x/y corrodinate on the screen. Bad random number generators would, after a few seconds, "freeze". For example, the Apple II's RND function had a period of 17,000 and change (that will also tell you how long ago I was looking at this ...) The best I saw filled the screen, for the ones with long periods you could either query the dot and increment its color number, or pull another number and plot a random color from the list of colors. Watching the "twinkling" told you a lot ... Getting back to your explanation, I would have expected the position of the "funnel" to be in the near vacinity of the actual structure. But, we did not seem to get close to begin to fall down the slope ... which is back to the question of is there something wrong with our raindrops? The constraints? Paul? :) ID: 1130 · Rating: 0 · rate: / Reply Quote

Keith E. Laidig Volunteer moderator Project developer Send message Joined: 1 Jul 05 Posts: 154 Credit: 117,189,961 RAC: 0	Message 1134 - Posted: 9 Oct 2005, 2:58:05 UTC - in response to Message 1130. Getting back to your explanation, I would have expected the position of the "funnel" to be in the near vacinity of the actual structure. But, we did not seem to get close to begin to fall down the slope ... which is back to the question of is there something wrong with our raindrops? The constraints? Perhaps I wasn't clear enough. The randomness of the inital portion of the search is not shown in any direct way in the plot of energy vs RMSD. Since one only sees the final structure from each 'search' you don't know from whence each end point came. In fact, one could have a large number of starting points end at the same end point. Take, for example, the situation in which a particular basin of the potential is searched with a large number of different starting points, all ending at the same 'best' energy structure at the bottom. None of these searches would be evident in the E vs. RMSD plot. In fact, the Rosetta potential doesn't show a funnel, rather it is a sharp, narrow hole in the energy in the immediate vicinity of the best strucutre - hence the analogy of the putting green representing the region about best prediction and the narrow cylinder of the cup representing the narrow region of lower energy immediately about the best structure. The plotting of the energy vs RMSD is only one way to look at the results. It is commonly done for historical reasons; namely, until the recent developments by P. Bradley and DB, the algorithm didn't find the narrow wells except in special cases. All they had was the broad plateaus. Subsequent statistical analyses were used to find the best predicted strucutre within the distributed results and the E vs RMSD view of the results proved useful in that context....so, they're used to looking at things that way. Lastly, that the potential doesn't result in the experimental obvservation is a statement about the accuracy of the potential. This is an area of endless work - with no end in sight. Is the 33% success rate for modest proteins a shortcomming of the potential or a shortcomming in the search or both? We will all find out as our project continues. Science is messy - even the computational kind. ID: 1134 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 1146 - Posted: 9 Oct 2005, 9:13:03 UTC I read the words, but I am completely missing the point you are trying to make. SO, Me thinks we have to start someplace else ... when I remember where that was I will start you off there ... ID: 1146 · Rating: 0 · rate: / Reply Quote