To continue this discussion, I am very much interested in what others
say about the best approach to get the 3-D structural properties for
the disordered regions or the secondary structures elements, may be
for the purpose of training some machine learning algorithm or guiding
MD simulations of a deemed homologous protein with unknown structure?
Should it be DSSP or PDB or something else that I am not aware of ?
On 23 Apr 2008, at 19:20, Kevin Karplus wrote:
>>> Narges Habibi wrote
>>> I'm doing a project on "Protein Contact Map Prediction" and I use
>> features for nueral network's input, including Secondary Structure
>> of a
>> given Amino Acid. There are several ways:
>>>> 1- getting dssp file for each pdb file (from ftp server)
>> 2- extracting from pdb file (The HELIX and SHEET and TURN section)
>> 3- getting ss file from www.pdb.org (as I see the given sequences
>> in this
>> file don't match with the pdb files, why?)
>>>> What do you suggest? What method is more accurate?
>> None of the above.
>> Predicting contact maps using known structure is cheating. You should
> be predicting the local structure, not extracting it from known
> structures. Any way that data from known structures can creep into
> your inputs invaliates your testing, and makes it impossible to say
> with confidence that your method does anything useful. Given the
> rather low-quality of contact prediction at the current state of the
> art, even small amounts of information from the real structure can
> make a big difference.
>> The following paper by my student is a pretty good summary of the the
> best method as of CASP7---improvements since then have been modest:
>> George Shackelford and Kevin Karplus.
> Contact Prediction using Mutual Information and Neural Nets.
> Proteins: Structure, Function, and Bioinformatics,
> 69(S8):159-164, 2007. (CASP7 sepcial issue).
>> I see a lot of "prediction" work that is complete garbage, because the
> authors fooled themselves by using data that could only come from
> knowing the real structures. The even more common problem is
> insufficient separation of train and test sets, in which computer
> scientists assume that the random partition of a data set is all that
> is needed---but the sta sets we have aren't independent samples, so
> one has to go to some effort to ensure that the test set does not
> contain examples that are very close to training set examples.
> Kevin Karplus karplus from soe.ucsc.eduhttp://www.soe.ucsc.edu/~karplus> Professor of Biomolecular Engineering, University of California,
> Santa Cruz
> Undergraduate Director, Bioinformatics
> (Senior member, IEEE) (Board of Directors & Chair of Education
> Committee, ISCB)
> Affiliations for identification only.
>>> TO UNSUBSCRIBE OR CHANGE YOUR SUBSCRIPTION OPTIONS, please see