XtalPred-RF server JCMM | JCSG
Godzik Lab

Job submission

Input sequence

Paste your sequence or sequences in FASTA format. At most 10 sequences could be submitted as a single job. The XtalPred server accepts sequences between 50 and 1000 residues. Longer or shorter proteins are usually very difficult to crystallize.

Optional features

Select checkbox "Find close bacterial homologs more likely to crystallize" to get a list of target's homologs in complete microbial genomes with the information about their crystalizability class. The list also contains links to detailed information about each homolog. Currently (June 2007) XtalPred contains pre-calculated results for 487 complete genomes provided by NCBI, i.e. 1,549,504 protein sequences. Selecting this feature increases the total time of calculations.

Select checkbox "Find close homologs in the full NR database" to get a list of target's homologs in the full NR database. XtalPred uses the NR database from June 2007 with 4,970,641 protein sequences. Selecting this feature increases the total time of calculations.

Paste your valid email address. Once the job is finished, you will get the results at this email address.
You can provide the email address at the time of submission or later.

Server policy

The XtalPred web server is free for academic users only. Other users are requested to contact adam@burnham.org
Be aware, that XtalPred uses several external programs which are freely available only for academic users. Read more here.

Retrieve the results

Paste job id provided during job submission process. All job id for jobs submitted from webpage starts with 'webdb/', for example: webdb/1182365731.4638. Other job ids for local databases starts with 'db/', for example: db/Microbial_genomes/Acidobacteria_bacterium_Ellin345.

All results for jobs submitted from the webpage will be deleted within 7 days.

Crystallization prediction

Crystallization analyzes

The web server compares nine of the protein's biochemical and biophysical features with corresponding probability distributions obtained from TargetDB. A plot is automatically generated for each protein feature and reveals distributions of failures and successes in the learning sets extracted from TargetDB; interpolated empirical distributions of crystallization probability, and the positions of the protein in those distributions. See example on Figure 1.

Figure 1. Example of comparison of target features with distributions of crystallization probabilities obtained from TargetDB.

Crystallization classification

The prediction is made by combining individual crystallization probabilities into a single crystallization score. Based on this score, the protein is assigned to one of five crystallization classes: optimal, suboptimal, average, difficult and very difficultit. Each class represents different crystallization success rate observed in TargetDB (See Figure 2.). Finally, the server shows the distribution of the crystallization probability in the crystallization classes and the position of the protein within this distribution. See an example on Figure 3.

Figure 3. Example of crystallization classification. Figure 2. Distribution of targets into crystallization classes and observed successes and failures in protein crystallization.

Crystallization classification by Random Forest Classifier (XtalPred-RF)

In XtalPred-RF the list of protein features has been extended by adding predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues, and amino-acid composition of the predicted protein surface. Then, Random Forest classifier was used to predict protein crystallizability class. It resulted in almost two-fold improvement of the prediction of crystallization success as compared to the original XtalPred version from 2007.

Training and testing set used in the development of XtalPred-RF

Lists of TargetTrack IDs of PSI targets used as training and testing sets for XtalPred-RF and their sequences are available at: http://ffas.burnham.org/XtalPred/data.tar

File names of the lists of positive and negative cases in training and testing sets:

learn.pos.txt - training set, positive cases (2265 solved structures)
learn.neg.txt - training set, negative cases (2355 targets which failed to crystallize)

test.pos.txt - testing set, positive cases (2445 solved structures)
test.neg.txt - testing set, negative cases (2440 targets which failed to crystallize)

fasta.txt - sequences of all targets from training and testing set.


XtalPred crystallizability classification is based on statistics on non-secreted wild-type microbial proteins and is optimized for identifying the most promising crystallization targets from large protein families. XtalPred is also helpful in construct design, although crystalizability class itself is usually not a sufficient criterion to find precise construct boundaries.



The XtalPred server
Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I.A., Lesley, S.A., Godzik, A. XtalPred: a web server for prediction of protein crystallizability. Bioinformatics, 2007 23(24):3403-5. [PubMed]

The method
Slabinski, L., Jaroszewski, L., Rodrigues, A.P.C., Rychlewski, L., Wilson, I.A., Lesley, S.A., Godzik, A. The challenge of protein structure determination - lessons from structural genomics. Protein Science, 2007 16(11):2472-82. [PubMed]

Jahandideh S, Jaroszewski L, Godzik A. Improving the chances of successful protein structure determination with a random forest classifier. Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35 [PubMed]

External software

The XtalPred web server uses several programs (freely available only for academic users) for calculation and prediction of protein features:

PSI-BLAST - homology searches. Ref.: Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. J Mol Biol, 1990 215(3):403-10.

PSIPRED - secondary structure prediction. Ref.: Jones, D.T. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol, 1999 292: 195-202.

DISOPRED2 - prediction of structurally disordered regions. Ref.: Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F., and Jones, D.T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 2004 337: 635-645.

COILS - prediction of coiled-coil regions. Ref.: Lupas, A., Van Dyke, M., and Stock, J. 1991. Predicting coiled coils from protein sequences. Science 252: 1162-1164.

TMHMM - prediction of transmembrane helices. Ref.: Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567-580.

SEG - calculation of low-complexity regions. Ref.: Wootton, J.C. 1994. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18: 269-285.

RPSP - prediction of signal peptides. Ref.: Plewczynski, D., Slabinski, L., Tkacz, A., Kajan, L., Holm, L., Ginalski, K., Rychlewski, L. The RPSP: web server for prediction of signal peptides. 2007.Polymer 48: 5493-5496.

About XtalPred | References | Contact Us | Pre-calculated Results

This server is supported by the NIH Protein Structure Initiative grants: P20 GM076221 (JCMM) and U54 GM074898 (JCSG).

Last update August 30, 2007