Improving the chances of successful protein structure determination with a random forest classifier

Jahandideh, S.; Jaroszewski, L.; Godzik, A.

doi:10.1107/S1399004713032070

Improving the chances of successful protein structure determination with a random forest classifier

S. Jahandideh, L. Jaroszewski and A. Godzik

Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007), Protein Sci. 16, 2472–2482] was developed. XtalPred classifies proteins into five `crystallization classes' based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.

Keywords: structural genomics; target selection; machine-learning methods; XtalPred.

Read article Similar articles

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Text
		Plain Text

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Text
		Plain Text

Search IUCr Journals		doi		Advanced search
Author		volume	page