Application of Feature Selection Technology Based on Incremental of Diversity in Prediction of Flexible Regions from Protein Sequences

Author(s): Suqing Yang, Shisai Hu, Ying Zhang*, Jun Lv.

Journal Name: Letters in Organic Chemistry

Volume 14 , Issue 9 , 2017

Become EABM
Become Reviewer

Graphical Abstract:


Background: The flexibility of protein structures is often related to the function of the protein. Feature selection (FS) is very critical to the application of a lot of machine learning which deals with small sampling and high-dimensional data. For the prediction of the flexible regions by the protein sequences, it is important to build a machine learning methodology which is based on an effective feature selection technology. This may also provide new knowledge to understand the protein folding process.

Method: Firstly, the frequencies of the k-spaced amino acid pairs are taken as a representation of the local sequences. Secondly, these representations are processed by feature selection based on incremental of diversity (FSID) to reduce the dimensionality. Finally, the logistic regression approach is applied to integrate the selected features into a scheme to discriminate flexible or rigid (referred to as FSID_FRP).

Results: 74 features are selected from the set of 66 sequences, which includes 26 flexible patterns and 48 rigid patterns. Most of the flexible patterns are associated with Glycine or Proline, and the rigid patterns are associated with Leucine or Valine. We obtained 79.41% accuracy and 0.51 MCC using the FSID_FRP method in which we applied logistic regression and used the representation of the 74 features. The results of FSID_FRP method are comparable to that of FlexRP method that includes 95 features.

Conclusion: A simple feature selection method FSID is shown to be very efficient in the prediction of the flexible/rigid regions of protein sequences. This method is more appropriate for small-sampling classification than the entropy-based feature selection method. The proposed FSID_FRP method achieved 80% prediction accuracy and stronger generalization ability.

Keywords: Feature selection, increment of diversity, k-spaced amino acid pairs, logistic regression, protein flexible regions, protein sequences.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2017
Page: [642 - 647]
Pages: 6
DOI: 10.2174/1570178614666170221145333
Price: $65

Article Metrics

PDF: 21