Background: The flexibility of protein structures is often related to the function of the protein.
Feature selection (FS) is very critical to the application of a lot of machine learning which deals with
small sampling and high-dimensional data. For the prediction of the flexible regions by the protein sequences,
it is important to build a machine learning methodology which is based on an effective feature
selection technology. This may also provide new knowledge to understand the protein folding process.
Method: Firstly, the frequencies of the k-spaced amino acid pairs are taken as a representation of the local
sequences. Secondly, these representations are processed by feature selection based on incremental of diversity
(FSID) to reduce the dimensionality. Finally, the logistic regression approach is applied to integrate
the selected features into a scheme to discriminate flexible or rigid (referred to as FSID_FRP).
Results: 74 features are selected from the set of 66 sequences, which includes 26 flexible patterns and 48
rigid patterns. Most of the flexible patterns are associated with Glycine or Proline, and the rigid patterns
are associated with Leucine or Valine. We obtained 79.41% accuracy and 0.51 MCC using the
FSID_FRP method in which we applied logistic regression and used the representation of the 74 features.
The results of FSID_FRP method are comparable to that of FlexRP method that includes 95 features.
Conclusion: A simple feature selection method FSID is shown to be very efficient in the prediction of
the flexible/rigid regions of protein sequences. This method is more appropriate for small-sampling
classification than the entropy-based feature selection method. The proposed FSID_FRP method
achieved 80% prediction accuracy and stronger generalization ability.