Background: Proteins play a crucial role in life activities, such as catalyzing metabolic
reactions, DNA replication, responding to stimuli, etc. Identification of protein structures and
functions are critical for both basic research and applications. Because the traditional experiments
for studying the structures and functions of proteins are expensive and time consuming,
computational approaches are highly desired. In key for computational methods is how to
efficiently extract the features from the protein sequences. During the last decade, many powerful
feature extraction algorithms have been proposed, significantly promoting the development of the
studies of protein structures and functions.
Objective: To help the researchers to catch up the recent developments in this important field, in
this study, an updated review is given, focusing on the sequence-based feature extractions of
Method: These sequence-based features of proteins were grouped into three categories, including
composition-based features, autocorrelation-based features and profile-based features. The detailed
information of features in each group was introduced, and their advantages and disadvantages were
discussed. Besides, some useful tools for generating these features will also be introduced.
Results: Generally, autocorrelation-based features outperform composition-based features, and
profile-based features outperform autocorrelation-based features. The reason is that profile-based
features consider the evolutionary information, which is useful for identification of protein
structures and functions. However, profile-based features are more time consuming, because the
multiple sequence alignment process is required.
Conclusion: In this study, some recently proposed sequence-based features were introduced and
discussed, such as basic k-mers, PseAAC, auto-cross covariance, top-n-gram etc. These features
did make great contributions to the developments of protein sequence analysis. Future studies can
be focus on exploring the combinations of these features. Besides, techniques from other fields,
such as signal processing, natural language process (NLP), image processing etc., would also
contribute to this important field, because natural languages (such as English) and protein
sequences share some similarities. Therefore, the proteins can be treated as documents, and the
features, such as k-mers, top-n-grams, motifs, can be treated as the words in the languages.
Techniques from these filed will give some new ideas and strategies for extracting the features