Background: Post-translational modifications (PTMs) are a key regulating mechanism in the
cellular process. It is of importance to quickly and accurately identify PTMs. Both next generation
sequencing as well as bioinformatics techniques greatly facilitated discovery of PTMs. Most
bioinformatics techniques followed the machine learning framework where feature extraction occupies
a key position.
Conclusion: The article focuses mainly on reviewing various feature extractions from protein sequence,
structure, function, physicochemical and biochemical property and evolution conservation, which were
used for predicting PTMs in the machine learning-based methods. The binary encoding, amino acid
composition, pseudo amino acid composition, composition of K-spaced amino acid pairs, auto
correlation functions, position weight amino acids composition and position-specific amino acid
propensity extracted features directly from protein sequences. Encoding based on grouped weight is a
hybrid way of feature extraction integrating information both on physicochemical and biochemical
property and on sequences. The information on protein structure, especially secondary structure,
accessible surface and disorder was used for encoding proteins. The feature extraction from the
evolution conservation included position-specific scoring matrix and k-nearest neighbor score. In
addition, we discussed some existing problems in the feature extractions.
Keywords: Machine learning, feature extraction, PSSM, PTMs, pseudo amino acid composition, position-specific amino acid
propensity, composition of K-spaced amino acid pairs, auto correlation functions.
Rights & PermissionsPrintExport