Background: As a newly uncovered post-translational modification on the
ε-amino group of lysine residue, protein malonylation was found to be involved in
metabolic pathways and certain diseases. Apart from experimental approaches, several
computational methods based on machine learning algorithms were recently proposed to
predict malonylation sites. However, previous methods failed to address imbalanced data
sizes between positive and negative samples.
Objective: In this study, we identified the significant features of malonylation sites in a
novel computational method which applied machine learning algorithms and balanced data
sizes by applying synthetic minority over-sampling technique.
Method: Four types of features, namely, amino acid (AA) composition, position-specific
scoring matrix (PSSM), AA factor, and disorder were used to encode residues in protein
segments. Then, a two-step feature selection procedure including maximum relevance
minimum redundancy and incremental feature selection, together with random forest
algorithm, was performed on the constructed hybrid feature vector.
Results: An optimal classifier was built from the optimal feature subset, which featured an
F1-measure of 0.356. Feature analysis was performed on several selected important features.
Conclusion: Results showed that certain types of PSSM and disorder features may be
closely associated with malonylation of lysine residues. Our study contributes to the
development of computational approaches for predicting malonyllysine and provides
insights into molecular mechanism of malonylation.