Background: Revealing the subcellular location of a newly discovered protein can bring insight to their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databanks has called for the development of automated analysis methods.
Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers.
Conclusions: Concomitant with the large numbers of protein sequences generated by high-throughput technologies, three future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the design of powerful predictor. The third is the protein multiple location sites prediction.