Background: Thermophilic proteins can maintain good activity under high temperature,
therefore, it is important to study thermophilic proteins for the thermal stability of proteins.
Objective: In order to solve the problem of low precision and low efficiency in predicting
thermophilic proteins, a prediction method based on feature fusion and machine learning was
proposed in this paper.
Method: For the selected thermophilic data sets, firstly, the thermophilic protein sequence was
characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and
autocorrelation coefficient. Then, kernel principal component analysis (KPCA) was used to reduce
the dimension of the expressed protein sequence features in order to reduce the training time and
improve efficiency. Finally, the classification model was designed by using the classification
Results: A variety of classification algorithms was used to train and test on the selected thermophilic
dataset. By comparison, the accuracy of the support vector machine (SVM) under the jackknife
method was over 92%. The combination of other evaluation indicators also proved that the SVM
performance was the best.
Conclusion: Because of choosing an effectively feature representation method and a robust
classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to
most reported methods.