Obtaining soluble proteins in sufficient concentrations helps increase the overall success rate in various experimental
studies. Protein solubility is an individual trait ultimately determined by its primary protein sequence. Exploring
the interconnection between the protein solubility and the compositions of protein sequence is instrumental for setting priorities
on targets in large scale proteomics projects. In this paper, amino acid composition (20 dimensions) and the dipeptide
composition (400 dimensions) were extracted to form the total candidate feature pool (420 dimensions), and each feature
was selected into the feature vectors one by one, which were sorted by the absolute value of the correlation coefficient.
Finally, we evaluated and recorded the 420 results of Support Vector Machine (SVM) as the prediction engine. According
to the results of SVM, the first 208 features were chosen from the 420 dimensions, which were considered as the
efficient ones. By analyzing the composition of the former 208 features, we found that the protein solubility was significantly
influenced by the occurrence frequencies of the acidic amino acids, basic amino acids, non-polar hydrophobic
amino acids and the two polar neutral amino acids(C, Q) in the protein sequences. Additionally, we detected that the
dipeptides composed by the acidic amino acids (D, E) and basic amino acids (K, R and H), especially the dipeptide composed
by the acidic amino acids (D, E), had strong interconnection with the protein solubility.
Keywords: Protein solubility, support vector machine, correlation coefficient, hydrophobic amino acids, dipeptide, vector, proteomics, protein sequence
Rights & PermissionsPrintExport