Generic placeholder image

Current Bioinformatics


ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Identification of Disease-specific Single Amino Acid Polymorphisms Using a Simple Random Forest at Protein-level

Author(s): Jian He, Rongao Yuan, Lei Xu, Yanzhi Guo* and Menglong Li*

Volume 16 , Issue 10 , 2021

Published on: 24 August, 2021

Page: [1278 - 1287] Pages: 10

DOI: 10.2174/1574893616666210825094751


Background: The number of human genetic variants deposited into publicly available databases has been increasing exponentially. Among these variants, non-synonymous single nucleotide polymorphisms (nsSNPs), also known as single Amino Acid Polymorphisms (SAPs), have been demonstrated to be strongly correlated with phenotypic variations of traits/diseases.

Objective: However, the detailed mechanisms governing the disease association of SAPs remain unclear. Thus, further investigation of new attributes and improvement of the prediction becomes more and more urgent since amount of unknown disease-related SAPs need to be investigated.

Methods: Based on the principle of Random Forest (RF), we firstly constructed a new effective prediction model for SAPs associated with a particular disease from protein sequences. Four usual sequence signature extractions were separately performed to select the optimal features. Then SAP peptide lengths from 12 to 202 were also optimized.

Results: The optimal models achieve higher than 90% accuracy and Area Under the Curve (AUC) of over 0.9 on all 11 external testing datasets. Finally, the good performance on an independent test set with an accuracy higher than 95% proves the superiority of our method.

Conclusion: In this paper, based on Random Forest (RF), we constructed 11 disease-association prediction models for SAPs from the protein sequence level. All models yield prediction accuracy higher than 90% and Area Under the Curve (AUC) more than 0.9. Our method only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins.

Keywords: Single amino acid polymorphisms, random forest, protein sequence, disease-specific prediction, binding site, optimal features.

Graphical Abstract

© 2022 Bentham Science Publishers | Privacy Policy