Nglyc: A Random Forest Method for Prediction of N-Glycosylation Sites in Eukaryotic Protein Sequence

Author(s): Ganesan Pugalenthi, Varadharaju Nithya, Kuo-Chen Chou, Govindaraju Archunan*

Journal Name: Protein & Peptide Letters

Volume 27 , Issue 3 , 2020

Become EABM
Become Reviewer

Graphical Abstract:


Abstract:

Background: N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism.

Objective: In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences.

Methods: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites.

Result: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate.

Conclusion: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.

Keywords: N-glycosylation, protein sequence, protein function, glycosites, glycoproteins, machine learning method.

[1]
Schwarz, F.; Aebi, M. Mechanisms and principles of N-linked protein glycosylation. Curr. Opin. Struct. Biol., 2011, 21(5), 576-582.
[http://dx.doi.org/10.1016/j.sbi.2011.08.005] [PMID: 21978957]
[2]
Gavel, Y.; von Heijne, G. Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: Implications for protein engineering. Protein Eng., 1990, 3(5), 433-442.
[http://dx.doi.org/10.1093/protein/3.5.433] [PMID: 2349213]
[3]
Boscher, C.; Dennis, J.W.; Nabi, I.R. Glycosylation, galectins and cellular signaling. Curr. Opin. Cell Biol., 2011, 23(4), 383-392.
[http://dx.doi.org/10.1016/j.ceb.2011.05.001] [PMID: 21616652]
[4]
van Kooyk, Y.; Rabinovich, G.A. Protein-glycan interactions in the control of innate and adaptive immune responses. Nat. Immunol., 2008, 9(6), 593-601.
[http://dx.doi.org/10.1038/ni.f.203] [PMID: 18490910]
[5]
Varki, A.; Cummings, R.D.; Esko, J.D.; Freeze, H.H.; Stanley, P.; Bertozzi, C.R.; Hart, G.W.; Etzler, M.E. Essentials of Glycobiology, 2nd ed; Cold Spring Harbor, New York, 2009.
[6]
Woods, R.J.; Edge, C.J.; Dwek, R.A. Protein surface oligosaccharides and protein function. Nat. Struct. Biol., 1994, 1(8), 499-501.
[http://dx.doi.org/10.1038/nsb0894-499] [PMID: 7664073]
[7]
Wormald, M.R.; Dwek, R.A. Glycoproteins: Glycan presentation and protein-fold stability. Structure, 1999, 7(7), R155-R160.
[http://dx.doi.org/10.1016/S0969-2126(99)80095-1] [PMID: 10425673]
[8]
Hennet, T. Diseases of glycosylation beyond classical congenital disorders of glycosylation. Biochim. Biophys. Acta, 2012, 1820(9), 1306-1317.
[http://dx.doi.org/10.1016/j.bbagen.2012.02.001] [PMID: 22343051]
[9]
Jaeken, J. Congenital disorders of glycosylation. Handb. Clin. Neurol., 2013, 113, 1737-1743.
[http://dx.doi.org/10.1016/B978-0-444-59565-2.00044-7] [PMID: 23622397]
[10]
Elliott, S.; Lorenzini, T.; Asher, S.; Aoki, K.; Brankow, D.; Buck, L.; Busse, L.; Chang, D.; Fuller, J.; Grant, J.; Hernday, N.; Hokum, M.; Hu, S.; Knudten, A.; Levin, N.; Komorowski, R.; Martin, F.; Navarro, R.; Osslund, T.; Rogers, G.; Rogers, N.; Trail, G.; Egrie, J. Enhancement of therapeutic protein in vivo activities through glycoengineering. Nat. Biotechnol., 2003, 21(4), 414-421.
[http://dx.doi.org/10.1038/nbt799] [PMID: 12612588]
[11]
Solá, R.J.; Griebenow, K. Glycosylation of therapeutic proteins: An effective strategy to optimize efficacy. BioDrugs, 2010, 24(1), 9-21.
[http://dx.doi.org/10.2165/11530550-000000000-00000] [PMID: 20055529]
[12]
Burda, P.; Aebi, M. The dolichol pathway of N-linked glycosylation. Biochim. Biophys. Acta, 1999, 1426(2), 239-257.
[http://dx.doi.org/10.1016/S0304-4165(98)00127-5] [PMID: 9878760]
[13]
Helenius, A.; Aebi, M. Roles of N-linked glycans in the endoplasmic reticulum. Annu. Rev. Biochem., 2004, 73, 1019-1049.
[http://dx.doi.org/10.1146/annurev.biochem.73.011303.073752] [PMID: 15189166]
[14]
Pless, D.D.; Lennarz, W.J. Enzymatic conversion of proteins to glycoproteins. Proc. Natl. Acad. Sci. USA, 1977, 74(1), 134-138.
[http://dx.doi.org/10.1073/pnas.74.1.134] [PMID: 264667]
[15]
Petrescu, A.J.; Milac, A.L.; Petrescu, S.M.; Dwek, R.A.; Wormald, M.R. Statistical analysis of the protein environment of N-glycosylation sites: Implications for occupancy, structure, and folding. Glycobiology, 2004, 14(2), 103-114.
[http://dx.doi.org/10.1093/glycob/cwh008] [PMID: 14514716]
[16]
Zielinska, D.F.; Gnad, F.; Wiśniewski, J.R.; Mann, M. Precision mapping of an in vivo N-glycoproteome reveals rigid topological and sequence constraints. Cell, 2010, 141(5), 897-907.
[http://dx.doi.org/10.1016/j.cell.2010.04.012] [PMID: 20510933]
[17]
Gupta, R.; Jung, E. Brunak. S. Prediction of N-glycosylation Sites in Human Proteins, 2004. Available from:. http://www.cbs.dtu.dk/services/NetNGlyc/
[18]
Caragea, C.; Sinapov, J.; Silvescu, A.; Dobbs, D.; Honavar, V. Glycosylation site prediction using ensembles of support vector machine classifiers. BMC Bioinformatics, 2007, 8, 438.
[http://dx.doi.org/10.1186/1471-2105-8-438]
[19]
Lee, J.W.; Lee, J.B.; Park, M.; Song, S.H. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal., 2005, 48, 869-885.
[http://dx.doi.org/10.1016/j.csda.2004.03.017]
[20]
Chuang, G.Y.; Boyington, J.C.; Joyce, M.G.; Zhu, J.; Nabel, G.J.; Kwong, P.D.; Georgiev, I. Computational prediction of N-linked glycosylation incorporating structural properties and patterns. Bioinformatics, 2012, 28(17), 2249-2255.
[http://dx.doi.org/10.1093/bioinformatics/bts426] [PMID: 22782545]
[21]
Hamby, S.E.; Hirst, J.D. Prediction of glycosylation sites using random forests. BMC Bioinformatics, 2008, 9, 500.
[http://dx.doi.org/10.1186/1471-2105-9-500] [PMID: 19038042]
[22]
Chauhan, J.S.; Bhat, A.H.; Raghava, G.P.S.; Rao, A.; Glyco, P.P. A webserver for prediction of N- and O-glycosites in prokaryotic protein sequences. PLoS One, 2012, 7(7)e40155
[http://dx.doi.org/10.1371/journal.pone.0040155] [PMID: 22808107]
[23]
Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucleic Acids Res., 2000, 28(1), 235-242.
[http://dx.doi.org/10.1093/nar/28.1.235] [PMID: 10592235]
[24]
Apweiler, R.; Bairoch, A.; Wu, C.H. Protein sequence databases. Curr. Opin. Chem. Biol., 2004, 8(1), 76-80.
[http://dx.doi.org/10.1016/j.cbpa.2003.12.004] [PMID: 15036160]
[25]
Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006, 22(13), 1658-1659.
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID: 16731699]
[26]
Zhang, H.; Loriaux, P.; Eng, J.; Campbell, D.; Keller, A.; Moss, P.; Bonneau, R.; Zhang, N.; Zhou, Y.; Wollscheid, B.; Cooke, K.; Yi, E.C.; Lee, H.; Peskind, E.R. UniPep--a database for human N-linked glycosites: A resource for biomarker discovery. Genome Biol., 2006, 7(8), R73.
[http://dx.doi.org/10.1186%2Fgb-2006-7-8-r73] [PMID: 16901351]
[27]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res., 1997, 25(17), 3389-3402.
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[28]
Adamczak, R.; Porollo, A.; Meller, J. Accurate prediction of solvent accessibility using neural networks-based regression. Proteins, 2004, 56(4), 753-767.
[http://dx.doi.org/10.1002/prot.20176] [PMID: 15281128]
[29]
Breiman, L. Random forests. Mach. Learn., 2001, 45, 5-32.
[http://dx.doi.org/10.1023/A:1010933404324]
[30]
Jia, S.C.; Hu, X.Z. Using random forest algorithm to predict β-hairpin motifs. Protein Pept. Lett., 2011, 18(6), 609-617.
[http://dx.doi.org/10.2174/092986611795222777] [PMID: 21309739]
[31]
Kandaswamy, K.K.; Chou, K.C.; Martinetz, T.; Möller, S.; Suganthan, P.N.; Sridharan, S.; Pugalenthi, G. AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. J. Theor. Biol., 2011, 270(1), 56-62.
[http://dx.doi.org/10.1016/j.jtbi.2010.10.037] [PMID: 21056045]
[32]
Kandaswamy, K.K.; Pugalenthi, G.; Hartmann, E.; Kalies, K.U.; Möller, S.; Suganthan, P.N.; Martinetz, T. SPRED: A machine learning approach for the identification of classical and non-classical secretory proteins in mammalian genomes. Biochem. Biophys. Res. Commun., 2010, 391(3), 1306-1311.
[http://dx.doi.org/10.1016/j.bbrc.2009.12.019] [PMID: 19995554]
[33]
Kumar, K.K.; Pugalenthi, G.; Suganthan, P.N. DNA-Prot: Identification of DNA binding proteins from protein sequence information using random forest. J. Biomol. Struct. Dyn., 2009, 26(6), 679-686.
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID: 19385697]
[34]
Liaw, A.; Wiener, M. Classification and regression by randomforest. R News, 2002, 2, 18-22.
[35]
Chen, W.; Lv, H.; Nie, F.; Lin, H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics, 2019, 35(6), 2796-2800.
[http://dx.doi.org/10.1093/bioinformatics/btz015] [PMID: 30624619]
[36]
Feng, C.Q.; Zhang, Z.Y.; Zhu, X.J.; Lin, Y.; Chen, W.; Tang, H.; Lin, H. iTerm-PseKNC: A sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics, 2019, 35(9), 1469-1477.
[http://dx.doi.org/10.1093/bioinformatics/bty827] [PMID: 30247625]
[37]
Chou, K.C. Some remarks on protein attribute prediction and pseudo aminoacid composition (50th Anniversary Year Review). J. Theor. Biol., 2011, 273, 236-247.
[http://dx.doi.org/10.1016/j.jtbi.2010.12.024] [PMID: 21168420]
[38]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., 2005, 27(8), 1226-1238.
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262]
[39]
Chen, W.; Tang, H.; Ye, J.; Lin, H.; Chou, K.C. iRNA-PseU: Identifying RNA pseudouridine sites. Mol. Ther. Nucleic Acids, 2016, 5e332
[PMID: 28427142]
[40]
Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chou, K.C. iRNA-AI: Identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget, 2017, 8(3), 4208-4217.
[http://dx.doi.org/10.18632/oncotarget.13758] [PMID: 27926534]
[41]
Chen, W.; Ding, H.; Zhou, X.; Lin, H.; Chou, K.C. iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Anal. Biochem., 2018, 561-562, 59-65.
[http://dx.doi.org/10.1016/j.ab.2018.09.002] [PMID: 30201554]
[42]
Bause, E. Model studies on N-glycosylation of proteins. Biochem. Soc. Trans., 1984, 12(3), 514-517.
[http://dx.doi.org/10.1042/bst0120514] [PMID: 6428943]
[43]
Kaplan, H.A.; Naider, F.; Lennarz, W.J. Partial characterization and purification of the glycosylation site recognition component of oligosaccharyltransferase. J. Biol. Chem., 1988, 263(16), 7814-7820.
[PMID: 3372505]
[44]
Kaplan, H.A.; Welply, J.K.; Lennarz, W.J. Oligosaccharyl transferase: The central enzyme in the pathway of glycoprotein assembly. Biochim. Biophys. Acta, 1987, 906(2), 161-173.
[http://dx.doi.org/10.1016/0304-4157(87)90010-4] [PMID: 3297152]
[45]
Roitsch, T.; Lehle, L. Expression of yeast invertase in oocytes from Xenopus laevis. Secretion of active enzyme differing in glycosylation. Eur. J. Biochem., 1989, 181(3), 733-739.
[http://dx.doi.org/10.1111/j.1432-1033.1989.tb14785.x] [PMID: 2659349]
[46]
Pearl, L.; Blundell, T. The active site of aspartic proteinases. FEBS Lett., 1984, 174(1), 96-101.
[http://dx.doi.org/10.1016/0014-5793(84)81085-6] [PMID: 6381096]


Rights & PermissionsPrintExport Cite as

Article Details

VOLUME: 27
ISSUE: 3
Year: 2020
Page: [178 - 186]
Pages: 9
DOI: 10.2174/0929866526666191002111404
Price: $65

Article Metrics

PDF: 18
HTML: 4