Background: N-Glycosylation is one of the most important post-translational
mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is
any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are
glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation
Objective: In this article, our motivation is to develop a computational method to predict Nglycosylation
sites in eukaryotic protein sequences.
Methods: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site
from protein sequence, using 315 sequence features. The method was trained using a dataset of 600
N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation
sites and 253 non-glycosylation sites. Nglyc prediction was compared with
NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using
human and mouse N-glycosylation sites.
Result: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features.
Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc
performs better than the other methods with high sensitivity and specificity rate.
Conclusion: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and
0.8182 specificity. Comparison study shows that our method performs better than the other
methods. Applicability and success of our method was further evaluated using human and mouse
N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/