Generic placeholder image

Current Bioinformatics


ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

PoGB-pred: Prediction of Antifreeze Proteins Sequences Using Amino Acid Composition with Feature Selection Followed by a Sequential-based Ensemble Approach

Author(s): Affan Alim*, Abdul Rafay and Imran Naseem

Volume 16, Issue 3, 2021

Published on: 07 July, 2020

Page: [446 - 456] Pages: 11

DOI: 10.2174/1574893615999200707141926

Price: $65


Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process, which may cause the rupture in the internal cells and tissues. AFP’s have also attracted attention and interest in food industries and cryopreservation.

Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce a competitive performance on a highly distinct AFP structure.

Methods: In this study, machine learning-based algorithms including Principal Component Analysis (PCA) followed by Gradient Boosting (GB) were proposed to be used for anti-freeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments' composition of amino acid and dipeptides are used. PCA, in particular, is proposed for dimension reduction and high variance retaining of data, which is followed by an ensemble method named gradient boosting for modeling and classification.

Results: The proposed method obtained a superfluous performance on PDB, Pfam, and Uniprot datasets as compared to the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components, a high accuracy of 89.63% was achieved, which is superior to 87.41% utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different datasets such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, the proposed method attained high sensitivity of 79.16% which is 12.50% better than state-of-the-art RAFP-pred method.

Conclusion: AFPs have a common function with a distinct structure. Therefore, the development of a single model for different sequences often fails for AFPs. Robust results have been shown by the proposed model on the diversity of training and testing datasets. The results of the proposed model outperformed compared to the previous AFPs prediction method, such as RAFP-Pred. The proposed model consists of PCA for dimension reduction, followed by gradient boosting for classification. Due to simplicity, scalability properties, and high performance result, this model can be easily extended for analyzing the proteomic and genomic datasets.

Keywords: Terms-protein, antifreeze protein, PCA, gradient boosting, classifier, identification.

Graphical Abstract
Griffith M, Ala P, Yang DS, Hon W-C, Moffatt BA. Antifreeze protein produced endogenously in winter rye leaves. Plant Physiol 1992; 100(2): 593-6.
[] [PMID: 16653033]
Kuiper MJ, Morton CJ, Abraham SE, Gray-Weale A. The biological function of an insect antifreeze protein simulated by molecular dynamics. eLife 2015; 4e05142
Urrutia ME, Duman JG, Knight CA. “Plant thermal hysteresis proteins,” Biochimica et Biophysica Acta (BBA)-. Protein Struct Mol Enzym 1992; 1121(1-2): 199-206.
Sinha P, Muralidharan S, Sengupta S, Veerappapillai S. A brief review on antifreeze proteins: structure, function and applications. Res J Pharm Biol Chem Sci 2016; 7(3): 914-9.
Kandaswamy KK, Chou K-C, Martinetz T, et al. AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 2011; 270(1): 56-62.
[] [PMID: 21056045]
Davies PL, Hew CL. Biochemistry of fish antifreeze proteins. FASEB J 1990; 4(8): 2460-8.
[] [PMID: 2185972]
Fletcher GL, Goddard SV. Antifreeze proteins and their genes: from basic research to business opportunity. Chemtech 1999; 29(6): 17-28.
Ewart, K. V., Qing Lin, and C. L. Hew. Structure, function and evolution of antifreeze proteins. Cellular and Molecular Life Sciences CMLS 552 (1999): 271-283
Feeney RE, Yeh Y. Antifreeze proteins: current status and possible food use. Trends Food Sci Technol 1998; 9(3): 102-6.
Griffith M, Ewart KV. Antifreeze proteins and their potential use in frozen foods. Biotechnol Adv 1995; 13(3): 375-402.
[] [PMID: 14536093]
Regand A, Goff HD. Ice recrystallization inhibition in ice cream as affected by ice structuring proteins from winter wheat grass. J Dairy Sci 2006; 89(1): 49-57.
[] [PMID: 16357267]
Clarke CJ, Buckley SL, Lindner N. Ice structuring proteins - a new name for antifreeze proteins. Cryo Lett 2002; 23(2): 89-92.
[PMID: 12391489]
Payne SR, Sandford D, Harris A, Young OA. The effects of antifreeze proteins on chilled and frozen meat. Meat Sci 1994; 37(3): 429-38.
[] [PMID: 22059547]
Khan S, Naseem I, Togneri R, Bennamoun M. Rafp-pred: robust prediction of antifreeze proteins using localized analysis of n-peptide compositions. IEEE/ACM Trans Comput Biol Bioinformatics 2018; 15(1): 244-50.
Usman M, Lee JA. Afp-cksaap: prediction of antifreeze proteins using the composition of k-spaced amino acid pairs with deep neural network. 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE) 2019..
Pratiwi R, Malik AA, Schaduangrat N, et al. Cryoprotect: a web server for classifying antifreeze proteins from nonantifreeze proteins. J Chem 2017; 20179861752
Eslami M, Shirali Hossein Zade R, Takalloo Z, et al. afpCOOL: a tool for antifreeze protein prediction. Heliyon 2018; 4(7)e00705
[] [PMID: 30094375]
Chou K-C, Shen H-B. Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc 2008; 3(2): 153-62.
[] [PMID: 18274516]
Chou K-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001; 43(3): 246-55.
[] [PMID: 11288174]
Bateman A, Coin L, Durbin R, et al. The Pfam protein families database. Nucleic Acids Res 2004; 32(Database issue)(Suppl. 1): D138-41.
[] [PMID: 14681378]
Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments Proteins 1997; 28(3): 405-20. 0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L.
[PMID: 9223186]
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001; 17(3): 282-3.
[] [PMID: 11294794]
Chou K-C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics 2009; 6(4): 262-74.
Srivastava A, Kumar R, Kumar M. BlaPred: Predicting and classifying β-lactamase using a 3-tier prediction system via Chou’s general PseAAC. J Theor Biol 2018; 457: 29-36.
[] [PMID: 30138632]
Pearson K. Liii. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci 1901; 2(11): 559-72.
Fisher RA, Mackenzie WA. Studies in crop variation. ii. the manurial response of different potato varieties. J Agric Sci 1923; 13(3): 311-20.
Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet 2008; 40(5): 646-9.
[] [PMID: 18425127]
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal 2002; 38(4): 367-78.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media 2009.
Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics 2003; 19(12): 1589-91.
[] [PMID: 12912846]
Berman HM, Bourne PE, Westbrook J, Zardecki C. The protein data bank.,in Protein Structure. CRC Press 2003; pp. 394-410.
Bairoch A, Apweiler R, Wu CH, et al. The universal protein resource (uniprot). Nucleic Acids Res 2005; 33(Database issue)(Suppl. 1): D154-9.
[] [PMID: 15608167]
Wang Y, Hu M, Li Q, Zhang X-P, Zhai G, Yao N. Abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with covid-19 in an accurate and unobtrusive manner. arXiv preprint arXiv:200205534 2020..
Khatri R, Varghese V, Sharma S, Kumar GS, Chhabra HS. Pullout strength predictor: A machine learning approach. Asian Spine J 2019; 13(5): 842-8.
[] [PMID: 31154706]
Xiao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Methods Programs Biomed 2018; 153: 1-9.
[] [PMID: 29157442]

Rights & Permissions Print Export Cite as
© 2022 Bentham Science Publishers | Privacy Policy