Title:Identification of Cancerlectins By Using Cascade Linear Discriminant Analysis and Optimal g-gap Tripeptide Composition
VOLUME: 15 ISSUE: 6
Author(s):Liangwei Yang , Hui Gao *, Keyu Wu, Haotian Zhang , Changyu Li and Lixia Tang
Affiliation:Center for Informational Biology, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Center for Informational Biology, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu
Keywords:Cancerlectin, cascade LDA, g-gap tripeptide composition, SVM, protein, ANOVA.
Abstract:
Background: Lectins are a diverse group of glycoproteins or glycoconjugate proteins
that can be extracted from plants, invertebrates and higher animals. Cancerlectins, a kind of lectins,
which play a key role in the process of tumor cells interacting with each other and are being employed
as therapeutic agents. A full understanding of cancerlectins is significant because it provides
a tool for the future direction of cancer therapy.
Objective: To develop an accurate and practically useful timesaving tool to identify cancerlectins.
A novel sequence-based method is proposed along with a correlative webserver to access the proposed
tool.
Methods: Firstly, protein features were extracted in a newly feature building way termed, g-gap
tripeptide composition. After which a proposed cascade linear discriminant analysis (Cascade
LDA) is used to alleviate the high dimensional difficulties with the Analysis Of Variance (ANOVA)
as a feature importance criterion. Finally, Support Vector Machine (SVM) is used as the classifier
to identify cancerlectins.
Results: The proposed method achieved an accuracy of 91.34% with sensitivity of 89.89%, specificity
of 92.48% and an 0.8318 Mathew’s correlation coefficient based on only 13 fusion features
in jackknife cross validation, the result of which is superior to other published methods in this domain.
Conclusion: In this study, a new method based only on primary structure of protein is proposed
and experimental results show that it could be a promising tool to identify cancerlectins. An openaccess
webserver is made available in this work to facilitate other related works.