Title:Discrimination of Thermophilic and Mesophilic Proteins Using Reduced Amino Acid Alphabets with n-Grams
VOLUME: 7 ISSUE: 2
Author(s):Aydin Albayrak and Ugur O. Sezerman
Affiliation:Biological Sciences and Bioengineering, Sabanci University, Orhanli, Tuzla, Istanbul, Turkey.
Keywords:Amino acid composition, dipeptide, N-grams, reduced amino acid alphabets, statistically significant features,
thermostability, tripeptide, Xylella fastidosa, reduced amino acid alphabets, homologous proteins
Abstract:Protein thermostabilization has been the focus of recent research due to growing interest in the production of
enzymes that can operate at temperatures that are industrially beneficial. Understanding the determinants of
thermostabilization at the level of sequence and structure is important to design such enzymes. A bioinformatical
approach was used to determine the extent by which reduced amino acid alphabets (RAAA) with n-grams (subsequences
of length n) that were subjected to a t-test-based feature selection procedure can be used to discriminate proteins from
thermophiles and mesophiles. Classification performance of 65 different protein alphabets with 3 different n-gram sizes
was systematically evaluated using support vector machines in a test set that contained 707 proteins from mesophilic
Xylella fastidosa and thermophilic Aquifex aeolicus. A classification accuracy of 91.796% was achieved with Hsdm16
RAAA with 13 features: EK-ILV-ST-A-G-F-H-Q-N-R-M-W-Y. The t-test-based feature selection procedure reduced the
classification time without significantly affecting classification accuracy. The overall combination of methods in this
paper is useful and computationally fast for classifying protein sequences from thermophiles and mesophiles using
sequence information alone.