Aim and Objective: The rapid increase in the amount of protein sequence data available
leads to an urgent need for novel computational algorithms to analyze and compare these sequences.
This study is undertaken to develop an efficient computational approach for timely encoding protein
sequences and extracting the hidden information.
Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was
converted into a three-letter sequence, and then a graph without loops and multiple edges and its
geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid
composition) model was thus constructed to characterize a protein sequence numerically.
Results: By using the proposed mathematical descriptor of a protein sequence, similarity
comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were
made, respectively. The resulting clusters agreed well with the established taxonomic groups. In
addition, a generalized PseAAC based SVM (support vector machine) model was developed to
identify DNA-binding proteins. Experiment results showed that our method performed better than
DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206
in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded
with negative samples, the presented approach outperformed the four previous methods with
improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-
33.85% in terms of F1M.
Conclusion: These results suggested that the generalized PseAAC model was very efficient for
comparison and analysis of protein sequences, and very competitive in identifying DNA-binding