Background: Protein is a kind of important organics in life. It is varied with its
sequences, structures and functions. Protein evolutionary classification is one of the popular
research topics in computational bioinformatics. Many studies have used protein sequence
information to classify the evolutionary relationships of proteins. As the amount of protein
sequence data increases, efficient computational tools are needed to make efficient protein
evolutionary classifications with high accuracies in the big data paradigm.
Methods: In this study, we propose a new simple and efficient computational approach based on
the normalized mutual information rates to compute the relationship between protein sequences,
we then use the “distances” defined on the relationships to perform the evolutionary classifications
of proteins. The new method is computational efficient, model-free and unsupervised, which does
not require training data when performing classifications.
Results: Simulation studies on various examples demonstrate the efficiency of the new method.
We use precision-recall curves to compare the efficiency of our new method with traditional
methods, results show that the new method outperforms the traditional methods in most of the
cases when performing evolutionary classifications.
Conclusion: The new method is simple and proved to be efficient in protein evolutionary
classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.