Title:An Information Gain-based Method for Evaluating the Classification Power of Features Towards Identifying Enhancers
VOLUME: 15 ISSUE: 6
Author(s):Tianjiao Zhang, Rongjie Wang, Qinghua Jiang* and Yadong Wang*
Affiliation:School of Computer Science and Technology, Harbin Institute of Technology, Harbin, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, School of Life Science and Technology, Harbin Institute of Technology, Harbin, School of Computer Science and Technology, Harbin Institute of Technology, Harbin
Keywords:Enhancer, gene expression regulation, sequence features, transcriptional features, epigenetic features, information
gain.
Abstract:
Background: Enhancers are cis-regulatory elements that enhance gene expression on
DNA sequences. Since most of enhancers are located far from transcription start sites, it is difficult
to identify them. As other regulatory elements, the regions around enhancers contain a variety of
features, which can help in enhancer recognition.
Objective: The classification power of features differs significantly, the performances of existing
methods that use one or a few features for identifying enhancer vary greatly. Therefore, evaluating
the classification power of each feature can improve the predictive performance of enhancers.
Methods: We present an evaluation method based on Information Gain (IG) that captures the
entropy change of enhancer recognition according to features. To validate the performance of our
method, experiments using the Single Feature Prediction Accuracy (SFPA) were conducted on
each feature.
Results: The average IG values of the sequence feature, transcriptional feature and epigenetic
feature are 0.068, 0.213, and 0.299, respectively. Through SFPA, the average AUC values of the
sequence feature, transcriptional feature and epigenetic feature are 0.534, 0.605, and 0.647,
respectively. The verification results are consistent with our evaluation results.
Conclusion: This IG-based method can effectively evaluate the classification power of features for
identifying enhancers. Compared with sequence features, epigenetic features are more effective for
recognizing enhancers.