An Information Gain-based Method for Evaluating the Classification Power of Features Towards Identifying Enhancers

Author(s): Tianjiao Zhang, Rongjie Wang, Qinghua Jiang*, Yadong Wang*

Journal Name: Current Bioinformatics

Volume 15 , Issue 6 , 2020


Become EABM
Become Reviewer
Call for Editor

Graphical Abstract:


Abstract:

Background: Enhancers are cis-regulatory elements that enhance gene expression on DNA sequences. Since most of enhancers are located far from transcription start sites, it is difficult to identify them. As other regulatory elements, the regions around enhancers contain a variety of features, which can help in enhancer recognition.

Objective: The classification power of features differs significantly, the performances of existing methods that use one or a few features for identifying enhancer vary greatly. Therefore, evaluating the classification power of each feature can improve the predictive performance of enhancers.

Methods: We present an evaluation method based on Information Gain (IG) that captures the entropy change of enhancer recognition according to features. To validate the performance of our method, experiments using the Single Feature Prediction Accuracy (SFPA) were conducted on each feature.

Results: The average IG values of the sequence feature, transcriptional feature and epigenetic feature are 0.068, 0.213, and 0.299, respectively. Through SFPA, the average AUC values of the sequence feature, transcriptional feature and epigenetic feature are 0.534, 0.605, and 0.647, respectively. The verification results are consistent with our evaluation results.

Conclusion: This IG-based method can effectively evaluate the classification power of features for identifying enhancers. Compared with sequence features, epigenetic features are more effective for recognizing enhancers.

Keywords: Enhancer, gene expression regulation, sequence features, transcriptional features, epigenetic features, information gain.

[1]
Corradin O, Scacheri PC. Enhancer variants: evaluating functions in common disease. Genome Med 2014; 6(10): 85.
[http://dx.doi.org/10.1186/s13073-014-0085-3 ] [PMID: 25473424]
[2]
Li W, Notani D, Rosenfeld MG. Enhancers as non-coding RNA transcription units: recent insights and future perspectives. Nat Rev Genet 2016; 17(4): 207-23.
[http://dx.doi.org/10.1038/nrg.2016.4 ] [PMID: 26948815]
[3]
Hatzis P, Talianidis I. Dynamics of enhancer-promoter communication during differentiation-induced gene activation. Mol Cell 2002; 10(6): 1467-77.
[http://dx.doi.org/10.1016/S1097-2765(02)00786-4 ] [PMID: 12504020]
[4]
Cheng L, Hu Y. Human Disease System Biology. Curr Gene Ther 2018; 18(5): 255-6.
[http://dx.doi.org/10.2174/1566523218666181010101114]
[5]
Lam MTY, Li W, Rosenfeld MG, Glass CK. Enhancer RNAs and regulated transcriptional programs. Trends Biochem Sci 2014; 39(4): 170-82.
[http://dx.doi.org/10.1016/j.tibs.2014.02.007 ] [PMID: 24674738]
[6]
Buecker C, Wysocka J. Enhancers as information integration hubs in development: lessons from genomics. Trends Genet 2012; 28(6): 276-84.
[http://dx.doi.org/10.1016/j.tig.2012.02.008 ] [PMID: 22487374]
[7]
Peng J, Zhu L, Wang Y, et al. Mining relationships among multiple entities in biological networks IEEE/ACM Trans Comput Biol Bioinform 2020; 17(3): 769-.
[http://dx.doi.org/10.1109/TCBB.2019.2904965]
[8]
Teng M, Irizarry RA. Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data. Genome Res 2017; 27(11): 1930-8.
[http://dx.doi.org/10.1101/gr.220673.117 ] [PMID: 29025895]
[9]
Heintzman ND, Stuart RK, Hon G, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 2007; 39(3): 311-8.
[http://dx.doi.org/10.1038/ng1966 ] [PMID: 17277777]
[10]
Visel A, Blow MJ, Li Z, et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 2009; 457(7231): 854-8.
[http://dx.doi.org/10.1038/nature07730 ] [PMID: 19212405]
[11]
Arner E, Daub CO, Vitting-Seerup K, et al. FANTOM consortium. transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 2015; 347(6225): 1010-4.
[http://dx.doi.org/10.1126/science.1259418 ] [PMID: 25678556]
[12]
Peng J, Guan J, Shang X. Predicting Parkinson’s disease genes based on Node2vec and autoencoder. Front Genet 2019; 10: 226.
[http://dx.doi.org/10.3389/fgene.2019.00226 ] [PMID: 31001311]
[13]
Pennacchio LA, Ahituv N, Moses AM, et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006; 444(7118): 499-502.
[http://dx.doi.org/10.1038/nature05295 ] [PMID: 17086198]
[14]
Cheng L, Hu Y, Sun J, Zhou M, Jiang Q. DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics 2018; 34(11): 1953-6.
[http://dx.doi.org/10.1093/bioinformatics/bty002 ] [PMID: 29365045]
[15]
Peng J, Hui W, Li Q, et al. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics Oxf Engl 2019; 35(21): 4364-71.
[http://dx.doi.org/10.1093/bioinformatics/btz254]
[16]
Wang D, Garcia-Bassets I, Benner C, et al. Reprogramming transcription by distinct classes of enhancers functionally defined by eRNA. Nature 2011; 474(7351): 390-4.
[http://dx.doi.org/10.1038/nature10006 ] [PMID: 21572438]
[17]
Ernst J, Kheradpour P, Mikkelsen TS, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011; 473(7345): 43-9.
[http://dx.doi.org/10.1038/nature09906 ] [PMID: 21441907]
[18]
Göke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 2012; 28(5): 656-63.
[http://dx.doi.org/10.1093/bioinformatics/bts028 ] [PMID: 22247280]
[19]
Wang G, Wang Y, Feng W, et al. Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells. BMC Genomics 2008; 9(Suppl. 2): S22.
[http://dx.doi.org/10.1186/1471-2164-9-S2-S22 ] [PMID: 18831788]
[20]
Lander ES, Linton LM, Birren B, et al. International human genome sequencing consortium, initial sequencing and analysis of the human genome. Nature 2001; 409(6822): 860-921.
[http://dx.doi.org/10.1038/35057062 ] [PMID: 11237011]
[21]
Zhang Y, Liu T, Meyer CA, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008; 9(9): R137.
[http://dx.doi.org/10.1186/gb-2008-9-9-r137 ] [PMID: 18798982]
[22]
Harrow J, Frankish A, Gonzalez JM, et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res 2012; 22(9): 1760-74.
[http://dx.doi.org/10.1101/gr.135350.111 ] [PMID: 22955987]
[23]
Karolchik D, Hinrichs AS, Furey TS, et al. The UCSC table browser data retrieval tool. Nucleic Acids Res 2004; 32(Database issue): D493-6.
[http://dx.doi.org/10.1093/nar/gkh103 ] [PMID: 14681465]
[24]
Wingender E, Dietze P, Karas H, Knüppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996; 24(1): 238-41.
[http://dx.doi.org/10.1093/nar/24.1.238 ] [PMID: 8594589]
[25]
Barrett T, Wilhite SE, Ledoux P, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 2013; 41(Database issue): D991-5.
[PMID: 23193258]
[26]
Firpi HA, Ucar D, Tan K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 2010; 26(13): 1579-86.
[http://dx.doi.org/10.1093/bioinformatics/btq248 ] [PMID: 20453004]


Rights & PermissionsPrintExport Cite as

Article Details

VOLUME: 15
ISSUE: 6
Year: 2020
Published on: 11 November, 2020
Page: [574 - 580]
Pages: 7
DOI: 10.2174/1574893614666191120141032
Price: $65

Article Metrics

PDF: 21
HTML: 1