The “Gene Cube”: A Novel Approach to Three-dimensional Clustering of Gene Expression Data

George    I.    Lambrou; Maria       Sdraka; Dimitrios       Koutsouris

Abstract

Background: A very popular technique for isolating significant genes from cancerous tissues is the application of various clustering algorithms on data obtained by DNA microarray experiments.

Aim: The objective of the present work is to take into consideration the chromosomal identity of every gene before the clustering, by creating a three-dimensional structure of the form Chromosomes×Genes×Samples. Further on, the k-Means algorithm and a triclustering technique called δ- TRIMAX, are applied independently on the structure.

Materials and Methods: The present algorithm was developed using the Python programming language (v. 3.5.1). For this work, we used two distinct public datasets containing healthy control samples and tissue samples from bladder cancer patients. Background correction was performed by subtracting the median global background from the median local Background from the signal intensity. The quantile normalization method has been applied for sample normalization. Three known algorithms have been applied for testing the “gene cube”, a classical k-means, a transformed 3D k-means and the δ-TRIMAX.

Results: Our proposed data structure consists of a 3D matrix of the form Chromosomes×Genes×Samples. Clustering analysis of that structure manifested very good results as we were able to identify gene expression patterns among samples, genes and chromosomes. Discussion: to the best of our knowledge, this is the first time that such a structure is reported and it consists of a useful tool towards gene classification from high-throughput gene expression experiments.

Conclusions: Such approaches could prove useful towards the understanding of disease mechanics and tumors in particular.

Keywords: Machine learning, clustering, chromosomes, gene expression, DNA microarrays, algorithms.

« Previous Next »

Graphical Abstract

[1] 
DeRisi J, Penland L, Brown PO, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet  1996; 14(4): 457-60.
[2] 
Groen AK. The pros and cons of gene expression analysis by microarrays. J Hepatol  2001; 35(2): 295-6.
[3] 
Lambrou GI, Adamaki M, Koultouki E, et al. Systems Biolo-gy Methodologies for the Understanding of Common Onco-genetic Mechanisms in Childhood Leukemic and Rhabdomy-osarcoma Cells Quality Assurance in Healthcare Service Delivery, Nursing and Personalized Medicine: Technologies and Processes: Technologies and Processes.  Hershey, PA: IGI Global 2012; pp. 111-68.
[4] 
Jiang D, Tang C, Zhang A. Cluster analysis for gene expres-sion data: a survey. IEEE Trans Knowl Data Eng  2004; 16(11): 1370-86.
[5] 
Yang ZR. Machine learning approaches to bioinformaticsWorld	scientific   2010; 4.
[http://dx.doi.org/10.1142/7454] 
[6] 
Zhang A. Advanced analysis of gene expression microarray dataWorld Scientific  2006; 1:.
[http://dx.doi.org/10.1142/6016] 
[7] 
Madeira SC, Oliveira AL. Biclustering algorithms for biologi-cal data analysis: a surveyIEEE/ACM Trans Comput Biol Bioinform  2004; 1(1): 24-45.
[8] 
Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res  2003; 13(4): 703-16.
[9] 
Yin L, Huang CH, Ni J. Clustering of gene expression data: performance and similarity analysis. BMC Bioinformatics  2006; 7(Suppl. 4): S19.
[10] 
D’haeseleer P. How does gene expression clustering work? Nat Biotechnol  2005; 23(12): 1499-501.
[11] 
Mahanta P, Ahmed HA, Bhattacharyya DK, et al. Triclustering in gene expression data analysis: A selected survey Emerging Trends and Applications in Computer Science. NCETACS 2011.
[12] 
Zhao L, Zaki MJ. Tricluster: An effective algorithm for mining coherent clusters in 3d microarray data. Proceedings of the 2005 ACM SIGMOD international conference on Manage-ment of data 2005.
[http://dx.doi.org/10.1145/1066157.1066236.] 
[13] 
Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S, Wingender E. Coexpression and coregulation analysis of time-series gene expression data in estrogen-induced breast cancer cell. Algorithms Mol Biol  2013; 8(1): 9.
[14] 
Ciaramella A, Cocozza S, Iorio F, et al. Interactive data analysis and clustering of genomic data. Neural Netw  2008; 21(2-3): 368-78.
[15] 
Gutierrez AD, Rubio-Escudero C, Riquelme JC. Triclustering on temporary microarray data using the TriGen algorithm Intelligent Systems Design and Applications(ISDA).  2011.
[http://dx.doi.org/10.1109/ISDA.2011.6121768.] 
[16] 
Araújo RB, Ferreira GHT, Orair GH, et al. The ParTriCluster algorithm for gene expression analysis. Int J Parallel Program  2008; 36(2): 226-49.
[17] 
Jiang D, Pei J, Ramanathan M, et al. Mining coherent gene clusters from gene-sample-time microarray data in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining  ACM: Seattle, WA, USA. 2004; 430-39.
[18] 
Tchagang AB, Phan S, Famili F, et al. Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm. BMC Bioinformatics  2012; 13: 54.
[19] 
Mankad S, Michailidis G. Biclustering Three-Dimensional Data Arrays With Plaid Models. J Comput Graph Stat  2014; 23(4): 943-65.
[20] 
Li A, Tuck D. An effective tri-clustering algorithm combining expression data with gene regulation information. Gene Regul Syst Bio  2009; 3: 49-64.
[21] 
Cohen BA, Mitra RD, Hughes JD, Church GM. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat Genet  2000; 26(2): 183-6.
[22] 
Reyal F, Stransky N, Bernard-Pierrot I, et al. Visualizing chromosomes as transcriptome correlation maps: evidence of chromosomal domains containing co-expressed genes--a study of 130 invasive ductal breast carcinomas. Cancer Res  2005; 65(4): 1376-83.
[23] 
Sturn A, Quackenbush J, Trajanoski Z. Genesis: cluster analysis of microarray data. Bioinformatics  2002; 18(1): 207-8.
[24] 
Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory  1982; 28(2): 129-37.
[25] 
Ball GH, Hall DJ. A clustering technique for summarizing multivariate data. Behav Sci  1967; 12(2): 153-5.
[26] 
MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability   Oakland, CA, USA.
	Vol. 1 (Univ. of Calif. Press, 1967). 281-97.
[27] 
Zaravinos A, Lambrou GI, Boulalas I, Delakas D, Spandidos DA. Identification of common differentially expressed genes in urinary bladder cancer. PLoS One  2011; 6(4)e18135
[28] 
Jones E, Oliphant T, Peterson P. Open source scientific tools for python 2001.http://www.scipy.org
[29] 
Van Der Walt S, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput Sci Eng  2011; 13(2): 22-30.
[30] 
McKinney W. Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference 2010.
[31] 
Pérez F, Granger BE. IPython: A System for Interactive Scien-tific Computing. Comput Sci Eng  2007; 9(3): 21-9.
[32] 
Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng  2007; 9(3): 90-5.
[33] 
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res  2011; 12(Oct): 2825-30.
[34] 
Raybaut P, Davar G.  Python (x, y) scientific-oriented python distribution
	based on qt and spyder. 
[35] 
Mengual L, Burset M, Ars E, et al. DNA microarray expression profiling of bladder cancer allows identification of noninvasive diagnostic markers. J Urol  2009; 182(2): 741-8.
[36] 
Amaratunga D, Cabrera J. Analysis of Data From Viral DNA Microchips. J Am Stat Assoc  2001; 96(456): 1161-70.
[37] 
Bolstad B. Probe level quantile normalization of high density oligonucleotide array data  2001; 1-8.
[38] 
Chandran UR, Ma C, Dhir R, et al. Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer  2007; 7: 64.
[39] 
Sîrbu A, Ruskin HJ, Crane M. Cross-platform microarray data normalisation for regulatory network inference. PLoS One  2010; 5(11)e13822
[40] 
Ramasamy A, Mondry A, Holmes CC, Altman DG. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med  2008; 5(9)e184
[41] 
Kollegal M, Adak S, Shippy R, et al. Considerations in Making
	Microarray Cross-Platform Correlations. in CSB Workshops. 2005.Stanford, CA, USA. 
[42] 
Yauk CL, Berndt ML, Williams A, Douglas GR. Comprehensive comparison of six microarray technologies. Nucleic Acids Res  2004; 32(15)e124
[43] 
Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics  2003; 19(2): 185-93.
[44] 
Wu W, Dave N, Tseng GC, Richards T, Xing EP, Kaminski N. Comparison of normalization methods for CodeLink Bioarray data. BMC Bioinformatics  2005; 6: 309.
[45] 
Hastie T, Tibshirani R, Sherlock G, et al. Imputing missing data for gene expression arrays Stanford University Statistics Department
	Technical report. 1999.
[46] 
Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics  2001; 17(6): 520-5.
[47] 
Malarvizhi MR, Thanamani AS. K-nearest neighbor in miss-ing data imputation. Int J Eng Res Dev  2012; 5(1): 5-7.
[48] 
Pham DT, Dimov SS, Nguyen CD. Selection of K in K-means clustering. Proc Inst Mech Eng, C J Mech Eng Sci  2005; 219(1): 103-19.
[49] 
Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 2007.
[50] 
Monnot J. Approximation algorithms for the maximum Ham-iltonian path problem with specified endpoint. Eur J Oper Res  2005; 161(3): 721-35.
[51] 
Braga Araújo R, Trielli Ferreira GH, Orair GH, et al. The Par-TriCluster Algorithm for Gene Expression Analysis. Int J Parallel Program  2008; 36(2): 226-49.
[52] 
Dunn JC. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters 1973.
[53] 
McLachlan GJ, Basford KE. Mixture models Inference and applications to clustering Statistics: Textbooks and Mono-graphs.  New York: Dekker 1988; p. 1.
[54] 
Maulik U, Bandyopadhyay S. Genetic algorithm-based clus-tering technique. Pattern Recognit  2000; 33(9): 1455-65.

Rights & Permissions Print Cite

Article Metrics

44

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893614666190116170406	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

The “Gene Cube”: A Novel Approach to Three-dimensional Clustering of Gene Expression Data

Abstract

Graphical Abstract

Related Journals

Related Books

Related Articles