The “Gene Cube”: A Novel Approach to Three-dimensional Clustering of Gene Expression Data

Author(s): George I. Lambrou*, Maria Sdraka, Dimitrios Koutsouris

Journal Name: Current Bioinformatics

Volume 14 , Issue 8 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Background: A very popular technique for isolating significant genes from cancerous tissues is the application of various clustering algorithms on data obtained by DNA microarray experiments.

Aim: The objective of the present work is to take into consideration the chromosomal identity of every gene before the clustering, by creating a three-dimensional structure of the form Chromosomes×Genes×Samples. Further on, the k-Means algorithm and a triclustering technique called δ- TRIMAX, are applied independently on the structure.

Materials and Methods: The present algorithm was developed using the Python programming language (v. 3.5.1). For this work, we used two distinct public datasets containing healthy control samples and tissue samples from bladder cancer patients. Background correction was performed by subtracting the median global background from the median local Background from the signal intensity. The quantile normalization method has been applied for sample normalization. Three known algorithms have been applied for testing the “gene cube”, a classical k-means, a transformed 3D k-means and the δ-TRIMAX.

Results: Our proposed data structure consists of a 3D matrix of the form Chromosomes×Genes×Samples. Clustering analysis of that structure manifested very good results as we were able to identify gene expression patterns among samples, genes and chromosomes. Discussion: to the best of our knowledge, this is the first time that such a structure is reported and it consists of a useful tool towards gene classification from high-throughput gene expression experiments.

Conclusions: Such approaches could prove useful towards the understanding of disease mechanics and tumors in particular.

Keywords: Machine learning, clustering, chromosomes, gene expression, DNA microarrays, algorithms.

DeRisi J, Penland L, Brown PO, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet 1996; 14(4): 457-60.
Groen AK. The pros and cons of gene expression analysis by microarrays. J Hepatol 2001; 35(2): 295-6.
Lambrou GI, Adamaki M, Koultouki E, et al. Systems Biolo-gy Methodologies for the Understanding of Common Onco-genetic Mechanisms in Childhood Leukemic and Rhabdomy-osarcoma Cells Quality Assurance in Healthcare Service Delivery, Nursing and Personalized Medicine: Technologies and Processes: Technologies and Processes. Hershey, PA: IGI Global 2012; pp. 111-68.
Jiang D, Tang C, Zhang A. Cluster analysis for gene expres-sion data: a survey. IEEE Trans Knowl Data Eng 2004; 16(11): 1370-86.
Yang ZR. Machine learning approaches to bioinformaticsWorld scientific 2010; 4.
Zhang A. Advanced analysis of gene expression microarray dataWorld Scientific 2006; 1:.
Madeira SC, Oliveira AL. Biclustering algorithms for biologi-cal data analysis: a surveyIEEE/ACM Trans Comput Biol Bioinform 2004; 1(1): 24-45.
Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 2003; 13(4): 703-16.
Yin L, Huang CH, Ni J. Clustering of gene expression data: performance and similarity analysis. BMC Bioinformatics 2006; 7(Suppl. 4): S19.
D’haeseleer P. How does gene expression clustering work? Nat Biotechnol 2005; 23(12): 1499-501.
Mahanta P, Ahmed HA, Bhattacharyya DK, et al. Triclustering in gene expression data analysis: A selected survey Emerging Trends and Applications in Computer Science. NCETACS 2011.
Zhao L, Zaki MJ. Tricluster: An effective algorithm for mining coherent clusters in 3d microarray data. Proceedings of the 2005 ACM SIGMOD international conference on Manage-ment of data 2005.
Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S, Wingender E. Coexpression and coregulation analysis of time-series gene expression data in estrogen-induced breast cancer cell. Algorithms Mol Biol 2013; 8(1): 9.
Ciaramella A, Cocozza S, Iorio F, et al. Interactive data analysis and clustering of genomic data. Neural Netw 2008; 21(2-3): 368-78.
Gutierrez AD, Rubio-Escudero C, Riquelme JC. Triclustering on temporary microarray data using the TriGen algorithm Intelligent Systems Design and Applications(ISDA). 2011.
Araújo RB, Ferreira GHT, Orair GH, et al. The ParTriCluster algorithm for gene expression analysis. Int J Parallel Program 2008; 36(2): 226-49.
Jiang D, Pei J, Ramanathan M, et al. Mining coherent gene clusters from gene-sample-time microarray data in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining ACM: Seattle, WA, USA. 2004; 430-39.
Tchagang AB, Phan S, Famili F, et al. Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm. BMC Bioinformatics 2012; 13: 54.
Mankad S, Michailidis G. Biclustering Three-Dimensional Data Arrays With Plaid Models. J Comput Graph Stat 2014; 23(4): 943-65.
Li A, Tuck D. An effective tri-clustering algorithm combining expression data with gene regulation information. Gene Regul Syst Bio 2009; 3: 49-64.
Cohen BA, Mitra RD, Hughes JD, Church GM. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat Genet 2000; 26(2): 183-6.
Reyal F, Stransky N, Bernard-Pierrot I, et al. Visualizing chromosomes as transcriptome correlation maps: evidence of chromosomal domains containing co-expressed genes--a study of 130 invasive ductal breast carcinomas. Cancer Res 2005; 65(4): 1376-83.
Sturn A, Quackenbush J, Trajanoski Z. Genesis: cluster analysis of microarray data. Bioinformatics 2002; 18(1): 207-8.
Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory 1982; 28(2): 129-37.
Ball GH, Hall DJ. A clustering technique for summarizing multivariate data. Behav Sci 1967; 12(2): 153-5.
MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability Oakland, CA, USA. Vol. 1 (Univ. of Calif. Press, 1967). 281-97.
Zaravinos A, Lambrou GI, Boulalas I, Delakas D, Spandidos DA. Identification of common differentially expressed genes in urinary bladder cancer. PLoS One 2011; 6(4)e18135
Jones E, Oliphant T, Peterson P. Open source scientific tools for python 2001.
Van Der Walt S, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 2011; 13(2): 22-30.
McKinney W. Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference 2010.
Pérez F, Granger BE. IPython: A System for Interactive Scien-tific Computing. Comput Sci Eng 2007; 9(3): 21-9.
Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng 2007; 9(3): 90-5.
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res 2011; 12(Oct): 2825-30.
Raybaut P, Davar G. Python (x, y) scientific-oriented python distribution based on qt and spyder.
Mengual L, Burset M, Ars E, et al. DNA microarray expression profiling of bladder cancer allows identification of noninvasive diagnostic markers. J Urol 2009; 182(2): 741-8.
Amaratunga D, Cabrera J. Analysis of Data From Viral DNA Microchips. J Am Stat Assoc 2001; 96(456): 1161-70.
Bolstad B. Probe level quantile normalization of high density oligonucleotide array data 2001; 1-8.
Chandran UR, Ma C, Dhir R, et al. Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer 2007; 7: 64.
Sîrbu A, Ruskin HJ, Crane M. Cross-platform microarray data normalisation for regulatory network inference. PLoS One 2010; 5(11)e13822
Ramasamy A, Mondry A, Holmes CC, Altman DG. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med 2008; 5(9)e184
Kollegal M, Adak S, Shippy R, et al. Considerations in Making Microarray Cross-Platform Correlations. in CSB Workshops. 2005.Stanford, CA, USA.
Yauk CL, Berndt ML, Williams A, Douglas GR. Comprehensive comparison of six microarray technologies. Nucleic Acids Res 2004; 32(15)e124
Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003; 19(2): 185-93.
Wu W, Dave N, Tseng GC, Richards T, Xing EP, Kaminski N. Comparison of normalization methods for CodeLink Bioarray data. BMC Bioinformatics 2005; 6: 309.
Hastie T, Tibshirani R, Sherlock G, et al. Imputing missing data for gene expression arrays Stanford University Statistics Department Technical report. 1999.
Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17(6): 520-5.
Malarvizhi MR, Thanamani AS. K-nearest neighbor in miss-ing data imputation. Int J Eng Res Dev 2012; 5(1): 5-7.
Pham DT, Dimov SS, Nguyen CD. Selection of K in K-means clustering. Proc Inst Mech Eng, C J Mech Eng Sci 2005; 219(1): 103-19.
Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 2007.
Monnot J. Approximation algorithms for the maximum Ham-iltonian path problem with specified endpoint. Eur J Oper Res 2005; 161(3): 721-35.
Braga Araújo R, Trielli Ferreira GH, Orair GH, et al. The Par-TriCluster Algorithm for Gene Expression Analysis. Int J Parallel Program 2008; 36(2): 226-49.
Dunn JC. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters 1973.
McLachlan GJ, Basford KE. Mixture models Inference and applications to clustering Statistics: Textbooks and Mono-graphs. New York: Dekker 1988; p. 1.
Maulik U, Bandyopadhyay S. Genetic algorithm-based clus-tering technique. Pattern Recognit 2000; 33(9): 1455-65.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Page: [721 - 727]
Pages: 7
DOI: 10.2174/1574893614666190116170406
Price: $65

Article Metrics

PDF: 29