Different Approaches for Missing Data Handling in Fuzzy Clustering: A Review

Author(s): Sonia Goel, Meena Tushir*

Journal Name: Recent Advances in Electrical & Electronic Engineering
Formerly Recent Patents on Electrical & Electronic Engineering

Volume 13 , Issue 6 , 2020


Become EABM
Become Reviewer
Call for Editor

Graphical Abstract:


Abstract:

Introduction: Incomplete data sets containing some missing attributes is a prevailing problem in many research areas. The reasons for the lack of missing attributes may be several; human error in tabulating/recording the data, machine failure, errors in data acquisition or refusal of a patient/customer to answer few questions in a questionnaire or survey. Further, clustering of such data sets becomes a challenge.

Objective: In this paper, we presented a critical review of various methodologies proposed for handling missing data in clustering. The focus of this paper is the comparison of various imputation techniques based FCM clustering and the four clustering strategies proposed by Hathway and Bezdek.

Methods: In this paper, we imputed the missing values in incomplete datasets by various imputation/ non-imputation techniques to complete the data set and then conventional fuzzy clustering algorithm is applied to get the clustering results.

Results: Experiments on various synthetic data sets and real data sets from UCI repository are carried out. To evaluate the performance of the various imputation/ non-imputation based FCM clustering algorithm, several performance criteria and statistical tests are considered. Experimental results on various data sets show that the linear interpolation based FCM clustering performs significantly better than other imputation as well as non-imputation techniques.

Conclusion: It is concluded that the clustering algorithm is data specific, no clustering technique can give good results on all data sets. It depends upon both the data type and the percentage of missing attributes in the dataset. Through this study, we have shown that the linear interpolation based FCM clustering algorithm can be used effectively for clustering of incomplete data set.

Keywords: FCM Clustering, incomplete data sets, imputation, missing data, regression, interpolation.

[1]
J.C. Dunn, "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated cluster", J. Cybern., vol. 3, pp. 32-57, 1973.
[http://dx.doi.org/10.1080/01969727308546046]
[2]
J.C. Bezdek, Pattern recognition with fuzzy objective function algorithms., Plenum: New York, 1981.
[http://dx.doi.org/10.1007/978-1-4757-0450-1]
[3]
A.R.T. Donders, "G.J.M.G Van Der Heijden, T Stijnen, K. G. Moons, “A gentle introduction to imputation of missing values", J. Clin. Epidemiol., vol. 59, pp. 1087-1091, 2006.
[http://dx.doi.org/10.1016/j.jclinepi.2006.01.014] [PMID: 16980149]
[4]
K.I. Penny, and T. Chesney, "Imputation methods to deal with missing values when data mining trauma injury data In:", 28th International Conference on Information Technology Interfaces, 2006pp. 213-218
[http://dx.doi.org/10.1109/ITI.2006.1708480]
[5]
K.I. Penny, and T. Chesney, "A comparison of missing value imputation methods for classifying patient outcome following trauma injury In: ITI 2008 - 30th International Conference on Information Technology Interfaces, Dubrovnik, Croatia, 2008",
[http://dx.doi.org/10.1109/ITI.2008.4588437]
[6]
P. Hayati Rezvan, K.J. Lee, and J.A. Simpson, "The rise of multiple imputation: a review of the reporting and implementation of the method in medical research", BMC Med. Res. Methodol., vol. 15, p. 30, 2015.
[http://dx.doi.org/10.1186/s12874-015-0022-1] [PMID: 25880850]
[7]
A. Morisot, F. Bessaoud, P. Landais, X. Rébillard, B. Trétarre, and J.P. Daurès, "Prostate cancer: net survival and cause-specific survival rates after multiple imputation", BMC Med. Res. Methodol., vol. 15, p. 54, 2015.
[http://dx.doi.org/10.1186/s12874-015-0048-4] [PMID: 26216355]
[8]
P. Kalyani, Approaches to partition medical data using clustering algorithms Int. J. Comput. Appl.. vol. 49, 2012.
[http://dx.doi.org/10.5120/7941-1102]
[9]
T.R. Sullivan, K.J. Lee, P. Ryan, and A.B. Salter, "Multiple imputation for handling missing outcome data when estimating the relative risk", BMC Med. Res. Methodol., vol. 17, no. 1, p. 134, 2017.
[http://dx.doi.org/10.1186/s12874-017-0414-5] [PMID: 28877666]
[10]
H. Junninen, H. Niska, K. Tuppurainen, J. Ruuskanen, and M. Kolehmainen, "Methods for imputation of missing values in air quality data sets", Atmos. Environ., vol. 38, pp. 2895-2907, 2004.
[http://dx.doi.org/10.1016/j.atmosenv.2004.02.026]
[11]
N.M. Noor, M.M. Al Bakri Abdullah, A.S. Yahaya, and N.A. Ramli, "Comparison of linear interpolation method and mean method to replace the missing values in environmental data set", Mater. Sci. Forum, vol. 803, pp. 278-281, 2015.
[http://dx.doi.org/10.4028/www.scientific.net/MSF.803.278]
[12]
N.A. Zainuri, A.A. Jemain, and N. Muda, "A comparison of various imputation methods for missing values in air quality data", Sains Malays., vol. 44, pp. 449-456, 2015.
[http://dx.doi.org/10.17576/jsm-2015-4403-17]
[13]
H. Li, X. Deng, and E. Smith, "Missing data imputation for paired stream and air temperature sensor data", Environmetrics, vol. 28, no. e2426, 2017.
[http://dx.doi.org/10.1002/env.2426]
[14]
J. Han, J. Pei, and M. Kambe, Data mining concepts and techniques., Elsevier, 2011.
[15]
P. Li, Z. Chen, Y. Hu, Y. Leng, and Q. Li, "A weighted fuzzy c-means clustering algorithm for incomplete big sensor data In:", China Conference on Wireless Sensor Networks, pp. 55-63, 2017.
[16]
S.C. Chapra, and R.P. Canale, Numerical Methods for Engineers., McGraw-Hill Higher Education: Boston, 2010.
[17]
R.J. Little, and D.B. Rubin, Statistical Analysis with Missing Data., John Wiley & Sons, 2010.
[18]
Z. Jia, Z. Yu, and C. Zhang, "Fuzzy c-means clustering algorithm based on incomplete data", IEEE International Conference on Information Acquisition,. 2006, pp. 600-604
[http://dx.doi.org/10.1109/ICIA.2006.305793]
[19]
M. Sarkar, and T.Y. Leong, "Fuzzy K-means clustering with missing values", Proceedings of the AMIA Symposium, , 2001, pp. 588-592
[20]
K.L. Wagstaff, and V.G. Laidler, "Making the most of missing values: Object clustering with partial data in astronomy, " In: , Astronomical Data Analysis Software and Systems XIV. vol. 347. 2005, p. 172.
[21]
B. Twala, M. Cartwright, and M. Shepperd, "Comparison of various methods for handling incomplete data in software engineering databases In:", International Symposium on Empirical Software Engineering, 2005pp. 105-114
[http://dx.doi.org/10.1109/ISESE.2005.1541819]
[22]
J.T. Chi, E.C. Chi, and R.G. Baraniuk, "K-pod: A method for k-means clustering of missing data", Am. Stat., vol. 70, pp. 91-99, 2016.
[http://dx.doi.org/10.1080/00031305.2015.1086685]
[23]
K. Honda, R. Nonoguchi, A. Notsu, and H. Ichihashi, "PCA-guided k-means clustering with incomplete data In: ", IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011), 2011, pp. 1710-1714
[http://dx.doi.org/10.1109/FUZZY.2011.6007312]
[24]
T. Karkkainen, and S. Ayramo, Robust clustering methods for incomplete and erroneous data. . Vol. 33. WIT Transactions on Information and Communication Technologies, 2004.
[25]
R.J. Hathaway, and J.C. Bezdek, "Fuzzy c-means clustering of incomplete data", IEEE Trans. Syst. Man Cybern. B Cybern., vol. 31, no. 5, pp. 735-744, 2001.
[http://dx.doi.org/10.1109/3477.956035] [PMID: 18244838]
[26]
D.Q. Zhang, and S.C. Chen, "Clustering incomplete data using kernel-based fuzzy c-means algorithm", Neural Process. Lett., vol. 18, pp. 155-162, 2003.
[http://dx.doi.org/10.1023/B:NEPL.0000011135.19145.1b]
[27]
T. Li, L. Zhang, W. Lu, H. Hou, X. Liu, W. Pedrycz, and C. Zhong, "Interval kernel Fuzzy C-Means clustering of incomplete data", Neurocomput., vol. 237, pp. 316-333, 2017.
[http://dx.doi.org/10.1016/j.neucom.2017.01.017]
[28]
H. Timm, C. Döring, and R. Kruse, "Differentiated treatment of missing values in fuzzy clustering", Inter. Fuzzy Syst. Assoc. World Cong., 2003, pp. 354-361
[http://dx.doi.org/10.1007/3-540-44967-1_42]
[29]
H. Timm, C. Doring, and R. Kruse, "Different approaches to fuzzy clustering of incomplete datasets", Int. J. Approx. Reason., vol. 35, pp. 239-249, 2004.
[http://dx.doi.org/10.1016/j.ijar.2003.08.004]
[30]
L. Himmelspach, and S. Conrad, "Fuzzy clustering of incomplete data based on cluster dispersion In:", International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. 2010, pp. 59-68,
[http://dx.doi.org/10.1007/978-3-642-14049-5_7]
[31]
L. Himmelspach, and S. Conrad, "Clustering approaches for data with missing values: Comparison and evaluation", 2010 Fifth International Conference on Digital Information Management (ICDIM), 2010pp. 19-28
[http://dx.doi.org/10.1109/ICDIM.2010.5664691]
[32]
K. Siminski, "Clustering with missing values", Fundam. Inform., vol. 123, pp. 331-350, 2013.
[33]
Q. Zhang, and Z. Chen, ""A distributed weighted possibilistic cmeansv algorithm for clustering incomplete big sensor data", Int. J.v Distribut. Sens. Netw", 10, vol. 5, pp. 430814, 2014.
[http://dx.doi.org/10.1155/2014/430814]
[34]
L. Zhang, W. Lu, X. Liu, W. Pedrycz, and C. Zhong, "Fuzzy c-means clustering of incomplete data based on probabilistic information granules of missing values", Knowl. Base. Syst., vol. 99, pp. 51-70, 2016.
[http://dx.doi.org/10.1016/j.knosys.2016.01.048]
[35]
J. Li, S. Song, Y. Zhang, and Z. Zhou, "Robust k-median and k-means clustering algorithms for incomplete data", Math. Probl. Eng., pp. 1-8, 2016.
[http://dx.doi.org/10.1155/2016/4321928]
[36]
H. Kang, The prevention and handling of the missing data Korean J. Anesthesiology,. 64, vol. 5, p. 402, 2013.
[http://dx.doi.org/10.4097/kjae.2013.64.5.402]
[37]
Q. Wang, and J.N.K. Rao, "Empirical likelihood for linear regression models under imputation for missing responses", Can. J. Stat., vol. 29, pp. 597-608, 2001.
[http://dx.doi.org/10.2307/3316009]
[38]
H. Toutenburg, C. Heumann, and T. Nittner, "Linear regression models with incomplete categorical covariates", Comput. Stat., vol. 17, pp. 215-232, 2002.
[http://dx.doi.org/10.1007/s001800200103]
[39]
L. Beretta, and A. Santaniello, Nearest neighbor imputation algorithms: a critical evaluation BMC Med. Inform. Decis. Mak., vol. 16, suppl. Suppl. 3, p. 74, 2016.
[http://dx.doi.org/10.1186/s12911-016-0318-z] [PMID: 27454392]
[40]
S. Hwang, J.H. Oh, J. Cox, S.J. Tang, and H.F. Tibbals, Blood detection in wireless capsule endoscopy using expectation maximization clustering.
[http://dx.doi.org/10.1117/12.654109]
[41]
A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum likelihood estimation from incomplete data via the EM algorithm", J. R. Stat. Soc. Series B Stat. Methodol., vol. 39, pp. 1-38, 1977.
[42]
F.V. Nelwamondo, S. Mohamed, and T. Marwala, "Missing data: A comparison of neural network and expectation maximization techniques", Curr. Sci., vol. 93, no. 11, pp. 1514-1521, 2007.
[43]
C. Sammut, and G.I. Webb, Encyclopedia of machine learning., Springer Science & Business Media, 2011.
[44]
Y.G. Jung, M.S. Kang, and J. Heo, Clustering Performance Comparison Using K-Means and Expectation Maximization Algorithms.. 2014
[45]
W. Al-Mudhafer, "Maximum likelihood & multiple imputation of incomplete static and dynamic reservoir data In:", 12th EAGE International Conference on Geoinformatics-Theoretical and Applied Aspects, 2013
[http://dx.doi.org/10.3997/2214-4609.20142491]
[46]
H. Fang, "MI Fuzzy clustering for incomplete longitudinal data in smart health", Smart Health (Amst), vol. 1-2, pp. 50-65, 2017.
[http://dx.doi.org/10.1016/j.smhl.2017.04.002] [PMID: 28993813]
[47]
I.R. White, P. Royston, and A.M. Wood, "Multiple imputation using chained equations: Issues and guidance for practice", Stat. Med., vol. 30, no. 4, pp. 377-399, 2011.
[http://dx.doi.org/10.1002/sim.4067] [PMID: 21225900]
[48]
J.K. Dixon, "Pattern recognition with partly missing data", IEEE Trans. Syst. Man Cybern., vol. 9, pp. 617-621, 1979.
[http://dx.doi.org/10.1109/TSMC.1979.4310090]
[49]
K. Bache, and M. Lichman, UCI Machine Learning Repository, Irvine, CA: University of California School of Information and Computer Science. Available at:.http://archive. ics. uci. edu/ml
[50]
Z. Huang, and M.K. Ng, "A fuzzy k-modes algorithm for clustering categorical data", IEEE Trans. Fuzzy Syst., vol. 7, pp. 446-452, 1999.
[http://dx.doi.org/10.1109/91.784206]
[51]
W.M. Rand, "Objective criteria for the evaluation of clustering methods", J. Am. Stat. Assoc., vol. 66, pp. 846-850, 1971.
[http://dx.doi.org/10.1080/01621459.1971.10482356]
[52]
L. Hubert, and P. Arabie, "Comparing partitions", J. Classif., vol. 2, pp. 193-218, 1985.
[http://dx.doi.org/10.1007/BF01908075]
[53]
A. Strehl, and J. Ghosh, "Cluster ensembles-a knowledge reuse framework for combining multiple partitions", J. Mach. Learn. Res., vol. 3, pp. 583-617, 2002.


Rights & PermissionsPrintExport Cite as

Article Details

VOLUME: 13
ISSUE: 6
Year: 2020
Page: [833 - 846]
Pages: 14
DOI: 10.2174/2352096512666191127121710
Price: $25

Article Metrics

PDF: 12
HTML: 3
PRC: 1