Improved K-means Clustering Algorithm and its Applications

Hui        Qi; Jinqing        Li; Xiaoqiang         Di; Weiwu         Ren; Fengrong         Zhang

Abstract

Background: K-means algorithm is implemented through two steps: initialization and subsequent iterations. Initialization is to select the initial cluster center, while subsequent iterations are to continuously change the cluster center until it won't change any more or the number of iterations reaches its maximum. K-means algorithm is so sensitive to the cluster center selected during initialization that the selection of a different initial cluster center will influence the algorithm performance. Therefore, improving the initialization process has become an important means of K-means performance improvement.

Methods: This paper uses a new strategy to select the initial cluster center. It first calculates the minimum and maximum values of the data in a certain index (For lower-dimensional data, such as twodimensional data, features with larger variance, or the distance to the origin can be selected; for higher-dimensional data, PCA can be used to select the principal component with the largest variance), and then divides the range into equally-sized sub-ranges. Next adjust the sub-ranges based on the data distribution so that each sub-range contains as much data as possible. Finally, the mean value of the data in each sub-range is calculated and used as the initial clustering center.

Results: The theoretical analysis shows that although the time complexity of the initialization process is linear, the algorithm has the characteristics of the superlinear initialization method. This algorithm is applied to two-dimensional GPS data analysis and high-dimensional network attack detection. Experimental results show that this algorithm achieves high clustering performance and clustering speed.

Conclusion: This paper reduces the subsequent iterations of K-means algorithm without compromising the clustering performance, which makes it suitable for large-scale data clustering. This algorithm can not only be applied to low-dimensional data clustering, but also suitable for highdimensional data.

Keywords: Clustering, K-means, GPS, Network attack detection, algorithm, applications.

Graphical Abstract

[1] 
D. Aloise, A. Deshpande, P. Hansen,  and P. Popat, "NP-hardness of Euclidean sum-of-squares clustering", Mach. Learn., vol. 75, pp. 245-248, 2009.
[2] 
M. Mahajan, P. Nimbhorkar,  and K. Varadarajan, "The planar -means problem is NP-hard", Theor. Comput. Sci., vol. 442, pp. 13-21, 2012.
[3] 
H. Qi, Y. Liu,  and D. Wei, "GPS-Based vehicle moving state recognition method and its applications on dynamic in-car navigation systems In ", 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure Computing  2014, pp. 354-360
[4] 
M.E. Celebi, H.A. Kingravi,  and P.A. Vela, "A comparative study of efficient initialization methods for the k-means clustering algorithm", Expert Syst. Appl., vol. 40, pp. 200-210, 2013.
[5] 
M.E. Celebi, "Improving the performance of k-means for color quantization", Image Vis. Comput., vol. 29, pp. 260-271, 2011.
[6] 
D. Arthur,  and S. Vassilvitskii, "K-means++: The advantages of careful seeding In ", Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms Philadelphia, PA, USA 2007, pp. 1027-1035.
[7] 
M.B. Al-Daoud, "A new algorithm for cluster initialization", Inter. J. Comp. Cont. Quant. Info. Eng., vol. 1, pp. 1016-1018, 2007.
[8] 
S.J. Redmond,  and C. Heneghan, "“A method for initialising the K-rmeans clustering algorithm using kd-trees,” Pattern Recogn", Letters, vol. 28, pp. 965-973, 2007.
[9] 
M.A. Hasan, V. Chaoji, S. Salem,  and M.J. Zaki, "Robust partitional clustering by outlier and density insensitive seeding", Pattern Recognit. Lett., vol. 30, pp. 994-1002, 2009.
[10] 
K.A.A. Nazeer,  and M.P. Sebastian, "Improving the accuracy and efficiency of the k-means clustering algorithm  In ", World Congress on Engineering, WCE 2009 Hong Kong, China 2009, pp. 308-12
[11] 
M. Yedla, S.R. Pathakota,  and T.M. Srinivasa, "Enhancing K-means clustering algorithm with improved initial centre", Inter. J. Comp. Sci. Info. Technol., vol. 1, pp. 121-125, 2010.
[12] 
M. Goyal,  and S. Kumar, "Improving the initial centroids of k-means clustering algorithm to generalize its applicability", J. Ins. Eng., vol. 95, pp. 345-350, 2014.
[13] 
M. Erisoglu, N. Calis,  and S. Sakallioglu, "A new algorithm for initial cluster centers in k-means algorithm", Pattern Recognit. Lett., vol. 32, pp. 1701-1705, 2011.
[14] 
W.L. Al-Yaseen, Z.A. Othman,  and M.Z.A. Nazri, "Multi-level hybrid support vector machine and extreme learning machine based on modified K-means for intrusion detection system", Expert Syst. Appl., vol. 67, pp. 296-303, 2017.
[15] 
A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii,  and S. Venkatesan, "Scalable K-Means by Ranked Retrieval  In ", Proceedings of the 7th ACM International Conference on Web Search and Data Mining New York, NY, USA 2014, pp. 233-242.
[16] 
M. Capó, A. Pérez,  and J.A. Lozano, "An efficient approximation to the K-means clustering for massive data", Knowl. Base. Syst., vol. 117, pp. 56-69, 2017.
[17] 
U. Fayyad,  and P.S. Bradley, “Method for refining the initial conditions
  for clustering with applications to small and large database
  clustering”. US6115708A, 2000.
[18] 
V.V. Phoha,  and K.S. Balagani, “Method to indentify anomalous
  data using cascaded K-Means clustering and an ID3 decision tree”. US7792770B1, 2010.
[19] 
H. Qi, X. Di, J. Li,  and H. Ma, Improved K-means algorithm and its application to vehicle steering identification.  In Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering., LNICST: Harbin, China, 2018, pp. 378-386.

Rights & Permissions Print Cite

Article Metrics

18

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1872212113666181203110611	Print ISSN 1872-2121
Publisher Name Bentham Science Publisher	Online ISSN 2212-4047

Recent Patents on Engineering

Improved K-means Clustering Algorithm and its Applications

Abstract

Graphical Abstract

Related Journals

Related Books