Generic placeholder image

International Journal of Sensors, Wireless Communications and Control

Editor-in-Chief

ISSN (Print): 2210-3279
ISSN (Online): 2210-3287

Review Article

A Study of Variance and its Utility in Machine Learning

Author(s): Krishna Gopal Sharma* and Yashpal Singh

Volume 12, Issue 5, 2022

Published on: 22 July, 2022

Page: [333 - 343] Pages: 11

DOI: 10.2174/2210327912666220617153359

Price: $65

Abstract

With the availability of inexpensive devices like storage and data sensors, collecting and storing data is now simpler than ever. Biotechnology, pharmacy, business, online marketing websites, Twitter, Facebook, and blogs are some of the sources of the data. Understanding the data is crucial today as every business activity from private to public, from hospitals to mega mart benefits from this. However, due to the explosive volume of data, it is becoming almost impossible to decipher the data manually. We are creating 2.5 quintillion bytes per day in 2022. One quintillion byte is one billion Gigabytes. Approximately, 90% of the total data is created in the last two years. Naturally, an automatic technique to analyze the data is a necessity today. Therefore, data mining is performed with the help of machine learning tools to analyze and understand the data. Data Mining and Machine Learning are heavily dependent on statistical tools and techniques. Therefore, we sometimes use the term – “Statistical Learning” for Machine Learning. Many machine learning techniques exist in the literature and improvement is a continuous process as no model is perfect. This paper examines the influence of variance, a statistical concept, on various machine learning approaches and tries to understand how this concept can be used to improve performance.

Keywords: Statistical learning, machine learning, data mining, variance, k-distance, KNN.

Next »
Graphical Abstract
[1]
Moore DS. The Basic Practice of Statistics. New York, New York 2010.
[3]
How Much Data Is Created Every Day in . 2022. Available from: https://techjury.net/blog/how-much-data-is-created-every-day/#gref
[4]
Naisbitt J. Megatrends: Ten new directions transforming our lives Warner Books. 1982. Available from https://books.google.co.in/books?id=LzMEy5bBONcC
[5]
Drowning in data, starved for information. Available form: https://ericbrown.com/drowning-in-data-starved-for-information.htm
[6]
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: Data mining, inference, and prediction. In: New York, NY: Springer New York In: 2009.
[http://dx.doi.org/10.1007/978-0-387-84858-7]
[7]
James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: with Applications in R. New York: Springer New York 2013.
[http://dx.doi.org/10.1007/978-1-4614-7138-7]
[8]
Buja A. Special invited paper. additive logistic regression: A statistical view of boosting. Ann Stat 2000; 28(2): 387-91.
[9]
Andriy B. The Hundred-Page Machine Learning Book Canada: Quebec City. 2019.
[10]
Samuel AL. Some studies in machine learning using the game of checkers. IBM J Res Develop 1959; 3(3): 210-29.
[http://dx.doi.org/10.1147/rd.33.0210]
[11]
Ross T. The synthesis of intelligence-Its implications. Psychol Rev 1938; 45(2): 185-9.
[http://dx.doi.org/10.1037/h0059815]
[12]
Tesauro G. others. Temporal difference learning and TD-Gammon. Commun ACM 1995; 38(3): 58-68.
[http://dx.doi.org/10.1145/203330.203343]
[13]
Types of machine learning algorithms. Available from: https://en.proft.me/2015/12/24/types-machine-learning-algorithms/
[14]
Agresti A, Finlay B. Statistical Methods for the Social Sciences. 3rd ed. Prentice Hall 1997.
[15]
Weiss NA. Introductory statistics Reading, Massachusetts. Ill.: Addison-Wesley 1999.
[16]
Bhattacharyya J, Gouri KRA. Statistics principles and methods. New York, N.Y.: Wiley 1992.
[17]
Isotalo J. Basics of Statistics. CreateSpace Independent Publishing Platform 2014.
[18]
Moore D, McCabe G, Craig B. Introduction to the practice of statistics 2009.
[19]
von Hippel PT. Mean, median, and skew: Correcting a textbook rule. J Stat Educ 2005; 13(2)
[20]
Molin S. Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python. Packt Publishing Ltd 2019.
[21]
Chowdhury M, Apon A, Dey K. Data analytics for intelligent transportation systems. In: Data Analytics for Intelligent Transportation Systems. 2017; pp. 1-316.
[22]
Blitzstein JK, Hwang J. Introduction to probability. FL: Crc Press Boca Raton 2015.
[23]
Taboga M. Lectures on probability theory and mathematical statistics. In: CA, USA: San Bernardino 2012.
[24]
Expectation | Mean | Average Available form: https://www.probabilitycourse.com/chapter3/3_2_2_expectation.php
[25]
Salkind NJ, Frey BB. Statistics for people who (think they) hate statistics: Using Microsoft Excel. Sage publications 2021.
[26]
Stapelberg NJ, Hamilton-Craig I, Neumann DL, Shum DHK, McConnell H. Mind and heart: Heart rate variability in major depressive dis-order and coronary heart disease - a review and recommendations. Aust N Z J Psychiatry 2012; 46(10): 946-57.
[http://dx.doi.org/10.1177/0004867412444624] [PMID: 22528974]
[27]
Bruce P, Bruce A, Gedeck P. Practical statistics for data scientists: 50+ essential concepts using R and Python. O’Reilly Media 2020.
[28]
Variance Available from: https://en.wikipedia.org/wiki/Variance
[29]
Brownlee J. Statistical methods for machine learning 2019.
[30]
Botev Z, Ridder A. Variance reduction Wiley statsRef Stat Ref online 2017; 1-6.
[http://dx.doi.org/10.1002/9781118445112.stat07975]
[31]
Variance Definition | DeepAI Available from: https://deepai.org/machine-learning-glossary-and-terms/variance
[32]
Kamber H. Micheline, Pei, Jian J Data mining concepts and techniques. 3rd ed. Waltham, Mass.: Morgan Kaufmann Publishers 2012. http://www.books24x7.com/marc.asp?bookid=44712 [Internet]
[33]
Blakeslee S. Lost on earth: Wealth of data found in space. The New York Times 1990. Available from: https://www.nytimes.com/1990/03/20/science/lost-on-earth-wealth-of-data-found-in-space.html
[34]
Alzubaidi M, Patel A, Panchanathan S, Black JA Jr. Toward the detection of abnormal chest radiographs the way radiologists do it.Medical Imaging. 2011. Computer-Aided Diagnosis 2011;796337.
[http://dx.doi.org/10.1117/12.878256]
[35]
Patel Ameet MD. A computational model for anomaly detection in chest radiographs
[36]
Alzubaidi M. A computational model for anomaly detection in chest radiographs Available from: https://www.academia.edu/1802798/A_Computational_Model_for_Anomaly_Detection_in_Chest_Radiographs[cited 2022 Apr 6].
[37]
Alzubaidi M, Balasubramanian V, Patel A, Panchanathan S, Black JA. A novel online Variance Based Instance Selection (VBIS) method for efficient atypicality detection in chest radiographs. Med Image Process 2012; 8314(2): 8314Z.
[38]
Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees Routledge; New York. 2017.
[http://dx.doi.org/10.1201/9781315139470]
[39]
Decision tree split methods. Decision Tree Machine Learning Available from: https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/
[40]
Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM 1975; 18(9): 509-17.
[http://dx.doi.org/10.1145/361002.361007]
[41]
K-D Tree. Available from: https://en.wikipedia.org/wiki/K-d_tree
[42]
Witten IH, Frank E, Hall MA, Pal CJ. DATA M Practical machine learning tools and techniques. DATA MINING 2005; p. 4.
[43]
Pearson K. LIII. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci 1901; 2(11): 559-72.
[http://dx.doi.org/10.1080/14786440109462720]
[44]
Jolliffe IT. Principal Component Analysis. 2nd ed. New York: Springer-Verlag 2002.
[45]
Marzban C. Variance-based sensitivity analysis: An illustration on the lorenz’63 model. Mon Weather Rev 2013; 141(11): 4069-79.
[http://dx.doi.org/10.1175/MWR-D-13-00032.1]
[46]
Abdullah L. Euclidean distance over incomplete datasets. 2nd international conference on mathematical and related sciences book of abstracts 2019; 51.
[47]
Sharma KG, Ram A, Singh Y. Efficient density based outlier handling technique in data mining. In: Meghanathan N, Kaushik BK, Nagamalai D. (eds) Advances in Computer Science and Information Technology. CCSIT 2011. Communications in Computer and Information Science, vol 131. Springer, Berlin, Heidelberg 2011.
[http://dx.doi.org/10.1007/978-3-642-17857-3_53]
[48]
Moussa GS, Owais M, Dabbour E. Variance-based global sensitivity analysis for rear-end crash investigation using deep learning. Accid Anal Prev 2022; 165: 106514.
[http://dx.doi.org/10.1016/j.aap.2021.106514]
[49]
Sadeghyan S. A new robust feature selection method using variance-based sensitivity analysis. Arxiv 2018.
[http://dx.doi.org/10.abs/1804.05092]
[50]
Saltelli A, Tarantola S, Campolongo F, Ratto M. Sensitivity analysis in practice: A guide to assessing scientific models Wiley Online Li-brary. 2004; 1.
[51]
Kamalov F. Orthogonal variance decomposition based feature selection. Expert Syst Appl 2021; 182: 115191. [Internet].
[http://dx.doi.org/10.1016/j.eswa.2021.115191]
[52]
Saltelli A, Annoni P, Azzini I, Campolongo F, Ratto M, Tarantola S. Variance based sensitivity analysis of model output. Design and esti-mator for the total sensitivity index. Comput Phys Commun 2010; 181(2): 259-70.
[http://dx.doi.org/10.1016/j.cpc.2009.09.018]
[53]
Saltelli A, Sobol’ IM. Sensitivity analysis for nonlinear mathematical models: Numerical experience. Math Model 1995; 7(11): 16-28.
[54]
Tarnik MG, Ghafari S, Bahraini T, Yazdi HS. Minimum variance based-Bayes combination for prediction of soil properties on Vis-NIR reflectance spectroscopy. Chemom Intell Lab Syst 2020; 207: 104194.
[http://dx.doi.org/10.1016/j.chemolab.2020.104194]
[55]
Afridi MK, Azam N, Yao JT. Variance based three-way clustering approaches for handling overlapping clustering. Int J Approx Reason 2020; 118: 47-63. [Internet].
[http://dx.doi.org/10.1016/j.ijar.2019.11.011]
[56]
Mukhopadhyay T, Naskar S, Gupta KK, Kumar R, Dey S, Adhikari S. Probing the stochastic dynamics of coronaviruses: Machine learn-ing assisted deep computational insights with exploitable dimensions. Adv Theory Simul 2021; 4(7): 2000291.
[http://dx.doi.org/10.1002/adts.202000291]
[57]
Novello P, Poëtte G, Lugato D, Congedo P. Variance based samples weighting for supervised deep learning. Arxiv 2021.
[http://dx.doi.org/10.abs/2101.07561]
[58]
de Sá CR. Variance-based feature importance in neural networks. International Conference on Discovery Science. 306-15.
[59]
Roberts AGK, Catchpoole DR, Kennedy PJ. Variance-based feature selection for classification of cancer subtypes using gene expression data. Proc Int Jt Conf Neural Netw 2018; 1-8.
[http://dx.doi.org/10.1109/IJCNN.2018.8489279]
[60]
Dhindsa A, Bhatia S, Agrawal S, Sohi BS. An improvised machine learning model based on mutual information feature selection approach for microbes classification. Entropy (Basel) 2021; 23(2): 257.
[http://dx.doi.org/10.3390/e23020257] [PMID: 33672252]
[61]
Umyarov A, Tuzhilin A. Improving rating estimation in recommender systems using aggregation-and variance-based hierarchical models. Proceedings of the third ACM conference on Recommender systems. 37-44.
[http://dx.doi.org/10.1145/1639714.1639722]
[62]
Zhao Z, Zhang R, Cox J, Duling D, Sarle W. Massively parallel feature selection: An approach based on variance preservation. Mach Learn 2013; 92(1): 195-220.
[http://dx.doi.org/10.1007/s10994-013-5373-4]
[63]
Ebenuwa SH, Sharif MS, Alazab M, Al-Nemrat A. Variance ranking attributes selection techniques for binary classification problem in imbalance data IEEE Access 2019; 7: 24649-66.
[http://dx.doi.org/10.1109/ACCESS.2019.2899578]
[64]
Takasu A, Aihara K. Variance based classifier comparison in text catergorization. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval July 2000; 316-7.
[65]
Bharathi A, Natarajan AM. Cancer classification using support vector machines and relevance vector machine based on analysis of vari-ance features. J Comput Sci 2011; 7(9): 1393-9.
[http://dx.doi.org/10.3844/jcssp.2011.1393.1399]
[66]
Yang Z, Yu Y, You C, Steinhardt J, Ma Y. Rethinking bias-variance trade-off for generalization of neural networks. International Confer-ence on Machine Learning 2020; pp. 10767-77.
[67]
Pham HV, Qian S, Wang J, Lutellier T, Rosenthal J, Tan L, et al. Problems and opportunities in training deep learning software systems: An analysis of variance. Proceedings of the 35th IEEE/ACM international conference on automated software engineering. 771-83.
[68]
Skafte N, Jørgensen M, Hauberg S. Reliable training and estimation of variance networks. Adv Neural Inf Process Syst 2019; 32.
[69]
Zadeh LA. Fuzzy sets.Fuzzy sets, fuzzy logic, and fuzzy systems: Selected papers by Lotfi A Zadeh. World Scientific 1996; pp. 394-432.
[http://dx.doi.org/10.1142/9789814261302_0021]
[70]
Sanjaa B, Tsoozol P. Fuzzy and probability. International Forum on Strategic Technology. 03-06 October 2007; Ulaanbaatar, Mongolia: IEEE 141-3.
[http://dx.doi.org/10.1109/IFOST.2007.4798542]
[71]
Kosko B. Fuzziness vs. probability. Int J Gen Syst 1990; 17(2–3): 211-40.
[http://dx.doi.org/10.1080/03081079008935108]
[72]
Akbari MGH, Rezaei AH, Waghei Y. Statistical inference about the variance of fuzzy random variables. Sankhy{\=a}. Sankhya Ser B 2009; 206-21.
[73]
Keller JM, Gray MR, Givens JA. A fuzzy K-Nearest neighbor algorithm. IEEE Trans Syst Man Cybern 1985; SMC-15(4): 580-5.
[http://dx.doi.org/10.1109/TSMC.1985.6313426]
[74]
Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20(3): 273-97.
[http://dx.doi.org/10.1007/BF00994018]
[75]
Lin CF, Wang SD. Fuzzy support vector machines. IEEE Trans Neural Netw 2002; 13(2): 464-71.
[http://dx.doi.org/10.1109/72.991432] [PMID: 18244447]
[76]
Khemchandani R. Jayadeva, Chandra S. Fuzzy Twin Support Vector Machines for Pattern Classification 2008; pp. 131-42.
[77]
Rezvani S, Wang X, Pourpanah F. Intuitionistic fuzzy twin support vector machines. IEEE Trans Fuzzy Syst 2019; 27(11): 2140-51.
[http://dx.doi.org/10.1109/TFUZZ.2019.2893863]
[78]
Brieman F, Olshen S, Friedman J, Charles JS. Classification and regression trees 2012. 66. Available from: https://www.routledge.com/Classification-and-Regression-Trees/Breiman-Friedman-Stone-Olshen/p/book/9780412048418
[79]
Katuwal R, Suganthan PN, Zhang L. Heterogeneous oblique random forest. Pattern Recognit 2020; 99: 107078.
[http://dx.doi.org/10.1016/j.patcog.2019.107078]
[80]
Molodtsov D. Soft set theory-First results. Comput Math Appl 1999; 37(4–5): 19-31.
[http://dx.doi.org/10.1016/S0898-1221(99)00056-5]
[81]
Maji PK, Roy AR, Biswas R. Fuzzy Soft Sets. J Fuzzy Math 2001; 3(9): 589-602.https://www.scirp.org/(S(i43dyn45teexjx455qlt3d2q))/reference/ReferencesPapers.aspx?ReferenceID=1185371
[82]
Handaga B, Herawan T, Deris MM. FSSC: An algorithm for classifying numerical data using fuzzy soft set theory. https://services.igi-global.com/resolvedoi/resolve.aspx?doi=104018/ijfsa2012100102https://www.igi-global.com/article/fssc-algorithm-classifying-numerical-data/70755
[83]
Lashari SA, Ibrahim R, Senan N. Medical data classification using similarity measure of fuzzy soft set based distance measure. J Tele-commun Electron Comput Eng 2017; 9: 95-9.
[84]
Yanto ITR, Saedudin RR, Lashari SA. A numerical classification technique based on fuzzy soft set using hamming distance. Adv Intell Syst Comput 2018; 700: 252-60. Available from: https://link.springer.com/chapter/10.1007/978-3-319-72550-5_25
[85]
Memis S, Enginoglu S, Erkan U. Numerical data classification via distance-based similarity measures of fuzzy parameterized fuzzy soft matrices IEEE Access 2021; 9: 88583-601.
[http://dx.doi.org/10.1109/ACCESS.2021.3089849]
[86]
Memiş S, Enginoğlu S, Erkan U. A classification method in machine learning based on soft decision-making via fuzzy parameterized fuzzy soft matrices IEEE Access 2021.https://link.springer.com/article/10.1007/s00500-021-06553-z
[87]
Memis S, Enginoglu S, Erkan U. A new classification method using soft decision-making based on an aggregation operator of fuzzy pa-rameterized fuzzy soft matrices. Turk J Electr Eng Comput Sci 2022; 30(3): 871-90.

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy