A Study of Variance and its Utility in Machine Learning

Krishna   Gopal   Sharma; Yashpal      Singh

Abstract

With the availability of inexpensive devices like storage and data sensors, collecting and storing data is now simpler than ever. Biotechnology, pharmacy, business, online marketing websites, Twitter, Facebook, and blogs are some of the sources of the data. Understanding the data is crucial today as every business activity from private to public, from hospitals to mega mart benefits from this. However, due to the explosive volume of data, it is becoming almost impossible to decipher the data manually. We are creating 2.5 quintillion bytes per day in 2022. One quintillion byte is one billion Gigabytes. Approximately, 90% of the total data is created in the last two years. Naturally, an automatic technique to analyze the data is a necessity today. Therefore, data mining is performed with the help of machine learning tools to analyze and understand the data. Data Mining and Machine Learning are heavily dependent on statistical tools and techniques. Therefore, we sometimes use the term – “Statistical Learning” for Machine Learning. Many machine learning techniques exist in the literature and improvement is a continuous process as no model is perfect. This paper examines the influence of variance, a statistical concept, on various machine learning approaches and tries to understand how this concept can be used to improve performance.

Keywords: Statistical learning, machine learning, data mining, variance, k-distance, KNN.

[1] 
Moore DS. The Basic Practice of Statistics. New York, New York 2010.
[2] 
Diabetes. Available from:  https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451
[3] 
 How Much Data Is Created Every Day in . 2022.  Available from:
          https://techjury.net/blog/how-much-data-is-created-every-day/#gref
[4] 
Naisbitt J. Megatrends: Ten new directions transforming our lives Warner Books.  1982. Available from
           https://books.google.co.in/books?id=LzMEy5bBONcC
[5] 
 Drowning in data, starved for information. Available form: https://ericbrown.com/drowning-in-data-starved-for-information.htm
[6] 
Hastie T, Tibshirani R, Friedman J.  The elements of statistical  learning: Data mining, inference, and prediction. In: New York,  NY: Springer New York In:  2009.
[http://dx.doi.org/10.1007/978-0-387-84858-7] 
[7] 
James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: with Applications in R. New York: Springer New York 2013.
[http://dx.doi.org/10.1007/978-1-4614-7138-7] 
[8] 
Buja A. Special invited paper. additive logistic regression: A statistical view of boosting. Ann Stat  2000; 28(2): 387-91.
[9] 
Andriy B. The Hundred-Page Machine Learning Book Canada:  Quebec City.  2019.
[10] 
Samuel AL. Some studies in machine learning using the game of checkers. IBM J Res Develop  1959; 3(3): 210-29.
[http://dx.doi.org/10.1147/rd.33.0210] 
[11] 
Ross T. The synthesis of intelligence-Its implications. Psychol Rev  1938; 45(2): 185-9.
[http://dx.doi.org/10.1037/h0059815] 
[12] 
Tesauro G. others. Temporal difference learning and TD-Gammon. Commun ACM  1995; 38(3): 58-68.
[http://dx.doi.org/10.1145/203330.203343] 
[13] 
Types of machine learning algorithms. Available from: https://en.proft.me/2015/12/24/types-machine-learning-algorithms/
[14] 
Agresti A, Finlay B. Statistical Methods for the Social Sciences.  3rd ed. Prentice Hall 1997.
[15] 
Weiss NA. Introductory statistics Reading, Massachusetts. Ill.: Addison-Wesley 1999.
[16] 
Bhattacharyya J, Gouri KRA. Statistics principles and methods.  New York, N.Y.: Wiley 1992.
[17] 
Isotalo J. Basics of Statistics. CreateSpace Independent Publishing Platform 2014.
[18] 
Moore D, McCabe G, Craig B. Introduction to the practice of statistics 2009.
[19] 
von Hippel PT. Mean, median, and skew: Correcting a textbook rule. J Stat Educ  2005; 13(2)
[20] 
Molin S. Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python. Packt Publishing Ltd 2019.
[21] 
Chowdhury M, Apon A, Dey K. Data analytics for intelligent transportation systems. In: Data Analytics for Intelligent Transportation  Systems. 2017; pp. 1-316.
[22] 
Blitzstein JK, Hwang J. Introduction to probability. FL: Crc Press Boca Raton 2015.
[23] 
Taboga M. Lectures on probability theory and mathematical statistics. In: CA, USA: San Bernardino 2012.
[24] 
Expectation | Mean | Average Available form: https://www.probabilitycourse.com/chapter3/3_2_2_expectation.php
[25] 
Salkind NJ, Frey BB. Statistics for people who (think they) hate statistics: Using Microsoft Excel. Sage publications 2021.
[26] 
Stapelberg NJ, Hamilton-Craig I, Neumann DL, Shum DHK, McConnell H. Mind and heart: Heart rate variability in major depressive dis-order and coronary heart disease - a review and recommendations. Aust N Z J Psychiatry  2012; 46(10): 946-57.
[http://dx.doi.org/10.1177/0004867412444624] [PMID:  22528974] 
[27] 
Bruce P, Bruce A, Gedeck P. Practical statistics for data scientists: 50+ essential concepts using R and Python. O’Reilly Media 2020.
[28] 
Variance Available from: https://en.wikipedia.org/wiki/Variance
[29] 
Brownlee J. Statistical methods for machine learning  2019.
[30] 
Botev Z, Ridder A. Variance reduction  Wiley statsRef Stat Ref  online  2017; 1-6.
[http://dx.doi.org/10.1002/9781118445112.stat07975] 
[31] 
 Variance Definition | DeepAI Available from: https://deepai.org/machine-learning-glossary-and-terms/variance
[32] 
Kamber H. Micheline, Pei, Jian J Data mining concepts and techniques.  3rd ed. Waltham, Mass.: Morgan Kaufmann Publishers 2012. http://www.books24x7.com/marc.asp?bookid=44712 [Internet]
[33] 
Blakeslee S. Lost on earth: Wealth of data found in space. The New  York Times 1990. Available from:
          https://www.nytimes.com/1990/03/20/science/lost-on-earth-wealth-of-data-found-in-space.html
[34] 
Alzubaidi M, Patel A, Panchanathan S, Black JA Jr. Toward the detection of abnormal chest radiographs the way radiologists do it.Medical Imaging.  2011. Computer-Aided Diagnosis 2011;796337.
[http://dx.doi.org/10.1117/12.878256] 
[35] 
Patel Ameet MD. A computational model for anomaly detection in chest radiographs 
[36] 
Alzubaidi M. A computational model for anomaly detection in chest radiographs  Available from:
          https://www.academia.edu/1802798/A_Computational_Model_for_Anomaly_Detection_in_Chest_Radiographs[cited 2022 Apr 6].
[37] 
Alzubaidi M, Balasubramanian V, Patel A, Panchanathan S, Black JA. A novel online Variance Based Instance Selection (VBIS) method for efficient atypicality detection in chest radiographs. Med Image Process  2012; 8314(2): 8314Z.
[38] 
Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees Routledge; New York.  2017.
[http://dx.doi.org/10.1201/9781315139470] 
[39] 
 Decision tree split methods. Decision Tree Machine Learning Available from: https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/
[40] 
Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM  1975; 18(9): 509-17.
[http://dx.doi.org/10.1145/361002.361007] 
[41] 
K-D Tree.  Available from: https://en.wikipedia.org/wiki/K-d_tree
[42] 
Witten IH, Frank E, Hall MA, Pal CJ. DATA M Practical machine learning tools and techniques.  DATA MINING 2005; p. 4.
[43] 
Pearson K. LIII. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci  1901; 2(11): 559-72.
[http://dx.doi.org/10.1080/14786440109462720] 
[44] 
Jolliffe IT. Principal Component Analysis.  2nd ed. New York: Springer-Verlag 2002.
[45] 
Marzban C. Variance-based sensitivity analysis: An illustration on the lorenz’63 model. Mon Weather Rev  2013; 141(11): 4069-79.
[http://dx.doi.org/10.1175/MWR-D-13-00032.1] 
[46] 
Abdullah L. Euclidean distance over incomplete datasets. 2nd international conference on mathematical and related sciences book of abstracts  2019; 51.
[47] 
Sharma KG, Ram A, Singh Y. Efficient density based outlier handling technique in data mining.  In: Meghanathan N, Kaushik BK, Nagamalai D. (eds) Advances in Computer Science and Information Technology. CCSIT 2011. Communications in Computer and Information Science, vol 131. Springer, Berlin, Heidelberg 2011.
[http://dx.doi.org/10.1007/978-3-642-17857-3_53] 
[48] 
Moussa GS, Owais M, Dabbour E. Variance-based global sensitivity analysis for rear-end crash investigation using deep learning. Accid Anal Prev  2022; 165: 106514.
[http://dx.doi.org/10.1016/j.aap.2021.106514] 
[49] 
Sadeghyan S. A new robust feature selection method using variance-based sensitivity analysis. Arxiv 2018.
[http://dx.doi.org/10.abs/1804.05092] 
[50] 
Saltelli A, Tarantola S, Campolongo F, Ratto M. Sensitivity analysis in practice: A guide to assessing scientific models Wiley Online Li-brary.   2004;  1.
[51] 
Kamalov F. Orthogonal variance decomposition based feature selection. Expert Syst Appl  2021; 182: 115191.  [Internet].
[http://dx.doi.org/10.1016/j.eswa.2021.115191] 
[52] 
Saltelli A, Annoni P, Azzini I, Campolongo F, Ratto M, Tarantola S. Variance based sensitivity analysis of model output. Design and esti-mator for the total sensitivity index. Comput Phys Commun  2010; 181(2): 259-70.
[http://dx.doi.org/10.1016/j.cpc.2009.09.018] 
[53] 
Saltelli A, Sobol’ IM. Sensitivity analysis for nonlinear mathematical models: Numerical experience. Math Model  1995; 7(11): 16-28.
[54] 
Tarnik MG, Ghafari S, Bahraini T, Yazdi HS. Minimum variance based-Bayes combination for prediction of soil properties on Vis-NIR reflectance spectroscopy. Chemom Intell Lab Syst  2020; 207: 104194.
[http://dx.doi.org/10.1016/j.chemolab.2020.104194] 
[55] 
Afridi MK, Azam N, Yao JT. Variance based three-way clustering approaches for handling overlapping clustering. Int J Approx Reason  2020; 118: 47-63.  [Internet].
[http://dx.doi.org/10.1016/j.ijar.2019.11.011] 
[56] 
Mukhopadhyay T, Naskar S, Gupta KK, Kumar R, Dey S, Adhikari S. Probing the stochastic dynamics of coronaviruses: Machine learn-ing assisted deep computational insights with exploitable dimensions. Adv Theory Simul  2021; 4(7): 2000291.
[http://dx.doi.org/10.1002/adts.202000291] 
[57] 
Novello P, Poëtte G, Lugato D, Congedo P. Variance based samples weighting for supervised deep learning. Arxiv 2021.
[http://dx.doi.org/10.abs/2101.07561] 
[58] 
de Sá CR. Variance-based feature importance in neural networks. International Conference on Discovery Science.  306-15.
[59] 
Roberts AGK, Catchpoole DR, Kennedy PJ. Variance-based feature selection for classification of cancer subtypes using gene expression data. Proc Int Jt Conf Neural Netw  2018; 1-8.
[http://dx.doi.org/10.1109/IJCNN.2018.8489279] 
[60] 
Dhindsa A, Bhatia S, Agrawal S, Sohi BS. An improvised machine learning model based on mutual information feature selection approach for microbes classification. Entropy (Basel)  2021; 23(2): 257.
[http://dx.doi.org/10.3390/e23020257] [PMID:  33672252] 
[61] 
Umyarov A, Tuzhilin A. Improving rating estimation in recommender systems using aggregation-and variance-based hierarchical models. Proceedings of the third ACM conference on Recommender systems.  37-44.
[http://dx.doi.org/10.1145/1639714.1639722] 
[62] 
Zhao Z, Zhang R, Cox J, Duling D, Sarle W. Massively parallel feature selection: An approach based on variance preservation. Mach Learn  2013; 92(1): 195-220.
[http://dx.doi.org/10.1007/s10994-013-5373-4] 
[63] 
Ebenuwa SH, Sharif MS, Alazab M, Al-Nemrat A. Variance ranking attributes selection techniques for binary classification problem in imbalance data IEEE Access  2019; 7: 24649-66.
[http://dx.doi.org/10.1109/ACCESS.2019.2899578] 
[64] 
Takasu A, Aihara K. Variance based classifier comparison in text catergorization. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval July  2000; 316-7.
[65] 
Bharathi A, Natarajan AM. Cancer classification using support vector machines and relevance vector machine based on analysis of vari-ance features. J Comput Sci  2011; 7(9): 1393-9.
[http://dx.doi.org/10.3844/jcssp.2011.1393.1399] 
[66] 
Yang Z, Yu Y, You C, Steinhardt J, Ma Y. Rethinking bias-variance trade-off for generalization of neural networks.  International Confer-ence on Machine Learning 2020; pp. 10767-77.
[67] 
Pham HV, Qian S, Wang J, Lutellier T, Rosenthal J, Tan L, et al. Problems and opportunities in training deep learning software systems: An analysis of variance. Proceedings of the 35th IEEE/ACM international conference on automated software engineering. 771-83.
[68] 
Skafte N, Jørgensen M, Hauberg S. Reliable training and estimation of variance networks. Adv Neural Inf Process Syst  2019; 32.
[69] 
Zadeh LA. Fuzzy sets.Fuzzy sets, fuzzy logic, and fuzzy systems: Selected papers by Lotfi A Zadeh.  World Scientific 1996; pp. 394-432.
[http://dx.doi.org/10.1142/9789814261302_0021] 
[70] 
Sanjaa B, Tsoozol P. Fuzzy and probability. International Forum on Strategic Technology. 03-06 October 2007; Ulaanbaatar, Mongolia: IEEE 141-3.
[http://dx.doi.org/10.1109/IFOST.2007.4798542] 
[71] 
Kosko B. Fuzziness vs. probability. Int J Gen Syst  1990; 17(2–3): 211-40.
[http://dx.doi.org/10.1080/03081079008935108] 
[72] 
Akbari MGH, Rezaei AH, Waghei Y. Statistical inference about the variance of fuzzy random variables. Sankhy{\=a}. Sankhya Ser B  2009; 206-21.
[73] 
Keller JM, Gray MR, Givens JA. A fuzzy K-Nearest neighbor algorithm. IEEE Trans Syst Man Cybern  1985; SMC-15(4): 580-5.
[http://dx.doi.org/10.1109/TSMC.1985.6313426] 
[74] 
Cortes C, Vapnik V. Support-vector networks. Mach Learn  1995; 20(3): 273-97.
[http://dx.doi.org/10.1007/BF00994018] 
[75] 
Lin CF, Wang SD. Fuzzy support vector machines. IEEE Trans Neural Netw  2002; 13(2): 464-71.
[http://dx.doi.org/10.1109/72.991432] [PMID:  18244447] 
[76] 
Khemchandani R. Jayadeva, Chandra S.  Fuzzy Twin Support Vector Machines for Pattern Classification 2008; pp. 131-42.
[77] 
Rezvani S, Wang X, Pourpanah F. Intuitionistic fuzzy twin support vector machines. IEEE Trans Fuzzy Syst  2019; 27(11): 2140-51.
[http://dx.doi.org/10.1109/TFUZZ.2019.2893863] 
[78] 
Brieman F, Olshen S, Friedman J, Charles JS. Classification and regression trees 2012.  66. Available from: 
          https://www.routledge.com/Classification-and-Regression-Trees/Breiman-Friedman-Stone-Olshen/p/book/9780412048418
[79] 
Katuwal R, Suganthan PN, Zhang L. Heterogeneous oblique random forest. Pattern Recognit  2020; 99: 107078.
[http://dx.doi.org/10.1016/j.patcog.2019.107078] 
[80] 
Molodtsov D. Soft set theory-First results. Comput Math Appl  1999; 37(4–5): 19-31.
[http://dx.doi.org/10.1016/S0898-1221(99)00056-5] 
[81] 
Maji PK, Roy AR, Biswas R. Fuzzy Soft Sets. J Fuzzy Math  2001; 3(9): 589-602.https://www.scirp.org/(S(i43dyn45teexjx455qlt3d2q))/reference/ReferencesPapers.aspx?ReferenceID=1185371
[82] 
Handaga B, Herawan T, Deris MM.  FSSC: An algorithm for classifying numerical data using fuzzy soft set theory. https://services.igi-global.com/resolvedoi/resolve.aspx?doi=104018/ijfsa2012100102https://www.igi-global.com/article/fssc-algorithm-classifying-numerical-data/70755
[83] 
Lashari SA, Ibrahim R, Senan N. Medical data classification using similarity measure of fuzzy soft set based distance measure. J Tele-commun Electron Comput Eng  2017; 9: 95-9.
[84] 
Yanto ITR, Saedudin RR, Lashari SA. A numerical classification technique based on fuzzy soft set using hamming distance. Adv Intell Syst Comput  2018; 700: 252-60.  Available from:
          https://link.springer.com/chapter/10.1007/978-3-319-72550-5_25
[85] 
Memis S, Enginoglu S, Erkan U. Numerical data classification via distance-based similarity measures of fuzzy parameterized fuzzy soft matrices IEEE Access  2021; 9: 88583-601.
[http://dx.doi.org/10.1109/ACCESS.2021.3089849] 
[86] 
Memiş S, Enginoğlu S, Erkan U. A classification method in machine learning based on soft decision-making via fuzzy parameterized fuzzy soft matrices  IEEE Access 2021.https://link.springer.com/article/10.1007/s00500-021-06553-z
[87] 
Memis S, Enginoglu S, Erkan U. A new classification method using soft decision-making based on an aggregation operator of fuzzy pa-rameterized fuzzy soft matrices. Turk J Electr Eng Comput Sci  2022; 30(3): 871-90.

Rights & Permissions Print Cite

Article Metrics

24

2

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/2210327912666220617153359	Print ISSN 2210-3279
Publisher Name Bentham Science Publisher	Online ISSN 2210-3287

International Journal of Sensors, Wireless Communications and Control

A Study of Variance and its Utility in Machine Learning

Abstract

Graphical Abstract

Federated learning for biomedical applications

Information, Trust, and Risk: Exploring the Intersection of Sensing, Wireless Communications, and Control

Machine Learning for Industry 4.0 manufacturing applications

Next-Generation Network Architecture, Algorithms, and Security

International Journal of Sensors, Wireless Communications and Control

A Study of Variance and its Utility in Machine Learning

Abstract

Graphical Abstract

Call for Papers in Thematic Issues

Federated learning for biomedical applications

Information, Trust, and Risk: Exploring the Intersection of Sensing, Wireless Communications, and Control

Machine Learning for Industry 4.0 manufacturing applications

Next-Generation Network Architecture, Algorithms, and Security

Related Journals

Related Books