On Performance of Feature Normalization in Classification with Distance-Based Case-Based Reasoning
Jun-Ling Yu and Hui Li
Affiliation: School of Economics and Management, Zhejiang Normal University, P.O. Box 62, 688 Ying Bin Da Dao, Jinhua, Zhejiang 321004, PR China.
Keywords: Bankruptcy prediction, Case-Based Reasoning(CBR), credit scoring, distance-based case retrieval, Feature normalization, K-Nearest Neighbors, Feature normalization methods, Data sets information, Distance-based KNN
Distance-based k-nearest-neighbors (KNN) retrieval is usually adopted to find nearest neighbors for the target case in case-based reasoning (CBR). However, in practical application, the similarity metrics calculated by this means often estimate the actual situation with bias because different features have different units and magnitudes. To avoid this problem, method of feature normalization is usually employed before the implementation of case retrieval. In this study on performance of CBR in classification, distance-based case retrieval was, respectively, implemented on eight binary classification data sets, including two credit scoring datasets and one bankruptcy prediction dataset, after, respectively, implementing seven feature normalized methods on the datasets. The results show that: 1) The two data normalization methods of T3, namely: scaling feature values into the range of 0 to 1 by revised minimum-maximum normalization, and T7, namely: the use of normalization utility function, helped CBR perform the best on 3 out of 8 data sets; 2) the data normalization method of T4, namely: dividing by the maximum value of the feature or maximum normalization, made CBR perform better on 7 out of 8 data sets than the other methods; 3) The data normalization method of T7 helped CBR improve classification performance on 6 out of 8 data sets. As a whole, T7 and T4, followed by T3/T2(minimum-maximum normalization) and T1(Z-score)/T6(Standard deviation normalization), are the best feature normalization methods for distance- based CBR in classification (including bankruptcy prediction and credit scoring problems), as they obtain the top ranking scores in the experiment. In this patent paper, we contribute the current literature on the understanding of how should we choose the most adequate normalization method that could be applied on the data collected in a particular system.
Rights & PermissionsPrintExport