Generic placeholder image

Combinatorial Chemistry & High Throughput Screening

Editor-in-Chief

ISSN (Print): 1386-2073
ISSN (Online): 1875-5402

Research Article

Variable Screening for Near Infrared (NIR) Spectroscopy Data Based on Ridge Partial Least Squares Regression

Author(s): Naifei Zhao, Qingsong Xu, Man-lai Tang and Hong Wang*

Volume 23, Issue 8, 2020

Page: [740 - 756] Pages: 17

DOI: 10.2174/1386207323666200428114823

Price: $65

Abstract

Aim and Objective: Near Infrared (NIR) spectroscopy data are featured by few dozen to many thousands of samples and highly correlated variables. Quantitative analysis of such data usually requires a combination of analytical methods with variable selection or screening methods. Commonly-used variable screening methods fail to recover the true model when (i) some of the variables are highly correlated, and (ii) the sample size is less than the number of relevant variables. In these cases, Partial Least Squares (PLS) regression based approaches can be useful alternatives.

Materials and Methods: In this research, a fast variable screening strategy, namely the preconditioned screening for ridge partial least squares regression (PSRPLS), is proposed for modelling NIR spectroscopy data with high-dimensional and highly correlated covariates. Under rather mild assumptions, we prove that using Puffer transformation, the proposed approach successfully transforms the problem of variable screening with highly correlated predictor variables to that of weakly correlated covariates with less extra computational effort.

Results: We show that our proposed method leads to theoretically consistent model selection results. Four simulation studies and two real examples are then analyzed to illustrate the effectiveness of the proposed approach.

Conclusion: By introducing Puffer transformation, high correlation problem can be mitigated using the PSRPLS procedure we construct. By employing RPLS regression to our approach, it can be made more simple and computational efficient to cope with the situation where model size is larger than the sample size while maintaining a high precision prediction.

Keywords: Puffer transformation, preconditioning, sure independence screening (SIS), ridge partial least squares regression, variable screening, near infrared (NIR) spectroscopy data.

[1]
Balabin, R.M.; Lomakina, E.I. Support vector machine regression (SVR/LS-SVM)--an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data. Analyst (Lond.), 2011, 136(8), 1703-1712.
[http://dx.doi.org/10.1039/c0an00387e ] [PMID: 21350755]
[2]
Huang, X.; Xu, Q-S.; Liang, Y-Z. PLS regression based on sure independence screening for multivariate calibration. Anal. Methods, 2012, 4(9), 2815-2821.
[http://dx.doi.org/10.1039/c2ay25032b]
[3]
Mehmood, T.; Liland, K.H.; Snipen, L. Sae, b., Solve, A review of variable selection methods in Partial Least Squares Regression. Chemom. Intell. Lab. Syst., 2012, 118(16), 62-69.
[http://dx.doi.org/10.1016/j.chemolab.2012.07.010]
[4]
Yun, Y-H.; Li, H.D.; Deng, B.C.; Cao, D.S. An overview of variable selection methods in multivariate analysis of near-infrared spectra. TrAC Trends Analyt. Chem., 2019, 113, 102-115.
[http://dx.doi.org/10.1016/j.trac.2019.01.018]
[5]
Ma, S.; Li, R.; Tsai, C-L. Variable screening via quantile partial correlation. J. Am. Stat. Assoc., 2017, 112(518), 650-663.
[http://dx.doi.org/10.1080/01621459.2016.1156545 ] [PMID: 28943683]
[6]
Fan, J.; Lv, J. Sure independence screening for ultra-high dimensional feature space (with discussion). J. R. Stat. Soc. B, 2008, 70(5), 849-911.
[http://dx.doi.org/10.1111/j.1467-9868.2008.00674.x]
[7]
Wang, H. Factor profiled sure independence screening. Biometrika, 2012, 99(1), 15-28.
[http://dx.doi.org/10.1093/biomet/asr074]
[8]
Wang, X.; Leng, C. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. B, 2015, 78(3), 589-611.
[http://dx.doi.org/10.1111/rssb.12127]
[9]
Cho, H.; Fryzlewicz, P. High dimensional variable selection via tilting. J. R. Stat. Soc. Ser. A Stat. Soc., 2011, 74(3), 593-622.
[http://dx.doi.org/10.1111/j.1467-9868.2011.01023.x]
[10]
He, X.; Wang, L.; Hong, H. Correction: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat., 2013, 41, 342-369.
[http://dx.doi.org/10.1214/13-AOS1087]
[11]
Ji, P.; Jin, J. UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Stat., 2012, 40(1), 73-103.
[http://dx.doi.org/10.1214/11-AOS947]
[12]
Liu, J.; Li, R.; Wu, R. Feature selection for varying coefficient models with ultrahigh dimensional covariates. J. Am. Stat. Assoc., 2014, 109(505), 266-274.
[http://dx.doi.org/10.1080/01621459.2013.850086 ] [PMID: 24678135]
[13]
Witten, D.M.; Tibshirani, R.J. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol., 2009, 8(1), 1-27.
[14]
Zhu, L.; Li, L.; Li, R.; Zhu, L. Model-Free Feature Screening for Ultrahigh Dimensional Data. J. Am. Stat. Assoc., 2011, 106(496), 1464-1475.
[http://dx.doi.org/10.1198/jasa.2011.tm10563 ] [PMID: 22754050]
[15]
Zhao, N.; Xu, Q.; Wang, H. Marginal screening for partial least squares regression. IEEE Access, 2017, 5, 14047-14055.
[http://dx.doi.org/10.1109/ACCESS.2017.2728532]
[16]
Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23(19), 2507-2517.
[http://dx.doi.org/10.1093/bioinformatics/btm344 ] [PMID: 17720704]
[17]
Frenich, A.G.; Jouanrimbaud, D.; Massart, D.L.; Kuttatharmmakul, S.; Galera, M.M.; Vidal, J.L.M. Wavelength selection method for multicomponent spectrophotometric determinations using partial least squares. Analyst (Lond.), 1995, 120(12), 2787-2792.
[http://dx.doi.org/10.1039/an9952002787]
[18]
Huang, X.; Pan, W.; Park, S.; Han, X.; Miller, L.W.; Hall, J. Modeling the relationship between LVAD support time and gene expression changes in the human heart by penalized partial least squares. Bioinformatics, 2004, 20(6), 888-894.
[http://dx.doi.org/10.1093/bioinformatics/btg499 ] [PMID: 14751963]
[19]
Chong, I.G.; Jun, C.H. Performance of some variable selection methods when multicollinearity is present. Chemom. Intell. Lab. Syst., 2005, 78, 103-112.
[http://dx.doi.org/10.1016/j.chemolab.2004.12.011]
[20]
Gosselin, R.; Rodrigue, D.; Duchesne, C. A Bootstrap-VIP approach for selecting wavelength intervals in spectral imaging applications. Chemom. Intell. Lab. Syst., 2010, 100(1), 12-21.
[http://dx.doi.org/10.1016/j.chemolab.2009.09.005]
[21]
Zhou, L.; Wang, H.; Xu, Q. Survival forest with partial least squares for high dimensional censored data. Chemom. Intell. Lab. Syst., 2018, 179, 12-21.
[http://dx.doi.org/10.1016/j.chemolab.2018.05.005]
[22]
Eriksson, L.; Johansson, E.; Kettaneh-Wold, N.; Wold, S. Multi-and megavariate data analysis principles and applications; Umetrics Academy, 2001.
[23]
Gidskehaug, L.; Anderssen, E.; Flatberg, A.; Alsberg, B.K. A framework for significance analysis of gene expression data using dimension reduction methods. BMC Bioinformatics, 2007, 8(1), 346.
[http://dx.doi.org/10.1186/1471-2105-8-346 ] [PMID: 17877799]
[24]
Martens, M. Sensory and chemical quality criteria for white cabbage studied by multivariate data analysis. Lebensm. Wiss. Technol., 1985, 18, 100-104.
[25]
Shao, R.; Jia, F.; Martin, E.B.; Morris, A.J. Wavelets and non-linear principal components analysis for process monitoring. Control Eng. Pract., 1999, 7(7), 865-879.
[http://dx.doi.org/10.1016/S0967-0661(99)00039-8]
[26]
Espen, P.V.; Lemberge, P. Multivariate analysis of quality - An introduction. Meas. Sci. Technol., 2001, 12(44), 186-187.
[27]
Hasegawa, K.; Miyashita, Y.; Funatsu, K. GA strategy for variable selection in QSAR studies: GA-based PLS analysis of calcium channel antagonists. J. Chem. Inf. Comput. Sci., 1997, 37(2), 306-310.
[http://dx.doi.org/10.1021/ci960047x ] [PMID: 9157101]
[28]
Leardi, R. Gonza, l., A. Lupia,n,ez, Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemom. Intell. Lab. Syst., 1998, 41(2), 195-207.
[http://dx.doi.org/10.1016/S0169-7439(98)00051-3]
[29]
Leardi, R.; Seasholtz, M.B.; Pell, R.J. Variable selection for multivariate calibration using a genetic algorithm: prediction of additive concentrations in polymer films from Fourier transform-infrared spectral data. Anal. Chim. Acta, 2002, 461(2), 189-200.
[http://dx.doi.org/10.1016/S0003-2670(02)00272-6]
[30]
Abrahamsson, C.; Johansson, J. Comparison of different variable selection methods conducted on NIR transmission measurements on intact tablets. Chemom. Intell. Lab. Syst., 2003, 69(1-2), 3-12.
[http://dx.doi.org/10.1016/S0169-7439(03)00064-9]
[31]
Koshoubu, J.; Iwata, T.; Minami, S. Application of the modified UVE-PLS method for a mid-infrared absorption spectral data set of water-ethanol mixtures. Appl. Spectrosc., 2000, 54(1), 148-152.
[http://dx.doi.org/10.1366/0003702001948240]
[32]
Koshoubu, J.; Iwata, T.; Minami, S. Elimination of the uninformative calibration sample subset in the modified UVE(Uninformative Variable Elimination)-PLS (Partial Least Squares) method. Anal. Sci., 2001, 17(2), 319-322.
[http://dx.doi.org/10.2116/analsci.17.319 ] [PMID: 11990548]
[33]
Polanski, J.; Gieleciak, R. The comparative molecular surface analysis (CoMSA) with modified uniformative variable elimination-PLS (UVE-PLS) method: application to the steroids binding the aromatase enzyme. J. Chem. Inf. Comput. Sci., 2003, 43(2), 656-666.
[http://dx.doi.org/10.1021/ci020038q ] [PMID: 12653535]
[34]
Centner, V.; Massart, D.L.; de Noord, O.E.; de Jong, S.; Vandeginste, B.M.; Sterna, C. Elimination of uninformative variables for multivariate calibration. Anal. Chem., 1996, 68(21), 3851-3858.
[http://dx.doi.org/10.1021/ac960321m ] [PMID: 21619260]
[35]
Ferna, ndez Pierna, J. A.; Abbas, O.; Baeten, V.; Dardenne, P. A backward variable selection method for PLS regression (BVSPLS). Anal. Chim. Acta, 2009, 642, 89-93.
[http://dx.doi.org/10.1016/j.aca.2008.12.002]
[36]
Guzma, n., Elena; Baeten, V.; Pierna, J. A. F., ndez; Garci, a.-M., Jose, A. Application of low-resolution Raman spectroscopy for the analysis of oxidized olive oil. Food Control, 2011, 22(12), 2036-2040.
[http://dx.doi.org/10.1016/j.foodcont.2011.05.025]
[37]
Lazraq, A.; Cleroux, R.; Gauchi, J.P. Selecting both latent and explanatory variables in the PLS1 regression model. Chemom. Intell. Lab. Syst., 2003, 66(2), 117-126.
[http://dx.doi.org/10.1016/S0169-7439(03)00027-3]
[38]
Lindgren, F.; Geladi, P.; Berglund, A. Sjöström, M.; Wold, S. Interactive variable selection (IVS) for PLS. Part II: Chemical applications. J. Chemometr., 1995, 9(5), 331-342.
[http://dx.doi.org/10.1002/cem.1180090502]
[39]
Lindgren, F.; Geladi, P. Rännar, S.; Wold, S. Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms. J. Chemometr., 1994, 8(5), 349-363.
[http://dx.doi.org/10.1002/cem.1180080505]
[40]
Saebo, S.; Almoy, T.; Aaroe, J.; Aastveit, A.H ST-PLS: a multidirectional nearest shrunken centroid type classifier via PLS. J. Chemometr.,, 2010, 22(22), 54-62.
[41]
Tibshirani, R.; Hastie, T.; Narasimhan, B.; Chu, G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci., 2003, 18(1), 104-117.
[http://dx.doi.org/10.1214/ss/1056397488]
[42]
Le, C.K.A.; Rossouw, D.; Robertgranie, C.; Besse, P. A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol., 2008, 7(1), 35.
[43]
Xu, Q-S.; Liang, Y-Z.; Shen, H-L. Generalized PLS regression. J. Chemometr., 2001, 15(3), 135-148.
[http://dx.doi.org/10.1002/cem.605]
[44]
Jia, J.; Rohe, K Preconditioning to comply with the irrepresentable condition. arXiv preprint arXiv:1208.5584, 2012.
[45]
Höskuldsson, A. PLS regression methods. J. Chemometr., 1988, 2(3), 211-228.
[http://dx.doi.org/10.1002/cem.1180020306]
[46]
Klema, V.; Laub, A.J. The singular value decomposition: Its computation and some applications. IEEE Trans. Automat. Contr., 1980, 25(2), 164-176.
[http://dx.doi.org/10.1109/TAC.1980.1102314]
[47]
Goldsmith, J.; Bobb, J.; Crainiceanu, C.M.; Caffo, B.; Reich, D. Penalized functional regression. J. Comput. Graph. Stat., 2011, 20(4), 830-851.
[http://dx.doi.org/10.1198/jcgs.2010.10007 ] [PMID: 22368438]
[48]
Chong, G.; Wahba, G. Minimizing GCV/GML Scores with Multiple Smoothing Parameters via the Newton Method. SIAM J. Sci. Statist. Comput., 1991, 12(2), 383-398.
[http://dx.doi.org/10.1137/0912021]
[49]
Craven, P.; Wahba, G. Smoothing noisy data with spline functions. Numer. Math., 1978, 31(4), 377-403.
[http://dx.doi.org/10.1007/BF01404567]
[50]
Xu, Q-S.; Liang, Y-Z.; Du, Y-P. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J. Chemometr., 2004, 18(2), 112-120.
[http://dx.doi.org/10.1002/cem.858]
[51]
Cai, W.; Li, Y.; Shao, X. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemom. Intell. Lab. Syst., 2008, 90(2), 188-194.
[http://dx.doi.org/10.1016/j.chemolab.2007.10.001]
[52]
Reiss, P.T.; Ogden, R.T. Functional principal component regression and functional partial least squares. J. Am. Stat. Assoc., 2007, 102(479), 984-996.
[http://dx.doi.org/10.1198/016214507000000527]
[53]
Burns, D.A.; Ciurczak, E.W. Handbook of Near-Infrared Analysis; CRC Press, 2007.
[http://dx.doi.org/10.1201/9781420007374]
[54]
Wang, H.; Li, G. Extreme learning machine Cox model for high-dimensional survival analysis. Stat. Med., 2019, 38(12), 2139-2156.
[http://dx.doi.org/10.1002/sim.8090 ] [PMID: 30632193]
[55]
Chikuse, Y. Statistics on Special Manifolds. Lecture Notes in Statistics; Springer-Verlag: Berlin, 2003.
[http://dx.doi.org/10.1007/978-0-387-21540-2]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy