Aim and Objective: Near Infrared (NIR) spectroscopy data are featured by few dozen
to many thousands of samples and highly correlated variables. Quantitative analysis of
such data usually requires a combination of analytical methods with variable selection
or screening methods. Commonly-used variable screening methods fail to recover the
true model when (i) some of the variables are highly correlated, and (ii) the sample size
is less than the number of relevant variables. In these cases, partial least squares (PLS)
regression based approaches can be useful alternatives.
Materials and Methods: In this research, a fast variable screening strategy, namely
the preconditioned screening for ridge partial least squares regression (PSRPLS), is
proposed for modelling NIR spectroscopy data with high-dimensional and highly
correlated covariates. Under rather mild assumptions, we prove that using Puffer
transformation, the proposed approach successfully transforms the problem of variable
screening with highly correlated predictor variables to that of weakly correlated
covariates with less extra computational effort.
Results: We show that our proposed method leads to theoretically consistent model
selection results. Four simulation studies and two real examples are then analyzed to
illustrate the effectiveness of the proposed approach.
Conclusion: By introducing Puffer transformation, high correlation problem can be
mitigated using the PSRPLS procedure we construct. By employing RPLS regression
to our approach, it can be made more simple and computational efficient to cope with
the situation where model size is larger than the sample size while maintaining a high