A novel approach is developed for modeling situations in which the modeled property is an algebraically
transformed version of the original experimental data. In many cases such a transformation results in a data set with a
significantly smaller data range. Here we explore the effects of range-of-data on modeling statistics. We illustrate a twostep
method using data on the mass spectrometry collision energy (CE) that is required to decompose 50% of precursor
ions to fragments (CE50). Earlier we showed that a nonlinear center-of-mass transformation, yielding Ecom50, produces
values less dependent on the specific mass spectrometric experimental conditions. For this data set the Ecom50 range is
13.5% of the CE50 range. We propose a two-step modeling method. First, the original experimental data, CE50, (larger
range-of-data) is modeled by a standard modeling method (PLS). Second, the calculated dependent variable resulting from
the modeling is algebraically transformed (not modeled) according to the center-of-mass transformation, providing the
generally more useful data, Ecom50. As shown here, use of this two-step method for predicting Ecom50 (from previously
published data) produces a standard error 21% smaller and correspondingly reduces the confidence interval for prediction.
Some specific implications for prediction are given for a published data set. This work is part of the ongoing development
of a system of models to assist in the development of human metabolites.
Keywords: Collison energy at 50% reduction (CE50), molconn structure descriptors, PLS models, range of data significance,
PubChem structures prediction.
Rights & PermissionsPrintExport