Rolf
Sundberg
Mathem.
statistics
Stockholm
University
Lelystad,
May 2000
Situation:
Calibration
is a process where we are establishing relation between
We
can only use a sample – not the whole population
The
relationship is used for prediction of true value for new pigs.
There
is interplay between prediction and estimation.

Prediction 
Estimation 

(R)MSEP 
(R)MSE 
(Imagined
known) theoretical distribution for (x,y)
Predict future y_{o }for known x_{o} by
Prediction error = residual
_{ }
Accuracy measure
Minimum MSEP for
Under some conditions straight line:
And
Independent
of x_{0}
Calibration data available instead of
theoretical distribution
Random sample or controlled x
Estimated straight line regression (e.g. by OLS =
Ordinary Least Squares)
Properties? How good?
If sample size n is large, so line is precisely
determined
The precision must be estimated, by
p is the dimension of x, in this example p = 1.
MSE not (quite) fair as prediction error
measure, because y_{i} has already influenced
(1)
Crossvalidation, leaveoneout
Computational
burden?
(2)
Variation of (1) : Leave out larger subsets of data, not only one at a
time
(3)
Sort of extreme of (2) :
Calibration set  Validation set
( test set ) .
Two MSEP when the two sets exchange roles
For large n and OLS regression, with p<<n
More precisely
Examples
from Danish study (1996)

N

p 
MSEP/MSE 
1+(p+1)/n 
OLS
(FOM/MK) 
202 
4 
1.03 
1.025 
PCR
(CC) 
344 
11 
1.07 
1.03 
PLS
(Autofom) 
344 
127 
1.09 !!!! 
1.4 !!!! 
General conclusion: MSE too
optimistic, a little or much
PLS
is one of several shrinkage methods (regularisation methods)
Others
= PCR, CR, RR, LSRR
Why shrink ?
To
compensate for (near) collinearity
Obvious
risk for an extreme slope of the OLSfitted plane, just by chance.
For
safety, reduce this slope
Some
linear combinations of xvariables are almost constant (over observations)
How detect nearcollinearity ?
Corr
(X) – Matrix near singular, some
very small eigenvalue(s)
Statistical
consequence for OLS:
==> b likely to have large coefficients ( by chance)
This
is unavoidable if p$n
Near
collinearity is typical if p is large
Nearcollinearity may occur for p small
Different
approach:
What
are the “principal properties” of the measurement system?
What
is the natural (chemical/biological) rank of the system (the data)?
Variation
is typically taking place essentially only in a lowdimensional space (xspace)
Estimator 
Predictor 
No
systematic error, in OLS but can be far from truth/causality
=> misinterpretations 
Works if Shrunk (ommitted) directions have little influence or new data vary
little in shrunk directions (like calibr. Data) 
Estimation for description & interpretation,
MSE
Predictivity measures
internal
(simulated)
external
(true)
Representativity of calibration/test set
What is to be predicted?
True y
What can be achieved?
Measured y
Pretreatment: Shrinkage methods not invariant
under e.g. individual rescaling of x
1.
Principal components analysis = PCA
Regression = PCR
(t_{1},t_{2}) are equivalent to
(x_{1},x_{2}), but whereas x_{1} , x_{2} vary
equally much and are strongly correlated, t_{1} varies much more than t_{2}
(»constant) and t_{1} , t_{2} are uncorrelated.
Much more likely t_{1} can explain
variation in y, than t_{2}
PC1: t_{1} along direction that explains
most variation
PC2: t_{2} in orthogonal direction that
explains next most variation etc, if there are more dimensions.
So:
PCA
t_{1 }, ………, t_{p}
which replace x_{1},….., x_{p}
t_{1} = c_{11 }x_{1} +  + c_{1p }x_{p}
the t_{i} are called scores
the c_{ij} are called loadings
PCR:
Regress y on only t_{i} ,……, t_{k},
instead of full regression on t_{i} ,……, t_{p} or equivalent
y on x_{i},……x_{p}
Choose number k by crossvalidation
Possible inefficiency:
there may be PCs which
do not influence y.
Why include them in regression? Only
contributing uncertainty.
PLS more efficient in this respect, but else
similar to PCR
OLS maximises Corr(y,t(x)) over t and Corr(y,t_{1}) over t_{1
}
PLS maximises Cov(y,t(x)) over t and Cov(y,t_{1}) over t_{1 }
PCR
maximises Var (t_{1})
The c_{i} values form a direction vector where
PLS is a compromise between extremes
Wish to have highest possible correlation
Wish to have high(est possible) variance in t_{1}
More general approach:
1) Maximise some expression: f (Corr(t_{1},y) , var(t_{1})) with respect to
where f is increasing both in Corr and in Var. This yields some direction c_{1}.
_{2) }OLS regress y on t_{1} to form predictor
_{ }
3)
Calculate residuals from 2), and repeat the procedure on them, if
desirable
_{ }
It can be shown then that
Note to step 2 above:
Upscaling of b_{RR}
by least squares, so called LSRR.
In typical use, RR
»
LSRR (in the sense b_{RR} »
b_{LSRR} for small *)
Choose * by crossvalidation
Now
OLS, PLS, PCR satisfy criteria of this type Þ
LSRR
OLS:
d
= 0
PLS:
d
®
4
first factor, first latent variable, first PC
PCR:
d
®
l_{max} first factor, first latent variable, first PC
l_{max}
= maximal eigenvalue of (X^{T}X)
So
all these methods are strongly related mutually.
One
more such method:
Maximise
Corr(t_{1}
, y)/Var(t_{1})^{g}
with respect to c_{1}
Choose
g
that yields best cross validation.
Repeat
on yresiduals to form next factor t_{2}, etc.
This
is Continuum regression (Stone & Brooks).
Any
of these shrinkage methods is justifiable and typically yield quite similar
predictors. Perhaps PCR and PLS are conceptually
preferable and PLS is slightly more efficient.
Schematic
picture of PLS or PCR
Shrinkage
methods are not invariant under transformations of x
Autoscaling^{ }(may be difficult), sometimes it is reasonable, often not, for instance with spectral measurements, difference spectrum.
^{ }
How
about weight and fat thickness for pigs?
Two
different situations:
Calibration
Prediction
Calibrate
for individuals, which might be in the population.
(with
PCR or PLS) t_{1} and t_{2} from calibration.
Larger
samples
Wider
samples
More
/ other variables
Better
model (transform variables, include interactions, include nonlinearities).
Better
predictor.
Double
regression fits well with PCR/PCA
Proposed
procedure:
Use PCA to find
the principal components describing the variation in (X, Z) jointly, from the
total data set on (X, Z). Say the result is t = t (X, Z).
Use OLS
regression of y on t to construct a predictor based on t = t (X, Z).
Use
crossvalidation to choose the number of PC’s in (X, Z) that best predicts y.
When only x is
available, predict y via PC’s, see next page.
Both
X and Z is selected to be able to describe y, so there will probably be a large
extent of co linearity in (X, Z). Hence PCR and similar
method is probably motivated.
(Application
on double regression)
Predict y_{0 } by
Missing
data more easily handled by PLS & PCR than by methods like ridge regression.
(Data from Brockhoff et al. 1993)
Concerns:
Smell of apples after storage under n = 48 different conditions.
Y=
preference of smell on a 05 point scale, averaged over a trained sensory panel
of 10 assessors.
X=(x_{1},……,x_{15})
= intensities of p = 15 GC peaks, corresponding to 15 volatiles.
Questions:
Can
y be predicted from x?
Can
y be understood from x?
X
data (GC peak areas), n=48 samples, p=15 variables.
y
on x plots for some xvariables
MSEP
Regression
coefficients when Ordinary Least Squares regression is used.
Star:
PLS one LV
Box:
PCR one PC
Regression
of y = “preference” on GC
Karlsson,
Karlberg & Olsson, KTH & SU, Analyt. Chem. Acta 1995.
125
specimens, spectra at 316 wavelengths (UV – visible) and nitrate by reference
method.
Centered data
for each wavelength.
Minimum
norm LS
Shows
that we gain a little but restricting data to first 100 wavelengths (the
remaining ones appear not to contain any information).
PLS,
centered data.
_____
All wavelengths
100
first wavelengths
See
how similar the curves are !
(of
regression coefficients as function of wavelength)
PCR(20), PLS(20) and LSRR(0,003)chosen to have their highest peaks of about the same amplitude.
PCR:
dotted line, PLS: dashed line, LSRR: solid line
Regressions
of nitrate on absorbances at first 100 wavelengths. CV leave oneout MSEP
values.
PLS
and PCR plotted against number of
factors. PCR: dashed line, PLS: solid line
Calibration
set and test set after random split Þ
“leave one out” and test set yield
about the same MSEP values.
Calibration
set and test set separated in time
split Þ
“leave one out” is too optimistic in its MSEP
The
same situation as on the previous page, but with PCR instead of PLS.
Brown
P. J. (1993):
Measurement,
regression and calibration
Oxford
University Press
Martens H., Næs T. (1989)
Multivariate calibration
Wiley
Sundberg
R. (1999)
Multivariate
calibration – direct and indirect regression methodology.
Scand.
J. Statist. , Vol 26, pp 161 – 207 (with discussion)
(Review
paper; contains the spectroscopy example).
Sundberg
R. (2000)
Aspects
of statistical regression in sensometrics.
Food
Qual. & Preference, Vol 11, pp 17 – 26
(Contains
the sensometrics example)