is a process where we are establishing relation between
can only use a sample – not the whole population
relationship is used for prediction of true value for new pigs.
is interplay between prediction and estimation.
known) theoretical distribution for (x,y)
Predict future yo for known xo by
Prediction error = residual
Minimum MSEP for
Under some conditions straight line:
Calibration data available instead of
Random sample or controlled x
Estimated straight line regression (e.g. by OLS =
Ordinary Least Squares)
Properties? How good?
If sample size n is large, so line is precisely
The precision must be estimated, by
p is the dimension of x, in this example p = 1.
MSE not (quite) fair as prediction error
measure, because yi has already influenced
Variation of (1) : Leave out larger subsets of data, not only one at a
Sort of extreme of (2) :
Calibration set - Validation set
( test set ) .
Two MSEP when the two sets exchange roles
For large n and OLS regression, with p<<n
from Danish study (1996)
General conclusion: MSE too
optimistic, a little or much
is one of several shrinkage methods (regularisation methods)
= PCR, CR, RR, LSRR
Why shrink ?
compensate for (near-) collinearity
risk for an extreme slope of the OLS-fitted plane, just by chance.
safety, reduce this slope
linear combinations of x-variables are almost constant (over observations)
How detect near-collinearity ?
(X) – Matrix near singular, some
very small eigenvalue(s)
consequence for OLS:
==> b likely to have large coefficients ( by chance)
is unavoidable if p$n
are the “principal properties” of the measurement system?
is the natural (chemical/biological) rank of the system (the data)?
is typically taking place essentially only in a low-dimensional space (x-space)
systematic error, in OLS but can be far from truth/causality
Shrunk (ommitted) directions have little influence or
new data vary
little in shrunk directions (like calibr. Data)
Estimation for description & interpretation,
Representativity of calibration/test set
What is to be predicted?
What can be achieved?
Pretreatment: Shrinkage methods not invariant
under e.g. individual rescaling of x
Principal components analysis = PCA
Regression = PCR
(t1,t2) are equivalent to
(x1,x2), but whereas x1 , x2 vary
equally much and are strongly correlated, t1 varies much more than t2
(»constant) and t1 , t2 are uncorrelated.
Much more likely t1 can explain
variation in y, than t2
PC1: t1 along direction that explains most variation
PC2: t2 in orthogonal direction that explains next most variation etc, if there are more dimensions.
t1 , ………, tp
which replace x1,….., xp
t1 = c11 x1 + --- + c1p xp
the ti are called scores
when calculated for an observation item.
when calculated for an observation item.
the cij are called loadings
Regress y on only ti ,……, tk,
instead of full regression on ti ,……, tp or equivalent
y on xi,……xp
Choose number k by cross-validation
there may be PCs which
do not influence y.
Why include them in regression? Only
PLS more efficient in this respect, but else
similar to PCR
OLS maximises Corr(y,t(x)) over t and Corr(y,t1) over t1
Cov(y,t(x)) over t and Cov(y,t1) over t1
maximises Var (t1)
The ci values form a direction vector where
PLS is a compromise between extremes
Wish to have highest possible correlation
Wish to have high(est possible) variance in t1
More general approach:
1) Maximise some expression: f (Corr(t1,y) , var(t1)) with respect to
where f is increasing both in Corr and in Var. This yields some direction c1.
2) OLS regress y on t1 to form predictor
Calculate residuals from 2), and repeat the procedure on them, if
It can be shown then that
Note to step 2 above:
Upscaling of bRR
by least squares, so called LSRR.
In typical use, RR
LSRR (in the sense bRR »
bLSRR for small *)
Choose * by cross-validation
OLS, PLS, PCR satisfy criteria of this type Ţ
= 0 bOLS
bOLS = (XTX)-1XTy
first factor, first latent variable, first PC
-lmax first factor, first latent variable, first PC
= maximal eigenvalue of (XTX)
all these methods are strongly related mutually.
more such method:
with respect to c1
that yields best cross validation.
on y-residuals to form next factor t2, etc.
is Continuum regression (Stone & Brooks).
of these shrinkage methods is justifiable and typically yield quite similar
predictors. Perhaps PCR and PLS are conceptually
preferable and PLS is slightly more efficient.
picture of PLS or PCR
methods are not invariant under transformations of x
Autoscaling (may be difficult), sometimes it is reasonable, often not, for instance with spectral measurements, difference spectrum.
about weight and fat thickness for pigs?
for individuals, which might be in the population.
PCR or PLS) t1 and t2 from calibration.
/ other variables
model (transform variables, include interactions, include nonlinearities).
Double regression fits well with PCR/PCA
Use PCA to find
the principal components describing the variation in (X, Z) jointly, from the
total data set on (X, Z). Say the result is t = t (X, Z).
regression of y on t to construct a predictor based on t = t (X, Z).
cross-validation to choose the number of PC’s in (X, Z) that best predicts y.
When only x is
available, predict y via PC’s, see next page.
X and Z is selected to be able to describe y, so there will probably be a large
extent of co linearity in (X, Z). Hence PCR and similar
method is probably motivated.
on double regression)
Predict y0 by
data more easily handled by PLS & PCR than by methods like ridge regression.
(Data from Brockhoff et al. 1993)
Smell of apples after storage under n = 48 different conditions.
preference of smell on a 0-5 point scale, averaged over a trained sensory panel
of 10 assessors.
= intensities of p = 15 GC peaks, corresponding to 15 volatiles.
y be predicted from x?
y be understood from x?
data (GC peak areas), n=48 samples, p=15 variables.
on x plots for some x-variables
coefficients when Ordinary Least Squares regression is used.
PLS one LV
PCR one PC
of y = “preference” on GC
Karlberg & Olsson, KTH & SU, Analyt. Chem. Acta 1995.
specimens, spectra at 316 wavelengths (UV – visible) and nitrate by reference
for each wavelength.
that we gain a little but restricting data to first 100 wavelengths (the
remaining ones appear not to contain any information).
how similar the curves are !
regression coefficients as function of wavelength)
PCR(20), PLS(20) and LSRR(0,003)chosen to have their highest peaks of about the same amplitude.
dotted line, PLS: dashed line, LSRR: solid line
of nitrate on absorbances at first 100 wavelengths. CV leave one-out MSEP
and PCR plotted against number of
factors. PCR: dashed line, PLS: solid line
set and test set after random split Ţ
“leave one out” and test set yield
about the same MSEP values.
set and test set separated in time
“leave one out” is too optimistic in its MSEP
same situation as on the previous page, but with PCR instead of PLS.
P. J. (1993):
regression and calibration
Martens H., Nćs T. (1989)
calibration – direct and indirect regression methodology.
J. Statist. , Vol 26, pp 161 – 207 (with discussion)
paper; contains the spectroscopy example).
of statistical regression in sensometrics.
Qual. & Preference, Vol 11, pp 17 – 26
the sensometrics example)