Rolf Sundberg                                                                           

Mathem. statistics                                                                                                         

Stockholm University



Calibration and prediction aspects on pig grading


Lelystad, May 2000




Calibration is a process where we are establishing relation between



We can only use a sample – not the whole population


The relationship is used for prediction of true value for new pigs.

Quantification of prediction precision.

There is interplay between prediction and estimation.








(a) Idealised case

(Imagined known) theoretical distribution for (x,y)


Predict future yo for known xo by

Prediction error = residual

Accuracy measure


Minimum MSEP for 

Under some conditions straight line:



Independent of x0


(b) More realistic case


Calibration data available instead of theoretical distribution

Random sample or controlled x

Estimated straight line regression (e.g. by OLS = Ordinary Least Squares)


Properties? How good?

If sample size n is large, so line is precisely determined


The precision must be estimated, by


p is the dimension of x, in this example p = 1.


MSE not (quite) fair as prediction error measure, because yi has already influenced


Better:   Simulate prediction


(1)                 Cross-validation, leave-one-out



Computational burden?


(2)                 Variation of (1) : Leave out larger subsets of data, not only one at a time


(3)                          Sort of extreme of (2) : Calibration set  -  Validation set ( test set ) .


                                                                    Two MSEP when the two sets exchange roles  



Equivalent measures in Cross-validation, leave-one-out




Several x-variables


More on relation   MSE  --  MSEP


For large n and OLS regression, with p<<n 



More precisely


Examples from Danish study (1996)

















PLS (Autofom)



1.09 !!!!

1.4 !!!!


General conclusion: MSE too optimistic, a little or much




PLS is one of several shrinkage methods (regularisation methods)

Others = PCR, CR, RR, LSRR


Why shrink ?


To compensate for (near-) collinearity



Near – collinearity

Obvious risk for an extreme slope of the OLS-fitted plane, just by chance.

For safety, reduce this slope


Collinearity - What is the problem?


Some linear combinations of x-variables are almost constant (over observations)


How detect near-collinearity ?


Corr (X) – Matrix near singular,  some very small eigenvalue(s)


Statistical consequence for OLS:


==> b likely to have large coefficients ( by chance)


This is unavoidable if p$n


Near- collinearity is typical if p is large

Near-collinearity may occur for p small


Different approach:


Variation is typically taking place essentially only in a low-dimensional space (x-space)

Under near-collinearity:



No systematic error, in OLS but can be far from truth/causality =>  misinterpretations

Works if

 Shrunk (ommitted) directions have little influence or

 new data vary little in shrunk directions (like calibr. Data)


Discussion aspects


Shrinkage methods


“Trade bias for variance”


1.     Principal components analysis = PCA   Regression = PCR


(t1,t2) are equivalent to (x1,x2), but whereas x1 , x2 vary equally much and are strongly correlated, t1 varies much more than t2 (»constant) and t1 , t2 are uncorrelated.


Much more likely t1 can explain variation in y, than t2 explaining it.


PC1: t1 along direction that explains most variation


PC2: t2 in orthogonal direction that explains next most variation etc, if there are more dimensions.



PCA      t1 , ………, tp   which replace  x1,….., xp

              t1 = c11 x1 + --- + c1p xp        PC1


              the ti are called scores when calculated for an observation item.

              the cij are called loadings


PCR:     Regress y on only ti ,……, tk,

              instead of full regression on ti ,……, tp or equivalent y on  xi,……xp


Choose number k by cross-validation


Possible inefficiency:  there may be PCs which do not influence y.

Why include them in regression? Only contributing uncertainty.


PLS more efficient in this respect, but else similar to PCR

OLS maximises Corr(y,t(x)) over t and Corr(y,t1) over t1


PLS maximises Cov(y,t(x)) over t and Cov(y,t1) over t1


PCR   maximises  Var (t1)


The ci values form a direction vector where





PLS is a compromise between extremes


More general approach:


1)     Maximise some expression: f (Corr(t1,y) , var(t1)) with respect to


where f is increasing both in Corr and in Var. This yields some direction c1.


2)       OLS regress y on t1 to form predictor




3)     Calculate residuals from 2), and repeat the procedure on them, if desirable


It can be shown then that



Note to step 2 above:

Upscaling of bRR by least squares, so called LSRR.


In typical use, RR » LSRR (in the sense bRR » bLSRR for small *)


Choose * by cross-validation


Now OLS, PLS, PCR satisfy criteria of this type Ţ  LSRR


OLS:                                   d = 0            bOLS = (XTX)-1XTy

PLS:                                   d ®  4           first factor, first latent variable, first PC

PCR:                                   d ®  -lmax   first factor, first latent variable, first PC


lmax = maximal eigenvalue of (XTX)


So all these methods are strongly related mutually.


One more such method:


Corr(t1 , y)/Var(t1)g

with respect to c1


Choose g that yields best cross validation.

Repeat on y-residuals to form next factor t2, etc.

This is Continuum regression (Stone & Brooks).


Any of these shrinkage methods is justifiable and typically yield quite similar predictors. Perhaps PCR and PLS are conceptually preferable and PLS is slightly more efficient.


Schematic picture of PLS or PCR



Shrinkage methods are not invariant under transformations of x

Autoscaling  (may be difficult), sometimes it is reasonable, often not, for instance with spectral measurements, difference spectrum.


How about weight and fat thickness for pigs?


How to detect (and treat) outliers

Two different situations:



Calibrate for individuals, which might be in the population.


Prediction phase 

(with PCR or PLS) t1 and t2 from calibration.


How to reduce prediction errors ?

Aspects on double regression

Double regression fits well with PCR/PCA








Proposed procedure:

  1.  Use PCA to find the principal components describing the variation in (X, Z) jointly, from the total data set on (X, Z). Say the result is t = t (X, Z).

  2.  Use OLS regression of y on t to construct a predictor based on t = t (X, Z).

  3.  Use cross-validation to choose the number of PC’s in (X, Z) that best predicts y.

  4. When only x is available, predict y via PC’s, see next page.


Both X and Z is selected to be able to describe y, so there will probably be a large extent of co linearity in (X, Z). Hence PCR and similar method is probably motivated.

PLS/PCR and missing data

(Application on double regression)


Predict y0  by

Missing data more easily handled by PLS & PCR than by methods like ridge regression.


Sensometrics example

(Data from Brockhoff et al. 1993)

Concerns: Smell of apples after storage under n = 48 different conditions.


Y= preference of smell on a 0-5 point scale, averaged over a trained sensory panel of 10 assessors.


X=(x1,……,x15) = intensities of p = 15 GC peaks, corresponding to 15 volatiles.



Can y be predicted from x?

Can y be understood from x?


X data (GC peak areas), n=48 samples, p=15 variables.

y on x plots for some x-variables



Regression coefficients when Ordinary Least Squares regression is used.




Star: PLS one LV

Box: PCR one PC

Regression of y = “preference” on GC






Spectroscopy example: Determination of nitrate (NO3-) in municipal waste water

Karlsson, Karlberg & Olsson, KTH & SU, Analyt. Chem. Acta 1995.

125 specimens, spectra at 316 wavelengths (UV – visible) and nitrate by reference method.

Centered data

for each wavelength.

Minimum norm LS

Shows that we gain a little but restricting data to first 100 wavelengths (the remaining ones appear not to contain any information).


PLS, centered data.



_____ All wavelengths

--------100 first wavelengths


See how similar the curves are !

(of regression coefficients as function of wavelength)


PCR(20), PLS(20) and LSRR(0,003)chosen to have their highest peaks of about the same amplitude.

PCR: dotted line, PLS: dashed line, LSRR: solid line

Regressions of nitrate on absorbances at first 100 wavelengths. CV leave one-out MSEP values.

PLS and PCR  plotted against number of factors. PCR: dashed line, PLS: solid line

Calibration set and test set after random split Ţ  “leave one out” and test set yield about the same MSEP values.

Calibration set and test set separated in time split Ţ  “leave one out” is too optimistic in its MSEP


The same situation as on the previous page, but with PCR instead of PLS.

Some references

Brown P. J. (1993):

Measurement, regression and calibration

Oxford University Press


Martens H., Nćs T. (1989)

Multivariate calibration



Sundberg R. (1999)

Multivariate calibration – direct and indirect regression methodology.

Scand. J. Statist. , Vol 26, pp 161 – 207 (with discussion)

(Review paper; contains the spectroscopy example).


Sundberg R. (2000)

Aspects of statistical regression in sensometrics.

Food Qual. & Preference, Vol 11, pp 17 – 26

(Contains the sensometrics example)