Rolf Sundberg                                                                           

Mathem. statistics                                                                                                         

Stockholm University

 

 

Calibration and prediction aspects on pig grading

 

Lelystad, May 2000

 

Situation:

 

Calibration is a process where we are establishing relation between

 

 

We can only use a sample – not the whole population

 

The relationship is used for prediction of true value for new pigs.


Quantification of prediction precision.

There is interplay between prediction and estimation.

 

Prediction

Estimation

 

(R)MSEP

(R)MSE

 

(a) Idealised case

(Imagined known) theoretical distribution for (x,y)

 

Predict future yo for known xo by

Prediction error = residual

Accuracy measure

 

Minimum MSEP for 

Under some conditions straight line:

And 

 

Independent of x0


 

(b) More realistic case

 

Calibration data available instead of theoretical distribution

Random sample or controlled x

Estimated straight line regression (e.g. by OLS = Ordinary Least Squares)

 

Properties? How good?

If sample size n is large, so line is precisely determined

 

The precision must be estimated, by

 

p is the dimension of x, in this example p = 1.

 

MSE not (quite) fair as prediction error measure, because yi has already influenced

 

Better:   Simulate prediction

 

(1)                 Cross-validation, leave-one-out

                           

 

Computational burden?

 

(2)                 Variation of (1) : Leave out larger subsets of data, not only one at a time

 

(3)                          Sort of extreme of (2) : Calibration set  -  Validation set ( test set ) .

 

                                                                    Two MSEP when the two sets exchange roles  

                          

 

Equivalent measures in Cross-validation, leave-one-out

 

 

 

Several x-variables


 

More on relation   MSE  --  MSEP

 

For large n and OLS regression, with p<<n 

 

 

More precisely

 

Examples from Danish study (1996)

 

 

N

p

MSEP/MSE

1+(p+1)/n

OLS (FOM/MK)

202

4

1.03

1.025

PCR (CC)

344

11

1.07

1.03

PLS (Autofom)

344

127

1.09 !!!!

1.4 !!!!

 

General conclusion: MSE too optimistic, a little or much

 


 

Shrink

PLS is one of several shrinkage methods (regularisation methods)

Others = PCR, CR, RR, LSRR

 

Why shrink ?

 

To compensate for (near-) collinearity

 


 

Near – collinearity

Obvious risk for an extreme slope of the OLS-fitted plane, just by chance.

For safety, reduce this slope

 

Collinearity - What is the problem?

 

Some linear combinations of x-variables are almost constant (over observations)

 

How detect near-collinearity ?

 

Corr (X) – Matrix near singular,  some very small eigenvalue(s)

 

Statistical consequence for OLS:

 

==> b likely to have large coefficients ( by chance)

 

This is unavoidable if p$n

 

Near- collinearity is typical if p is large

Near-collinearity may occur for p small

 

Different approach:

 

Variation is typically taking place essentially only in a low-dimensional space (x-space)


Under near-collinearity:

Estimator

Predictor

No systematic error, in OLS but can be far from truth/causality =>  misinterpretations

Works if

 Shrunk (ommitted) directions have little influence or

 new data vary little in shrunk directions (like calibr. Data)

 

Discussion aspects


 

Shrinkage methods

 

“Trade bias for variance”

 

1.     Principal components analysis = PCA   Regression = PCR

 

(t1,t2) are equivalent to (x1,x2), but whereas x1 , x2 vary equally much and are strongly correlated, t1 varies much more than t2 (»constant) and t1 , t2 are uncorrelated.

 

Much more likely t1 can explain variation in y, than t2 explaining it.

 

PC1: t1 along direction that explains most variation

 

PC2: t2 in orthogonal direction that explains next most variation etc, if there are more dimensions.

 

So:

PCA      t1 , ………, tp   which replace  x1,….., xp

              t1 = c11 x1 + --- + c1p xp        PC1

 

              the ti are called scores when calculated for an observation item.

              the cij are called loadings

 

PCR:     Regress y on only ti ,……, tk,

              instead of full regression on ti ,……, tp or equivalent y on  xi,……xp

 

Choose number k by cross-validation

 

Possible inefficiency:  there may be PCs which do not influence y.

Why include them in regression? Only contributing uncertainty.

 

PLS more efficient in this respect, but else similar to PCR


OLS maximises Corr(y,t(x)) over t and Corr(y,t1) over t1

 

PLS maximises Cov(y,t(x)) over t and Cov(y,t1) over t1

 

PCR   maximises  Var (t1)

 

The ci values form a direction vector where

 

 

 

 

PLS is a compromise between extremes

 

More general approach:

 

1)     Maximise some expression: f (Corr(t1,y) , var(t1)) with respect to

 

where f is increasing both in Corr and in Var. This yields some direction c1.

 

2)       OLS regress y on t1 to form predictor

 

 

 

3)     Calculate residuals from 2), and repeat the procedure on them, if desirable

 

It can be shown then that

 

 

Note to step 2 above:

Upscaling of bRR by least squares, so called LSRR.

 

In typical use, RR » LSRR (in the sense bRR » bLSRR for small *)

 

Choose * by cross-validation

 

Now OLS, PLS, PCR satisfy criteria of this type Ţ  LSRR

 

OLS:                                   d = 0            bOLS = (XTX)-1XTy

PLS:                                   d ®  4           first factor, first latent variable, first PC

PCR:                                   d ®  -lmax   first factor, first latent variable, first PC

 

lmax = maximal eigenvalue of (XTX)

 

So all these methods are strongly related mutually.

 

One more such method:

Maximise

Corr(t1 , y)/Var(t1)g

with respect to c1

 

Choose g that yields best cross validation.

Repeat on y-residuals to form next factor t2, etc.

This is Continuum regression (Stone & Brooks).

 

Any of these shrinkage methods is justifiable and typically yield quite similar predictors. Perhaps PCR and PLS are conceptually preferable and PLS is slightly more efficient.

 

Schematic picture of PLS or PCR


 

Pretreatment

Shrinkage methods are not invariant under transformations of x

Autoscaling  (may be difficult), sometimes it is reasonable, often not, for instance with spectral measurements, difference spectrum.

 

How about weight and fat thickness for pigs?

 

How to detect (and treat) outliers

Two different situations:

 

Calibration

Calibrate for individuals, which might be in the population.


 

Prediction phase 

(with PCR or PLS) t1 and t2 from calibration.

 

How to reduce prediction errors ?

Aspects on double regression

Double regression fits well with PCR/PCA

 

                  

 

 

 

 

 

Proposed procedure:

  1.  Use PCA to find the principal components describing the variation in (X, Z) jointly, from the total data set on (X, Z). Say the result is t = t (X, Z).

  2.  Use OLS regression of y on t to construct a predictor based on t = t (X, Z).

  3.  Use cross-validation to choose the number of PC’s in (X, Z) that best predicts y.

  4. When only x is available, predict y via PC’s, see next page.

 

Both X and Z is selected to be able to describe y, so there will probably be a large extent of co linearity in (X, Z). Hence PCR and similar method is probably motivated.

PLS/PCR and missing data

(Application on double regression)

 

Predict y0  by

Missing data more easily handled by PLS & PCR than by methods like ridge regression.

 

Sensometrics example

(Data from Brockhoff et al. 1993)

Concerns: Smell of apples after storage under n = 48 different conditions.

 

Y= preference of smell on a 0-5 point scale, averaged over a trained sensory panel of 10 assessors.

 

X=(x1,……,x15) = intensities of p = 15 GC peaks, corresponding to 15 volatiles.

 

Questions:

Can y be predicted from x?

Can y be understood from x?

 

X data (GC peak areas), n=48 samples, p=15 variables.


y on x plots for some x-variables


MSEP

 


Regression coefficients when Ordinary Least Squares regression is used.

 


 

 

Star: PLS one LV

Box: PCR one PC

Regression of y = “preference” on GC

 


 


 


 


 

Spectroscopy example: Determination of nitrate (NO3-) in municipal waste water

Karlsson, Karlberg & Olsson, KTH & SU, Analyt. Chem. Acta 1995.

125 specimens, spectra at 316 wavelengths (UV – visible) and nitrate by reference method.


Centered data

for each wavelength.


Minimum norm LS


Shows that we gain a little but restricting data to first 100 wavelengths (the remaining ones appear not to contain any information).

 

PLS, centered data.

 

 

_____ All wavelengths

--------100 first wavelengths


 

See how similar the curves are !

(of regression coefficients as function of wavelength)

 

PCR(20), PLS(20) and LSRR(0,003)chosen to have their highest peaks of about the same amplitude.

PCR: dotted line, PLS: dashed line, LSRR: solid line


Regressions of nitrate on absorbances at first 100 wavelengths. CV leave one-out MSEP values.

PLS and PCR  plotted against number of factors. PCR: dashed line, PLS: solid line


Calibration set and test set after random split Ţ  “leave one out” and test set yield about the same MSEP values.


Calibration set and test set separated in time split Ţ  “leave one out” is too optimistic in its MSEP

 


The same situation as on the previous page, but with PCR instead of PLS.


Some references

Brown P. J. (1993):

Measurement, regression and calibration

Oxford University Press

 

Martens H., Nćs T. (1989)

Multivariate calibration

Wiley

 

Sundberg R. (1999)

Multivariate calibration – direct and indirect regression methodology.

Scand. J. Statist. , Vol 26, pp 161 – 207 (with discussion)

(Review paper; contains the spectroscopy example).

 

Sundberg R. (2000)

Aspects of statistical regression in sensometrics.

Food Qual. & Preference, Vol 11, pp 17 – 26

(Contains the sensometrics example)