Rolf Sundberg

Mathem. statistics

Stockholm University

# Calibration and prediction aspects on pig grading

Situation:

Calibration is a process where we are establishing relation between We can only use a sample – not the whole population

The relationship is used for prediction of true value for new pigs.

## Quantification of prediction precision.

There is interplay between prediction and estimation.

 Prediction Estimation (R)MSEP (R)MSE

## (a) Idealised case (Imagined known) theoretical distribution for (x,y)

Predict future yo for known xo by Prediction error = residual Accuracy measure Minimum MSEP for Under some conditions straight line: And Independent of x0

## (b) More realistic case

Calibration data available instead of theoretical distribution

Random sample or controlled x Estimated straight line regression (e.g. by OLS = Ordinary Least Squares) Properties? How good? If sample size n is large, so line is precisely determined

The precision must be estimated, by p is the dimension of x, in this example p = 1.

MSE not (quite) fair as prediction error measure, because yi has already influenced ## Better:   Simulate prediction

(1)                 Cross-validation, leave-one-out Computational burden?

(2)                 Variation of (1) : Leave out larger subsets of data, not only one at a time

(3)                          Sort of extreme of (2) : Calibration set  -  Validation set ( test set ) .

Two MSEP when the two sets exchange roles ### Equivalent measures in Cross-validation, leave-one-out ## Several x-variables    ## More on relation   MSE  --  MSEP

For large n and OLS regression, with p<<n More precisely Examples from Danish study (1996)

 N p MSEP/MSE 1+(p+1)/n OLS (FOM/MK) 202 4 1.03 1.025 PCR (CC) 344 11 1.07 1.03 PLS (Autofom) 344 127 1.09 !!!! 1.4 !!!!

General conclusion: MSE too optimistic, a little or much

## Shrink

PLS is one of several shrinkage methods (regularisation methods)

Others = PCR, CR, RR, LSRR

Why shrink ?

To compensate for (near-) collinearity ## Near – collinearity Obvious risk for an extreme slope of the OLS-fitted plane, just by chance.

For safety, reduce this slope

## Collinearity - What is the problem?

Some linear combinations of x-variables are almost constant (over observations)

How detect near-collinearity ?

Corr (X) – Matrix near singular,  some very small eigenvalue(s)

Statistical consequence for OLS: ==> b likely to have large coefficients ( by chance)

This is unavoidable if p\$n

Near- collinearity is typical if p is large

Near-collinearity may occur for p small

Different approach:

• What are the “principal properties” of the measurement system?

• What is the natural (chemical/biological) rank of the system (the data)?

Variation is typically taking place essentially only in a low-dimensional space (x-space) ## Under near-collinearity:

 Estimator Predictor No systematic error, in OLS but can be far from truth/causality =>  misinterpretations Works if  Shrunk (ommitted) directions have little influence or  new data vary little in shrunk directions (like calibr. Data)

## Discussion aspects

• Estimation for description & interpretation, MSE

•  Predictivity measures

internal (simulated)

external (true)

• Representativity of calibration/test set

• What is to be predicted?      True y

• What can be achieved?        Measured y

• Pretreatment: Shrinkage methods not invariant under e.g. individual rescaling of x

## Shrinkage methods

1.     Principal components analysis = PCA   Regression = PCR (t1,t2) are equivalent to (x1,x2), but whereas x1 , x2 vary equally much and are strongly correlated, t1 varies much more than t2 (»constant) and t1 , t2 are uncorrelated.

Much more likely t1 can explain variation in y, than t2 explaining it.

PC1: t1 along direction that explains most variation

PC2: t2 in orthogonal direction that explains next most variation etc, if there are more dimensions.

So:

PCA      t1 , ………, tp   which replace  x1,….., xp

t1 = c11 x1 + --- + c1p xp        PC1

the ti are called scores when calculated for an observation item.

PCR:     Regress y on only ti ,……, tk,

instead of full regression on ti ,……, tp or equivalent y on  xi,……xp

Choose number k by cross-validation

Possible inefficiency:  there may be PCs which do not influence y.

Why include them in regression? Only contributing uncertainty.

PLS more efficient in this respect, but else similar to PCR

OLS maximises Corr(y,t(x)) over t and Corr(y,t1) over t1

PLS maximises Cov(y,t(x)) over t and Cov(y,t1) over t1

PCR   maximises  Var (t1)

The ci values form a direction vector where  PLS is a compromise between extremes

•  Wish to have highest possible correlation

•  Wish to have high(est possible) variance in t1

More general approach:

1)     Maximise some expression: f (Corr(t1,y) , var(t1)) with respect to where f is increasing both in Corr and in Var. This yields some direction c1.

2)       OLS regress y on t1 to form predictor 3)     Calculate residuals from 2), and repeat the procedure on them, if desirable

It can be shown then that

### Note to step 2 above:

Upscaling of bRR by least squares, so called LSRR.

In typical use, RR » LSRR (in the sense bRR » bLSRR for small *)

Choose * by cross-validation

Now OLS, PLS, PCR satisfy criteria of this type Þ  LSRR

OLS:                                   d = 0            bOLS = (XTX)-1XTy

PLS:                                   d ®  4           first factor, first latent variable, first PC

PCR:                                   d ®  -lmax   first factor, first latent variable, first PC

lmax = maximal eigenvalue of (XTX)

So all these methods are strongly related mutually.

One more such method:

Maximise

Corr(t1 , y)/Var(t1)g

with respect to c1

Choose g that yields best cross validation.

Repeat on y-residuals to form next factor t2, etc.

This is Continuum regression (Stone & Brooks).

Any of these shrinkage methods is justifiable and typically yield quite similar predictors. Perhaps PCR and PLS are conceptually preferable and PLS is slightly more efficient.

Schematic picture of PLS or PCR  ## Pretreatment

Shrinkage methods are not invariant under transformations of x Autoscaling  (may be difficult), sometimes it is reasonable, often not, for instance with spectral measurements, difference spectrum.

How about weight and fat thickness for pigs?

## How to detect (and treat) outliers

Two different situations:

• Calibration

• Prediction

### Calibration

Calibrate for individuals, which might be in the population.  ### Prediction phase

(with PCR or PLS) t1 and t2 from calibration. ## How to reduce prediction errors ?

• Larger samples

• Wider samples

• More / other variables

• Better model (transform variables, include interactions, include nonlinearities).

• Better predictor.

## Aspects on double regression

Double regression fits well with PCR/PCA Proposed procedure:

1.  Use PCA to find the principal components describing the variation in (X, Z) jointly, from the total data set on (X, Z). Say the result is t = t (X, Z).

2.  Use OLS regression of y on t to construct a predictor based on t = t (X, Z).

3.  Use cross-validation to choose the number of PC’s in (X, Z) that best predicts y.

4. When only x is available, predict y via PC’s, see next page.

Both X and Z is selected to be able to describe y, so there will probably be a large extent of co linearity in (X, Z). Hence PCR and similar method is probably motivated.

## PLS/PCR and missing data

(Application on double regression) Predict y0  by Missing data more easily handled by PLS & PCR than by methods like ridge regression.

## Sensometrics example

(Data from Brockhoff et al. 1993)

Concerns: Smell of apples after storage under n = 48 different conditions.

Y= preference of smell on a 0-5 point scale, averaged over a trained sensory panel of 10 assessors.

X=(x1,……,x15) = intensities of p = 15 GC peaks, corresponding to 15 volatiles.

Questions:

Can y be predicted from x?

Can y be understood from x?

X data (GC peak areas), n=48 samples, p=15 variables. y on x plots for some x-variables MSEP Regression coefficients when Ordinary Least Squares regression is used.  Star: PLS one LV

Box: PCR one PC

Regression of y = “preference” on GC   ## Spectroscopy example: Determination of nitrate (NO3-) in municipal waste water

Karlsson, Karlberg & Olsson, KTH & SU, Analyt. Chem. Acta 1995. 125 specimens, spectra at 316 wavelengths (UV – visible) and nitrate by reference method.

Centered data for each wavelength. Minimum norm LS Shows that we gain a little but restricting data to first 100 wavelengths (the remaining ones appear not to contain any information).

PLS, centered data. _____ All wavelengths

--------100 first wavelengths

See how similar the curves are !

(of regression coefficients as function of wavelength) PCR(20), PLS(20) and LSRR(0,003)chosen to have their highest peaks of about the same amplitude.

PCR: dotted line, PLS: dashed line, LSRR: solid line

Regressions of nitrate on absorbances at first 100 wavelengths. CV leave one-out MSEP values. PLS and PCR  plotted against number of factors. PCR: dashed line, PLS: solid line

Calibration set and test set after random split Þ  “leave one out” and test set yield about the same MSEP values. Calibration set and test set separated in time split Þ  “leave one out” is too optimistic in its MSEP The same situation as on the previous page, but with PCR instead of PLS. ## Some references

Brown P. J. (1993):

Measurement, regression and calibration

Oxford University Press

Martens H., Næs T. (1989)

Multivariate calibration

Wiley

Sundberg R. (1999)

Multivariate calibration – direct and indirect regression methodology.

Scand. J. Statist. , Vol 26, pp 161 – 207 (with discussion)

(Review paper; contains the spectroscopy example).

Sundberg R. (2000)

Aspects of statistical regression in sensometrics.

Food Qual. & Preference, Vol 11, pp 17 – 26

(Contains the sensometrics example)