## Statistics versus Efficacy

• When there is no need of statistics, there is no need of the "help" of an statistician
• When there is some need of (elementary) statistics, the "help" of a statistician can be avoided (fortunately)
• When there is some need of (less elementary) statistics, the "help" of a statistician can be avoided (only for a while, unfortunately)
• When the problem still is complicated the "help" becomes necessary.
• When the "help" has been requested, the problem becomes untractable, but it is too late to backtrack, the fox has minded the geese.

## What are Statistics ?

• a theoretically well-established methodology to model and analyze problems concerned with uncertainty (mathematics)
• an empirical and know-how approach concerned with extraction of valuable information contained in data (engineering)

===> a great difficulty to manage jointly the two former aspects.

## What to do ?

• try to find the midpoint between efficacy and simplicity

## Are there problems with regression ?

• No problem at all as far as theory has been tied up for years.
• Some problems when you try to use regression in a specific context.
• Many problems when you try to define rules of application and to achieve some transparency.

## What are the remaining problems ?

• Model (models)
• Criterion (criteria)
• Sampling
• Precision
• Outliers
• Transparency

## Introduction Concentration of Lactose versus number of Leucocytes ## Different types of models

• Linear models (the most used: Classical Regression).
• "Extended Linear Models" (PCR, LRR,  PLS).
• Generalized Linear Models (Used by NL to repredict sex).
• Non Linear Models (at least Neural Nets).

## Individual influence

• Robustness of the estimators

===> Sensitivity to specific data

• outliers
• leverage points
• subpopulations (?)

## Outliers with respect to Y

• High residual  ## Outliers with respect to the xi

• Leverage points  ## Leverage points with low residuals ## Individual influence diagnostics

### Diagnostics based on the residuals ## Examination of residuals  ## Prediction influence diagnostics  ## Comparison of individual influence diagnostics  ## Robust regression

• Linked with the identification of outliers

Either individual diagnostics and classical (non-robust) adjustment

or robust adjustment and detection of extremes.

• Evaluation of robustness: breakdown point, Minimal contamination level ## Alternatives

• Least absolute deviations method.
• M-regression.
• Least median squares method.
• Least trimmed squares method.

## Least absolute deviations

• Estimation of \$ by minimisation of • Robustness to outliers in Y
• Weak robustness to leverage points • Effectiveness : 64%

### Example:  ## M regression

• Estimation of \$ by minimisation of • Robustness to outliers in Y
• Weak robustness to leverage points ## GM regression ## Least median squares method ### Example  ## Least trimmed squares method ## Sampling

### At least four aspects in sampling

• representativity (sample survey)
• sample size
• sub-populations
• selection on predictors (precision improvement)

### Selection on predictors

• selection on predictors has been usually made

*    remark that the variables on which the selection operates MUST be put in the model (BE)

---> In this case no problem.

• selection has been made on combinations of predictors

*    recall that selection can (should) not be done on responses.

### Precision increasing

The variance of the estimator of the slope of regression is: when x is a randomly drop out of a population the expectation is: It can be usefull to increase the variance of X by selection at the extremes.

### What has been done ?

• random selection
• over selection of the extremes (40 /20 /40) on a predictor (?)
• over selection of the extremes on a linear combination of the predictors (the predicted lean meat %).
• over selection of the extremes on some predictors.

### Univariate selection procedures ### Precision increasing

 random sampling: 0,05075342 40/20/40: ]-4,-1[ / [-1,1] / ]1,4[ 0,03537267 40/20/40: ]-2,-1[ / [-1,1] / ]1,2[ 0,03863843   ### Precision increasing

 Correlation of the estimators random sampling: 0,05011554 0,05131047 -0,01838 selection on the predictors 0,03452182 0,03477434 0,01189 selection on a linear combination of the predictors 0,04430179 0,04344855 -0,3664