Sei sulla pagina 1di 205

OPUS QUANT2 (PLS) for advanced users

Dr. Andreas Niemöller


Bruker Optik GmbH

Innovation with Integrity


Successful calibration setup starts long
before using PLS

• Reliable wet chemistry as reference


(component values)
• Right sample presentation and correct
NIR measurement (spectra)
• Comparable amount of sample
analyzed by wet chemistry and NIR

• All different and powerful


chemometric tools and algorithms,
like PLS, cannot derive good results
from a bad data set

2
Classical univariate calibration model

1.2
• linear regression
• extrapolation allowed
1
• sensitivity directly definable
Measured Value

0.8 • evaluation of a single


measured value

0.6

0.4

0.2

0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Concentration / %

3
Multivariate calibration model

• multivariate regression after Factorization


• extrapolation NOT allowed

• sensitivity NOT definable

• evaluation of spectrum

PLS Partial
Least
Squares

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45


Concentration / %

4
OPUS QUANT2 (PLS) for advanced users

Principles of Factor Analysis

5
Principles and Properties of Factor
Analysis

• Variance analysis: ‘looking for changes in the data set‘


• Common statistical method for data analysis
• Different names used in chemometrics:
- Factor analysis
- Principal Component Analysis (PCA)

• Orthogonal transformation of the data


• Enormous data compression: representing of the data set by a few
latent variables

6
Factor Analysis of Spectra

Factor analysis breaks apart the spectral data into the most
common spectral variations (factors, loadings, principal
components) and the corresponding scaling coefficients (scores)

p d p

Loadings

Scores
Spectral data matrix = d

n n

Data matrix: n spectra with p data points


Scores: d score values for each spectrum (d < n)
Factors: d Factors with p data points (d < n)

7
Factor Analysis of the Spectral Variance
(without Property Values)

5 Spectra Scores Loadings


1 Factor 1 5.216 1
Factor 2 -0.216
Factor 3 1.73E-02
Factor 4 -1.52E-02
Factor 5 3.17E-02
2 Factor 1 5.95
Factor 2 -0.103 2
Factor 3 4.97E-04
Factor 4 4.33E-02
Factor 5 5.65E-03
3 Factor 1 7.731
Factor 2 -0.699 3
Factor 3 3.67E-04
Factor 4 -1.15E-02
Factor 5 -2.04E-02
4 Factor 1 5.768
Factor 2 0.693 4
Factor 3 2.97E-02
Factor 4 -3.76E-03
Factor 5 -1.27E-02
5 Factor 1 7.13
Factor 2 0.441 5
Factor 3 -3.75E-02
Factor 4 -9.54E-03
Factor 5 4.46E-03

8
Inverse Factor Analysis: Reconstruction
of a Spectrum using all Factors

Scores of Loadings Spectrum


spectrum

7.731

+ -0.699

+ 3.67E-04

+ -1.15E-02

+ -2.04E-02

9
Reconstruction using two Factors:
>99% of Information Content retained

Scores of Loadings Spectrum


Spectrum

7.731

+ -0.699

In the software the spectra are not reconstructed. The spectra are
represented just by the few scores values (data compression) which are
used in the modeling calculations.

Spectral residuals: difference between original spectra and reconstructed


spectra using n loadings

10
Moving from factor analysis to PLS

• PLS is a factor analysis (variance analysis) taking component or


property values (e.g. concentrations) into account
• For each component or property a set of PLS factors is calculated
• The factors are calculated based only on the spectral variance
correlated with the given component or property values
• PLS can be seen as a variance analysis including a kind of
regression step
• PLS is very effective in making use of correlated information and
discriminating non useful information
• Even overlapping bands and structures in the spectra can be
separated

11
PLS Loadings for components A and B

PLS loadings
A B Comp. A

1
2
3

PLS loadings
Comp. B

1
2
3

12
Analysis of Spectra using PCA or PLS
models based on Scores and Loadings

Scores of Loadings in Spectrum


spectrum the model measured

7.731

+ -0.699

For the measured spectrum the scores are calculated according to the
factors (loadings) stored in the model.
The scores are used for the final evaluation in the PCA model
(identification) or PLS model (quantification).

13
Experiment to show Capabilities of PLS

Reflectance
0.7

spectra from
0.6

glucose with
admixtures of
0.5
Absorbance

1.0-1.9% talc
0.4 0.3
0.2
0.1

10000 9000 8000 7000 6000 5000


Wavenumber /cm-1

14
Spectra after Vector Normalization
Absorbance

10000 9000 8000 7000 6000 5000


Wavenumber /cm-1

15
Not optimized PLS Model with a broad
Spectral Range used

R2=96.63
1.9
RMSECV=0.05%

1.7 Rank 2
NIR prediction / %

1.5

1.3

1.1

0.9
0.9 1.1 1.3 1.5 1.7 1.9
Reference value / %

Parameter: vector normalization, 10.000-4.500 cm-1


16
Regression Coefficients of not optimized
PLS Model

Weighting of
wavenumbers of
the calibrated
property.
PLS contains an
Absorbance

automatic ‘search’
for relevant
wavenumbers

10000 9000 8000 7000 6000 5000


Wavenumber /cm-1

Parameter: vector normalization, 10.000-4.500 cm-1


17
PLS Factors (loadings) of the not
optimized PLS Model

Both factors
Factor 1 contain parts of
Factor 2 the spectral
variation caused
by the talc
Absorbance

content

10000 9000 8000 7000 6000 5000


Wavenumber /cm-1

Parameter: vector normalization, 10.000-4.500 cm-1


18
Optimized PLS Model

R2=99.68
1.9
RMSECV=0.02%
1.7 Rank 2
NIR prediction / %

1.5

1.3

1.1

0.9
0.9 1.1 1.3 1.5 1.7 1.9
Reference value / %

Parameter: 1. Derivative, 7.500-6.100 cm-1


19
Spectra after 1st Derivative
Absorbance

10000 9000 8000 7000 6000 5000


Wavenumber /cm-1

20
Regression Coefficients for the
optimized PLS Model

In the optimized
model only the
talc peak is
considered
Absorbance

7400 7200 7000 6800 6600 6400 6200


Wavenumber /cm-1

Parameter: 1. derivative, 7.500-6.100 cm-1


21
PLS factors (loadings) for the optimized
PLS Model

Only the first


Factor 1
factor contains
Factor 2
useful information
for the prediction
Absorbance

of the talc
content

7400 7200 7000 6800 6600 6400 6200


Wavenumber /cm-1

Parameter: 1. derivative, 7.500-6.100 cm-1


22
OPUS QUANT2 (PLS) for advanced users

Basics of Calibration Setup and


Modeling

23
Data sets for model setup and method
validation

Setup with cross validation for


Calibration
small data sets (feasibility)

Val Setup with cross validation and


Calibration
Set check with validation set

Test
Calibration Setup with test set validation
Set

Test Val Setup with test set and validation


Calibration
Set Set set

Method setup ‘today’

Val Set= dataset of independent samples


24
Methods must be validated over time
(model maintenance)!

Calibration

Val
Calibration
Set
Val Val Val Val
Set Set Set Set
Test
Calibration
Set

Test Val
Calibration
Set Set

Method setup ‘today’ Method validation time


‘in the future’

Val Set = dataset of independent samples


25
Methods must be validated over time
(model maintenance)!

Validation with
independent samples is
the ONLY way to
• check the accuracy,
reproducibility and
robustness of PLS Val Val Val Val
methods, Set Set Set Set
• select methods for
routine use.

Method validation time


‘in the future’

Val Set = dataset of independent samples


26
Updating of methods and data sets with
new samples (new batches, new
recipes)

Test Val Val Val Val Val


Calibration
Set Set Set Set Set Set

robustness
of model

Method setup Method validation time

27
Principles of method development

1. Well measured spectra and reference values


2. Checking of spectra and data sets on outliers, unusual effects and
samples
3. Setup of first methods for control and selection of spectral ranges for
optimization
4. Optimization
5. Selection and review of models from the optimization list
6. Validation of models with independent samples, if possible considering
timeline (more new or newer samples)

Frequent repeating of point 6 in routine usage!

28
Principles of method development

When calibration samples are selected, care should be taken to ensure


that all major factors affecting the accuracy of calibration are covered
within the limits of the defined application area. These factors include
the following:
1. Recipes: combinations and composition ranges of major and
minor sample components: analytes and non-analytes
2. seasonal, geographic and genetic effects on sample material
or raw materials
3. processing techniques and conditions
4. storage and storage conditions
5. sample and instrument temperatures and changes of them
6. instrument variations

Such requirements for calibration development are given


e.g. in ISO 12099 (Feed) and ISO 12543 IDF 201 (Dairy).

29
Distribution of samples

Prediction

„rare sample“ or

outlier


 
  typical
 concentration range



 The concentration range of the calibration
 should extent the expected analysis range
if possible.

Reference value

30
General parameter influencing the
modeling and the model accuracy

• Quality of instruments
e.g. Resolution, stability, signal/noise ratio, precision, robustness

• Parameters for measurement

• Sample preparation and sample presentation

• Accuracy of reference method (e.g. wet chemistry)


• in many cases the accuracy of an IR or NIR method is only
depending on the accuracy of reference method.
• in average NIR can be more accurate due to better
reproducibility

31
Selection of calibration and test
samples

• Calibration and test set samples should be well distributed over the entire
property range

• As many as possible samples should be used for the test set but important
samples must be in the calibration. In case of big data sets the splitting is
done by having 50% in the calibration and test set.

• Required number of samples


• feasibility study: ~ 20 samples minimum
• typical applications: ~ 50-100 samples
• complex application: > 150 samples

32
Selection of spectral ranges for
calibration

4 • Avoiding of spectral
noise, e.g. on the left
and right border of the
spectra where the
3
detector has low
sensitivity or a cut off

2 • Avoiding of spectral
ranges with total
absorbance
(absorbance >2,0 AU)
1

• A quantitative
evaluation is only
0 possible up to 2 AU but
starting from the
baseline.
10000 8000 6000 4000 2000
Wavenumber

33
Trouble shooting in case of poor
prediction

• Selection of suitable spectral ranges?


• Were ranges with spectral noise included in the calibration?
• Were ranges with total absorption included in the calibration?

• Selection of correct experiment for measurements?


• Selection of a robust Quant2 method?
• Selection of suitable data preprocessing ?
• Were the property values of the calibration samples well distributed over the
selected range?

34
Trouble shooting in case of outliers

• Was the sample not homogenized properly?


• No temperature control with critical liquid samples?
• Probe not properly immerged?
• Was an air bubble in the optical gap?
• Selection of the wrong method or measuring experiments?
• Measurements through vials: Identical vials for calibration and
measurement?
• Comparable measuring conditions (e.g. angle of attack of the probe, ...)?

35
Trouble shooting in case of outliers

Problem: Higher calibration errors due to bad reference values

Solution:
• Revision of the reference analysis method (2nd reference technique, old
chemicals, operator?)
• Revision of accuracy, error limits and reproducibility of the reference
analysis?
• Repetition and/or multiple determination of the reference values for some
samples

36
NIR-Spectra of Water at various
temperatures
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Absorbance Units

9000 8000 7000 6000


Wavenumber cm-1
37
1st derivative NIR-Spectra of Water at
various temperatures

7600 7400 7200 7000 6800 6600 6400 6200


38
Shifting of Water band due to
increasing temperature

0 °C
5 °C
1.45

10 °C
15 °C
20 °C
Absorbance Units

25 °C
1.40

30 °C
35 °C
40 °C
45 °C
1.35

50 °C
55 °C
60 °C
65 °C
1.30

70 °C
75 °C
80 °C
85 °C
1.25

90 °C
95 °C
7100 7000 6900 6800 6700 100 °C
Wavenumber cm -1

39
Difference Spectra of Water band on
increasing temperature

0 °C
5 °C
10 °C
15 °C
0.4

20 °C
Absorbance Units

25 °C
30 °C
0.2

35 °C
40 °C
45 °C
50 °C
0.0

55 °C
60 °C
65 °C
-0.2

70 °C
75 °C
80 °C
85 °C
-0.4

90 °C
95 °C
7500 7000 6500 6000 100 °C
Wavenumber cm-1
40
Toluene NIR-Spectra on various
temperatures
-0.01 0.0 0.01 0.02 0.03 0.04 0.05 0.06
Absorbance Units

9000 8000 7000 6000 5000


Wavenumber cm-1
41
Toluene NIR-Spectra on various
temperatures
1.135 1.140 1.145 1.150 1.155 1.160 1.165

25 °C
30 °C
35 °C
Absorbance Units

40 °C
45 °C
50 °C
55 °C
60 °C
65 °C
70 °C
75 °C
80 °C
85 °C
90 °C
95 °C

5958 5956 5954 5952 5950 5948


Wavenumber cm-1

42
Error values for characterizing
calibration performance and validation

• Multivariate PLS models can’t be


checked by regression coefficient r2 and
X slope of a regression line.
• The standard deviation of NIR
predictions from the true values
Calibration: (reference) are calculated as Root
RMSEE or RMSEC Mean Square Error of..

Cross validation: • Depending on the data set used for


RMSECV prediction different errors are defined
• Another parameter is the R2 value
Test set validation:
which should be as close as possible to
RMSEP
100%

43
Error values for characterizing
calibration performance and validation

• Root Mean Square Error of..

X  Estimation or Calibration for the


predictions of all samples using
the calibration model based on all
Calibration: samples
RMSEE or RMSEC  Cross Validation for the
predictions during the Cross
Cross validation:
Validation, i.e. samples are
RMSECV
temporarily independent
Test set validation:  Prediction for the prediction of
RMSEP independent samples

44
What our customers expect…

RMSEP = 0.5? Great!


All my results within a
range of +/- 0.5.
Excellent accuracy!

Customer YOU
Normal (Gaussian) Distribution

We find always the same number of


events within the following intervals:
+/- 1s • +/- 1 standard deviation
68.3% (68.3%)

+/- 2s
• +/- 2 standard deviations
(95.5%)
95.5%
• +/- 3 standard deviations
+/- 3s (99.7%)
99.7%

RMSEP / RMSECV are identical to the Standard deviation !


Praxis:
Ruminant Feed – Fat: Test Set Validation

Reality (Counting the results)

69% 94% 100%

Theory (Normal Distribution)


68.3%

95.5%

99.7%
R2 and its meaning: expresses the
relation of error bar and value range

R2 = 66.4%

R2 = 81.4%

R2 = 98.9%

R2 is coefficient of determination
and is not the same as
r2 is regression coefficient
R2: Calibration of Fat in Milk

n RMSEP R^2 Range


500 0.06 99.51 5.03
480 0.06 99.45 3.35
400 0.05 99.22 2.17
300 0.05 98.41 1.36
200 0.05 96.51 0.82
100 0.05 88.79 0.46
80 0.05 81.71 0.35
60 0.04 73.87 0.26
50 0.04 70.24 0.21
40 0.04 65.70 0.16
OPUS QUANT2 (PLS) for advanced users

Methods of Data Pretreatment

50
Principle of Vector Normalization

Acetylsalicylic acid
Salicylic acid
1.0
0.8

Normalization of spectra
Absorption
0.6

to vector norm value 1


0.4
0.2
0.0

12000 11000 10000 9000 8000 7000 6000 5000 4000


Wavenumber / cm-1
51
Principle of Vector Normalization

Acetylsalicylic acid
Salicylic acid
0.5
0.4
Absorption
0.3
0.2
0.1
0.0

12000 11000 10000 9000 8000 7000 6000 5000 4000


Wavenumber / cm-1
52
Spectra of Glucose before Vector
Normalization

12500 - 3500 cm-1


12000 - 3800 cm-1
9000 - 3800 cm-1
1.5
Absorption
1.0
0.5

12000 10000 8000 6000 4000


Wavenumber / cm-1
53
Glucose Spectra normalized in different
spectral ranges
0.06

12500 - 3500 cm-1


12000 - 3800 cm-1
9000 - 3800 cm-1
0.04
Absorption
0.02
0.00
-0.02

9000 8000 7000 6000 5000 4000


Wavenumber / cm-1
54
Principle of Derivatives
(Savitzky Golay)

Absorption 5 5pt5ptpt
band

1st derivative

55
Spectra of Glucose after 1st Derivative
0.004

5 pt
13 pt
25 pt
0.002
Absorption
-0.002 0.000
-0.004

12000 11000 10000 9000 8000 7000 6000 5000 4000


Wavenumber / cm-1
56
Spectra of Glucose after 1st Derivative
(detail)
0.0002

5 pt
13 pt
25 pt
-0.0000
Absorption
-0.0002
-0.0004

9000 8800 8600 8400 8200 8000


Wavenumber / cm-1
57
Spectra of Glucose after 1st and 2nd
Derivative

1st derivative 13 pt
0.10

2nd derivative 13 pt
0.05
Absorption
-0.05 0.00
-0.10
-0.15

9000 8000 7000 6000 5000 4000


Wavenumber / cm-1
58
Spectra of Glucose after 1st and 2nd
Derivative(detail)
0.15

1st derivative 13 pt
2nd derivative 13 pt
0.10
0.05
-0.10 -0.05 0.00
Absorption
-0.15
-0.20

9000 8800 8600 8400 8200 8000


Wavenumber / cm-1
59
Advantages and Disadvantages of
Vector Normalization

• Advantages of vector normalization


• Shape of spectra retained
• Interpretation of spectra more easy

• Disadvantages of vector normalization


• Result depends on used spectral range

60
Advantages and Disadvantages of
Derivatives

• Advantages of derivatives
• Contrast enhanced, more details visible
• Result depends not on used spectral range

• Disadvantages of derivatives
• Noise enhanced, smoothing step needed
• Result depends on used window size

61
Recommended smoothing Point
Settings for 1st Derivatives

• Resolution 8cm-1
• Quant2: 13 to 21pt, mainly 17pt
• Ident: 9 to 17pt, mainly 13pt

• Resolution 16cm-1
• Quant2: 9 to 17pt, mainly 13pt
• Ident: 9 to 17pt, mainly 9pt

62
Other Pre-processing Methods in
Quant2

• No Spectral Data Preprocessing


Only used in very rare applications,
where the offset shift reflects the
required information for physical effects,
e.g. scattering effects by
changing particle sizes

• Constant Offset Elimination and


Straight Line Elimination
Only applicable for spectra with a
horizontal baseline, e.g.
• NIR spectra of liquids
• MIR spectra
• Raman spectra

63
Other Pre-processing Methods in
Quant2

• Min-Max Normalization
Only useful if you have a more or
less constant highest peak or you
looking for peak ratios in the
selected spectral range
Not really useful for NIR, quite risky
in most cases

• Internal Standard
Used only if an internal standard is
used for scaling spectra

64
Other Pre-processing Methods in
Quant2

• Second Derivative
For elimination of offsets and skewed
baselines
Common for dispersive systems to
increase the contrast for low resolution
spectra
Noise is highly increased, that’s why
using 1st derivative plus
Vectornormlization is better

65
Other Pre-processing Methods in
Quant2

• MSC and 1st Derivative + MSC


Common Method to correct baseline
effects due to wavelength dependent
scatter effects.
Not advisable for small data sets and
spectra with different effects because
a MSC model is derived from the
calibration spectra which could fail
partly for new independent samples.
Vector normalization gives
comparable results and is applied on
each spectrum individually which is
more robust.

66
OPUS QUANT2 (PLS) for advanced users

Method setup and spectra table

67
Load method with overview on spectra
and parameters

68
Component definition with units and
decimal point settings

69
Adding dummy components as category
variables

• Adding of dummy components which


are not calibrated can be useful for
sorting and selection of samples
Examples:
– Sample type
– Year and/or month
– Origin of samples (country,
supplier, facility, plant, vessel)
– Special properties (Additives,
temperature)

70
Adding dummy components as category
variables

• Example gasoline samples


There are samples with and without
Ethanol added. Ethanol has a strong
influence on the spectra. It could be
helpful or important to develop models
with and without Ethanol containing
samples.
The categories YES or NO must be put in
the spectra table with values, e.g. 1 and
0. By sorting the column the samples
can be easily marked for color settings
or to exclude them.

71
Spectra table for spectra and
component values

72
Check spectra before loading them!

The spectra are


shown in the
preview
window. By
toggling with
the cursor keys
you can check
easily the
quality of all
spectra which
helps to avoid
trouble later
on.

73
Missing values are handled as a blank

Even with
missing values
you can
copy/paste
tables from e.g.
Excel to the
spectra table.

74
Set sample number

• The sample number indicates which


spectra are belonging to the same
sample. This is very important if
repeated measurements are done of
one sample (replicates or refills).
• During cross validation or test set
validation always samples are
considered, i.e. all spectra of the same
sample are validated at the same time.
Would one spectrum of a sample
remain in the calibration set the
validation of the other spectra of the
same sample is not independent.

75
New method based on mean spectra for
each sample (sample no.)

76
New method based on mean spectra for
each sample (sample no.)

• Spectra assigned to the same sample number are automatically


averaged.

• The mean spectra are stored and automatically a new corresponding


QUANT2 method file is created.

• The new method can be further developed and new samples can be
added even with repeated measurements.

• Samples with just one spectrum are just retained.

77
New method based on mean spectra for
each sample (sample no.)

Mean spectra

New method

78
New method based on mean spectra for
each sample (sample no.)

79
Component correlations

• For robust calibration models only


spectral information should be
used which is only correlating with
the calibrated component.
• In case of co-linearity (e.g. by
dilution) some information might
be used which are not related to
the component and could cause
trouble in the future if the co-linear
relation is changed.
• Example: Active Ingredient
(Component A) and Excipients
(Component B)

80
Component correlations

81
Calibration design

82
Calibration design

83
Calibration design

84
Dataset settings

• Dataset splitting in calibration and


test set
• Data set assignment of selected
(marked) spectra
• Color assignment of selected
(marked) spectra
• Special options for excluding of
spectra with missing component
values

85
Set data set

86
Spectra without reference values can be
set

For selected
components the
spectra can be
excluded for
blank entries or
for spectra with
component
values of 0 or -
1.

87
Automatic selection of test samples on
component values (Kennard-Stone)

• The selection is performed across all


components with an optimum
distribution of samples on all
dimensions (4 components = 4
dimensional property space)
• Samples with lowest and highest
property values are in the calibration
set, the next inner ones in the test set.
All other samples are selected
according to the selected percentage of
test samples
• The automatic selection is not available
for data sets, which are too small

88
Automatic selection of test samples on
component values (Kennard-Stone)

Samples with
lowest and
highest
property values
are in the
calibration set,
the next inner
ones in the test
set

89
Automatic selection of test samples on
component values (Kennard-Stone)

Next test
sample is
chosen with the
Next test
maximum
sample
distance from
the already
selected ones in
all dimensions
(properties).
Here it is found
in the middle.

90
Automatic selection of test samples on
component values (Kennard-Stone)

10 % Test samples
Next test
sample is
chosen with the
maximum
distance from
the already
selected ones in
all dimensions
(properties)
until the
required
percentage of
test samples is
reached.

91
Automatic selection of test samples on
component values (Kennard-Stone)

20 % Test samples

92
Automatic selection of test samples on
component values (Kennard-Stone)

50 % Test samples

93
Automatic selection of test samples
(Kennard-Stone) in scores space (PCA)

94
Quant2 OPUS 7: exclude redundant
samples
Quant2 OPUS 7: exclude redundant
samples

• Many methods have a lot of redundant samples which are


accumulated over time when many samples are added
which have same properties, e.g. due to product specifications.
• Such samples are not contributing to model because
• they do not introduce new information
• increase data set size and computation time
• are changing the statistic in calibration and validation
• Reducing data set size will help to achieve better models and reduce
computation time (optimization).

• Function can be used to find redundant samples in advance which


should not go to the reference lab.
By this cost for wet chemistry can be avoided.
Quant2 OPUS 7: exclude redundant
samples

• The new algorithm is looking for k nearest neighbors (kNN) and kick
redundant samples out which are very close to a given sample.
• This is the opposite approach to Kennard-Stone algorithm which is
used to find and select sample which are covering well the range of
samples.

• To work on big data sets you can do now:


• Reduce data set by kNN algorithm
• Select Test Set by Kennard-Stone (on values or PCA scores)
• Optimize
• Check models with new samples and Quant2 Filelist
Quant2 OPUS 7: exclude redundant
samples

View on PCA scores


plot of IV method with
7330 spectra.
About 6500 spectra
from Indonesia (blue)
which are quite similar.
Quant2 OPUS 7: exclude redundant
samples

Detail view on PCA


scores plot of IV
method with
7330 spectra.
About 6500 spectra
from Indonesia (blue)
which are quite similar.
Quant2 OPUS 7: exclude redundant
samples

Detail view on PCA


scores plot of IV
method with
selected spectra to be
excluded
Data set reduced from
7330 spectra
to 1162.
Quant2 OPUS 7: exclude redundant
samples

Total view on PCA


scores plot of IV
method with
selected spectra to be
excluded
Data set reduced from
7330 spectra
to 1162.
Quant2 OPUS 7: exclude redundant
samples

Test Set validation


with 687 spectra and
7330 spectra in
calibration set.

RMSEP = 0.73
Quant2 OPUS 7: exclude redundant
samples

Calibration with 7330


spectra.

RMSEE = 0.54
Quant2 OPUS 7: exclude redundant
samples

Test Set validation with


687 spectra and 1162
spectra in calibration
set.

RMSEP = 0.73

(Before with 7330


spectra in calibration
RMSEP = 0.73)
Quant2 OPUS 7: exclude redundant
samples

Calibration with 7330


spectra.

RMSEE = 0.81

(Before with 7330


spectra in calibration
RMSEE = 0.54)
Quant2 OPUS 7: Set Color in PCA Score
Plot
Quant2 OPUS 7: Set Color in PCA Score
Plot

Zoom in.
Quant2 OPUS 7: Set Color in PCA Score
Plot

Set color.
Quant2 OPUS 7: Set Color in PCA Score
Plot

Done.
Set dataset

• Selected spectra can be assigned to


calibration or test set or can be
excluded

110
Set color for plots on page Graph

• Colors can be assigned to selected


spectra for display in plots
• Colors may indicate.

 Samples of different type,


origin
 Time of measurement, e.g.
year
 Special samples
 Samples with very low or high
property values

111
Set color for plots on page Graph

112
Set color for plots on page Graph

113
OPUS QUANT2 (PLS) for advanced users

Parameter and validation settings

114
Parameter page for data pretreatment
and spectral regions

115
Data pretreatment in any order and in
any spectral ranges

116
Data pretreatment in any order and in
any spectral ranges

CAUTION!
Everything possible,
but maybe not useful!

117
Data pretreatment in any order and in
any spectral ranges

118
Data pretreatment in spectral regions
selected for modeling

119
Interactive selection of spectral regions

120
Display preprocessed spectra

121
Display preprocessed spectra but only
every x th sample

122
Statistics for repeated measurements
(replicates) on preprocessed spectra

123
Statistics for repeated measurements
(replicates) on preprocessed spectra

124
Statistics for repeated measurements
(replicates) on preprocessed spectra

125
Model calculation including validation

126
Model calculation including validation

• For each activated component a


separate PLS model is calculated based
on the selected dataset(s)
• The maximum rank is limiting the
complexity of the model
(default is rank 10).
• Lower values for the maximum rank
are saving calculation time. Only useful
if it is known that less factors are
sufficient.
• More than 10 factors are maybe
required for more complex applications
but the risk of unstable models is
increasing.

127
Internal Validation

At present, two different types of


validation are accepted:

1) cross-validation

2) test-set-validation

Important: Independent samples for internal validation

128
(Full) Cross Validation

Validation by successive exclude and put


back of samples.
During the Cross Validation all samples are
temporary independent from the
calibration set.

Calibration Data Set Test Sample

129
(Full) Cross Validation

Calculation of a temporary calibration


model based on n-1 samples and
prediction of the test sample.
Comparison of NIR prediction and
reference value are part of the calculation
of Root Mean Square Error of Cross
Validation (RMSECV)

Calibration Data Set Test Sample

130
(Full) Cross Validation

Calibration Data Set Test Sample

131
(Full) Cross Validation

This procedure is
continued until all
samples has been
taken out, tested
and put back into
the calibration set

132
(Full) Cross Validation

Advantages of Cross
Validation:

• All samples are used for


calibration and validation,
helpful for small data sets

133
(Full) Cross Validation

Disadvantages of
Cross Validation:

• The RMSECV is lower than


the Root Mean Square Error
for independent samples
(RMSEP)
• long calculation times
during optimization

134
Test Set Validation

Definition of two different data sets (for example 50:50):

Calibration Data Set Test Set


Development of Model Validation of Model

Samples from the Test Set need to be independent from the Calibration
Data Set

135
Test Set Validation

Problem: Only 50% of the samples are used for calibration set up.

Calibration Data Set Test Set


Development of Model Validation of Model

Good Tool for Data Sets with Sufficient Number of Samples

136
Cross validation, (full) cross validation

• No. of leave out samples for Cross


Validation depends on the number of
samples in the Calibration Set
 to many leave out samples are
leading to bad results because the
temporary calculated models are
unstable
 Leave out one sample is not a
challenge and gives over-
optimistic low RMXECV errors
• Rule of Thumb: Number of samples
divided by 30 (= 30 passes during
Cross Validation)

137
OPUS QUANT2 (PLS) for advanced users

Calibration results and statistics

138
NIR predictions vs. true values
(reference) in the model validation

Green line is the ideal line


for NIR prediction is equal
to the true value (reference);
Not a regression line!

139
NIR predictions vs. true values
(reference) for the calibration

140
Statistics for the model validation

Residual
Prediction
Deviation

RPD = SD/SECV
or
RPD = SD/SEP

SD = Standard
deviation of the
true values
(reference)

RPD > 3
acceptable
model

141
Statistics for the model validation

Residual
RPD Classification Application
Prediction
<1.0 very poor not recommended Deviation
1.0 - 2.4 poor not recommended
RPD = SD/SECV
2.5 - 2.9 fair rough screening
or
3.0 - 3.9 reasonable screening RPD = SD/SEP
4.0 - 5.9 good QC
SD = Standard
6.0 - 7.9 very good QA
deviation of the
8.0 - 10.0 excellent any application true values
>10.0 superior as good as reference (reference)

142
Statistics for the model validation

143
Regression line, ideal case

Regression
line (blue)

144
Regression line, non ideal case

Regression
line (blue)

145
Statistics for the model validation

146
Differences vs. true values (reference)

The distribution of
the deviations
and especially the
range between
minimum and
maximum
deviation helps to
check model
performance.

147
Error vs. rank

Each factor
contributes with
helpful
information for
lowering the
error. After a
reaching a
minimum the
error increases
again.
(overfitting)

148
Mahalanobis distance (MD) and spectral
residuals

Only spectra in
the upper right
corner are
potential outliers,
but not spectra of
samples with very
low or high
property values.

149
Quant2 OPUS 7: New Mahalanobis
Distance threshold

To check MD settings
go to calibration!!!

For cross validation


results the MD values
are sometimes extreme
because samples are
outside the calibration
when those values are
obtained.
Quant2 OPUS 7: New Mahalanobis
Distance threshold

Before OPUS 7 the


default threshold
including factor 2 was
always to low.
Adjustments were
needed before storing
the method or
afterwards in OPUS
LAB.
Quant2 OPUS 7: New Mahalanobis
Distance threshold

In OPUS 7 the
threshold is set based
on the calibration set
statistic.
Almost all calibration
spectra will be below
the threshold. This is
logical because those
samples belong to the
calibration set.
Scores plot showing PLS scores

153
Statistics based on the predictions for
repeated measurements

154
Regression coefficients (b-vector)

The regression
coefficients are
showing the
weighting of data
point
(wavenumbers or
wavelength) in
the model.

155
PLS loadings (factors)

The loadings are


showing where
spectral variance
is located which is
coded in this
factor. Important
to look for noise
loadings.

156
All plots as values in the full report

157
Component Value Density

These values can


be used to define
a threshold in
OPUS LAB for
indicating
interesting
samples for
calibration
updates.

158
Detection of relevant samples for
calibration expansion by the predicition

159
Detection of relevant samples for
calibration expansion by the predicition

60
Component value density
45
Model NIR vs. true
50
Component value density

43

40 41

NIR prediction
39
30

37
20
35

10
33

0 31
31 33 35 37 39 41 43 45
True value (reference)

160
Statistics based on the predictions for
repeated measurements

161
OPUS QUANT2 (PLS) for advanced users

Optimization tool and its settings

162
Optimization with NIR, A or B algorithm

163
Optimization with NIR, A or B algorithm

• The NIR optimization is calculating


models using all combinations of five
pre-defined or user-defined spectral
ranges
• For the A and B optimization the test
area is devided into 10 equally large (or
user defined) parts and these are
combined:
 For General A, starting from 1,
regions are successively added
 For General B, starting from 10,
regions are successively removed

164
Direct transfer of settings to the
parameter page for the selected model

165
Basic settings with a broad maximum
test range

166
Pre-defined spectral ranges for NIR
optimization

167
Pre-defined spectral ranges for NIR
optimization

168
10 spectral ranges for A & B
optimization by splitting the test range

169
10 spectral ranges for A & B
optimization by splitting the test range

170
User defined spectral ranges for A & B
optimization

171
User defined spectral ranges for A & B
optimization

172
OPUS QUANT2 (PLS) for advanced users

User defined dedicated optimization


ranges

173
Overview NIR spectral regions

O-H
C-H
N-H

174
User defined regions for A opt. of C-H
and N-H (w/o water and water vapour)

9000 - 8000 cm-1


8000 - 7450 cm-1
6900 - 6770 cm-1
6770 - 6400 cm-1
6400 - 6030 cm-1
6030 - 5500 cm-1
4950 - 4770 cm-1
4770 - 4600 cm-1
4600 - 4500 cm-1
4500 - 3850 cm-1

O-H C-H N-H


Regions above 9000cm-1 are normally not considered for reflection spectra obtained with integration sphere.

175
User defined regions for NIR
optimization of water (moisture)

10550 - 9250 cm-1


7100 - 6800 cm-1
6800 - 6400 cm-1
6400 - 6030 cm-1
5300 - 4950 cm-1

O-H C-H N-H


Regions above 9000cm-1 are normally not considered for reflection spectra obtained with integration sphere.

176
Suggested spectral regions for user
defined optimization with Quant2 (PLS)

A optimization
for C-H and N-H
9000 - 8000 cm-1
8000 - 7450 cm-1
6900 - 6770 cm-1
6770 - 6400 cm-1
6400 - 6030 cm-1
6030 - 5500 cm-1
4950 - 4770 cm-1
4770 - 4600 cm-1
4600 - 4500 cm-1
4500 - 3850 cm-1

NIR optimization
for O-H
10550 - 9250 cm-1
7100 - 6800 cm-1
6800 - 6400 cm-1
O-H C-H N-H 6400 - 6030 cm-1
5300 - 4950 cm-1
Regions above 9000cm-1 are normally not considered for reflection spectra obtained with integration sphere.

177
OPUS QUANT2 (PLS) for advanced users

Quant2 file list for model validation

178
Different models can be tested at once
with a list of spectra

179
Adding true values (reference) for
comparison with predictions

180
Copy/paste of true values (reference)
for comparison with predictions

181
Copy/paste of true values (reference)
for comparison with predictions

182
Predictions overview

183
Prediction vs. true value (reference)
with target and regression line (blue)

184
Easy comparison of different models

185
Difference vs. true value (reference)
with bias line (blue)

186
Quant2 Filelist OPUS 7: marking of MD
and calibration range outliers

Marking according to
the indication in the
table on page ‘Analysis
Results’:

MD/range OK

MD not OK

out of range

MD and range
not OK
Quant2 Filelist OPUS 7: marking of MD
and calibration range outliers

Marking according to
the indication in the
table on page ‘Analysis
Results’:

MD/range OK

MD not OK

out of range

MD and range
not OK
Quant2 Filelist OPUS 7: marking of MD
and calibration range outliers

Marking according to
the indication in the
table on page ‘Analysis
Results’:

MD/range OK

MD not OK

out of range

MD and range
not OK
Result statistics

190
74 PLS models for API in tablets:
calibration results

10
RMSEP or RMSECV of calibration
9

7
RMSEP or RMSECV

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73
Model

191
74 PLS models for API in tablets:
calibration and validation results

10
RMSEP or RMSECV of calibration RMSEP of validation
9

7
RMSEP or RMSECV

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73
Model

192
Region with API information but bad
influence

193
Spectra of tablets and pure API

194
Maximized spectra of tablets and pure
API

195
Select regions related to API

Remove API
spectrum before
starting
optimization!

196
Model robustness check by prediction of
independent samples across
instruments
• Sunflower samples were scanned on 3 Bruker Instruments
• Each sample were scanned 2 times with re-filling
• Same cup filling was measured on all instruments
• Predictions were done with 5 models obtained during model
optimization process
• All models showed very similar calibration results but act different in
terms of

• prediction repeatability between re-fills on one instrument


• prediction repeatability between the instruments

197
Model robustness check by prediction of
independent samples across
instruments

38 Protein
Model 1
33
RMSECV = 1.0
SEP = 1.3

28

23

MPA 1 MPA 2 MATRIX-I


18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

198
Model robustness check by prediction of
independent samples across
instruments

38 Protein
Model 2
33
RMSECV = 0.99
SEP = 1.7

28

23

MPA 1 MPA 2 MATRIX-I


18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

199
Model robustness check by prediction of
independent samples across
instruments

38 Protein
Model 3
33
RMSECV = 1.1
SEP = 1.7

28

23

MPA 1 MPA 2 MATRIX-I


18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

200
Model robustness check by prediction of
independent samples across
instruments

38 Protein
Model 4
33
RMSECV = 1.1
SEP = 1.7

28

23

MPA 1 MPA 2 MATRIX-I


18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

201
Model robustness check by prediction of
independent samples across
instruments

38 Protein
Model 5
33
RMSECV = 1.2
SEP = 2.5

28

23

MPA 1 MPA 2 MATRIX-I


18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

202
Modeling with big spectra data sets
transferred from Foss and new Bruker
data
By the time when Foss spectra are transferred the number of available samples
is limited. Sometimes the reference values are not available or to old (i.e. for
moisture).
Nevertheless as many samples as possible should be measured on the Bruker.
Not for the transfer samples but for the calibration samples reference values
are required.
For the modeling and the model selection it is helpful to scan samples several
times to check and select models by repeatability.

Never use transferred Foss spectra alone to create a model!

203
Modeling with big spectra data sets
transferred from Foss and new Bruker
data
The modeling must be guided towards the characteristic of
Bruker spectra by a proper splitting of data sets:

• Calibration set: as many Bruker as available


• Test set, a good mix of Bruker and Foss data (e.g. 50:50)
• Validation set: 100% Bruker, probably from different instruments
• Avoid overfitting by selecting lower rank (less PCs)

Model development (Test set optimization) Model check & selection

Calibration Set Testset Validation set

Transferred Foss Foss


Bruker Bruker Bruker
Data Data

204
Innovation with Integrity

©Copyright
Copyright Bruker
© 2011 Bruker Corporation.
Corporation. All rights
All rights reserved. reserved.
www.bruker.com

Potrebbero piacerti anche