2017 - OPUS Quant Advanced PDF

OPUS QUANT2 (PLS) for advanced users
Dr. Andreas Niemöller

Bruker Optik GmbH
Innovation with Integrity

Successful calibration setup starts long
before using PLS
• Reliable wet chemistry as reference

(component values)
• Right sample presentation and correct
NIR measurement (spectra)
• Comparable amount of sample
analyzed by wet chemistry and NIR
• All different and powerful

chemometric tools and algorithms,
like PLS, cannot derive good results
from a bad data set
2
Classical univariate calibration model
1.2
• linear regression
• extrapolation allowed
1
• sensitivity directly definable
Measured Value
0.8 • evaluation of a single

measured value
0.6
0.4
0.2
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Concentration / %
3
Multivariate calibration model
• multivariate regression after Factorization

• extrapolation NOT allowed
• sensitivity NOT definable
• evaluation of spectrum
PLS Partial
Least
Squares
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Concentration / %
4
Principles of Factor Analysis
5
Principles and Properties of Factor
Analysis
• Variance analysis: ‘looking for changes in the data set‘

• Common statistical method for data analysis
• Different names used in chemometrics:
- Factor analysis
- Principal Component Analysis (PCA)
• Orthogonal transformation of the data

• Enormous data compression: representing of the data set by a few
latent variables
6
Factor Analysis of Spectra
Factor analysis breaks apart the spectral data into the most
common spectral variations (factors, loadings, principal
components) and the corresponding scaling coefficients (scores)
p d p
Loadings
Scores
Spectral data matrix = d
n n
Data matrix: n spectra with p data points

Scores: d score values for each spectrum (d < n)
Factors: d Factors with p data points (d < n)
7
Factor Analysis of the Spectral Variance
(without Property Values)
5 Spectra Scores Loadings

1 Factor 1 5.216 1
Factor 2 -0.216
Factor 3 1.73E-02
Factor 4 -1.52E-02
Factor 5 3.17E-02
2 Factor 1 5.95
Factor 2 -0.103 2
Factor 3 4.97E-04
Factor 4 4.33E-02
Factor 5 5.65E-03
3 Factor 1 7.731
Factor 2 -0.699 3
Factor 3 3.67E-04
Factor 4 -1.15E-02
Factor 5 -2.04E-02
4 Factor 1 5.768
Factor 2 0.693 4
Factor 3 2.97E-02
Factor 4 -3.76E-03
Factor 5 -1.27E-02
5 Factor 1 7.13
Factor 2 0.441 5
Factor 3 -3.75E-02
Factor 4 -9.54E-03
Factor 5 4.46E-03
8
Inverse Factor Analysis: Reconstruction
of a Spectrum using all Factors
Scores of Loadings Spectrum

spectrum
7.731
+ -0.699
+ 3.67E-04
+ -1.15E-02
+ -2.04E-02
9
Reconstruction using two Factors:
>99% of Information Content retained
Scores of Loadings Spectrum

Spectrum
7.731
+ -0.699
In the software the spectra are not reconstructed. The spectra are
represented just by the few scores values (data compression) which are
used in the modeling calculations.
Spectral residuals: difference between original spectra and reconstructed

spectra using n loadings
10
Moving from factor analysis to PLS
• PLS is a factor analysis (variance analysis) taking component or

property values (e.g. concentrations) into account
• For each component or property a set of PLS factors is calculated
• The factors are calculated based only on the spectral variance
correlated with the given component or property values
• PLS can be seen as a variance analysis including a kind of
regression step
• PLS is very effective in making use of correlated information and
discriminating non useful information
• Even overlapping bands and structures in the spectra can be
separated
11
PLS Loadings for components A and B
PLS loadings
A B Comp. A
1
2
3
PLS loadings
Comp. B
1
2
3
12
Analysis of Spectra using PCA or PLS
models based on Scores and Loadings
Scores of Loadings in Spectrum

spectrum the model measured
7.731
+ -0.699
For the measured spectrum the scores are calculated according to the
factors (loadings) stored in the model.
The scores are used for the final evaluation in the PCA model
(identification) or PLS model (quantification).
13
Experiment to show Capabilities of PLS
Reflectance
0.7
spectra from
0.6
glucose with
admixtures of
0.5
Absorbance
1.0-1.9% talc
0.4 0.3
0.2
0.1
10000 9000 8000 7000 6000 5000

Wavenumber /cm-1
14
Spectra after Vector Normalization
Absorbance
10000 9000 8000 7000 6000 5000

Wavenumber /cm-1
15
Not optimized PLS Model with a broad
Spectral Range used
R2=96.63
1.9
RMSECV=0.05%
1.7 Rank 2
NIR prediction / %
1.5
1.3
1.1
0.9
0.9 1.1 1.3 1.5 1.7 1.9
Reference value / %
Parameter: vector normalization, 10.000-4.500 cm-1

16
Regression Coefficients of not optimized
PLS Model
Weighting of
wavenumbers of
the calibrated
property.
PLS contains an
Absorbance
automatic ‘search’
for relevant
wavenumbers
10000 9000 8000 7000 6000 5000

Wavenumber /cm-1

17
PLS Factors (loadings) of the not
optimized PLS Model
Both factors
Factor 1 contain parts of
Factor 2 the spectral
variation caused
by the talc
Absorbance
content
10000 9000 8000 7000 6000 5000

Wavenumber /cm-1

18
Optimized PLS Model
R2=99.68
1.9
RMSECV=0.02%
1.7 Rank 2
NIR prediction / %
1.5
1.3
1.1
0.9
0.9 1.1 1.3 1.5 1.7 1.9
Reference value / %
Parameter: 1. Derivative, 7.500-6.100 cm-1

19
Spectra after 1st Derivative
Absorbance
10000 9000 8000 7000 6000 5000

Wavenumber /cm-1
20
Regression Coefficients for the
optimized PLS Model
In the optimized
model only the
talc peak is
considered
Absorbance
7400 7200 7000 6800 6600 6400 6200

Wavenumber /cm-1
Parameter: 1. derivative, 7.500-6.100 cm-1

21
PLS factors (loadings) for the optimized
PLS Model
Only the first

Factor 1
factor contains
Factor 2
useful information
for the prediction
Absorbance
of the talc
content
7400 7200 7000 6800 6600 6400 6200

Wavenumber /cm-1
Parameter: 1. derivative, 7.500-6.100 cm-1

22
Basics of Calibration Setup and

Modeling
23
Data sets for model setup and method
validation
Setup with cross validation for

Calibration
small data sets (feasibility)
Val Setup with cross validation and

Calibration
Set check with validation set
Test
Calibration Setup with test set validation
Set
Test Val Setup with test set and validation

Calibration
Set Set set
Method setup ‘today’
Val Set= dataset of independent samples

24
Methods must be validated over time
(model maintenance)!
Calibration
Val
Calibration
Set
Val Val Val Val
Set Set Set Set
Test
Calibration
Set
Test Val
Calibration
Set Set
Method setup ‘today’ Method validation time

‘in the future’
Val Set = dataset of independent samples

25
Methods must be validated over time
(model maintenance)!
Validation with
independent samples is
the ONLY way to
• check the accuracy,
reproducibility and
robustness of PLS Val Val Val Val
methods, Set Set Set Set
• select methods for
routine use.
Method validation time

‘in the future’
Val Set = dataset of independent samples

26
Updating of methods and data sets with
new samples (new batches, new
recipes)
Test Val Val Val Val Val

Calibration
Set Set Set Set Set Set
robustness
of model
Method setup Method validation time
27
Principles of method development
1. Well measured spectra and reference values

2. Checking of spectra and data sets on outliers, unusual effects and
samples
3. Setup of first methods for control and selection of spectral ranges for
optimization
4. Optimization
5. Selection and review of models from the optimization list
6. Validation of models with independent samples, if possible considering
timeline (more new or newer samples)
Frequent repeating of point 6 in routine usage!
28
Principles of method development
When calibration samples are selected, care should be taken to ensure

that all major factors affecting the accuracy of calibration are covered
within the limits of the defined application area. These factors include
the following:
1. Recipes: combinations and composition ranges of major and
minor sample components: analytes and non-analytes
2. seasonal, geographic and genetic effects on sample material
or raw materials
3. processing techniques and conditions
4. storage and storage conditions
5. sample and instrument temperatures and changes of them
6. instrument variations
Such requirements for calibration development are given

e.g. in ISO 12099 (Feed) and ISO 12543 IDF 201 (Dairy).
29
Distribution of samples
Prediction

„rare sample“ or

outlier


 
  typical
 concentration range



 The concentration range of the calibration
 should extent the expected analysis range
if possible.

Reference value
30
General parameter influencing the
modeling and the model accuracy
• Quality of instruments
e.g. Resolution, stability, signal/noise ratio, precision, robustness
• Parameters for measurement
• Sample preparation and sample presentation
• Accuracy of reference method (e.g. wet chemistry)

• in many cases the accuracy of an IR or NIR method is only
depending on the accuracy of reference method.
• in average NIR can be more accurate due to better
reproducibility
31
Selection of calibration and test
samples
• Calibration and test set samples should be well distributed over the entire
property range
• As many as possible samples should be used for the test set but important
samples must be in the calibration. In case of big data sets the splitting is
done by having 50% in the calibration and test set.
• Required number of samples

• feasibility study: ~ 20 samples minimum
• typical applications: ~ 50-100 samples
• complex application: > 150 samples
32
Selection of spectral ranges for
calibration
4 • Avoiding of spectral
noise, e.g. on the left
and right border of the
spectra where the
3
detector has low
sensitivity or a cut off
2 • Avoiding of spectral
ranges with total
absorbance
(absorbance >2,0 AU)
1
• A quantitative
evaluation is only
0 possible up to 2 AU but
starting from the
baseline.
10000 8000 6000 4000 2000
Wavenumber
33
Trouble shooting in case of poor
prediction
• Selection of suitable spectral ranges?

• Were ranges with spectral noise included in the calibration?
• Were ranges with total absorption included in the calibration?
• Selection of correct experiment for measurements?

• Selection of a robust Quant2 method?
• Selection of suitable data preprocessing ?
• Were the property values of the calibration samples well distributed over the
selected range?
34
Trouble shooting in case of outliers
• Was the sample not homogenized properly?

• No temperature control with critical liquid samples?
• Probe not properly immerged?
• Was an air bubble in the optical gap?
• Selection of the wrong method or measuring experiments?
• Measurements through vials: Identical vials for calibration and
measurement?
• Comparable measuring conditions (e.g. angle of attack of the probe, ...)?
35
Trouble shooting in case of outliers
Problem: Higher calibration errors due to bad reference values
Solution:
• Revision of the reference analysis method (2nd reference technique, old
chemicals, operator?)
• Revision of accuracy, error limits and reproducibility of the reference
analysis?
• Repetition and/or multiple determination of the reference values for some
samples
36
NIR-Spectra of Water at various
temperatures
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Absorbance Units
9000 8000 7000 6000

Wavenumber cm-1
37
1st derivative NIR-Spectra of Water at
various temperatures
7600 7400 7200 7000 6800 6600 6400 6200

38
Shifting of Water band due to
increasing temperature
0 °C
5 °C
1.45
10 °C
15 °C
20 °C
Absorbance Units
25 °C
1.40
30 °C
35 °C
40 °C
45 °C
1.35
50 °C
55 °C
60 °C
65 °C
1.30
70 °C
75 °C
80 °C
85 °C
1.25
90 °C
95 °C
7100 7000 6900 6800 6700 100 °C
Wavenumber cm -1
39
Difference Spectra of Water band on
increasing temperature
0 °C
5 °C
10 °C
15 °C
0.4
20 °C
Absorbance Units
25 °C
30 °C
0.2
35 °C
40 °C
45 °C
50 °C
0.0
55 °C
60 °C
65 °C
-0.2
70 °C
75 °C
80 °C
85 °C
-0.4
90 °C
95 °C
7500 7000 6500 6000 100 °C
Wavenumber cm-1
40
Toluene NIR-Spectra on various
temperatures
-0.01 0.0 0.01 0.02 0.03 0.04 0.05 0.06
Absorbance Units
9000 8000 7000 6000 5000

Wavenumber cm-1
41
Toluene NIR-Spectra on various
temperatures
1.135 1.140 1.145 1.150 1.155 1.160 1.165
25 °C
30 °C
35 °C
Absorbance Units
40 °C
45 °C
50 °C
55 °C
60 °C
65 °C
70 °C
75 °C
80 °C
85 °C
90 °C
95 °C
5958 5956 5954 5952 5950 5948

Wavenumber cm-1
42
Error values for characterizing
calibration performance and validation
• Multivariate PLS models can’t be

checked by regression coefficient r2 and
X slope of a regression line.
• The standard deviation of NIR
predictions from the true values
Calibration: (reference) are calculated as Root
RMSEE or RMSEC Mean Square Error of..
Cross validation: • Depending on the data set used for

RMSECV prediction different errors are defined
• Another parameter is the R2 value
Test set validation:
which should be as close as possible to
RMSEP
100%
43
Error values for characterizing
calibration performance and validation
• Root Mean Square Error of..
X  Estimation or Calibration for the

predictions of all samples using
the calibration model based on all
Calibration: samples
RMSEE or RMSEC  Cross Validation for the
predictions during the Cross
Cross validation:
Validation, i.e. samples are
RMSECV
temporarily independent
Test set validation:  Prediction for the prediction of
RMSEP independent samples
44
What our customers expect…
RMSEP = 0.5? Great!

All my results within a
range of +/- 0.5.
Excellent accuracy!
Customer YOU
Normal (Gaussian) Distribution
We find always the same number of

events within the following intervals:
+/- 1s • +/- 1 standard deviation
68.3% (68.3%)
+/- 2s
• +/- 2 standard deviations
(95.5%)
95.5%
• +/- 3 standard deviations
+/- 3s (99.7%)
99.7%
RMSEP / RMSECV are identical to the Standard deviation !

Praxis:
Ruminant Feed – Fat: Test Set Validation
Reality (Counting the results)
69% 94% 100%
Theory (Normal Distribution)

68.3%
95.5%
99.7%
R2 and its meaning: expresses the
relation of error bar and value range
R2 = 66.4%
R2 = 81.4%
R2 = 98.9%
R2 is coefficient of determination
and is not the same as
r2 is regression coefficient
R2: Calibration of Fat in Milk
n RMSEP R^2 Range

500 0.06 99.51 5.03
480 0.06 99.45 3.35
400 0.05 99.22 2.17
300 0.05 98.41 1.36
200 0.05 96.51 0.82
100 0.05 88.79 0.46
80 0.05 81.71 0.35
60 0.04 73.87 0.26
50 0.04 70.24 0.21
40 0.04 65.70 0.16
Methods of Data Pretreatment
50
Principle of Vector Normalization
Acetylsalicylic acid
Salicylic acid
1.0
0.8
Normalization of spectra
Absorption
0.6
to vector norm value 1

0.4
0.2
0.0
12000 11000 10000 9000 8000 7000 6000 5000 4000

Wavenumber / cm-1
51
Principle of Vector Normalization
Acetylsalicylic acid
Salicylic acid
0.5
0.4
Absorption
0.3
0.2
0.1
0.0
12000 11000 10000 9000 8000 7000 6000 5000 4000

Wavenumber / cm-1
52
Spectra of Glucose before Vector
Normalization
12500 - 3500 cm-1

12000 - 3800 cm-1
9000 - 3800 cm-1
1.5
Absorption
1.0
0.5
12000 10000 8000 6000 4000

Wavenumber / cm-1
53
Glucose Spectra normalized in different
spectral ranges
0.06
12500 - 3500 cm-1

12000 - 3800 cm-1
9000 - 3800 cm-1
0.04
Absorption
0.02
0.00
-0.02
9000 8000 7000 6000 5000 4000

Wavenumber / cm-1
54
Principle of Derivatives
(Savitzky Golay)
Absorption 5 5pt5ptpt
band
1st derivative
55
Spectra of Glucose after 1st Derivative
0.004
5 pt
13 pt
25 pt
0.002
Absorption
-0.002 0.000
-0.004
12000 11000 10000 9000 8000 7000 6000 5000 4000

Wavenumber / cm-1
56
Spectra of Glucose after 1st Derivative
(detail)
0.0002
5 pt
13 pt
25 pt
-0.0000
Absorption
-0.0002
-0.0004
9000 8800 8600 8400 8200 8000

Wavenumber / cm-1
57
Spectra of Glucose after 1st and 2nd
Derivative
1st derivative 13 pt
0.10
2nd derivative 13 pt
0.05
Absorption
-0.05 0.00
-0.10
-0.15
9000 8000 7000 6000 5000 4000

Wavenumber / cm-1
58
Spectra of Glucose after 1st and 2nd
Derivative(detail)
0.15
1st derivative 13 pt
2nd derivative 13 pt
0.10
0.05
-0.10 -0.05 0.00
Absorption
-0.15
-0.20
9000 8800 8600 8400 8200 8000

Wavenumber / cm-1
59
Advantages and Disadvantages of
Vector Normalization
• Advantages of vector normalization

• Shape of spectra retained
• Interpretation of spectra more easy
• Disadvantages of vector normalization

• Result depends on used spectral range
60
Advantages and Disadvantages of
Derivatives
• Advantages of derivatives
• Contrast enhanced, more details visible
• Result depends not on used spectral range
• Disadvantages of derivatives
• Noise enhanced, smoothing step needed
• Result depends on used window size
61
Recommended smoothing Point
Settings for 1st Derivatives
• Resolution 8cm-1
• Quant2: 13 to 21pt, mainly 17pt
• Ident: 9 to 17pt, mainly 13pt
• Resolution 16cm-1
• Quant2: 9 to 17pt, mainly 13pt
• Ident: 9 to 17pt, mainly 9pt
62
Other Pre-processing Methods in
Quant2
• No Spectral Data Preprocessing

Only used in very rare applications,
where the offset shift reflects the
required information for physical effects,
e.g. scattering effects by
changing particle sizes
• Constant Offset Elimination and

Straight Line Elimination
Only applicable for spectra with a
horizontal baseline, e.g.
• NIR spectra of liquids
• MIR spectra
• Raman spectra
63
Quant2
• Min-Max Normalization
Only useful if you have a more or
less constant highest peak or you
looking for peak ratios in the
selected spectral range
Not really useful for NIR, quite risky
in most cases
• Internal Standard
Used only if an internal standard is
used for scaling spectra
64
Quant2
• Second Derivative
For elimination of offsets and skewed
baselines
Common for dispersive systems to
increase the contrast for low resolution
spectra
Noise is highly increased, that’s why
using 1st derivative plus
Vectornormlization is better
65
Quant2
• MSC and 1st Derivative + MSC

Common Method to correct baseline
effects due to wavelength dependent
scatter effects.
Not advisable for small data sets and
spectra with different effects because
a MSC model is derived from the
calibration spectra which could fail
partly for new independent samples.
Vector normalization gives
comparable results and is applied on
each spectrum individually which is
more robust.
66
Method setup and spectra table
67
Load method with overview on spectra
and parameters
68
Component definition with units and
decimal point settings
69
Adding dummy components as category
variables
• Adding of dummy components which

are not calibrated can be useful for
sorting and selection of samples
Examples:
– Sample type
– Year and/or month
– Origin of samples (country,
supplier, facility, plant, vessel)
– Special properties (Additives,
temperature)
70
Adding dummy components as category
variables
• Example gasoline samples

There are samples with and without
Ethanol added. Ethanol has a strong
influence on the spectra. It could be
helpful or important to develop models
with and without Ethanol containing
samples.
The categories YES or NO must be put in
the spectra table with values, e.g. 1 and
0. By sorting the column the samples
can be easily marked for color settings
or to exclude them.
71
Spectra table for spectra and
component values
72
Check spectra before loading them!
The spectra are

shown in the
preview
window. By
toggling with
the cursor keys
you can check
easily the
quality of all
spectra which
helps to avoid
trouble later
on.
73
Missing values are handled as a blank
Even with
missing values
you can
copy/paste
tables from e.g.
Excel to the
spectra table.
74
Set sample number
• The sample number indicates which

spectra are belonging to the same
sample. This is very important if
repeated measurements are done of
one sample (replicates or refills).
• During cross validation or test set
validation always samples are
considered, i.e. all spectra of the same
sample are validated at the same time.
Would one spectrum of a sample
remain in the calibration set the
validation of the other spectra of the
same sample is not independent.
75
New method based on mean spectra for
each sample (sample no.)
76
• Spectra assigned to the same sample number are automatically

averaged.
• The mean spectra are stored and automatically a new corresponding

QUANT2 method file is created.
• The new method can be further developed and new samples can be
added even with repeated measurements.
• Samples with just one spectrum are just retained.
77
Mean spectra
New method
78
79
Component correlations
• For robust calibration models only

spectral information should be
used which is only correlating with
the calibrated component.
• In case of co-linearity (e.g. by
dilution) some information might
be used which are not related to
the component and could cause
trouble in the future if the co-linear
relation is changed.
• Example: Active Ingredient
(Component A) and Excipients
(Component B)
80
Component correlations
81
Calibration design
82
Calibration design
83
Calibration design
84
Dataset settings
• Dataset splitting in calibration and

test set
• Data set assignment of selected
(marked) spectra
• Color assignment of selected
(marked) spectra
• Special options for excluding of
spectra with missing component
values
85
Set data set
86
Spectra without reference values can be
set
For selected
components the
spectra can be
excluded for
blank entries or
for spectra with
component
values of 0 or -
1.
87
Automatic selection of test samples on
component values (Kennard-Stone)
• The selection is performed across all

components with an optimum
distribution of samples on all
dimensions (4 components = 4
dimensional property space)
• Samples with lowest and highest
property values are in the calibration
set, the next inner ones in the test set.
All other samples are selected
according to the selected percentage of
test samples
• The automatic selection is not available
for data sets, which are too small
88
Samples with
lowest and
highest
property values
are in the
calibration set,
the next inner
ones in the test
set
89
Next test
sample is
chosen with the
Next test
maximum
sample
distance from
the already
selected ones in
all dimensions
(properties).
Here it is found
in the middle.
90
10 % Test samples
Next test
sample is
chosen with the
maximum
distance from
the already
selected ones in
all dimensions
(properties)
until the
required
percentage of
test samples is
reached.
91
20 % Test samples
92
50 % Test samples
93
Automatic selection of test samples
(Kennard-Stone) in scores space (PCA)
94
Quant2 OPUS 7: exclude redundant
samples
samples
• Many methods have a lot of redundant samples which are

accumulated over time when many samples are added
which have same properties, e.g. due to product specifications.
• Such samples are not contributing to model because
• they do not introduce new information
• increase data set size and computation time
• are changing the statistic in calibration and validation
• Reducing data set size will help to achieve better models and reduce
computation time (optimization).
• Function can be used to find redundant samples in advance which

should not go to the reference lab.
By this cost for wet chemistry can be avoided.
samples
• The new algorithm is looking for k nearest neighbors (kNN) and kick
redundant samples out which are very close to a given sample.
• This is the opposite approach to Kennard-Stone algorithm which is
used to find and select sample which are covering well the range of
samples.
• To work on big data sets you can do now:

• Reduce data set by kNN algorithm
• Select Test Set by Kennard-Stone (on values or PCA scores)
• Optimize
• Check models with new samples and Quant2 Filelist
samples
View on PCA scores

plot of IV method with
7330 spectra.
About 6500 spectra
from Indonesia (blue)
which are quite similar.
samples
Detail view on PCA

scores plot of IV
method with
7330 spectra.
About 6500 spectra
from Indonesia (blue)
which are quite similar.
samples
Detail view on PCA

scores plot of IV
method with
selected spectra to be
excluded
Data set reduced from
7330 spectra
to 1162.
samples
Total view on PCA

scores plot of IV
method with
selected spectra to be
excluded
Data set reduced from
7330 spectra
to 1162.
samples
Test Set validation

with 687 spectra and
7330 spectra in
calibration set.
RMSEP = 0.73
samples
Calibration with 7330

spectra.
RMSEE = 0.54
samples
Test Set validation with

687 spectra and 1162
spectra in calibration
set.
RMSEP = 0.73
(Before with 7330

RMSEP = 0.73)
samples
Calibration with 7330

spectra.
RMSEE = 0.81
(Before with 7330

RMSEE = 0.54)
Quant2 OPUS 7: Set Color in PCA Score
Plot
Plot
Zoom in.
Plot
Set color.
Plot
Done.
Set dataset
• Selected spectra can be assigned to

calibration or test set or can be
excluded
110
Set color for plots on page Graph
• Colors can be assigned to selected

spectra for display in plots
• Colors may indicate.
 Samples of different type,

origin
 Time of measurement, e.g.
year
 Special samples
 Samples with very low or high
property values
111
112
113
Parameter and validation settings
114
Parameter page for data pretreatment
and spectral regions
115
Data pretreatment in any order and in
any spectral ranges
116
any spectral ranges
CAUTION!
Everything possible,
but maybe not useful!
117
any spectral ranges
118
Data pretreatment in spectral regions
selected for modeling
119
Interactive selection of spectral regions
120
Display preprocessed spectra
121
Display preprocessed spectra but only
every x th sample
122
Statistics for repeated measurements
(replicates) on preprocessed spectra
123
124
125
Model calculation including validation
126
Model calculation including validation
• For each activated component a

separate PLS model is calculated based
on the selected dataset(s)
• The maximum rank is limiting the
complexity of the model
(default is rank 10).
• Lower values for the maximum rank
are saving calculation time. Only useful
if it is known that less factors are
sufficient.
• More than 10 factors are maybe
required for more complex applications
but the risk of unstable models is
increasing.
127
Internal Validation
At present, two different types of

validation are accepted:
1) cross-validation
2) test-set-validation
Important: Independent samples for internal validation
128
(Full) Cross Validation
Validation by successive exclude and put

back of samples.
During the Cross Validation all samples are
temporary independent from the
calibration set.
Calibration Data Set Test Sample
129
Calculation of a temporary calibration

model based on n-1 samples and
prediction of the test sample.
Comparison of NIR prediction and
reference value are part of the calculation
of Root Mean Square Error of Cross
Validation (RMSECV)
130
131
This procedure is
continued until all
samples has been
taken out, tested
and put back into
the calibration set
132
Advantages of Cross
Validation:
• All samples are used for

calibration and validation,
helpful for small data sets
133
Disadvantages of
Cross Validation:
• The RMSECV is lower than

the Root Mean Square Error
for independent samples
(RMSEP)
• long calculation times
during optimization
134
Test Set Validation
Definition of two different data sets (for example 50:50):
Calibration Data Set Test Set

Development of Model Validation of Model
Samples from the Test Set need to be independent from the Calibration
Data Set
135
Test Set Validation
Problem: Only 50% of the samples are used for calibration set up.
Calibration Data Set Test Set

Development of Model Validation of Model
Good Tool for Data Sets with Sufficient Number of Samples
136
Cross validation, (full) cross validation
• No. of leave out samples for Cross

Validation depends on the number of
samples in the Calibration Set
 to many leave out samples are
leading to bad results because the
temporary calculated models are
unstable
 Leave out one sample is not a
challenge and gives over-
optimistic low RMXECV errors
• Rule of Thumb: Number of samples
divided by 30 (= 30 passes during
Cross Validation)
137
Calibration results and statistics
138
NIR predictions vs. true values
(reference) in the model validation
Green line is the ideal line

for NIR prediction is equal
to the true value (reference);
Not a regression line!
139
NIR predictions vs. true values
(reference) for the calibration
140
Statistics for the model validation
Residual
Prediction
Deviation
RPD = SD/SECV
or
RPD = SD/SEP
SD = Standard
deviation of the
true values
(reference)
RPD > 3
acceptable
model
141
Residual
RPD Classification Application
Prediction
<1.0 very poor not recommended Deviation
1.0 - 2.4 poor not recommended
RPD = SD/SECV
2.5 - 2.9 fair rough screening
or
3.0 - 3.9 reasonable screening RPD = SD/SEP
4.0 - 5.9 good QC
SD = Standard
6.0 - 7.9 very good QA
deviation of the
8.0 - 10.0 excellent any application true values
>10.0 superior as good as reference (reference)
142
143
Regression line, ideal case
Regression
line (blue)
144
Regression line, non ideal case
Regression
line (blue)
145
146
Differences vs. true values (reference)
The distribution of
the deviations
and especially the
range between
minimum and
maximum
deviation helps to
check model
performance.
147
Error vs. rank
Each factor
contributes with
helpful
information for
lowering the
error. After a
reaching a
minimum the
error increases
again.
(overfitting)
148
Mahalanobis distance (MD) and spectral
residuals
Only spectra in
the upper right
corner are
potential outliers,
but not spectra of
samples with very
low or high
property values.
149
Quant2 OPUS 7: New Mahalanobis
Distance threshold
To check MD settings
go to calibration!!!
For cross validation

results the MD values
are sometimes extreme
because samples are
outside the calibration
when those values are
obtained.
Distance threshold
Before OPUS 7 the

default threshold
including factor 2 was
always to low.
Adjustments were
needed before storing
the method or
afterwards in OPUS
LAB.
Distance threshold
In OPUS 7 the
threshold is set based
on the calibration set
statistic.
Almost all calibration
spectra will be below
the threshold. This is
logical because those
samples belong to the
calibration set.
Scores plot showing PLS scores
153
Statistics based on the predictions for
repeated measurements
154
Regression coefficients (b-vector)
The regression
coefficients are
showing the
weighting of data
point
(wavenumbers or
wavelength) in
the model.
155
PLS loadings (factors)
The loadings are

showing where
spectral variance
is located which is
coded in this
factor. Important
to look for noise
loadings.
156
All plots as values in the full report
157
Component Value Density
These values can

be used to define
a threshold in
OPUS LAB for
indicating
interesting
samples for
calibration
updates.
158
Detection of relevant samples for
calibration expansion by the predicition
159
Detection of relevant samples for
calibration expansion by the predicition
60
Component value density
45
Model NIR vs. true
50
Component value density
43
40 41
NIR prediction
39
30
37
20
35
10
33
0 31
31 33 35 37 39 41 43 45
True value (reference)
160
Statistics based on the predictions for
repeated measurements
161
Optimization tool and its settings
162
Optimization with NIR, A or B algorithm
163
Optimization with NIR, A or B algorithm
• The NIR optimization is calculating

models using all combinations of five
pre-defined or user-defined spectral
ranges
• For the A and B optimization the test
area is devided into 10 equally large (or
user defined) parts and these are
combined:
 For General A, starting from 1,
regions are successively added
 For General B, starting from 10,
regions are successively removed
164
Direct transfer of settings to the
parameter page for the selected model
165
Basic settings with a broad maximum
test range
166
Pre-defined spectral ranges for NIR
optimization
167
Pre-defined spectral ranges for NIR
optimization
168
10 spectral ranges for A & B
optimization by splitting the test range
169
10 spectral ranges for A & B
optimization by splitting the test range
170
User defined spectral ranges for A & B
optimization
171
User defined spectral ranges for A & B
optimization
172
User defined dedicated optimization

ranges
173
Overview NIR spectral regions
O-H
C-H
N-H
174
User defined regions for A opt. of C-H
and N-H (w/o water and water vapour)
9000 - 8000 cm-1

8000 - 7450 cm-1
6900 - 6770 cm-1
6770 - 6400 cm-1
6400 - 6030 cm-1
6030 - 5500 cm-1
4950 - 4770 cm-1
4770 - 4600 cm-1
4600 - 4500 cm-1
4500 - 3850 cm-1
O-H C-H N-H

Regions above 9000cm-1 are normally not considered for reflection spectra obtained with integration sphere.
175
User defined regions for NIR
optimization of water (moisture)
10550 - 9250 cm-1

7100 - 6800 cm-1
6800 - 6400 cm-1
6400 - 6030 cm-1
5300 - 4950 cm-1
O-H C-H N-H

176
Suggested spectral regions for user
defined optimization with Quant2 (PLS)
A optimization
for C-H and N-H
9000 - 8000 cm-1
8000 - 7450 cm-1
6900 - 6770 cm-1
6770 - 6400 cm-1
6400 - 6030 cm-1
6030 - 5500 cm-1
4950 - 4770 cm-1
4770 - 4600 cm-1
4600 - 4500 cm-1
4500 - 3850 cm-1
NIR optimization
for O-H
10550 - 9250 cm-1
7100 - 6800 cm-1
6800 - 6400 cm-1
O-H C-H N-H 6400 - 6030 cm-1
5300 - 4950 cm-1
177
Quant2 file list for model validation
178
Different models can be tested at once
with a list of spectra
179
Adding true values (reference) for
comparison with predictions
180
Copy/paste of true values (reference)
for comparison with predictions
181
Copy/paste of true values (reference)
for comparison with predictions
182
Predictions overview
183
Prediction vs. true value (reference)
with target and regression line (blue)
184
Easy comparison of different models
185
Difference vs. true value (reference)
with bias line (blue)
186
Quant2 Filelist OPUS 7: marking of MD
and calibration range outliers
Marking according to
the indication in the
table on page ‘Analysis
Results’:
MD/range OK
MD not OK
out of range
MD and range
not OK
Results’:
MD/range OK
MD not OK
out of range
MD and range
not OK
Results’:
MD/range OK
MD not OK
out of range
MD and range
not OK
Result statistics
190
74 PLS models for API in tablets:
calibration results
10
RMSEP or RMSECV of calibration
9
7
RMSEP or RMSECV
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73
Model
191
74 PLS models for API in tablets:
calibration and validation results
10
RMSEP or RMSECV of calibration RMSEP of validation
9
7
RMSEP or RMSECV
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73
Model
192
Region with API information but bad
influence
193
Spectra of tablets and pure API
194
Maximized spectra of tablets and pure
API
195
Select regions related to API
Remove API
spectrum before
starting
optimization!
196
Model robustness check by prediction of
independent samples across
instruments
• Sunflower samples were scanned on 3 Bruker Instruments
• Each sample were scanned 2 times with re-filling
• Same cup filling was measured on all instruments
• Predictions were done with 5 models obtained during model
optimization process
• All models showed very similar calibration results but act different in
terms of
• prediction repeatability between re-fills on one instrument

• prediction repeatability between the instruments
197
instruments
38 Protein
Model 1
33
RMSECV = 1.0
SEP = 1.3
28
23
MPA 1 MPA 2 MATRIX-I

18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
198
instruments
38 Protein
Model 2
33
RMSECV = 0.99
SEP = 1.7
28
23

18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
199
instruments
38 Protein
Model 3
33
RMSECV = 1.1
SEP = 1.7
28
23

18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
200
instruments
38 Protein
Model 4
33
RMSECV = 1.1
SEP = 1.7
28
23

18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
201
instruments
38 Protein
Model 5
33
RMSECV = 1.2
SEP = 2.5
28
23

18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
202
Modeling with big spectra data sets
transferred from Foss and new Bruker
data
By the time when Foss spectra are transferred the number of available samples
is limited. Sometimes the reference values are not available or to old (i.e. for
moisture).
Nevertheless as many samples as possible should be measured on the Bruker.
Not for the transfer samples but for the calibration samples reference values
are required.
For the modeling and the model selection it is helpful to scan samples several
times to check and select models by repeatability.
Never use transferred Foss spectra alone to create a model!
203
Modeling with big spectra data sets
transferred from Foss and new Bruker
data
The modeling must be guided towards the characteristic of
Bruker spectra by a proper splitting of data sets:
• Calibration set: as many Bruker as available

• Test set, a good mix of Bruker and Foss data (e.g. 50:50)
• Validation set: 100% Bruker, probably from different instruments
• Avoid overfitting by selecting lower rank (less PCs)
Model development (Test set optimization) Model check & selection
Calibration Set Testset Validation set
Transferred Foss Foss

Bruker Bruker Bruker
Data Data
204
Innovation with Integrity
©Copyright
Copyright Bruker
© 2011 Bruker Corporation.
Corporation. All rights
All rights reserved. reserved.
www.bruker.com

2017 - OPUS Quant Advanced PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

2017 - OPUS Quant Advanced PDF

Caricato da

Copyright:

Formati disponibili

OPUS QUANT2 (PLS) for advanced users

Dr. Andreas Niemöller

Innovation with Integrity

• Reliable wet chemistry as reference

• All different and powerful

0.8 • evaluation of a single

• multivariate regression after Factorization

• sensitivity NOT definable

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Principles of Factor Analysis

• Variance analysis: ‘looking for changes in the data set‘

• Orthogonal transformation of the data

Data matrix: n spectra with p data points

5 Spectra Scores Loadings

Scores of Loadings Spectrum

Scores of Loadings Spectrum

Spectral residuals: difference between original spectra and reconstructed

• PLS is a factor analysis (variance analysis) taking component or

Scores of Loadings in Spectrum

10000 9000 8000 7000 6000 5000

10000 9000 8000 7000 6000 5000

Parameter: vector normalization, 10.000-4.500 cm-1

10000 9000 8000 7000 6000 5000

Parameter: vector normalization, 10.000-4.500 cm-1

10000 9000 8000 7000 6000 5000

Parameter: vector normalization, 10.000-4.500 cm-1

Parameter: 1. Derivative, 7.500-6.100 cm-1

10000 9000 8000 7000 6000 5000

7400 7200 7000 6800 6600 6400 6200

Parameter: 1. derivative, 7.500-6.100 cm-1

Only the first

7400 7200 7000 6800 6600 6400 6200

Parameter: 1. derivative, 7.500-6.100 cm-1

Basics of Calibration Setup and

Setup with cross validation for

Val Setup with cross validation and

Test Val Setup with test set and validation

Method setup ‘today’

Val Set= dataset of independent samples

Method setup ‘today’ Method validation time

Val Set = dataset of independent samples

Method validation time

Val Set = dataset of independent samples

Test Val Val Val Val Val

Method setup Method validation time

1. Well measured spectra and reference values

Frequent repeating of point 6 in routine usage!

When calibration samples are selected, care should be taken to ensure

Such requirements for calibration development are given

• Parameters for measurement

• Sample preparation and sample presentation

• Accuracy of reference method (e.g. wet chemistry)

• Required number of samples

• Selection of suitable spectral ranges?

• Selection of correct experiment for measurements?

• Was the sample not homogenized properly?

Problem: Higher calibration errors due to bad reference values

9000 8000 7000 6000

7600 7400 7200 7000 6800 6600 6400 6200

9000 8000 7000 6000 5000

5958 5956 5954 5952 5950 5948

• Multivariate PLS models can’t be

Cross validation: • Depending on the data set used for