MLHEP 2015: Introductory Lecture #3

MACHINE LEARNING IN HIGH
ENERGY PHYSICS
LECTURE #3
Alex Rogozhnikov, 2015

RECAPITULATION
logistic regression and SVM
projections, kernels and regularizations
overfitting (2 definitions)
stochastic optimization

DECISION TREES FOR CLASSIFICATION

COMPOSITIONS
Basic motivation: improve quality of classification by
reusing strong sides of classifiers.

SIMPLE VOTING
Averaging predictions of
Averaging predicted probabilities
Averaging decision function
= [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ P+1 P−1
(x) = (x)P±1
1
J
∑J
j=1
p±1,j
D(x) = (x)
1
J
∑J
j=1
dj

WEIGHTED VOTING
The way to introduce importance of classifiers
D(x) = (x)∑j
αj dj
GENERAL CASE OF ENSEMBLING:
D(x) = f ( (x), (x), …, (x))d1 d2 dJ

PROBLEMS
very close base classifiers
need to keep variation
and still have good quality of basic classifiers

GENERATING TRAINING SUBSET
subsampling
taking fixed part of samples (sampling without replacement)
bagging(Bootstrap AGGregating) sampling with
replacement,
If #generated samples = length of dataset, the fraction of
unique samples in new dataset is 1 − ∼ 63.2
1
e

RANDOM SUBSPACE MODEL (RSM)
Generating subspace of features by taking random subset of
features

RANDOM FOREST [LEO BREIMAN, 2001]
Random forest is composition of decision trees.
For each tree is trained by
bagging samples
taking random featuresm
Predictions are obtained via simple voting.

OVERFITTING
overfitted (in the sense that predictions for train and test
are different)
doesn't overfit: increasing complexity (adding more trees)
doesn't spoil classifier

Works with features of different nature
Stable to noise in data
From 'Testing 179 Classifiers on 121 Datasets'
The classifiers most likely to be the bests
are the random forest (RF) versions, the
best of which [...] achieves 94.1% of the
maximum accuracy overcoming 90% in the
84.3% of the data sets.

RANDOM FOREST SUMMARY
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation:
Hardly interpretable
=Nused Nfeatures‾ ‾‾‾‾‾‾√

COMPARING DISTRIBUTIONS
1d: Kolmogorov-Smirnov
more features is a problem, but we can compute KS over
each of variables
hardly 1d results can be combined together

COMPARING DISTRIBUTIONS OF
POSITIVE AND NEGATIVE TRACKS

USING CLASSIFIER
Want to compute significance?
Use ROC AUC + Mann-Whitney U test

SAMPLE WEIGHTS IN ML
Can be used with many estimators.
, , i − index of eventxi yi wi
weight corresponds to frequency of observation
expected behavior: is the same as having copies of
th event
global normalization doesn't matter
= nwi n
i
Example for logistic regression:
 = L( , ) → min
∑
i
wi xi yi

Weights (parameters) of classifier sample weights≠
In code:
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance
of training events.
Only sample weights in this lecture.

ADABOOST [FREUND, SHAPIRE, 1995]
Bagging: information from previous trees not taken into
account.
Adaptive Boosting is weighted composition of weak
learners:
We assume , labels ,
th weak learner misclassified th event iff
D(x) = (x)
∑
j
αj dj
(x) = ±1dj = ±1yi
j i ( ) = −1yi dj xi

ADABOOST
D(x) = (x)
∑
j
αj dj
Weak learners are built in sequence
each next classifier is trained using different weights
initially = 1 for each training sample
After building th base classifier:
1.
2. increase weight of misclassified
wi
j
= ln
( )
αj
1
2
wcorrect
wwrong
← ×wi wi e
− ( )αj yi
dj xi

ADABOOST EXAMPLE
Decision trees of depth 1 will be used.

ADABOOST SECRET
D(x) = (x)
∑
j
αj dj
 = L( , ) = exp(− D( )) → min
∑
i
xi yi
∑
i
yi xi
is obtained as result analytical optimization
sample weight is equal to penalty
for event
αj
= L( , ) = exp(− D( ))wi xi yi yi xi

ADABOOST SUMMARY
able to combine many weak learners
takes mistakes into account
simple, overhead is negligible
too sensitive to outliers

GRADIENT BOOSTING [FRIEDMAN, 1999]
composition of weak learners,
D(x) = (x)
∑
j
αj dj
(x)p+1
(x)p−1
=
=
σ(D(x))
σ(−D(x))
Optimization of log-likelihood:
 = L( , ) = ln (1 + ) → min
∑
i
xi yi
∑
i
e
− D( )yi
xi

GRADIENT BOOSTING
D(x) = (x)
∑
j
αj dj
 = ln (1 + ) → min
∑
i
e
− D( )yi
xi
Optimization problem: find all and weak leaners
Mission impossible
Main point: greedy optimization of loss function by training
one more weak learner
Each new estimator follows the gradient of loss function
αj dj
dj

GRADIENT BOOSTING
Gradient boosting ~ steepest gradient descent.
(x) = (x)Dj ∑j
=1j
′ αj
′ dj
′
(x) = (x) + (x)Dj Dj−1 αj dj
At th iteration:j
pseudo-residual
train regressor to minimize MSE:
find optimal
= − zi
∂
∂D( )xi
∣∣D(x)= (x)Dj−1
dj
( ( ) − → min∑i
dj xi zi )
2
αj

ADDITIONAL GB TRICKS
to make training more stable, add learning rate
(x) = η (x)Dj ∑j
αj dj
randomization to fight noise and build different trees:
subsampling of features and training samples

AdaBoost is particular case of gradient boosting with
different target loss function*:
 = → min
∑
i
e
− D( )yi
xi
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that )( ) = ±1dj xi

LOSS FUNCTIONS
Gradient boosting can optimize different smooth loss
function.
regression,
Mean Squared Error
Mean Absolute Error
binary classification,
ExpLoss (ada AdaLoss)
LogLoss
y ∈ ℝ
(d( ) −∑i
xi yi )
2
d( ) −∑i
∣∣ xi yi
∣∣
= ±1yi
∑i
e
− d( )yi
xi
log(1 + )∑i
e
− d( )yi
xi

EXAMPLE: REGRESSION WITH GB
using regression trees of depth=2

number of trees = 1, 2, 3, 100

ADAPTING BOOSTING
By modifying boosting or changing loss function we can
solve different problems
classification
regression
ranking
Also we can add restrictions, i.e. fight correlation with mass

LOSS FUNCTION: RANKING EXAMPLE
In ranking we need to order items by :yi
< ⇒ d( ) < d( )yi yj xi xj
We can add penalization term for misordering:
 = L( , , , )
∑
ij
xi xj yi yj
L( , , , ) =
{
xi xj yi yj
σ(d( ) − d( )),xj xi
0,
<yi yj
otherwise

BOOSTING TO UNIFORMITY
Point of uniform boosting - have constant efficiency against
some variable.
Examples:
flat background efficiency along mass
flat signal efficiency for different flight time
flat signal efficiency along Dalitz variable

EXAMPLE: NON-FLAT BACKGROUND
EFFICIENCY ALONG MASS
High correlation with mass will create from pure
background false peaking signal
Aim: for different regions in mass.FPR = const

uBoostBDT
Variation of AdaBoost approach,
aim .= constFPRregion
fix target efficiency (say %), find
corresponding threshold
= 30FPRtarget
Train a tree, its decision function
increase weight for misclassification:
increase weight of signal events in regions with high FPR
(x)dj
← exp (−α (x)))wi wi yi dj
← exp (β( − ))wi wi FPRregion FPRtarget

uBoost
uBoost is an ensemble over uBoostBDT, each uBoostBDT
uses own global FPR.
uBoostBDT returns 0 or 1 (passed or not the threshold
corresponding to target FPR), simple voting is used to
obtain predictions.
drives to uniform selection
very complex training
many classifiers
estimation of threshold in uBoostBDT may be biased

MEASURING NON-UNIFORMITY
CvM = ∫ d (s)∑region
(s) − (s)∣∣Fregion Fglobal
∣∣
2
Fglobal

FLATNESS LOSS
Put an additional term in loss function which will penalize
for non-uniformity
 = + cexploss FL
Flatness loss approximates (non-differentiable) CvM
metrics:
= ∫ dsFL ∑region
(s) − (s)∣∣Fregion Fglobal
∣∣
2
≅ 2( (s) − (s))
∂
∂D( )xi
FL Fregion Fglobal
∣∣s=D( )xi

EXAMPLE (EFFICIENCY OVER BACKGROUND)

GRADIENT BOOSTING
general-purpose flexible algorithm
usually over trees
state-of-art results in many areas
can overfit
usually needs tuning

MLHEP 2015: Introductory Lecture #3

More Related Content

What's hot

Viewers also liked

Similar to MLHEP 2015: Introductory Lecture #3

Recently uploaded

MLHEP 2015: Introductory Lecture #3