MACHINE LEARNING IN HIGH
ENERGY PHYSICS
LECTURE #3
Alex Rogozhnikov, 2015
RECAPITULATION
logistic regression and SVM
projections, kernels and regularizations
overfitting (2 definitions)
stochastic optimization
NEURAL NETWORKS
DECISION TREES FOR CLASSIFICATION
DECISION TREES FOR REGRESSION
PRUNING
COMPOSITIONS
Basic motivation: improve quality of classification by
reusing strong sides of classifiers.
SIMPLE VOTING
Averaging predictions of
Averaging predicted probabilities
Averaging decision function
= [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ P+1 P−1
(x) = (x)P±1
1
J
∑J
j=1
p±1,j
D(x) = (x)
1
J
∑J
j=1
dj
WEIGHTED VOTING
The way to introduce importance of classifiers
D(x) = (x)∑j
αj dj
GENERAL CASE OF ENSEMBLING:
D(x) = f ( (x), (x), …, (x))d1 d2 dJ
PROBLEMS
very close base classifiers
need to keep variation
and still have good quality of basic classifiers
DECISION TREE
GENERATING TRAINING SUBSET
subsampling
taking fixed part of samples (sampling without replacement)
bagging(Bootstrap AGGregating) sampling with
replacement,
If #generated samples = length of dataset, the fraction of
unique samples in new dataset is 1 − ∼ 63.2
1
e
RANDOM SUBSPACE MODEL (RSM)
Generating subspace of features by taking random subset of
features
RANDOM FOREST [LEO BREIMAN, 2001]
Random forest is composition of decision trees.
For each tree is trained by
bagging samples
taking random featuresm
Predictions are obtained via simple voting.
data optimal boundary
50 trees 2000 trees
OVERFITTING
overfitted (in the sense that predictions for train and test
are different)
doesn't overfit: increasing complexity (adding more trees)
doesn't spoil classifier
Works with features of different nature
Stable to noise in data
From 'Testing 179 Classifiers on 121 Datasets'
The classifiers most likely to be the bests
are the random forest (RF) versions, the
best of which [...] achieves 94.1% of the
maximum accuracy overcoming 90% in the
84.3% of the data sets.
RANDOM FOREST SUMMARY
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation:
Hardly interpretable
=Nused Nfeatures‾ ‾‾‾‾‾‾√
COMPARING DISTRIBUTIONS
1d: Kolmogorov-Smirnov
more features is a problem, but we can compute KS over
each of variables
hardly 1d results can be combined together
COMPARING DISTRIBUTIONS
COMPARING DISTRIBUTIONS OF
POSITIVE AND NEGATIVE TRACKS
USING CLASSIFIER
Want to compute significance?
Use ROC AUC + Mann-Whitney U test
SAMPLE WEIGHTS IN ML
Can be used with many estimators.
, , i −  index of eventxi yi wi
weight corresponds to frequency of observation
expected behavior: is the same as having copies of
th event
global normalization doesn't matter
= nwi n
i
Example for logistic regression:
 = L( , ) → min
∑
i
wi xi yi
Weights (parameters) of classifier sample weights≠
In code:
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance
of training events.
Only sample weights in this lecture.
ADABOOST [FREUND, SHAPIRE, 1995]
Bagging: information from previous trees not taken into
account.
Adaptive Boosting is weighted composition of weak
learners:
We assume , labels ,
th weak learner misclassified th event iff
D(x) = (x)
∑
j
αj dj
(x) = ±1dj = ±1yi
j i ( ) = −1yi dj xi
ADABOOST
D(x) = (x)
∑
j
αj dj
Weak learners are built in sequence
each next classifier is trained using different weights
initially = 1 for each training sample
After building th base classifier:
1.
2. increase weight of misclassified
wi
j
= ln
( )
αj
1
2
wcorrect
wwrong
← ×wi wi e
− ( )αj yi
dj xi
ADABOOST EXAMPLE
Decision trees of depth 1 will be used.
ADABOOST SECRET
D(x) = (x)
∑
j
αj dj
 = L( , ) = exp(− D( )) → min
∑
i
xi yi
∑
i
yi xi
is obtained as result analytical optimization
sample weight is equal to penalty
for event
αj
= L( , ) = exp(− D( ))wi xi yi yi xi
LOSS FUNCTION OF ADABOOST
ADABOOST SUMMARY
able to combine many weak learners
takes mistakes into account
simple, overhead is negligible
too sensitive to outliers
MINUTES BREAKx
DECISION TREES FOR REGRESSION
GRADIENT BOOSTING [FRIEDMAN, 1999]
composition of weak learners,
D(x) = (x)
∑
j
αj dj
(x)p+1
(x)p−1
=
=
σ(D(x))
σ(−D(x))
Optimization of log-likelihood:
 = L( , ) = ln (1 + ) → min
∑
i
xi yi
∑
i
e
− D( )yi
xi
GRADIENT BOOSTING
D(x) = (x)
∑
j
αj dj
 = ln (1 + ) → min
∑
i
e
− D( )yi
xi
Optimization problem: find all and weak leaners
Mission impossible
Main point: greedy optimization of loss function by training
one more weak learner
Each new estimator follows the gradient of loss function
αj dj
dj
GRADIENT BOOSTING
Gradient boosting ~ steepest gradient descent.
(x) = (x)Dj ∑j
=1j
′ αj
′ dj
′
(x) = (x) + (x)Dj Dj−1 αj dj
At th iteration:j
pseudo-residual
train regressor to minimize MSE:
find optimal
= − zi
∂
∂D( )xi
∣∣D(x)= (x)Dj−1
dj
( ( ) − → min∑i
dj xi zi )
2
αj
ADDITIONAL GB TRICKS
to make training more stable, add learning rate
(x) = η (x)Dj ∑j
αj dj
randomization to fight noise and build different trees:
subsampling of features and training samples
AdaBoost is particular case of gradient boosting with
different target loss function*:
 = → min
∑
i
e
− D( )yi
xi
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that )( ) = ±1dj xi
LOSS FUNCTIONS
Gradient boosting can optimize different smooth loss
function.
regression,
Mean Squared Error
Mean Absolute Error
binary classification,
ExpLoss (ada AdaLoss)
LogLoss
y ∈ ℝ
(d( ) −∑i
xi yi )
2
d( ) −∑i
∣∣ xi yi
∣∣
= ±1yi
∑i
e
− d( )yi
xi
log(1 + )∑i
e
− d( )yi
xi
EXAMPLE: REGRESSION WITH GB
using regression trees of depth=2
number of trees = 1, 2, 3, 100
ADAPTING BOOSTING
By modifying boosting or changing loss function we can
solve different problems
classification
regression
ranking
Also we can add restrictions, i.e. fight correlation with mass
LOSS FUNCTION: RANKING EXAMPLE
In ranking we need to order items by :yi
< ⇒ d( ) < d( )yi yj xi xj
We can add penalization term for misordering:
 = L( , , , )
∑
ij
xi xj yi yj
L( , , , ) =
{
xi xj yi yj
σ(d( ) − d( )),xj xi
0,
<yi yj
otherwise
BOOSTING TO UNIFORMITY
Point of uniform boosting - have constant efficiency against
some variable.
Examples:
flat background efficiency along mass
flat signal efficiency for different flight time
flat signal efficiency along Dalitz variable
EXAMPLE: NON-FLAT BACKGROUND
EFFICIENCY ALONG MASS
High correlation with mass will create from pure
background false peaking signal
Aim: for different regions in mass.FPR = const
uBoostBDT
Variation of AdaBoost approach,
aim .= constFPRregion
fix target efficiency (say %), find
corresponding threshold
= 30FPRtarget
Train a tree, its decision function
increase weight for misclassification:
increase weight of signal events in regions with high FPR
(x)dj
← exp (−α (x)))wi wi yi dj
← exp (β( − ))wi wi FPRregion FPRtarget
uBoost
uBoost is an ensemble over uBoostBDT, each uBoostBDT
uses own global FPR.
uBoostBDT returns 0 or 1 (passed or not the threshold
corresponding to target FPR), simple voting is used to
obtain predictions.
drives to uniform selection
very complex training
many classifiers
estimation of threshold in uBoostBDT may be biased
MEASURING NON-UNIFORMITY
MEASURING NON-UNIFORMITY
CvM = ∫ d (s)∑region
(s) − (s)∣∣Fregion Fglobal
∣∣
2
Fglobal
FLATNESS LOSS
Put an additional term in loss function which will penalize
for non-uniformity
 = + cexploss FL
Flatness loss approximates (non-differentiable) CvM
metrics:
= ∫ dsFL ∑region
(s) − (s)∣∣Fregion Fglobal
∣∣
2
≅ 2( (s) − (s))
∂
∂D( )xi
FL Fregion Fglobal
∣∣s=D( )xi
EXAMPLE (EFFICIENCY OVER BACKGROUND)
GRADIENT BOOSTING
general-purpose flexible algorithm
usually over trees
state-of-art results in many areas
can overfit
usually needs tuning
THE END

MLHEP 2015: Introductory Lecture #3

  • 1.
    MACHINE LEARNING INHIGH ENERGY PHYSICS LECTURE #3 Alex Rogozhnikov, 2015
  • 2.
    RECAPITULATION logistic regression andSVM projections, kernels and regularizations overfitting (2 definitions) stochastic optimization
  • 3.
  • 4.
    DECISION TREES FORCLASSIFICATION
  • 5.
  • 6.
  • 8.
    COMPOSITIONS Basic motivation: improvequality of classification by reusing strong sides of classifiers.
  • 9.
    SIMPLE VOTING Averaging predictionsof Averaging predicted probabilities Averaging decision function = [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ P+1 P−1 (x) = (x)P±1 1 J ∑J j=1 p±1,j D(x) = (x) 1 J ∑J j=1 dj
  • 10.
    WEIGHTED VOTING The wayto introduce importance of classifiers D(x) = (x)∑j αj dj GENERAL CASE OF ENSEMBLING: D(x) = f ( (x), (x), …, (x))d1 d2 dJ
  • 11.
    PROBLEMS very close baseclassifiers need to keep variation and still have good quality of basic classifiers
  • 12.
  • 13.
    GENERATING TRAINING SUBSET subsampling takingfixed part of samples (sampling without replacement) bagging(Bootstrap AGGregating) sampling with replacement, If #generated samples = length of dataset, the fraction of unique samples in new dataset is 1 − ∼ 63.2 1 e
  • 14.
    RANDOM SUBSPACE MODEL(RSM) Generating subspace of features by taking random subset of features
  • 15.
    RANDOM FOREST [LEOBREIMAN, 2001] Random forest is composition of decision trees. For each tree is trained by bagging samples taking random featuresm Predictions are obtained via simple voting.
  • 16.
  • 17.
  • 18.
    OVERFITTING overfitted (in thesense that predictions for train and test are different) doesn't overfit: increasing complexity (adding more trees) doesn't spoil classifier
  • 19.
    Works with featuresof different nature Stable to noise in data From 'Testing 179 Classifiers on 121 Datasets' The classifiers most likely to be the bests are the random forest (RF) versions, the best of which [...] achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets.
  • 20.
    RANDOM FOREST SUMMARY Impressivelysimple Trees can be trained in parallel Doesn't overfit Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: Hardly interpretable =Nused Nfeatures‾ ‾‾‾‾‾‾√
  • 21.
    COMPARING DISTRIBUTIONS 1d: Kolmogorov-Smirnov morefeatures is a problem, but we can compute KS over each of variables hardly 1d results can be combined together
  • 22.
  • 24.
  • 26.
    USING CLASSIFIER Want tocompute significance? Use ROC AUC + Mann-Whitney U test
  • 27.
    SAMPLE WEIGHTS INML Can be used with many estimators. , , i −  index of eventxi yi wi weight corresponds to frequency of observation expected behavior: is the same as having copies of th event global normalization doesn't matter = nwi n i Example for logistic regression:  = L( , ) → min ∑ i wi xi yi
  • 28.
    Weights (parameters) ofclassifier sample weights≠ In code: tree = DecisionTreeClassifier(max_depth=4) tree.fit(X, y, sample_weight=weights) Sample weights are convenient way to regulate importance of training events. Only sample weights in this lecture.
  • 29.
    ADABOOST [FREUND, SHAPIRE,1995] Bagging: information from previous trees not taken into account. Adaptive Boosting is weighted composition of weak learners: We assume , labels , th weak learner misclassified th event iff D(x) = (x) ∑ j αj dj (x) = ±1dj = ±1yi j i ( ) = −1yi dj xi
  • 30.
    ADABOOST D(x) = (x) ∑ j αjdj Weak learners are built in sequence each next classifier is trained using different weights initially = 1 for each training sample After building th base classifier: 1. 2. increase weight of misclassified wi j = ln ( ) αj 1 2 wcorrect wwrong ← ×wi wi e − ( )αj yi dj xi
  • 31.
    ADABOOST EXAMPLE Decision treesof depth 1 will be used.
  • 34.
    ADABOOST SECRET D(x) =(x) ∑ j αj dj  = L( , ) = exp(− D( )) → min ∑ i xi yi ∑ i yi xi is obtained as result analytical optimization sample weight is equal to penalty for event αj = L( , ) = exp(− D( ))wi xi yi yi xi
  • 35.
  • 36.
    ADABOOST SUMMARY able tocombine many weak learners takes mistakes into account simple, overhead is negligible too sensitive to outliers
  • 37.
  • 38.
  • 39.
    GRADIENT BOOSTING [FRIEDMAN,1999] composition of weak learners, D(x) = (x) ∑ j αj dj (x)p+1 (x)p−1 = = σ(D(x)) σ(−D(x)) Optimization of log-likelihood:  = L( , ) = ln (1 + ) → min ∑ i xi yi ∑ i e − D( )yi xi
  • 40.
    GRADIENT BOOSTING D(x) =(x) ∑ j αj dj  = ln (1 + ) → min ∑ i e − D( )yi xi Optimization problem: find all and weak leaners Mission impossible Main point: greedy optimization of loss function by training one more weak learner Each new estimator follows the gradient of loss function αj dj dj
  • 41.
    GRADIENT BOOSTING Gradient boosting~ steepest gradient descent. (x) = (x)Dj ∑j =1j ′ αj ′ dj ′ (x) = (x) + (x)Dj Dj−1 αj dj At th iteration:j pseudo-residual train regressor to minimize MSE: find optimal = − zi ∂ ∂D( )xi ∣∣D(x)= (x)Dj−1 dj ( ( ) − → min∑i dj xi zi ) 2 αj
  • 42.
    ADDITIONAL GB TRICKS tomake training more stable, add learning rate (x) = η (x)Dj ∑j αj dj randomization to fight noise and build different trees: subsampling of features and training samples
  • 43.
    AdaBoost is particularcase of gradient boosting with different target loss function*:  = → min ∑ i e − D( )yi xi This loss function is called ExpLoss or AdaLoss. *(also AdaBoost expects that )( ) = ±1dj xi
  • 44.
    LOSS FUNCTIONS Gradient boostingcan optimize different smooth loss function. regression, Mean Squared Error Mean Absolute Error binary classification, ExpLoss (ada AdaLoss) LogLoss y ∈ ℝ (d( ) −∑i xi yi ) 2 d( ) −∑i ∣∣ xi yi ∣∣ = ±1yi ∑i e − d( )yi xi log(1 + )∑i e − d( )yi xi
  • 47.
    EXAMPLE: REGRESSION WITHGB using regression trees of depth=2
  • 48.
    number of trees= 1, 2, 3, 100
  • 49.
    ADAPTING BOOSTING By modifyingboosting or changing loss function we can solve different problems classification regression ranking Also we can add restrictions, i.e. fight correlation with mass
  • 50.
    LOSS FUNCTION: RANKINGEXAMPLE In ranking we need to order items by :yi < ⇒ d( ) < d( )yi yj xi xj We can add penalization term for misordering:  = L( , , , ) ∑ ij xi xj yi yj L( , , , ) = { xi xj yi yj σ(d( ) − d( )),xj xi 0, <yi yj otherwise
  • 51.
    BOOSTING TO UNIFORMITY Pointof uniform boosting - have constant efficiency against some variable. Examples: flat background efficiency along mass flat signal efficiency for different flight time flat signal efficiency along Dalitz variable
  • 52.
    EXAMPLE: NON-FLAT BACKGROUND EFFICIENCYALONG MASS High correlation with mass will create from pure background false peaking signal Aim: for different regions in mass.FPR = const
  • 53.
    uBoostBDT Variation of AdaBoostapproach, aim .= constFPRregion fix target efficiency (say %), find corresponding threshold = 30FPRtarget Train a tree, its decision function increase weight for misclassification: increase weight of signal events in regions with high FPR (x)dj ← exp (−α (x)))wi wi yi dj ← exp (β( − ))wi wi FPRregion FPRtarget
  • 54.
    uBoost uBoost is anensemble over uBoostBDT, each uBoostBDT uses own global FPR. uBoostBDT returns 0 or 1 (passed or not the threshold corresponding to target FPR), simple voting is used to obtain predictions. drives to uniform selection very complex training many classifiers estimation of threshold in uBoostBDT may be biased
  • 55.
  • 56.
    MEASURING NON-UNIFORMITY CvM =∫ d (s)∑region (s) − (s)∣∣Fregion Fglobal ∣∣ 2 Fglobal
  • 57.
    FLATNESS LOSS Put anadditional term in loss function which will penalize for non-uniformity  = + cexploss FL Flatness loss approximates (non-differentiable) CvM metrics: = ∫ dsFL ∑region (s) − (s)∣∣Fregion Fglobal ∣∣ 2 ≅ 2( (s) − (s)) ∂ ∂D( )xi FL Fregion Fglobal ∣∣s=D( )xi
  • 58.
  • 59.
    GRADIENT BOOSTING general-purpose flexiblealgorithm usually over trees state-of-art results in many areas can overfit usually needs tuning
  • 60.