Anomaly / Outlier Detection
Introduction
• Anomaly is a pattern in the data that does not provide the expected behaviour called as outliers,
exceptions, peculiarities, surprise, etc.
• Anomalies are the set of data points that are considerably different than the remaining data.
• Outliers are data points that are considered out of the ordinary or abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/ critical/actionable information which
could be of interest to analysts.
 Anomalous events occur infrequently.
• Anomalies translate to significant (often critical) real life entities
• Cyber intrusions
• Credit card fraud
• Applications:
• Credit card fraud detection
• Telecommunication fraud detection
• Network intrusion detection
• Fault detection
Example
• N1 and N2 are regions of normal behavior
• Points o1 and o2 are anomalies
• Points in region O3 are anomalies
X
Y
N1
N2
o1
o2
O3
Aspects of Anomaly Detection Problem
• Nature of input data
• Availability of supervision
• Type of anomaly: point, contextual, structural
• Output of anomaly detection
• Evaluation of anomaly detection techniques
Input Data
• Most common form of data handled by
anomaly detection techniques is Record
Data
• Univariate
• Multivariate
Tid SrcIP
Start
time
Dest IP Dest
Port
Number
of bytes
Attack
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No
2 206.163.37.95 11:13:56 160.94.179.219 139 195 No
3 206.163.37.95 11:14:29 160.94.179.217 139 180 No
4 206.163.37.95 11:14:30 160.94.179.255 139 199 No
5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes
6 206.163.37.95 11:14:35 160.94.179.253 139 177 No
7 206.163.37.95 11:14:36 160.94.179.252 139 172 No
8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes
9 206.163.37.95 11:14:41 160.94.179.250 139 195 No
10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes
10
Input Data – Nature of Attributes
• Nature of attributes
• Binary
• Categorical
• Continuous Tid SrcIP Duration Dest IP
Number
of bytes
Internal
1 206.163.37.81 0.10 160.94.179.208 150 No
2 206.163.37.99 0.27 160.94.179.235 208 No
3 160.94.123.45 1.23 160.94.179.221 195 Yes
4 206.163.37.37 112.03 160.94.179.253 199 No
5 206.163.37.41 0.32 160.94.179.244 181 No
Input Data – Complex Data Types
• Relationship among data instances
• Sequential
• Temporal
• Spatial
• Spatio-temporal
• Graph
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Labels
• Supervised Anomaly Detection
• Labels available for both normal data and anomalies
• Semi-supervised Anomaly Detection
• Labels available only for normal data
• Unsupervised Anomaly Detection
• No labels assumed
• Based on the assumption that anomalies are very rare compared to normal data
Type of Anomaly
• Point Anomalies
• Contextual Anomalies
• Collective Anomalies
Point Anomalies
• An individual data instance is anomalous with respect to the data
X
Y
N1
N2
o1
o2
O3
Contextual Anomalies
• An individual data instance is anomalous within a context
• Requires a notion of context
• Also called as conditional anomalies
Normal
Anomaly
Collective Anomalies
• A collection of related data instances is anomalous.
• Requires a relationship among data instances
• Sequential Data
• Spatial Data
• Graph Data
• The individual instances within a collective anomaly are not anomalous by themselves.
Anomalous Subsequence
Output of Anomaly Detection
• Label
• Each test instance is given a normal or anomaly label
• Score
• Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter
Anomaly Detection Schemes
• General Steps
• Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
• Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics differ significantly from the normal profile
• Types of anomaly detection schemes
• Graphical & Statistical-based
• Distance-based
• Model-based
Clustering Based Anomaly Detection
• Key assumption: Normal data records belong to large and dense clusters, while anomalies belong do not belong
to any of the clusters or form very small clusters
• Categorization according to labels
• Semi-supervised – Cluster normal data to create modes of normal behavior. If any new instance does not
belong to any of the clusters or it is not close to any cluster, is called anomaly
• Unsupervised – Post-processing is needed after a clustering step to determine the size of the clusters and the
distance from the clusters is required from the point to be anomaly.
• Anomalies detected using clustering based methods can be:
1. Data records that do not fit into any cluster (residuals from clustering)
2. Small clusters
3. Low density clusters or local anomalies (far from other points within the same cluster)
Anomaly Detection: Clustering Approach
Anomaly score function:
• Given a data point x from a dataset D,
Alternate definitions:
1. f(x) = distance between x and its closest centroid
2. f(x) : (called relative distance)
= ratio between the data point's distance from the centroid to the median distance
of all data points in the cluster from the centroid
3. f(x) = improvement in the goodness of a cluster (as measured by an objective
function) when x is removed
Anomaly Detection: K-means Clustering Approach
Step 1: Select k random data points from the training dataset as the centroids of the clusters C1, C2, ...Ck.
Step 2: For each training data point x:
a. Compute the Euclidean distance D(Ci, x), i = 1...k
b. Find cluster Cq that is closest to data point x.
c. Assign data point x to Cq. Update the centroid of Cq. (The centroid of a cluster is the
arithmetic mean of the data points in the cluster.)
Step 3: Repeat Step 2 until the centroids of clusters C1, C2, ...Ck stabilize in terms of convergence criterion.
Step 4: For each test (new) data point y:
a. Compute the Euclidean distance D(Ci, y), i = 1...k called anomaly score. Find the cluster Cr that is
closest to y.
b. Use a threshold t on this score to determine anomalies or outliers. i.e, x is an outlier iff score > t.
Otherwise, x is normal datapoint.
• Points in small clusters – anomalies**
Using K-means with 2 clusters. Fig uses distance of point from closest centroids
(D is not considered outlier)
Fig uses relative distance of point from closest centroids to adjust for the difference of densities among the clusters
Clustering Based Anomaly Detection
• Advantages:
• No need to be supervised
• Easily adaptable to on-line anomaly detection from temporal data
• Drawbacks
• Computationally expensive – Time complexity is O(cn), c - # of clusters
• Using indexing structures (k-d tree, R* tree) may alleviate this problem
• If normal points do not create any clusters the techniques may fail
• In high dimensional spaces, data is sparse and distances between any two data records may
become quite similar.
• Clustering algorithms may not give any meaningful clusters
Visualization Based Techniques
• Use visualization tools to observe the data
• Provide alternate views of data for manual inspection
• Anomalies are detected visually
• Advantages
• Keeps a human in the loop
• Disadvantages
• Works well for low dimensional data
• Can provide only aggregated or partial views for high dimension data
Application of Dynamic Graphics
• Apply dynamic graphics to the exploratory analysis of spatial data.
• Visualization tools are used to examine local variability to detect anomalies
• Manual inspection of plots of the data that display its marginal and multivariate distributions
Applications of Anomaly Detection
• Network intrusion detection
• Insurance / Credit card fraud detection
• Healthcare Informatics / Medical diagnostics
• Industrial Damage Detection
• Image Processing / Video surveillance
• Novel Topic Detection in Text Mining
References
• Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence, Published by O’Reilly
Media,2017
• Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern Approach
• Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010
• Artificial Intelligence and Machine Learning, Chandra S.S. & H.S. Anand, PHI Publications
• Machine Learning, Rajiv Chopra, Khanna Publishing House

Anomaly detection (Unsupervised Learning) in Machine Learning

  • 1.
  • 2.
    Introduction • Anomaly isa pattern in the data that does not provide the expected behaviour called as outliers, exceptions, peculiarities, surprise, etc. • Anomalies are the set of data points that are considerably different than the remaining data. • Outliers are data points that are considered out of the ordinary or abnormal . This includes noise. • Anomalies are a special kind of outlier that has significant/ critical/actionable information which could be of interest to analysts.  Anomalous events occur infrequently. • Anomalies translate to significant (often critical) real life entities • Cyber intrusions • Credit card fraud • Applications: • Credit card fraud detection • Telecommunication fraud detection • Network intrusion detection • Fault detection
  • 3.
    Example • N1 andN2 are regions of normal behavior • Points o1 and o2 are anomalies • Points in region O3 are anomalies X Y N1 N2 o1 o2 O3
  • 4.
    Aspects of AnomalyDetection Problem • Nature of input data • Availability of supervision • Type of anomaly: point, contextual, structural • Output of anomaly detection • Evaluation of anomaly detection techniques
  • 5.
    Input Data • Mostcommon form of data handled by anomaly detection techniques is Record Data • Univariate • Multivariate Tid SrcIP Start time Dest IP Dest Port Number of bytes Attack 1 206.135.38.95 11:07:20 160.94.179.223 139 192 No 2 206.163.37.95 11:13:56 160.94.179.219 139 195 No 3 206.163.37.95 11:14:29 160.94.179.217 139 180 No 4 206.163.37.95 11:14:30 160.94.179.255 139 199 No 5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes 6 206.163.37.95 11:14:35 160.94.179.253 139 177 No 7 206.163.37.95 11:14:36 160.94.179.252 139 172 No 8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes 9 206.163.37.95 11:14:41 160.94.179.250 139 195 No 10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes 10
  • 6.
    Input Data –Nature of Attributes • Nature of attributes • Binary • Categorical • Continuous Tid SrcIP Duration Dest IP Number of bytes Internal 1 206.163.37.81 0.10 160.94.179.208 150 No 2 206.163.37.99 0.27 160.94.179.235 208 No 3 160.94.123.45 1.23 160.94.179.221 195 Yes 4 206.163.37.37 112.03 160.94.179.253 199 No 5 206.163.37.41 0.32 160.94.179.244 181 No
  • 7.
    Input Data –Complex Data Types • Relationship among data instances • Sequential • Temporal • Spatial • Spatio-temporal • Graph GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG
  • 8.
    Data Labels • SupervisedAnomaly Detection • Labels available for both normal data and anomalies • Semi-supervised Anomaly Detection • Labels available only for normal data • Unsupervised Anomaly Detection • No labels assumed • Based on the assumption that anomalies are very rare compared to normal data
  • 9.
    Type of Anomaly •Point Anomalies • Contextual Anomalies • Collective Anomalies
  • 10.
    Point Anomalies • Anindividual data instance is anomalous with respect to the data X Y N1 N2 o1 o2 O3
  • 11.
    Contextual Anomalies • Anindividual data instance is anomalous within a context • Requires a notion of context • Also called as conditional anomalies Normal Anomaly
  • 12.
    Collective Anomalies • Acollection of related data instances is anomalous. • Requires a relationship among data instances • Sequential Data • Spatial Data • Graph Data • The individual instances within a collective anomaly are not anomalous by themselves. Anomalous Subsequence
  • 13.
    Output of AnomalyDetection • Label • Each test instance is given a normal or anomaly label • Score • Each test instance is assigned an anomaly score • Allows the output to be ranked • Requires an additional threshold parameter
  • 14.
    Anomaly Detection Schemes •General Steps • Build a profile of the “normal” behavior • Profile can be patterns or summary statistics for the overall population • Use the “normal” profile to detect anomalies • Anomalies are observations whose characteristics differ significantly from the normal profile • Types of anomaly detection schemes • Graphical & Statistical-based • Distance-based • Model-based
  • 15.
    Clustering Based AnomalyDetection • Key assumption: Normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters • Categorization according to labels • Semi-supervised – Cluster normal data to create modes of normal behavior. If any new instance does not belong to any of the clusters or it is not close to any cluster, is called anomaly • Unsupervised – Post-processing is needed after a clustering step to determine the size of the clusters and the distance from the clusters is required from the point to be anomaly. • Anomalies detected using clustering based methods can be: 1. Data records that do not fit into any cluster (residuals from clustering) 2. Small clusters 3. Low density clusters or local anomalies (far from other points within the same cluster)
  • 16.
    Anomaly Detection: ClusteringApproach Anomaly score function: • Given a data point x from a dataset D, Alternate definitions: 1. f(x) = distance between x and its closest centroid 2. f(x) : (called relative distance) = ratio between the data point's distance from the centroid to the median distance of all data points in the cluster from the centroid 3. f(x) = improvement in the goodness of a cluster (as measured by an objective function) when x is removed
  • 17.
    Anomaly Detection: K-meansClustering Approach Step 1: Select k random data points from the training dataset as the centroids of the clusters C1, C2, ...Ck. Step 2: For each training data point x: a. Compute the Euclidean distance D(Ci, x), i = 1...k b. Find cluster Cq that is closest to data point x. c. Assign data point x to Cq. Update the centroid of Cq. (The centroid of a cluster is the arithmetic mean of the data points in the cluster.) Step 3: Repeat Step 2 until the centroids of clusters C1, C2, ...Ck stabilize in terms of convergence criterion. Step 4: For each test (new) data point y: a. Compute the Euclidean distance D(Ci, y), i = 1...k called anomaly score. Find the cluster Cr that is closest to y. b. Use a threshold t on this score to determine anomalies or outliers. i.e, x is an outlier iff score > t. Otherwise, x is normal datapoint. • Points in small clusters – anomalies**
  • 18.
    Using K-means with2 clusters. Fig uses distance of point from closest centroids (D is not considered outlier)
  • 19.
    Fig uses relativedistance of point from closest centroids to adjust for the difference of densities among the clusters
  • 20.
    Clustering Based AnomalyDetection • Advantages: • No need to be supervised • Easily adaptable to on-line anomaly detection from temporal data • Drawbacks • Computationally expensive – Time complexity is O(cn), c - # of clusters • Using indexing structures (k-d tree, R* tree) may alleviate this problem • If normal points do not create any clusters the techniques may fail • In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. • Clustering algorithms may not give any meaningful clusters
  • 21.
    Visualization Based Techniques •Use visualization tools to observe the data • Provide alternate views of data for manual inspection • Anomalies are detected visually • Advantages • Keeps a human in the loop • Disadvantages • Works well for low dimensional data • Can provide only aggregated or partial views for high dimension data
  • 22.
    Application of DynamicGraphics • Apply dynamic graphics to the exploratory analysis of spatial data. • Visualization tools are used to examine local variability to detect anomalies • Manual inspection of plots of the data that display its marginal and multivariate distributions
  • 23.
    Applications of AnomalyDetection • Network intrusion detection • Insurance / Credit card fraud detection • Healthcare Informatics / Medical diagnostics • Industrial Damage Detection • Image Processing / Video surveillance • Novel Topic Detection in Text Mining
  • 24.
    References • Tom Markiewicz&Josh Zheng,Getting started with Artificial Intelligence, Published by O’Reilly Media,2017 • Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern Approach • Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010 • Artificial Intelligence and Machine Learning, Chandra S.S. & H.S. Anand, PHI Publications • Machine Learning, Rajiv Chopra, Khanna Publishing House