VVB_DWM Module no 2_Data exploration and data preprocessing .pptx

Module No : 02
Data Exploration & Data Preprocessing
1

General meaning of mining
Mining is the extraction of valuable minerals or other
geological materials from the earth. Mining of stones, diamonds and
metal has been a human activity since pre-historic times.
2

What is Data Mining? Non technical view
1. There are things that we know that we know…
2. There are things that we know that we don’t know…
3. There are things that we don’t know we don’t know.”
Data mining is not finding something it is all about
discovering
e.g. Columbus “discovered” America
3

What is data mining
• Data mining is the computing process of discovering patterns in
large data sets to predict future trends.
• Data mining refers to extracting or
mining knowledge from large amount
data
• It is similar like mining of gold from
rocks or sand(gold mining)
• Finding small set of precious things
from a great deal of raw material
4

What is data mining
• Data mining is the process of exploration and analysis of large
quantities of data inorder to discover meaningful patterns and rules
• Data mining means extraction meaningful patterns and rules from
large quantities of information
• Similar terms
1. knowledge mining from databases
2. knowledge extraction
3. pattern analysis
4. data archaeology
5. data dredging
5

What is data mining
• Data mining formula:
Data + Interestingness Criteria = Hidden Patterns
Interestingness Criteria may be
1. frequency
2. correlation
3. length of occurrence
4. repeating/ periodicity
5. consistency
6. abnormal behaviour
6

Architecture of a typical data mining system
Data filtering, cleaning Data integration
7
DATA WAREHOUSE
Datawarehouse server
Data mining engine
Pattern Evaluation
Graphical User Interface
Knowledge
base

Architecture of a typical data mining system
• Datawarehouse : information repository, where data integration, data
cleaning, filtering, transformation and loading processes performed on
raw data
• Datawarehouse server: is responsible for fetching the relevant data
from DW, based on users data mining request
• Knowledge base: this is the domain knowledge that is used to guide
the search e.g. concept hierarchies, interestingness things
• Data mining engine : it consists of set of functional modules for task
like classification, association, clustering etc.
• Pattern evaluation module : it uses interestingness things to filter out
discovered patterns
• Graphical user interface: communicate between users and data
mining system
8

Knowledge Discovery in Database (KDD)
• Knowledge discovery in database (KDD) is the process of finding
useful information and patterns in data
• KDD is the process consisting of many steps, while data mining is the
only one of these steps
• Data mining is the use of the algorithms to extract the information and
patterns derived by the users
• There are many alternative names for KDD
1. process of discovering hidden patterns in data
2. knowledge extraction
3. information discovery
4. exploratory data analysis
5. information harvesting
6. unsupervised pattern recognition
9

• Knowledge discovery in database (KDD) process consists of the
following five steps
1. Selection: obtained data from heterogeneous sources in the form of
various database files and non-electronic sources
2. Pre-processing: data correction, removal, cleaning, missing data
supply, prediction of missing data
3. Transformation: data conversion into standardized common format
for processing
4. Data mining: application of algorithms on transformed data to
generate the desired results
5. Interpretation/evaluation: presentation of data mining results to
the users in different GUI strategies or by various virtualization
10

• Virtualization means visual presentation of data which include
1. Graphical: bar charts, pi charts, histograms, line graphs
2. Geometric: box plot, scatter diagrams
3. Icon-based: figures, colours which improves presentation of results
4. Pixel-based: each data value is shown as a uniquely coloured pixel
5. Hierarchical: hierarchically divide the display into regions
6. Hybrid: all the above can be combined into one display
11

• KDD Process
12
Initial data Selection Target data Pre-processing
Pre-
processed
data
Transformation
Transformed
data
Data mining Model Interpretation
Knowledge

Data mining techniques:
1. Classification
2. Estimation
3. Prediction
4. Affinity grouping/Association(Frequent Pattern Mining)
5. Clustering
6. Visualization and Description
13

1. Classification:
• Classification consists of examining the feature of a newly
presented tuple/object/record and Assigning to it a predefined class
• The classification technique is to build a model that can be applied
to unclassified data in order to classify it.
e.g.
1. Classifying Credit Card applications as low, medium and high
credit limits
2. Assigning a class( branch wise ) to newly admitted student in a
college
14

2. Estimation:
• Estimation deals with continuously valued outcome
• Given some input data, we use estimation to come up with a value
of some unknown continuous variables (such as income, height, Credit card
balance etc.)
• Estimation is often used to perform a classification task
e.g.
1. Estimating a family’s total household income.
2. Estimating the value of a piece of real estate.
‘Classification and estimation are used together to predict future behavior’
15

3. Prediction:
• From the past behavior(historical data), classification and estimation we
can predict future behavior or estimated future value.
• The historical data is used to build a model that explains the current
observed behavior. When this model is applied to current inputs, the
result is prediction of future behavior.
• The only way to check the accuracy of the prediction is to wait and
see.
e.g.
1.Prediction of customers which will leave with in the next six months
2. Predicting result of a BE student by referring result of S.S.C. and
H.S.C.
16

4. Affinity grouping/Association (Frequent Pattern Mining/ Market basket analysis)
• The task of affinity grouping is to determine which things go
together.
– e.g. tea and sugar
• The purchasing of one product when another product is purchased
represents an association rule
• Affinity grouping can also be used to identify cross-selling things to
design attractive offers and services.
– e.g. tea and coffee
17

5. Clustering
• Clustering is the technique of segmenting a diverse group in to a
number of more similar subgroups or clusters.
• Clustering does not rely on predefined classes.
• The records are groups together on the basis of self-similarity.
e.g.
1. Cluster of symptoms might indicate a particular disease.
2. Clusters of videos and music might indicate culture of a society.
18

6. Visualization and Description
Description :
• Description of a complicated database increases understanding the
database and suggests where to look for explanation.
– e.g. understanding people, products or processes
Visualization :
• Since human beings can easily extract meaning of things from
visual scenes.
• So Data Visualization is a powerful activity of data mining.
• One meaningful picture or graph create more impact as compared to
any form of data.
19

6. Visualization
20

Data mining algorithms:
different algorithms we are going to use for data mining
1. Decision tree
2. Naïve bayes classification (Bayesian)
3. Bootstamp algorithm
4. Random forest
5. K-means clustering
6. Agglomerative algorithm
7. Divisive algorithm
8. BRICH algorithm
9. DBSCAN and OPTICS algorithm
10. Apriori algorithm
11. Market basket analysis
21

Applications of Data mining:
data mining is widely used in divers areas, there are number of
commercial data mining system available
1. Financial data analysis
2. Retail industry
3. Telecommunication industry
4. Biological data analysis
5. Intrusion detection
6. Other scientific applications (geosciences, astronomy, climate and
ecosystem modeling, chemical engineering, fluid dynamics etc)
22

Issues in Data mining:
• Human interaction: domain knowledge experts, technical experts
needed to formulate queries, identify training data set and able to
visualize desired results
• Interpretation of result: required experts to correctly interpret the
result
• Visualization of results: to easily view and understand the o/p of
data mining algorithms, Visualization of the results is helpful
• Large datasets: massive datasets create problems with different
modeling applications
• High dimensionality: use of so many attributes may increase
overall complexity and decreases the efficiency
23

Issues in Data mining:
• Multimedia data: the use of multimedia data complicates or
invalidates many proposed algorithms
• Missing data: missing data can lead to invalid results(incorrect)
• Irrelevant data: some attributes might not be of interest to data
mining or may not be useful
• Noisy data: attribute values might be incorrect or invalid
• Changing data: databases can not be assumed to be static e.g.
address, marital status or age
24

Data Exploration Definition :
Data exploration is an approach similar to initial data analysis,
whereby a data analyst uses visual exploration to understand what is
in a dataset and the characteristics of the data, rather than through
traditional data management systems.
26

Attributes
• An attribute is a data field, representing a characteristic or feature
of a data object.
• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature tends to use the term feature, while
statisticians prefer the term variable.
• Data mining and database professionals commonly use the term
attribute
• Attributes describing a customer object can include: customer ID,
name, and address.
27

Types of Attributes
1. Nominal attributes
2. Binary attributes
3. Ordinal attributes
4. Numeric attributes
a) Interval-scaled attributes
b) Ratio-scaled attributes
28

Types of Attributes 1. Nominal attributes
• Nominal means “relating to names.” The values of a nominal attribute
are symbols or names of things.
• Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical.
• The values do not have any meaningful order.
• Example 1: hair color and marital status are two attributes describing
person objects, then possible values for hair color are black,brown,
blond, red, auburn, gray, and white.
• The attribute marital status can take on the values single, married,
divorced, and widowed.
• Both hair color and marital status are nominal attributes.
• Example 2: occupation, with the values teacher, dentist, programmer,
farmer, and so on.
29

Types of Attributes 2. Binary attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
• Example: The attribute medical test is binary, where a value of 1
means the result of the test for the patient is positive, while 0
means the result is negative.
• there is no preference on which outcome should be coded as 0 or 1
attribute gender having the states male and female.
30

Types of Attributes 3. Ordinal attributes
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.
• Example: drink size corresponds to the size of drinks available at a
fast-food restaurant. This nominal attribute has three possible
values: small, medium,and large.
• grade (e.g., A++, A+, A, B++,B+,B,C++ and so on)
31

Types of Attributes 4. Numeric attributes
• A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
a)Interval-Scaled Attributes
• Interval-scaled attributes are measured on a scale of equal-size units.
The values of interval-scaled attributes have order and can be positive,
0, or negative
• such attributes allow us to compare and quantify the difference
between values.
• Example 1: temperature (20 degree Celsius is five degrees higher than
a temperature of 15 degree Celsius)
• Example 2: calendar dates (the years 2002 and 2010 are eight years
apart)
32

Types of Attributes 4. Numeric attributes
b)Ratio-Scaled Attributes
• A ratio-scaled attribute is a numeric attribute with an inherent
zero-point i.e. if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value
• the values are ordered, and we can also compute the
• difference between values, as well as the mean,median, and mode
• Example 1: year_of_experience
• Example 2: no-of-words (in a document)
• Example 3: weight, height, latitude and longitude
33

Statistical Description of data
• For data pre-processing to be successful, it is essential to have an overall
picture of your data.
• Basic statistical descriptions can be used to identify properties of the data and
• highlight which data values should be treated as noise or outliers.
• Following are the different ways to describe data statistically
1. Mean 2. Median
3. Mode 4. Midrange
4. Range 5. Quartiles
6. Interquartile Range 7. Five-Number Summary
8. Boxplots 9. Outliers
10. Variance 11. Standard Deviation
12. Histograms 13. Scatter Plots
14. Data Correlation
34

Statistical Description of data :
1. Mean (Average Value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for some numeric attribute X. The mean of
this set of values is

2. Median (middle value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for some numeric
attribute X, like salary. The median of this set of values is

Let X1,X2, …..,Xn be a set of N values or observations, such as for some attribute X.
Data set : 30,36,47,50,52,52,56,60,63,70,70,110
The mode of this set of values is : (the values repeating maximum times) t his set of data is bimodal
i.e. there are two modes 52 and 70 30,36,47,50,52,52,56,60,63,70,70,110)
3. Mode

4. Midrange
•Let X1,X2, …..,Xn be a set of N values or observations, such as for some
numeric attribute X.
•Data set : 30,36,47,50,52,52,56,60,63,70,70,110
•The midrange is the average of largest and smallest values in the set. The
midrange of this set of values is (30+110)/2=70
(30,36,47,50,52,52,56,60,63,70,70,110)

5. Range
•Let X1,X2, …..,Xn be a set of N values or observations, such as for some
numeric attribute X.
•The range of the set is the difference between the largest (max) and
•smallest (min) values.
•e.g. (30,36,47,50,52,52,56,60,63,70,70,110)
•Range of this data set is 110-30=80

Statistical Description of data : 5. Quartiles
Let X1,X2, …..,Xn be a set of N values or observations, such as for some numeric
attribute X.
We can pick certain data points so as to split the data distribution into
equal-size consecutive sets, as shown in figure
41

Statistical Description of data : 5. Quartiles
Q1=Lower Quartile
Q2= Median
Q3=Upper Quartile In
this e.g.
Q1=52
Q2=54
Q3=58
– The
first
quartil
e,
denote
d by
Q1, is
the
25th
percen
tile. It
42

Statistical Description of data : 6.
Interquartile Range(IQR)
43
• The distance between the first and third quartiles is a simple
measure of spread that gives the range covered by the middle half of
the data.
• This distance is called the interquartile range (IQR) and is defined
as IQR=Q3-Q1

Statistical Description of data : 6.
Interquartile Range(IQR)
44

45

Statistical Description of data : 7. Five-Number Summary
• The five-number summary of a distribution consists of the median (Q2), the
quartiles Q1 and Q3, and the smallest and largest individual observations, written
in the order of Minimum, Q1, Median, Q3, Maximum.
• E.g. 2,3,3,4,5,6,8,9
• Minimum = 2
• Q1= 3
• Median = 4.5
• Q3= 7
• Maximum= 9
46

Statistical Description of data : 8. Boxplots
• Boxplots are a popular way of visualizing a distribution. A boxplot
incorporates the five-number summary as follows:
• E.g. 2,3,3,4,5,6,8,9
The five-number summary
• Minimum = 2
• Q1= 3
• Median = 4.5
• Q3= 7
• Maximum= 9
47

Statistical Description of data : 9. Outliers
• Outliers are an extremely high or extremely low values in the data set.
We can identify an outliers by following
• The values greater than Q3+ 1.5(IQR)
• The values less than Q1 – 1.5(IQR)
48

Statistical Description of data : 10. Variance and Standard deviation
49

Statistical Description of data : 12. Histograms
• “Histos” means pole, and “gram” means chart, so a histogram is a chart
of poles
• Plotting histograms is a graphical method for summarizing the
distribution of a given attribute, X
50

Statistical Description of data : 13. Scatter Plots
• A scatter plot is one of the most effective graphical method for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
• To construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.
51

Statistical Description of data : 14. Data Correlations
• Two attributes, X, and Y, are correlated if one attribute implies the other. Correlations can be positive,
negative, or null (uncorrelated)
• Figure shows examples of positive and negative correlations between two attributes.
• If the plotted points pattern slopes from lower left to upper right, this means that the values of X increase
as the values of Y increase, suggesting a positive correlation
• If the pattern of plotted points slopes from upper left to lower right, the values of X increase as the
values of Y decrease, suggesting a negative correlation
52

Statistical Description of data : 14. Data Correlations
53

Measuring data similarity and dissimilarity
54
• In data mining applications(like clustering, classifications etc.) We are interested
in comparison of objects on the basis of their similarities and dissimilarities
• Similarities and dissimilarities can be measured by using following ways
1. Data Matrix
2. Dissimilarity Matrix
3. Minkowski Distance
a) Manhattan(City block) Distance
b) Euclidean Distance
c) Supremum Distance
4. Cosine similarity

Data Matrix:
• This structure stores the n data objects in the form of a relational
table, or n-by-p matrix (n objects, p attributes)
55

Data Matrix:
• Data points
• X1(1,2), X2(3,5),X3(2,0),X4(4,5)
56

Dissimilarity Matrix:
• This structure stores a collection of proximities that are
available for all pairs of n objects.
• It is often represented by an n-by-n table
57

Minkowski Distance
Euclidean Distance c)Supremum Distance
5

Minkowski Distance
Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)
Manhattan(City block) Distance c)Supremum Distance
59

Minkowski Distance
Supremum Distance
60

Cosine similarity
61

Cosine similarity
X=(2,1,3,2,4,5,3) ; Y=(4,3,4,3,6,5,5) how similar are x and
y?
62

Exercise : Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance
63

Data Visualization
• Data visualization aims to communicate data clearly and
effectively through graphical representation.
• visualization techniques are used to discover data
relationships that are otherwise not easily observable by
looking at the raw data
• Data visualization techniques are divided into following types
1. Pixel-Oriented Visualization Techniques
2. Geometric Projection Visualization Techniques
3. Icon-Based Visualization Techniques
4. Hierarchical Visualization Techniques
64

Data Visualization: 1. Pixel-Oriented Visualization Techniques
• A simple way to visualize the value of a dimension is to use a pixel
where the color of the pixel reflects the dimension’s value.
• For a data set of m dimensions, pixel-oriented techniques create m
windows on the screen, one for each dimension.
• The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows.
• The colors of the pixels reflect the corresponding values.
• A customer information table, which consists of four dimensions:
income, credit limit, transaction volume, and age.
• We can sort all customers in income-ascending order, and use this
order to lay out the customer data in the four visualization windows,
as shown in Figure
65

Data Visualization: 1. Pixel-Oriented Visualization Techniques
• The pixel colors are chosen so that the smaller the value, the lighter the
shading.
• Using pixelbased visualization, we can easily observe the following: credit
limit increases as income increases; customers whose income is in the
middle range are more likely to purchase more; there is no clear
correlation between income and age.
66

Data Visualization: 2.Geometric Projection Visualization
Techniques
• Geometric projection techniques help users find
interesting projections of multidimensional data sets.
• Figure shows an example, where X and Y are two spatial
attributes and the third dimension is represented by
different shapes.
• Through this visualization, we can see that points of types
“+” and “×” tend to be co located.
• A 3-D scatter plot uses three axes in a Cartesian coordinate
system. If it uses color, it can display up to 4-D data points
68

Data Visualization:2.Geometric Projection Visualization Techniques
69

Data Visualization:2.Geometric Projection Visualization Techniques
70

Data Visualization: 3.Icon-Based Visualization Techniques
data represented using stick figures
71

Data Visualization: 3.Icon-Based Visualization Techniques
data represented using chernoff faces
72

Data Visualization: 4. Hierarchical Visualization Techniques
Tree-maps display and Radial hierarchical visualization
73

• Tag cloud Visualization
74

Tree-map
75

• Touch graph Visualization
76

Why Pre-processing?
• Data have quality if they satisfy the requirements of the intended
use.
• There are many factors affecting data quality, including
– Accuracy (inaccurate data containing errors or values that
deviate from the expected data)
– Completeness (incomplete data: lacking attributes values, certain
attributes of interest or containing only aggregate data)
– Consistency (inconsistency in data contain variation, differences,
disparity, deviation, mismatch in the data)
– Timeliness (submit records on time e.g. at the end of the month)
– Believability (reflects how much the data is trusted by users)
– Interpretability (reflects how easy the data is understood)
78

Major Tasks involved in Data Preprocessing
• Following are the major steps involved in data preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation.
79

Major Tasks involved in Data Preprocessing
80

Data Preprocessing: Data Cleaning
• Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies
• Missing Values
– Ignore the tuple
– Fill in the missing value manually
– Use a global constant to fill in the missing value
– Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value
– Use the attribute mean or median for all samples belonging to the same
class as the given tuple
– Use the most probable value to fill in the missing value
• Noisy Data
• Outlier analysis
81

Data Preprocessing:
• Data cleaning: Outlier analysis
82

Data Preprocessing:
• Data integration: this involve integrating multiple databases, data
cubes, or files
• combines data from multiple sources to form a coherent data store.
• following concepts contribute to smooth data integration.
– The resolution of semantic heterogeneity
– metadata
– correlation analysis
– tuple duplication detection
– data conflict detection
83

Data Preprocessing: Data Reduction
• Data reduction obtains a reduced representation of the data set that
is much smaller in volume, yet produces the same (or almost the
same) analytical results.
• Data reduction strategies include
– data cube aggregation
– dimensionality reduction
– data compression
– numerosity reduction.
84

Data Preprocessing: Data Reduction
• Data cube aggregation, where aggregation operations are applied to
the data in the construction of data cube.
• In dimensionality reduction, data encoding schemes are applied so
as to obtain a reduced or “compressed” representation of the original
data (e.g., removing irrelevant attributes)
• Data compression, where encoding mechanisms are used to reduce
the data set size
• In numerosity reduction, the data are replaced by alternative,
smaller representations using parametric models (e.g., regression or
log-linear models) or nonparametric models (e.g., histograms,
clusters, sampling, or data aggregation)
85

Data reduction: Histograms
• “Histos” means pole, and “gram” means chart, so a histogram is a
chart of poles
• Plotting histograms is a graphical method for summarizing the
distribution of a given attribute, X
86

Data Preprocessing: Data reduction
Attribute subset selection
• Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes
• It reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
87

Data Preprocessing: Data reduction
Clustering and sampling
• Clustering techniques consider data tuples as objects. They partition
the objects into groups, or clusters, so that objects within a cluster
are “similar” to one another and “dissimilar” to objects in other
clusters.
• Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller random
data sample (or subset)
88

Data transformation: Normalization of data
• Normalization is used to scale values so they fit in a specific range
(adjusting the value range is important when dealing with attributes
of different units and scales)
• E.g. when using the Euclidian distance all attributes should have the
same scale for a fair comparison
• An attribute is normalized by scaling its value so that they fall
within a small specific range such as 0.0 to 1.0
• Normalization is particularly useful for classification algorithms
• Methods for data normalization
1. Min-max Normalization
2. Z-score Normalization
3. Decimal scaling
89

1. Min-max Normalization:
90

2. Z-Score Normalization: (zero-mean normalization)
91

2. Z-Score Normalization: (zero-mean normalization)
92

3. Decimal Scaling:
93

Data transformation: Binning
• Data grouped together into bins
• e.g. process of binning of processors of mobile phones
• Data binning or bucketing is a data preprocessing technique used to
reduce the effect of minor observation errors
• Statistical data binning is a way to group a number of more/less
continuous values into a smaller number of bins
• E.g. 1. if you have data about group of people you may arrange their ages
into a smaller number of age intervals
• E.g. 2. Histograms are an example of data binning used in order to
observe underlying distributions
• Histograms are typically used for ease of data visualization
94

• Binning methods smooth a sorted data value by consulting its
‘neighbourhood’ (i.e. values around it)
• The sorted values are distributed into number of buckets or bins
• e.g. sorted data for price partitioned into depth 3
4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into equidepth bins:
(4-15) Bin 1 : 4, 8, 15
(16-24) Bin 2 : 21, 22, 24
(25-34) Bin 3: 25, 28, 34
95

• Partition into equidepth bins:
(4-15) Bin 1 : 4, 8, 15
(16-24) Bin 2 : 21, 22, 24
(25-34) Bin 3: 25, 28, 34
• Smoothing by bin means:
Bin 1: 9,9,9
Bin 2: 22, 22, 22
Bin 3: 29,29,29
• Smoothing by bin boundaries:
Bin 1: 4,4,15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
96

Data Discretization
• Data discretization methods used to reduce the number of values
for a given continuous attributes by dividing the range of the
attribute into intervals.
• where the raw values of a numeric attribute (e.g., age) are replaced
by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels
(e.g., youth, adult, senior).
• Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels
of granularity.
• Histogram analysis
• Concept hierarchy generation
97

Data discretization: Histogram analysis
• A frequency distribution shows how often each different value in a
set of data occurs.
• A histogram is the most commonly used graph to show frequency
distributions. It looks very much like a bar chart
98

Data discretization: Concept hierarchy generation
• The concept hierarchies can be used to transform the data into
multiple levels of granularity
• four methods for the generation of concept hierarchies for nominal
data
1. Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
2. Specification of a portion of a hierarchy by explicit data grouping
3. Specification of a set of attributes, but not of their partial ordering
4. Specification of only a partial set of attributes
99

1. Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• user or expert can easily define a concept hierarchy by specifying a
partial or total ordering of the attributes at the schema level.
• e.g. location dimension may contain the attributes (specifying the
ordering) such as street < city < province or state < country
2. Specification of a portion of a hierarchy by explicit data grouping
• a user could define some intermediate levels manually
• E.g. {ABC road, Hyderabad, A.P, India} subset of South India
• {XYZ road, Amritsar, Punjab, India} subset of North India
100

3. Specification of a set of attributes, but not of their partial ordering
• The system automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy.
• a concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute set.
• The attribute with the most distinct values is placed at the lowest
hierarchy level.
• The lower the number of distinct values an attribute has, the higher
it is in the generated concept hierarchy.
• E.g. see the diagram on next slide
101

102

4. Specification of only a partial set of attributes
• Sometimes a user only have a vague idea about what should be
included in a hierarchy.
• Consequently, the user may have included only a small subset of the
relevant attributes in the hierarchy specification
103

VVB_DWM Module no 2_Data exploration and data preprocessing .pptx

More Related Content

Similar to VVB_DWM Module no 2_Data exploration and data preprocessing .pptx

Recently uploaded

VVB_DWM Module no 2_Data exploration and data preprocessing .pptx