Pré-processamento, Transformação e Limpeza de dados (baseado nos slides do livro: Data Mining: C...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Pré-processamento, Transformação e Limpeza de dados (baseado nos slides do livro: Data Mining: C...
Pré-processamento, Pré-processamento, Transformação e Transformação e
Limpeza de dadosLimpeza de dados
(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Front-end applications of Front-end applications of DWDW
Information processingInformation processing Querying, basic statistical analysis, reporting using
crosstabs, tables, charts or graphs
Analytical processingAnalytical processing Multidimensional data analysis through basic OLAP
operations (slice/dice, drill-down, roll-up, pivoting, etc)
Data miningData mining Knowledge discovery by finding hidden patterns and
associations, building analytical models, performing classification and prediction, and presenting results through visualization tools.
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Application contextApplication context Construction of a data repository for data Construction of a data repository for data
analysisanalysis also called pre-processing (data mining context) or
ETL process (DW context) querying, reporting, analytical processing, data mining
require quality data Migration of data from a source to a target Migration of data from a source to a target
schemaschema poorly structured to structured data to support application migration
Enhancement of a single data sourceEnhancement of a single data source Eliminating errors, duplicates, inconsistencies
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Application contextApplication context Construction of a data repository for data Construction of a data repository for data
analysisanalysis also called pre-processing (data mining context) or
ETL process (DW context) querying, reporting, analytical processing, data mining
require quality data Migration of data from a source to a target Migration of data from a source to a target
schemaschema poorly structured to structured data to support application migration
Enhancement of a single data sourceEnhancement of a single data source Eliminating errors, duplicates, inconsistencies
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreprocessingData Preprocessing
Why preprocess the data?Why preprocess the data?
Descriptive data summarizationDescriptive data summarization
Data cleaning Data cleaning
Data integration and transformationData integration and transformation
Data reductionData reduction
Discretization and concept hierarchy generationDiscretization and concept hierarchy generation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example (1)Example (1)
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example (2)Example (2)
Suppose we want to Suppose we want to analyze the companyanalyze the company´s data wrt the sales at a given branch´s data wrt the sales at a given branch
Select attributes and dimensions to be Select attributes and dimensions to be included in the analysis: item,price, included in the analysis: item,price, units_sold, etcunits_sold, etc
May find out that....May find out that....
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Why Data Preprocessing?Why Data Preprocessing?
Data in the real world is Data in the real world is dirtydirtyincomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data e.g., occupation=“”
noisy: containing errors or outliers (spelling, phonetic and typing errors, word transpositions, multiple values in a single free-form field)
e.g., Salary=“-10” inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials)
e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Why Is Data Dirty?Why Is Data Dirty?
Incomplete dataIncomplete data comes from: comes from: non available data value when collected different criteria between the time when the data was collected
and when it is analyzed. human/hardware/software problems
Noisy dataNoisy data comes from: comes from: data collection: faulty instruments data entry: human or computer errors data transmission
Inconsistent (and redundant) dataInconsistent (and redundant) data comes from: comes from: Different data sources, so non uniform naming conventions/data
codes Functional dependency and/or referential integrity violation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Why Is Data Preprocessing Why Is Data Preprocessing Important?Important?
Data warehouseData warehouse needs needs consistent integration consistent integration of quality dataof quality data Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
No quality data, no No quality data, no quality mining resultsquality mining results!! Quality decisions must be based on quality data (e.g.,
duplicate or missing data may cause incorrect or even misleading statistics)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Major Tasks in Data Major Tasks in Data PreprocessingPreprocessing
Data cleaningData cleaning Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integrationData integration Integration of multiple databases, data cubes, or files
Data transformationData transformation Normalization and aggregation
Data reductionData reduction Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretizationData discretization Part of data reduction but with particular importance, especially for
numerical data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Forms of data Forms of data preprocessingpreprocessing
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
One methodology for the One methodology for the ETL process (L. English)ETL process (L. English)
ParsingParsing Correction:Correction: ZIP or postal codes, addresses ZIP or postal codes, addresses (field)(field) Standardization: Standardization: casing, soundex/phonetic casing, soundex/phonetic
equivalent, dictionary spelling, column splitting or equivalent, dictionary spelling, column splitting or merging, filter out stopwords, conversion to a standard merging, filter out stopwords, conversion to a standard format (e.g. dates) format (e.g. dates)
Matching or record linkage:Matching or record linkage: exact matches, wild card, exact matches, wild card, soundex, keying fields or combination of fields, text soundex, keying fields or combination of fields, text indexing, edit distance, signatures indexing, edit distance, signatures
Consolidation (enhancement and merging):Consolidation (enhancement and merging): duplicate with duplicate with more information is kept, source more information is kept, source prioritypriority, most recent , most recent update, most frequently occurring, random choice, field update, most frequently occurring, random choice, field contents, take an equal number of fields from each source contents, take an equal number of fields from each source
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreprocessingData Preprocessing
Why preprocess the data?Why preprocess the data?
Descriptive data summarizationDescriptive data summarization
Data cleaning Data cleaning
Data integration and transformationData integration and transformation
Data reductionData reduction
Discretization and concept hierarchy generationDiscretization and concept hierarchy generation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Descriptive data Descriptive data summarizationsummarization
MotivationMotivation To better understand the data: central tendency,
variation and spread
Measures of central tendency Measures of central tendency Mean, median, mode, midrange
Measures of data dispersion Measures of data dispersion Quartiles, inter quartile range, outliers, variance,
etc.
GoalGoal: efficiently compute these measures in : efficiently compute these measures in large DBslarge DBs
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Measuring the Central Measuring the Central TendencyTendency
MeanMean (algebraic measure): (algebraic measure): Weighted arithmetic mean:
Trimmed mean: chopping extreme values
MedianMedian (holistic measure): (holistic measure): Middle value if odd number of values,
or average of the middle two values otherwise
Estimated by interpolation (for grouped data)
ModeMode Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
n
iixn
x1
1
n
ii
n
iii
w
xwx
1
1
cf
lfnLmedian
median
))(2/
(1
)(3 medianmeanmodemean
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Median, mean and mode of Median, mean and mode of symmetric datasymmetric data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Positively Skewed DataPositively Skewed Data
Mode appears at the point smaller than the medianMode appears at the point smaller than the median
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Negatively Skewed DataNegatively Skewed Data
Mode appears at the point greater than the medianMode appears at the point greater than the median
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Negatively skewed data Negatively skewed data (example)(example)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Measuring the Measuring the Dispersion of Data (1)Dispersion of Data (1)
Quartiles, outliers and boxplotsQuartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
QuartilesQuartiles
Kth percentileKth percentile of a set of data in numerical order: of a set of data in numerical order: value value xx such that such that kk % of the data entries lie at or % of the data entries lie at or below below xx Values at or below the median: 50th percentile
QuartilesQuartiles: most commonly used percentiles, give : most commonly used percentiles, give indication of the center, spread and shape of a indication of the center, spread and shape of a distributiondistribution Q1: 25th percentile; Q3: 75th percentile Interquartile range: IQR = Q3 – Q1 Outliers: values 1.5XIQR above Q3 or below Q1
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Boxplot AnalysisBoxplot Analysis
Five-number summaryFive-number summary of a distribution: of a distribution:
Minimum, Q1, M, Q3, Maximum
BoxplotBoxplot Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to
Minimum and Maximum
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Boxplots paralelas Boxplots paralelas (exemplo)(exemplo)
Feminino 2025
Feminino 1998
Masculino 2025
Masculino 1998
Ambos os sexos 2025
Ambos os sexos 1998
Esp
ect
ativ
a d
e v
ida
ao
na
sce
r (e
m a
no
s)
100
90
80
70
60
50
40
30
Kenya
ZimbabweNamibiaBotswanaSwazilandRwandaZambiaEthiopiaMalawi
ZimbabweNamibiaBotswanaRwandaEthiopiaZambiaSwaziland
Malawi
UgandaCôte d'IvoireNigerBurundiTanzaniaBurkina FasoLesothoKenya
ZimbabweNamibiaBotswanaSwazilandRwandaZambiaEthiopia
Malawi
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Visualization of Data Visualization of Data Dispersion: Boxplot AnalysisDispersion: Boxplot Analysis
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Measuring the Measuring the Dispersion of Data (2)Dispersion of Data (2)
Variance and standard deviationVariance and standard deviation Variance s2:
Standard deviation s is the square root of variance s2
measures spread about the mean
S=0 when there is no apread, i.e., all observations have the same
value
Both are algebraic measures, scalable computation
n
i
n
iii
n
ii x
nx
nxx
ns
1 1
22
1
22 ])(1
[1
1)(
1
1
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Properties of Normal Properties of Normal Distribution CurveDistribution Curve
The The normal (distribution) curvenormal (distribution) curve From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation) From μ–2σ to μ+2σ: contains about 95% of it From μ–3σ to μ+3σ: contains about 99.7% of it
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Graphic Displays of Basic Graphic Displays of Basic Statistical DescriptionsStatistical Descriptions
GraphGraph displays of displays of basic statistical class descriptionsbasic statistical class descriptions
BoxplotHistogramQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Histogram AnalysisHistogram AnalysisFrequency histograms
A univariate graphical methodConsists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Histograms (example)Histograms (example)
30 40 50 60 70 80 90
0
20
40
60
80
100
Ambos os sexos 1998
30 40 50 60 70 80 90
0
20
40
60
80
100
Ambos os sexos 2025
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Quantile PlotQuantile Plot Displays Displays all of the dataall of the data (allowing the user to assess both (allowing the user to assess both
the overall behavior and unusual occurrences)the overall behavior and unusual occurrences) Plots Plots quantilequantile information information
For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Quantile Plot (example)Quantile Plot (example)
0.1 0.3 0.5 0.7 0.9
Quantis Ambos os sexos
50
80
50
80
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Quantile-Quantile (Q-Q) Quantile-Quantile (Q-Q) PlotPlot
Graphs the quantiles of one univariate distribution Graphs the quantiles of one univariate distribution againstagainst the corresponding quantiles of another the corresponding quantiles of another
Allows the user to view whether there is a shift in Allows the user to view whether there is a shift in going from one distribution to anothergoing from one distribution to another
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Q-Q Plot (example)Q-Q Plot (example)
40 50 60 70 80
AS1998
40
50
60
70
80
AS
202
5
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Scatter plotScatter plot Provides a first look at Provides a first look at bivariate databivariate data to see clusters of to see clusters of
points, outliers, etcpoints, outliers, etc Each pair of values is treated as a pair of coordinates and Each pair of values is treated as a pair of coordinates and
plotted as points in the planeplotted as points in the plane
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Scatter plot (example)Scatter plot (example)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Loess CurveLoess Curve Adds a Adds a smooth curvesmooth curve to a scatter plot in order to provide better to a scatter plot in order to provide better
perception of the pattern of dependenceperception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomials that are fitted by the parameter, and the degree of the polynomials that are fitted by the regressionregression
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Positively and Negatively Positively and Negatively Correlated DataCorrelated Data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Not CorrelatedNot Correlated Data Data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreprocessingData Preprocessing
Why preprocess the data?Why preprocess the data?
Descriptive data summarizationDescriptive data summarization
Data cleaning Data cleaning
Data integration and transformationData integration and transformation
Data reductionData reduction
Discretization and concept hierarchy generationDiscretization and concept hierarchy generation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Why Data Preprocessing?Why Data Preprocessing?
Data in the real world is Data in the real world is dirtydirty incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data e.g., occupation=“”
noisy: containing errors or outliers (spelling, phonetic and typing errors, word transpositions, multiple values in a single free-form field)
e.g., Salary=“-10” inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials)
e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data CleaningData Cleaning
ImportanceImportance “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball “Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasksData cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data CleaningData Cleaning
ImportanceImportance “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball “Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasksData cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Missing DataMissing Data
Data is Data is not always availablenot always available Ex: many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data Missing data may be due tomay be due to equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be Missing data may need to be inferredinferred..
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
How to Handle Missing How to Handle Missing Data?Data?
Ignore the tupleIgnore the tuple not effective when the percentage of missing values per attribute varies
considerably.
Fill in the missing value manuallyFill in the missing value manually
tedious + infeasible whith large data sets
Fill in it automaticallyFill in it automatically with with
a global constant : e.g., “unknown”; not recommended
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision
tree
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Noisy DataNoisy Data NoiseNoise: random error or variance in a measured : random error or variance in a measured
variablevariable Incorrect attribute valuesIncorrect attribute values may due to may due to
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention
Other data problems which requires data cleaningOther data problems which requires data cleaning duplicate records incomplete data inconsistent data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
How to Handle Noisy Data?How to Handle Noisy Data?
BinningBinning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
ClusteringClustering detect and remove outliers
Combined computer and human inspectionCombined computer and human inspection detect suspicious values and check by human (e.g.,
deal with possible outliers)
RegressionRegression smooth by fitting the data into regression functions
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Simple Discretization Methods: Simple Discretization Methods: BinningBinning
Equal-widthEqual-width (distance) partitioning: (distance) partitioning: Divides the range into N intervals of equal size: uniform
grid if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate
presentation Skewed data is not handled well.
Equal-depthEqual-depth (frequency) partitioning: (frequency) partitioning: Divides the range into N intervals, each containing
approximately the same number of samples Good data scaling Managing categorical attributes can be tricky.
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Binning for Data Binning for Data SmoothingSmoothing
Sorted data for priceSorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 3434
Partition into equal-frequency (equi-depth) binsPartition into equal-frequency (equi-depth) bins::
- Bin 1: 4, 8, 9, 15- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34- Bin 3: 26, 28, 29, 34
Smoothing by bin meansSmoothing by bin means::
- Bin 1: 9, 9, 9, 9- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29- Bin 3: 29, 29, 29, 29
Smoothing by bin boundariesSmoothing by bin boundaries::
-- Bin 1: 4, 4, 4, 15 Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34- Bin 3: 26, 26, 26, 34
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Cluster AnalysisCluster Analysis Similar values are organized into Similar values are organized into groupsgroups May be used to detect outliers May be used to detect outliers
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
RegressionRegression
x
y
y = x + 1
X1
Y1
Y1’
Data can be smoothed by Data can be smoothed by fitting it to a functionfitting it to a function Ex: linear regression can be used so that one variable can be used to predict Ex: linear regression can be used so that one variable can be used to predict
the otherthe other
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreprocessingData Preprocessing
Why preprocess the data?Why preprocess the data?
Data cleaning Data cleaning
Data integration and transformationData integration and transformation
Data reductionData reduction
Discretization and concept hierarchy Discretization and concept hierarchy
generationgeneration
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Data IntegrationIntegration
Data integrationData integration: Combines data from multiple sources : Combines data from multiple sources into a coherent storeinto a coherent store
Schema integration: Schema integration: Integrate metadata from different Integrate metadata from different sourcessources
Entity identification problemEntity identification problem: identify real world entities : identify real world entities from multiple data sources, e.g., A.cust-id from multiple data sources, e.g., A.cust-id B.cust-# B.cust-# Also known as record linkage, duplicate elimination
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Related problemsRelated problems Detecting and resolving Detecting and resolving data value conflictsdata value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units
Redundant dataRedundant data occur often when integrating multiple occur often when integrating multiple databasesdatabases Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue Redundant attributes may be able to be detected by
correlation analysis
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Correlation analysisCorrelation analysis (for numerical data) (for numerical data)
Correlation coefficientCorrelation coefficient (also called Pearson’s product (also called Pearson’s product moment coefficient)moment coefficient)
where n is the number of tuples, and are the respective means of A where n is the number of tuples, and are the respective means of A and B, and B, σσA A and and σσB B are the respective standard deviation of A and B, and are the respective standard deviation of A and B, and
ΣΣ(AB) is the sum of the AB cross-product.(AB) is the sum of the AB cross-product.
If rIf rA,BA,B > 0, A and B are > 0, A and B are positively correlatedpositively correlated (A’s values (A’s values
increase as B’s). The higher, the stronger correlation.increase as B’s). The higher, the stronger correlation.
rrA,BA,B = 0: independent; r = 0: independent; rA,BA,B < 0: negatively correlated < 0: negatively correlated
BABA n
BAnAB
n
BBAAr BA )1(
)(
)1(
))((,
A B
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Positively and Negatively Positively and Negatively Correlated DataCorrelated Data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Correlation Analysis Correlation Analysis (for categorical data)(for categorical data)
ΧΧ22 (chi-square) test (chi-square) test
The larger the The larger the ΧΧ22 value, the more likely the variables value, the more likely the variables are relatedare related
The cells that contribute the most to the The cells that contribute the most to the ΧΧ22 value are value are those whose actual count is very different from the those whose actual count is very different from the expected countexpected count
Expected
ExpectedObserved 22 )(
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Chi-Square: An ExampleChi-Square: An Example
ΧΧ22 (chi-square) calculation (numbers in parenthesis are (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution expected counts calculated based on the data distribution in the two categories)in the two categories)
It shows that It shows that like_science_fictionlike_science_fiction and and play_chessplay_chess are are correlated in the groupcorrelated in the group
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222
Play Play chesschess
Not play Not play chesschess
Sum Sum (row)(row)
Like science fictionLike science fiction 250(90)250(90) 200(360)200(360) 450450
Not like science Not like science fictionfiction
50(210)50(210) 1000(840)1000(840) 10501050
Sum(col.)Sum(col.) 300300 12001200 15001500
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data TransformationData Transformation
SmoothingSmoothing: remove noise from data: remove noise from data
AggregationAggregation: summarization, data cube construction: summarization, data cube construction
GeneralizationGeneralization: concept hierarchy climbing: concept hierarchy climbing
NormalizationNormalization: scaled to fall within a small, specified range: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
Attribute/feature constructionAttribute/feature construction New attributes constructed from the given ones
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data TransformationData Transformation
SmoothingSmoothing: remove noise from data: remove noise from data
AggregationAggregation: summarization, data cube construction: summarization, data cube construction
GeneralizationGeneralization: concept hierarchy climbing: concept hierarchy climbing NormalizationNormalization: scaled to fall within a small, specified : scaled to fall within a small, specified
rangerange min-max normalization z-score normalization normalization by decimal scaling
Attribute/feature constructionAttribute/feature construction New attributes constructed from the given ones
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Normalization (for Normalization (for numerical data)numerical data)
min-max normalizationmin-max normalization
z-score normalization (z-score normalization (μμ: mean, : mean, σσ: : standard deviation)standard deviation)
normalization by decimal scalingnormalization by decimal scaling
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10'
Where j is the smallest integer such that Max(|ν’|) < 1
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreprocessingData Preprocessing
Why preprocess the data?Why preprocess the data?
Data cleaning Data cleaning
Data integration and transformationData integration and transformation
Data reductionData reduction
Discretization and concept hierarchy Discretization and concept hierarchy
generationgeneration
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data ReductionData Reduction
A data warehouse may store A data warehouse may store terabytes of terabytes of datadataComplex data analysis/mining may take a very
long time to run on the complete data set Data reduction Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data reduction Data reduction strategiesstrategies
Data cube Data cube aggregationaggregation DimensionalityDimensionality reductionreduction
remove unimportant attributes Data Data compressioncompression Numerosity reductionNumerosity reduction
fit data into models DiscretizationDiscretization and and concept hierarchyconcept hierarchy generation generation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Cube AggregationData Cube Aggregation
Multiple levels of aggregation in data cubesMultiple levels of aggregation in data cubes Further reduce the size of data to deal with
Queries regarding aggregated information should Queries regarding aggregated information should
be answered using the be answered using the smallest available cuboidsmallest available cuboid
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of a Data Cube w/ Example of a Data Cube w/ materialized aggregate materialized aggregate
datadata Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Dimensionality Dimensionality ReductionReduction
Data sets may contain Data sets may contain hundreds of attributeshundreds of attributes Some are irrelevant or redundant
Feature selectionFeature selection (i.e., attribute subset selection): (i.e., attribute subset selection): Select a minimum set of features such that the
probability distribution of the data classes given the values for those features is as close as possible to the original distribution given the values of all features
reduce # of patterns in the patterns, easier to understand
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Heuristic Feature Heuristic Feature Selection MethodsSelection Methods
There are There are 22dd possible sub-features of possible sub-features of dd features features Several heuristic feature selection methods:Several heuristic feature selection methods:
Best single features under the feature independence assumption: choose by statistical significance tests.
Stepwise feature selectionThe best single-feature is picked firstThen next best feature condition to the first, ...
Stepwise feature eliminationRepeatedly eliminate the worst feature
Best combined feature selection and elimination Decision tree induction
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Decision Tree Induction
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Data CompressionCompression
Original Data Compressed Data
lossless
Original DataApproximated
lossy
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Compression Data Compression (examples)(examples)
String compressionString compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without
expansion Audio/video compressionAudio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without reconstructing the whole
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Wavelet Transformation Wavelet Transformation (1) (1)
Discrete wavelet transformDiscrete wavelet transform (DWT): linear signal processing (DWT): linear signal processing Transforms a data vector D into a numerically different vector D’, with the Transforms a data vector D into a numerically different vector D’, with the
same length, of same length, of wavelet coefficientswavelet coefficients Compressed approximationCompressed approximation: store only a small fraction of the strongest of the : store only a small fraction of the strongest of the
wavelet coefficientswavelet coefficients Good results on Good results on sparsesparse or or skewedskewed data and on data with data and on data with ordered attributesordered attributes, ,
can be applied to can be applied to multidimensionalmultidimensional data data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Wavelet Wavelet Transformation (2) Transformation (2)
Similar to discrete Fourier transformSimilar to discrete Fourier transform (DFT), but (DFT), but better lossy compression, localized in spacebetter lossy compression, localized in space
MethodMethod (hierarchical pyramid algo.): (hierarchical pyramid algo.): Length, L, must be an integer power of 2 (padding with 0s, when
necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two sets of data of length L/2 Applies two functions recursively, until reaches the desired length Wavelets coefficients are a selection of
the values obtained
Haar2 Daubechie4
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Given Given NN data vectors from data vectors from kk-dimensions, find -dimensions, find c <= k c <= k orthogonal vectorsorthogonal vectors that can be best used to represent that can be best used to represent data data
Each data vector is a Each data vector is a linear combination of the linear combination of the cc principal component vectorsprincipal component vectors
Works for Works for numericnumeric data only data only Used when the Used when the number of dimensionsnumber of dimensions is large, is large,
computationally computationally inexpensiveinexpensive, can be applied to , can be applied to ordered and unorderedordered and unordered attributes, can handle attributes, can handle sparsesparse and and skewedskewed data data
Principal Component Principal Component Analysis Analysis
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
PCA basic procedurePCA basic procedure Input data are Input data are normalizednormalized
Attributes w/ large domains do not dominate attributes w/ smaller domains
Computes Computes c orthogonal and unit vectorsc orthogonal and unit vectors – – principal componentsprincipal components Input data is a linear combination
Principal components sorted according to Principal components sorted according to decreasing strengthdecreasing strength (variance among the data) (variance among the data)
Size of data is reduced by Size of data is reduced by eliminating the eliminating the weaker componentsweaker components (w/ lower variance) (w/ lower variance)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
X1
X2
Y1
Y2
Principal Component Analysis
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Numerosity Numerosity ReductionReduction
Parametric methodsParametric methods Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data (except possible outliers)
Non-parametric methods Non-parametric methods Do not assume models Major families: histograms, clustering, sampling
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Regression ModelsRegression Models
Linear regressionLinear regression: Data are modeled to fit a straight line: Data are modeled to fit a straight line Y = + X Two parameters , and specify the line and are to be estimated by
using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
…. Often uses the least-square method to fit the line
Multiple regressionMultiple regression: allows a response variable Y to be modeled as : allows a response variable Y to be modeled as a linear function of a multidimensional feature vectora linear function of a multidimensional feature vector Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above.
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Histograms (example)Histograms (example)
30 40 50 60 70 80 90
0
20
40
60
80
100
Ambos os sexos 1998
30 40 50 60 70 80 90
0
20
40
60
80
100
Ambos os sexos 2025
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
HistogramsHistograms A popular data reduction technique, uses A popular data reduction technique, uses binningbinning
to approximate data distributionsto approximate data distributions
Divides data into buckets and stores Divides data into buckets and stores average average frequencyfrequency for each bucket for each bucket
Different partitioning rulesDifferent partitioning rules: equal-width, equal-: equal-width, equal-frequency, V-optimal, max-difffrequency, V-optimal, max-diff
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Cluster AnalysisCluster Analysis Similar values are organized into Similar values are organized into groupsgroups May be used to detect outliers May be used to detect outliers
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
ClusteringClustering Partition data set into Partition data set into clustersclusters, and one can store cluster , and one can store cluster
representation onlyrepresentation only
Cluster quality can be measured by its Cluster quality can be measured by its diameterdiameter or the or the
centroid distancecentroid distance
Can have Can have hierarchical clusteringhierarchical clustering and be stored in multi- and be stored in multi-
dimensional index tree structuresdimensional index tree structures
There are many choices of clustering definitions and There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8clustering algorithms, further detailed in Chapter 8
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
SamplingSampling
Allows a mining algorithm to run in complexity that is Allows a mining algorithm to run in complexity that is potentially potentially sub-linearsub-linear to the size of the data to the size of the data Proportional to the sample size
Choose a Choose a representativerepresentative subset of the data subset of the data Simple random sampling may have very poor performance
in the presence of skew Develop adaptive sampling methodsDevelop adaptive sampling methods
Stratified sampling: Approximate the percentage of each class (or
subpopulation of interest) in the overall database Used in conjunction with skewed data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
SamplingSampling
Raw Data Cluster/Stratified Sample
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreprocessingData Preprocessing
Why preprocess the data?Why preprocess the data?
Data cleaning Data cleaning
Data integration and transformationData integration and transformation
Data reductionData reduction
Discretization and concept hierarchy generationDiscretization and concept hierarchy generation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
DiscretizationDiscretization
Three types of attributes:Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers
DiscretizationDiscretization: : Divide the range of a continuous attribute into intervals Interval labels used to replace actual values Some classification algorithms only accept categorical
attributes. Reduce data size by discretization
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Discretization and Discretization and Concept hierachyConcept hierachy
Discretization Discretization reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals.
Concept hierarchiesConcept hierarchies reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Discretization Discretization techniquestechniques
SupervisedSupervised/unsupervised: if discretization process /unsupervised: if discretization process explores class informationexplores class information
Top-down Top-down (splitting): finds one or a few points to (splitting): finds one or a few points to split the entire range of the attribute and then split the entire range of the attribute and then does it recursivelydoes it recursively
Bottom-upBottom-up (merging): starts at all the continuous (merging): starts at all the continuous values, merges neighborhood values into values, merges neighborhood values into intervals and performs recursive mergesintervals and performs recursive merges
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Discretization and Concept Discretization and Concept Hierarchy Generation for Hierarchy Generation for
Numeric DataNumeric DataBinningBinning
Histogram analysisHistogram analysis
Clustering analysisClustering analysis
Entropy-based discretizationEntropy-based discretization
Segmentation by natural partitioningSegmentation by natural partitioning
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Entropy-Based Entropy-Based Discretization (1)Discretization (1)
Given a set of samples S, if S is partitioned into Given a set of samples S, if S is partitioned into two intervals Stwo intervals S11 and S and S22 using boundary T, the using boundary T, the information gain after partitioning isinformation gain after partitioning is
EntropyEntropy is calculated based on class distribution of is calculated based on class distribution of the samples in the set. Given the samples in the set. Given mm classes, the classes, the entropy of entropy of SS11 is is
where pi is the probability of class i in S1
)(||
||)(
||
||),( 2
21
1SEntropy
SS
SEntropySSTSI
m
iii ppSEntropy
121 )(log)(
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Entropy-Based Entropy-Based Discretization (2)Discretization (2)
The The boundary that minimizes the entropy functionboundary that minimizes the entropy function over all possible boundaries is selected as a binary over all possible boundaries is selected as a binary discretizationdiscretization
The process is recursively applied to partitions The process is recursively applied to partitions obtained until some stopping criterion is metobtained until some stopping criterion is met
Such a boundary may reduce data size and Such a boundary may reduce data size and improve classification accuracyimprove classification accuracy
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Interval Merge by Interval Merge by 22 AnalysisAnalysis
Merging-basedMerging-based (bottom-up) (bottom-up) Finds the best neighboring intervals and merge Finds the best neighboring intervals and merge
them to form larger intervals recursivelythem to form larger intervals recursively ChiMergeChiMerge
Initially, each distinct value of a numerical attribute A is considered to be one interval
2 tests are performed for every pair of adjacent intervals Adjacent intervals with the least 2 values are merged
together This merge process proceeds recursively until a
predefined stopping criterion is met (such as significance level, max-interval, max inconsistency, etc.)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Segmentation by Segmentation by Natural PartitioningNatural Partitioning
A simply A simply 3-4-5 rule3-4-5 rule can be used to segment numeric can be used to segment numeric
data into relatively uniform, “natural” intervalsdata into relatively uniform, “natural” intervals If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of 3-4-5 RuleExample of 3-4-5 Rule
(-$4000 -$5,000)
(-$400 - 0)
(-$400 - -$300)
(-$300 - -$200)
(-$200 - -$100)
(-$100 - 0)
(0 - $1,000)
(0 - $200)($200 - $400)
($400 - $600)
($600 - $800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 - $3,000)
($3,000 - $4,000)
($4,000 - $5,000)
($1,000 - $2, 000)
($1,000 - $1,200)($1,200 - $1,400)
($1,400 - $1,600)
($1,600 - $1,800)
($1,800 - $2,000)
msd=1,000 Low=-$1,000 High=$2,000Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Concept Hierarchy Generation Concept Hierarchy Generation for Categorical Datafor Categorical Data
Specification of a partial ordering of attributesSpecification of a partial ordering of attributes explicitlyexplicitly at the schema level by users or expertsat the schema level by users or experts street<city<state<country
Specification of a portion of a hierarchy by explicitSpecification of a portion of a hierarchy by explicit data data groupinggrouping {Urbana, Champaign, Chicago}<Illinois
Specification of a set of attributesSpecification of a set of attributes System automatically generates partial ordering by
analysis of the number of distinct values E.g., street < city <state < country
Specification of only a partial set of attributesSpecification of only a partial set of attributes E.g., only street < city, not others
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Automatic Concept Automatic Concept Hierarchy GenerationHierarchy Generation
Some concept hierarchies can be automatically generated based on Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the the analysis of the number of distinct values per attribute in the given data set given data set The attribute with the most distinct values is placed at the lowest level of
the hierarchy Note: Exception—weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
BibliografiaBibliografia
(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 3 – livro 2001, Kaufmann, 2001 (Capítulo 3 – livro 2001, Capítulo 2 – draft)Capítulo 2 – draft)
(Relatório) (Relatório) Expectativa de vida ao nascer, por Expectativa de vida ao nascer, por Região, País e Sexo: 1998 e 2025Região, País e Sexo: 1998 e 2025, Lurdes , Lurdes Jesus, FCUL, 2003Jesus, FCUL, 2003