Pré-processamento, Transformação e Limpeza de dados (baseado nos slides do livro: Data Mining: C...

Pré-processamento, Pré-processamento, Transformação e Transformação e

Limpeza de dadosLimpeza de dados

(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Front-end applications of Front-end applications of DWDW

Information processingInformation processing Querying, basic statistical analysis, reporting using

crosstabs, tables, charts or graphs

Analytical processingAnalytical processing Multidimensional data analysis through basic OLAP

operations (slice/dice, drill-down, roll-up, pivoting, etc)

Data miningData mining Knowledge discovery by finding hidden patterns and

associations, building analytical models, performing classification and prediction, and presenting results through visualization tools.


(LEIC Tagus)

Application contextApplication context Construction of a data repository for data Construction of a data repository for data

analysisanalysis also called pre-processing (data mining context) or

ETL process (DW context) querying, reporting, analytical processing, data mining

require quality data Migration of data from a source to a target Migration of data from a source to a target

schemaschema poorly structured to structured data to support application migration

Enhancement of a single data sourceEnhancement of a single data source Eliminating errors, duplicates, inconsistencies


(LEIC Tagus)

Data PreprocessingData Preprocessing

Why preprocess the data?Why preprocess the data?

Descriptive data summarizationDescriptive data summarization

Data cleaning Data cleaning

Data integration and transformationData integration and transformation

Data reductionData reduction

Discretization and concept hierarchy generationDiscretization and concept hierarchy generation


(LEIC Tagus)

Example (1)Example (1)

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch


(LEIC Tagus)

Example (2)Example (2)

Suppose we want to Suppose we want to analyze the companyanalyze the company´s data wrt the sales at a given branch´s data wrt the sales at a given branch

Select attributes and dimensions to be Select attributes and dimensions to be included in the analysis: item,price, included in the analysis: item,price, units_sold, etcunits_sold, etc

May find out that....May find out that....


(LEIC Tagus)

Why Data Preprocessing?Why Data Preprocessing?

Data in the real world is Data in the real world is dirtydirtyincomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data e.g., occupation=“”

noisy: containing errors or outliers (spelling, phonetic and typing errors, word transpositions, multiple values in a single free-form field)

e.g., Salary=“-10” inconsistent: containing discrepancies in codes or

names (synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials)

e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records


(LEIC Tagus)

Why Is Data Dirty?Why Is Data Dirty?

Incomplete dataIncomplete data comes from: comes from: non available data value when collected different criteria between the time when the data was collected

and when it is analyzed. human/hardware/software problems

Noisy dataNoisy data comes from: comes from: data collection: faulty instruments data entry: human or computer errors data transmission

Inconsistent (and redundant) dataInconsistent (and redundant) data comes from: comes from: Different data sources, so non uniform naming conventions/data

codes Functional dependency and/or referential integrity violation


(LEIC Tagus)

Why Is Data Preprocessing Why Is Data Preprocessing Important?Important?

Data warehouseData warehouse needs needs consistent integration consistent integration of quality dataof quality data Data extraction, cleaning, and transformation comprises

the majority of the work of building a data warehouse

No quality data, no No quality data, no quality mining resultsquality mining results!! Quality decisions must be based on quality data (e.g.,

duplicate or missing data may cause incorrect or even misleading statistics)


(LEIC Tagus)

Major Tasks in Data Major Tasks in Data PreprocessingPreprocessing

Data cleaningData cleaning Fill in missing values, smooth noisy data, identify or remove outliers,

and resolve inconsistencies

Data integrationData integration Integration of multiple databases, data cubes, or files

Data transformationData transformation Normalization and aggregation

Data reductionData reduction Obtains reduced representation in volume but produces the same or

similar analytical results

Data discretizationData discretization Part of data reduction but with particular importance, especially for

numerical data


(LEIC Tagus)

Forms of data Forms of data preprocessingpreprocessing


(LEIC Tagus)

One methodology for the One methodology for the ETL process (L. English)ETL process (L. English)

ParsingParsing Correction:Correction: ZIP or postal codes, addresses ZIP or postal codes, addresses (field)(field) Standardization: Standardization: casing, soundex/phonetic casing, soundex/phonetic

equivalent, dictionary spelling, column splitting or equivalent, dictionary spelling, column splitting or merging, filter out stopwords, conversion to a standard merging, filter out stopwords, conversion to a standard format (e.g. dates) format (e.g. dates)

Matching or record linkage:Matching or record linkage: exact matches, wild card, exact matches, wild card, soundex, keying fields or combination of fields, text soundex, keying fields or combination of fields, text indexing, edit distance, signatures indexing, edit distance, signatures

Consolidation (enhancement and merging):Consolidation (enhancement and merging): duplicate with duplicate with more information is kept, source more information is kept, source prioritypriority, most recent , most recent update, most frequently occurring, random choice, field update, most frequently occurring, random choice, field contents, take an equal number of fields from each source contents, take an equal number of fields from each source


(LEIC Tagus)









(LEIC Tagus)

Descriptive data Descriptive data summarizationsummarization

MotivationMotivation To better understand the data: central tendency,

variation and spread

Measures of central tendency Measures of central tendency Mean, median, mode, midrange

Measures of data dispersion Measures of data dispersion Quartiles, inter quartile range, outliers, variance,

etc.

GoalGoal: efficiently compute these measures in : efficiently compute these measures in large DBslarge DBs


(LEIC Tagus)

Measuring the Central Measuring the Central TendencyTendency

MeanMean (algebraic measure): (algebraic measure): Weighted arithmetic mean:

Trimmed mean: chopping extreme values

MedianMedian (holistic measure): (holistic measure): Middle value if odd number of values,

or average of the middle two values otherwise

Estimated by interpolation (for grouped data)

ModeMode Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

n

iixn

x1

1

n

ii

n

iii

w

xwx

1

1

cf

lfnLmedian

median

))(2/

(1

)(3 medianmeanmodemean


(LEIC Tagus)

Median, mean and mode of Median, mean and mode of symmetric datasymmetric data


(LEIC Tagus)

Positively Skewed DataPositively Skewed Data

Mode appears at the point smaller than the medianMode appears at the point smaller than the median


(LEIC Tagus)

Negatively Skewed DataNegatively Skewed Data

Mode appears at the point greater than the medianMode appears at the point greater than the median


(LEIC Tagus)

Negatively skewed data Negatively skewed data (example)(example)


(LEIC Tagus)

Measuring the Measuring the Dispersion of Data (1)Dispersion of Data (1)

Quartiles, outliers and boxplotsQuartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max

Boxplot: ends of the box are the quartiles, median is

marked, whiskers, and plot outlier individually

Outlier: usually, a value higher/lower than 1.5 x IQR


(LEIC Tagus)

QuartilesQuartiles

Kth percentileKth percentile of a set of data in numerical order: of a set of data in numerical order: value value xx such that such that kk % of the data entries lie at or % of the data entries lie at or below below xx Values at or below the median: 50th percentile

QuartilesQuartiles: most commonly used percentiles, give : most commonly used percentiles, give indication of the center, spread and shape of a indication of the center, spread and shape of a distributiondistribution Q1: 25th percentile; Q3: 75th percentile Interquartile range: IQR = Q3 – Q1 Outliers: values 1.5XIQR above Q3 or below Q1


(LEIC Tagus)

Boxplot AnalysisBoxplot Analysis

Five-number summaryFive-number summary of a distribution: of a distribution:

Minimum, Q1, M, Q3, Maximum

BoxplotBoxplot Data is represented with a box

The ends of the box are at the first and third

quartiles, i.e., the height of the box is IRQ

The median is marked by a line within the box

Whiskers: two lines outside the box extend to

Minimum and Maximum


(LEIC Tagus)

Boxplots paralelas Boxplots paralelas (exemplo)(exemplo)

Feminino 2025

Feminino 1998

Masculino 2025

Masculino 1998

Ambos os sexos 2025

Ambos os sexos 1998

Esp

ect

ativ

a d

e v

ida

ao

na

sce

r (e

m a

no

s)

100

90

80

70

60

50

40

30

Kenya

ZimbabweNamibiaBotswanaSwazilandRwandaZambiaEthiopiaMalawi

ZimbabweNamibiaBotswanaRwandaEthiopiaZambiaSwaziland

Malawi

UgandaCôte d'IvoireNigerBurundiTanzaniaBurkina FasoLesothoKenya

ZimbabweNamibiaBotswanaSwazilandRwandaZambiaEthiopia

Malawi


(LEIC Tagus)

Visualization of Data Visualization of Data Dispersion: Boxplot AnalysisDispersion: Boxplot Analysis


(LEIC Tagus)

Measuring the Measuring the Dispersion of Data (2)Dispersion of Data (2)

Variance and standard deviationVariance and standard deviation Variance s2:

Standard deviation s is the square root of variance s2

measures spread about the mean

S=0 when there is no apread, i.e., all observations have the same

value

Both are algebraic measures, scalable computation

n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1

[1

1)(

1

1


(LEIC Tagus)

Properties of Normal Properties of Normal Distribution CurveDistribution Curve

The The normal (distribution) curvenormal (distribution) curve From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation) From μ–2σ to μ+2σ: contains about 95% of it From μ–3σ to μ+3σ: contains about 99.7% of it


(LEIC Tagus)

Graphic Displays of Basic Graphic Displays of Basic Statistical DescriptionsStatistical Descriptions

GraphGraph displays of displays of basic statistical class descriptionsbasic statistical class descriptions

BoxplotHistogramQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve


(LEIC Tagus)

Histogram AnalysisHistogram AnalysisFrequency histograms

A univariate graphical methodConsists of a set of rectangles that reflect the counts

or frequencies of the classes present in the given data


(LEIC Tagus)

Histograms (example)Histograms (example)

30 40 50 60 70 80 90

0

20

40

60

80

100

Ambos os sexos 1998

30 40 50 60 70 80 90

0

20

40

60

80

100

Ambos os sexos 2025


(LEIC Tagus)

Quantile PlotQuantile Plot Displays Displays all of the dataall of the data (allowing the user to assess both (allowing the user to assess both

the overall behavior and unusual occurrences)the overall behavior and unusual occurrences) Plots Plots quantilequantile information information

For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi


(LEIC Tagus)

Quantile Plot (example)Quantile Plot (example)

0.1 0.3 0.5 0.7 0.9

Quantis Ambos os sexos

50

80

50

80


(LEIC Tagus)

Quantile-Quantile (Q-Q) Quantile-Quantile (Q-Q) PlotPlot

Graphs the quantiles of one univariate distribution Graphs the quantiles of one univariate distribution againstagainst the corresponding quantiles of another the corresponding quantiles of another

Allows the user to view whether there is a shift in Allows the user to view whether there is a shift in going from one distribution to anothergoing from one distribution to another


(LEIC Tagus)

Q-Q Plot (example)Q-Q Plot (example)

40 50 60 70 80

AS1998

40

50

60

70

80

AS

202

5


(LEIC Tagus)

Scatter plotScatter plot Provides a first look at Provides a first look at bivariate databivariate data to see clusters of to see clusters of

points, outliers, etcpoints, outliers, etc Each pair of values is treated as a pair of coordinates and Each pair of values is treated as a pair of coordinates and

plotted as points in the planeplotted as points in the plane


(LEIC Tagus)

Scatter plot (example)Scatter plot (example)


(LEIC Tagus)

Loess CurveLoess Curve Adds a Adds a smooth curvesmooth curve to a scatter plot in order to provide better to a scatter plot in order to provide better

perception of the pattern of dependenceperception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing Loess curve is fitted by setting two parameters: a smoothing

parameter, and the degree of the polynomials that are fitted by the parameter, and the degree of the polynomials that are fitted by the regressionregression


(LEIC Tagus)

Positively and Negatively Positively and Negatively Correlated DataCorrelated Data


(LEIC Tagus)

Not CorrelatedNot Correlated Data Data


(LEIC Tagus)









(LEIC Tagus)

Why Data Preprocessing?Why Data Preprocessing?

Data in the real world is Data in the real world is dirtydirty incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data e.g., occupation=“”

noisy: containing errors or outliers (spelling, phonetic and typing errors, word transpositions, multiple values in a single free-form field)

e.g., Salary=“-10” inconsistent: containing discrepancies in codes or

names (synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials)

e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records


(LEIC Tagus)

Data CleaningData Cleaning

ImportanceImportance “Data cleaning is one of the three biggest problems in data

warehousing”—Ralph Kimball “Data cleaning is the number one problem in data

warehousing”—DCI survey

Data cleaning tasksData cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration


(LEIC Tagus)

Missing DataMissing Data

Data is Data is not always availablenot always available Ex: many tuples have no recorded value for several attributes,

such as customer income in sales data

Missing data Missing data may be due tomay be due to equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of

entry

not register history or changes of the data

Missing data may need to be Missing data may need to be inferredinferred..


(LEIC Tagus)

How to Handle Missing How to Handle Missing Data?Data?

Ignore the tupleIgnore the tuple not effective when the percentage of missing values per attribute varies

considerably.

Fill in the missing value manuallyFill in the missing value manually

tedious + infeasible whith large data sets

Fill in it automaticallyFill in it automatically with with

a global constant : e.g., “unknown”; not recommended

the attribute mean

the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or decision

tree


(LEIC Tagus)

Noisy DataNoisy Data NoiseNoise: random error or variance in a measured : random error or variance in a measured

variablevariable Incorrect attribute valuesIncorrect attribute values may due to may due to

faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention

Other data problems which requires data cleaningOther data problems which requires data cleaning duplicate records incomplete data inconsistent data


(LEIC Tagus)

How to Handle Noisy Data?How to Handle Noisy Data?

BinningBinning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

ClusteringClustering detect and remove outliers

Combined computer and human inspectionCombined computer and human inspection detect suspicious values and check by human (e.g.,

deal with possible outliers)

RegressionRegression smooth by fitting the data into regression functions


(LEIC Tagus)

Simple Discretization Methods: Simple Discretization Methods: BinningBinning

Equal-widthEqual-width (distance) partitioning: (distance) partitioning: Divides the range into N intervals of equal size: uniform

grid if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate

presentation Skewed data is not handled well.

Equal-depthEqual-depth (frequency) partitioning: (frequency) partitioning: Divides the range into N intervals, each containing

approximately the same number of samples Good data scaling Managing categorical attributes can be tricky.


(LEIC Tagus)

Binning for Data Binning for Data SmoothingSmoothing

Sorted data for priceSorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 3434

Partition into equal-frequency (equi-depth) binsPartition into equal-frequency (equi-depth) bins::

- Bin 1: 4, 8, 9, 15- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34- Bin 3: 26, 28, 29, 34

Smoothing by bin meansSmoothing by bin means::

- Bin 1: 9, 9, 9, 9- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29- Bin 3: 29, 29, 29, 29

Smoothing by bin boundariesSmoothing by bin boundaries::

-- Bin 1: 4, 4, 4, 15 Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34- Bin 3: 26, 26, 26, 34


(LEIC Tagus)

Cluster AnalysisCluster Analysis Similar values are organized into Similar values are organized into groupsgroups May be used to detect outliers May be used to detect outliers


(LEIC Tagus)

RegressionRegression

x

y

y = x + 1

X1

Y1

Y1’

Data can be smoothed by Data can be smoothed by fitting it to a functionfitting it to a function Ex: linear regression can be used so that one variable can be used to predict Ex: linear regression can be used so that one variable can be used to predict

the otherthe other


(LEIC Tagus)






Discretization and concept hierarchy Discretization and concept hierarchy

generationgeneration


(LEIC Tagus)

Data Data IntegrationIntegration

Data integrationData integration: Combines data from multiple sources : Combines data from multiple sources into a coherent storeinto a coherent store

Schema integration: Schema integration: Integrate metadata from different Integrate metadata from different sourcessources

Entity identification problemEntity identification problem: identify real world entities : identify real world entities from multiple data sources, e.g., A.cust-id from multiple data sources, e.g., A.cust-id B.cust-# B.cust-# Also known as record linkage, duplicate elimination


(LEIC Tagus)

Related problemsRelated problems Detecting and resolving Detecting and resolving data value conflictsdata value conflicts

For the same real world entity, attribute values from different sources are different

Possible reasons: different representations, different scales, e.g., metric vs. British units

Redundant dataRedundant data occur often when integrating multiple occur often when integrating multiple databasesdatabases Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue Redundant attributes may be able to be detected by

correlation analysis


(LEIC Tagus)

Correlation analysisCorrelation analysis (for numerical data) (for numerical data)

Correlation coefficientCorrelation coefficient (also called Pearson’s product (also called Pearson’s product moment coefficient)moment coefficient)

where n is the number of tuples, and are the respective means of A where n is the number of tuples, and are the respective means of A and B, and B, σσA A and and σσB B are the respective standard deviation of A and B, and are the respective standard deviation of A and B, and

ΣΣ(AB) is the sum of the AB cross-product.(AB) is the sum of the AB cross-product.

If rIf rA,BA,B > 0, A and B are > 0, A and B are positively correlatedpositively correlated (A’s values (A’s values

increase as B’s). The higher, the stronger correlation.increase as B’s). The higher, the stronger correlation.

rrA,BA,B = 0: independent; r = 0: independent; rA,BA,B < 0: negatively correlated < 0: negatively correlated

BABA n

BAnAB

n

BBAAr BA )1(

)(

)1(

))((,

A B


(LEIC Tagus)

Positively and Negatively Positively and Negatively Correlated DataCorrelated Data


(LEIC Tagus)

Correlation Analysis Correlation Analysis (for categorical data)(for categorical data)

ΧΧ22 (chi-square) test (chi-square) test

The larger the The larger the ΧΧ22 value, the more likely the variables value, the more likely the variables are relatedare related

The cells that contribute the most to the The cells that contribute the most to the ΧΧ22 value are value are those whose actual count is very different from the those whose actual count is very different from the expected countexpected count

Expected

ExpectedObserved 22 )(


(LEIC Tagus)

Chi-Square: An ExampleChi-Square: An Example

ΧΧ22 (chi-square) calculation (numbers in parenthesis are (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution expected counts calculated based on the data distribution in the two categories)in the two categories)

It shows that It shows that like_science_fictionlike_science_fiction and and play_chessplay_chess are are correlated in the groupcorrelated in the group

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

Play Play chesschess

Not play Not play chesschess

Sum Sum (row)(row)

Like science fictionLike science fiction 250(90)250(90) 200(360)200(360) 450450

Not like science Not like science fictionfiction

50(210)50(210) 1000(840)1000(840) 10501050

Sum(col.)Sum(col.) 300300 12001200 15001500


(LEIC Tagus)

Data TransformationData Transformation

SmoothingSmoothing: remove noise from data: remove noise from data

AggregationAggregation: summarization, data cube construction: summarization, data cube construction

GeneralizationGeneralization: concept hierarchy climbing: concept hierarchy climbing

NormalizationNormalization: scaled to fall within a small, specified range: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature constructionAttribute/feature construction New attributes constructed from the given ones


(LEIC Tagus)

Data TransformationData Transformation

SmoothingSmoothing: remove noise from data: remove noise from data

AggregationAggregation: summarization, data cube construction: summarization, data cube construction

GeneralizationGeneralization: concept hierarchy climbing: concept hierarchy climbing NormalizationNormalization: scaled to fall within a small, specified : scaled to fall within a small, specified

rangerange min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature constructionAttribute/feature construction New attributes constructed from the given ones


(LEIC Tagus)

Normalization (for Normalization (for numerical data)numerical data)

min-max normalizationmin-max normalization

z-score normalization (z-score normalization (μμ: mean, : mean, σσ: : standard deviation)standard deviation)

normalization by decimal scalingnormalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10'

Where j is the smallest integer such that Max(|ν’|) < 1


(LEIC Tagus)






Discretization and concept hierarchy Discretization and concept hierarchy

generationgeneration


(LEIC Tagus)

Data ReductionData Reduction

A data warehouse may store A data warehouse may store terabytes of terabytes of datadataComplex data analysis/mining may take a very

long time to run on the complete data set Data reduction Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results


(LEIC Tagus)

Data reduction Data reduction strategiesstrategies

Data cube Data cube aggregationaggregation DimensionalityDimensionality reductionreduction

remove unimportant attributes Data Data compressioncompression Numerosity reductionNumerosity reduction

fit data into models DiscretizationDiscretization and and concept hierarchyconcept hierarchy generation generation


(LEIC Tagus)

Data Cube AggregationData Cube Aggregation

Multiple levels of aggregation in data cubesMultiple levels of aggregation in data cubes Further reduce the size of data to deal with

Queries regarding aggregated information should Queries regarding aggregated information should

be answered using the be answered using the smallest available cuboidsmallest available cuboid


(LEIC Tagus)

Example of a Data Cube w/ Example of a Data Cube w/ materialized aggregate materialized aggregate

datadata Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum


(LEIC Tagus)

Dimensionality Dimensionality ReductionReduction

Data sets may contain Data sets may contain hundreds of attributeshundreds of attributes Some are irrelevant or redundant

Feature selectionFeature selection (i.e., attribute subset selection): (i.e., attribute subset selection): Select a minimum set of features such that the

probability distribution of the data classes given the values for those features is as close as possible to the original distribution given the values of all features

reduce # of patterns in the patterns, easier to understand


(LEIC Tagus)

Heuristic Feature Heuristic Feature Selection MethodsSelection Methods

There are There are 22dd possible sub-features of possible sub-features of dd features features Several heuristic feature selection methods:Several heuristic feature selection methods:

Best single features under the feature independence assumption: choose by statistical significance tests.

Stepwise feature selectionThe best single-feature is picked firstThen next best feature condition to the first, ...

Stepwise feature eliminationRepeatedly eliminate the worst feature

Best combined feature selection and elimination Decision tree induction


(LEIC Tagus)

Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


(LEIC Tagus)

Data Data CompressionCompression

Original Data Compressed Data

lossless

Original DataApproximated

lossy


(LEIC Tagus)

Data Compression Data Compression (examples)(examples)

String compressionString compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without

expansion Audio/video compressionAudio/video compression

Typically lossy compression, with progressive refinement

Sometimes small fragments of signal can be reconstructed without reconstructing the whole


(LEIC Tagus)

Wavelet Transformation Wavelet Transformation (1) (1)

Discrete wavelet transformDiscrete wavelet transform (DWT): linear signal processing (DWT): linear signal processing Transforms a data vector D into a numerically different vector D’, with the Transforms a data vector D into a numerically different vector D’, with the

same length, of same length, of wavelet coefficientswavelet coefficients Compressed approximationCompressed approximation: store only a small fraction of the strongest of the : store only a small fraction of the strongest of the

wavelet coefficientswavelet coefficients Good results on Good results on sparsesparse or or skewedskewed data and on data with data and on data with ordered attributesordered attributes, ,

can be applied to can be applied to multidimensionalmultidimensional data data


(LEIC Tagus)

Wavelet Wavelet Transformation (2) Transformation (2)

Similar to discrete Fourier transformSimilar to discrete Fourier transform (DFT), but (DFT), but better lossy compression, localized in spacebetter lossy compression, localized in space

MethodMethod (hierarchical pyramid algo.): (hierarchical pyramid algo.): Length, L, must be an integer power of 2 (padding with 0s, when

necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two sets of data of length L/2 Applies two functions recursively, until reaches the desired length Wavelets coefficients are a selection of

the values obtained

Haar2 Daubechie4


(LEIC Tagus)

Given Given NN data vectors from data vectors from kk-dimensions, find -dimensions, find c <= k c <= k orthogonal vectorsorthogonal vectors that can be best used to represent that can be best used to represent data data

Each data vector is a Each data vector is a linear combination of the linear combination of the cc principal component vectorsprincipal component vectors

Works for Works for numericnumeric data only data only Used when the Used when the number of dimensionsnumber of dimensions is large, is large,

computationally computationally inexpensiveinexpensive, can be applied to , can be applied to ordered and unorderedordered and unordered attributes, can handle attributes, can handle sparsesparse and and skewedskewed data data

Principal Component Principal Component Analysis Analysis


(LEIC Tagus)

PCA basic procedurePCA basic procedure Input data are Input data are normalizednormalized

Attributes w/ large domains do not dominate attributes w/ smaller domains

Computes Computes c orthogonal and unit vectorsc orthogonal and unit vectors – – principal componentsprincipal components Input data is a linear combination

Principal components sorted according to Principal components sorted according to decreasing strengthdecreasing strength (variance among the data) (variance among the data)

Size of data is reduced by Size of data is reduced by eliminating the eliminating the weaker componentsweaker components (w/ lower variance) (w/ lower variance)


(LEIC Tagus)

X1

X2

Y1

Y2

Principal Component Analysis


(LEIC Tagus)

Numerosity Numerosity ReductionReduction

Parametric methodsParametric methods Assume the data fits some model, estimate model

parameters, store only the parameters, and discard the data (except possible outliers)

Non-parametric methods Non-parametric methods Do not assume models Major families: histograms, clustering, sampling


(LEIC Tagus)

Regression ModelsRegression Models

Linear regressionLinear regression: Data are modeled to fit a straight line: Data are modeled to fit a straight line Y = + X Two parameters , and specify the line and are to be estimated by

using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2,

…. Often uses the least-square method to fit the line

Multiple regressionMultiple regression: allows a response variable Y to be modeled as : allows a response variable Y to be modeled as a linear function of a multidimensional feature vectora linear function of a multidimensional feature vector Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above.


(LEIC Tagus)

Histograms (example)Histograms (example)

30 40 50 60 70 80 90

0

20

40

60

80

100

Ambos os sexos 1998

30 40 50 60 70 80 90

0

20

40

60

80

100

Ambos os sexos 2025


(LEIC Tagus)

HistogramsHistograms A popular data reduction technique, uses A popular data reduction technique, uses binningbinning

to approximate data distributionsto approximate data distributions

Divides data into buckets and stores Divides data into buckets and stores average average frequencyfrequency for each bucket for each bucket

Different partitioning rulesDifferent partitioning rules: equal-width, equal-: equal-width, equal-frequency, V-optimal, max-difffrequency, V-optimal, max-diff


(LEIC Tagus)

Cluster AnalysisCluster Analysis Similar values are organized into Similar values are organized into groupsgroups May be used to detect outliers May be used to detect outliers


(LEIC Tagus)

ClusteringClustering Partition data set into Partition data set into clustersclusters, and one can store cluster , and one can store cluster

representation onlyrepresentation only

Cluster quality can be measured by its Cluster quality can be measured by its diameterdiameter or the or the

centroid distancecentroid distance

Can have Can have hierarchical clusteringhierarchical clustering and be stored in multi- and be stored in multi-

dimensional index tree structuresdimensional index tree structures

There are many choices of clustering definitions and There are many choices of clustering definitions and

clustering algorithms, further detailed in Chapter 8clustering algorithms, further detailed in Chapter 8


(LEIC Tagus)

SamplingSampling

Allows a mining algorithm to run in complexity that is Allows a mining algorithm to run in complexity that is potentially potentially sub-linearsub-linear to the size of the data to the size of the data Proportional to the sample size

Choose a Choose a representativerepresentative subset of the data subset of the data Simple random sampling may have very poor performance

in the presence of skew Develop adaptive sampling methodsDevelop adaptive sampling methods

Stratified sampling: Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data


(LEIC Tagus)

Sampling

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data


(LEIC Tagus)

SamplingSampling

Raw Data Cluster/Stratified Sample


(LEIC Tagus)








(LEIC Tagus)

DiscretizationDiscretization

Three types of attributes:Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers

DiscretizationDiscretization: : Divide the range of a continuous attribute into intervals Interval labels used to replace actual values Some classification algorithms only accept categorical

attributes. Reduce data size by discretization


(LEIC Tagus)

Discretization and Discretization and Concept hierachyConcept hierachy

Discretization Discretization reduce the number of values for a given continuous

attribute by dividing the range of the attribute into intervals.

Concept hierarchiesConcept hierarchies reduce the data by collecting and replacing low level

concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)


(LEIC Tagus)

Discretization Discretization techniquestechniques

SupervisedSupervised/unsupervised: if discretization process /unsupervised: if discretization process explores class informationexplores class information

Top-down Top-down (splitting): finds one or a few points to (splitting): finds one or a few points to split the entire range of the attribute and then split the entire range of the attribute and then does it recursivelydoes it recursively

Bottom-upBottom-up (merging): starts at all the continuous (merging): starts at all the continuous values, merges neighborhood values into values, merges neighborhood values into intervals and performs recursive mergesintervals and performs recursive merges


(LEIC Tagus)

Discretization and Concept Discretization and Concept Hierarchy Generation for Hierarchy Generation for

Numeric DataNumeric DataBinningBinning

Histogram analysisHistogram analysis

Clustering analysisClustering analysis

Entropy-based discretizationEntropy-based discretization

Segmentation by natural partitioningSegmentation by natural partitioning


(LEIC Tagus)

Entropy-Based Entropy-Based Discretization (1)Discretization (1)

Given a set of samples S, if S is partitioned into Given a set of samples S, if S is partitioned into two intervals Stwo intervals S11 and S and S22 using boundary T, the using boundary T, the information gain after partitioning isinformation gain after partitioning is

EntropyEntropy is calculated based on class distribution of is calculated based on class distribution of the samples in the set. Given the samples in the set. Given mm classes, the classes, the entropy of entropy of SS11 is is

where pi is the probability of class i in S1

)(||

||)(

||

||),( 2

21

1SEntropy

SS

SEntropySSTSI

m

iii ppSEntropy

121 )(log)(


(LEIC Tagus)

Entropy-Based Entropy-Based Discretization (2)Discretization (2)

The The boundary that minimizes the entropy functionboundary that minimizes the entropy function over all possible boundaries is selected as a binary over all possible boundaries is selected as a binary discretizationdiscretization

The process is recursively applied to partitions The process is recursively applied to partitions obtained until some stopping criterion is metobtained until some stopping criterion is met

Such a boundary may reduce data size and Such a boundary may reduce data size and improve classification accuracyimprove classification accuracy


(LEIC Tagus)

Interval Merge by Interval Merge by 22 AnalysisAnalysis

Merging-basedMerging-based (bottom-up) (bottom-up) Finds the best neighboring intervals and merge Finds the best neighboring intervals and merge

them to form larger intervals recursivelythem to form larger intervals recursively ChiMergeChiMerge

Initially, each distinct value of a numerical attribute A is considered to be one interval

2 tests are performed for every pair of adjacent intervals Adjacent intervals with the least 2 values are merged

together This merge process proceeds recursively until a

predefined stopping criterion is met (such as significance level, max-interval, max inconsistency, etc.)


(LEIC Tagus)

Segmentation by Segmentation by Natural PartitioningNatural Partitioning

A simply A simply 3-4-5 rule3-4-5 rule can be used to segment numeric can be used to segment numeric

data into relatively uniform, “natural” intervalsdata into relatively uniform, “natural” intervals If an interval covers 3, 6, 7 or 9 distinct values at the most

significant digit, partition the range into 3 equi-width

intervals

If it covers 2, 4, or 8 distinct values at the most significant

digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most significant

digit, partition the range into 5 intervals


(LEIC Tagus)

Example of 3-4-5 RuleExample of 3-4-5 Rule

(-$4000 -$5,000)

(-$400 - 0)

(-$400 - -$300)

(-$300 - -$200)

(-$200 - -$100)

(-$100 - 0)

(0 - $1,000)

(0 - $200)($200 - $400)

($400 - $600)

($600 - $800) ($800 -

$1,000)

($2,000 - $5, 000)

($2,000 - $3,000)

($3,000 - $4,000)

($4,000 - $5,000)

($1,000 - $2, 000)

($1,000 - $1,200)($1,200 - $1,400)

($1,400 - $1,600)

($1,600 - $1,800)

($1,800 - $2,000)

msd=1,000 Low=-$1,000 High=$2,000Step 2:

Step 4:

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

count

(-$1,000 - $2,000)

(-$1,000 - 0) (0 -$ 1,000)

Step 3:

($1,000 - $2,000)


(LEIC Tagus)

Concept Hierarchy Generation Concept Hierarchy Generation for Categorical Datafor Categorical Data

Specification of a partial ordering of attributesSpecification of a partial ordering of attributes explicitlyexplicitly at the schema level by users or expertsat the schema level by users or experts street<city<state<country

Specification of a portion of a hierarchy by explicitSpecification of a portion of a hierarchy by explicit data data groupinggrouping {Urbana, Champaign, Chicago}<Illinois

Specification of a set of attributesSpecification of a set of attributes System automatically generates partial ordering by

analysis of the number of distinct values E.g., street < city <state < country

Specification of only a partial set of attributesSpecification of only a partial set of attributes E.g., only street < city, not others


(LEIC Tagus)

Automatic Concept Automatic Concept Hierarchy GenerationHierarchy Generation

Some concept hierarchies can be automatically generated based on Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the the analysis of the number of distinct values per attribute in the given data set given data set The attribute with the most distinct values is placed at the lowest level of

the hierarchy Note: Exception—weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values


(LEIC Tagus)

BibliografiaBibliografia

(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 3 – livro 2001, Kaufmann, 2001 (Capítulo 3 – livro 2001, Capítulo 2 – draft)Capítulo 2 – draft)

(Relatório) (Relatório) Expectativa de vida ao nascer, por Expectativa de vida ao nascer, por Região, País e Sexo: 1998 e 2025Região, País e Sexo: 1998 e 2025, Lurdes , Lurdes Jesus, FCUL, 2003Jesus, FCUL, 2003

Pré-processamento, Transformação e Limpeza de dados (baseado nos slides do livro: Data Mining: C...

Documents

Transcript of Pré-processamento, Transformação e Limpeza de dados (baseado nos slides do livro: Data Mining: C...