Post on 24-May-2015
description
Aprendizagem Computacional Gladys Castillo, UA
Bayesian Networks Classifiers
Part I – Naive Bayes
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.2
The Supervised Classification Problem
A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by a
set of attributes X ={X1, X2, …, Xn} X
Supervised Learning Algorithm
hC
hC C(N+1)
Inputs: Attribute values of x(N+1) Output: class of x(N+1)
Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C
Given: a dataset D with N labeled examples of <X, C>Build: a classifier, a hypothesis hC: X C that can correctly predict the class labels of new objects
Learning Phase:
c(N)<x(N), c(N)>
………
c(2)<x(2), c(2)>
c(1)<x(1), c(1)>
CCXXnn……XX11DD
c(N)<x(N), c(N)>
………
c(2)<x(2), c(2)>
c(1)<x(1), c(1)>
CCXXnn……XX11DD)1(
1x )1(nx
)2(1x )2(
nx
)(1
Nx )( Nnx
)1(1
Nx )1(2
Nx )1( Nnx…
attribute X2
att
ribu
te
X1
x
x x
x
x
xx
x
xx
oo o
oo
o
oo
o
o o
o
o
give credit
don´t give
credit
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.3
Statistical Classifiers Treat the attributes X= {X1, X2, …, Xn} and
the class C as random variables A random variable is defined by
the probability density function
Give the probability P(cj | x) that x
belongs to a particular class rather than a simple classification
Probability density function of a random variable and few observations
)(xf )(xf
instead of having the map X C , we have X P(C| X)
The class c* attached to an example is the class with bigger P(cj|
x)
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.4
Bayesian Classifiers
Bayesian because the class c* attached to an example x is determined by the Bayes’ TheoremP(X,C)= P(X|
C).P(C)P(X,C)= P(C|
X).P(X)Bayes theorem is the main tool in Bayesian inference
)(
)|()()|(
XP
CXPCPXCP
We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior
distribution.
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.5
Bayes TheoremExample
Given: A doctor knows that meningitis causes stiff neck 50% of
the time Prior probability of any patient having meningitis is
1/50,000 Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has meningitis?0002.0
20/150000/15.0
)()()|(
)|( SP
MPMSPSMP
likelihood
prior
prior
posterior prior x likelihood
)(
)|()()|(
XP
CXPCPXCP
Before observing the data, our prior beliefs can be expressed
in a prior probability distribution that represents the knowledge we have about the
unknown features. After observing the data our
revised beliefs are captured by a posterior distribution over
the unknown features.
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.6
)(
)|()()|(
x
xx
P
cPcPcP jj
j
Bayesian Classifier
Bayes Theorem
Maximum a posteriori
classification
How to determine P(cj | x) for each class cj ?
P(x) can be ignored because it is the same for all the classes (a
normalization constant)
)|()(max)(1
*jj
...mjBayes cPcParghc xx
Classify x to the class which has bigger posterior
probability
Bayesian Classificatio
n Rule
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.7
“Naïve” because of its very naïve independence assumption:
Naïve Bayes (NB) Classifier
all the attributes are conditionally independent given the class
Duda and Hart (1973); Langley (1992)
P(x | cj) can be decomposed into a product
of n terms, one term for each attribute
“Bayes” because the class c* attached to an example x is
determined by the Bayes’ Theorem
)|()(max)(1
*jj
...mjBayes cPcParghc xx
when the attribute space is high dimensional direct estimation is hard unless
we introduce some assumptions
n
ijiij
...mjNB cxXPcParghc
11
* )|()(max)(xNB
Classification Rule
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.8
Xi continuous The attribute is discretized and then treats as a discrete
attribute A Normal distribution is usually assumed
2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj
Xi discrete
1. Estimate P(cj) for each class cj
Naïve Bayes (NB)Learning Phase (Statistical Parameter Estimation)
),;()|( ijijkjki xgcxXP
Given a training dataset D of N labeled examples (assuming complete data)
N
NcP j
j )(ˆ
j
ijkjki N
NcxXP )|(ˆ
2
2
)(2
2
1),;( e
x
xg
ond
e
Nj - the number of examples of the class cj
Nijk - number of examples of the class cj
having the value xk for the attribute Xi
Two options
The mean ij e the standard deviation ij are estimated from D
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.9
2. Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj
For real attributes a Normal distribution is usually assumed
Xi | cj ~ N(μij, σ2ij) - the mean ij e the standard deviation ij are estimated from D
Continuous Attributes Normal or Gaussian Distribution
For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = 0.0273
The d
ensity
pro
bability
fu
nctio
n
- 2 323
N(0,2)
- 2 323
N(0,2)
f(x) is symmetrical around its mean
),;()|( ijijkjki xgcxXP 2
2
)(2
2
1),;( e
x
xg
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.10
1. Estimate P(cj) for each class cj
2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj
X1 discrete X2 continuos
Naïve Bayes Probability Estimates
)101 1.73 ()|( 2 .,x;gCxXP
Train
ing d
ata
set D
5
3)(ˆ CP
3
2)|(ˆ CaXP i
Class X1 X2
+ a 1.0
+ b 1.2
+ a 3.0
- b 4.4
- b 4.5
Two classes: + (positive) e - (negative)
Two attributes: X1 – discrete which takes values a e b X2 – continuos
Binary Classification Problem
3
1)|(ˆ CbXP i
5
2)(ˆ CP
)070 4.45 ()|( 2 .,x;gCxXP 2
0)|(ˆ CaXP i 2
2)|(ˆ CbXP i
2+= 1.73, 2+ = 1.10
2-= 4.45, 22+ = 0.07
Example from John & Langley (1995)
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.11
Probability Estimates Discrete Attributes
for each class: P(No) = 7/10 P(Yes) = 3/10
for each attribute value and class:
Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book
N
NcP j
j )(ˆ
j
ijkjki N
NcxXP )|(ˆ
To compute Nijk we need to count the
number of examples of the class cj
having the value xk for the attribute Xi
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.12
for each pair attribute-class (Xi, cj)
Example for (Income, Class=No)
If Class=No sample mean = 110 sample standard deviation =
54.54
0072.0)54.54(2
1)|120( )2975(2
)110120( 2
eNoIncomeP
Probability Estimates Continuos Attributes
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book
2
2
)(2
2
1) ()|( e ij
ijkx
ij
ijijkjki ,;xgcxXP
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.13
The Balance-scale Problem
Each example has 4 numerical attributes: the left weight (Left_W) the left distance (Left_D) the right weight (Right_W) right distance (Right_D)
The dataset was generated to model psychological experimental results
Left_W
Each example is classified into 3 classes: the balance-scale:
tip to the right (Right)tip to the left (Left)is balanced (Balanced)
3 target rules:If LD x LW > RD x RW tip to the leftIf LD x LW < RD x RW tip to the rightIf LD x LW = RD x RW it is balanced
dataset from the UCI repository
Right_W
Left_D Right_D
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.14
The Balance-scale Problem
Left_W Left_D
Right_W Right_D
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
Adapted from © João Gama’s slides “Aprendizagem Bayesiana”
Discretization is applied: each attribute is mapped to 5 intervals
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.15
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
The Balance-scale Problem Learning Phase
Build
Contingency tables
Contingency Tables
AttributeAttribute: : LeftLeft_ W_ W
2534496686RightRight
9108810BalancedBalanced
7271614214LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : LeftLeft_ W_ W
2534496686RightRight
9108810BalancedBalanced
7271614214LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : LeftLeft_ D_ D
2737495790RightRight
8109108BalancedBalanced
7770593816LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : LeftLeft_ D_ D
2737495790RightRight
8109108BalancedBalanced
7770593816LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_ W_ W
7970583716RightRight
8910108BalancedBalanced
2833496387LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_ W_ W
7970583716RightRight
8910108BalancedBalanced
2833496387LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_ D_ D
8267573717RightRight
9108108BalancedBalanced
2535446591LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_ D_ D
8267573717RightRight
9108108BalancedBalanced
2535446591LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
260
LeftLeft
Classes Classes CountersCounters
56526045
TotalTotalRightRightBalancedBalanced
260
LeftLeft
Classes Classes CountersCounters
56526045
TotalTotalRightRightBalancedBalanced
565 examples
Assuming complete data, the computation of all the required estimates requires a simple scan through the data, an operation of time
complexity O(N n), where N is the number of training examples.
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.16
We need to estimate the posterior probabilities P(cj | x) for each class
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
P(Left|x) P(Balanced|x)
P(Right|x)
0.277796 0.135227 0.586978Class = Right
P(cj |x) = P(cj ) x P(Left_W=1 | cj) x P(Left_D=5 | cj )
x P(Right_W=4 | cj ) x P(Right_D=2 | cj ),
cj {Left, Balanced, Right}
The class counters and contingency tables are used to compute the
posterior probabilities for each class
1
LeftLeft_W_W
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
1
LeftLeft_W_W
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
?
max
The Balance-scale Problem Classification Phase
n
ijiij
...mjNB cxXPcParghc
11
* )|()(max)(x
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.17
Iris Dataset From the statistician Douglas
Fisher Three flower types (classes):
Setosa Virginica Versicolour
Four continuous attributes Sepal width and length Petal width and length
Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
dataset from the UCI repository
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.18
Iris Dataset
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.19
Iris Dataset
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.20
Scatter Plot Array of Iris Attributes
The attributes petal width and
petal length provide a moderate
separation of the Irish species
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.21
Naïve Bayes Iris Dataset – Continuous Attributes
Normal Probability Density Functions
Attribute: PetalWidth
P(PetalWidth|Setosa)
P(PetalWidth|Versicolor)
P(PetalWidth|Virginica)
Class Iris-setosa mean: 0.244, standard deviation: 0.107
Class Iris-versicolor mean: 1.326, standard deviation: 0.198
Class Iris-virginica mean: 2.026, standard deviation: 0.275
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.22
Class Iris-versicolor (0.327): ==============================Attribute sepallength mean: 5.936, standard deviation: 0.516Attribute sepalwidth mean: 2.770, standard deviation: 0.314Attribute petallengthmean: 4.260, standard deviation: 0.470Attribute petalwidthmean: 1.326, standard deviation: 0.198
Naïve Bayes Iris Dataset – Continuous Attributes
Model
Class Iris-setosa (0.327): ========================== Attribute sepallength mean: 5.006, standard deviation: 0.352Attribute sepalwidth mean: 3.418, standard deviation: 0.381Attribute petallength mean: 1.464, standard deviation: 0.174Attribute petalwidth mean: 0.244, standard deviation: 0.107
Class Iris-virginica (0.327): =============================Attribute sepallength mean: 6.588, standard deviation: 0.636Attribute sepalwidthmean: 2.974, standard deviation: 0.322Attribute petallengthmean: 5.552, standard deviation: 0.552Attribute petalwidthmean: 2.026, standard deviation: 0.275
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.23
Naïve Bayes Iris Dataset – Continuous Attributes
Classification Phase
maximum values
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.24
Naïve Bayes Iris Dataset – Continuous Attributes
Estimate P(cj | x) for each class:
P(setosa|x) P(versicolor|x)
P(virginica|x)
0 0.995 0.005Classe = versicolor
P(cj |x) = P(cj ) x P(sepalLength =5 | cj) x P(sepalWidth =3 | cj ) x P(petalLength =2 | cj ) x P(petalWidth =2 | cj ), cj {setosa, versicolor, virgínica}
max
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.25
Naïve Bayes Iris Dataset – Continuous Attributes
Estimate P(cj | x) for the versicolor class Class Iris-versicolor (0.327): ==============================Attribute sepallength mean: 5.936, standard deviation: 0.516Attribute sepalwidth mean: 2.770, standard deviation: 0.314Attribute petallengthmean: 4.260, standard deviation: 0.470Attribute petalwidthmean: 1.326, standard deviation: 0.198 P(versicolor |x) = P(versicolor ) x P(sepalLength = 5 | versicolor )
x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor) x P(petalWidth =2 | versicolor )
P(versicolor |x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314) x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)
n
ijiij
...mjNB cxXPcParghc
11
* )|()(max)(x
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.26
Naïve Bayes Continuous Attributes - Discretization
For continuos attributes
After BinDiscretization, bins=3
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.27
Naïve Bayes Continuous Attributes - Discretization
Model
Class Iris-versicolor (0.327): ==============================Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020
Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-virginica (0.327): =============================Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidthrange1: 0.380 range2: 0.580 range3: 0.040 Attribute petallengthrange1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidthrange1: 0.000 range2: 0.100 range3: 0.900
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.28
Naïve Bayes Continuous Attributes - Discretization
Class Iris-setosa (0.327):========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000Class Iris-versicolor (0.327): ==============================Attribute sepallengthrange1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020Class Iris-virginica (0.327): =============================Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidthrange1: 0.380 range2: 0.580 range3: 0.040Attribute petallengthrange1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidthrange1: 0.000 range2: 0.100 range3: 0.900
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1
Range2
Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1
Range2
Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
We can build a conditional probability table (CPT) for each attribute
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.29
Naïve Bayes Continuous Attributes - Discretization
Class Iris-setosa (0.327):========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000Class Iris-versicolor (0.327): ==============================Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020Class Iris-virginica (0.327): =============================Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidthrange1: 0.380 range2: 0.580 range3: 0.040Attribute petallengthrange1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidthrange1: 0.000 range2: 0.100 range3: 0.900
Atributo Petallength
Range1
Range2
Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1
Range2
Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.30
Naïve Bayes Iris Dataset – Discretized Attributes
Classification (Implementation) Phase
maximum values
Discretized Examples
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.31
Naïve Bayes Classification Phase
How to classify this example ?
P(setosa|x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 |
setosa) x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa )
P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)
x P(petalWidth =r3 | versicolor )
P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica) x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica) x P(petalWidth =r3 | virginica )
We need to compute the conditional posterior probabilities for each class
sepallengt
sepalwidth
petallength
petallength
r1 r2 r1 r3
Example with discretized attributes
sepallengt
sepalwidth
petallength
petallength
5 3 2 2
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.32
Naïve Bayes Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(setosa | x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth = r2 | setosa ) x P(petalLength = r1 | setosa ) x P(petalWidth = r3 | setosa )P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.33
Naïve Bayes Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(versicolor | x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth = r2 | versicolor ) x P(petalLength = r1 | versicolor ) x P(petalWidth = r3 | versicolor )P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.34
Naïve Bayes Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(virginica | x) = P(virginica) x P(sepalLength = r1 | virginica ) x P(sepalWidth = r2 | virginica ) x P(petalLength = r1 | virginica ) x P(petalWidth = r3 | virginica )P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
If all the class probabilities are zero we can not determine the class for this example
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.35
Naïve Bayes Laplace Correction
Nijk number of examples in D such that
Xi = xk and C = cj Nj number of examples in D of class cj
k number of possible values of Xi
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
To avoid zero probabilities due to
zero counters we can implement the
Laplace correction
j
ijkjki N
NcxXP )|(ˆ
kN
NcxXP
j
ijkjki
1)|(ˆ
To calculate the conditional probabilities, instead of using the estimate
To use the Laplace correction:
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.36
Bayesian SPAM Filters Binary Classification Problem
57 continuous attributes Some word frequencies Some character frequencies Capital letters frequency
Two classes 0 – is not SPAM 1 – is SPAM
Spambase dataset from the UCI repository
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.37
Bayesian SPAM Filters Implementation in RapidMiner
Discretization Method
Learning
Testing
Feature Subset Selection Method
Evaluation Method
In this dataset there are no missing values; otherwise we need first to use a method for replace the missing values
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.38
Bayesian SPAM Filters Confussion Matrix
SPAM Precision– percentage of e-mails classified as SPAMs which truly are
SPAM Recall – percentage of e-mails classified as SPAMs over the total of examples that are SPAM
sensibilidad
FNTP
TPTPRrecall
FPTP
TPprecision
Concept Learning Problem: Is an email a
SPAM?
True Positive = number of examples classified as positive which truly are
False Positive = number of examples classified as positive which are negative
False Negative = number of examples classified as negative which are positive
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.39
Experimental Results with continuous
attributes discretization after FSS discretization before FSS
1. Feature Selection (CFS) 2. Genetic Algorithm (CFS) 3. Wrapper 4. Feature Selection (CFS)Minimal Entropy Discretization 5. Genetic Algorithm (CFS)Minimal Entropy Discretization
6. Feature Selection (CFS)Frequency Discretization 7. WrapperMinimal Entropy Discretization 8. Minimal Entropy DiscretizationFeature Selection (CFS) 9. Minimal Entropy DiscretizationGenetic Algorithm (CFS) 10. Minimal Entropy DiscretizationWrapper
o Legenda – Ordem dos operadores de pré-processamento utilizados em cada caso:
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.40
NB is one of more simple and effective classifiers
NB has a very strong unrealistic independence assumption:
Naïve Bayes Performance
all the attributes are conditionally independent given the value of class
Bias-variance decomposition of Test Error Naive Bayes for Nursery Dataset
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
500 2000 3500 5000 6500 8000 9500 11000 12500
# Examples
Variance
Bias in practice: independence assumption
is violated HIGH BIAS
it can lead to poor classification
However, NB is efficient due to its high
variance management
less parameters LOW VARIANCE
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.41
reducing the bias resulting from the modeling error by relaxing the attribute independence assumption
one natural extension: Bayesian Network Classifiers
Improving Naïve Bayes reducing the bias of the parameter estimates
by improving the probability estimates computed from data
Rele
van
t w
ork
s:
Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502
J. Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292
Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29 Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc.
of the 5th Workshop of Artificial Intelligence and Statistics Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical
Computer Science, v. 292Rele
van
t w
ork
s: