Page 61 - Data Science Algorithms in a Week

P. 61

Naive Bayes

4. We put the data into the program for calculating the posterior probability from
the observations and get the following answer:

[['No', 'Yes', 'Yes', 'No', {'Yes': 0.0, 'No': 1.0}]]

By this calculation, a patient tested should not suffer from the illness.
However, the probability of No seems quite high. It may be a good idea to
get more data to get a more precise estimate of with what probability the
patient is healthy.

5. a) The result of the algorithm is as follows:
[['Yes', 'No', 'Yes', 'No', 'Yes', {'Yes': 0.8459918784779665, 'No':
0.15400812152203341}]]

So, according to the naive Bayes algorithm, when applied to the data in the
table, the email is spam with the probability of about 85%.

b) This method may not be as good since the occurrence of certain words in a
spam email is not independent. For example, spam emails containing the
word money would likely try to convince that a victim of a spam could
somehow get the money from the spammer and thus other words such as
rich, secret, or free are more likely to be present in such an email as well. A
nearest neighbor algorithm would seem to perform better at spam email
classification. One could verify the actual methods using cross-validation.

6. For this problem, we will use the extended Bayes' theorem for both continuous
and discrete random variables:

P(male|height=172cm,weight=60kg,hair=long)=R/[R+~R]
where R=P(height=172cm|male)*P(weight=60kg|male)*P(hair=long|male)*P(male)

~R=P(height=172cm|female)*P(weight=60kg|female)*P(hair=long|female)*P(female
)

[ 49 ]

56 57 58 59 60 61 62 63 64 65 66