Page 38 - Data Science Algorithms in a Week

P. 38

Classification Using K Nearest Neighbors

4. Map of Italy - choosing the value of k: We are given a partial map of Italy as for
the problem Map of Italy. But suppose that the complete data is not available.
Thus we cannot calculate the error rate on all the predicted points for different
values of k. How should one choose the value of k for the k-NN algorithm to
complete the map of Italy in order to maximize the accuracy?
5. House ownership: Using the data from the section concerned with the problem
of house ownership, find the closest neighbor to Peter using the Euclidean metric:

a) without rescaling the data,
b) using the scaled data.

Is the closest neighbor in a) the same as the neighbor in b)? Which of the
neighbors owns the house?

6. Text classification: Suppose you would like to find books or documents in
Gutenberg's corpus (www.gutenberg.org) that are similar to a selected book from
the corpus (for example, the Bible) using a certain metric and the 1-NN algorithm.
How would you design a metric measuring the similarity distance between the
two documents?

Analysis:

1. 8 degrees Celsius is closer to 20 degrees Celsius than to -50 degrees Celsius. So,
the algorithm would classify that Mary should feel warm at -8 degrees Celsius.
But this likely is not true using our common sense and knowledge. In more
complex examples, we may be seduced by the results of the analysis to make
false conclusions due to our lack of expertise. But remember that data science
makes use of substantive and expert knowledge, not only data analysis. To make
good conclusions, we should have a good understanding of the problem and our
data.

The algorithm further says that at 22 degrees Celsius, Mary should feel
warm, and there is no doubt in that, as 22 degrees Celsius is higher than 20
degrees Celsius and a human being feels warmer with a higher temperature;
again, a trivial use of our knowledge. For 15 degrees Celsius, the algorithm
would deem Mary to feel warm, but our common sense we may not be that
certain of this statement.

[ 26 ]

33 34 35 36 37 38 39 40 41 42 43