Page 38 - Data Science Algorithms in a Week
P. 38

Classification Using K Nearest Neighbors


                   4.  Map of Italy - choosing the value of k: We are given a partial map of Italy as for
                      the problem Map of Italy. But suppose that the complete data is not available.
                      Thus we cannot calculate the error rate on all the predicted points for different
                      values of k. How should one choose the value of k for the k-NN algorithm to
                      complete the map of Italy in order to maximize the accuracy?
                   5.  House ownership: Using the data from the section concerned with the problem
                      of house ownership, find the closest neighbor to Peter using the Euclidean metric:


                                   a) without rescaling the data,
                                   b) using the scaled data.

                          Is the closest neighbor in a) the same as the neighbor in b)? Which of the
                          neighbors owns the house?

                   6.  Text classification: Suppose you would like to find books or documents in
                      Gutenberg's corpus (www.gutenberg.org) that are similar to a selected book from
                      the corpus (for example, the Bible) using a certain metric and the 1-NN algorithm.
                      How would you design a metric measuring the similarity distance between the
                      two documents?

            Analysis:

                   1.  8 degrees Celsius is closer to 20 degrees Celsius than to -50 degrees Celsius. So,
                      the algorithm would classify that Mary should feel warm at -8 degrees Celsius.
                      But this likely is not true using our common sense and knowledge. In more
                      complex examples, we may be seduced by the results of the analysis to make
                      false conclusions due to our lack of expertise. But remember that data science
                      makes use of substantive and expert knowledge, not only data analysis. To make
                      good conclusions, we should have a good understanding of the problem and our
                      data.

                          The algorithm further says that at 22 degrees Celsius, Mary should feel
                          warm, and there is no doubt in that, as 22 degrees Celsius is higher than 20
                          degrees Celsius and a human being feels warmer with a higher temperature;
                          again, a trivial use of our knowledge. For 15 degrees Celsius, the algorithm
                          would deem Mary to feel warm, but our common sense we may not be that
                          certain of this statement.










                                                     [ 26 ]
   33   34   35   36   37   38   39   40   41   42   43