Page 34 - Data Science Algorithms in a Week
P. 34

Classification Using K Nearest Neighbors


            Analysis:

            Using, for example, the 1-NN algorithm and the Manhattan or Euclidean distance would
            result in the classification of the document in question to the class of mathematics.
            However, intuitively, we should instead use a different metric to measure the distance, as
            the document in question has a much higher count of the word computer than other known
            documents in the class of mathematics.
            Another candidate metric for this problem is a metric that would measure the proportion of
            the counts for the words, or the angle between the instances of documents. Instead of the
            angle, one could take the cosine of the angle cos(θ), and then use the well-known dot
            product formula to calculate the cos(θ).

            Let a=(a ,a ), b=(b ,b ), then instead this formula:
                   x
                     y
                              y
                            x

            One derives:






            Using the cosine distance metric, one could classify the document in question to the class of
            informatics:





























                                                     [ 22 ]
   29   30   31   32   33   34   35   36   37   38   39