Page 34 - Data Science Algorithms in a Week

P. 34

Classification Using K Nearest Neighbors

Analysis:

Using, for example, the 1-NN algorithm and the Manhattan or Euclidean distance would
result in the classification of the document in question to the class of mathematics.
However, intuitively, we should instead use a different metric to measure the distance, as
the document in question has a much higher count of the word computer than other known
documents in the class of mathematics.
Another candidate metric for this problem is a metric that would measure the proportion of
the counts for the words, or the angle between the instances of documents. Instead of the
angle, one could take the cosine of the angle cos(θ), and then use the well-known dot
product formula to calculate the cos(θ).

Let a=(a ,a ), b=(b ,b ), then instead this formula:
x
y
y
x

One derives:

Using the cosine distance metric, one could classify the document in question to the class of
informatics:

[ 22 ]

29 30 31 32 33 34 35 36 37 38 39