Page 37 - Data Science Algorithms in a Week
P. 37
Classification Using K Nearest Neighbors
Summary
The k-nearest neighbor algorithm is a classification algorithm that assigns to a given data
point the majority class among the k-nearest neighbors. The distance between two points is
measured by a metric. Examples of distances include: Euclidean distance, Manhattan
distance, Minkowski distance, Hamming distance, Mahalanobis distance, Tanimoto
distance, Jaccard distance, tangential distance, and cosine distance. Experiments with
various parameters and cross-validation can help to establish which parameter k and which
metric should be used.
The dimensionality and position of a data point in the space are determined by its qualities.
A large number of dimensions can result in low accuracy of the k-NN algorithm. Reducing
the dimensions of qualities of smaller importance can increase accuracy. Similarly, to
increase accuracy further, distances for each dimension should be scaled according to the
importance of the quality of that dimension.
Problems
1. Mary and her temperature preferences: Imagine that you know that your friend
Mary feels cold when it is -50 degrees Celsius, but she feels warm when it is 20
degrees Celsius. What would the 1-NN algorithm say about Mary; would she feel
warm or cold at the temperatures 22, 15, -10? Do you think that the algorithm
predicted Mary's body perception of the temperature correctly? If not, please,
give the reasons and suggest why the algorithm did not give appropriate results
and what would need to improve in order for the algorithm to make a better
classification.
2. Mary and temperature preferences: Do you think that the use of the 1-NN
algorithm would yield better results than the use of the k-NN algorithm for k>1?
3. Mary and temperature preferences: We collected more data and found out that
Mary feels warm at 17C, but cold at 18C. By our common sense, Mary should feel
warmer with a higher temperature. Can you explain a possible cause of
discrepancy in the data? How could we improve the analysis of our data? Should
we collect also some non-temperature data? Suppose that we have only
temperature data available, do you think that the 1-NN algorithm would still
yield better results with the data like this? How should we choose k for k-NN
algorithm to perform well?
[ 25 ]