Page 37 - Data Science Algorithms in a Week
P. 37

Classification Using K Nearest Neighbors


            Summary

            The k-nearest neighbor algorithm is a classification algorithm that assigns to a given data
            point the majority class among the k-nearest neighbors. The distance between two points is
            measured by a metric. Examples of distances include: Euclidean distance, Manhattan
            distance, Minkowski distance, Hamming distance, Mahalanobis distance, Tanimoto
            distance, Jaccard distance, tangential distance, and cosine distance. Experiments with
            various parameters and cross-validation can help to establish which parameter k and which
            metric should be used.
            The dimensionality and position of a data point in the space are determined by its qualities.
            A large number of dimensions can result in low accuracy of the k-NN algorithm. Reducing
            the dimensions of qualities of smaller importance can increase accuracy. Similarly, to
            increase accuracy further, distances for each dimension should be scaled according to the
            importance of the quality of that dimension.



            Problems


                   1.  Mary and her temperature preferences: Imagine that you know that your friend
                      Mary feels cold when it is -50 degrees Celsius, but she feels warm when it is 20
                      degrees Celsius. What would the 1-NN algorithm say about Mary; would she feel
                      warm or cold at the temperatures 22, 15, -10? Do you think that the algorithm
                      predicted Mary's body perception of the temperature correctly? If not, please,
                      give the reasons and suggest why the algorithm did not give appropriate results
                      and what would need to improve in order for the algorithm to make a better
                      classification.
                   2.  Mary and temperature preferences: Do you think that the use of the 1-NN
                      algorithm would yield better results than the use of the k-NN algorithm for k>1?
                   3.  Mary and temperature preferences: We collected more data and found out that
                      Mary feels warm at 17C, but cold at 18C. By our common sense, Mary should feel
                      warmer with a higher temperature. Can you explain a possible cause of
                      discrepancy in the data? How could we improve the analysis of our data? Should
                      we collect also some non-temperature data? Suppose that we have only
                      temperature data available, do you think that the 1-NN algorithm would still
                      yield better results with the data like this? How should we choose k for k-NN
                      algorithm to perform well?







                                                     [ 25 ]
   32   33   34   35   36   37   38   39   40   41   42