Page 39 - Data Science Algorithms in a Week
P. 39

Classification Using K Nearest Neighbors


                          To be able to use our algorithm to yield better results, we should collect more
                          data. For example, if we find out that Mary feels cold at 14 degrees Celsius,
                          then we have a data instance that is very close to 15 degrees and, thus, we
                          can guess with a higher certainty that Mary would feel cold at a temperature
                          of 15 degrees.

                   2.  The nature of the data we are dealing with is just one-dimensional and also
                      partitioned into two parts, cold and warm, with the property: the higher the
                      temperature, the warmer a person feels. Also, even if we know how Mary feels at
                      temperatures, -40, -39, ..., 39, 40, we still have a very limited amount of data
                      instances - just one around every degree Celsius. For these reasons, it is best to
                      just look at one closest neighbor.
                   3.  The discrepancies in the data can be caused by inaccuracy in the tests carried out.
                      This could be mitigated by performing more experiments.

                          Apart from inaccuracy, there could be other factors that influence how Mary
                          feels: for example, the wind speed, humidity, sunshine, how warmly Mary is
                          dressed (if she has a coat with jeans, or just shorts with a sleeveless top, or
                          even a swimming suit), if she was wet or dry. We could add these additional
                          dimensions (wind speed and how dressed) into the vectors of our data
                          points. This would provide more, and better quality, data for the algorithm
                          and, consequently, better results could be expected.
                          If we have only temperature data, but more of it (for example, 10 instances of
                          classification for every degree Celsius), then we could increase the k and look
                          at more neighbors to determine the temperature more accurately. But this
                          purely relies on the availability of the data. We could adapt the algorithm to
                          yield the classification based on all the neighbors within a certain distance d
                          rather than classifying based on the k-closest neighbors. This would make the
                          algorithm work well in both cases when we have a lot of data within the
                          close distance, but also even if we have just one data instance close to the
                          instance that we want to classify.

                   4.  For this purpose, one can use cross-validation (consult the Cross-validation section
                      in the Appendix A - Statistics) to determine the value of k with the highest
                      accuracy. One could separate the available data from the partial map of Italy into
                      learning and test data, For example, 80% of the classified pixels on the map
                      would be given to a k-NN algorithm to complete the map. Then the remaining
                      20% of the classified pixels from the partial map would be used to calculate the
                      percentage of the pixels with the correct classification by the k-NN algorithm.





                                                     [ 27 ]
   34   35   36   37   38   39   40   41   42   43   44