Page 39 - Data Science Algorithms in a Week
P. 39
Classification Using K Nearest Neighbors
To be able to use our algorithm to yield better results, we should collect more
data. For example, if we find out that Mary feels cold at 14 degrees Celsius,
then we have a data instance that is very close to 15 degrees and, thus, we
can guess with a higher certainty that Mary would feel cold at a temperature
of 15 degrees.
2. The nature of the data we are dealing with is just one-dimensional and also
partitioned into two parts, cold and warm, with the property: the higher the
temperature, the warmer a person feels. Also, even if we know how Mary feels at
temperatures, -40, -39, ..., 39, 40, we still have a very limited amount of data
instances - just one around every degree Celsius. For these reasons, it is best to
just look at one closest neighbor.
3. The discrepancies in the data can be caused by inaccuracy in the tests carried out.
This could be mitigated by performing more experiments.
Apart from inaccuracy, there could be other factors that influence how Mary
feels: for example, the wind speed, humidity, sunshine, how warmly Mary is
dressed (if she has a coat with jeans, or just shorts with a sleeveless top, or
even a swimming suit), if she was wet or dry. We could add these additional
dimensions (wind speed and how dressed) into the vectors of our data
points. This would provide more, and better quality, data for the algorithm
and, consequently, better results could be expected.
If we have only temperature data, but more of it (for example, 10 instances of
classification for every degree Celsius), then we could increase the k and look
at more neighbors to determine the temperature more accurately. But this
purely relies on the availability of the data. We could adapt the algorithm to
yield the classification based on all the neighbors within a certain distance d
rather than classifying based on the k-closest neighbors. This would make the
algorithm work well in both cases when we have a lot of data within the
close distance, but also even if we have just one data instance close to the
instance that we want to classify.
4. For this purpose, one can use cross-validation (consult the Cross-validation section
in the Appendix A - Statistics) to determine the value of k with the highest
accuracy. One could separate the available data from the partial map of Italy into
learning and test data, For example, 80% of the classified pixels on the map
would be given to a k-NN algorithm to complete the map. Then the remaining
20% of the classified pixels from the partial map would be used to calculate the
percentage of the pixels with the correct classification by the k-NN algorithm.
[ 27 ]