Page 40 - Data Science Algorithms in a Week
P. 40
Classification Using K Nearest Neighbors
5. a) Without data rescaling, Peter's closest neighbor has an annual income of 78,000
USD and is aged 25. This neighbor does not own a house.
b) After data rescaling, Peter's closet neighbor has annual income of 60,000 USD
and is aged 40. This neighbor owns a house.
6. To design a metric that accurately measures the similarity distance between the
two documents, we need to select important words that will form the dimensions
of the frequency vectors for the documents. The words that do not determine the
semantic meaning of a documents tend to have an approximately similar
frequency count across all the documents. Thus, instead, we could produce a list
with the relative word frequency counts for a document. For example, we could
use the following definition:
Then the document could be represented by an N-dimensional vector
consisting of the word frequencies for the N words with the highest relative
frequency count. Such a vector will tend to consist of the more important
words than a vector of the N words with the highest frequency count.
[ 28 ]