Page 40 - Data Science Algorithms in a Week
P. 40

Classification Using K Nearest Neighbors


                   5.  a) Without data rescaling, Peter's closest neighbor has an annual income of 78,000
                      USD and is aged 25. This neighbor does not own a house.
                      b) After data rescaling, Peter's closet neighbor has annual income of 60,000 USD
                      and is aged 40. This neighbor owns a house.
                   6.  To design a metric that accurately measures the similarity distance between the
                      two documents, we need to select important words that will form the dimensions
                      of the frequency vectors for the documents. The words that do not determine the
                      semantic meaning of a documents tend to have an approximately similar
                      frequency count across all the documents. Thus, instead, we could produce a list
                      with the relative word frequency counts for a document. For example, we could
                      use the following definition:






                          Then the document could be represented by an N-dimensional vector
                          consisting of the word frequencies for the N words with the highest relative
                          frequency count. Such a vector will tend to consist of the more important
                          words than a vector of the N words with the highest frequency count.




































                                                     [ 28 ]
   35   36   37   38   39   40   41   42   43   44   45