Page 39 - FINAL CFA II SLIDES JUNE 2019 DAY 3
P. 39

UNSUPERVISED LEARNING                                          READING 8: MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

     ALGORITHMS                                                                         MODULE 8.10: SUPERVISED  AND UNSUPERVISED  MACHINE LEARNING





     Clustering: Given a data set, clustering is the process of grouping observations into categories based on similarities in their
     attributes. For example, stocks can be assigned to different categories based on their past performances, rather than standard
     sector classifiers (e.g., finance, healthcare, technology, etc.). Clustering can be bottom-up or top-down. In the case of bottom-
     up clustering, we start with one observation as its own cluster and add other similar observations to that cluster, or form another
     non-overlapping cluster. Top-down clustering starts with one giant cluster and then partitions that cluster into smaller and
     smaller clusters.


     Dimension Reduction: Problems associated with too much noise often arise when the number of features in a data set (its
     dimension) is excessive. Dimension reduction seeks to remove the noise (i.e., those attributes that do not contain much
     information). One method is principal component analysis (PCA) which summarizes the information in a large number of
     correlated factors into a much smaller set of uncorrelated factors. The first factor in PCA would be the most important factor in
     explaining the variation across observations. The second factor would be the second most important and so on, up to the
     number of uncorrelated factors specified by the researcher.

     STEPS IN MODEL TRAINING


      LOS 8.r: Describe the steps in model training.


      1. Specify the algorithm.
      2. Specify the hyperparameters (before the processing begins).
      3. Divide data into training and validation samples. In the case of cross validation, the training and validation
         samples are randomly generated every learning cycle.
      4. Evaluate the training using a performance parameter, P, in the validation sample.
      5. Repeat the training until adequate level of performance is achieved. In choosing the number of times to repeat,
         the researcher must use caution to avoid overfitting the model.
   34   35   36   37   38   39   40   41   42   43   44