Page 39 - FINAL CFA II SLIDES JUNE 2019 DAY 3

P. 39

UNSUPERVISED LEARNING READING 8: MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

ALGORITHMS MODULE 8.10: SUPERVISED AND UNSUPERVISED MACHINE LEARNING

Clustering: Given a data set, clustering is the process of grouping observations into categories based on similarities in their
attributes. For example, stocks can be assigned to different categories based on their past performances, rather than standard
sector classifiers (e.g., finance, healthcare, technology, etc.). Clustering can be bottom-up or top-down. In the case of bottom-
up clustering, we start with one observation as its own cluster and add other similar observations to that cluster, or form another
non-overlapping cluster. Top-down clustering starts with one giant cluster and then partitions that cluster into smaller and
smaller clusters.

Dimension Reduction: Problems associated with too much noise often arise when the number of features in a data set (its
dimension) is excessive. Dimension reduction seeks to remove the noise (i.e., those attributes that do not contain much
information). One method is principal component analysis (PCA) which summarizes the information in a large number of
correlated factors into a much smaller set of uncorrelated factors. The first factor in PCA would be the most important factor in
explaining the variation across observations. The second factor would be the second most important and so on, up to the
number of uncorrelated factors specified by the researcher.

STEPS IN MODEL TRAINING

LOS 8.r: Describe the steps in model training.

1. Specify the algorithm.
2. Specify the hyperparameters (before the processing begins).
3. Divide data into training and validation samples. In the case of cross validation, the training and validation
samples are randomly generated every learning cycle.
4. Evaluate the training using a performance parameter, P, in the validation sample.
5. Repeat the training until adequate level of performance is achieved. In choosing the number of times to repeat,
the researcher must use caution to avoid overfitting the model.

34 35 36 37 38 39 40 41 42 43 44