Page 123 - FULL REPORT 30012024
P. 123
Categorical encoding is then used to convert non-numeric variables into a
format that machine learning algorithms can understand. Label encoding was
chosen because it is efficient in converting categories into numerical values,
which is required for algorithms that need numerical input. This procedure is
meticulously carried out for variables such as 'gender', 'smoking_status', and
'Residence_type,' among others, to ensure that the subtleties of categorical data
are kept in a numerical format.
The Synthetic Minority Oversampling Technique (SMOTE) is used to resolve
the imbalance in the dataset, which might possibly lead to a biassed model. As
seen in In Figure 4.46, it artificially synthesises additional samples of the
minority class in order to offer a balanced representation of classes. The
argument for utilising SMOTE is its ability to attenuate the skewed class
distribution, hence improving the model's generalizability.
Figure 4.46 The code to employ SMOTE.
Based on the literature study in the preceding chapter, the
RandomForestClassifier was selected for its resilience and efficacy in dealing
with both categorical and numerical data. It is trained on a balanced dataset,
which was chosen because of the classifier's capacity to minimise overfitting
and provide feature relevance ratings. The performance of the model is
thoroughly tested using conventional measures such as accuracy, precision,
recall, and the F1 score. The measurements provide a full evaluation of its
predictive capabilities. Table 4.6 depicts the table including the metrics and
assessment findings.
106