Page 123 - FULL REPORT 30012024

P. 123

Categorical encoding is then used to convert non-numeric variables into a

format that machine learning algorithms can understand. Label encoding was

chosen because it is efficient in converting categories into numerical values,
which is required for algorithms that need numerical input. This procedure is

meticulously carried out for variables such as 'gender', 'smoking_status', and
'Residence_type,' among others, to ensure that the subtleties of categorical data

are kept in a numerical format.

The Synthetic Minority Oversampling Technique (SMOTE) is used to resolve

the imbalance in the dataset, which might possibly lead to a biassed model. As
seen in In Figure 4.46, it artificially synthesises additional samples of the

minority class in order to offer a balanced representation of classes. The

argument for utilising SMOTE is its ability to attenuate the skewed class
distribution, hence improving the model's generalizability.

Figure 4.46 The code to employ SMOTE.

Based on the literature study in the preceding chapter, the
RandomForestClassifier was selected for its resilience and efficacy in dealing

with both categorical and numerical data. It is trained on a balanced dataset,

which was chosen because of the classifier's capacity to minimise overfitting
and provide feature relevance ratings. The performance of the model is

thoroughly tested using conventional measures such as accuracy, precision,
recall, and the F1 score. The measurements provide a full evaluation of its

predictive capabilities. Table 4.6 depicts the table including the metrics and

assessment findings.

106

118 119 120 121 122 123 124 125 126 127 128