Page 120 - FULL REPORT 30012024

P. 120

4.3.2 Prediction Module

This section is dedicated to the development and enhancement of a predictive
analytics model for stroke risk assessment, which encompasses two

fundamental phases, data cleaning and training model.

4.3.2.1 Data Cleaning

The "healthcare-dataset-stroke-data.csv" dataset, sourced for the stroke
prediction analysis, displayed a high degree of cleanliness, partly attributable

to the careful maintenance of its Kaggle host, who included user comments
for continual development. Despite this, some data cleansing was still

required.

The dataset was processed using Python and the Pandas module in Jupyter

Notebook. The fundamental purpose of the cleaning procedure was to adapt
the dataset for situations where users submit data, mandating clarity and

relevance in the data fields. At first, the rows labelled as 'Other' in the gender
column were removed, resulting in a narrower emphasis on the male and

female categories. Similarly, values designated as 'Unknown' in the

'smoking_status' column were eliminated to enhance clarity in smoking data.
Figure 4.41 depicts the data cleaning code applied for the prediction dataset.

Figure 4.41 The data cleaning python code.

103

115 116 117 118 119 120 121 122 123 124 125