Page 120 - FULL REPORT 30012024
P. 120
4.3.2 Prediction Module
This section is dedicated to the development and enhancement of a predictive
analytics model for stroke risk assessment, which encompasses two
fundamental phases, data cleaning and training model.
4.3.2.1 Data Cleaning
The "healthcare-dataset-stroke-data.csv" dataset, sourced for the stroke
prediction analysis, displayed a high degree of cleanliness, partly attributable
to the careful maintenance of its Kaggle host, who included user comments
for continual development. Despite this, some data cleansing was still
required.
The dataset was processed using Python and the Pandas module in Jupyter
Notebook. The fundamental purpose of the cleaning procedure was to adapt
the dataset for situations where users submit data, mandating clarity and
relevance in the data fields. At first, the rows labelled as 'Other' in the gender
column were removed, resulting in a narrower emphasis on the male and
female categories. Similarly, values designated as 'Unknown' in the
'smoking_status' column were eliminated to enhance clarity in smoking data.
Figure 4.41 depicts the data cleaning code applied for the prediction dataset.
Figure 4.41 The data cleaning python code.
103