Page 120 - FULL REPORT 30012024
P. 120

4.3.2  Prediction Module




                                This section is dedicated to the development and enhancement of a predictive
                                analytics  model  for  stroke  risk  assessment,  which  encompasses  two

                                fundamental phases, data cleaning and training model.


                        4.3.2.1 Data Cleaning



                                The  "healthcare-dataset-stroke-data.csv"  dataset,  sourced  for  the  stroke
                                prediction analysis, displayed a high degree of cleanliness, partly attributable

                                to the careful maintenance of its Kaggle host, who included user comments
                                for  continual  development.  Despite  this,  some  data  cleansing  was  still

                                required.


                                The dataset was processed using Python and the Pandas module in Jupyter

                                Notebook. The fundamental purpose of the cleaning procedure was to adapt
                                the  dataset  for  situations  where  users  submit  data,  mandating  clarity  and

                                relevance in the data fields. At first, the rows labelled as 'Other' in the gender
                                column were removed, resulting in a narrower emphasis on the male and

                                female  categories.  Similarly,  values  designated  as  'Unknown'  in  the

                                'smoking_status' column were eliminated to enhance clarity in smoking data.
                                Figure 4.41 depicts the data cleaning code applied for the prediction dataset.






















                                                   Figure 4.41 The data cleaning python code.



                                                               103
   115   116   117   118   119   120   121   122   123   124   125