Page 51 - Reclaim YOUR DIGITAL GOLD (without audio)
P. 51

Data ColleCtion Harvesting



            2.  Free and Open Dataset Access
            Open-source  datasets  are  the  most  efficient  and
            straightforward way to collect data  for  your  machine
            learning  model.  Thousands  of open-source datasets,
            similar to coding snippets, are available online. They are
            completely  free,  easy  to  find,  and  time-saving.  Even  if
            public datasets appear to contain an infinite amount of
            rich, detailed data, they may still require cleaning to meet
            specific requirements.

            The following are some of the best places to look for
            free public datasets:

                   ● Amazon
                   ● Kaggle

                   ● Microsoft
                   ● Government Datasets (i.e., Stats data)
                   ● Lionbridge AI
                   ● Google’s Datasets Search Engine
                   ● UCI Machine Learning Repository


            3.  Scanning for Data on the Internet

            Assume we want to get  product information from
            Amazon, such as descriptions and prices. This could be
            accomplished through repetitive typing or copy-pasting.
            However,  Amazon  has  far too many  items  and  their
            prices fluctuate far too frequently for this to be feasible.
            This is what web scraping tools are for. They sift through
            a variety  of Internet  data.  Furthermore, these  tools
            automatically  or manually  search  for new  or updated
            data and store it for your convenience.





                                                                    31
   46   47   48   49   50   51   52   53   54   55   56