Page 51 - Reclaim YOUR DIGITAL GOLD (without audio)
P. 51
Data ColleCtion Harvesting
2. Free and Open Dataset Access
Open-source datasets are the most efficient and
straightforward way to collect data for your machine
learning model. Thousands of open-source datasets,
similar to coding snippets, are available online. They are
completely free, easy to find, and time-saving. Even if
public datasets appear to contain an infinite amount of
rich, detailed data, they may still require cleaning to meet
specific requirements.
The following are some of the best places to look for
free public datasets:
● Amazon
● Kaggle
● Microsoft
● Government Datasets (i.e., Stats data)
● Lionbridge AI
● Google’s Datasets Search Engine
● UCI Machine Learning Repository
3. Scanning for Data on the Internet
Assume we want to get product information from
Amazon, such as descriptions and prices. This could be
accomplished through repetitive typing or copy-pasting.
However, Amazon has far too many items and their
prices fluctuate far too frequently for this to be feasible.
This is what web scraping tools are for. They sift through
a variety of Internet data. Furthermore, these tools
automatically or manually search for new or updated
data and store it for your convenience.
31