Page 12 - ChatGPT Prompts Book: Precision Prompts, Role Prompting, Training & AI Writing Techniques for Mortals
P. 12
dataset contains a lot of errors or inconsistencies, the model
may learn to produce outputs that are similarly flawed.
Understanding the sources of the dataset used to train a
language model is therefore important for assessing the
model's biases and reliability. In the case of ChatGPT, the
model was trained on an extensive selection of data sources
including the following source types.
1) Websites: Content from millions of websites,
including news articles and blog posts covering a
wide array of topics, such as science, technology,
politics, history, and culture.
2) Books: Excerpts from books, both fiction and non-
fiction, exposing the model to a variety of different
writing styles, genres, and narrative structures.
3) Online forums: This includes content collected
from online forums and discussion boards, such as
Reddit and Stack Overflow, providing ChatGPT with
examples of informal language and conversation, as
well as a variety of opinions and viewpoints.
4) Social media: Text from social media platforms,
including Twitter and Facebook, was used to help
ChatGPT understand shorter and more casual forms
of text, including slang and abbreviations.
5) Conversational data: Conversational data from
customer support logs, public chat rooms, and other
sources to improve ChatGPT’s ability to engage in
dialogue and understand the context in a
conversational setting.
ChatGPT-4
At the time of writing, ChatGPT’s model is powered by the
GPT-4 architecture. This model is the latest version in a
series of GPT models and the culmination of decades of
research and innovation in large language modeling.