upcarta
  • Sign In
  • Sign Up
  • Explore
  • Search

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

  • Paper
  • Sep 30, 2021
  • #ArtificialIntelligence #ComputerScience
Jesse Dodge
@JesseDodge
(Author)
Margaret Mitchell
@mmitchell_ai
(Author)
arxiv.org
Read on arxiv.org
1 Recommender
1 Mention
Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available... Show More

Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora
available are made by scraping significant portions of the internet, and are frequently introduced with only minimal documentation. In
this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created
by applying a set of filters to a single snapshot
of Common Crawl. We begin by investigating
where the data came from, and find a significant amount of text from unexpected sources
like patents and US military websites. Then
we explore the content of the text itself, and
find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets.
To understand the impact of the filters applied
to create this dataset, we evaluate the text that
was removed, and show that blocklist filtering disproportionately removes text from and
about minority individuals. Finally, we conclude with some recommendations for how to
created and document web-scale datasets from
a scrape of the internet.

Show Less
Recommend
Post
Save
Complete
Collect
Mentions
See All
Nitasha Tiku @nitashatiku ยท Apr 19, 2023
  • Post
  • From Twitter
Ever since I read this excellent paper from @JesseDodge at @AllenInstitute @mmitchell_ai et al and saw their graph of the top websites in Google's C4-a popular dataset for training AI-I've wanted to dig further. Today's the day!
  • upcarta ©2025
  • Home
  • About
  • Terms
  • Privacy
  • Cookies
  • @upcarta