Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Paper
Sep 30, 2021
#ArtificialIntelligence #ComputerScience

Jesse Dodge

@JesseDodge

(Author)

Margaret Mitchell

@mmitchell_ai

(Author)

Read on arxiv.org

1 Recommender

1 Mention

Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available... Show More

Mentions

See All

Nitasha Tiku @nitashatiku · Apr 19, 2023

Post
From Twitter

Ever since I read this excellent paper from @JesseDodge at @AllenInstitute @mmitchell_ai et al and saw their graph of the top websites in Google's C4-a popular dataset for training AI-I've wanted to dig further. Today's the day!