upcarta
  • Sign In
  • Sign Up
  • Explore
  • Search
Mentions
Nitasha Tiku @nitashatiku ยท Apr 19, 2023
  • From Twitter

Ever since I read this excellent paper from @JesseDodge at @AllenInstitute @mmitchell_ai et al and saw their graph of the top websites in Google's C4-a popular dataset for training AI-I've wanted to dig further. Today's the day!

Paper Sep 30, 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
by Jesse Dodge and Margaret Mitchell
Post Add to Collection Mark as Completed
Recommended by 1 person
1 mention
Share on Twitter Repost
  • upcarta ©2025
  • Home
  • About
  • Terms
  • Privacy
  • Cookies
  • @upcarta