140 Datasets


Reuters News dataset

(Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.

Penn Treebank

Used for next word prediction or next character prediction.




UCIs Spambase

(Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an intere...

Broadcast News

Large text dataset, classically used for next word prediction.


The Stanford Question Answering Datasetbroadly useful question answering and reading comprehension dataset, where every answer to a question is posed as...

Billion Words dataset

A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.

Common Crawl

Petabyte-scale crawl of the webmost frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dat...

Text Classification Datasets

From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sam...

20 newsgroups

Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for eit...