[Reddit comments (2015 - 2018)](https://github.com/PolyAI - LDN/conversational - datasets/tree/master/reddit) |
paper |
726,484,430 |
S2ORC Citation pairs (Abstracts) |
[paper](https://aclanthology.org/2020.acl - main.447/) |
116,288,806 |
[WikiAnswers](https://github.com/afader/oqa#wikianswers - corpus) Duplicate question pairs |
paper |
77,427,422 |
PAQ (Question, Answer) pairs |
paper |
64,371,441 |
S2ORC Citation pairs (Titles) |
[paper](https://aclanthology.org/2020.acl - main.447/) |
52,603,982 |
S2ORC (Title, Abstract) |
[paper](https://aclanthology.org/2020.acl - main.447/) |
41,769,185 |
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) (Title, Body) pairs |
- |
25,316,456 |
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) (Title+Body, Answer) pairs |
- |
21,396,559 |
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) (Title, Answer) pairs |
- |
21,396,559 |
MS MARCO triplets |
paper |
9,144,553 |
GOOAQ: Open Question Answering with Diverse Answer Types |
paper |
3,012,496 |
[Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo - answers - dataset) (Title, Answer) |
[paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02 - Abstract.html) |
1,198,260 |
Code Search |
- |
1,151,414 |
COCO Image captions |
[paper](https://link.springer.com/chapter/10.1007%2F978 - 3 - 319 - 10602 - 1_48) |
828,395 |
SPECTER citation triplets |
[paper](https://doi.org/10.18653/v1/2020.acl - main.207) |
684,100 |
[Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo - answers - dataset) (Question, Answer) |
[paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02 - Abstract.html) |
681,164 |
[Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo - answers - dataset) (Title, Question) |
[paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02 - Abstract.html) |
659,896 |
SearchQA |
paper |
582,261 |
Eli5 |
[paper](https://doi.org/10.18653/v1/p19 - 1346) |
325,475 |
Flickr 30k |
paper |
317,695 |
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) Duplicate questions (titles) |
|
304,525 |
AllNLI (SNLI and MultiNLI |
[paper SNLI](https://doi.org/10.18653/v1/d15 - 1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18 - 1101) |
277,230 |
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) Duplicate questions (bodies) |
|
250,519 |
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) Duplicate questions (titles+bodies) |
|
250,460 |
[Sentence Compression](https://github.com/google - research - datasets/sentence - compression) |
[paper](https://www.aclweb.org/anthology/D13 - 1155/) |
180,000 |
Wikihow |
paper |
128,542 |
Altlex |
[paper](https://aclanthology.org/P16 - 1135.pdf) |
112,696 |
[Quora Question Triplets](https://quoradata.quora.com/First - Quora - Dataset - Release - Question - Pairs) |
- |
103,663 |
Simple Wikipedia |
[paper](https://www.aclweb.org/anthology/P11 - 2117/) |
102,225 |
Natural Questions (NQ) |
paper |
100,231 |
[SQuAD2.0](https://rajpurkar.github.io/SQuAD - explorer/) |
[paper](https://aclanthology.org/P18 - 2124.pdf) |
87,599 |
TriviaQA |
- |
73,346 |
Total |
|
1,170,060,424 |