Corpora
From NLTK
Over 40 corpora and corpus samples are included with the NLTK Corpus Distribution (750Mb). NLTK also provides corpus readers for easy access to many of these corpora from Python programs, e.g. if X is the name of a corpus, then some or all of the following methods will be defined:
>>> nltk.corpus.X.raw() # raw data from the corpus file(s) >>> nltk.corpus.X.words() # a list of words and punctuation tokens >>> nltk.corpus.X.sents() # words() grouped into sentences >>> nltk.corpus.X.tagged_words() # a list of (word,tag) pairs >>> nltk.corpus.X.tagged_sents() # tagged_words() grouped into sentences >>> nltk.corpus.X.parsed_sents() # a list of parse trees
Contents |
Parsed Corpora
The following corpora contain parsed text, and have a corpus reader that supports the following methods: raw(), words(), sents(), tagged_words, tagged_sents(), and parsed_sents.
- alpino: Alpino Treebank (Dutch)
- cess_cat: CESS-CAT Treebank (Catalan)
- cess_esp: CESS-ESP Treebank (Spanish)
- floresta: Floresta Treebank (Portuguese)
- treebank: Penn Treebank Corpus Sample (English)
- sinica: Sinica Treebank Corpus Sample (Chinese)
Tagged Corpora
The following corpora contain tagged text, and have a corpus reader that supports the following methods: raw(), words(), sents(), tagged_words, and tagged_sents().
- brown: Brown Corpus
- indian: Indian Language POS-Tagged Corpus (Bangla, Hindi, Marathi, Telugu)
- mac_morpho: MacMorpho POS-Tagged Corpus (Brazilian Portuguese)
Text Corpora
The following corpora contain plain text, and have a corpus reader that supports the following methods: raw() and words().
- abc: Australian Broadcasting Commission 2006: Science News, Rural News
- genesis: Genesis Corpus
- gutenberg: Project Gutenberg Selections
- inaugural: US Presidential Inaugural Address Corpus
- udhr: Universal Declaration of Human Rights Corpus
- state_union: US Presidential State of the Union Address Corpus
Lexicons
The following corpora contain lexical data, and have a corpus reader that supports the following methods: raw() and words().
- cmudict: Carnegie Mellon Pronouncing Dictionary
- names: Names Corpus
- propbank: Proposition Bank Corpus
- stopwords: Stopwords Corpus (Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish)
- toolbox: Toolbox Data Samples
- verbnet: VerbNet Corpus
- words: Wordlist (English)
NLTK also has an interface to WordNet together with the WordNet Similarity measures (nltk.wordnet).
Categorized Corpora
The following corpora contain categorized data, and have a corpus reader that supports access by category.
- brown: Brown Corpus
- movie_reviews: Sentiment Polarity Dataset
- qc: Question Classification Corpus
- reuters: Reuters-21578 Corpus
Miscellaneous
- chat80: Chat-80 Database
- conll2000: CoNLL 2000 Chunking Corpus
- conll2002: CoNLL 2002 Named Entity Corpus (Dutch, Spanish)
- ieer: NIST 1999 Information Extraction: Entity Recognition Corpus
- paradigms: Paradigm Corpus
- ppattach: PP Attachment Corpus
- rte: RTE Corpus (Challenges 1, 2 and 3)
- senseval: SENSEVAL 2 Corpus
- shakespeare: Shakespeare XML Corpus Sample
- timit: TIMIT Corpus Sample
- wordnet: WordNet



