Development
From NLTK
- nltk-devel - the mailing list where development plans are discussed
- Developers Guide - how we collaborate
- Eclipse - how to set up an IDE for NLTK development
If you're interested in working on any of the following tasks, please let us know, and we'll help to get you started.
Or if you'd like to contribute in other ways, take a look at our Projects page, or post a message to nltk-devel describing your interests and we'll suggest other tasks.
| Task | Priority | Time Required | Instructions | Developer |
|---|---|---|---|---|
| WordNet Browser | High | Weeks | Develop a standalone WordNet browser. C.f. wntw and webwn. | Jussi Salmela |
| Dependency parser | High | Days-Weeks | Add a dependency parser to nltk. | Jason Narad |
| Web demos | High | Days-Weeks | Deliver NLTK graphical demos via a browser plugin or web services | Adnan Balagam |
| Corpora | ||||
| Non-English Corpora | High | Hours | We would like to add a variety of annotated corpora for languages other than English to the NLTK corpus package. If you have an appropriate corpus, which is freely redistributable, please let us know. | |
| Packaging | ||||
| OS X Packaging | High | Hours | We would like NLTK to be available on OS X as a set of two native packages -- one containing the NLTK source, and the other contining the corpora. The corpora should be installed to a standard location. | Joshua Ritterman |
| Egg Packaging | Medium | Hours | We would like NLTK to be available via setuptools as "eggs" -- one containing the NLTK source, and the other contining the corpora. | (none) |
| Debian Packaging | Medium | Hours | We would like NLTK to be available as debian packages -- one containing the NLTK source, and the other contining the corpora. | (none) |
| Testing | ||||
| BuildBot | Medium | Days | BuildBot is a system that automates the testing of software projects on multiple architectures. Set up a build bot (and a couple clients) to automate nltk regression testing. | (none) |
| Regression Tests | Medium | Hours-Weeks | Using doctest, write regression test cases for some of the NLTK modules that are currently lacking regression testing. | (none) |
| Audit API Docs | Low | Hours-Weeks | Check the epydoc API documentation strings to make sure they're still up-to-date and accurate; and where necessary, fill in missing documentation strings. | |
- Known bugs
-
- some of the WordNet similarity functions
- certain re-entrant structures in featstruct
- when the LHS of an edge contains an ApplicationExpression, variable values in the RHS bindings aren't copied over when the fundamental rule applies
- HMM tagger tags everything as ' in some situations
Contents |
[edit]
NLTK-Lite Version 0.9.3 (April)
- Provide access to WordNet senses (so that we can navigate from a synset to its component word senses (not words), as requested by Tim Mahrt
[edit]
NLTK-Lite Version 0.9.1 (January)
- good off-the-shelf tokenizer, tagger and chunker
- interface to Wordnet index
- docstrings up-to-date and epydoc errors fixed
- house coding style (PEP8)
- replace ConditionalFreqDist with defaultdict(FreqDist)
- change FreqDist to wrap a dictionary and pass through: getitem, iter, contains, len
- add plot method to FreqDist
- add verbnet lexicon and corpus reader
- add doc_contrib to top level Makefile
- add OLAC records for each corpus
- epydoc display of docstrings in treetransforms.py
- doctest code in function docstrings?
- improve installation instructions on CD-ROM
- web-as-corpus interface with caching
- access to semcor frequency data in wordnet api
- nltk.tree.Tree documentation (including node attribute)
- movie reviews corpus and reader
- Reuters corpus and reader
[edit]
NLTK-Lite Version 1.0 / NLTK Version 2.0 (early 2008)
Once it reaches version 1.0, NLTK-Lite's name will be changed back to NLTK and assigned version 2.0. This will coincide with the publication of the NLTK Book. From this point onwards, names and interfaces will be frozen for at least a year. Subsequent changes will be conservative and will support backwards compatibity wherever possible.
[edit]
Unscheduled tasks
- final naming
- tokenize -> token (or something else?)
- explain reason for name change from old NLTK
- add OLAC support: read and write an OLAC static repository
- Switch to setuptools
- get rpm build working on mac
- material on writing (adapting?) a corpus reader
- Simple n-gram language modeling, interpolated and backoff language models
- Text Tiling
- WordNet similarity: Gloss Vector similarity
- efficiency improvements for chunk parser (slower than chart parser for some grammars, incl toolbox parser grammar)
[edit]
Software
- Marshalling
- integrate more student projects (incl TAG, textcat, paradigms)
- add sequence values to FeatureStructure
- decision list classifier
- sequence classifier
- collocation support (chi-sq, PMI, spearman rank correlation, etc)
- new material on data modelling (interlinear text, paradigms)
- maxent package and tutorial
- lexical semantics
- information extraction (e.g. from biomedical literature)
- regular expressions for extracting temporal expressions
- terminological difference in chart parsing with Jurafsky and Martin textbook
[edit]
Corpora
- port more NLTK corpus readers
- more LDC corpus samples (Fisher?)
- add Li & Roth question classification data
- SRL corpus and reader
- more tree data
- MUC 6 or 7 data
- MWE corpus (Nicholson)
- Mawu corpus sample
- Yemba lexicon
[edit]
Housekeeping
- get epydoc docstrings to compile cleanly
- Unicode compliance
- check graphical demos on windows machines (add cf.mainloop()?)



