1   Introduction to Language Processing and Python

1.1   The Language Challenge

Today, people from all walks of life — including professionals, students, and the general population — are confronted by unprecedented volumes of information, the vast bulk of which is stored as unstructured text. In 2003, it was estimated that the annual production of books amounted to 8 Terabytes. (A Terabyte is 1,000 Gigabytes, i.e., equivalent to 1,000 pickup trucks filled with books.) It would take a human being about five years to read the new scientific material that is produced every 24 hours. Although these estimates are based on printed materials, increasingly the information is also available electronically. Indeed, there has been an explosion of text and multimedia content on the World Wide Web. For many people, a large and growing fraction of work and leisure time is spent navigating and accessing this universe of information.

The presence of so much text in electronic form is a huge challenge to NLP. Arguably, the only way for humans to cope with the information explosion is to exploit computational techniques that can sift through huge bodies of text.

Although existing search engines have been crucial to the growth and popularity of the Web, humans require skill, knowledge, and some luck, to extract answers to such questions as What tourist sites can I visit between Philadelphia and Pittsburgh on a limited budget? What do expert critics say about digital SLR cameras? What predictions about the steel market were made by credible commentators in the past week? Getting a computer to answer them automatically is a realistic long-term goal, but would involve a range of language processing tasks, including information extraction, inference, and summarization, and would need to be carried out on a scale and with a level of robustness that is still beyond our current capabilities.

1.1.1   The Richness of Language

Language is the chief manifestation of human intelligence. Through language we express basic needs and lofty aspirations, technical know-how and flights of fantasy. Ideas are shared over great separations of distance and time. The following samples from English illustrate the richness of language:

(1)

a.Overhead the day drives level and grey, hiding the sun by a flight of grey spears. (William Faulkner, As I Lay Dying, 1935)

b.When using the toaster please ensure that the exhaust fan is turned on. (sign in dormitory kitchen)

c.Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated activities with Ki values of 45.1-271.6 μM (Medline, PMID: 10718780)

d.Iraqi Head Seeks Arms (spoof news headline)

e.The earnest prayer of a righteous man has great power and wonderful results. (James 5:16b)

f.Twas brillig, and the slithy toves did gyre and gimble in the wabe (Lewis Carroll, Jabberwocky, 1872)

g.There are two ways to do this, AFAIK :smile: (internet discussion archive)

Thanks to this richness, the study of language is part of many disciplines outside of linguistics, including translation, literary criticism, philosophy, anthropology and psychology. Many less obvious disciplines investigate language use, such as law, hermeneutics, forensics, telephony, pedagogy, archaeology, cryptanalysis and speech pathology. Each applies distinct methodologies to gather observations, develop theories and test hypotheses. Yet all serve to deepen our understanding of language and of the intellect that is manifested in language.

The importance of language to science and the arts is matched in significance by the cultural treasure embodied in language. Each of the world's ~7,000 human languages is rich in unique respects, in its oral histories and creation legends, down to its grammatical constructions and its very words and their nuances of meaning. Threatened remnant cultures have words to distinguish plant subspecies according to therapeutic uses that are unknown to science. Languages evolve over time as they come into contact with each other and they provide a unique window onto human pre-history. Technological change gives rise to new words like blog and new morphemes like e- and cyber-. In many parts of the world, small linguistic variations from one town to the next add up to a completely different language in the space of a half-hour drive. For its breathtaking complexity and diversity, human language is as a colorful tapestry stretching through time and space.

1.1.2   The Promise of NLP

As we have seen, NLP is important for scientific, economic, social, and cultural reasons. NLP is experiencing rapid growth as its theories and methods are deployed in a variety of new language technologies. For this reason it is important for a wide range of people to have a working knowledge of NLP. Within industry, it includes people in human-computer interaction, business information analysis, and Web software development. Within academia, this includes people in areas from humanities computing and corpus linguistics through to computer science and artificial intelligence. We hope that you, a member of this diverse audience reading these materials, will come to appreciate the workings of this rapidly growing field of NLP and will apply its techniques in the solution of real-world problems.

This book presents a carefully-balanced selection of theoretical foundations and practical applications, and equips readers to work with large datasets, to create robust models of linguistic phenomena, and to deploy them in working language technologies. By integrating all of this into the Natural Language Toolkit (NLTK), we hope this book opens up the exciting endeavor of practical natural language processing to a broader audience than ever before.

The rest of this chapter provides a non-technical overview of Python and will cover the basic programming knowledge needed for the rest of the chapters in Part 1. It contains many examples and exercises; there is no better way to learn to program than to dive in and try these yourself. Before you know it you will be programming!

The goal of this chapter is to answer the following questions:

  1. what can we achieve by combining simple programming techniques with large quantities of text?
  2. how can we automatically extract representative words from a large text?
  3. is the Python programming language suitable for such work?

Along the way you will be introduced to a selection of elementary concepts in linguistics and computer science. However, this is deliberately not systematic, but only a taster, intended to give you the flavour of what will come later, and motivate you to work through the more systematic material that will follow.

1.2   Computing with Language: Texts and Words

As we will see, it is easy to get our hands on large quantities of text. What can we do with it, assuming we can write some simple programs? Here we will treat the text as data for the programs we write, programs that manipulate and analyze it in a variety of interesting ways. The first step is to get started with the Python interpreter.

1.2.1   Getting Started

One of the friendly things about Python is that it allows you to type directly into the interactive interpreter — the program that will be running your Python programs. You can run the Python interpreter using a simple graphical interface called the Interactive DeveLopment Environment (IDLE). On a Mac you can find this under Applications→MacPython, and on Windows under All Programs→Python. Under Unix you can run Python from the shell by typing python. The interpreter will print a blurb about your Python version; simply check that you are running Python 2.4 or greater (here it is 2.5):

 
Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

The >>> prompt indicates that the Python interpreter is now waiting for input. When copying examples from this book be sure not to type in the >>> prompt yourself. Now, let's begin by using Python as a calculator:

 
>>> 1 + 5 * 2 - 3
8
>>>

Once the interpreter has finished calculating the answer and displaying it, the prompt reappears. This means the Python interpreter is waiting for another instruction.

Try a few more expressions of your own. You can use asterisk (*) for multiplication and slash (/) for division, and parentheses for bracketing expressions. One strange thing you might come across is that division doesn't always behave as you might expect; it does integer division or floating point division depending on how you specify the inputs:

 
>>> 3/3
1
>>> 1/3
0
>>> 1.0/3.0
0.33333333333333331
>>>

These examples demonstrate how you can work interactively with the interpreter, allowing you to experiment and explore. As you will see later, your intuitions about numerical expressions will be useful for manipulating language data in Python.

Now let's try a nonsensical expression to see how the interpreter handles it:

 
>>> 1 +
Traceback (most recent call last):
  File "<stdin>", line 1
    1 +
      ^
SyntaxError: invalid syntax
>>>

Here we have produced a syntax error. It doesn't make sense to end an instruction with a plus sign. The Python interpreter indicates the line where the problem occurred.

1.2.2   Searching Text

Now that we can use the Python interpreter, let's see how we can harness its power to process text. The first step is to type a line of magic at the Python prompt, telling the interpreter to load some texts for us to explore: from nltk.book import * (i.e. import all names from NLTK's book module). After printing a welcome message, it loads the text of several books, including Moby Dick. Type the following, taking care to get spelling and punctuation exactly right:

 
>>> from nltk.book import *
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>

We can examine the contents of a text in a variety of ways. A concordance view shows us a given word in its context. Here we look up the word monstrous. Try seaching for other words; you can use the up-arrow key to access the previous command and modify the word being searched.

 
>>> text1.concordance("monstrous")
mong the former , one was of a most monstrous size . ... This came towards us , o
ION OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have re
all over with a heathenish array of monstrous clubs and spears . Some were thickl
ed as you gazed , and wondered what monstrous cannibal and savage could ever have
 that has survived the flood ; most monstrous and most mountainous ! That Himmale
 they might scout at Moby Dick as a monstrous fable , or still worse and more det
ath of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere lo
ling Scenes . In connexion with the monstrous pictures of whales , I am strongly
>>>

You can now try concordance searches on some of the other texts we have included. For example, search Sense and Sensibility for the word affection, using text2.concordance("affection"). Search the book of Genesis to find out how long some people lived, using: text3.concordance("lived"). You could look at text4, the US Presidential Inaugural Addresses to see examples of English dating back to 1789, and search for words like nation, terror, god. We've also included text5, the NPS Chat Corpus: search this for unconventional words like im, ur, lol. (Note that this corpus is uncensored!)

Once you've spent some time examining these texts, we hope you have a new sense of the richness and diversity of language. In the next chapter you will learn how to access a broader range of text, including text in languages other than English.

If we can find words in a text, we can also take note of their position within the text. We produce a dispersion plot, where each bar represents an instance of a word and each row represents the entire text. In Figure 1.1 we see some striking patterns of word usage over the last 220 years. You can produce this plot as shown below (so long as you have Numpy and Pylab installed). You might like to try different words, and different texts. As before, take care to get the quotes, commas, brackets and parentheses exactly right.

 
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>
../images/inaugural.png

Figure 1.1: Lexical Dispersion Plot for Words in Presidential Inaugural Addresses

A concordance permits us to see words in context, e.g. we saw that monstrous appeared in the context the monstrous pictures. What other words appear in the same contexts that monstrous appears in? We can find out as follows:

 
>>> text1.similar("monstrous")
subtly impalpable curious abundant perilous trustworthy untoward
singular imperial few maddens loving mystifying christian exasperate
puzzled fearless uncommon domineering candid
>>> text2.similar("monstrous")
great very so good vast a exceedingly heartily amazingly as sweet
remarkably extremely
>>>

Observe that we get different results for different books.

Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the "generate" function, e.g. text3.generate():

 
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
>>>

Note that first time you run this, it is slow because it gathers statistics about word sequences. Each time you run it, you will get different output text. Now try generating random text in the style of an inaugural address or an internet chat room.

Note

When text is printed, punctuation has been split off from the previous word. Although this is not correct formatting for English text, we do this to make it clear that punctuation does not belong to the word. This is called "tokenization", and you will learn about it in Chapter 2.

1.2.3   Counting Vocabulary

The most obvious fact about texts that emerges from the previous section is that they differ in the vocabulary they use. In this section we will see how to use the computer to count the words in a text, in a variety of useful ways. As before you will jump right in and experiment with the Python interpreter, even though you may not have studied Python systematically yet.

Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We'll use the text of Moby Dick again:

 
>>> len(text1)
260819
>>>

That's a quarter of a million words long! But how many distinct words does this text contain? To work this out in Python we have to pose the question slightly differently. The vocabulary of a text is just the set of words that it uses, and in Python we can list the vocabulary of text3 with the command: set(text3). This will produce many screens of words. Now try the following:

 
>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3))
2789
>>> len(text3) / len(set(text3))
16
>>>

Here we can see a sorted list of vocabulary items, beginning with various punctuation symbols and continuing with words starting with A. Words starting with a will appear much later, after the last "Z" word, Zoroaster. We discover the size of the vocabulary indirectly, by asking for the length of the set. Finally, we can calculate a measure of the lexical richness of the text and learn that each word is used 16 times on average.

Next, let's focus in on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:

 
>>> text3.count("smote")
5
>>> 100.0 * text4.count('a') / len(text4)
1.4587672822333748
>>>

You might like to repeat such calculations on several texts, but it is tedious to keep retyping it for different texts. Instead, we can come up with our own name for this task, e.g. "score", and define a function that can be re-used as often as we like:

 
>>> def score(text):
...     return len(text) / len(set(text))
...
>>> score(text3)
16
>>> score(text4)
4
>>>

Note

The Python interpreter changes the prompt from >>> to ... after encountering the colon at the end of the first line. The ... prompt indicates that Python expects an indented code block to appear next. It is up to you to do the indentation, by typing four spaces. To finish the indented block just enter a blank line.

Notice that we used the score function by typing its name, followed by an open parenthesis, the name of the text, then a close parenthesis. This is just what we did for the len and set functions earlier. These parentheses will show up often: their role is to separate the name of a task — such as score — from the data that the task is to be performed on — such as text3. Functions are an advanced concept in programming and we only mention them at the outset to give newcomers a sense of the power and flexibility of programming. We'll come back to them towards the end of this chapter. Later we'll see how to use such functions when tabulating data, as shown below: same task many times over using functions, we can easily build up tables like Table 1.1.

Table 1.1:

Lexical Diversity of Various Genres in the Brown Corpus

Genre Token Count Type Count Score
skill and hobbies 82345 11935 6.9
humor 21695 5017 4.3
fiction: science 14470 3233 4.5
press: reportage 100554 14394 7.0
fiction: romance 70022 8452 8.3
religion 39399 6373 6.2

1.2.4   Exercises

  1. ☼ How many words are there in text2? How many distinct words are there?
  2. ☼ Compare the lexical diversity scores for humor and romance fiction in Table 1.1. Which genre is more lexically diverse?
  3. ☺ Compare the lexical dispersion plot with Google Trends, which shows the frequency with which a term has been referenced in news reports or been used in search terms over time.
  4. ☼ Produce a dispersion plot of the four main protagonists in Sense and Sensibility: Elinor, Marianne, Edward, Willoughby. What can you observe about the different roles played by the males and females in this novel? Can you identify the couples?
  5. ☼ According to Strunk and White's Elements of Style, the word however, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. (http://www.bartleby.com/141/strunk3.html) Use the concordance tool to study actual usage of this word in the various texts we have been considering.
  6. ◑ Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation.
  7. ◑ How many times does the word lol appear in text5? How much is this as a percentage of the total number of words in this text?
  8. ◑ Pick a pair of texts and study the differences between them, in terms of vocabulary, vocabulary richness, genre, etc. Can you find pairs of words which have quite different meanings across the two texts, such as monstrous in Moby Dick and in Sense and Sensibility?
  9. ◑ Compare the frequency of use of the modal verbs will and could in text2 (romance fiction) and text7 (news). Which modal verb is more common in which genre?

We will leave our quest for characteristic words of a text, and explore a rather different approach that uses the ratio of word frequencies.

1.3   A Closer Look at Python: Texts as Lists of Words

You've seen some important building blocks of the Python programming language. Here we take a break from language processing to take a closer look at Python.

1.3.1   Lists

What is a text? At one level, it is a sequence of symbols on a page, such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here's how we represent text in Python, in this case the opening sentence of Moby Dick:

 
>>> sent1 = ["Call", "me", "Ishmael", "."]
>>>

After the prompt we've given a name we made up, sent1, followed by the equals sign, and then some quoted words, separated with commas, and surrounded with brackets. This bracketed material is known as a list in Python: it is how we store a text. Each individual word must be quoted, using double or single quotes, like "this" or like 'this'. (When using single quotes, use the close quote character at the start and the end.) Here, we've given this list the name sent1. We can inspect it by typing the name, and we can ask for its length:

 
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> len(sent1)
4
>>> score(sent1)
1
>>>

We can even apply our own "score" function to it. Some more lists have been defined for you, one for the opening sentence of each of our texts, sent2sent8. We inspect two of them here; you can see the rest for yourself using the Python interpreter.

 
>>> sent2
["The", "family", "of", "Dashwood", "had", "long",
"been", "settled", "in", "Sussex", "."]
>>> sent3
["In", "the", "beginning", "God", "created", "the",
"heaven", "and", "the", "earth", "."]

You can type these in or else make up a few sentences of your own. Now let's repeat some of the other Python operations we saw above in Section 1.2:

 
>>> sorted(sent3)
['.', 'God', 'In', 'and', 'beginning', 'created', 'earth',
'heaven', 'the', 'the', 'the']
>>> len(set(sent3))
9
>>> sent3.count("the")
3
>>>

We can also do arithmetic operations with lists in Python. Multiplying a list by a number, e.g. sent1 * 2, creates a longer list containing multiple copies of the items in the original list. Adding two lists, e.g. sent4 + sent1, creates a new list containing everything from the first list, followed by everything from the second list:

 
>>> sent1 * 2
['Call', 'me', 'Ishmael', '.', 'Call', 'me', 'Ishmael', '.']
>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the',
'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']
>>>

This special use of the addition operation is called concatenation; it links the lists together into a single list. We can concatenate sentences to build up a text.

1.3.2   Indexing

As we have seen, a text in Python is just a list of words, represented using a particular combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words (len(text1)), and count the occurrences of a particular word (text1.count("heaven")). And just as we can pick out the first, tenth, or even 14,278th word in a printed text, we can identify the elements of a list by their number, or index, by following the name of the text with the index inside brackets. We can also find the index of the first occurrence of any word:

 
>>> text4[173]
'awaken'
>>> text4.index("awaken")
173
>>>

Indexes turn out to be a common way to access the words of a text, or — more generally — the elements of a list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing.

 
>>> text5[1040:1060]
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is',
'so', 'good', 'because', 'you', 'can', 'actually', 'play',
'a', 'full', 'game', 'without', 'buying', 'it']

Indexes have some subtleties, and we'll explore these with the help of an artificial sentence:

 
>>> sent = ["word1", "word2", "word3", "word4", "word5",
            "word6", "word7", "word8", "word9", "word10",
            "word11", "word12", "word13", "word14", "word15",
            "word16", "word17", "word18", "word19", "word20"]
>>> sent[0]
'word1'
>>> sent[19]
'word20'
>>>

Notice that our indexes start from zero: sent element zero, written sent[0], is the first word, 'word1', while sent element 19 is 'word20'. This is initially confusing, but typical of modern programming languages. (If you've mastered the system of counting centuries where 19XY is a year in the 20th century, or if you live in a country where walking up 1 flight of stairs puts you on level 2 of a building, you'll quickly get the hang of this.) The moment Python accesses the content of a list from the computer's memory, it is already at the first element; we have to tell it how many elements forward to go.

Let's take a closer look at slicing, using our artificial sentence again:

 
>>> sent[17:20]
['word18', 'word19', 'word20']
>>> sent[17]
'word18'
>>> sent[18]
'word19'
>>> sent[19]
'word20'
>>>

Thus, the slice 17:20 includes sent elements 17, 18, and 19. By convention, m:n means elements mn-1. We can omit the first number if the slice begins at the start of the list, and we can omit the second number if the slice goes to the end:

 
>>> sent[:3]
['word1', 'word2', 'word3']
>>> text2[141525:]
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor',
'and', 'Marianne', ',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the',
'least', 'considerable', ',', 'that', 'though', 'sisters', ',', 'and',
'living', 'almost', 'within', 'sight', 'of', 'each', 'other', ',',
'they', 'could', 'live', 'without', 'disagreement', 'between', 'themselves',
',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']
>>>

We can modify an element of a list by assigning to one of its index values, e.g. putting sent[0] on the left of the equals sign. We can also replace an entire slice with new material:

 
>>> sent[0] = "First Word"
>>> sent[19] = "Last Word"
>>> sent[1:19] = ["Second Word", "Third Word"]
>>> sent
['First Word', 'Second Word', 'Third Word', 'Last Word']
>>>

1.3.3   Variables

From the start of Section 1.2, you have had access texts called text1, text2, and so on. It saved a lot of typing to be able to refer to a 250,000-word book with a short name like this! In general, we can make up names for anything we care to calculate. We did this ourselves in the previous sections, e.g. defining a variable sent1 as follows:

 
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>>

Such lines have the form: variable = expression. Python will evaluate the expression, and save its result to the variable. This process does not generate any output; you have to type the variable on a line of its own to inspect its contents. The equals sign is slightly misleading, since information is copied from the right side to the left. The variable can be anything you like, e.g. my_sent, sentence, xyzzy. It must start with a letter, and can include numbers and underscores. It cannot be any of Python's reserved words, such as if, not, and import. Here are some examples:

 
>>> mySent = ["The", "family", "of", "Dashwood", "had", "long",
...          "been", "settled", "in", "Sussex", "."]
>>> noun_phrase = mySent[:4]
>>> noun_phrase
['The', 'family', 'of', 'Dashwood']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Dashwood', 'The', 'family', 'of']
>>>

It is good to choose meaningful variable names to help you — and anyone who reads your Python code — to understand what your code is meant to do. Python does not try to make sense of the names; it blindly follows your instructions, and does not object if you do something confusing, such as one = "two" or two = 3.

We can use variables to hold intermediate steps of a computation. This may make the Python code easier to follow. Thus len(set(text1)) could also be written:

 
>>> vocab = set(text1)
>>> vocab_size = len(vocab)
>>> vocab_size
19317
>>>

1.3.4   Exercises

  1. ☼ Create a variable phrase containing a list of words. Experiment with the operations described above, including addition, multiplication, indexing, slicing, and sorting.
  2. ☼ The index of the:lx in sent3 is 1, because sent3[1] gives us 'the'. What are the indexes of the two other occurrences of this word in sent3?
  3. ☼ Our artificial sentence had 20 elements. What does the interpreter do when you enter sent[20]? Why?
  4. ☼ We can count backwards from the end of a list using negative indexes. The last element of a list always has index -1. See what happens when you enter text2[-1].
  5. ◑ Use text6.index(??) to find the index of the word sunset. By a process of trial and error, find the slice for the complete sentence that contains this word.
  6. ◑ Use the addition, set, and sorted operations to compute the vocabulary of the sentences defined above (sent1 ...).
  7. ◑ Write the slice expression to produces the last two words of text2.

1.4   Computing with Language: Simple Statistics

Let's return to our exploration of the ways we can bring our computational resources to bear on large quantities of text. We began this discussion in Section 1.2, and we saw how to search for words in context, how to compile the vocabulary of a text, how to generate random text in the same style, and so on.

In this section we pick up the question of what makes a text distinct, and use automatic methods to find characteristic words and collocations of a text. As in Section 1.2, you will try new features of the Python language by copying them into the interpreter, and you'll learn about these features systematically in the following section.

Before continuing with this section, check your understanding of the previous section by predicting the output of the following code, and using the interpreter to check if you got it right. If you found it difficult to do this task, it would be a good idea to review the previous section before continuing further.

 
>>> saying = ["After", "all", "is", "said", "and", "done", ",",
...           "more", "is", "said", "than", "done", "."]
>>> words = set(saying)
>>> words = sorted(words)
>>> words[-2:]

1.4.1   Frequency Distributions

How could we automatically identify the words of a text that are most informative about the topic and genre of the text? Let's begin by finding the most frequent words of the text. Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item, like that shown in Figure 1.2. We would need thousands of counters and it would be a laborious process, so laborious that we would rather assign the task to a machine.

../images/tally.png

Figure 1.2: Counting Words Appearing in a Text

The table in Figure 1.2 is known as a frequency distribution, and it tells us the frequency of each vocabulary item in the text. It is a "distribution" since it tells us how the the total number of words in the text — 260,819 in the case of Moby Dick — are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words of Moby Dick. Be sure to try this for yourself, taking care to use the correct parentheses and uppercase letters. (This code assumes that you have already done from nltk.book import * during your Python session.)

 
>>> fdist1 = FreqDist(text1)
>>> fdist1
<FreqDist with 260819 samples>
>>> vocabulary1 = fdist1.sorted()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-',
'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for',
'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on',
'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>> fdist1["whale"]
906
>>>

Do any words in the above list help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. This list tells us almost nothing about the text; they just represent the plumbing of English text. What proportion of English text is taken up with such words? We can generate a cumulative frequency plot for these words, using fdist1.plot(), to produce the graph shown in Figure 1.3. From this, it looks like these 50 words account for almost half the words of the book!

../images/fdist-moby.png

Figure 1.3: Cumulative Frequency Plot for 50 Most Frequent Words in Moby Dick

If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes. See them using fdist1.hapaxes(). This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others! It seems that there's too many rare words, and without seeing the context we probably can't guess what half of them mean in any case.

Next let's look at the long words of a text; perhaps these will be more characteristic and informative. For this we adapt some notation from set theory. We would like to find the words from the vocabulary of the text that are more than than 15 characters long. We can express this in mathematical notation as follows:

(2){w | wV & P(w)}, where P(w) is true if and only if w is more than 15 characters long.

In other words, we want to find all w such that w is in the vocabulary and w is longer than 15 characters. We can translate this expression into Python as follows:

 
>>> v = set(text1)
>>> sorted(w for w in v if len(w) > 15)
['apprehensiveness', 'comprehensiveness', 'indiscriminately',
'superstitiousness', 'circumnavigating', 'simultaneousness',
'physiognomically', 'circumnavigation', 'hermaphroditical',
'subterraneousness', 'uninterpenetratingly', 'irresistibleness',
'responsibilities', 'uncompromisedness', 'uncomfortableness',
'supernaturalness', 'characteristically', 'cannibalistically',
'circumnavigations', 'indispensableness', 'preternaturalness',
'CIRCUMNAVIGATION', 'undiscriminating', 'Physiognomically']

The expression w for w in v could have equally been written word for word in vocab, and means "give me all words, where each word is an element of the vocabulary set". For each such word, we check that its length is greater than 15; all other words will be ignored. We will discuss this more carefully later. For now you should simply try out the above statements in the Python interpreter, and try changing the text, and changing the length condition.

Let's return to our task of finding words that characterize a text. Notice that the long words in text4 reflect its national focus: constitutionally, transcontinental, while those in text5 reflect its informal content: boooooooooooglyyyyyy and yuuuuuuuuuuuummmmmmmmmmmm. Have we succeeded in automatically extracting words that typify a text? Well, these very long words are often hapaxes (i.e. unique) and perhaps it would be better to find frequently occurring long words. This seems promising since it eliminates frequent short words (e.g. the) and infrequent long words like (antiphilosophists). Here are all words from the chat corpus that are longer than 5 characters, that occur more than 5 times:

 
>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 5 and text5.count(w) > 5)
['#14-19teens', '<empty>', 'ACTION', 'anybody', 'anyone', 'around',
'cute.-ass', 'everybody', 'everyone', 'female', 'listening', 'minutes',
'people', 'played', 'player', 'really', 'seconds', 'should', 'something',
'watching']

Notice how we have used two conditions: len(w) > 5 ensures that the words are longer than 5 letters, and text5.count(w) > 5 ensures that these words occur more than five times. At last we have managed to automatically identify the frequently-occuring content-bearing words of the text.

1.4.3   Statistics Over Secondary Data

[statistics over word lengths (multiple plots on one graph), eliminating words, normalizing words, freqdists over letters]

1.4.4   Exercises

  1. The demise of teen language: Read the BBC News article: UK's Vicky Pollards 'left behind' http://news.bbc.co.uk/1/hi/education/6173441.stm. The article gives the following statistic about teen language: "the top 20 words used, including yeah, no, but and like, account for around a third of all words." How many word types account for a third of all word tokens, for a variety of text sources. What do you conclude about this statistic? Read more about this on LanguageLog, at http://itre.cis.upenn.edu/~myl/languagelog/archives/003993.html.

1.5   Back to Python: Making Decisions and Taking Control

So far, our simple programs have been able to manipulate sequences of words, and perform some operation on each one. We applied this to lists consisting of a few words, but the approach works the same for lists of arbitrary size, containing thousands of items. Thus, such programs have some interesting qualities: (i) the ability to work with language, and (ii) the potential to save human effort through automation. Another useful feature of programs is their ability to make decisions on our behalf; this is our focus in this section.

1.5.1   Conditionals

Python supports a wide range of operators like < and >= for testing the relationship between values. The full set of these relational operators are shown in Table 1.2.

Table 1.2:

Numerical Comparison Operators

Operator Relationship
< less than
<= less than or equal to
== equal to (note this is two not one = sign)
!= not equal to
> greater than
>= greater than or equal to

We can use these to select different words from a sentence of news text. Here are some examples — only the operator is changed from one line to the next.

 
>>> [w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
>>> [w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
>>> [w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
>>> [w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board',
'as', 'a', 'nonexecutive', 'director', '29', '.']
>>>

The above expressions involve numerical comparisons. We can also test various properties of words, using the functions listed in Table 1.3.

Table 1.3:

Some Word Comparison Operators

Function Meaning
s.startswith(t) s starts with t
s.endswith(t) s ends with t
t in s t is contained inside s
s.islower() all cased characters in s are lowercase
s.isupper() all cased characters in s are uppercase
s.isalpha() all characters in s are alphabetic
s.isalnum() all characters in s are alphanumeric
s.isdigit() all characters in s are digits
s.istitle() s is titlecased

Here are some examples of these operators being used to select words from our texts.

 
>>> sorted(w for w in set(text1) if w.endswith("ableness"))
['comfortableness', 'honourableness', 'immutableness',
'indispensableness', 'indomitableness', 'intolerableness',
'palpableness', 'reasonableness', 'uncomfortableness']
>>> sorted(w for w in set(text4) if "gnt" in w)
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted(w for w in set(sent7) if w.isdigit())
['29', '61']
>>>

We can also use and, or, and not:

 
>>> sorted(w for w in set(text7) if "-" in w and "index" in w)
['Stock-index', 'index-arbitrage', 'index-fund',
'index-options', 'index-related', 'stock-index']
>>> sorted(w for w in set(text3) if w.istitle() and len(w) > 10)
['Abelmizraim', 'Allonbachuth', 'Beerlahairoi', 'Canaanitish',
'Chedorlaomer', 'Girgashites', 'Hazarmaveth', ...]
>>> sorted(w for w in set(sent7) if not w.islower())
[',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken']
>>> sorted(w for w in set(text2) if "cie" in w or "cei" in w)
['ancient', 'ceiling', 'conceit', 'conceited', 'conceive', 'conscience',
'conscientious', 'conscientiously', 'deceitful', 'deceive', ...]

1.5.2   Control Structures

Most programming languages permit us to execute a block of code when a conditional expression, or if statement, is satisfied. In the following program, we have created a variable called word containing the string value 'cat'. The if statement then checks whether the condition len(word) < 5 is true. Because the conditional expression is true, the body of the if statement is invoked and the print statement is executed, and displays a message to the user.

 
>>> word = "cat"
>>> if len(word) < 5:
...     print 'word length is less than 5'
...
word length is less than 5
>>>

When we use the Python interpreter we have to have an extra blank line in order for it to detect that the nested block is complete.

If we change the conditional expression to len(word) >= 5, to check that the length of word is greater than or equal to 5, then the conditional expression will no longer be true. This time, the body of the if statement will not be executed, and no message is shown to the user:

 
>>> if len(word) >= 5:
...   print 'word length is greater than or equal to 5'
...
>>>

An if statement is known as a control structure because it controls whether the code in the indented block will be run. Another control structure is the for loop:

 
>>> for word in ['Call', 'me', 'Ishmael', '.']:
...     print word
...
Call
me
Ishmael
.
>>>

This is called a loop because Python executes the code in circular fashion. It starts by doing word = 'Call', effectively using the word variable to name the first item of the list. Then it displays the value of word to the user. Next, it moves on to the second item of the list, and so on. It stops once every item of the list has been processed.

Now we can combine the if and for statements. We will loop over every item of the list, and only print the item if it ends with the letter "l". We'll pick another name for the variable to demonstrate that Python doesn't try to make sense of variable names.

 
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
...     if xyzzy.endswith("l"):
...         print xyzzy
...
Call
Ishmael
>>>

You will notice that if and for statements have a colon at the end of the line, before the indentation begins. In fact, all Python control structures end with a colon. The colon indicates that the current statement relates to the indented block that follows.

We can also specify an action to be taken if the condition of the if statement is not met. Here we see the elif "else if" statement, and the else statement. Notice that these also have colons before the indented code.

 
>>> for token in sent1:
...     if token.islower():
...         print "lowercase word"
...     elif token.istitle():
...         print "titlecase word"
...     else:
...         print "punctuation"
...
titlecase word
lowercase word
titlecase word
punctuation
>>>

As you can see, even with this small amount of Python knowledge, you can start to build multi-line Python programs. Its important to develop such programs in pieces, testing that each piece does what you expect before combining them into a program. This is why the Python interactive interpreter is so invaluable, and why you should get comfortable using it.

Finally, let's combine the idioms we've been exploring. First we create a list of cie and cei words, then we loop over each item and print it. Notice the comma at the end of the print statement, which tells Python to produce its output on a single line.

 
>>> confusing = sorted(w for w in set(text2) if "cie" in w or "cei" in w)
>>> for word in confusing:
...     print word,
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...

1.5.4   Frequency Distributions

Some of the methods defined on NLTK frequency distributions are shown in Table 1.4. [More discussion and examples...]

Table 1.4:

Methods Defined in the Frequency Distribution Module

Example Description
fdist['monstrous'] count of the number of times a given sample occurred
fdist.freq('monstrous') frequency of a given sample
fdist.N() total number of samples
fdist.sorted() the samples sorted in order of decreasing frequency
for sample in fdist: iterate over the samples
fdist.max() sample with the greatest count
fdist.plot() graphical plot of the frequency distribution

1.5.5   Exercises

  1. ☼ Assign a new value to sent, namely the sentence ["she", "sells", "sea", "shells", "by", "the", "sea", "shore"], then write code to perform the following tasks:
    1. Print all words beginning with 'sh':
    2. Print all words longer than 4 characters.
    3. Generate a new sentence that adds the popular hedge word 'like' before every word beginning with 'se'.
  2. ◑ What does the following Python do? sum(len(w) for w in text1) Can you use it to work out the average word length of a text?
  3. ◑ What is the difference between the test w.isupper() and not w.islower()?

1.6   Computing with Language: Accessing Text Corpora

A text corpus is a large body of text, containing a careful balance of material in one or more genres. We have already seen some small corpora, such as the presidential inaugural addresses. This corpus actually contains dozens of individual texts — one per address — but we glued them end-to-end and treated them like chapters of a book, i.e. as a single text. In this section we will examine a variety of text corpora and will see how to select individual texts, and how to compare them.

1.6.1   The Gutenberg Corpus

NLTK includes a selection of texts from the Project Gutenberg electronic text archive. Let's find out what it contains. We begin by telling the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.files(), the files in NLTK's corpus of Gutenberg texts:

 
>>> import nltk
>>> nltk.corpus.gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt',
'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
>>>

Let's pick out the first of these texts — Emma by Jane Austen — and give it a short name emma, then find out how many words it contains:

 
>>> emma = nltk.corpus.gutenberg.words("austen-emma.txt")
>>> len(emma)
192432
>>>

Note

In NLTK 0.9.5 you cannot do concordancing (and other tasks from Section 1.2) using a text defined this way. Instead you have to do the following:

 
>>> emma = nltk.Text(nltk.corpus.gutenberg.words("austen-emma.txt"))
>>>

It might get cumbersome to type nltk.corpus.gutenberg all the time, and there's nothing to stop us giving this a name in the usual way, e.g. by defining a name gutenberg = nltk.corpus.gutenberg, then using this instead. This is so common that Python provides direct support for it:

 
>>> from nltk.corpus import gutenberg
>>> gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt',
'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
>>>

Let's write a short program to display other information about each text:

 
>>> for filename in gutenberg.files():
...     r = gutenberg.raw(filename)
...     w = gutenberg.words(filename)
...     s = gutenberg.sents(filename)
...     v = set(w)
...     print filename, len(r)/len(w), len(w)/len(s), len(w)/len(v)
...
austen-emma.txt 4 21 24
austen-persuasion.txt 4 23 16
austen-sense.txt 4 24 20
bible-kjv.txt 4 33 73
blake-poems.txt 4 18 4
chesterton-ball.txt 4 17 10
chesterton-brown.txt 4 19 10
chesterton-thursday.txt 4 16 10
melville-moby_dick.txt 4 24 13
milton-paradise.txt 4 52 9
shakespeare-caesar.txt 4 12 7
shakespeare-hamlet.txt 4 13 6
shakespeare-macbeth.txt 4 13 5
whitman-leaves.txt 4 35 10
>>>

This program has displayed the filename, followed by three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average. Observe that all texts have an average word length of 4 (evidently a reliable property of English text), but that they vary greatly in sentence length (12-52 words per sentence) and diversity score (5-73).

This example also showed how we can access the "raw" text of the book, not split up into words. The raw() function gives us the contents of the file without any linguistic processing. So, for example,