1.1 The Language Challenge
Today, people from all walks of life — including professionals, students,
and the general population — are confronted by
unprecedented volumes of information, the vast bulk of which is stored
as unstructured text. In 2003, it was estimated that the annual
production of books amounted to 8 Terabytes. (A Terabyte is 1,000
Gigabytes, i.e., equivalent to 1,000 pickup trucks filled with books.)
It would take a human being about five years to read the new
scientific material that is produced every 24 hours. Although these
estimates are based on printed materials, increasingly the information
is also available electronically. Indeed, there has been an explosion of text
and multimedia content on the World Wide Web. For many people, a
large and growing fraction of work and leisure time is spent
navigating and accessing this universe of information.
The presence of so much text in electronic form is a huge challenge to
NLP. Arguably, the only way for humans to cope with the information
explosion is to exploit computational techniques that can sift
through huge bodies of text.
Although existing search engines have been crucial to the growth and
popularity of the Web, humans require skill, knowledge, and some luck,
to extract answers to such questions as What tourist sites can I
visit between Philadelphia and Pittsburgh on a limited budget?
What do expert critics say about digital SLR cameras? What
predictions about the steel market were made by credible commentators
in the past week? Getting a computer to answer them automatically
is a realistic long-term goal, but would involve a range of language
processing tasks, including information extraction, inference, and
summarization, and would need to be carried out on a scale and with a
level of robustness that is still beyond our current capabilities.
1.1.1 The Richness of Language
Language is the chief manifestation of human intelligence. Through
language we express basic needs and lofty aspirations, technical
know-how and flights of fantasy. Ideas are shared over great
separations of distance and time. The following samples from English
illustrate the richness of language:
| (1) | |
| a. | | Overhead the day drives level and grey, hiding the sun by a flight
of grey spears. (William Faulkner, As I Lay Dying, 1935) |
| b. | | When using the toaster please ensure that the exhaust fan is turned
on. (sign in dormitory kitchen) |
| c. | | Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated
activities with Ki values of 45.1-271.6 μM (Medline, PMID: 10718780) |
| d. | | Iraqi Head Seeks Arms (spoof news headline) |
| e. | | The earnest prayer of a righteous man has great power and wonderful
results. (James 5:16b) |
| f. | | Twas brillig, and the slithy toves did gyre and gimble in the wabe
(Lewis Carroll, Jabberwocky, 1872) |
| g. | | There are two ways to do this, AFAIK :smile: (internet discussion archive) |
|
Thanks to this richness, the study of language is part of many
disciplines outside of linguistics, including translation, literary
criticism, philosophy, anthropology and psychology. Many less obvious
disciplines investigate language use, such as law, hermeneutics,
forensics, telephony, pedagogy, archaeology, cryptanalysis and speech
pathology. Each applies distinct methodologies to gather
observations, develop theories and test hypotheses. Yet all serve to
deepen our understanding of language and of the intellect that is
manifested in language.
The importance of language to science and the arts is matched in
significance by the cultural treasure embodied in language.
Each of the world's ~7,000 human languages is rich in unique respects,
in its oral histories and creation legends, down to its grammatical
constructions and its very words and their nuances of meaning.
Threatened remnant cultures have words to distinguish plant subspecies
according to therapeutic uses that are unknown to science. Languages
evolve over time as they come into contact with each other and they
provide a unique window onto human pre-history. Technological change
gives rise to new words like blog and new morphemes like e- and
cyber-. In many parts of the world, small linguistic variations
from one town to the next add up to a completely different language in
the space of a half-hour drive. For its breathtaking complexity and
diversity, human language is as a colorful tapestry stretching
through time and space.
1.1.2 The Promise of NLP
As we have seen, NLP is important
for scientific, economic, social, and cultural reasons. NLP is
experiencing rapid growth as its theories and methods are deployed in
a variety of new language technologies. For this reason it is
important for a wide range of people to have a working knowledge of
NLP.
Within industry, it includes people in
human-computer interaction, business information analysis,
and Web software development.
Within academia, this includes people in areas from
humanities computing and corpus linguistics
through to computer science and artificial intelligence.
We hope that you, a member of this diverse
audience reading these materials, will come to appreciate the workings
of this rapidly growing field of NLP and will apply its techniques in
the solution of real-world problems.
This book presents a
carefully-balanced selection of theoretical foundations and practical
applications, and equips readers to work with large datasets, to create
robust models of linguistic phenomena, and to deploy them in working
language technologies. By integrating all of this into the Natural
Language Toolkit (NLTK), we hope this book opens up the exciting
endeavor of practical natural language processing to a broader
audience than ever before.
The rest of this chapter provides a non-technical overview of Python and will
cover the basic programming knowledge needed for the rest of
the chapters in Part 1. It contains many examples and exercises;
there is no better way to learn to program than to dive in and try
these yourself. Before you know it you will be programming!
The goal of this chapter is to answer the following questions:
- what can we achieve by combining simple programming techniques with large quantities of text?
- how can we automatically extract representative words from a large text?
- is the Python programming language suitable for such work?
Along the way you will be introduced to a selection of elementary concepts
in linguistics and computer science. However, this is deliberately not
systematic, but only a taster, intended to give you the flavour of what
will come later, and motivate you to work through the more systematic
material that will follow.
1.2 Computing with Language: Texts and Words
As we will see, it is easy to get our hands on large quantities of text.
What can we do with it, assuming we can write some simple programs?
Here we will treat the text as data for the programs we write,
programs that manipulate and analyze it in a variety of interesting ways.
The first step is to get started with the Python interpreter.
1.2.1 Getting Started
One of the friendly things about Python is that it allows you
to type directly into the interactive interpreter —
the program that will be running your Python programs.
You can run the Python interpreter using a simple graphical interface
called the Interactive DeveLopment Environment (IDLE).
On a Mac you can find this under Applications→MacPython,
and on Windows under All Programs→Python.
Under Unix you can run Python from the shell by typing python.
The interpreter will print a blurb about your Python version;
simply check that you are running Python 2.4 or greater (here it is 2.5):
| |
Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
|
|
Note
If you are unable to run the Python interpreter, you probably don't
have Python installed correctly. Please visit http://nltk.org/ for
detailed instructions.
The >>> prompt indicates that the Python interpreter is now waiting
for input. When copying examples from this book be sure not to type
in the >>> prompt yourself. Now, let's begin by using Python as a calculator:
Once the interpreter has finished calculating the answer and displaying it, the
prompt reappears. This means the Python interpreter is waiting for another instruction.
Try a few more expressions of your own. You can use asterisk (*)
for multiplication and slash (/) for division, and parentheses for
bracketing expressions. One strange thing you might come across is
that division doesn't always behave as you might expect; it does integer
division or floating point division depending on how you specify the inputs:
| |
>>> 3/3
1
>>> 1/3
0
>>> 1.0/3.0
0.33333333333333331
>>>
|
|
These examples demonstrate how you can work interactively with the
interpreter, allowing you to experiment and explore.
As you will see later, your intuitions about numerical expressions
will be useful for manipulating language data in Python.
Now let's try a nonsensical expression to see how the interpreter handles it:
| |
>>> 1 +
Traceback (most recent call last):
File "<stdin>", line 1
1 +
^
SyntaxError: invalid syntax
>>>
|
|
Here we have produced a syntax error. It doesn't make sense
to end an instruction with a plus sign. The Python interpreter indicates
the line where the problem occurred.
1.2.2 Searching Text
Now that we can use the Python interpreter, let's see how we can harness its
power to process text. The first step is to type a line of magic at the
Python prompt, telling the interpreter to load some texts for us to explore:
from nltk.book import * (i.e. import all names from NLTK's book module).
After printing a welcome message, it loads
the text of several books, including Moby Dick. Type the following,
taking care to get spelling and punctuation exactly right:
| |
>>> from nltk.book import *
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>
|
|
We can examine the contents of a text in a variety
of ways. A concordance view shows us a given word in its context. Here we
look up the word monstrous. Try seaching for other words; you can use the up-arrow
key to access the previous command and modify the word being searched.
| |
>>> text1.concordance("monstrous")
mong the former , one was of a most monstrous size . ... This came towards us , o
ION OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have re
all over with a heathenish array of monstrous clubs and spears . Some were thickl
ed as you gazed , and wondered what monstrous cannibal and savage could ever have
that has survived the flood ; most monstrous and most mountainous ! That Himmale
they might scout at Moby Dick as a monstrous fable , or still worse and more det
ath of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere lo
ling Scenes . In connexion with the monstrous pictures of whales , I am strongly
>>>
|
|
You can now try concordance searches on some of the other texts we have included.
For example, search Sense and Sensibility for the word
affection, using text2.concordance("affection"). Search the book of Genesis
to find out how long some people lived, using:
text3.concordance("lived"). You could look at text4, the
US Presidential Inaugural Addresses to see examples of English dating
back to 1789, and search for words like nation, terror, god.
We've also included text5, the NPS Chat Corpus: search this for
unconventional words like im, ur, lol.
(Note that this corpus is uncensored!)
Once you've spent some time examining these texts, we hope you have a new
sense of the richness and diversity of language. In the next chapter
you will learn how to access a broader range of text, including text in
languages other than English.
If we can find words in a text, we can also take note of their position within
the text. We produce a dispersion plot, where each bar represents an instance
of a word and each row represents the entire text. In Figure 1.1 we
see some striking patterns of word usage over the last 220 years. You can
produce this plot as shown below (so long as you have Numpy and Pylab installed).
You might like to try different words, and different texts. As before, take
care to get the quotes, commas, brackets and parentheses exactly right.
| |
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>
|
|
A concordance permits us to see words in context, e.g. we saw that
monstrous appeared in the context the monstrous pictures.
What other words appear in the same contexts that monstrous
appears in? We can find out as follows:
| |
>>> text1.similar("monstrous")
subtly impalpable curious abundant perilous trustworthy untoward
singular imperial few maddens loving mystifying christian exasperate
puzzled fearless uncommon domineering candid
>>> text2.similar("monstrous")
great very so good vast a exceedingly heartily amazingly as sweet
remarkably extremely
>>>
|
|
Observe that we get different results for different books.
Now, just for fun, let's try generating some random text in the various
styles we have just seen. To do this, we type the name of the text
followed by the "generate" function, e.g. text3.generate():
| |
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
>>>
|
|
Note that first time you run this, it is slow because it gathers statistics
about word sequences. Each time you run it, you will get different output text.
Now try generating random text in the style of an inaugural address or an
internet chat room.
Note
When text is printed, punctuation has been split off
from the previous word. Although this is not correct formatting
for English text, we do this to make it clear that punctuation does
not belong to the word. This is called "tokenization", and you will learn
about it in Chapter 2.
1.2.3 Counting Vocabulary
The most obvious fact about texts that emerges from the previous section is that
they differ in the vocabulary they use. In this section we will see how to use the
computer to count the words in a text, in a variety of useful ways.
As before you will jump right in and experiment with
the Python interpreter, even though you may not have studied Python systematically
yet.
Let's begin by finding out the length of a text from start to finish,
in terms of the words and punctuation symbols that appear. We'll use
the text of Moby Dick again:
| |
>>> len(text1)
260819
>>>
|
|
That's a quarter of a million words long! But how many distinct words does this text
contain? To work this out in Python we have to pose the question slightly
differently. The vocabulary of a text is just the set of words that it uses,
and in Python we can list the vocabulary of text3 with the command: set(text3).
This will produce many screens of words. Now try the following:
| |
>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3))
2789
>>> len(text3) / len(set(text3))
16
>>>
|
|
Here we can see a sorted list of vocabulary items, beginning with various
punctuation symbols and continuing with words starting with A. Words
starting with a will appear much later, after the last "Z" word, Zoroaster.
We discover the size of the vocabulary indirectly, by asking
for the length of the set. Finally, we can calculate a measure of the lexical
richness of the text and learn that each word is used 16 times on average.
Next, let's focus in on particular words. We can count how often a word occurs
in a text, and compute what percentage
of the text is taken up by a specific word:
| |
>>> text3.count("smote")
5
>>> 100.0 * text4.count('a') / len(text4)
1.4587672822333748
>>>
|
|
You might like to repeat such calculations on several texts,
but it is tedious to keep retyping it for different texts. Instead,
we can come up with our own name for this task, e.g. "score", and
define a function that can be re-used as often as we like:
| |
>>> def score(text):
... return len(text) / len(set(text))
...
>>> score(text3)
16
>>> score(text4)
4
>>>
|
|
Note
The Python interpreter changes the prompt from
>>> to ... after encountering the colon at the
end of the first line. The ... prompt indicates
that Python expects an indented code block to appear next.
It is up to you to do the indentation, by typing four
spaces. To finish the indented block just enter a blank line.
Notice that we used the score function by typing its name, followed
by an open parenthesis, the name of the text, then a close parenthesis.
This is just what we did for the len and set functions earlier.
These parentheses will show up often: their role is to separate
the name of a task — such as score — from the data
that the task is to be performed on — such as text3.
Functions are an advanced concept in programming and we only
mention them at the outset to give newcomers a sense of the
power and flexibility of programming. We'll come back to them
towards the end of this chapter. Later we'll see how to use
such functions when tabulating data, as shown below:
same task many times over using functions, we can easily build
up tables like Table 1.1.
| Genre |
Token Count |
Type Count |
Score |
| skill and hobbies |
82345 |
11935 |
6.9 |
| humor |
21695 |
5017 |
4.3 |
| fiction: science |
14470 |
3233 |
4.5 |
| press: reportage |
100554 |
14394 |
7.0 |
| fiction: romance |
70022 |
8452 |
8.3 |
| religion |
39399 |
6373 |
6.2 |
Table 1.1:
Lexical Diversity of Various Genres in the Brown Corpus
1.2.4 Exercises
- ☼ How many words are there in text2? How many
distinct words are there?
- ☼ Compare the lexical diversity scores for humor
and romance fiction in Table 1.1. Which genre is
more lexically diverse?
- ☺ Compare the lexical dispersion plot with Google Trends, which
shows the frequency with which a term has been referenced in news reports
or been used in search terms over time.
- ☼ Produce a dispersion plot of the four main protagonists in
Sense and Sensibility: Elinor, Marianne, Edward, Willoughby.
What can you observe about the different roles played by the males
and females in this novel? Can you identify the couples?
- ☼ According to Strunk and White's Elements of Style,
the word however, used at the start of a sentence,
means "in whatever way" or "to whatever extent", and not
"nevertheless". They give this example of correct usage:
However you advise him, he will probably do as he thinks best.
(http://www.bartleby.com/141/strunk3.html)
Use the concordance tool to study actual usage of this word
in the various texts we have been considering.
- ◑ Consider the following Python expression: len(set(text4)).
State the purpose of this expression. Describe the two steps
involved in performing this computation.
- ◑ How many times does the word lol appear in text5?
How much is this as a percentage of the total number of words
in this text?
- ◑ Pick a pair of texts and study the differences between them,
in terms of vocabulary, vocabulary richness, genre, etc. Can you
find pairs of words which have quite different meanings across the
two texts, such as monstrous in Moby Dick and in Sense and Sensibility?
- ◑ Compare the frequency of use of the modal verbs will and
could in text2 (romance fiction) and text7 (news).
Which modal verb is more common in which genre?
We will leave our quest for characteristic words of a text,
and explore a rather different approach that uses the
ratio of word frequencies.
1.3 A Closer Look at Python: Texts as Lists of Words
You've seen some important building blocks of the Python programming language.
Here we take a break from language processing to take a closer look at Python.
1.3.1 Lists
What is a text? At one level, it is a sequence of symbols on a page, such
as this one. At another level, it is a sequence of chapters, made up
of a sequence of sections, where each section is a sequence of paragraphs,
and so on. However, for our purposes, we will think of a text as nothing
more than a sequence of words and punctuation. Here's how we represent
text in Python, in this case the opening sentence of Moby Dick:
| |
>>> sent1 = ["Call", "me", "Ishmael", "."]
>>>
|
|
After the prompt we've given a name we made up, sent1, followed
by the equals sign, and then some quoted words, separated with
commas, and surrounded with brackets. This bracketed material
is known as a list in Python: it is how we store a text.
Each individual word must be quoted, using double or single quotes,
like "this" or like 'this'.
(When using single quotes, use the close quote character at the start and the end.)
Here, we've given this list the name sent1. We can inspect
it by typing the name, and we can ask for its length:
| |
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> len(sent1)
4
>>> score(sent1)
1
>>>
|
|
We can even apply our own "score" function to it.
Some more lists have been defined for you,
one for the opening sentence of each of our texts,
sent2 … sent8. We inspect two of them
here; you can see the rest for yourself using the Python interpreter.
| |
>>> sent2
["The", "family", "of", "Dashwood", "had", "long",
"been", "settled", "in", "Sussex", "."]
>>> sent3
["In", "the", "beginning", "God", "created", "the",
"heaven", "and", "the", "earth", "."]
|
|
You can type these in or else make up a few sentences of your own.
Now let's repeat some of the other Python operations we saw above in
Section 1.2:
| |
>>> sorted(sent3)
['.', 'God', 'In', 'and', 'beginning', 'created', 'earth',
'heaven', 'the', 'the', 'the']
>>> len(set(sent3))
9
>>> sent3.count("the")
3
>>>
|
|
We can also do arithmetic operations with lists in Python.
Multiplying a list by a number, e.g. sent1 * 2,
creates a longer list containing multiple
copies of the items in the original list. Adding two
lists, e.g. sent4 + sent1, creates a new list
containing everything from the first list, followed
by everything from the second list:
| |
>>> sent1 * 2
['Call', 'me', 'Ishmael', '.', 'Call', 'me', 'Ishmael', '.']
>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the',
'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']
>>>
|
|
This special use of the addition operation is called concatenation;
it links the lists together into a single list. We can concatenate
sentences to build up a text.
1.3.2 Indexing
As we have seen, a text in Python is just a list of words, represented
using a particular combination of brackets and quotes. Just as with an
ordinary page of text, we can count up the total number of words
(len(text1)), and count the occurrences of a particular word
(text1.count("heaven")). And just as we can pick out the
first, tenth, or even 14,278th word in a printed text, we can identify
the elements of a list by their number, or index, by following
the name of the text with the index inside brackets. We can
also find the index of the first occurrence of any word:
| |
>>> text4[173]
'awaken'
>>> text4.index("awaken")
173
>>>
|
|
Indexes turn out to be a common way to access the words of a text,
or — more generally — the elements of a list.
Python permits us to access sublists as well, extracting
manageable pieces of language from large texts, a technique
known as slicing.
| |
>>> text5[1040:1060]
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is',
'so', 'good', 'because', 'you', 'can', 'actually', 'play',
'a', 'full', 'game', 'without', 'buying', 'it']
|
|
Indexes have some subtleties, and we'll explore these with
the help of an artificial sentence:
| |
>>> sent = ["word1", "word2", "word3", "word4", "word5",
"word6", "word7", "word8", "word9", "word10",
"word11", "word12", "word13", "word14", "word15",
"word16", "word17", "word18", "word19", "word20"]
>>> sent[0]
'word1'
>>> sent[19]
'word20'
>>>
|
|
Notice that our indexes start from zero:
sent element zero, written sent[0],
is the first word, 'word1', while
sent element 19 is 'word20'.
This is initially confusing,
but typical of modern programming languages.
(If you've mastered the system of counting
centuries where 19XY is a year in the 20th century,
or if you live in a country where walking up
1 flight of stairs puts you on level 2
of a building, you'll quickly get the hang of this.)
The moment Python accesses the content of a list from
the computer's memory, it is already at the first element;
we have to tell it how many elements forward to go.
Let's take a closer look at slicing, using our artificial sentence again:
| |
>>> sent[17:20]
['word18', 'word19', 'word20']
>>> sent[17]
'word18'
>>> sent[18]
'word19'
>>> sent[19]
'word20'
>>>
|
|
Thus, the slice 17:20 includes sent elements 17, 18, and 19.
By convention, m:n means elements m…n-1.
We can omit the first number if the slice begins at the start of the
list, and we can omit the second number if the slice goes to the end:
| |
>>> sent[:3]
['word1', 'word2', 'word3']
>>> text2[141525:]
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor',
'and', 'Marianne', ',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the',
'least', 'considerable', ',', 'that', 'though', 'sisters', ',', 'and',
'living', 'almost', 'within', 'sight', 'of', 'each', 'other', ',',
'they', 'could', 'live', 'without', 'disagreement', 'between', 'themselves',
',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']
>>>
|
|
We can modify an element of a list by assigning to one of its index values,
e.g. putting sent[0] on the left of the equals sign. We can also
replace an entire slice with new material:
| |
>>> sent[0] = "First Word"
>>> sent[19] = "Last Word"
>>> sent[1:19] = ["Second Word", "Third Word"]
>>> sent
['First Word', 'Second Word', 'Third Word', 'Last Word']
>>>
|
|
1.3.3 Variables
From the start of Section 1.2, you have had
access texts called text1, text2, and so on. It saved a lot
of typing to be able to refer to a 250,000-word book with a short name
like this! In general, we can make up names for anything we care
to calculate. We did this ourselves in the previous sections, e.g.
defining a variable sent1 as follows:
| |
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>>
|
|
Such lines have the form: variable = expression. Python will evaluate
the expression, and save its result to the variable. This process does
not generate any output; you have to type the variable on a line of its
own to inspect its contents. The equals sign is slightly misleading,
since information is copied from the right side to the left.
The variable can be anything you like, e.g. my_sent, sentence, xyzzy.
It must start with a letter, and can include numbers and underscores.
It cannot be any of Python's reserved words, such as if, not,
and import. Here are some examples:
| |
>>> mySent = ["The", "family", "of", "Dashwood", "had", "long",
... "been", "settled", "in", "Sussex", "."]
>>> noun_phrase = mySent[:4]
>>> noun_phrase
['The', 'family', 'of', 'Dashwood']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Dashwood', 'The', 'family', 'of']
>>>
|
|
It is good to choose meaningful variable names to help you — and anyone
who reads your Python code — to understand what your code is meant to do.
Python does not try to make sense of the names; it blindly follows your instructions,
and does not object if you do something confusing, such as one = "two" or two = 3.
We can use variables to hold intermediate steps of a computation. This may make
the Python code easier to follow. Thus len(set(text1)) could also be written:
| |
>>> vocab = set(text1)
>>> vocab_size = len(vocab)
>>> vocab_size
19317
>>>
|
|
1.3.4 Exercises
- ☼ Create a variable phrase containing a list of words.
Experiment with the operations described above, including addition,
multiplication, indexing, slicing, and sorting.
- ☼ The index of the:lx in sent3 is 1, because sent3[1]
gives us 'the'. What are the indexes of the two other occurrences
of this word in sent3?
- ☼ Our artificial sentence had 20 elements. What does the interpreter
do when you enter sent[20]? Why?
- ☼ We can count backwards from the end of a list using negative indexes.
The last element of a list always has index -1.
See what happens when you enter text2[-1].
- ◑ Use text6.index(??) to find the index of the word sunset.
By a process of trial and error, find the slice for the complete sentence that
contains this word.
- ◑ Use the addition, set, and sorted operations to compute the
vocabulary of the sentences defined above (sent1 ...).
- ◑ Write the slice expression to produces the last two
words of text2.
1.4 Computing with Language: Simple Statistics
Let's return to our exploration of the ways we can bring our computational
resources to bear on large quantities of text. We began this discussion in
Section 1.2, and we saw how to search for words
in context, how to compile the vocabulary of a text, how to generate random
text in the same style, and so on.
In this section we pick up the question of what makes a text distinct,
and use automatic methods to find characteristic words and collocations
of a text. As in Section 1.2, you will try
new features of the Python language by copying them into the interpreter,
and you'll learn about these features systematically in the following section.
Before continuing with this section, check your understanding of the
previous section by predicting the output of the following code, and using the
interpreter to check if you got it right. If you found it difficult
to do this task, it would be a good idea to review the previous section
before continuing further.
| |
>>> saying = ["After", "all", "is", "said", "and", "done", ",",
... "more", "is", "said", "than", "done", "."]
>>> words = set(saying)
>>> words = sorted(words)
>>> words[-2:]
|
|
1.4.1 Frequency Distributions
How could we automatically identify the words of a text that are most
informative about the topic and genre of the text? Let's begin by
finding the most frequent words of the text. Imagine how you might
go about finding the 50 most frequent words of a book. One method
would be to keep a tally for each vocabulary item, like that shown in Figure 1.2.
We would need thousands of counters and it would be a laborious process,
so laborious that we would rather assign the task to a machine.
The table in Figure 1.2 is known as a frequency distribution,
and it tells us the frequency of each vocabulary item in the text. It is a "distribution"
since it tells us how the the total number of words in the text — 260,819
in the case of Moby Dick — are distributed across the vocabulary items.
Since we often need frequency distributions in language processing, NLTK
provides built-in support for them. Let's use a FreqDist to find the
50 most frequent words of Moby Dick. Be sure to try this for yourself,
taking care to use the correct parentheses and uppercase letters.
(This code assumes that you have already done
from nltk.book import * during your Python session.)
| |
>>> fdist1 = FreqDist(text1)
>>> fdist1
<FreqDist with 260819 samples>
>>> vocabulary1 = fdist1.sorted()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-',
'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for',
'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on',
'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>> fdist1["whale"]
906
>>>
|
|
Do any words in the above list help us grasp the topic or genre of this text?
Only one word, whale, is slightly informative! It occurs over 900 times.
This list tells us almost nothing about the text; they just represent the
plumbing of English text.
What proportion of English text is taken up with such words?
We can generate a cumulative frequency plot for these words,
using fdist1.plot(), to produce the graph shown in Figure 1.3.
From this, it looks like these 50 words account for almost half the
words of the book!
If the frequent words don't help us, how about the words that occur once
only, the so-called hapaxes. See them using fdist1.hapaxes().
This list contains lexicographer, cetological,
contraband, expostulations, and about 9,000 others!
It seems that there's too many rare words, and without seeing the
context we probably can't guess what half of them mean in any case.
Next let's look at the long words of a text; perhaps these will be
more characteristic and informative. For this we adapt some notation
from set theory. We would like to find the words from the vocabulary
of the text that are more than than 15 characters long. We can
express this in mathematical notation as follows:
| (2) | | {w | w ∈ V & P(w)},
where P(w) is true if and only if w is more than 15 characters long. |
In other words, we want to find all w such that w
is in the vocabulary and w is longer than 15 characters.
We can translate this expression into Python as follows:
| |
>>> v = set(text1)
>>> sorted(w for w in v if len(w) > 15)
['apprehensiveness', 'comprehensiveness', 'indiscriminately',
'superstitiousness', 'circumnavigating', 'simultaneousness',
'physiognomically', 'circumnavigation', 'hermaphroditical',
'subterraneousness', 'uninterpenetratingly', 'irresistibleness',
'responsibilities', 'uncompromisedness', 'uncomfortableness',
'supernaturalness', 'characteristically', 'cannibalistically',
'circumnavigations', 'indispensableness', 'preternaturalness',
'CIRCUMNAVIGATION', 'undiscriminating', 'Physiognomically']
|
|
The expression w for w in v could have equally been written
word for word in vocab, and means "give me all words, where each
word is an element of the vocabulary set". For each such word,
we check that its length is greater than 15; all other words will
be ignored. We will discuss this more carefully later. For now
you should simply try out the above statements in the Python interpreter,
and try changing the text, and changing the length condition.
Let's return to our task of finding words that characterize a text.
Notice that the long words in text4 reflect its national focus:
constitutionally, transcontinental, while
those in text5 reflect its informal content:
boooooooooooglyyyyyy and yuuuuuuuuuuuummmmmmmmmmmm.
Have we succeeded in automatically extracting words that typify
a text? Well, these very long words are often hapaxes (i.e. unique)
and perhaps it would be better to find frequently occurring
long words. This seems promising since it eliminates
frequent short words (e.g. the) and infrequent long words
like (antiphilosophists).
Here are all words from the chat corpus
that are longer than 5 characters, that occur more than 5 times:
| |
>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 5 and text5.count(w) > 5)
['#14-19teens', '<empty>', 'ACTION', 'anybody', 'anyone', 'around',
'cute.-ass', 'everybody', 'everyone', 'female', 'listening', 'minutes',
'people', 'played', 'player', 'really', 'seconds', 'should', 'something',
'watching']
|
|
Notice how we have used two conditions: len(w) > 5 ensures that the
words are longer than 5 letters, and text5.count(w) > 5 ensures that
these words occur more than five times. At last we have managed to
automatically identify the frequently-occuring content-bearing
words of the text.
1.4.2 Collocations
Frequency distributions are very powerful. Here we briefly explore
a more advanced application that uses word pairs, also known as bigrams.
We can convert a list of words to a list of bigrams as follows:
| |
>>> bigrams(["more", "is", "said", "than", "done"])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
|
|
Here we see that the pair of words than-done is a bigram, and we write
it in Python as ('than', 'done'). Now, collocations are essentially
just frequent bigrams, except that we want to pay more attention to the
cases that involve rare words. In particular, we want to find
bigrams that occur more often than we would expect based on
the frequency of individual words. The collocations() function
does this for us (we will see how it works later).
| |
>>> text4.collocations()
United States; fellow citizens; has been; have been; those who;
Declaration Independence; Old World; Indian tribes;
District Columbia; four years; Chief Magistrate; and the;
the world; years ago; Santo Domingo; Vice President;
the people; for the; specie payments; Western Hemisphere
|
|
1.4.3 Statistics Over Secondary Data
[statistics over word lengths (multiple plots on one graph), eliminating words, normalizing words, freqdists over letters]
1.4.4 Exercises
- ◑ The demise of teen language:
Read the BBC News article: UK's Vicky Pollards 'left behind' http://news.bbc.co.uk/1/hi/education/6173441.stm.
The article gives the following statistic about teen language:
"the top 20 words used, including yeah, no, but and like, account for around a third of all words."
How many word types account for a third
of all word tokens, for a variety of text sources. What do you conclude about this statistic?
Read more about this on LanguageLog, at http://itre.cis.upenn.edu/~myl/languagelog/archives/003993.html.
1.5 Back to Python: Making Decisions and Taking Control
So far, our simple programs have been able to manipulate sequences of
words, and perform some operation on each one. We applied this to lists
consisting of a few words, but the approach works the same for lists of
arbitrary size, containing thousands of items. Thus, such programs
have some interesting qualities: (i) the ability to work with
language, and (ii) the potential to save human effort through
automation. Another useful feature of programs is their ability to
make decisions on our behalf; this is our focus in this section.
1.5.1 Conditionals
Python supports a wide range of operators like < and >= for
testing the relationship between values. The full set of these relational
operators are shown in Table 1.2.
| Operator |
Relationship |
| < |
less than |
| <= |
less than or equal to |
| == |
equal to (note this is two not one = sign) |
| != |
not equal to |
| > |
greater than |
| >= |
greater than or equal to |
Table 1.2:
Numerical Comparison Operators
We can use these to select different words from a sentence of news text.
Here are some examples — only the operator is changed from one
line to the next.
| |
>>> [w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
>>> [w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
>>> [w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
>>> [w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board',
'as', 'a', 'nonexecutive', 'director', '29', '.']
>>>
|
|
The above expressions involve numerical comparisons. We can also
test various properties of words, using the functions listed in
Table 1.3.
| Function |
Meaning |
| s.startswith(t) |
s starts with t |
| s.endswith(t) |
s ends with t |
| t in s |
t is contained inside s |
| s.islower() |
all cased characters in s are lowercase |
| s.isupper() |
all cased characters in s are uppercase |
| s.isalpha() |
all characters in s are alphabetic |
| s.isalnum() |
all characters in s are alphanumeric |
| s.isdigit() |
all characters in s are digits |
| s.istitle() |
s is titlecased |
Table 1.3:
Some Word Comparison Operators
Here are some examples of these operators being used to
select words from our texts.
| |
>>> sorted(w for w in set(text1) if w.endswith("ableness"))
['comfortableness', 'honourableness', 'immutableness',
'indispensableness', 'indomitableness', 'intolerableness',
'palpableness', 'reasonableness', 'uncomfortableness']
>>> sorted(w for w in set(text4) if "gnt" in w)
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted(w for w in set(sent7) if w.isdigit())
['29', '61']
>>>
|
|
We can also use and, or, and not:
| |
>>> sorted(w for w in set(text7) if "-" in w and "index" in w)
['Stock-index', 'index-arbitrage', 'index-fund',
'index-options', 'index-related', 'stock-index']
>>> sorted(w for w in set(text3) if w.istitle() and len(w) > 10)
['Abelmizraim', 'Allonbachuth', 'Beerlahairoi', 'Canaanitish',
'Chedorlaomer', 'Girgashites', 'Hazarmaveth', ...]
>>> sorted(w for w in set(sent7) if not w.islower())
[',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken']
>>> sorted(w for w in set(text2) if "cie" in w or "cei" in w)
['ancient', 'ceiling', 'conceit', 'conceited', 'conceive', 'conscience',
'conscientious', 'conscientiously', 'deceitful', 'deceive', ...]
|
|
1.5.2 Control Structures
Most programming languages permit us to execute a block of code when a
conditional expression, or if statement, is satisfied. In
the following program, we have created a variable called word
containing the string value 'cat'. The if statement then
checks whether the condition len(word) < 5 is true. Because the
conditional expression is true, the body of the if statement is
invoked and the print statement is executed, and displays a
message to the user.
| |
>>> word = "cat"
>>> if len(word) < 5:
... print 'word length is less than 5'
...
word length is less than 5
>>>
|
|
When we use the Python interpreter we have to have an extra blank line
in order for it to detect that the nested block is complete.
If we change the conditional expression to len(word) >= 5,
to check that the length of word is greater than or equal to 5,
then the conditional expression will no longer be true.
This time, the body of the if statement will not be executed,
and no message is shown to the user:
| |
>>> if len(word) >= 5:
... print 'word length is greater than or equal to 5'
...
>>>
|
|
An if statement is known as a control structure
because it controls whether the code in the indented block will be run.
Another control structure is the for loop:
| |
>>> for word in ['Call', 'me', 'Ishmael', '.']:
... print word
...
Call
me
Ishmael
.
>>>
|
|
This is called a loop because Python executes the code in
circular fashion. It starts by doing word = 'Call',
effectively using the word variable to name the first
item of the list. Then it displays the value of word
to the user. Next, it moves on to the second item of the
list, and so on. It stops once every item of the list has
been processed.
Now we can combine the if and for statements.
We will loop over every item of the list, and only print
the item if it ends with the letter "l". We'll pick another
name for the variable to demonstrate that Python doesn't
try to make sense of variable names.
| |
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
... if xyzzy.endswith("l"):
... print xyzzy
...
Call
Ishmael
>>>
|
|
You will notice that if and for statements
have a colon at the end of the line,
before the indentation begins. In fact, all Python
control structures end with a colon. The colon
indicates that the current statement relates to the
indented block that follows.
We can also specify an action to be taken if
the condition of the if statement is not met.
Here we see the elif "else if" statement, and
the else statement. Notice that these also have
colons before the indented code.
| |
>>> for token in sent1:
... if token.islower():
... print "lowercase word"
... elif token.istitle():
... print "titlecase word"
... else:
... print "punctuation"
...
titlecase word
lowercase word
titlecase word
punctuation
>>>
|
|
As you can see, even with this small amount of Python knowledge,
you can start to build multi-line Python programs.
Its important to develop such programs in pieces,
testing that each piece does what you expect before
combining them into a program. This is why the Python
interactive interpreter is so invaluable, and why you should get
comfortable using it.
Finally, let's combine the idioms we've been exploring.
First we create a list of cie and cei words,
then we loop over each item and print it. Notice the
comma at the end of the print statement, which tells
Python to produce its output on a single line.
| |
>>> confusing = sorted(w for w in set(text2) if "cie" in w or "cei" in w)
>>> for word in confusing:
... print word,
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
|
|
1.5.3 Functions
It often happens that part of a program needs to be used several times
over. For example, suppose we were writing a program that needed to
be able to form the plural of a singular noun, and that this needed to
be done at various places during the program. Rather than repeating
the same code several times over, it is more efficient (and reliable)
to localize this work inside a function. A function is a
programming construct that can be called with one or more inputs and
which returns an output. We define a function using the keyword
def followed by the function name and any input parameters,
followed by a colon; this in turn is followed by the body of the
function. We use the keyword return to indicate the value that is
produced as output by the function. The best way to convey this is
with an example. Our function plural() in Listing 1.1
takes a singular noun and generates a plural form (one which is not always
correct).
| |
def plural(word):
if word.endswith('y'):
return word[:-1] + 'ies'
elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
return word + 'es'
elif word.endswith('an'):
return word[:-2] + 'en'
return word + 's'
|
|
| |
>>> plural('fairy')
'fairies'
>>> plural('woman')
'women'
|
|
Listing 1.1 (plural.py): Example of a Python function |
(There is much more to be said about functions, but
we will hold off until Section 5.4.)
1.5.4 Frequency Distributions
Some of the methods defined on NLTK frequency distributions are shown in Table 1.4.
[More discussion and examples...]
| Example |
Description |
| fdist['monstrous'] |
count of the number of times a given sample occurred |
| fdist.freq('monstrous') |
frequency of a given sample |
| fdist.N() |
total number of samples |
| fdist.sorted() |
the samples sorted in order of decreasing frequency |
| for sample in fdist: |
iterate over the samples |
| fdist.max() |
sample with the greatest count |
| fdist.plot() |
graphical plot of the frequency distribution |
Table 1.4:
Methods Defined in the Frequency Distribution Module
1.5.5 Exercises
- ☼ Assign a new value to sent, namely the sentence
["she", "sells", "sea", "shells", "by", "the", "sea", "shore"],
then write code to perform the following tasks:
- Print all words beginning with 'sh':
- Print all words longer than 4 characters.
- Generate a new sentence that adds the popular
hedge word 'like' before every word
beginning with 'se'.
- ◑ What does the following Python do? sum(len(w) for w in text1)
Can you use it to work out the average word length of a text?
- ◑ What is the difference between the test w.isupper() and
not w.islower()?
1.6 Computing with Language: Accessing Text Corpora
A text corpus is a large body of text, containing a careful balance of material in
one or more genres. We have already seen some small corpora, such as the
presidential inaugural addresses. This corpus actually contains dozens of
individual texts — one per address — but we glued them end-to-end
and treated them like chapters of a book, i.e. as a single text. In this
section we will examine a variety of text corpora and will see how to select
individual texts, and how to compare them.
1.6.1 The Gutenberg Corpus
NLTK includes a selection of texts from the Project Gutenberg electronic text archive. Let's find
out what it contains. We begin
by telling the Python interpreter to load the NLTK package,
then ask to see nltk.corpus.gutenberg.files(), the files in
NLTK's corpus of Gutenberg texts:
| |
>>> import nltk
>>> nltk.corpus.gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt',
'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
>>>
|
|
Let's pick out the first of these texts — Emma by Jane Austen — and
give it a short name emma, then find out how many words it contains:
| |
>>> emma = nltk.corpus.gutenberg.words("austen-emma.txt")
>>> len(emma)
192432
>>>
|
|
Note
In NLTK 0.9.5 you cannot do concordancing (and other tasks from
Section 1.2) using a text
defined this way. Instead you have to do the following:
| |
>>> emma = nltk.Text(nltk.corpus.gutenberg.words("austen-emma.txt"))
>>>
|
|
It might get cumbersome to type nltk.corpus.gutenberg all the time, and there's
nothing to stop us giving this a name in the usual way, e.g. by defining a name
gutenberg = nltk.corpus.gutenberg, then using this instead. This is so common
that Python provides direct support for it:
| |
>>> from nltk.corpus import gutenberg
>>> gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt',
'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
>>>
|
|
Let's write a short program to display other information about each text:
| |
>>> for filename in gutenberg.files():
... r = gutenberg.raw(filename)
... w = gutenberg.words(filename)
... s = gutenberg.sents(filename)
... v = set(w)
... print filename, len(r)/len(w), len(w)/len(s), len(w)/len(v)
...
austen-emma.txt 4 21 24
austen-persuasion.txt 4 23 16
austen-sense.txt 4 24 20
bible-kjv.txt 4 33 73
blake-poems.txt 4 18 4
chesterton-ball.txt 4 17 10
chesterton-brown.txt 4 19 10
chesterton-thursday.txt 4 16 10
melville-moby_dick.txt 4 24 13
milton-paradise.txt 4 52 9
shakespeare-caesar.txt 4 12 7
shakespeare-hamlet.txt 4 13 6
shakespeare-macbeth.txt 4 13 5
whitman-leaves.txt 4 35 10
>>>
|
|
This program has displayed the filename, followed by three statistics for each text:
average word length, average sentence length, and the number of times each vocabulary
item appears in the text on average. Observe that all texts have an average word length
of 4 (evidently a reliable property of English text), but that they vary greatly in
sentence length (12-52 words per sentence) and diversity score (5-73).
This example also showed how we can access the "raw" text of the book,
not split up into words. The raw() function gives us the contents of the file
without any linguistic processing. So, for example,