Natural Language Toolkit

...software, data sets and tutorials for natural language processing...

Es:Programming

 

From NLTK

Jump to: navigation, search

.. -*- mode: rst -*- .. include:: ../definitions.txt

.. TODO: add exercises for Unicode section? .. TODO: add bullet points on regular expressions to summary .. TODO: update cspy reference to more recent book .. TODO: add pointers to regexp toolkits (e.g. Kodos)

.. _chap-programming:


2. Fundamentos de programación y Python

Este capítulo proporciona una introducción al lenguaje de programación Python y tratará los conceptos básicos de programación que son necesarios para el resto de capítulos de la primera parte del libro. Contiene muchos ejemplos y ejercicios; no hay una manera mejor de aprender a programar que programando y probando por uno mismo. Tienes que perder el miedo a adaptar los ejemplos a tus propias necesidades. ¡Antes de que te des cuenta ya estarás programando!

.. _sect-calculator:


Primeros pasos


Una de las cosas que facilitan el uso de Python es que te permite escribir directamente en el intérprete interactivo: el programa que va a ejecutar tus programas en Python. Puedes ejecutar el intérprete de Python usando una sencilla interfaz gráfica llamada "entorno de desarrollo interactivo" (| IDLE | de Interactive DeveLopment Environment). En un equipo Mac se puede acceder al intérprete en ``Application -> MacPython`` y en Windows bajo en ``All Programs -> Python``. Bajo Unix puede ejecutar Python desde la consola escribiendo ``python``. El intérprete imprimirá una información referente a la versión de Python, verifica que estás ejecutando la versión 2.4 o superior (en el ejemplo 2.5):

.. doctest-ignore::

   Python 2.5 (r25:51918, Sep 19 2006, 08:49:13) 
   [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>>

.. note::

  Si no puedes ejecutar el intérprete de Python, probablemente significa que no tienes instalado
  el Python correctamente. Visita la página |NLT-URL| para obtener las instrucciones detalladas.


El símbolo ``>>>`` indica que el intérprete de Python está esperando instrucciones. Empezaremos por utilizat Python como calculadora:

   >>> 3 + 2 * 5 - 1
   12
   >>>

Nos tenemos que fijar en varias cosa. En primer lugar, una vez el intérprete ha finalizado el cálculo y muestra el resultado, vuelve a aparecer el símbolo ``>>>``. Esto significa que el intérprete de Python está esperando otra instrucción. En segundo lugar, fíjate que Python trata el orden de las operaciones correctamente (a diferencia de otras calculadoras), por lo que la multiplicación ``2 * 5`` se calcula primero y después se suma ``3``.

Prueba tú mismo con otras expresiones. Puedes utilizar el asterisco (``*``) para multiplicar y la barra (``/``) para dividir, y también se pueden utilizar paréntesis para agrupar expresiones. Podrás darte cuenta que sucede algo extraño con la división, ya que no da el resultado esperado:

   >>> 3/3
   1
   >>> 1/3
   0
   >>>

El segundo caso es sorprendente ya que esperaríamos la respuesta fuese ``0.333333``.

Explicaremos este comportamiento más adelante en este mismo capítulo. Por ahora, observa simplemente que puedes trabajar de manera interactiva con el intérprete, lo que te permite experimentar y explorar el funcionamiento. Además, como veremos más adelante, tus intuiciones sobre las expresiones numéricas serán de utilidad para manipular otros tipos de datos en Python.

También puedes probar expresiones sin sentido para ver como las manipula el intérprete:

   >>> 1 +
   Traceback (most recent call last):
     File "<stdin>", line 1
       1 +
         ^
   SyntaxError: invalid syntax
   >>>

.. La salida real no continene la línea "Traceback ...", pero

  doctest no lo reconoce como una exepción sin esta línea.

Aquí hemos provocado un `error de sintaxis`:dt:. No tiene sentido acabar una instrucción con el signo más. El intérprete de Python indica la lína donde se ha producido el problema.

.. _sect-basics:


Comprender los conceptos básicos: Cadenas y Variables


Representación de texto


Podemos simplemente escribir el texto directamente en el intérprete porqué intentará interpretar el texto como si fuera lenguaje Python:

   >>> Hello World
   Traceback (most recent call last):
     File "<stdin>", line 1
       Hello World
                 ^
   SyntaxError: invalid syntax
   >>>

.. La salida real no continene la línea "Traceback ...", pero

  doctest no lo reconoce como una exepción sin esta línea.

Aquí podemos observar un mensaje de error. Fíjate que el intérprete se ha equivocado en la posición del error e indica el final de la cadena en lugar del principio.

Python representa un fragmento de texto mediante una `cadena`:dt. Las cadenas estan `delimitadas`:dt: |mdash| o separadas del resto del programa |mdash| mediante comillas:

   >>> 'Hello World'
   'Hello World'
   >>> "Hello World"
   'Hello World'
   >>>

Podemos utilizar tanto comillas simples como comillas dobles, siempre que utilizemos las mismas tanto al principio como al final de la cadena.

Ahora podemos llevar a cabo operaciones con cadenas como si utilizáramos una calculadora. Por ejemplo, podemos intuir cuál es el resultado de la suma de dos cadenas:

   >>> 'Hello' + 'World'
   'HelloWorld'
   >>>

Cuando aplicamos el signo ``+`` a dos cadenas la operación se denomina `concatenación`:dt:. Produce una nueva cadena que es una copia de las dos cadenas originales enganchadas. Fíjate que la concatenación no pone un espacio en blanco entre las palabras. El intérprete de Python no puede saber que deseas poner un espacio: hace `exactamente`:em: lo que se le indica. Sabiendo lo que hace el símbolo ``+` puedes adivinar el resultado de la multiplicación:

   >>> 'Hi' + 'Hi' + 'Hi'
   'HiHiHi'
   >>> 'Hi' * 3
   'HiHiHi'
   >>>

La conclusión que podemos sacar (además de aprender conceptos sobre cadenas) es que en Python nuestra intuición acostumbra a funcionar muy bien, por lo que vale la pena probar cosas para ver como funcionan. Es muy poco probable que estropees algo, así que anímate a hacer tus propias pruebas.

Almacenaje y reutilización de valores


Después de trabajar un rato, puede resultar bastante molesto reescribir constantemente expresiones una y otra vez. Sería interesante poder almacenar el `valor`:dt: de una expresión como ``'Hi' + 'Hi' + 'Hi'`` de manera que la podamos reutilizar. Podemos conseguir esto guardando los resultados en un espacio de la memoria del ordenador y dando un nombre a este espacio. Este lugar de memoria con un nombre asociado se denomina `variable`:dt:. En Python podemos crear variables por `asignación`:dt:, lo que implica poner un valor en la variable:

   >>> msg = 'Hello World'                           # [_hw-assignment]
   >>> msg                                           # [_hw-display]
   'Hello World'                                     # [_hw-output]
   >>>

En la línea hw-assignment_ hemos creado una variable que se llama ``msg`` (la abreviación de 'message') y hemos puesto un valor de cadena``'Hello World'``. Utilizamos la operación ``=``, que `asigna`:dt: el valor de la expresión que está a la derecha a la variable de la izquierda. Fíjate que el intérprete de Python no escribe ninguna salida; únicamente escribe una salida cuando la expresión devuelve un valor, y la operación de asignación no devulve ningún valor. En la línea hw-display_ inspeccionamos el valor de la variable poniendo su nombre en la línea de comandos: es decir, utilizamos el nombre``msg``. El intérprete escribe el contenido de la variable en la línea hw-output_.

Las variables se comportan como valores, así, en lugar de escribir ``'Hi' * 3`` podemos asignar a la variable ``msg`` el valor ``'Hi'``, y a ``num`` el valor ``3``, y llevar a cabo la multiplicación utilizando el nombre de las variables:

   >>> msg = 'Hi'
   >>> num = 3
   >>> msg * num
   'HiHiHi'
   >>>

Los nombres que elegimos para las variables dependen de nosotros. En lugar de ``msg`` y ``num``, podríamos haber utilizado nombres como:

   >>> marta = 'Hi'
   >>> foo123 = 3
   >>> marta * foo123
   'HiHiHi'
   >>>

|nopar| Por lo tanto, es aconsejable utilizar nombres de variable que tengan sentido para hacer que el código sea más legible y para que tanto tú como cualquier otra persona pueda entender fácilmente qué es lo que hace el programa. Python no intenta encontrar un sentido a los nombres de las variables, simplemente siguie a ciegas tus instrucciones, y no pone ninguna objeción si intentas hacer algo potencialmente confuso, como asignar el valor ``3`` a una variable que se llame ``dos``, con la instrucción de asignación ``dos = 3``.

Fíjate que también podemos asignar un nuevo valor a una variable simplemente utilizando de nuevo la asignación:

   >>> msg = msg * num
   >>> msg
   'HiHiHi'
   >>>

Aquí hemos tomado el valor de ``msg``, lo hemos multiplicado por ``3`` y hemos almacenado la nueva cadena (``HiHiHi``) de nuevo en la variable ``msg``.

Escritura e inspección de cadenas


Hasta el momento, cuando hemos querido ver el contenido de una variable o ver el resultado de un cálculo, simplemente hemos escrito el nombre de la variable en el intérprete. También podemos ver el contenido de la variable ``msg`` utilizando ``print msg``:

   >>> msg = 'Hello World'
   >>> msg
   'Hello World'
   >>> print msg
   Hello World
   >>>

Si nos fijamos bien, podemos observar que las comillas que indican que ``Hello World`` es una cadena no se escriben en el segundo caso. Esto es debido a que al inspeccionar una variable escribiendo si nombre en el intérprete interactivo, se escribe la `representación`:dt: que tiene Python de su valor. La instrucción ``print`` únicamente escribe el contenido, que en este caso es simplemente el texto contenido en la cadena.

De hecho, puedes utilizar una secuencia de valores separados por coma en una instrucción ``print``:

   >>> msg2 = 'Goodbye'
   >>> print msg, msg2
   Hello World Goodbye
   >>> 

.. nota:: Si has creado una varable ``v`` y quieres obtener información sobre ella

  escribe ``help(v)`` para leer la entrada de la ayuda para este tipo de objeto.
  Escribe ``dir(v) para ver una lista de operaciones definidas para este objeto.

Se tiene que tener un poco de cuidado al escoger los nombres (o `identificadores`:dt:) para las variables de Python. Algunas elecciones pueden provocar errores. En primer lugar, el nombre tiene que empezar con una letra, opcionalmente seguida de dígitos (``0`` to ``9``) o letras. Así, ``abc23`` es correcto, pero `23abc`` provocará un error de sintaxis. Se pueden utilizar guines bajos (_) (tanto en el interior como al principio del nombre de la variable), pero no un guión (-) ya que se interpreta como un operador aritmético. En la siguiente fragmento se muestra un segundo problema:

   >>> not = "don't do this"
     File "<stdin>", line 1
       not = "don't do this"
       ^
   SyntaxError: invalid syntax

¿Porqué se produce un error aquí? Porque ``not`` es una de las 30 palabras reservadas en Python. Estos identificadores se utilizan en contextos sintácticos específicos, y no se pueden utilizar como variables. Es fácil distinguir las palabras reservadas si se utiliza el IDLE, ya que aparecen resaltadas en color naranja.

Creación de programas con el Editor de Textos


El intérprete interactivo de Python ejecuta las instrucciones a medida que se van escribiento. A menudo es más conveniente componer un programa de múltiples líneas utilizando un editor de textos, y después hacer que Python ejecute todo el programa a la vez. Utilizando |IDLE| se puede hacer esto mediante el menú ``File`` y abriendo una nueva ventana. Prueba esto y escribe el siguiente programa de una línea:

    msg = 'Hello World'

Guarda este programa en un archivo que se llame ``test.py``, ves al menú ``Run``y selecciona ``Run Module``. El resultado que se muestra en la ventana principal de |IDLE| tendrá un aspecto como el siguiente:

.. doctest-ignore::

   >>> ================================ RESTART ================================
   >>> 
   >>>     

Pero, ¿dónde está la salida que muestra el valor de ``msg``? La respuesa es que el programa ``test.py`` únicamente mostrará el valor si se lo dices explícitamente, utilizando la instrucció ``print``. Así que se tiene que añadir una nueva línea para que ``test.py`` quede de la siguiente manera:

    msg = 'Hello World'
    print msg

Selecciona ``Run Module`` de nuevo, y ahora obtendrás una salida como la siguiente:

.. doctest-ignore::

   >>> ================================ RESTART ================================
   >>> 
   Hello World
   >>> 

A partir de ahora tienes la opción de utilizar el intérprete interactivo o un editor de textos para crear tus programas. A menudo conveniente provar tus ideas utilizando el intérprete, revisar una línea de código hasta que haga lo que esperas que haga, y consultando las opciones de ayuda interactovas. Una vez que todo funcione correctamente puedes enganchar el código (excepto las líneas ``>>>``) en el editor de textos, ir ampliando el programa de este modo, y al final guardar el programa en un archivo para que no tengas que volver a reescribirlo de nuevo.

Ejercicios


1. |Fácil| Pon en marcha el intérprete de Python (por ejemplo ejecutando |IDLE|). Prueba los ejemplos en la sección sect-calculator_, y experimente utilizando Python como calculadora.

2. |Fácil| Prueba los ejemplos de esta sección y después prueba lo siguiente:

  a) Crea una variable que se llame ``msg`` y pon un mensaje cualquiera en esta variable. Recuerda que las cadenas tienen que estar entre comillas, así que deberás escribir algo del estilo:
     >>> msg = "I like NLP!"
   
  b) Ahora escribe el contenido de la variable de dos maneras diferentes, primero simplemente escribe el nombre de la variable y pulsa enter, después utiliza la instrucción ``print``.
  c) Prueba diferentes expresiones aritméticas utilizando cadenas, por ejemplo
     ``msg + msg``, and ``5 * msg``.
  
  d) Define una nueva cadena ``hello`` y prueba ``hello + msg``. Cambia la cadena ``hello`` de manera que acabe con un espacio en blanco y vuelve a probar ``hello + msg``.

3. |Debate| Comenta los pasos que deberías hacer para encontrar las diez palabras más frecuentes en un documento de dos páginas.


Cortar y trocear


Las cadenas de caracteres son tan importantes que les dedicaremos algo más de tiempo. En este apartado aprenderemos a acceder a los diferentes `caracteres`:dt que componen una cadena, como eliminar `subcadenas:dt` arbitrarias y cómo invertir una cadena.

Acceso a los caracteres individuales


Las posiciones de una cadena están numeradas, empezando desde cero. Para acceder a una posición dentro de una cadena se debe especificar la posición entre corchetes:

   >>> msg = 'Hello World'
   >>> msg[0]
   'H'
   >>> msg[3]
   'l'
   >>> msg[5]
   ' '
   >>>

A esto se le denomina `indexación`:dt: o `subindexación`:dt:. La posición que se especifica dentro de los corchetes se denomina `índice`:dt:. Podemos recuperar no sólo letras, sino cualquier carácter, como el espacio del índice ``5``.

.. Nota:: Ten cuidado de diferenciar entre el carácter ``' '``, que es un espacio en blanco, y ````, que es la cadena vacía.

El hecho que las cadenas estén indexadas desde cero puede parecer contra-intuitivo. Para no equivocarte puedes pensar que los índices te están dando la posición inmediatamente `anterior`:em: a un carácter de la cadena, como se indica en la figura indexing01_.


.. _indexing01: .. figure:: ../images/indexing01.png

  :scale: 40


  Indexación de cadenas

Y ahora, ¿qué pasa si intentamos acceder a un índice que está fuera de la cadena?

   >>> msg[11]
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
   IndexError: string index out of range
   >>>

El índice ``11`` está fuera de la gama de índices válidos (es decir, de 0 a 10) para la cadena ``'Hello World'``. Esto produce un mensaje de error. Esta vez no es un error de sintaxis; el fragmento de programa es correcto desde el punto de vista sintáctico. Esta vez, en cambio, el error ocurre mientras se está ejecutando el programa. El mensaje ``Traceback`` indica en qué linia se ha producido el error (línea 1 de la "entrada estándar"). Después se muestra el nombre del error, ``IndexError``, y una breve explicación.

En general, ¿cómo podemos saber hasta qué índice podemos acceder? Si sabemos que la longitud de la cadena es `n`:math:, el valor de índice más alto será `n-1`:math:. Podemos conocer la longitud de una cadena mediante la función ``len()``.

   >>> len(msg)
   11
   >>>

Podemos definir una `función`:dt: de manera informal como un fragmento de código que tiene un nombre y que proporciona un servicio a nuestro programa cuando lo `llamamos`:dt: o ejecutamos utilizando su nombre. Llamamos a la función ``len()`` poniendo paréntesis después del nombre y dando la cadena ``msg`` de la que queremos saber la longitud. Ya que la función ``len()`` está incorporada en el intérprete de Python, |IDLE| la colorea de color púrpura.

Hemos visto qué ocurre cuando el índice es demasiado largo. ¿Y qué pasa cuando es demasiado pequeño? Veamos lo que pasa si utilizamos valores menores que cero:

   >>> msg[-1]
   'd'
   >>>

Esto no produce un error. Los índices negativos funcionan desde el `final`:em: de la cadena, así que ``-1`` indexa el último caracter, que es ``'d'``.

   >>> msg[-3]
   'r'
   >>> msg[-6]
   ' '
   >>>

En este caso el ordenador devuelve la posición en memoria relativa a la dirección de la cadena más su longitud, restando el índice, es decir, e.g. ``3136 + 11 - 1 = 3146``.

También podemos visualizar los índices negativos como se muestra en la figura indexing02_.

.. _indexing02: .. figure:: ../images/indexing02.png

  :scale: 40


  Índices negativos

Así pues, disponemos de dos maneras de acceder a los carácteres de una cadena, desde el principio o desde el final. Por ejemplo, podemos acceder al espacio entre ``Hello`` y ``World`` tanto con ``msg[5]`` como con ``msg[-6]``; los dos se refieren a la misma posición, porque ``5 = len(msg) - 6``.

Acceso a subcadenas


En |PLN| normalmente queremos acceder a más de un carácter a la vez. Esto es también muy sencillo, simplemente tenemos que especificar el índice inicial y final. Por ejemplo, el siguiente código accede a la subcadena que empieza en el índice ``1``, hasta (pero sin incluirlo) el índice ``4``:

   >>> msg[1:4]
   'ell'
   >>>

La notación ``:4`` se conoce como un `slice`:dt: (`rodaja`:em). Aquí vemos que los caracteres son ``'e'``, ``'l'`` y ``'l'`` que corresponden con ``msg[1]``, ``msg[2]`` y ``msg[3]``, pero no con ``msg[4]``. Esto es así porque una rodaja `empieza`:em: en el primer índice pero acaba `uno antes`:em: del índice final. Esto es consistente con la indexación: los índices también empiezan desde cero y crecen hasta `uno antes`:em: de la longitud de la cadena. Podemo ver esto slicing con el valor de ``len()``:

   >>> len(msg)
   11
   >>> msg[0:11]
   'Hello World'
   >>>

También podemos trocear con índices negativos |mdash| aquí se aplica también la regla básica de empezar desde el índice inicial y acabar uno antes del índice final; en este ejemplo acabamos antes del espacio en blanco:

   >>> msg[0:-6]
   'Hello'
   >>>

Python proporciona dos atajos para trocear que se utilizan muy a menudo. Si el índice inicial es ``0`` se puede dejar en blanco, y si el índice final coincide con la longitud de la cadena, también se puede dejar en blanco:

   >>> msg[:3]
   'Hel'
   >>> msg[6:]
   'World'
   >>>

El primer ejemplo selecciona los tres primeros carácteres de la cadena, y el segundo ejemplo selecciona desde el carácter con índice 6, es decir ``'W'``, hasta el final de la cadena.

Ejercicios


1. |fácil| Define una cadena ``s = 'colorless'``. Escribe una instrucción de Python que cambie esta cadena a "colourless" utilizando sólo las instrucciones de trocear y concatenar.

  1. . |fácil| Prueba los ejemplos de trocear cadenas de esta sección utilizando el intérprete interactivo. Después prueba por ti mismo otros casos. Intenta adivinar cuál será el resultado antes de ejecutar la instrucción.


  1. . |fácil| Podemos utilizar la notación de cortar cadenas para eliminar las terminaciones morfológicas de las palabras. Por ejemplo, ``'dogs'[:-1]`` elimina el último carácter de ``dogs``, dejando ``dog``. Utiliza la notación de cortar para eliminar los afijos de las siguientes palabras (hemos insertado un guión para indicar el afijo, pero no utilices el guión en tus cadenas: ``dish-es``, ``run-ning``, ``nation-ality``, ``un-do``, ``pre-heat``.
  1. . |fácil| Hemos mostrado cómo se puede generar un ``IndexError`` indexando más allá del final de una cadena. ¿Es posible utilizar un índice que vaya más allá a la izquierda del principio de una cadena?
  1. . |fácil| También podemos especificar el tamaño de "paso" para un corte. Los siguientes ejemplos devuelven cada dos caracteres dentro del corte, hacia adelante o hacia atrás:
      >>> msg[6:11:2]
      'Wrd'
      >>> msg[10:5:-2]
      'drW'
      >>>

Haz pruebas con diferentes tamaños de paso.

  1. . |fácil| ¿Qué ocurre cuando pides al intérprete que evalue ``msg[::-1]``? Explica porqué el resultado es razonable.



Cadenas, secuencias y oraciones


Hemos visto cómo podemos almacenar palabras como `Hello`:lx: en una cadena ``'Hello'``. En las cadenas también se pueden almacenar oraciones enteras, y manipularlas como hemos visto anteriormente, como vemos en el siguiente ejemplo de la famosa frase sin sentido de Chomsky <http://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously>`_:

   >>> sent = 'colorless green ideas sleep furiously'
   >>> sent[16:21]
   'ideas'
   >>> len(sent)
   37
   >>>

No obstante, no es demasiada buena idea tratar una oración como secuencia de caracteres, ya que hace difícil acceder a cada palabra. En lugar de esto, puede ser preferible representar una oración como una secuencia de sus `palabras`:em:\ ; como resultado, la indexación de una oración nos dará acceso a sus palabras, en lugar de a sus caracteres. A continuación veremos cómo podemos hacer esto:

Listas


Una `lista`:dt: está diseñada para almacenar una secuencia de valores. Una lista en cierto modo es similar a una cadena, con la diferencia de que cada elemento individual no tiene porque ser un carácter; pueden ser cadenas arbitrarias, enteros o incluso otras listas.

La lista de Python se representa como una secuencia de elementos separados por comas, delimitados por corchetes. A continuación vemos algunas listas:

   >>> squares = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
   >>> shopping_list = ['juice', 'muffins', 'bleach', 'shampoo']

También podemos almacenar oraciones y frases utilizando listas. En el siguiente ejemplo crearemos un fragmento de la oración de Chomsky y la pondremos en una variable llamada ``cgi``:

   >>> cgi = ['colorless', 'green', 'ideas']
   >>> cgi
   ['colorless', 'green', 'ideas']
   >>>

Dado que las listas y las cadenas son tipos de secuencias, las podemos procesar de una manera similar; las listas, al igual que las cadenas, soportan la función ``len()``, así como las operaciones de indexación y corte. En el siguiente ejemplo aplicamos estas operaciones a la lista ``cgi``:

   >>> len(cgi)
   3
   >>> cgi[0]
   'colorless'
   >>> cgi[-1]
   'ideas'
   >>> cgi[-5]
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
   IndexError: list index out of range
   >>>

En este ejemplo, ``cgi[-5]`` produce un error ya que el quinto elemento empezando por el final de una lista de tres elementos está en una posición anterior al inicio de la propia lista, es decir, no está definido. También podemos cortar lista exactamente de la misma manera que con cadenas:

   >>> cgi[1:3]
   ['green', 'ideas']
   >>> cgi[-2:]
   ['green', 'ideas']
   >>>

Las listas se pueden concatenar de la misma manera que las cadenas. En este ejemplo pondremos la lista resultante en un nueva variable ``chomsky``. La variable original ``cgi`` no cambiará en el siguiente proceso:

   >>> chomsky = cgi + ['sleep', 'furiously']
   >>> chomsky
   ['colorless', 'green', 'ideas', 'sleep', 'furiously']
   >>> cgi
   ['colorless', 'green', 'ideas']
   >>>

Ahora bien, las listas y las cadenas no tienen exactamente la misma utilidad. Las listas tienen la capacidad añadida de que se pueden cambiar sus elementos. Imaginémonos que queremos cambiar el elemento 0 de la lista``cgi`` a ``'colorful'``, podemos hacerlo asignando el nuevo valor al índice ``cgi[0]``:

   >>> cgi[0] = 'colorful'
   >>> cgi
   ['colorful', 'green', 'ideas']
   >>>

|nopar| Por otro lado, si intentamos hacer esto con una *cadena* |mdash| cambiando el carácter 0 de ``msg`` a ``'J'`` |mdash| obtenemos:

.. doctest-ignore::

   >>> msg[0] = 'J'
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
   TypeError: object does not support item assignment
   >>>

|nopar| El error se produce porque las cadenas son `inmutables`:dt: |mdash| no se puede cambiar una cadena una vez se ha creado. En cambio, las listas son `mutables`:dt:, y su contenido se puede modificar en cualquier momento. Como resultado, las listas soportan una serie de operaciones, o `métodos`:dt:, que modifican el valor original en lugar de devolver un nuevo valor. Un método es una función asociada con un objeto concreto. Para llamar a un método de un objeto se debe dar el nombre del objeto, un punto, después del nombre del método y para acabar unos paréntesis que contengan los argumentos, si los hay. En el siguiente ejemplo utilizamos los métodos ``sort()`` y ``reverse()``:

   >>> chomsky.sort()
   >>> chomsky.reverse()
   >>> chomsky
   ['sleep', 'ideas', 'green', 'furiously', 'colorless']
   >>>

Como se puede observar, el cursor reaparece en la línea inmediatamente posterior a ``chomsky.sort()`` y ``chomsky.reverse()``. Esto es debido a que estos métodos no producen una nueva lista, sino que modifican la lista original almacenada en la variable ``chomsky``.

Las listas también disponen del método ``append()`` para añadir elementos al final de la lista y el médoto ``index()`` para encontrar el índice de elementos concretos de la lista:

   >>> chomsky.append('said')
   >>> chomsky.append('Chomsky')
   >>> chomsky
   ['sleep', 'ideas', 'green', 'furiously', 'colorless', 'said', 'Chomsky']
   >>> chomsky.index('green')
   2
   >>>

Para finalizar, y como recordatorio, se pueden crear listas de cualquier valor que se desee. Como se puede observar en el siguiente ejemplo de una entrada léxica, los valores de una lista no tienen porque ser del mismo tipo (aunque es desaconsejable que sean de tipos distintos, como explicaremos en la sección sec-back-to-the-basics_).

   >>> bat = ['bat', [[1, 'n', 'flying mammal'], [2, 'n', 'striking instrument']]]
   >>>

Trabajando con secuencias elemento a elemento


Hemos visto cómo crear listas y cómo indexarlas y manipularlas de diferentes maneras. A menudo es útil recorrer una lista y procesar cada elemento. Podemos hacer esto con un bucle ``for``. Este es nuestro primer ejemplo de una `estructura de control`:dt: en Python, una instrucción que `controla`:em: cómo se ejecutan otras instrucciones:

   >>> for num in [1, 2, 3]:
   ...     print 'The number is', num
   ... 
   The number is 1
   The number is 2
   The number is 3

The interactive interpreter changes the prompt from ``>>>`` to ``...`` after encountering the colon at the end of the first line. This prompt indicates that the interpreter is expecting an indented block of code to appear next. However, it is up to you to do the indentation. To finish the indented block just enter a blank line.

The ``for`` loop has the general form: ``for`` *variable* ``in`` *sequence* followed by a colon, then an indented block of code. The first time through the loop, the variable is assigned to the first item in the sequence, i.e. ``num`` has the value ``1``. This program runs the statement ``print 'The number is', num`` for this value of ``num``, before returning to the top of the loop and assigning the second item to the variable. Once all items in the sequence have been processed, the loop finishes.

Now let's try the same idea with a list of words:

   >>> chomsky = ['colorless', 'green', 'ideas', 'sleep', 'furiously']
   >>> for word in chomsky:
   ...     print len(word), word[-1], word
   ...
   9 s colorless
   5 n green
   5 s ideas
   5 p sleep
   9 y furiously

The first time through this loop, the variable is assigned the value ``'colorless'``. This program runs the statement ``print len(word), word[-1], word`` for this value, to produce the output line: ``9 s colorless``. This process is known as `iteration`:dt:. Each iteration of the ``for`` loop starts by assigning the next item of the list ``chomsky`` to the `loop variable`:dt: ``word``. Then the indented `body`:dt: of the loop is run. Here the body consists of a single command, but in general the body can contain as many lines of code as you want, so long as they are all indented by the same amount. (We recommend that you always use exactly 4 spaces for indentation, and that you never use tabs.)

We can run another ``for`` loop over the Chomsky nonsense sentence, and calculate the average word length. As you will see, this program uses the ``len()`` function in two ways: to count the number of characters in a word, and to count the number of words in a phrase. Note that ``x += y`` is shorthand for ``x = x + y``; this idiom allows us to `increment`:dt: the ``total`` variable each time the loop is run.

   >>> total = 0
   >>> for word in chomsky:
   ...     total += len(word)
   ...
   >>> total / len(chomsky)
   6
   >>>

We can also write ``for`` loops to iterate over the characters in strings. This ``print`` statement ends with a trailing comma, which is how we tell Python not to print a newline at the end.

   >>> sent = 'colorless green ideas sleep furiously'
   >>> for char in sent:
   ...     print char,
   ... 
   c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y
   >>>

A note of caution: we have now iterated over words and characters, using expressions like ``for word in sent:`` and ``for char in sent:``. Remember that, to Python, ``word`` and ``char`` are meaningless variable names, and we could just as well have written ``for foo123 in sent:``. The interpreter simply iterates over the items in the sequence, quite oblivious to what kind of object they represent, e.g.:

   >>> for foo123 in 'colorless green ideas sleep furiously':
   ...     print foo123,
   ... 
   c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y
   >>> for foo123 in ['colorless', 'green', 'ideas', 'sleep', 'furiously']:
   ...     print foo123,
   ... 
   colorless green ideas sleep furiously
   >>>

However, you should try to choose 'sensible' names for loop variables because it will make your code more readable.

String Formatting


The output of a program is usually structured to make the information easily digestible by a reader. Instead of running some code and then manually inspecting the contents of a variable, we would like the code to tabulate some output. We already saw this above in the first ``for`` loop example that used a list of words, where each line of output was similar to ``5 p sleep``, consisting of a word length, the last character of the word, then the word itself.

There are many ways we might want to format such output. For instance, we might want to place the length value in parentheses `after`:em: the word, and print all the output on a single line:

   >>> for word in chomsky:
   ...     print word, '(', len(word), '),',
   colorless ( 9 ), green ( 5 ), ideas ( 5 ), sleep ( 5 ), furiously ( 9 ),
   >>>

However, this approach has a couple of problems. First, the ``print`` statement intermingles variables and punctuation, making it a little difficult to read. Second, the output has spaces around every item that was printed. A cleaner way to produce structured output uses Python's `string formatting expressions`:dt:. Before diving into clever formatting tricks, however, let's look at a really simple example. We are going to use a special symbol, ``%s``, as a placeholder in strings. Once we have a string containing this placeholder, we follow it with a single ``%`` and then a value ``v``. Python then returns a new string where ``v`` has been slotted in to replace ``%s``:

   >>> "I want a %s right now" % "coffee"
   'I want a coffee right now'
   >>>

In fact, we can have a number of placeholders, but following the ``%`` operator we need to put in a tuple with exactly the same number of values:

   >>> "%s wants a %s %s" % ("Lee", "sandwich", "for lunch")
   'Lee wants a sandwich for lunch'
   >>>

We can also provide the values for the placeholders indirectly. Here's an example using a ``for`` loop:

   >>> menu = ['sandwich', 'spam fritter', 'pancake']
   >>> for snack in menu:
   ...     "Lee wants a %s right now" % snack
   ... 
   'Lee wants a sandwich right now'
   'Lee wants a spam fritter right now'
   'Lee wants a pancake right now'
   >>>

We oversimplified things when we said that placeholders were of the form ``%s``; in fact, this is a complex object, called a `conversion specifier`:dt:. This has to start with the ``%`` character, and ends with conversion character such as ``s` or ``d``. The ``%s`` specifier tells Python that the corresponding variable is a string (or should be converted into a string), while the ``%d`` specifier indicates that the corresponding variable should be converted into a decimal representation. The string containing conversion specifiers is called a `format string`:dt:.

Picking up on the ``print`` example that we opened this section with, here's how we can use two different kinds of conversion specifier:

   >>> for word in chomsky:
   ...     print "%s (%d)," % (word, len(word)),
   colorless (9), green (5), ideas (5), sleep (5), furiously (9), 
   >>>

To summarize, string formatting is accomplished with a three-part object having the syntax: `format`:ph: ``%`` `values`:ph:. The `format`:ph: section is a string containing format specifiers such as ``%s`` and ``%d`` that Python will replace with the supplied values. The `values`:ph: section of a formatting string is a tuple containing exactly as many items as there are format specifiers in the `format`:ph: section. In the case that there is just one item, the parentheses can be left out. (We will discuss Python's string-formatting expressions in more detail in Section sec-lining-things-up_).

In the above example, we used a trailing comma to suppress the printing of a newline. Suppose, on the other hand, that we want to introduce some additional newlines in our output. We can accomplish this by inserting the "special" character ``\n`` into the ``print`` string:

   >>> for word in chomsky:
   ...	   print "Word = %s\nIndex = %s\n*****" % (word, chomsky.index(word))
   ...
   Word = colorless
   Index = 0
   *****
   Word = green
   Index = 1
   *****
   Word = ideas
   Index = 2
   *****
   Word = sleep
   Index = 3
   *****
   Word = furiously
   Index = 4
   *****
   >>> 


Converting Between Strings and Lists


Often we want to convert between a string containing a space-separated list of words and a list of strings. Let's first consider turning a list into a string. One way of doing this is as follows:

   >>> s = 
   >>> for word in chomsky:
   ...	   s += ' ' + word
   ...
   >>> s
   ' colorless green ideas sleep furiously'
   >>>

One drawback of this approach is that we have an unwanted space at the start of ``s``. It is more convenient to use the ``join()`` method. We specify the string to be used as the "glue", followed by a period, followed by the ``join()`` function.

   >>> sent = ' '.join(chomsky)
   >>> sent
   'colorless green ideas sleep furiously'
   >>>

So ``' '.join(chomsky)`` means: take all the items in ``chomsky`` and concatenate them as one big string, using ``' '`` as a spacer between the items.

Now let's try to reverse the process: that is, we want to convert a string into a list. Again, we could start off with an empty list ``[]`` and ``append()`` to it within a ``for`` loop. But as before, there is a more succinct way of achieving the same goal. This time, we will `split`:em: the new string ``sent`` on whitespace:

.. why did we have this? >>> sent.split(' ')

   >>> sent.split()
   ['colorless', 'green', 'ideas', 'sleep', 'furiously']
   >>> 

To consolidate your understanding of joining and splitting strings, let's try the same thing using a semicolon as the separator:

   >>> sent = ';'.join(chomsky)
   >>> sent
   'colorless;green;ideas;sleep;furiously'
   >>> sent.split(';')
   ['colorless', 'green', 'ideas', 'sleep', 'furiously']
   >>>

To be honest, many people find the notation for ``join()`` rather unintuitive. There is another function for converting lists to strings, again called ``join()`` which is called directly on the list. It uses whitespace by default as the "glue". However, we need to explicitly `import`:dt: this function into our code. One way of doing this is as follows:

   >>> import string
   >>> string.join(chomsky)
   'colorless green ideas sleep furiously'  
   >>>

Here, we imported something called ``string``, and then called the function ``string.join()``. In passing, if we want to use something other than whitespace as "glue", we just specify this as a second parameter:

   >>> string.join(chomsky, ';')
   'colorless;green;ideas;sleep;furiously'
   >>>

We will see other examples of statements with ``import`` later in this chapter. In general, we use ``import`` statements when we want to get access to Python code that doesn't already come as part of core Python. This code will exist somewhere as one or more files. Each such file corresponds to a Python `module`:dt: |mdash| this is a way of grouping together code and data that we regard as reusable. When you write down some Python statements in a file, you are in effect creating a new Python module. And you can make your code depend on another module by using the ``import`` statement. In our example earlier, we imported the module ``string`` and then used the ``join()`` function from that module. By adding ``string.`` to the beginning of ``join()``, we make it clear to the Python interpreter that the definition of ``join()`` is given in the ``string`` module. An alternative, and equally valid, approach is to use the ``from`` *module* ``import`` *identifier* statement, as shown in the next example:

   >>> from string import join
   >>> join(chomsky)
   'colorless green ideas sleep furiously'  
   >>>

In this case, the name ``join`` is added to all the other identifier that we have defined in the body of our programme, and we can use it to call a function like any other.

.. Note:: If you are creating a file to contain some of your Python

  code, do *not* name your file ``nltk.py``: it may get imported in
  place of the "real" NLTK package. (When it imports modules, Python
  first looks in the current folder / directory.)

Mini-Review


Strings and lists are both kind of `sequence`:dt:. As such, they can both be indexed and sliced:

   >>> query = 'Who knows?'
   >>> beatles = ['john', 'paul', 'george', 'ringo']
   >>> query[2]
   'o'
   >>> beatles[2]
   'george'
   >>> query[:2]
   'Wh'
   >>> beatles[:2]
   ['john', 'paul']
   >>>

Similarly, strings can be concatenated and so can lists (though not with each other!):

   >>> newstring = query + " I don't"
   >>> newlist = beatles + ['brian', 'george']

What's the difference between strings and lists as far as NLP is concerned? As we will see in Chapter chap-words_, when we open a file for reading into a Python program, what we get initially is a string, corresponding to the contents of the whole file. If we try to use a ``for`` loop to process the elements of this string, all we can pick out are the individual characters in the string |mdash| we don't get to choose the granularity. By contrast, the elements of a list can be as big or small as we like: for example, they could be paragraphs, sentence, phrases, words, characters. So lists have this huge advantage, that we can be really flexible about the elements they contain, and correspondingly flexible about what the downstream processing will act on. So one of the first things we are likely to do in a piece of NLP code is convert a string into a list (of strings). Conversely, when we want to write our results to a file, or to a terminal, we will usually convert them to a string.

Exercises


1. |easy| Using the Python interactive interpreter, experiment with

  the examples in this section.  Think of a sentence and
  represent it as a list of strings, e.g. ['Hello', 'world'].
  Try the various operations for indexing, slicing and sorting the elements
  of your list.  Extract individual items (strings), and perform
  some of the string operations on them.
  1. . |easy| Split ``sent`` on some other character, such as ``'s'``.
  1. . |easy| We pointed out that when ``phrase`` is a list, ``phrase.reverse()``
  returns a modified version of ``phrase`` rather than a new list. On
  the other hand, we can use the slice trick mentioned in the
  exercises for the previous section, ``[::-1]`` to create a `new`:em: reversed list
  without changing ``phrase``. Show how you can confirm this
  difference in behavior.
  1. . |easy| We have seen how to represent a sentence as a list of words, where
  each word is a sequence of characters.  What does ``phrase1[2][2]`` do?
  Why?  Experiment with other index values.
  1. . |easy| Write a ``for`` loop to print out the characters of a string, one per line.
  1. . |easy| What is the difference between calling ``split`` on a string
  with no argument or with ``' '`` as the argument,
  e.g. ``sent.split()`` versus ``sent.split(' ')``?  What happens
  when the string being split contains tab characters, consecutive
  space characters, or a sequence of tabs and spaces?  (In IDLE you
  will need to use ``'\t'`` to enter a tab character.)
  1. . |easy| Create a variable ``words`` containing a list of words.
  Experiment with ``words.sort()`` and ``sorted(words)``.
  What is the difference?
  1. . |easy| Earlier, we asked you to use a text editor to create a file
  called ``test.py``, containing the single line ``msg = 'Hello
  World'``. If you haven't already done this (or can't find the file),
  go ahead and do it now. Next, start up a new session with the
  Python interpreter, and enter the expression ``msg`` at the prompt.
  You will get an error from the interpreter. Now, try the following
  (note that you have to leave off the ``.py`` part of the filename):
      >>> from test import msg
      >>> msg
  This time, Python should return with a value. You can also try
  ``import test``, in which case Python should be able to
  evaluate the expression ``test.msg`` at the prompt.  


  1. . |soso| Process the list ``chomsky`` using a ``for`` loop, and store the
  result in a new list ``lengths``.  Hint: begin by assigning the
  empty list to ``lengths``, using ``lengths = []``. Then each time
  through the loop, use ``append()`` to add another length value to
  the list.
  1. . |soso| Define a variable ``silly`` to contain the string:
  ``'newly formed bland ideas are inexpressible in an infuriating
  way'``.  (This happens to be the legitimate interpretation that
  bilingual English-Spanish speakers can assign to Chomsky's
  famous phrase, according to Wikipedia).  Now write code to perform
  the following tasks:
  a) Split ``silly`` into a list of strings, one per
     word, using Python's ``split()`` operation, and save
     this to a variable called ``bland``.
  b) Extract the second letter of each word in ``silly`` and join
     them into a string, to get ``'eoldrnnnna'``.
  c) Combine the words in ``bland`` back into a single string, using ``join()``.
     Make sure the words in the resulting string are separated with
     whitespace.
  d) Print the words of ``silly`` in alphabetical order, one per line.
  
  1. . |soso| The ``index()`` function can be used to look up items in sequences.
  For example, ``'inexpressible'.index('e')`` tells us the index of the
  first position of the letter ``e``.
  
  a) What happens when you look up a substring, e.g. ``'inexpressible'.index('re')``?
  
  b) Define a variable ``words`` containing a list of words.  Now use ``words.index()``
     to look up the position of an individual word.
     
  c) Define a variable ``silly`` as in the exercise above.
     Use the ``index()`` function in combination with list slicing to
     build a list ``phrase`` consisting of all the words up to (but not
     including) ``in`` in ``silly``.



Making Decisions


So far, our simple programs have been able to manipulate sequences of words, and perform some operation on each one. We applied this to lists consisting of a few words, but the approach works the same for lists of arbitrary size, containing thousands of items. Thus, such programs have some interesting qualities: (i) the ability to work with language, and (ii) the potential to save human effort through automation. Another useful feature of programs is their ability to `make decisions`:em: on our behalf; this is our focus in this section.

Making Simple Decisions


Most programming languages permit us to execute a block of code when a `conditional expression`:dt:, or ``if`` statement, is satisfied. In the following program, we have created a variable called ``word`` containing the string value ``'cat'``. The ``if`` statement then checks whether the condition ``len(word) < 5`` is true. Because the conditional expression is true, the body of the ``if`` statement is invoked and the ``print`` statement is executed.

   >>> word = "cat"
   >>> if len(word) < 5:
   ...   print 'word length is less than 5'
   ... 
   word length is less than 5
   >>>

If we change the conditional expression to ``len(word) >= 5``, to check that the length of ``word`` is greater than or equal to ``5``, then the conditional expression will no longer be true, and the body of the ``if`` statement will not be run:

   >>> if len(word) >= 5:
   ...   print 'word length is greater than or equal to 5'
   ... 
   >>>

The ``if`` statement, just like the ``for`` statement above is a `control structure`:dt:. An ``if`` statement is a control structure because it controls whether the code in the body will be run. You will notice that both ``if`` and ``for`` have a colon at the end of the line, before the indentation begins. That's because all Python control structures end with a colon.

What if we want to do something when the conditional expression is not true? The answer is to add an ``else`` clause to the ``if`` statement:

   >>> if len(word) >= 5:
   ...   print 'word length is greater than or equal to 5'
   ... else:
   ...   print 'word length is less than 5'
   ... 
   word length is less than 5
   >>>

Finally, if we want to test multiple conditions in one go, we can use an ``elif`` clause that acts like an ``else`` and an ``if`` combined:

   >>> if len(word) < 3:
   ...   print 'word length is less than three'
   ... elif len(word) == 3:
   ...   print 'word length is equal to three'
   ... else:
   ...   print 'word length is greater than three'
   ... 
   word length is equal to three
   >>>

It's worth noting that in the condition part of an ``if`` statement, a nonempty string or list is evaluated as true, while an empty string or list evaluates as false.

   >>> mixed = ['cat', , ['dog'], []]
   >>> for element in mixed:
   ...     if element: 
   ...         print element
   ... 
   cat
   ['dog']

That is, we *don't* need to say ``if element is True:`` in the condition.

What's the difference between using ``if...elif`` as opposed to using a couple of ``if`` statements in a row? Well, consider the following situation:

   >>> animals = ['cat', 'dog']
   >>> if 'cat' in animals:
   ...     print 1
   ... elif 'dog' in animals:
   ...     print 2
   ... 
   1
   >>>

Since the ``if`` clause of the statement is satisfied, Python never tries to evaluate the ``elif`` clause, so we never get to print out ``2``. By contrast, if we replaced the ``elif`` by an ``if``, then we would print out both ``1`` and ``2``. So an ``elif`` clause potentially gives us more information than a bare ``if`` clause; when it evaluates to true, it tells us not only that the condition is satisfied, but also that the condition of the main ``if`` clause was

  • not* satisfied.


Conditional Expressions


Python supports a wide range of operators like ``<`` and ``>=`` for testing the relationship between values. The full set of these `relational operators`:dt: are shown in Table inequalities_.

.. _inequalities:

 ======== ==============
 Operator Relationship
 ======== ==============
 ``<``    less than
 ``<=``   less than or equal to
 ``==``   equal to (note this is two not one ``=`` sign)
 ``!=``   not equal to
 ``>``    greater than
 ``>=``   greater than or equal to
 ======== ==============
 Conditional Expressions

Normally we use conditional expressions as part of an ``if`` statement. However, we can test these relational operators directly at the prompt:

   >>> 3 < 5
   True
   >>> 5 < 3
   False
   >>> not 5 < 3
   True
   >>>

Here we see that these expressions have `Boolean`:dt: values, namely ``True`` or ``False``. ``not`` is a Boolean operator, and flips the truth value of Boolean statement.

Strings and lists also support conditional operators:

   >>> word = 'sovereignty'
   >>> 'sovereign' in word
   True
   >>> 'gnt' in word
   True
   >>> 'pre' not in word
   True
   >>> 'Hello' in ['Hello', 'World']
   True
   >>> 'Hell' in ['Hello', 'World']
   False
   >>>

Strings also have methods for testing what appears at the beginning and the end of a string (as opposed to just anywhere in the string:

   >>> word.startswith('sovereign')
   True
   >>> word.endswith('ty')
   True
   >>>


Iteration, Items, and ``if``


Now it is time to put some of the pieces together. We are going to take the string ``'how now brown cow'`` and print out all of the words ending in ``'ow'``. Let's build the program up in stages. The first step is to split the string into a list of words:

   >>> sentence = 'how now brown cow'
   >>> words = sentence.split()
   >>> words
   ['how', 'now', 'brown', 'cow']
   >>>

Next, we need to iterate over the words in the list. Just so we don't get ahead of ourselves, let's print each word, one per line:

   >>> for word in words:
   ...     print word
   ... 
   how
   now
   brown
   cow


The next stage is to only print out the words if they end in the string ``'ow'``. Let's check that we know how to do this first:

   >>> 'how'.endswith('ow')
   True
   >>> 'brown'.endswith('ow')
   False
   >>>

Now we are ready to put an ``if`` statement inside the ``for`` loop. Here is the complete program:

   >>> sentence = 'how now brown cow'
   >>> words = sentence.split()
   >>> for word in words:
   ...     if word.endswith('ow'):
   ...         print word
   ... 
   how
   now
   cow
   >>>

As you can see, even with this small amount of Python knowledge it is possible to develop useful programs. The key idea is to develop the program in pieces, testing that each one does what you expect, and then combining them to produce whole programs. This is why the Python interactive interpreter is so invaluable, and why you should get comfortable using it.

A Taster of Data Types


Integers, strings and lists are all kinds of `data types`:dt: in Python, and have types ``int``, ``str`` and ``list`` respectively. In fact, every value in Python has a type. Python's ``type()`` function will tell you what an object's type is:

   >>> oddments = ['cat', 'cat'.index('a'), 'cat'.split()]
   >>> for e in oddments:
   ...     type(e)
   ... 
   <type 'str'>
   <type 'int'>
   <type 'list'>
   >>> 

The type determines what operations you can perform on the data value. So, for example, we have seen that we can index strings and lists, but we can't index integers:

     >>> one = 'cat'
     >>> one[0]
     'c'
     >>> two = [1, 2, 3]
     >>> two[1]
     2
     >>> three = 1234
     >>> three[2]
     Traceback (most recent call last):

File "<pyshell#95>", line 1, in -toplevel- three[2]

     TypeError: 'int' object is unsubscriptable
     >>> 

The fact that this is a problem with types is signalled by the class of error, i.e., ``TypeError``; an object being "unscriptable" means we can't index into it.

Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:

   >>> query = 'Who knows?'
   >>> beatles = ['john', 'paul', 'george', 'ringo']
   >>> query + beatles
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: cannot concatenate 'str' and 'list' objects

You may also have noticed that our analogy between operations on strings and numbers at the beginning of this chapter broke down pretty soon:

   >>> 'Hi' * 3 
   'HiHiHi'
   >>> 'Hi' - 'i'
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: unsupported operand type(s) for -: 'str' and 'str'
   >>> 6 / 2     
   3
   >>> 'Hi' / 2
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: unsupported operand type(s) for /: 'str' and 'int'
   >>> 

These error messages are another example of Python telling us that we have got our data types in a muddle. In the first case, we are told that the operation of substraction (i.e., ``-``) cannot apply to objects of type ``str``, while in the second, we are told that division cannot take ``str`` and ``int`` as its two operands.

Exercises


1. |easy| Assign a new value to ``sentence``, namely the string

  ``'she sells sea shells by the sea shore'``, then
  write code to perform the following tasks:
  a) Print all words beginning with ``'sh'``:
  b) Print all words longer than 4 characters.
  c) Generate a new sentence that adds the popular
     hedge word ``'like'`` before every word
     beginning with ``'se'``.  Your result should
     be a single string.
  1. . |easy| Write code to abbreviate text by removing all the vowels.
  Define ``sentence`` to hold any string you like, then initialize
  a new string ``result`` to hold the empty string ````.  Now write
  a ``for`` loop to process the string, one character at a time,
  and append any non-vowel characters to the result string.
  1. . |easy| We pointed out that when empty strings and empty lists occur
  in the condition part of an ``if`` clause, they evaluate to
  false. In this case, they are said to be occuring in a `Boolean
  context`:dt:.
  Experiment with different kind of non-Boolean expressions in Boolean
  contexts, and see whether they evaluate as true or false.
  1. . |easy| Review conditional expressions, such as ``'row' in 'brown'``
  and ``'row' in ['brown', 'cow']``.
  
  a) Define ``sent`` to be the string ``'colorless green ideas sleep furiously'``,
     and use conditional expressions to test for the presence of particular words
     or substrings.
     
  b) Now define ``words`` to be a list of words contained in the sentence, using
     ``sent.split()``, and use conditional expressions to test for the presence
     of particular words or substrings.
  1. . |soso| Write code to convert text into *hAck3r*, where characters are
  mapped according to the following table:
  
  +---------+---+---+---+----+---+--------+-----+
  | Input:  | e | i | o | l  | s | .      | ate |
  +---------+---+---+---+----+---+--------+-----+
  | Output: | 3 | 1 | 0 | \| | 5 | 5w33t! | 8   |
  +---------+---+---+---+----+---+--------+-----+


.. _getting-organized:


Getting Organized



Strings and lists are a simple way to organize data. In particular, they `map`:dt: from integers to values. We can "look up" a character in a string using an integer, and we can look up a word in a list of words using an integer. These cases are shown in Figure maps01_.

.. _maps01: .. figure:: ../images/maps01.png

  :scale: 25
  Sequence Look-up


However, we need a more flexible way to organize and access our data. Consider the examples in Figure maps02_.

.. _maps02: .. figure:: ../images/maps02.png

  :scale: 25
  Dictionary Look-up

In the case of a phone book, we look up an entry using a `name`:em:, and get back a number. When we type a domain name in a web browser, the computer looks this up to get back an IP address. A word frequency table allows us to look up a word and find its frequency in a text collection. In all these cases, we are mapping from names to numbers, rather than the other way round as with indexing into sequences. In general, we would like to be able to map between arbitrary types of information. Table linguistic-objects_ lists a variety of linguistic objects, along with what they map.

.. _linguistic-objects:

   +--------------------+-------------------------------------------------+
   | Linguistic Object  |                      Maps                       |
   |                    +------------+------------------------------------+
   |                    |    from    | to                                 |
   +====================+============+====================================+
   |Document Index      |Word        |List of pages (where word is found) |
   |                    |            |                                    |
   +--------------------+------------+------------------------------------+
   |Thesaurus           |Word sense  |List of synonyms                    |
   +--------------------+------------+------------------------------------+
   |Dictionary          |Headword    |Entry (part of speech, sense        |
   |                    |            |definitions, etymology)             |
   |                    |            |                                    |
   +--------------------+------------+------------------------------------+
   |Comparative Wordlist|Gloss term  |Cognates (list of words, one per    |
   |                    |            |language)                           |
   +--------------------+------------+------------------------------------+
   |Morph Analyzer      |Surface form|Morphological analysis (list of     |
   |                    |            |component morphemes)                |
   |                    |            |                                    |
   +--------------------+------------+------------------------------------+
   Linguistic Objects as Mappings from Keys to Values

Most often, we are mapping from a string to some structured object. For example, a document index maps from a word (which we can represent as a string), to a list of pages (represented as a list of integers). In this section, we will see how to represent such mappings in Python.

Accessing Data with Data


Python provides a `dictionary`:dt: data type that can be used for mapping between arbitrary types.

.. Note:: A Python dictionary is somewhat like a linguistic dictionary

  |mdash| they both give you a systematic means of looking things up,
  and so there is some potential for confusion. However, we hope that
  it will usually be clear from the context which kind of dictionary
  we are talking about.

Here we define ``pos`` to be an empty dictionary and then add three entries to it, specifying the part-of-speech of some words. We add entries to a dictionary using the familiar square bracket notation:

   >>> pos = {}
   >>> pos['colorless'] = 'adj'
   >>> pos['furiously'] = 'adv'
   >>> pos['ideas'] = 'n'
   >>>

So, for example, ``pos['colorless'] = 'adj'`` says that the look-up value of ``'colorless'`` in ``pos`` is the string ``'adj'``.

.. Monkey-patching to get our dict examples to print consistently:

   >>> from nltk import SortedDict
   >>> pos = SortedDict(pos)

To look up a value in ``pos``, we again use indexing notation, except now the thing inside the square brackets is the item whose value we want to recover:

   >>> pos['ideas']
   'n'
   >>> pos['colorless']
   'adj'
   >>>

The item used for look-up is called the `key`:dt:, and the data that is returned is known as the `value`:dt:. As with indexing a list or string, we get an exception when we try to access the value of a key that does not exist:

   >>> pos['missing']
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
   KeyError: 'missing'
   >>>

This raises an important question. Unlike lists and strings, where we can use ``len()`` to work out which integers will be legal indices, how do we work out the legal keys for a dictionary? Fortunately, we can check whether a key exists in a dictionary using the ``in`` operator:

   >>> 'colorless' in pos
   True
   >>> 'missing' in pos
   False
   >>> 'missing' not in pos
   True
   >>>

Notice that we can use ``not in`` to check if a key is `missing`:em:. Be careful with the ``in`` operator for dictionaries: it only applies to the keys and not their values. If we check for a value, e.g. ``'adj' in pos``, the result is ``False``, since ``'adj'`` is not a key. We can loop over all the entries in a dictionary using a ``for`` loop.

   >>> for word in pos:
   ...     print "%s (%s)" % (word, pos[word])
   ... 
   colorless (adj)
   furiously (adv)
   ideas (n)
   >>>

We can see what the contents of the dictionary look like by inspecting the variable ``pos``. Note the presence of the colon character to separate each key from its corresponding value:

   >>> pos
   {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
   >>>

Here, the contents of the dictionary are shown as `key-value pairs`:dt:. As you can see, the order of the key-value pairs is different from the order in which they were originally entered. This is because dictionaries are not sequences but mappings. The keys in a mapping are not inherently ordered, and any ordering that we might want to impose on the keys exists independently of the mapping. As we shall see later, this gives us a lot of flexibility.

We can use the same key-value pair format to create a dictionary:

   >>> pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
   >>>

.. Monkey-patching to get our dict examples to print consistently:

   >>> pos = SortedDict(pos)

Using the dictionary methods ``keys()``, ``values()`` and ``items()``, we can access the keys and values as separate lists, and also the key-value pairs:

   >>> pos.keys()
   ['colorless', 'furiously', 'ideas']
   >>> pos.values()
   ['adj', 'adv', 'n']
   >>> pos.items()
   [('colorless', 'adj'), ('furiously', 'adv'), ('ideas', 'n')]
   >>> for (key, val) in pos.items():
   ...     print "%s ==> %s" % (key, val)
   ...
   colorless ==> adj
   furiously ==> adv
   ideas ==> n
   >>>

Note that keys are forced to be unique. Suppose we try to use a dictionary to store the fact that the word `content`:lx: is both a noun and a verb:

   >>> pos['content'] = 'n'
   >>> pos['content'] = 'v'
   >>> pos
   {'content': 'v', 'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
   >>>

Initially, ``pos['content']`` is given the value ``'n'``, and this is immediately overwritten with the new value ``'v'``. In other words, there is only one entry for ``'content'``. If we wanted to store multiple values in that entry, we could use a list, e.g. ``pos['content'] = ['n', 'v']``.

Counting with Dictionaries


The values stored in a dictionary can be any kind of object, not just a string |mdash| the values can even be dictionaries. The most common kind is actually an integer. It turns out that we can use a dictionary to store `counters`:dt: for many kinds of data. For instance, we can have a counter for all the letters of the alphabet; each time we get a certain letter we increment its corresponding counter:

   >>> phrase = 'colorless green ideas sleep furiously'
   >>> count = {}
   >>> for letter in phrase:
   ...     if letter not in count:
   ...         count[letter] = 0
   ...     count[letter] += 1
   >>> count
   {'a': 1, ' ': 4, 'c': 1, 'e': 6, 'd': 1, 'g': 1, 'f': 1, 'i': 2,
    'l': 4, 'o': 3, 'n': 1, 'p': 1, 's': 5, 'r': 3, 'u': 2, 'y': 1}
   >>>

Observe that ``in`` is used here in two different ways: ``for letter in phrase`` iterates over every letter, running the body of the ``for`` loop. Inside this loop, the conditional expression ``if letter not in count`` checks whether the letter is missing from the dictionary. If it is missing, we create a new entry and set its value to zero: ``count[letter] = 0``. Now we are sure that the entry exists, and it may have a zero or non-zero value. We finish the body of the ``for`` loop by incrementing this particular counter using the ``+=`` assignment operator. Finally, we print the dictionary, to see the letters and their counts. This method of maintaining many counters will find many uses, and you will become very familiar with it. To make counting much easier, we can use ``defaultdict``, a special kind of container introduced in Python 2.5. This is also included in NLTK for the benefit of readers who are using Python 2.4, and can be imported as shown below.

   >>> phrase = 'colorless green ideas sleep furiously'
   >>> from nltk import defaultdict
   >>> count = defaultdict(int)
   >>> for letter in phrase:
   ...     count[letter] += 1
   >>> count
   {'a': 1, ' ': 4, 'c': 1, 'e': 6, 'd': 1, 'g': 1, 'f': 1, 'i': 2,
    'l': 4, 'o': 3, 'n': 1, 'p': 1, 's': 5, 'r': 3, 'u': 2, 'y': 1}
   >>>

.. note:: Calling ``defaultdict(int)`` creates a special kind of dictionary.

  When that dictionary is accessed with a non-existent key
  |mdash| i.e. the first time a particular letter is encountered |mdash|
  then ``int()`` is called to produce the initial value for this key (i.e. ``0``).
  You can test this by running the above code, then typing ``count['X']``
  and seeing that it returns a zero value (and not a ``KeyError`` as in the
  case of normal Python dictionaries).
  The function defaultdict is very handy and will be used in many places later on.

There are other useful ways to display the result, such as sorting alphabetically by the letter:

   >>> sorted(count.items())
   [(' ', 4), ('a', 1), ('c', 1), ('d', 1), ('e', 6), ('f', 1), ...,
   ...('y', 1)]
   >>>

.. Note:: The function ``sorted()`` is similar to the ``sort()``

  method on sequences, but rather than sorting in-place, it produces
  a new sorted copy of its argument. Moreover, as we will see very
  soon, ``sorted()`` will work on a wider variety of data types,
  including dictionaries.


Getting Unique Entries


Sometimes, we don't want to count at all, but just want to make a record of the items that we have seen, regardless of repeats. For example, we might want to compile a vocabulary from a document. This is a sorted list of the words that appeared, regardless of frequency. At this stage we have two ways to do this. The first uses lists.

   >>> sentence = "she sells sea shells by the sea shore".split()
   >>> words = []
   >>> for word in sentence:
   ...     if word not in words:
   ...         words.append(word)
   ...
   >>> sorted(words)
   ['by', 'sea', 'sells', 'she', 'shells', 'shore', 'the']
   >>>

There is a better way to do this task using Python's `set`:dt: data type. We can convert ``sentence`` into a set, using ``set(sentence)``:

   >>> set(sentence)
   set(['shells', 'sells', 'shore', 'she', 'sea', 'the', 'by'])  
   >>>

The order of items in a set is not significant, and they will usually appear in a different order to the one they were entered in. The main point here is that converting a list to a set removes any duplicates. We convert it back into a list, sort it, and print. Here is the complete program:

   >>> sentence = "she sells sea shells by the sea shore".split()
   >>> sorted(set(sentence))
   ['by', 'sea', 'sells', 'she', 'shells', 'shore', 'the']

Here we have seen that there is sometimes more than one way to solve a problem with a program. In this case, we used three different built-in data types, a list, a dictionary, and a set. The ``set`` data type mostly closely modeled our task, so it required the least amount of work.

Scaling Up


We can use dictionaries to count word occurrences. For example, the following code uses |NLTK|\ 's corpus reader to load

  • Macbeth* and count the frequency of each word.

Before we can use |NLTK| we need to tell Python to load it, using the statement ``import nltk``.

   >>> import nltk
   >>> count = nltk.defaultdict(int)                     # initialize a dictionary
   >>> for word in nltk.corpus.gutenberg.words('shakespeare-macbeth.txt'): # tokenize Macbeth
   ...     word = word.lower()                           # normalize to lowercase
   ...     count[word] += 1                              # increment the counter
   ...
   >>>

You will learn more about accessing corpora in Section sec-extracting-text-from-corpora_. For now, you just need to know that ``gutenberg.words()`` returns a list of words, in this case from Shakespeare's play *Macbeth*, and we are iterating over this list using a ``for`` loop. We convert each word to lowercase using the string method ``word.lower()``, and use a dictionary to maintain a set of counters, one per word. Now we can inspect the contents of the dictionary to get counts for particular words:

   >>> count['scotland']
   12
   >>> count['the']
   692
   >>>

Exercises


1. |talk| Review the mappings in Table linguistic-objects_. Discuss any other

  examples of mappings you can think of.  What type of information do they map
  from and to?
  
  1. . |easy| Using the Python interpreter in interactive mode, experiment with
  the examples in this section.  Create a dictionary ``d``, and add
  some entries.  What happens if you try to access a non-existent
  entry, e.g. ``d['xyz']``?
  1. . |easy| Try deleting an element from a dictionary, using the syntax
  ``del d['abc']``.  Check that the item was deleted.
  1. . |easy| Create a dictionary ``e``, to represent a single lexical entry
  for some word of your choice.
  Define keys like ``headword``, ``part-of-speech``, ``sense``, and
  ``example``, and assign them suitable values.
  1. . |easy| Create two dictionaries, ``d1`` and ``d2``, and add some entries to
  each.  Now issue the command ``d1.update(d2)``.  What did this do?
  What might it be useful for?
  1. . |soso| Write a program that takes a sentence expressed as a single string,
  splits it and counts up the words.  Get it to print out each word and the
  word's frequency, one per line, in alphabetical order.

Regular Expressions


For a moment, imagine that you are editing a large text, and you have strong dislike of repeated occurrences of the word `very`:lx:. How could you find all such cases in the text? To be concrete, let's suppose that we assign the following text to the variable ``s``:

   >>> s = """Google Analytics is very very very nice (now)
   ... By Jason Hoffman 18 August 06
   ... Google Analytics, the result of Google's acquisition of the San
   ... Diego-based Urchin Software Corporation, really really opened its
   ... doors to the world a couple of days ago, and it allows you to
   ... track up to 10 sites within a single google account.
   ... """
   >>>

|nopar| Python's triple quotes ``"""`` are used here since they allow us to break a string across lines.

One approach to our task would be to convert the string into a list, and look for adjacent items that are both equal to the string ``'very'``. We use the ``range(n)`` function in this example to create a list of consecutive integers from 0 up to, but not including, ``n``:

   >>> text = s.split()
   >>> for n in range(len(text)):
   ...    if text[n] == 'very' and text[n+1] == 'very':
   ... 		print n, n+1
   ... 
   3 4
   4 5
   >>> 

|nopar| However, such an approach is not very flexible or convenient. In this section, we will present Python's `regular expression`:dt: module ``re``, which supports powerful search and substitution inside strings. As a gentle introduction, we will start out using a utility function ``re_show()`` to illustrate how regular expressions match against substrings. ``re_show()`` takes two arguments, a pattern that it is looking for, and a string in which the pattern might occur.

   >>> import nltk
   >>> nltk.re_show('very very', s)
   Google Analytics is {very very} very nice (now)
   ...
   >>>

|nopar| (We have only displayed the first part of ``s`` that is returned, since the rest is irrelevant for the moment.) As you can see, ``re_show`` places curly braces around the first occurrence it has found of the string ``'very very'``. So an important part of what ``re_show`` is doing is searching for any substring of ``s`` that `matches`:dt: the pattern in its first argument.

Now we might want to modify the example so that ``re_show`` highlights cases where there are two `or more`:em: adjacent sequences of ``'very'``. To do this, we need to use a `regular expression operator`:dt:, namely ``'+'``. If `s`:ph: is a string, then `s`:ph:\ ``+`` means: 'match one or more occurrences of `s`:ph:\ '. Let's first look at the case where `s`:ph: is a single character, namely the letter ``'o'``:

.. doctest-ignore::

   >>> nltk.re_show('o+', s)
   G{oo}gle Analytics is very very very nice (n{o}w)
   ...
   >>>

|nopar| ``'o+'`` is our first proper regular expression. You can think of it as matching an `infinite set`:em: of strings, namely the set {``'o'``, ``'oo'``, ``'ooo'``, ...}. But we would really like to match sequences of least two ``'o'``\ s; for this, we need the regular expression ``'oo+'``, which matches any string consisting of ``'o'`` followed by one or more occurrences of ``o``.

.. doctest-ignore::

   >>> nltk.re_show('oo+', s)
   G{oo}gle Analytics is very very very nice (now)
   ...
   >>>

Let's return to the task of identifying multiple occurrences of ``'very'``. Some initially plausible candidates won't do what we want. For example, ``'very+'`` would match ``'veryyy'`` (but not ``'very very'``), since the ``+`` scopes over the immediately preceding expression, in this case ``'y'``. To widen the scope of ``+``, we need to use parentheses, as in ``'(very)+'``. Will this match ``'very very'``? No, because we've forgotten about the whitespace between the two words; instead, it will match strings like ``'veryvery'``. However, the following `does`:em: work:

.. doctest-ignore::

   >>> nltk.re_show('(very\s)+', s)
   Google Analytics is {very very very }nice (now)
   >>>

|nopar| Characters preceded by a ``\``, such as ``'\s'``, have a special interpretation inside regular expressions; thus, ``'\s'`` matches a whitespace character. We could have used ``' '`` in our pattern, but ``'\s'`` is better practice in general. One reason is that the sense of "whitespace" we are using is more general than you might have imagined; it includes not just inter-word spaces, but also tabs and newlines. If you try to inspect the variable ``s``, you might initially get a shock:

.. doctest-ignore::

   >>> s
   "Google Analytics is very very very nice (now)\nBy Jason Hoffman 
   18 August 06\nGoogle
   ...
   >>>

|nopar| You might recall that ``'\n'`` is a special character that corresponds to a newline in a string. The following example shows how newline is matched by ``'\s'``.

   >>> s2 = "I'm very very\nvery happy"
   >>> nltk.re_show('very\s', s2)
   I'm {very }{very
   }{very }happy
   >>>

Python's ``re.findall(``\ `patt`:ph:, `s`:ph:\ ``)`` function is a useful way to find all the substrings in `s`:ph: that are matched by `patt`:ph:. Before illustrating, let's introduce two further special characters, ``'\d'`` and ``'\w'``: the first will match any digit, and the second will match any alphanumeric character. Before we can use ``re.findall()`` we have to load Python's regular expression module, using ``import re``.

   >>> import re
   >>> re.findall('\d\d', s)
   ['18', '06', '10']
   >>> re.findall('\s\w\w\w\s', s)
   [' the ', ' the ', ' its\n', ' the ', ' and ', ' you ']
   >>> 

|nopar| As you will see, the second example matches three-letter words. However, this regular expression is not quite what we want. First, the leading and trailing spaces are extraneous. Second, it will fail to match against strings such as ``'the San'``, where two three-letter words are adjacent. To solve this problem, we can use another special character, namely ``'\b'``. This is sometimes called a "zero-width" character; it matches against the empty string, but only at the beginning and end of words:

   >>> re.findall(r'\b\w\w\w\b', s)
   ['now', 'the', 'the', 'San', 'its', 'the', 'ago', 'and', 'you']

.. Note:: This example uses a Python `raw string`:dt:\:

  ``r'\b\w\w\w\b'``. The specific justification here is that in an
  ordinary string, ``\b`` is interpreted as a backspace character.
  Python will convert it to a backspace in a regular expression
  unless you use the ``r`` prefix to create a raw string as shown
  above. Another use for raw strings is to match strings that
  include backslashes. Suppose we want to match 'either\\or'. In order
  to create a regular expression, the backslash needs to be escaped,
  since it is a special character; 
  so we want to pass the pattern ``\\`` to the regular expression
  interpreter. But to express this as a Python string literal, each
  backslash must be escaped again, yielding the string
  ``'\\\\'``. However, with a raw string, this reduces down to
  ``r'\\'``. 

Returning to the case of repeated words, we might want to look for cases involving ``'very'`` or ``'really'``, and for this we use the disjunction operator ``|``.

   >>> nltk.re_show('((very|really)\s)+', s)
   Google Analytics is {very very very }nice (now)
   By Jason Hoffman 18 August 06
   Google Analytics, the result of Google's acquisition of the San
   Diego-based Urchin Software Corporation, {really really }opened its
   doors to the world a couple of days ago, and it allows you to
   track up to 10 sites within a single google account.
   >>>

In addition to the matches just illustrated, the regular expression ``'((very|really)\s)+'`` will also match cases where the two disjuncts occur with each other, such as the string ``'really very really '``.

Let's now look at how to perform substitutions, using the ``re.sub()`` function. In the first instance we replace all instances of ``l`` with ``s``. Note that this generates a string as output, and doesn't modify the original string. Then we replace any instances of ``green`` with ``red``.

   >>> sent = "colorless green ideas sleep furiously"
   >>> re.sub('l', 's', sent)
   'cosorsess green ideas sseep furioussy'
   >>> re.sub('green', 'red', sent)
   'colorless red ideas sleep furiously'
   >>>

We can also disjoin individual characters using a square bracket notation. For example, ``[aeiou]`` matches any of ``a``, ``e``, ``i``, ``o``, or ``u``, that is, any vowel. The expression ``[^aeiou]`` matches any single character that is `not`:em: a vowel. In the following example, we match sequences consisting of a non-vowel followed by a vowel.

   >>> nltk.re_show('[^aeiou][aeiou]', sent)
   {co}{lo}r{le}ss g{re}en{ i}{de}as s{le}ep {fu}{ri}ously
   >>>

|nopar| Using the same regular expression, the function ``re.findall()`` returns a list of all the substrings in ``sent`` that are matched:

   >>> re.findall('[^aeiou][aeiou]', sent)
   ['co', 'lo', 'le', 're', ' i', 'de', 'le', 'fu', 'ri']
   >>>

Groupings


Returning briefly to our earlier problem with unwanted whitespace around three-letter words, we note that ``re.findall()`` behaves slightly differently if we create `groups`:dt: in the regular expression using parentheses; it only returns strings that occur within the groups:

   >>> re.findall('\s(\w\w\w)\s', s)
   ['the', 'the', 'its', 'the', 'and', 'you']
   >>>

|nopar| The same device allows us to select only the non-vowel characters that appear before a vowel:

   >>> re.findall('([^aeiou])[aeiou]', sent)
   ['c', 'l', 'l', 'r', ' ', 'd', 'l', 'f', 'r']
   >>>

By delimiting a second group in the regular expression, we can even generate pairs (or `tuples`:dt:) that we may then go on and tabulate.

   >>> re.findall('([^aeiou])([aeiou])', sent)
   [('c', 'o'), ('l', 'o'), ('l', 'e'), ('r', 'e'), (' ', 'i'),

('d', 'e'), ('l', 'e'), ('f', 'u'), ('r', 'i')]

   >>>

Our next example also makes use of groups. One further special character is the so-called wildcard element, ``'.'``; this has the distinction of matching any single character (except ``'\n'``). Given the string ``s3``, our task is to pick out login names and email domains:

   >>> s3 = """
   ... <hart@vmd.cso.uiuc.edu>
   ... Final editing was done by Martin Ward <Martin.Ward@uk.ac.durham>
   ... Michael S. Hart <hart@pobox.com>
   ... Prepared by David Price, email <ccx074@coventry.ac.uk>"""

The task is made much easier by the fact that all the email addresses in the example are delimited by angle brackets, and we can exploit this feature in our regular expression:

   >>> re.findall(r'<(.+)@(.+)>', s3)
   [('hart', 'vmd.cso.uiuc.edu'), ('Martin.Ward', 'uk.ac.durham'), 
   ('hart', 'pobox.com'), ('ccx074', 'coventry.ac.uk')]
   >>>

|nopar| Since ``'.'`` matches any single character, ``'.+'`` will match any non-empty `string`:em: of characters, including punctuation symbols such as the period.

One question that might occur to you is how do we specify a match against a period? The answer is that we have to place a ``'\'`` immediately before the ``'.'`` in order to escape its special interpretation.

   >>> re.findall(r'(\w+\.)', s3)
   ['vmd.', 'cso.', 'uiuc.', 'Martin.', 'uk.', 'ac.', 'S.', 
   'pobox.', 'coventry.', 'ac.']
   >>>

Now, let's suppose that we wanted to match occurrences of both ``'Google'`` and ``'google'`` in our sample text. If you have been following up till now, you would reasonably expect that this regular expression with a disjunction would do the trick: ``'(G|g)oogle'``. But look what happens when we try this with ``re.findall()``:

   >>> re.findall('(G|g)oogle', s)
   ['G', 'G', 'G', 'g']
   >>> 

|nopar| What is going wrong? We innocently used the parentheses to indicate the scope of the operator ``'|'``, but ``re.findall()`` has interpreted them as marking a group. In order to tell ``re.findall()`` "don't try to do anything special with these parentheses", we need an extra piece of notation:

   >>> re.findall('(?:G|g)oogle', s)
   ['Google', 'Google', 'Google', 'google']
   >>> 

|nopar| Placing ``'?:'`` immediately after the opening parenthesis makes it explicit that the parentheses are just being used for scoping.

Practice Makes Perfect


Regular expressions are very flexible and very powerful. However, they often don't do what you expect. For this reason, you are strongly encouraged to try out a variety of tasks using ``re_show()`` and ``re.findall()`` in order to develop your intuitions further; the exercises below should help get you started. We suggest that you build up a regular expression in small pieces, rather than trying to get it completely right first time. Here are some operators and sequences that are commonly used in natural language processing.

 ===========  ====================================================================
        Commonly-used Operators and Sequences
 ---------------------------------------------------------------------------------
 ``*``        Zero or more, e.g. ``a*``, ``[a-z]*``
 -----------  --------------------------------------------------------------------
 ``+``        One or more, e.g. ``a+``, ``[a-z]+``
 -----------  --------------------------------------------------------------------
 ``?``        Zero or one (i.e. optional), e.g. ``a?``, ``[a-z]?``
 -----------  --------------------------------------------------------------------
 ``[..]``     A set or range of characters, e.g. ``[aeiou]``, ``[a-z0-9]``
 -----------  --------------------------------------------------------------------
 ``(..)``     Grouping parentheses, e.g. ``(the|a|an)``
 -----------  --------------------------------------------------------------------
 ``\b``       Word boundary (zero width)
 -----------  --------------------------------------------------------------------
 ``\d``       Any decimal digit (``\D`` is any non-digit)
 -----------  --------------------------------------------------------------------
 ``\s``       Any whitespace character (``\S`` is any non-whitespace character)
 -----------  --------------------------------------------------------------------
 ``\w``       Any alphanumeric character (``\W`` is any non-alphanumeric character
 -----------  --------------------------------------------------------------------
 ``\t``       The tab character
 -----------  --------------------------------------------------------------------
 ``\n``       The newline character
 ===========  ====================================================================

Exercises


  1. . |easy| Describe the class of strings matched by the following regular
  expressions. Note that ``'*'`` means: match zero or more
  occurrences of the preceding regular expression.
  a) ``[a-zA-Z]+``
  #) ``[A-Z][a-z]*``
  #) ``\d+(\.\d+)?``
  #) ``([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*``
  #) ``\w+|[^\w\s]+``
  Test your answers using ``re_show()``.
  1. . |easy| Write regular expressions to match the following classes of strings:
   a) A single determiner (assume that `a`:lx:, `an`:lx:, and `the`:lx:
      are the only determiners).
   #) An arithmetic expression using integers, addition, and
      multiplication, such as ``2*3+8``.
  1. . |soso| The above example of extracting (name, domain) pairs from
  text does not work when there is more than one email address
  on a line, because the ``+`` operator is "greedy" and consumes
  too much of the input.
  
  a) Experiment with input text containing more than one email address
     per line, such as that shown below.  What happens?
  #) Using ``re.findall()``, write another regular expression
     to extract email addresses, replacing the period character
     with a range or negated range, such as ``[a-z]+`` or ``[^ >]+``.
  #) Now try to match email addresses by changing the regular
     expression ``.+`` to its "non-greedy" counterpart, ``.+?``
     
    >>> s = """
    ... austen-emma.txt:hart@vmd.cso.uiuc.edu  (internet)  hart@uiucvmd (bitnet)
    ... austen-emma.txt:Internet (72600.2026@compuserve.com); TEL: (212-254-5093)
    ... austen-persuasion.txt:Editing by Martin Ward (Martin.Ward@uk.ac.durham)
    ... blake-songs.txt:Prepared by David Price, email ccx074@coventry.ac.uk
    ... """
  1. . |soso| Write code to convert text into Pig Latin. This involves two steps:
  move any consonant (or consonant cluster) that appears at the start of the word
  to the end, then append `ay`:lx:, e.g. `string`:lx: |rarr| `ingstray`:lx:,
  `idle`:lx: |rarr| `idleay`:lx:.  ``http://en.wikipedia.org/wiki/Pig_Latin``
  1. . |soso| Write code to convert text into *hAck3r* again, this time using regular expressions
  and substitution, where
  ``e`` |rarr| ``3``,
  ``i`` |rarr| ``1``,
  ``o`` |rarr| ``0``,
  ``l`` |rarr| ``|``,
  ``s`` |rarr| ``5``,
  ``.`` |rarr| ``5w33t!``,
  ``ate`` |rarr| ``8``.
  Normalize the text to lowercase before converting it.
  Add more substitutions of your own.  Now try to map
  ``s`` to two different values: ``$`` for word-initial ``s``,
  and ``5`` for word-internal ``s``.
  
  1. . |hard| Read the Wikipedia entry on *Soundex*. Implement this
  algorithm in Python.

Summary


  • Text is represented in Python using strings, and we type these with
 single or double quotes: ``'Hello'``, ``"World"``.
  • The characters of a string are accessed using indexes, counting from zero:
 ``'Hello World'[1]`` gives the value ``e``.  The length of a string is
 found using ``len()``.
  • Substrings are accessed using slice notation: ``'Hello World'[1:5]``
 gives the value ``ello``.  If the start index is omitted, the
 substring begins at the start of the string; if the end index is omitted,
 the slice continues to the end of the string.
  • Sequences of words are represented in Python using lists of strings:
 ``['colorless', 'green', 'ideas']``.  We can use indexing, slicing
 and the ``len()`` function on lists.
  • Strings can be split into lists: ``'Hello World'.split()`` gives
 ``['Hello', 'World']``.  Lists can be joined into strings:
 ``'/'.join(['Hello', 'World'])`` gives ``'Hello/World'``.
  • Lists can be sorted in-place: ``words.sort()``. To produce a separate,
 sorted copy, use: ``sorted(words)``.
  • We process each item in a string or list using a ``for`` statement:
 ``for word in phrase``.  This must be followed by the colon character
 and an indented block of code, to be executed each time through the loop.
  • We test a condition using an ``if`` statement: ``if len(word) < 5``.
 This must be followed by the colon character and an indented block of
 code, to be executed only if the condition is true.
  • A dictionary is used to map between arbitrary types of information,
 such as a string and a number: ``freq['cat'] = 12``.  We create
 dictionaries using the brace notation: ``pos = {}``,
 ``pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}``.
  • Some functions are not available by default, but must be accessed using
 Python's ``import`` statement.
  • Regular expressions are a powerful and flexible method of specifying
 patterns. Once we have imported the ``re`` module, we can use
 ``re.findall()`` to find all substrings in a string that match a pattern,
 and we can use ``re.sub()`` to replace substrings of one sort with another.

Further Reading


Python


Two freely available online texts are the following:

  • Josh Cogliati, *Non-Programmer's Tutorial for Python*,
 http://en.wikibooks.org/wiki/Non-Programmer's_Tutorial_for_Python/Contents
  • Allen B. Downey, Jeffrey Elkner and Chris Meyers,
  *How to Think Like a Computer Scientist: Learning with Python*,
  http://www.ibiblio.org/obp/thinkCSpy/
  • An Introduction to Python*

[vanRossum2006IP]_ is a Python tutorial by Guido van Rossum, the inventor of Python and Fred L. Drake, Jr., the official editor of the Python documentation. It is available online at http://docs.python.org/tut/tut.html. A more detailed but still introductory text is [Lutz2003LP]_, which covers the essential features of Python, and also provides an overview of the standard libraries.

..

   A more advanced text, [vanRossum2006IPLR]_ is the official reference
   for the Python language itself, and describes the syntax of Python and
   its built-in datatypes in depth. It is also available online at
   http://docs.python.org/ref/ref.html.

[Beazley2006PER]_ is a succinct reference book; although not suitable as an introduction to Python, it is an excellent resource for intermediate and advanced programmers.

Finally, it is always worth checking the official *Python Documentation* at http://docs.python.org/.

Regular Expressions


There are many references for regular expressions, both practical and theoretical. [Friedl2002MRE]_ is a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python.

For an introductory tutorial to using regular expressions in Python with the ``re`` module, see A. M. Kuchling, *Regular Expression HOWTO*, http://www.amk.ca/python/howto/regex/.

Chapter 3 of [Mertz2003TPP]_ provides a more extended tutorial on Python's facilities for text processing with regular expressions.

http://www.regular-expressions.info/ is a useful online resource, providing a tutorial and references to tools and other sources of information.




.. include:: footer.txt

Personal tools