Class 6a

name: title
layout: true
class: center, middle, inverse
---
# Handling raw text #

---
layout: false

There are a few things that we probably need to do to text in order to prepare if for useful
analysis.

If we just have the text of a novel, we will probably want to break it into words.
Mostly so far, we've used corpora that have already been broken into words, but we
can take an arbitrary piece of text and tokenize it.

```python
>>> text = "Hello, isn't this interesting?"
>>> text.split()
['Hello,', "isn't", 'this', 'interesting?']
```

NLTK contains a more sophisticated tokenizer, that separates out punctuation and contractions.

```python
>>> import nltk
>>> from nltk import word_tokenize
>>> word_tokenize(text)
['Hello', ',', 'is', "n't", 'this', 'interesting', '?']
```

This makes it easier to see patterns in related things, since "Hello" is
the same thing as "Hello,", etc.

In chapter 3, the NLTK book walks through how to retrieve a page of text from
the web and extract it, maybe we will come back to this.  It's riddled with technical
things that may or may not be familiar.  HTML, Javascript, web requests, RSS feeds.

---

## Regular Expressions ##

Regular expressions provide a way to do pattern matching, which is super useful.

Regular expressions are notoriously complicated, though the idea is pretty straightforward.
There is basically a little grammar of "patterns" to learn.

### Task 1 ###

Suppose that we want to find all the words that end in "ed". How might we proceed?
To begin, let's get a list of words to search in.

```python
import nltk
nltk.corpus.words.words('en')[:10]
# get English words, and remove proper names by using only lowercase ones
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
```

Now we want to find the words that end in "ed". Regular expressions allow us
to search for a pattern, and so the pattern we want to search for is:

"an e, followed by a d, which ends the word."

---

To bring in useful RE functions:

```python
import re
```

And now search through `wordlist` for just those words that match the pattern.
The way you indicate the end of the word if `$`.  To search for a pattern,
we can use `re.search()` with the pattern `ed$`.  The idea here is that it is
going to find anything that contains the pattern e-then-d-then-end
(which necessarily will be at the end of the word).

The way that `re.search()` works is to take the pattern, and then the string to
search.

```python
>>> edwords = [w for w in wordlist if re.search("ed$", w)]
>>> edwords[:5]
['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', ...]
```

There are other things we can do with the search string.

---

`ed$` looked for exactly `e` then `d` then the end of the word.  If you
don't want to match specifically "d", but just get any word whose
penultimate letter is "e", you can do this by using the "match any character"
indicator, `.`.

```python
>>> exwords = [w for w in wordlist if re.search("e.$", w)]
>>> exwords[:5]
['abaiser', 'abaissed', 'abandoned', 'abandonee', 'abandoner', ...]
```

(If you need to match an actual ".", then you need to "escape" the `.` by
putting a `\` before it, as in `\.`.  The `\` means "take the next character literally."
So if you need to match an actual "\", then you need to use `\\` in your
regular expression.)

In the same way `$` marks the end of a word, the `^` character marks the beginning.
So, how to do "starts with 'un'"?

```python
>>> unwords = [w for w in wordlist if re.search("^un", w)]
['un', 'unabandoned', 'unabased', 'unabasedly', 'unabashable', ...]
```

Five letter words that start with "c", end with "h", and have an "u" in the middle?

```python
>>> couchwords = [w for w in wordlist if re.search("^c.u.h$", w)]
['cauch', 'couch', 'cough', 'couth', 'crush', 'cruth']
```

---

If you don't want to match just any character, and not a specific character either, but
rather one of a set, you can enclose the options in square brackets, like `[aeo]`.

```python
>>> [w for w in wordlist if re.search("^c.[aeo].h$", w)]
['cheth', 'clash', 'closh', 'cloth', 'coach', 'crash', 'cyath']
```

This is getting pretty abstract, so let's look at an application.

For a few years, when SMS was taking off, but when phones were largely still dumb,
you might want to text "FOOD" to somebody, but you only have numbers.

There's long been a mnemonic number-to-letter association on phones.
You can still see them even on the iPhone's virtual touchpad.
This was generally to help remember a number, rather than to encode a word.
But there are mismatches both directions. 0 and 1 do not correspond to letters,
and Q and Z were not originally assigned to numbers.

|           |        |           |
|-----------|--------|-----------|
| 1         | 2 ABC  | 3 DEF     |
| 4 GHI     | 5 JKL  | 6 MNO     |
| 7 PRS (Q) | 8 TUV  | 9 WXY (Z) |
|           | 0 OPER |           |

---

So: FOOD might be rendered as 3663.  Unless you meant DOME.

The way T9 predictive text works is that it figures out what the
options are based on what you've typed, picks the most frequent one
as the first guess, and allows a way to advance to the next most
frequent word after that.

So what are the other options for 3663?

What we want is four letter words that start and end with one of the letters D, E, F, and
have two letters drawn from M, N, O in the middle.

If you type 4653, it could be any of...

```python
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

```

---

We can go on a side-trip to generalize this.  Why not?

Let's say we want to take a list of numbers, and find the words that we
can make from them.  The way that we'll do this is that we'll *build*
the regular expression as a string, and then use it as an argument to `re.search()`.

Game plan:
- take a list of numbers like `[3, 6]`
- end with `[w for w in wordlist if re.search('^[def][mno]$', w)]`

There are a few steps in the middle there.  Here's how I'd approach it:
- convert `[3, 6]` to `['def', 'mno']`, mapping numbers to letter groups
- build string starting with `^[` , ending with `]$`, with letter groups between
- do list comprehension using the string as argument to `re.search()`

The first thing we need to do is build a mapping from numbers to character classes.

So, e.g.: `mapping[3] == 'def'` and `mapping[6] == 'mno'`.  How would we create this?
Also: what should `mapping[0]` be? What should `mapping[1]` be?

---

```python
mapping = ['', '', 'abc', 'def', 'ghi', 'jkl', 'mno', 'pqrs', 'tuv', 'wxyz']
```

Now, we want to get from `[3, 6]` to `['def', 'mno']`.  How would we approach this?
That is, what list comprehension would you use?

```python
nums = [3, 6]
stringified_nums = [mapping[n] for n in nums]
```

Now, check this out:

```python
>>> 'PIZZA'.join(stringified_nums)
'defPIZZAmno'
```

We want to build a search string starting with `^[` and ending with `]$` and containing
the letter groups in between.  So, if `nums` were just `[3]`, then
`stringified_nums` would be just `['def']`, and the string we're building should
be `^[def]$`.  And for `[3, 6]`, we want `^[def][mno]$`.

Ok, pieces are all there, how do we build this pattern string?

```python
stringified_nums = [mapping[n] for n in nums]
pattern = '^['+(']['.join(stringified_nums))+']$'
[w for w in wordlist if re.search(pattern, w)]
```

---

So, if we define this as a function, we can now check number lists pretty easily.

```python

def t9words(nums):
    mapping = ['', '', 'abc', 'def', 'ghi', 'jkl', 'mno', 'pqrs', 'tuv', 'wxyz']
    stringified_nums = [mapping[n] for n in nums]
    pattern = '^['+(']['.join(stringified_nums))+']$'
    return [w for w in wordlist if re.search(pattern, w)]
```

```python
>>> t9words([3, 6, 6, 3])
['dome', 'done', 'food']
>>> t9words([2, 4, 3, 3, 7, 3])
['cheese']
```

---

Ok, back to regular expressions.  There are even more complicated things we can do.
The basic thing is that there is essentially a grammar/syntax/semantics to these
regular expression patterns.  We've seen how you can group things to provide
disjunction (`[]`), how to mark the beginning (`^`) and end (`$`) of a string.

```python
>>> [w for w in wordlist if re.search("^.lame$", w)]
['blame', 'clame', 'flame']
```

```python
>>> [w for w in wordlist if re.search("^.?lame$", w)]
['blame', 'clame', 'flame', 'lame']
```

```python
>>> [w for w in wordlist if re.search("^[bf]lame$", w)]
['blame', 'flame']
```

```python
>>> [w for w in wordlist if re.search("^[laeou]ser$", w)]
['user']
```

```python
>>> [w for w in wordlist if re.search("^[laeou]+ser$", w)]
['easer', 'laser', 'leaser', 'looser', 'loser', 'user']
```

```python
>>> [w for w in wordlist if re.search("^[laeou]*ser$", w)]
['easer', 'laser', 'leaser', 'looser', 'loser', 'ser', 'user']
```

The `+` means "1 or more of the preceding thing."
A `*` is like `+` but means "0 or more of the preceding thing."

---

This kind of pattern matching is pretty generally useful.
Here's one thing you can do with it: break the suffix off a word.

Suppose we have: *quickly* and *singing* and *watches* and we want to
pull the suffix and stem apart.  So, we want to have *quick* and *ly*, 
and *sing* and *ing*, and *watch* and *es*.

New function: `re.findall()` -- this takes a regular expression
and a string, and basically matches the regular expression as much
as possible in the string.  The "splitting" can be accomplished
by grouping things together with parentheses.

To wit:

```python
>>> re.findall(r'^(.*)(ing|ly|es|s)$', "quickly")
[('quick', 'ly')]
```

The `r` in front of the string tells Python this is a regular expression.
The first group is `(.*)`, which needs to follow the beginning of the line `^`.
The second group is a disjunctive expression, `|` separating the options:
`(ing|ly|es|s)`. The second group needs to be right before the end of the
string. The output will contain *just* the things that were in groups
(so the `^` and `$` were used to find the matches, but they are discarded
when recording the results).

---

This works pretty well.

```python
>>> re.findall(r'^(.*)(ing|ly|es|s)$', "singing")
[('sing', 'ing')]
```

```python
>>> re.findall(r'^(.*)(ing|ly|es|s)$', "watches")
[('watche', 's')]
```

Oops.

What went wrong?  This is pretty subtle.  It turns out that `.*` is "greedy"
and will match everything it can. So, it would prefer to match `watche` and let
the second group match just `s`.  There's a modifier `?` that will relax the `*`
to be non-greedy ("lazy"), stopping as soon as it can.  Effectively picking the
longest option from the second group.

```python
>>> re.findall(r'^(.*?)(ing|ly|es|s)$', "watches")
[('watch', 'es')]
```

---

# Stemming, segmentation, tagging #

---
layout: false

We only just kind of started looking at "stemming," but let's finish this off.

In the book, it defines `raw` like this and then runs the stemmer on it.

```python
>>> from nltk import word_tokenize
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = word_tokenize(raw)
```

But I'm lazy and I know that's in the Monty Python script.  So:

```python
>>> from nltk.book import *
```

---

Suppose that we want to find it.  This works:

```python
>>> ll = [2, 3, 5, 7, 11, 13, 17, 19]
>>> ll.index(7)
>>> ll.index(1)
```

So, find a unique-looking word:

```python
>>> text6.index("farcical")
>>> tokens = text6[1817:1856]
```

---

```python
>>>> porter = nltk.PorterStemmer()
>>>> lancaster = nltk.LancasterStemmer()
>>>> wnl = nltk.WordNetLemmatizer()
>>>> [porter.stem(w) for w in tokens]
['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut',
'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Suprem',
'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',',
'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']
```

--
```python
>>>> [lancaster.stem(w) for w in tokens]
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut',
'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem',
'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',',
'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']
```

--
```python
>>>> [wnl.lemmatize(t) for t in tokens]
['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing',
'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme',
'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',',
'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']
```

---

Tokenizing.  NLTK has a built-in tokenizer that actually does a pretty good job.

```python
r = nltk.corpus.treebank_raw.raw()
r[:100]
'/ '.join(word_tokenize(r)[:100])
```

---

POS tagging

```python
>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
```

```python
>>> t = nltk.Text(w.lower() for w in text6)
>>> t.similar("sword")
>>> t.similar("coconut")
```