name: title layout: true class: center, middle, inverse --- # Handling raw text # --- layout: false There are a few things that we probably need to do to text in order to prepare if for useful analysis. If we just have the text of a novel, we will probably want to break it into words. Mostly so far, we've used corpora that have already been broken into words, but we can take an arbitrary piece of text and tokenize it. ```python >>> text = "Hello, isn't this interesting?" >>> text.split() ['Hello,', "isn't", 'this', 'interesting?'] ``` NLTK contains a more sophisticated tokenizer, that separates out punctuation and contractions. ```python >>> import nltk >>> from nltk import word_tokenize >>> word_tokenize(text) ['Hello', ',', 'is', "n't", 'this', 'interesting', '?'] ``` This makes it easier to see patterns in related things, since "Hello" is the same thing as "Hello,", etc. In chapter 3, the NLTK book walks through how to retrieve a page of text from the web and extract it, maybe we will come back to this. It's riddled with technical things that may or may not be familiar. HTML, Javascript, web requests, RSS feeds. --- ## Regular Expressions ## Regular expressions provide a way to do pattern matching, which is super useful. Regular expressions are notoriously complicated, though the idea is pretty straightforward. There is basically a little grammar of "patterns" to learn. ### Task 1 ### Suppose that we want to find all the words that end in "ed". How might we proceed? To begin, let's get a list of words to search in. ```python import nltk nltk.corpus.words.words('en')[:10] # get English words, and remove proper names by using only lowercase ones wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()] ``` Now we want to find the words that end in "ed". Regular expressions allow us to search for a pattern, and so the pattern we want to search for is: "an e, followed by a d, which ends the word." --- To bring in useful RE functions: ```python import re ``` And now search through `wordlist` for just those words that match the pattern. The way you indicate the end of the word if `$`. To search for a pattern, we can use `re.search()` with the pattern `ed$`. The idea here is that it is going to find anything that contains the pattern e-then-d-then-end (which necessarily will be at the end of the word). The way that `re.search()` works is to take the pattern, and then the string to search. ```python >>> edwords = [w for w in wordlist if re.search("ed$", w)] >>> edwords[:5] ['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', ...] ``` There are other things we can do with the search string. --- `ed$` looked for exactly `e` then `d` then the end of the word. If you don't want to match specifically "d", but just get any word whose penultimate letter is "e", you can do this by using the "match any character" indicator, `.`. ```python >>> exwords = [w for w in wordlist if re.search("e.$", w)] >>> exwords[:5] ['abaiser', 'abaissed', 'abandoned', 'abandonee', 'abandoner', ...] ``` (If you need to match an actual ".", then you need to "escape" the `.` by putting a `\` before it, as in `\.`. The `\` means "take the next character literally." So if you need to match an actual "\", then you need to use `\\` in your regular expression.) In the same way `$` marks the end of a word, the `^` character marks the beginning. So, how to do "starts with 'un'"? -- ```python >>> unwords = [w for w in wordlist if re.search("^un", w)] ['un', 'unabandoned', 'unabased', 'unabasedly', 'unabashable', ...] ``` -- Five letter words that start with "c", end with "h", and have an "u" in the middle? -- ```python >>> couchwords = [w for w in wordlist if re.search("^c.u.h$", w)] ['cauch', 'couch', 'cough', 'couth', 'crush', 'cruth'] ``` --- If you don't want to match just any character, and not a specific character either, but rather one of a set, you can enclose the options in square brackets, like `[aeo]`. ```python >>> [w for w in wordlist if re.search("^c.[aeo].h$", w)] ['cheth', 'clash', 'closh', 'cloth', 'coach', 'crash', 'cyath'] ``` This is getting pretty abstract, so let's look at an application. For a few years, when SMS was taking off, but when phones were largely still dumb, you might want to text "FOOD" to somebody, but you only have numbers. There's long been a mnemonic number-to-letter association on phones. You can still see them even on the iPhone's virtual touchpad. This was generally to help remember a number, rather than to encode a word. But there are mismatches both directions. 0 and 1 do not correspond to letters, and Q and Z were not originally assigned to numbers. | | | | |-----------|--------|-----------| | 1 | 2 ABC | 3 DEF | | 4 GHI | 5 JKL | 6 MNO | | 7 PRS (Q) | 8 TUV | 9 WXY (Z) | | | 0 OPER | | --- | | | | |-----------|--------|-----------| | 1 | 2 ABC | 3 DEF | | 4 GHI | 5 JKL | 6 MNO | | 7 PRS (Q) | 8 TUV | 9 WXY (Z) | | | 0 OPER | | So: FOOD might be rendered as 3663. Unless you meant DOME. The way T9 predictive text works is that it figures out what the options are based on what you've typed, picks the most frequent one as the first guess, and allows a way to advance to the next most frequent word after that. So what are the other options for 3663? What we want is four letter words that start and end with one of the letters D, E, F, and have two letters drawn from M, N, O in the middle. -- If you type 4653, it could be any of... ```python [w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)] ['gold', 'golf', 'hold', 'hole'] ``` --- We can go on a side-trip to generalize this. Why not? Let's say we want to take a list of numbers, and find the words that we can make from them. The way that we'll do this is that we'll *build* the regular expression as a string, and then use it as an argument to `re.search()`. Game plan: - take a list of numbers like `[3, 6]` - end with `[w for w in wordlist if re.search('^[def][mno]$', w)]` There are a few steps in the middle there. Here's how I'd approach it: - convert `[3, 6]` to `['def', 'mno']`, mapping numbers to letter groups - build string starting with `^[` , ending with `]$`, with letter groups between - do list comprehension using the string as argument to `re.search()` The first thing we need to do is build a mapping from numbers to character classes. So, e.g.: `mapping[3] == 'def'` and `mapping[6] == 'mno'`. How would we create this? Also: what should `mapping[0]` be? What should `mapping[1]` be? --- ```python mapping = ['', '', 'abc', 'def', 'ghi', 'jkl', 'mno', 'pqrs', 'tuv', 'wxyz'] ``` Now, we want to get from `[3, 6]` to `['def', 'mno']`. How would we approach this? That is, what list comprehension would you use? -- ```python nums = [3, 6] stringified_nums = [mapping[n] for n in nums] ``` Now, check this out: ```python >>> 'PIZZA'.join(stringified_nums) 'defPIZZAmno' ``` We want to build a search string starting with `^[` and ending with `]$` and containing the letter groups in between. So, if `nums` were just `[3]`, then `stringified_nums` would be just `['def']`, and the string we're building should be `^[def]$`. And for `[3, 6]`, we want `^[def][mno]$`. Ok, pieces are all there, how do we build this pattern string? -- ```python stringified_nums = [mapping[n] for n in nums] pattern = '^['+(']['.join(stringified_nums))+']$' [w for w in wordlist if re.search(pattern, w)] ``` --- So, if we define this as a function, we can now check number lists pretty easily. ```python def t9words(nums): mapping = ['', '', 'abc', 'def', 'ghi', 'jkl', 'mno', 'pqrs', 'tuv', 'wxyz'] stringified_nums = [mapping[n] for n in nums] pattern = '^['+(']['.join(stringified_nums))+']$' return [w for w in wordlist if re.search(pattern, w)] ``` ```python >>> t9words([3, 6, 6, 3]) ['dome', 'done', 'food'] >>> t9words([2, 4, 3, 3, 7, 3]) ['cheese'] ``` --- Ok, back to regular expressions. There are even more complicated things we can do. The basic thing is that there is essentially a grammar/syntax/semantics to these regular expression patterns. We've seen how you can group things to provide disjunction (`[]`), how to mark the beginning (`^`) and end (`$`) of a string. ```python >>> [w for w in wordlist if re.search("^.lame$", w)] ['blame', 'clame', 'flame'] ``` ```python >>> [w for w in wordlist if re.search("^.?lame$", w)] ['blame', 'clame', 'flame', 'lame'] ``` ```python >>> [w for w in wordlist if re.search("^[bf]lame$", w)] ['blame', 'flame'] ``` ```python >>> [w for w in wordlist if re.search("^[laeou]ser$", w)] ['user'] ``` ```python >>> [w for w in wordlist if re.search("^[laeou]+ser$", w)] ['easer', 'laser', 'leaser', 'looser', 'loser', 'user'] ``` ```python >>> [w for w in wordlist if re.search("^[laeou]*ser$", w)] ['easer', 'laser', 'leaser', 'looser', 'loser', 'ser', 'user'] ``` The `+` means "1 or more of the preceding thing." A `*` is like `+` but means "0 or more of the preceding thing." --- This kind of pattern matching is pretty generally useful. Here's one thing you can do with it: break the suffix off a word. Suppose we have: *quickly* and *singing* and *watches* and we want to pull the suffix and stem apart. So, we want to have *quick* and *ly*, and *sing* and *ing*, and *watch* and *es*. New function: `re.findall()` -- this takes a regular expression and a string, and basically matches the regular expression as much as possible in the string. The "splitting" can be accomplished by grouping things together with parentheses. To wit: ```python >>> re.findall(r'^(.*)(ing|ly|es|s)$', "quickly") [('quick', 'ly')] ``` The `r` in front of the string tells Python this is a regular expression. The first group is `(.*)`, which needs to follow the beginning of the line `^`. The second group is a disjunctive expression, `|` separating the options: `(ing|ly|es|s)`. The second group needs to be right before the end of the string. The output will contain *just* the things that were in groups (so the `^` and `$` were used to find the matches, but they are discarded when recording the results). --- This works pretty well. ```python >>> re.findall(r'^(.*)(ing|ly|es|s)$', "singing") [('sing', 'ing')] ``` ```python >>> re.findall(r'^(.*)(ing|ly|es|s)$', "watches") [('watche', 's')] ``` Oops. What went wrong? This is pretty subtle. It turns out that `.*` is "greedy" and will match everything it can. So, it would prefer to match `watche` and let the second group match just `s`. There's a modifier `?` that will relax the `*` to be non-greedy ("lazy"), stopping as soon as it can. Effectively picking the longest option from the second group. ```python >>> re.findall(r'^(.*?)(ing|ly|es|s)$', "watches") [('watch', 'es')] ``` --- # Stemming, segmentation, tagging # --- layout: false We only just kind of started looking at "stemming," but let's finish this off. In the book, it defines `raw` like this and then runs the stemmer on it. ```python >>> from nltk import word_tokenize >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farcical aquatic ceremony.""" >>> tokens = word_tokenize(raw) ``` But I'm lazy and I know that's in the Monty Python script. So: ```python >>> from nltk.book import * ``` --- Suppose that we want to find it. This works: ```python >>> ll = [2, 3, 5, 7, 11, 13, 17, 19] >>> ll.index(7) >>> ll.index(1) ``` So, find a unique-looking word: ```python >>> text6.index("farcical") >>> tokens = text6[1817:1856] ``` --- ```python >>>> porter = nltk.PorterStemmer() >>>> lancaster = nltk.LancasterStemmer() >>>> wnl = nltk.WordNetLemmatizer() >>>> [porter.stem(w) for w in tokens] ['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.'] ``` -- ```python >>>> [lancaster.stem(w) for w in tokens] ['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.'] ``` -- ```python >>>> [wnl.lemmatize(t) for t in tokens] ['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.'] ``` --- Tokenizing. NLTK has a built-in tokenizer that actually does a pretty good job. ```python r = nltk.corpus.treebank_raw.raw() r[:100] '/ '.join(word_tokenize(r)[:100]) ``` --- POS tagging ```python >>> text = word_tokenize("They refuse to permit us to obtain the refuse permit") >>> nltk.pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] ``` ```python >>> t = nltk.Text(w.lower() for w in text6) >>> t.similar("sword") >>> t.similar("coconut") ```