Class 7a

name: title
layout: true
class: center, middle, inverse
---
# Categorizing, tagging, training #

---
layout: false

Apart from content, I want to try to see if we can get Jupyter Notebook
working for everyone as well.  I am unsure whether it will improve people's
path problems or just create new and different ones, but it seems like a pretty
pleasant environment to work in once we get used to it.

# Reminder, POS tagging #

```python
>>> import nltk
>>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
>>> nltk.help.upenn_tagset('DT')
>>> nltk.help.upenn_tagset('N.*')
```

```python
>>> from nltk.book import *
>>> t = nltk.Text(w.lower() for w in text6)
>>> t.similar("unladen")
african
```

---

```python
>>> from nltk.corpus import brown
>>> ' '.join(brown.sents()[0])
>>> brown.categories()
>>> ' '.join(brown.sents(categories='reviews')[0])
>>> brown.tagged_words()[:10]
>>> brown.tagged_words(tagset="universal")[:10]
>>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
>>> brown_news_tagged[:5]
```

What are the most common tags?  (Or, what are the tags, in decreasing frequency order?)

```python
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>> tag_fd.most_common()
>>> tag_fd.plot()
>>> tag_fd.plot(cumulative=True)
```

---

Answering some questions one might imagine having.

Nouns: usually preceded by what?  Though this part is a bit shaky.
Looking at bigrams, and then at the preceders.

```python
>>> word_tag_pairs = list(nltk.bigrams(brown_news_tagged))
>>> len(word_tag_pairs)
>>> word_tag_pairs[:3]
>>> noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
>>> fdist = nltk.FreqDist(noun_preceders)
>>> [tag for (tag, _) in fdist.most_common()]
```

What are the most common verbs?

Well, let's get a list of the verbs.

```python
>>> print(nltk.corpus.treebank.readme())
>>> wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
>>> print(wsj[:10])
>>> word_tag_fd = nltk.FreqDist(wsj)
>>> print(word_tag_fd.most_common()[:10])
```

---

Now that we have an ordered list of tagged words, get just the verbs.

```python
>>> wsjvs = [wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB']
>>> len(wsjvs)
>>> print(wsjvs[:10])
```

Given the structure, we can build a CFD.  First element is the condition.
So, answering the question: how often when the word is "yield" is it a verb?

```python
>>> cfd1 = nltk.ConditionalFreqDist(wsj)
>>> cfd1['yield']
>>> cfd1['yield'].most_common()
>>> cfd1['cut'].most_common()
```

Let's experiment with the Treebank's own tags.
Now, a CFD where tag is the condition, we collect the words.

```python
>>> wsj = nltk.corpus.treebank.tagged_words()
>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
>>> list(cfd2['VBN'])
>>> list(cfd2['VBD'])
```

---

What verbs can be both a VBN and a VBD?

```python
>>> print(cfd1.conditions()[:10])
>>> cfd1['cigarette']
>>> 'NN' in cfd1['cigarette']
>>> cfd1['kicked']
>>> 'NN' in cfd1['kicked']
>>> 'VBD' in cfd1['kicked']
>>> boths = [w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]]
>>> print(boths)
```

*Kicked* is one of them, let's have a look at the context before each.

```python
>>> idx1 = wsj.index(('kicked', 'VBD'))
>>> wsj[idx1-4:idx1+1]
>>> idx2 = wsj.index(('kicked', 'VBN'))
>>> wsj[idx2-4:idx2+1]
```

Problem from chapter: Is past participle (VBN) usually preceded by a form
of *have*? (Vs. VBD).  What can they look like?

```python
>>> wsj_tag_pairs = list(nltk.bigrams(wsjx))
>>> wsj_tag_pairs[0]
>>> vbn_preceders = [p1 for (p1, p2) in wsj_tag_pairs if p2[1] == 'VBN']
>>> vbn_preceders[0]
>>> cfdp = nltk.FreqDist(w for (w,t) in vbn_preceders)
>>> print(cfdp.most_common()[:20])
```

---

What tags are there?  Let's see.

Make a CFD, with tag as condition.  So we need to reverse the pairs in text.
Want to return a `dict` so that we can look up by tag.

```python
>>> pairs = [('a', 4), ('b', [2, 4, 5])]
>>> pdict = dict(pairs)
>>> pdict
>>> pdict['a']
>>> pdict['b']
```

So, if we want to find all tags that start with `NN`, then:

```python
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

>>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
>>> for tag in sorted(tagdict): print(tag, tagdict[tag])
NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("company's", 6)]
NN$-HL [("Golf's", 1), ("Navy's", 1)]
NN$-TL [("President's", 11), ("Army's", 3), ("Gallery's", 3), ("University's", 3), ("League's", 3)]
NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('business', 2), ('Salary', 2)]
NN-NC [('eva', 1), ('aya', 1), ('ova', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Dr.', 1), ('Oak', 1), ('Street', 1), ('Basin', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("janitors'", 3), ("men's", 3), ("taxpayers'", 2)]
NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Bros.'", 1), ("Writers'", 1)]
NNS-HL [('comments', 1), ('Offenses', 1), ('Sacrifices', 1), ('funds', 1), ('Results', 1)]
NNS-TL [('States', 38), ('Nations', 11), ('Masters', 10), ('Rules', 9), ('Communists', 9)]
NNS-TL-HL [('Nations', 1)]
```

---

What words follow "often"?  Let's look in the "learned" subcorpus.

```python
>>> brown_learned_text = brown.words(categories='learned')
>>> sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often'))
[',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming',
'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]
```

What POSes follow it?

```python
>>> brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
>>> tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
>>> fd = nltk.FreqDist(tags)
>>> fd.tabulate()
 PRT  ADV  ADP    . VERB  ADJ
   2    8    7    4   37    6
```

What *verb to verbs*s do we see?

```python
from nltk.corpus import brown
def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)

>>> for tagged_sent in brown.tagged_sents():
...     process(tagged_sent)
...
combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
...

```

---

Find words that have ambiguous POSes (4 or more), list them in frequency order.

```python
>>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
>>> data = nltk.ConditionalFreqDist((word.lower(), tag)
...                                 for (word, tag) in brown_news_tagged)
>>> for word in sorted(data.conditions()):
...     if len(data[word]) > 3:
...         tags = [tag for (tag, _) in data[word].most_common()]
...         print(word, ' '.join(tags))
...
best ADJ ADV NP V
better ADJ ADV V DET
close ADV ADJ V N
cut V N VN VD
even ADV DET ADJ V
grant NP N V -
hit V VD VN N
lay ADJ V NP VD
left VD ADJ N VN
like CNJ V ADJ P -
near P ADV ADJ DET
open ADJ V N ADV
past N ADJ DET P
present ADJ ADV V N
read V VN VD NP
right ADJ N DET ADV
second NUM ADV DET N
set VN V VD N -
that CNJ V WH DET
```

---

Check out the concordance tool `nltk.app.concordance()` and see how it works.
(Though I can't get it to work, the kernel just dies on me. So.)

After this, diving into dictionaries, default dictionaries.
Here I might just talk through what is in the actual book.

---

Making our own taggers, as a way of understanding what NLTK is doing.

What's the most likely tag?

```python
>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
'NN'
```

So, we'd have moderate success if we just tag *everything* as `NN`.

```python
>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
```

```python
>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
```

---

```python
patterns = [
...     (r'.*ing$', 'VBG'),               # gerunds
...     (r'.*ed$', 'VBD'),                # simple past
...     (r'.*es$', 'VBZ'),                # 3rd singular present
...     (r'.*ould$', 'MD'),               # modals
...     (r'.*\'s$', 'NN$'),               # possessive nouns
...     (r'.*s$', 'NNS'),                 # plural nouns
...     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
...     (r'.*', 'NN')                     # nouns (default)
... ]
regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_sents[3])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.20326391789486245
```

---

Lookup tagger.  Find most frequent 100 words and the most likely tag for each.

```python
>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = fd.most_common(100)
>>> likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> baseline_tagger.evaluate(brown_tagged_sents)
0.45578495136941344
```

```python
>>> sent = brown.sents(categories='news')[3]
>>> baseline_tagger.tag(sent)
[('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None),
('handful', None), ('of', 'IN'), ('such', None), ('reports', None),
('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','),
('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','),
('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None),
('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None),
(',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'),
('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None),
('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')]
    
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags,
...                                      backoff=nltk.DefaultTagger('NN'))
>>> baseline_tagger.tag(sent)

```

---

How does it do with 100? With 1000?  With 800?

```python
import pylab
pylab.arange(15)
2 ** pylab.arange(15)
```

```python
def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
```

```python
word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
words_by_freq = [w for (w, _) in word_freqs]
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
performance(cfd, words_by_freq[:64])
performance(cfd, words_by_freq[:128])
```

Let's collect these together into a graph.  This is going to be slow.

```python
perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
pylab.plot(sizes, perfs, '-bo')
pylab.title('Lookup Tagger Performance with Varying Model Size')
pylab.xlabel('Model Size')
pylab.ylabel('Performance')
pylab.show()
```

---

N-gram tagging, Unigram tagging, bigram tagging.

```python
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
>>> unigram_tagger.tag(brown_sents[2007])
>>> unigram_tagger.evaluate(brown_tagged_sents)
>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
```

```python
>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'),
('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'),
('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'),
('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
>>> bigram_tagger.evaluate(test_sents)
>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
```

---

```python
>>> from pickle import dump
>>> output = open('t2.pkl', 'wb')
>>> dump(t2, output, -1)
>>> output.close()
>>> pwd
```

```python
>>> from pickle import load
>>> input = open('t2.pkl', 'rb')
>>> tagger = load(input)
>>> input.close()
```