name: title layout: true class: center, middle, inverse --- # Categorizing, tagging, training # --- layout: false Apart from content, I want to try to see if we can get Jupyter Notebook working for everyone as well. I am unsure whether it will improve people's path problems or just create new and different ones, but it seems like a pretty pleasant environment to work in once we get used to it. # Reminder, POS tagging # ```python >>> import nltk >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") >>> nltk.pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] >>> nltk.help.upenn_tagset('DT') >>> nltk.help.upenn_tagset('N.*') ``` ```python >>> from nltk.book import * >>> t = nltk.Text(w.lower() for w in text6) >>> t.similar("unladen") african ``` --- ```python >>> from nltk.corpus import brown >>> ' '.join(brown.sents()[0]) >>> brown.categories() >>> ' '.join(brown.sents(categories='reviews')[0]) >>> brown.tagged_words()[:10] >>> brown.tagged_words(tagset="universal")[:10] >>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal') >>> brown_news_tagged[:5] ``` What are the most common tags? (Or, what are the tags, in decreasing frequency order?) ```python >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.most_common() >>> tag_fd.plot() >>> tag_fd.plot(cumulative=True) ``` --- Answering some questions one might imagine having. Nouns: usually preceded by what? Though this part is a bit shaky. Looking at bigrams, and then at the preceders. ```python >>> word_tag_pairs = list(nltk.bigrams(brown_news_tagged)) >>> len(word_tag_pairs) >>> word_tag_pairs[:3] >>> noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN'] >>> fdist = nltk.FreqDist(noun_preceders) >>> [tag for (tag, _) in fdist.most_common()] ``` What are the most common verbs? Well, let's get a list of the verbs. ```python >>> print(nltk.corpus.treebank.readme()) >>> wsj = nltk.corpus.treebank.tagged_words(tagset='universal') >>> print(wsj[:10]) >>> word_tag_fd = nltk.FreqDist(wsj) >>> print(word_tag_fd.most_common()[:10]) ``` --- Now that we have an ordered list of tagged words, get just the verbs. ```python >>> wsjvs = [wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB'] >>> len(wsjvs) >>> print(wsjvs[:10]) ``` Given the structure, we can build a CFD. First element is the condition. So, answering the question: how often when the word is "yield" is it a verb? ```python >>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1['yield'] >>> cfd1['yield'].most_common() >>> cfd1['cut'].most_common() ``` Let's experiment with the Treebank's own tags. Now, a CFD where tag is the condition, we collect the words. ```python >>> wsj = nltk.corpus.treebank.tagged_words() >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) >>> list(cfd2['VBN']) >>> list(cfd2['VBD']) ``` --- What verbs can be both a VBN and a VBD? ```python >>> print(cfd1.conditions()[:10]) >>> cfd1['cigarette'] >>> 'NN' in cfd1['cigarette'] >>> cfd1['kicked'] >>> 'NN' in cfd1['kicked'] >>> 'VBD' in cfd1['kicked'] >>> boths = [w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]] >>> print(boths) ``` *Kicked* is one of them, let's have a look at the context before each. ```python >>> idx1 = wsj.index(('kicked', 'VBD')) >>> wsj[idx1-4:idx1+1] >>> idx2 = wsj.index(('kicked', 'VBN')) >>> wsj[idx2-4:idx2+1] ``` Problem from chapter: Is past participle (VBN) usually preceded by a form of *have*? (Vs. VBD). What can they look like? ```python >>> wsj_tag_pairs = list(nltk.bigrams(wsjx)) >>> wsj_tag_pairs[0] >>> vbn_preceders = [p1 for (p1, p2) in wsj_tag_pairs if p2[1] == 'VBN'] >>> vbn_preceders[0] >>> cfdp = nltk.FreqDist(w for (w,t) in vbn_preceders) >>> print(cfdp.most_common()[:20]) ``` --- What tags are there? Let's see. Make a CFD, with tag as condition. So we need to reverse the pairs in text. Want to return a `dict` so that we can look up by tag. ```python >>> pairs = [('a', 4), ('b', [2, 4, 5])] >>> pdict = dict(pairs) >>> pdict >>> pdict['a'] >>> pdict['b'] ``` So, if we want to find all tags that start with `NN`, then: ```python def findtags(tag_prefix, tagged_text): cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news')) >>> for tag in sorted(tagdict): print(tag, tagdict[tag]) NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)] NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("company's", 6)] NN$-HL [("Golf's", 1), ("Navy's", 1)] NN$-TL [("President's", 11), ("Army's", 3), ("Gallery's", 3), ("University's", 3), ("League's", 3)] NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('business', 2), ('Salary', 2)] NN-NC [('eva', 1), ('aya', 1), ('ova', 1)] NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)] NN-TL-HL [('Fort', 2), ('Dr.', 1), ('Oak', 1), ('Street', 1), ('Basin', 1)] NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)] NNS$ [("children's", 7), ("women's", 5), ("janitors'", 3), ("men's", 3), ("taxpayers'", 2)] NNS$-HL [("Dealers'", 1), ("Idols'", 1)] NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Bros.'", 1), ("Writers'", 1)] NNS-HL [('comments', 1), ('Offenses', 1), ('Sacrifices', 1), ('funds', 1), ('Results', 1)] NNS-TL [('States', 38), ('Nations', 11), ('Masters', 10), ('Rules', 9), ('Communists', 9)] NNS-TL-HL [('Nations', 1)] ``` --- What words follow "often"? Let's look in the "learned" subcorpus. ```python >>> brown_learned_text = brown.words(categories='learned') >>> sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often')) [',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming', 'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...] ``` What POSes follow it? ```python >>> brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal') >>> tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often'] >>> fd = nltk.FreqDist(tags) >>> fd.tabulate() PRT ADV ADP . VERB ADJ 2 8 7 4 37 6 ``` What *verb to verbs*s do we see? ```python from nltk.corpus import brown def process(sentence): for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence): if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')): print(w1, w2, w3) >>> for tagged_sent in brown.tagged_sents(): ... process(tagged_sent) ... combined to achieve continue to place serve to protect wanted to wait allowed to place expected to become ... ``` --- Find words that have ambiguous POSes (4 or more), list them in frequency order. ```python >>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal') >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) >>> for word in sorted(data.conditions()): ... if len(data[word]) > 3: ... tags = [tag for (tag, _) in data[word].most_common()] ... print(word, ' '.join(tags)) ... best ADJ ADV NP V better ADJ ADV V DET close ADV ADJ V N cut V N VN VD even ADV DET ADJ V grant NP N V - hit V VD VN N lay ADJ V NP VD left VD ADJ N VN like CNJ V ADJ P - near P ADV ADJ DET open ADJ V N ADV past N ADJ DET P present ADJ ADV V N read V VN VD NP right ADJ N DET ADV second NUM ADV DET N set VN V VD N - that CNJ V WH DET ``` --- Check out the concordance tool `nltk.app.concordance()` and see how it works. (Though I can't get it to work, the kernel just dies on me. So.) After this, diving into dictionaries, default dictionaries. Here I might just talk through what is in the actual book. --- Making our own taggers, as a way of understanding what NLTK is doing. What's the most likely tag? ```python >>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')] >>> nltk.FreqDist(tags).max() 'NN' ``` So, we'd have moderate success if we just tag *everything* as `NN`. ```python >>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!' >>> tokens = word_tokenize(raw) >>> default_tagger = nltk.DefaultTagger('NN') >>> default_tagger.tag(tokens) [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')] ``` ```python >>> default_tagger.evaluate(brown_tagged_sents) 0.13089484257215028 ``` --- ```python patterns = [ ... (r'.*ing$', 'VBG'), # gerunds ... (r'.*ed$', 'VBD'), # simple past ... (r'.*es$', 'VBZ'), # 3rd singular present ... (r'.*ould$', 'MD'), # modals ... (r'.*\'s$', 'NN$'), # possessive nouns ... (r'.*s$', 'NNS'), # plural nouns ... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'.*', 'NN') # nouns (default) ... ] regexp_tagger = nltk.RegexpTagger(patterns) >>> regexp_tagger.tag(brown_sents[3]) [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...] >>> regexp_tagger.evaluate(brown_tagged_sents) 0.20326391789486245 ``` --- Lookup tagger. Find most frequent 100 words and the most likely tag for each. ```python >>> fd = nltk.FreqDist(brown.words(categories='news')) >>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) >>> most_freq_words = fd.most_common(100) >>> likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words) >>> baseline_tagger = nltk.UnigramTagger(model=likely_tags) >>> baseline_tagger.evaluate(brown_tagged_sents) 0.45578495136941344 ``` ```python >>> sent = brown.sents(categories='news')[3] >>> baseline_tagger.tag(sent) [('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None), ('handful', None), ('of', 'IN'), ('such', None), ('reports', None), ('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','), ('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None), ('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None), (',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'), ('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None), ('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')] >>> baseline_tagger = nltk.UnigramTagger(model=likely_tags, ... backoff=nltk.DefaultTagger('NN')) >>> baseline_tagger.tag(sent) ``` --- How does it do with 100? With 1000? With 800? ```python import pylab pylab.arange(15) 2 ** pylab.arange(15) ``` ```python def performance(cfd, wordlist): lt = dict((word, cfd[word].max()) for word in wordlist) baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN')) return baseline_tagger.evaluate(brown.tagged_sents(categories='news')) ``` ```python word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common() words_by_freq = [w for (w, _) in word_freqs] cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) performance(cfd, words_by_freq[:64]) performance(cfd, words_by_freq[:128]) ``` Let's collect these together into a graph. This is going to be slow. ```python perfs = [performance(cfd, words_by_freq[:size]) for size in sizes] pylab.plot(sizes, perfs, '-bo') pylab.title('Lookup Tagger Performance with Varying Model Size') pylab.xlabel('Model Size') pylab.ylabel('Performance') pylab.show() ``` --- N-gram tagging, Unigram tagging, bigram tagging. ```python >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> brown_sents = brown.sents(categories='news') >>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) >>> unigram_tagger.tag(brown_sents[2007]) >>> unigram_tagger.evaluate(brown_tagged_sents) >>> size = int(len(brown_tagged_sents) * 0.9) >>> size >>> train_sents = brown_tagged_sents[:size] >>> test_sents = brown_tagged_sents[size:] >>> unigram_tagger = nltk.UnigramTagger(train_sents) >>> unigram_tagger.evaluate(test_sents) ``` ```python >>> bigram_tagger = nltk.BigramTagger(train_sents) >>> bigram_tagger.tag(brown_sents[2007]) [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')] >>> unseen_sent = brown_sents[4203] >>> bigram_tagger.tag(unseen_sent) >>> bigram_tagger.evaluate(test_sents) >>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0) >>> t2 = nltk.BigramTagger(train_sents, backoff=t1) >>> t2.evaluate(test_sents) ``` --- ```python >>> from pickle import dump >>> output = open('t2.pkl', 'wb') >>> dump(t2, output, -1) >>> output.close() >>> pwd ``` ```python >>> from pickle import load >>> input = open('t2.pkl', 'rb') >>> tagger = load(input) >>> input.close() ```