name: title layout: true class: center, middle, inverse --- # Corpora and more sophistication --- layout: false # Getting to some of the included corpora NLTK provides access to a few basic corpora. For example, samples from Project Gutenberg. ```python >>> import nltk >>> nltk.corpus.gutenberg
>>> nltk.corpus.gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] ``` --- layout: false `austen-emma.txt` is just a text file, but it's been "wrapped" in a corpus reader. ```text [Emma by Jane Austen 1816] VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection. ``` ```python >>> from nltk.corpus import gutenberg >>> emma = gutenberg.words('austen-emma.txt') >>> emma[0:15] ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome'] ``` --- layout: false We can ask for `words`, `sents`, `raw` (text) as well. ```python >>> emmaw = gutenberg.words('austen-emma.txt') >>> emma[0:15] ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome'] >>> emmas = gutenberg.sents('austen-emma.txt') >>> emmas[0:4] [['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ['CHAPTER', 'I'], ['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.']] >>> emmar = gutenberg.raw('austen-emma.txt') >>> emmar[0:100] '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a' ``` --- layout: false ```python def corpus_stat(fileid): return{'words': len(gutenberg.raw(fileid))} ``` ```python for f in gutenberg.fileids(): stats = corpus_stat(f) print(stats['words']) ``` ```python def corpus_stats(fileid): num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) num_sents = len(gutenberg.sents(fileid)) distinct_words = set(w.lower() for w in gutenberg.words(fileid)) num_vocab = len(distinct_words) avg_word = round(num_chars/num_words) avg_sent = round(num_words/num_sents) avg_occ = round(num_words/num_vocab) return {'word': avg_word, 'sent': avg_sent, 'occ': avg_occ} ``` ```python for f in gutenberg.fileids(): stats = corpus_stats(f) print(stats['word'], stats['sent'], stats['occ'], f) ``` --- layout: false # The Brown Corpus From 1961, divided into genres. A subset is available via NLTK. ```python >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> brown.sents(categories=['news', 'reviews']) [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] ``` --- layout: false # Comparing genres ```python >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> news_text = [w.lower() for w in brown.words(categories='news')] >>> news_fdist = nltk.FreqDist(news_text) >>> news_fdist['may'] 93 >>> rom_text = [w.lower() for w in brown.words(categories='romance')] >>> rom_fdist = nltk.FreqDist(rom_text) >>> for m in modals: print(m + ':', news_fdist[m], end=' ') can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 >>> for m in modals: print(m + ':', rom_fdist[m], end=' ') can: 79 could: 195 may: 11 might: 51 must: 46 will: 49 ``` --- layout: false # Conditional Frequency Distributions These are kind of a three-dimensional Frequency Distribution, and can be pretty useful. The idea is basically that you collect a FreqDist for each of several subsets of the data. For example: you might have `news` and one condition, and `romance` as your other. It basically allows us to do what we just did but with a single structure. Behind the scenes, it's basically counting pairs, where the first member of the pair is the condition, and the second member of the pair is the "event." ```python >>> news_text[0:5] ['the', 'fulton', 'county', 'grand', 'jury'] >>> rom_text[0:5] ['they', 'neither', 'liked', 'nor', 'disliked'] >>> pairs = [('news', w) for w in news_text] >>> pairs += [('romance', w) for w in rom_text] >>> len(pairs) 170576 >>> pairs[0:3] [('news', 'the'), ('news', 'fulton'), ('news', 'county')] >>> pairs[100000:100003] [('news', 'governed'), ('news', 'by'), ('news', 'a')] >>> pairs[170000:170003] [('romance', '.'), ('romance', "we're"), ('romance', 'here')] ``` --- layout: false # Conditional Frequency Distributions ```python >>> cfd = nltk.ConditionalFreqDist(pairs) >>> cfd.conditions() ['romance', 'news'] >>> print(cfd['news'])
>>> cfd['news'].most_common(10) ``` We can replicate our comparison of the modals between `news` and `romance`. ```python >>> cfd.tabulate(samples=modals) can could may might must will news 94 87 93 38 53 389 romance 79 195 11 51 46 49 >>> cfd.tabulate(samples=modals, conditions=['news']) can could may might must will news 94 87 93 38 53 389 >>> print(cfd['spongebob'])
>>> cfd.tabulate(samples=modals) can could may might must will news 94 87 93 38 53 389 romance 79 195 11 51 46 49 spongebob 0 0 0 0 0 0 ``` --- layout: false # Complex comprehensions There are a couple of pretty complex list comprehensions in the NLTK book at this point. It is possible to have multiple `for` clauses, which get evaluated one after the other in the following way: ```python >>> [x for x in ['a', 'b', 'c']] ['a', 'b', 'c'] >>> [(x,y) for x in ['a', 'b', 'c'] for y in ['1', '2']] [('a', '1'), ('a', '2'), ('b', '1'), ('b', '2'), ('c', '1'), ('c', '2')] >>> [(x,y) for x in ['a', 'b', 'c'] for y in [x + '1', x + '2']] [('a', 'a1'), ('a', 'a2'), ('b', 'b1'), ('b', 'b2'), ('c', 'c1'), ('c', 'c2')] ``` Notice that it runs through all the `y`s for each of the `x`s. And, in fact, you can refer to the current `x` while you're running through the `y`s. This is one-directional, it is a matter of "scope." ```python >>> [(x,y) for x in [y + 'a', y + 'b', y + 'c'] for y in ['1', '2']] NameError: name 'y' is not defined ``` --- layout: false # Complex comprehensions ```python >>> bcfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) ``` ```python >>> from nltk.corpus import inaugural >>> print(inaugural.fileids()) ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt'] ``` ```python >>> ipairs = [(target, fileid[:4]) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in ['america', 'citizen'] if w.lower().startswith(target)] >>> icfd = nltk.ConditionalFreqDist(ipairs) >>> icfd.tabulate(samples=['1789','1889', '1989']) 1789 1889 1989 america 2 6 11 citizen 5 12 3 >>> icfd.plot() ``` --- layout: false # Generating text Suppose that we want to make the computer write some new stuff in the style of the KJV Bible. ```python >>> import nltk >>> from nltk.util import bigrams >>> print(nltk.corpus.genesis.fileids()) ['english-kjv.txt', 'english-web.txt', 'finnish.txt', 'french.txt', 'german.txt', 'lolcat.txt', 'portuguese.txt', 'swedish.txt'] >>> text = nltk.corpus.genesis.words('english-kjv.txt') >>> bigrams = nltk.bigrams(text) >>> bigrams[:5] TypeError: 'generator' object is not subscriptable >>> bigrams
>>> print(list(bigrams)[:5]) [('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('God', 'created'), ('created', 'the')] >>> print(list(bigrams)[:5]) [] >>> bigrams = nltk.bigrams(text) >>> cfd = nltk.ConditionalFreqDist(bigrams) ``` Strategy: Whatever word we're at, look at what usually follows it, and go with that. --- layout: false # Generating text So, suppose that we start with the word "living". Whtn it appears, it is followed by "creature" most of the time, so follow it with "creature". ```python >>> cfd['living'] FreqDist({',': 1, '.': 1, 'creature': 7, 'soul': 1, 'substance': 2, 'thing': 4}) >>> cfd['living'].max() 'creature' >>> cfd['creature'].max() 'that' ``` ```python def generate_model(cfdist, word, num=15): for i in range(num): print(word, end=' ') word = cfdist[word].max() ``` ```python >>> generate_model(cfd, 'living') living creature that he said , and the land of the land of the land ``` This is not optimal. It gets stuck. We want it not to get stuck. --- layout: false # Goal: improving this (a little) How can we keep it from getting in a loop? -- Let's try making it take, rather than the topmost one, a random one. How could we characterize what that would be doing? -- But let's try to make it more plausible, but making it take them in proportion to the likelihood they'd have shown up in the source text. -- Let's start with a toy: ```python >>> from nltk.util import bigrams >>> txt = ['the', 'dog', 'chased', 'the', 'cat', '.', 'the', 'dog', 'barked', '.'] >>> bgrams = nltk.bigrams(txt) >>> bgrams
>>> blist = [x for x in bgrams] >>> blist [('the', 'dog'), ('dog', 'chased'), ('chased', 'the'), ('the', 'cat'), ('cat', '.'), ('.', 'the'), ('the', 'dog'), ('dog', 'barked'), ('barked', '.')] >>> cfd = nltk.ConditionalFreqDist(blist) >>> cfd
>>> cfd['the'] FreqDist({'dog': 2, 'cat': 1}) ``` --- ```python >>> cfd['the'] FreqDist({'dog': 2, 'cat': 1}) ``` We want to pick a word to follow "the" but arbitrarily, so we don't pick the same thing every time. -- But we don't want to pick them equally, we want to pick "dog" twice as often as we pick "cat". -- We can use `random.choice(some list)` to pick an arbitrary one, so we need to get a list that has the properties we want. That is we want: ```python ['cat', 'dog', 'dog'] ``` --- This is what I ultimately ended up with ```python >>> l = [] >>> for k in cfd['the'].keys(): to_add = [k] * cfd['the'][k] l.extend(to_add) >>> l ['cat', 'dog', 'dog'] ``` So, maybe more concisely, if more opaquely: ```python >>> words = [k for k in cfd['the'] for n in range(cfd['the'][k])] >>> words ['cat', 'dog', 'dog'] ``` ```python import random random.choice(words) ``` --- Ok, now let's fix this: ```python def generate_model(cfdist, word, num=15): for i in range(num): print(word, end=' ') next_words = [k for k in cfd[word] for n in range(cfd[word][k])] word = random.choice(next_words) text = nltk.corpus.genesis.words('english-kjv.txt') bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'The', 100) ``` It's better. ```python from nltk.book import * bigrams = nltk.bigrams(text6) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'The', 100) ``` --- # Comparative wordlists # ```python >>> from nltk.corpus import swadesh ``` There are several languages: ```python >>> swadesh.fileids() ['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk'] ``` Here are the words in English (en): ```python >>> swadesh.words('en') ['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', ...] ``` --- And we can use this to "translate": ```python >>> fr2en = swadesh.entries(['fr', 'en']) >>> fr2en [('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ...] >>> translate = dict(fr2en) >>> translate['chien'] 'dog' >>> translate['jeter'] 'throw' ``` Or compare words across a set of langauges. ```python >>> languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la'] >>> for i in [139, 140, 141, 142]: ... print(swadesh.entries(languages)[i]) ... ('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere') ('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere') ('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere') ('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare') ``` --- # Pronouncing dictionary # ```python entries = nltk.corpus.cmudict.entries() for entry in entries[42371:42379]: print(entry) ('fir', ['F', 'ER1']) ('fire', ['F', 'AY1', 'ER0']) ('fire', ['F', 'AY1', 'R']) ('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M']) ('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M']) ('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z']) ('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z']) ('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L']) ``` That's kind of cool, it gives us a list of words and how to pronounce them. --- Suppose that we wanted to find all the words that end in the sound "-nicks": ```python >>> entries = nltk.corpus.cmudict.entries() >>> syllable = ['N', 'IH0', 'K', 'S'] >>> [word for word, pron in entries if pron[-4:] == syllable] ["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", ...] ``` Great. Perfect for writing bad poetry. What's this doing? ```python >>> [w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n'] ['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn'] ``` And this? ```python >>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n')) ['gn', 'kn', 'mn', 'pn'] ``` --- More bad poetry aids: Suppose that we want to find something with a particular stress pattern. This will help. ```python >>> def stress(pron): ... return [char for phone in pron for char in phone if char.isdigit()] ``` Once we see how that works, we can try it out: ```python >>> [w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']] ['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating', 'accelerator', 'accelerators', 'accentuated', 'accentuating', 'accommodated', 'accommodating', 'accommodative', 'accumulated', 'accumulating', 'accumulative', ...] ``` ```python >>> [w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']] ['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients', 'academicians', 'accommodation', 'accommodations', 'accreditation', 'accreditations', 'accumulation', 'accumulations', 'acetylcholine', 'acetylcholine', 'adjudication', ...] ``` --- layout: false # WordNet: Semantic relations A thesaurus gives you lists of synonyms. What are the synonyms of "broadcast"? The answer to that is not obvious---what do you mean by "broadcast"? - a radio or television program? - the act of transmission? - the general dissemination of information? To know what the synonyms of "broadcast" are, we first need to isolate the different *senses* that "broadcast" can have. Dictionaries will provide you with this. WordNet will also provide you with this. --- WordNet, a kind of dictionary/thesaurus to help with semantic processing. Load up WordNet, we'll call it `wn`. ```python from nltk.corpus import wordnet as wn ``` Let's see what we can figure out about "broadcast". In WordNet terminology, every individual word-sense has a list of synonyms. So, the synoyms of "broadcast over the airwaves, as in radio or television" are ``` 'air', 'send', 'broadcast', 'beam', 'transmit' ``` This is a **synset**. A synset is a collection of synonyms. There is one of these for each sense of "broadcast". The synset corresponding to the sense "a radio or television show" contains: ``` 'broadcast', 'program', 'programme' ``` Together, the collection of synsets are referred to as, well, **synsets**. --- To get this out of WordNet, we can do the following: ```python >>> bss = wn.synsets("broadcast") >>> bss [Synset('broadcast.n.01'), Synset('broadcast.n.02'), Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')] ``` Those are the senses of "broadcast" mentioned before. Each one of those is a set of synonyms, with a designated representative as its label. If we want to see the words that are synonyms for the third sense ("air.v.03"), we ask it for the **lemma_names**: ```python >>> bss[2] Synset('air.v.03') >>> bss[2].lemma_names() ['air', 'send', 'broadcast', 'beam', 'transmit'] ``` --- WordNet has definitions for the synsets and (sometimes) also has examples. ```python >>> bss[2].definition() 'broadcast over the airwaves, as in radio or television' >>> bss[2].examples() ['We cannot air this X-rated song'] ``` So, we could write a little function to make a dictionary entry: ```python def webster(word): synsets = wn.synsets(word) for s in synsets: print(s.definition()) ``` And then ```python >>> webster('broadcast') message that is transmitted by radio or television a radio or television show broadcast over the airwaves, as in radio or television sow over a wide area, especially by hand cause to become widely known ``` --- We can include more information too. ```python def webster(word): synsets = wn.synsets(word) for s in synsets: print(s.definition()) print(s.examples()) print(s.lemma_names()) ``` ```python >>> webster('broadcast') message that is transmitted by radio or television [] ['broadcast'] a radio or television show ['did you see his program last night?'] ['broadcast', 'program', 'programme'] broadcast over the airwaves, as in radio or television ['We cannot air this X-rated song'] ['air', 'send', 'broadcast', 'beam', 'transmit'] sow over a wide area, especially by hand ['broadcast seeds'] ['broadcast'] cause to become widely known ['spread information', 'circulate a rumor', 'broadcast the news'] ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] ``` Thaat's pretty ugly though. --- We can make the `print` statements do a little bit of formatting for us. It's a little better. ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print('- ', end='') print(s.definition()) print(' Syn:', end='') print(s.lemma_names()) print(' Exx:', end='') print(s.examples()) ``` ```python >>> webster('broadcast') broadcast - message that is transmitted by radio or television Syn:['broadcast'] Exx:[] - a radio or television show Syn:['broadcast', 'program', 'programme'] Exx:['did you see his program last night?'] - broadcast over the airwaves, as in radio or television Syn:['air', 'send', 'broadcast', 'beam', 'transmit'] Exx:['We cannot air this X-rated song'] - sow over a wide area, especially by hand Syn:['broadcast'] Exx:['broadcast seeds'] - cause to become widely known Syn:['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx:['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- We can make this more compact by using string formatting. (Ch 3, sec 3.9) ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print("- {}".format(s.definition())) print(" Syn: {}".format(s.lemma_names())) print(" Exx: {}".format(s.examples())) ``` The structure of a formatted string is that a template string is asked to `format` itself, filling in each of the `{}` blanks with values provided as parameters to `format`. ```python >>> webster('broadcast') broadcast - message that is transmitted by radio or television Syn: ['broadcast'] Exx: [] - a radio or television show Syn: ['broadcast', 'program', 'programme'] Exx: ['did you see his program last night?'] - broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'broadcast', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - sow over a wide area, especially by hand Syn: ['broadcast'] Exx: ['broadcast seeds'] - cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- Sometimes there are no examples. We can make it only print an example line if there are examples. And let's add part of speech as well. ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) print(" Syn: {}".format(s.lemma_names())) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` ```python >>> webster('broadcast') broadcast - n. message that is transmitted by radio or television Syn: ['broadcast'] - n. a radio or television show Syn: ['broadcast', 'program', 'programme'] Exx: ['did you see his program last night?'] - v. broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'broadcast', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - v. sow over a wide area, especially by hand Syn: ['broadcast'] Exx: ['broadcast seeds'] - v. cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- And we probably don't need to keep the original word in the list of synonyms. ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) syns = [w for w in s.lemma_names() if w != word] if len(syns) > 0: print(" Syn: {}".format(syns)) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` Incrementally improving. ```python >>> webster('broadcast') broadcast - n. message that is transmitted by radio or television - n. a radio or television show Syn: ['program', 'programme'] Exx: ['did you see his program last night?'] - v. broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - v. sow over a wide area, especially by hand Exx: ['broadcast seeds'] - v. cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- String formatting is even more useful for making pretty numbers. You can put a formatting instruction inside the `{}` on a format string. These start with `:`. The simplest one is just the width. You can also specify alignment (with `<`), or specify for "floating point" (`f`) numbers the width and digits after the decimal point. You can precede it with `0` to make it "pad with zeros." ```python def printlist(numbers): for n in numbers: print("{:6} - {:<6} - {:5.2f} - {:06.2f} - {} :)".format(n, n, n, n, n)) ``` ```python >>> numberlist = [1, 2, 42, 3.14, 7.5] >>> printlist(numberlist) 1 - 1 - 1.00 - 001.00 - 1 :) 2 - 2 - 2.00 - 002.00 - 2 :) 42 - 42 - 42.00 - 042.00 - 42 :) 3.14 - 3.14 - 3.14 - 003.14 - 3.14 :) 7.5 - 7.5 - 7.50 - 007.50 - 7.5 :) ``` --- But back to WordNet. We were here. ```python >>> bss = wn.synsets("broadcast") >>> bss [Synset('broadcast.n.01'), Synset('broadcast.n.02'), Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')] >>> bss[1].definition() 'a radio or television show' ``` If we already have prior knowledge of the senses, we can ask for a specific synset by name: ```python show = wn.synset('broadcast.n.02') ``` If we want to limit our words to verbs, we can do this: ```python >>> bssv = wn.synsets('broadcast', pos=wn.VERB) >>> bssv [Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')] ``` Options are ADJ, ADJ_SAT, NOUN, ADV, VERB. Or: 'a', 's', 'n', 'r', 'v'. ```python bssv = wn.synsets('broadcast', pos='v') ``` --- A Lemma object is a disambiguated word. The format of a Lemma's name is ```
.
.
.
``` where `word` is the identifier of the synset, and `lemma` is the specific form within the synset that we're looking at. ```python >>> bss[2] Synset('air.v.03') >>> bsls = bss[2].lemmas() >>> bsls [Lemma('air.v.03.air'), Lemma('air.v.03.send'), Lemma('air.v.03.broadcast'), Lemma('air.v.03.beam'), Lemma('air.v.03.transmit')] >>> bsls[1].name() 'send' >>> bsls[1].synset() Synset('air.v.03') ``` --- We can also use synsets to find relationships between words. Synsets are linked to more general (hypernyms) and more specific words (hyponyms): ```python >>> bss[2] Synset('air.v.03') >>> bss[2].definition() 'broadcast over the airwaves, as in radio or television' >>> bss[2].hypernyms() [Synset('publicize.v.01')] >>> bss[2].hyponyms() [Synset('interrogate.v.01'), Synset('rerun.v.01'), Synset('satellite.v.01'), Synset('sportscast.v.01'), Synset('telecast.v.01')] >>> bss[2].root_hypernyms() [Synset('act.v.01')] >>> bss[2].min_depth() 6 >>> wn.synset('sportscast.v.01').min_depth() 7 >>> wn.synset('publicize.v.01').min_depth() 5 ``` --- We can also look at how word-senses (synsets) are related. ```python >>> bss[1].definition() 'a radio or television show' >>> bss[0].definition() 'message that is transmitted by radio or television' >>> bss[1].lowest_common_hypernyms(bss[0]) [Synset('abstraction.n.06')] >>> bss[0].lowest_common_hypernyms(bss[1]) [Synset('abstraction.n.06')] >>> wn.synset('abstraction.n.06').definition() 'a general concept formed by extracting common features from specific examples' >>> bss[0].path_similarity(bss[1]) 0.1111111111111111 >>> bss[0].path_similarity(wn.synset('abstraction.n.06')) 0.25 >>> wn.synset('eat.v.01').definition() 'take in solid food' >>> wn.synset('eat.v.01').entailments() [Synset('chew.v.01'), Synset('swallow.v.01')] ``` --- If there's time, maybe we can do one more thing with our `webster` function. ```python from nltk.corpus import cmudict pro = cmudict.dict() ``` ```python def webster(word): synsets = wn.synsets(word) prons = pro[word] print("{} - {}".format(word, prons)) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) syns = [w for w in s.lemma_names() if w != word] if len(syns) > 0: print(" Syn: {}".format(syns)) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` ```python >>> webster('broadcast') broadcast - [['B', 'R', 'AO1', 'D', 'K', 'AE2', 'S', 'T']] - n. message that is transmitted by radio or television - n. a radio or television show Syn: ['program', 'programme'] Exx: ['did you see his program last night?'] - v. broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - v. sow over a wide area, especially by hand Exx: ['broadcast seeds'] - v. cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- Can't stop now. ```python >>> from nltk.corpus import swadesh >>> en2fr = swadesh.entries(['en', 'fr']) >>> translate = dict(en2fr) ``` ```python def webster(word): synsets = wn.synsets(word) prons = pro[word] print("{} - {}".format(word, prons)) if word in translate: print("Fr.: {}".format(translate[word])) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) syns = [w for w in s.lemma_names() if w != word] if len(syns) > 0: print(" Syn: {}".format(syns)) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` --- layout: false ```python >>> webster('dog') dog - [['D', 'AO1', 'G']] Fr.: chien - n. a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds Syn: ['domestic_dog', 'Canis_familiaris'] Exx: ['the dog barked all night'] - n. a dull unattractive unpleasant girl or woman Syn: ['frump'] Exx: ['she got a reputation as a frump', "she's a real dog"] - n. informal term for a man Exx: ['you lucky dog'] - n. someone who is morally reprehensible Syn: ['cad', 'bounder', 'blackguard', 'hound', 'heel'] Exx: ['you dirty dog'] - n. a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll Syn: ['frank', 'frankfurter', 'hotdog', 'hot_dog', 'wiener', 'wienerwurst', 'weenie'] - n. a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward Syn: ['pawl', 'detent', 'click'] - n. metal supports for logs in a fireplace Syn: ['andiron', 'firedog', 'dog-iron'] Exx: ['the andirons were too hot to touch'] - v. go after with the intent to catch Syn: ['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'go_after', 'track'] Exx: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit'] ``` ---