Class 3a

name: title
layout: true
class: center, middle, inverse
---
# Corpora and more sophistication

---
layout: false

# Getting to some of the included corpora

NLTK provides access to a few basic corpora.  For example, samples from Project Gutenberg.

```python
>>> import nltk
>>> nltk.corpus.gutenberg
<PlaintextCorpusReader in '/Users/hagstrom/nltk_data/corpora/gutenberg'>
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']
```

---
layout: false

`austen-emma.txt` is just a text file, but it's been "wrapped" in a corpus reader.

```text
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
```

```python
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.words('austen-emma.txt')
>>> emma[0:15]
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I',
'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome']
```

---
layout: false

We can ask for `words`, `sents`, `raw` (text) as well.

```python
>>> emmaw = gutenberg.words('austen-emma.txt')
>>> emma[0:15]
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I',
'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome']
>>> emmas = gutenberg.sents('austen-emma.txt')
>>> emmas[0:4]
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'],
 ['VOLUME', 'I'],
 ['CHAPTER', 'I'],
 ['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and',
  'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy',
  'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the',
  'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived',
  'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with',
  'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.']]
>>> emmar = gutenberg.raw('austen-emma.txt')
>>> emmar[0:100]
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a'
```

---
layout: false

```python
def corpus_stat(fileid):
    return{'words': len(gutenberg.raw(fileid))}
```

```python
for f in gutenberg.fileids():
    stats = corpus_stat(f)
    print(stats['words'])
```

```python
def corpus_stats(fileid):
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    distinct_words = set(w.lower() for w in gutenberg.words(fileid))
    num_vocab = len(distinct_words)
    avg_word = round(num_chars/num_words)
    avg_sent = round(num_words/num_sents)
    avg_occ = round(num_words/num_vocab)
    return {'word': avg_word, 'sent': avg_sent, 'occ': avg_occ}
```

```python
for f in gutenberg.fileids():
    stats = corpus_stats(f)
    print(stats['word'], stats['sent'], stats['occ'], f)
```

---
layout: false

# The Brown Corpus

From 1961, divided into genres.  A subset is available via NLTK.

```python
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.sents(categories=['news', 'reviews'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an',
'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election',
'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities',
'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end',
'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',',
'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``',
'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of',
'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election',
'was', 'conducted', '.'], ...]
```

---
layout: false

# Comparing genres

```python
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> news_text = [w.lower() for w in brown.words(categories='news')]
>>> news_fdist = nltk.FreqDist(news_text)
>>> news_fdist['may']
93
>>> rom_text = [w.lower() for w in brown.words(categories='romance')]
>>> rom_fdist = nltk.FreqDist(rom_text)
>>> for m in modals: print(m + ':', news_fdist[m], end=' ')
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
>>> for m in modals: print(m + ':', rom_fdist[m], end=' ')
can: 79 could: 195 may: 11 might: 51 must: 46 will: 49
```

---
layout: false

# Conditional Frequency Distributions

These are kind of a three-dimensional Frequency Distribution, and can be pretty useful.
The idea is basically that you collect a FreqDist for each of several subsets of the data.

For example: you might have `news` and one condition, and `romance` as your other.
It basically allows us to do what we just did but with a single structure.

Behind the scenes, it's basically counting pairs, where the first member of the pair
is the condition, and the second member of the pair is the "event."

```python
>>> news_text[0:5]
['the', 'fulton', 'county', 'grand', 'jury']
>>> rom_text[0:5]
['they', 'neither', 'liked', 'nor', 'disliked']
>>> pairs = [('news', w) for w in news_text]
>>> pairs += [('romance', w) for w in rom_text]
>>> len(pairs)
170576
>>> pairs[0:3]
[('news', 'the'), ('news', 'fulton'), ('news', 'county')]
>>> pairs[100000:100003]
[('news', 'governed'), ('news', 'by'), ('news', 'a')]
>>> pairs[170000:170003]
[('romance', '.'), ('romance', "we're"), ('romance', 'here')]
```

---
layout: false

# Conditional Frequency Distributions

```python
>>> cfd = nltk.ConditionalFreqDist(pairs)
>>> cfd.conditions()
['romance', 'news']
>>> print(cfd['news'])
<FreqDist with 13112 samples and 100554 outcomes>
>>> cfd['news'].most_common(10)
```

We can replicate our comparison of the modals between `news` and `romance`.

```python
>>> cfd.tabulate(samples=modals)
            can could   may might  must  will 
     news    94    87    93    38    53   389 
  romance    79   195    11    51    46    49 
>>> cfd.tabulate(samples=modals, conditions=['news'])
            can could   may might  must  will 
     news    94    87    93    38    53   389 
>>> print(cfd['spongebob'])
<FreqDist with 0 samples and 0 outcomes>
>>> cfd.tabulate(samples=modals)
            can could   may might  must  will 
     news    94    87    93    38    53   389 
  romance    79   195    11    51    46    49 
spongebob     0     0     0     0     0     0 
```

---
layout: false

# Complex comprehensions

There are a couple of pretty complex list comprehensions in the NLTK book at this point.

It is possible to have multiple `for` clauses, which get evaluated one after the other
in the following way:

```python
>>> [x for x in ['a', 'b', 'c']]
['a', 'b', 'c']
>>> [(x,y) for x in ['a', 'b', 'c'] for y in ['1', '2']]
[('a', '1'), ('a', '2'), ('b', '1'), ('b', '2'), ('c', '1'), ('c', '2')]
>>> [(x,y) for x in ['a', 'b', 'c'] for y in [x + '1', x + '2']]
[('a', 'a1'), ('a', 'a2'), ('b', 'b1'), ('b', 'b2'), ('c', 'c1'), ('c', 'c2')]
```

Notice that it runs through all the `y`s for each of the `x`s. And, in fact, you can
refer to the current `x` while you're running through the `y`s.
This is one-directional, it is a matter of "scope."

```python
>>> [(x,y) for x in [y + 'a', y + 'b', y + 'c'] for y in ['1', '2']]
NameError: name 'y' is not defined
```

---
layout: false

# Complex comprehensions

```python
>>> bcfd = nltk.ConditionalFreqDist(
        (genre, word)
        for genre in brown.categories()
        for word in brown.words(categories=genre))
```

```python
>>> from nltk.corpus import inaugural
>>> print(inaugural.fileids())
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt',
'2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
```

```python
>>> ipairs = [(target, fileid[:4])
        for fileid in inaugural.fileids()
        for w in inaugural.words(fileid)
        for target in ['america', 'citizen']
        if w.lower().startswith(target)]
>>> icfd = nltk.ConditionalFreqDist(ipairs)
>>> icfd.tabulate(samples=['1789','1889', '1989'])
        1789 1889 1989 
america    2    6   11 
citizen    5   12    3 
>>> icfd.plot()
```

---
layout: false

# Generating text

Suppose that we want to make the computer write some new stuff in the style of the KJV Bible.

```python
>>> import nltk
>>> from nltk.util import bigrams
>>> print(nltk.corpus.genesis.fileids())
['english-kjv.txt', 'english-web.txt', 'finnish.txt', 'french.txt',
'german.txt', 'lolcat.txt', 'portuguese.txt', 'swedish.txt']
>>> text = nltk.corpus.genesis.words('english-kjv.txt')
>>> bigrams = nltk.bigrams(text)
>>> bigrams[:5]
TypeError: 'generator' object is not subscriptable
>>> bigrams
<generator object bigrams at 0x11babeb48>
>>> print(list(bigrams)[:5])
[('In', 'the'), ('the', 'beginning'), ('beginning', 'God'),
('God', 'created'), ('created', 'the')]
>>> print(list(bigrams)[:5])
[]
>>> bigrams = nltk.bigrams(text)
>>> cfd = nltk.ConditionalFreqDist(bigrams)
```

Strategy: Whatever word we're at, look at what usually follows it, and go with that.

---
layout: false

# Generating text

So, suppose that we start with the word "living".  Whtn it appears, it is followed by
"creature" most of the time, so follow it with "creature".

```python
>>> cfd['living']
FreqDist({',': 1,
          '.': 1,
          'creature': 7,
          'soul': 1,
          'substance': 2,
          'thing': 4})
>>> cfd['living'].max()
'creature'
>>> cfd['creature'].max()
'that'
```

```python
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()
```

```python
>>> generate_model(cfd, 'living')
living creature that he said , and the land of the land of the land
```

This is not optimal.  It gets stuck.  We want it not to get stuck.
---
layout: false

# Goal: improving this (a little)

How can we keep it from getting in a loop?
--

Let's try making it take, rather than the topmost one, a random one.

How could we characterize what that would be doing?
--

But let's try to make it more plausible, but making it take them in proportion to the likelihood they'd have shown up in the source text.
--

Let's start with a toy:

```python
>>> from nltk.util import bigrams
>>> txt = ['the', 'dog', 'chased', 'the', 'cat', '.', 'the', 'dog', 'barked', '.']
>>> bgrams = nltk.bigrams(txt)
>>> bgrams
<generator object bigrams at 0x1070dfba0>
>>> blist = [x for x in bgrams]
>>> blist
[('the', 'dog'), ('dog', 'chased'), ('chased', 'the'), ('the', 'cat'), ('cat', '.'), ('.', 'the'), ('the', 'dog'), ('dog', 'barked'), ('barked', '.')]
>>> cfd = nltk.ConditionalFreqDist(blist)
>>> cfd
<ConditionalFreqDist with 0 conditions>
>>> cfd['the']
FreqDist({'dog': 2, 'cat': 1})

```

---

```python
>>> cfd['the']
FreqDist({'dog': 2, 'cat': 1})
```

We want to pick a word to follow "the" but arbitrarily, so we don't pick the same thing every time. 
--

But we don't want to pick them equally, we want to pick "dog" twice as often as we pick "cat".
--

We can use `random.choice(some list)` to pick an arbitrary one, so we need to get a list that has the properties we want.  That is we want:

```python
['cat', 'dog', 'dog']
```

---

This is what I ultimately ended up with

```python
>>> l = []
>>> for k in cfd['the'].keys():
    to_add = [k] * cfd['the'][k]
    l.extend(to_add)

>>> l
['cat', 'dog', 'dog']
```

So, maybe more concisely, if more opaquely:

```python
>>> words = [k for k in cfd['the'] for n in range(cfd['the'][k])]
>>> words
['cat', 'dog', 'dog']

```

```python
import random
random.choice(words)

```

---

Ok, now let's fix this:

```python
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        next_words = [k for k in cfd[word] for n in range(cfd[word][k])]
        word = random.choice(next_words)

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'The', 100)
```

It's better.

```python
from nltk.book import *
bigrams = nltk.bigrams(text6)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'The', 100)
```

---

# Comparative wordlists #

```python
>>> from nltk.corpus import swadesh
```

There are several languages:

```python
>>> swadesh.fileids()
['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk',
'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']
```

Here are the words in English (en):

```python
>>> swadesh.words('en')
['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that',
'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some',
'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', ...]
```

---

And we can use this to "translate":

```python
>>> fr2en = swadesh.entries(['fr', 'en'])
>>> fr2en
[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ...]
>>> translate = dict(fr2en)
>>> translate['chien']
'dog'
>>> translate['jeter']
'throw'
```

Or compare words across a set of langauges.

```python
>>> languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']
>>> for i in [139, 140, 141, 142]:
...     print(swadesh.entries(languages)[i])
...
('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere')
('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere')
('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere')
('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

```

---

# Pronouncing dictionary #

```python
entries = nltk.corpus.cmudict.entries()
for entry in entries[42371:42379]:
    print(entry)
('fir', ['F', 'ER1'])
('fire', ['F', 'AY1', 'ER0'])
('fire', ['F', 'AY1', 'R'])
('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])
('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])
('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])
('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])
('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])
```

That's kind of cool, it gives us a list of words and how to pronounce them.

---

Suppose that we wanted to find all the words that end in the sound "-nicks":

```python
>>> entries = nltk.corpus.cmudict.entries()
>>> syllable = ['N', 'IH0', 'K', 'S']
>>> [word for word, pron in entries if pron[-4:] == syllable]
["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics',
'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics',
'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", ...]
```

Great.  Perfect for writing bad poetry.

What's this doing?

```python
>>> [w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']
['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn']
```

And this?

```python
>>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))
['gn', 'kn', 'mn', 'pn']
```

---

More bad poetry aids: Suppose that we want to find something with a particular stress pattern.

This will help.

```python
>>> def stress(pron):
...     return [char for phone in pron for char in phone if char.isdigit()]
```

Once we see how that works, we can try it out:

```python
>>> [w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']]
['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating',
'accelerator', 'accelerators', 'accentuated', 'accentuating', 'accommodated',
'accommodating', 'accommodative', 'accumulated', 'accumulating', 'accumulative', ...]
```

```python
>>> [w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']]
['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients',
'academicians', 'accommodation', 'accommodations', 'accreditation', 'accreditations',
'accumulation', 'accumulations', 'acetylcholine', 'acetylcholine', 'adjudication', ...]

```
---
layout: false

# WordNet: Semantic relations

A thesaurus gives you lists of synonyms.  What are the synonyms of "broadcast"?

The answer to that is not obvious---what do you mean by "broadcast"?
- a radio or television program?
- the act of transmission?
- the general dissemination of information?

To know what the synonyms of "broadcast" are, we first need to isolate the different *senses* that "broadcast" can have.  Dictionaries will provide you with this.

WordNet will also provide you with this.

---

WordNet, a kind of dictionary/thesaurus to help with semantic processing.

Load up WordNet, we'll call it `wn`.

```python
from nltk.corpus import wordnet as wn
```

Let's see what we can figure out about "broadcast".

In WordNet terminology, every individual word-sense has a list of synonyms.  So, the synoyms of "broadcast over the airwaves, as in radio or television" are

```
'air', 'send', 'broadcast', 'beam', 'transmit'
```

This is a **synset**.  A synset is a collection of synonyms.

There is one of these for each sense of "broadcast".  The synset corresponding to the sense "a radio or television show" contains:

```
'broadcast', 'program', 'programme'
```

Together, the collection of synsets are referred to as, well, **synsets**.

---

To get this out of WordNet, we can do the following:

```python
>>> bss = wn.synsets("broadcast")
>>> bss
[Synset('broadcast.n.01'),
 Synset('broadcast.n.02'),
 Synset('air.v.03'),
 Synset('broadcast.v.02'),
 Synset('circulate.v.02')]
```

Those are the senses of "broadcast" mentioned before.  Each one of those is a set of synonyms, with a designated representative as its label.

If we want to see the words that are synonyms for the third sense ("air.v.03"), we ask it for the **lemma_names**:

```python
>>> bss[2]
Synset('air.v.03')
>>> bss[2].lemma_names()
['air', 'send', 'broadcast', 'beam', 'transmit']
```

---

WordNet has definitions for the synsets and (sometimes) also has examples.

```python
>>> bss[2].definition()
'broadcast over the airwaves, as in radio or television'
>>> bss[2].examples()
['We cannot air this X-rated song']

```

So, we could write a little function to make a dictionary entry:

```python
def webster(word):
    synsets = wn.synsets(word)
    for s in synsets:
        print(s.definition())
```

And then

```python
>>> webster('broadcast')
message that is transmitted by radio or television
a radio or television show
broadcast over the airwaves, as in radio or television
sow over a wide area, especially by hand
cause to become widely known

```

---

We can include more information too.

```python
def webster(word):
    synsets = wn.synsets(word)
    for s in synsets:
        print(s.definition())
        print(s.examples())
        print(s.lemma_names())
```

```python
>>> webster('broadcast')
message that is transmitted by radio or television
[]
['broadcast']
a radio or television show
['did you see his program last night?']
['broadcast', 'program', 'programme']
broadcast over the airwaves, as in radio or television
['We cannot air this X-rated song']
['air', 'send', 'broadcast', 'beam', 'transmit']
sow over a wide area, especially by hand
['broadcast seeds']
['broadcast']
cause to become widely known
['spread information', 'circulate a rumor', 'broadcast the news']
['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around']
```

Thaat's pretty ugly though.

---

We can make the `print` statements do a little bit of formatting for us.  It's a little better.

```python
def webster(word):
    synsets = wn.synsets(word)
    print(word)
    for s in synsets:
        print('- ', end='')
        print(s.definition())
        print('    Syn:', end='')
        print(s.lemma_names())
        print('    Exx:', end='')
        print(s.examples())
```

```python
>>> webster('broadcast')
broadcast
- message that is transmitted by radio or television
    Syn:['broadcast']
    Exx:[]
- a radio or television show
    Syn:['broadcast', 'program', 'programme']
    Exx:['did you see his program last night?']
- broadcast over the airwaves, as in radio or television
    Syn:['air', 'send', 'broadcast', 'beam', 'transmit']
    Exx:['We cannot air this X-rated song']
- sow over a wide area, especially by hand
    Syn:['broadcast']
    Exx:['broadcast seeds']
- cause to become widely known
    Syn:['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around']
    Exx:['spread information', 'circulate a rumor', 'broadcast the news']
```

---

We can make this more compact by using string formatting. (Ch 3, sec 3.9)

```python
def webster(word):
    synsets = wn.synsets(word)
    print(word)
    for s in synsets:
        print("- {}".format(s.definition()))
        print("    Syn: {}".format(s.lemma_names()))
        print("    Exx: {}".format(s.examples()))
```

The structure of a formatted string is that a template string is asked to `format` itself, filling in each of the `{}` blanks with values provided as parameters to `format`.

```python
>>> webster('broadcast')
broadcast
- message that is transmitted by radio or television
    Syn: ['broadcast']
    Exx: []
- a radio or television show
    Syn: ['broadcast', 'program', 'programme']
    Exx: ['did you see his program last night?']
- broadcast over the airwaves, as in radio or television
    Syn: ['air', 'send', 'broadcast', 'beam', 'transmit']
    Exx: ['We cannot air this X-rated song']
- sow over a wide area, especially by hand
    Syn: ['broadcast']
    Exx: ['broadcast seeds']
- cause to become widely known
    Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around']
    Exx: ['spread information', 'circulate a rumor', 'broadcast the news']
```

---

Sometimes there are no examples.  We can make it only print an example line if there are examples.  And let's add part of speech as well.

```python
def webster(word):
    synsets = wn.synsets(word)
    print(word)
    for s in synsets:
        print("- {}. {}".format(s.pos(), s.definition()))
        print("    Syn: {}".format(s.lemma_names()))
        if len(s.examples()) > 0:
            print("    Exx: {}".format(s.examples()))
```

```python
>>> webster('broadcast')
broadcast
- n. message that is transmitted by radio or television
    Syn: ['broadcast']
- n. a radio or television show
    Syn: ['broadcast', 'program', 'programme']
    Exx: ['did you see his program last night?']
- v. broadcast over the airwaves, as in radio or television
    Syn: ['air', 'send', 'broadcast', 'beam', 'transmit']
    Exx: ['We cannot air this X-rated song']
- v. sow over a wide area, especially by hand
    Syn: ['broadcast']
    Exx: ['broadcast seeds']
- v. cause to become widely known
    Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around']
    Exx: ['spread information', 'circulate a rumor', 'broadcast the news']

```

---

And we probably don't need to keep the original word in the list of synonyms.

```python
def webster(word):
    synsets = wn.synsets(word)
    print(word)
    for s in synsets:
        print("- {}. {}".format(s.pos(), s.definition()))
        syns = [w for w in s.lemma_names() if w != word]
        if len(syns) > 0:
            print("    Syn: {}".format(syns))
        if len(s.examples()) > 0:
            print("    Exx: {}".format(s.examples()))
```

Incrementally improving.

```python
>>> webster('broadcast')
broadcast
- n. message that is transmitted by radio or television
- n. a radio or television show
    Syn: ['program', 'programme']
    Exx: ['did you see his program last night?']
- v. broadcast over the airwaves, as in radio or television
    Syn: ['air', 'send', 'beam', 'transmit']
    Exx: ['We cannot air this X-rated song']
- v. sow over a wide area, especially by hand
    Exx: ['broadcast seeds']
- v. cause to become widely known
    Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around']
    Exx: ['spread information', 'circulate a rumor', 'broadcast the news']

```

---

String formatting is even more useful for making pretty numbers.

You can put a formatting instruction inside the `{}` on a format string.  These start with `:`. The simplest one is just the width.  You can also specify alignment (with `<`), or specify for "floating point" (`f`) numbers the width and digits after the decimal point.  You can precede it with `0` to make it "pad with zeros."

```python
def printlist(numbers):
    for n in numbers:
        print("{:6} - {:<6} - {:5.2f} - {:06.2f} - {} :)".format(n, n, n, n, n))
```

```python
>>> numberlist = [1, 2, 42, 3.14, 7.5]
>>> printlist(numberlist)
     1 - 1      -  1.00 - 001.00 - 1 :)
     2 - 2      -  2.00 - 002.00 - 2 :)
    42 - 42     - 42.00 - 042.00 - 42 :)
  3.14 - 3.14   -  3.14 - 003.14 - 3.14 :)
   7.5 - 7.5    -  7.50 - 007.50 - 7.5 :)
```

---

But back to WordNet.  We were here.

```python
>>> bss = wn.synsets("broadcast")
>>> bss
[Synset('broadcast.n.01'),
 Synset('broadcast.n.02'),
 Synset('air.v.03'),
 Synset('broadcast.v.02'),
 Synset('circulate.v.02')]
>>> bss[1].definition()
'a radio or television show'
```

If we already have prior knowledge of the senses, we can ask for a specific synset by name:

```python
show = wn.synset('broadcast.n.02')
```

If we want to limit our words to verbs, we can do this:

```python
>>> bssv = wn.synsets('broadcast', pos=wn.VERB)
>>> bssv
[Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')]
```

Options are ADJ, ADJ_SAT, NOUN, ADV, VERB. Or: 'a', 's', 'n', 'r', 'v'.

```python
bssv = wn.synsets('broadcast', pos='v')
```

---

A Lemma object is a disambiguated word.  The format of a Lemma's name is

```
<word>.<part-of-speech>.<number>.<lemma>
```

where `word` is the identifier of the synset, and `lemma` is the specific form within the synset that we're looking at.

```python
>>> bss[2]
Synset('air.v.03')
>>> bsls = bss[2].lemmas()
>>> bsls
[Lemma('air.v.03.air'),
 Lemma('air.v.03.send'),
 Lemma('air.v.03.broadcast'),
 Lemma('air.v.03.beam'),
 Lemma('air.v.03.transmit')]
>>> bsls[1].name()
'send'
>>> bsls[1].synset()
Synset('air.v.03')
```
---

We can also use synsets to find relationships between words.
Synsets are linked to more general (hypernyms) and more specific words (hyponyms):

```python
>>> bss[2]
Synset('air.v.03')
>>> bss[2].definition()
'broadcast over the airwaves, as in radio or television'
>>> bss[2].hypernyms()
[Synset('publicize.v.01')]
>>> bss[2].hyponyms()
[Synset('interrogate.v.01'),
 Synset('rerun.v.01'),
 Synset('satellite.v.01'),
 Synset('sportscast.v.01'),
 Synset('telecast.v.01')]
>>> bss[2].root_hypernyms()
[Synset('act.v.01')]
>>> bss[2].min_depth()
6
>>> wn.synset('sportscast.v.01').min_depth()
7
>>> wn.synset('publicize.v.01').min_depth()
5
```

---

We can also look at how word-senses (synsets) are related.

```python
>>> bss[1].definition()
'a radio or television show'
>>> bss[0].definition()
'message that is transmitted by radio or television'
>>> bss[1].lowest_common_hypernyms(bss[0])
[Synset('abstraction.n.06')]
>>> bss[0].lowest_common_hypernyms(bss[1])
[Synset('abstraction.n.06')]
>>> wn.synset('abstraction.n.06').definition()
'a general concept formed by extracting common features from specific examples'
>>> bss[0].path_similarity(bss[1])
0.1111111111111111
>>> bss[0].path_similarity(wn.synset('abstraction.n.06'))
0.25
>>> wn.synset('eat.v.01').definition()
'take in solid food'
>>> wn.synset('eat.v.01').entailments()
[Synset('chew.v.01'), Synset('swallow.v.01')]
```

---

If there's time, maybe we can do one more thing with our `webster` function.

```python
from nltk.corpus import cmudict
pro = cmudict.dict()
```

```python
def webster(word):
    synsets = wn.synsets(word)
    prons = pro[word]
    print("{} - {}".format(word, prons))
    for s in synsets:
        print("- {}. {}".format(s.pos(), s.definition()))
        syns = [w for w in s.lemma_names() if w != word]
        if len(syns) > 0:
            print("    Syn: {}".format(syns))
        if len(s.examples()) > 0:
            print("    Exx: {}".format(s.examples()))
```

```python
>>> webster('broadcast')
broadcast - [['B', 'R', 'AO1', 'D', 'K', 'AE2', 'S', 'T']]
- n. message that is transmitted by radio or television
- n. a radio or television show
    Syn: ['program', 'programme']
    Exx: ['did you see his program last night?']
- v. broadcast over the airwaves, as in radio or television
    Syn: ['air', 'send', 'beam', 'transmit']
    Exx: ['We cannot air this X-rated song']
- v. sow over a wide area, especially by hand
    Exx: ['broadcast seeds']
- v. cause to become widely known
    Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around']
    Exx: ['spread information', 'circulate a rumor', 'broadcast the news']
```

---

Can't stop now.

```python
>>> from nltk.corpus import swadesh
>>> en2fr = swadesh.entries(['en', 'fr'])
>>> translate = dict(en2fr)
```

```python
def webster(word):
    synsets = wn.synsets(word)
    prons = pro[word]
    print("{} - {}".format(word, prons))
    if word in translate:
        print("Fr.: {}".format(translate[word]))
    for s in synsets:
        print("- {}. {}".format(s.pos(), s.definition()))
        syns = [w for w in s.lemma_names() if w != word]
        if len(syns) > 0:
            print("    Syn: {}".format(syns))
        if len(s.examples()) > 0:
            print("    Exx: {}".format(s.examples()))
```

---
layout: false

```python
>>> webster('dog')
dog - [['D', 'AO1', 'G']]
Fr.: chien
- n. a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
    Syn: ['domestic_dog', 'Canis_familiaris']
    Exx: ['the dog barked all night']
- n. a dull unattractive unpleasant girl or woman
    Syn: ['frump']
    Exx: ['she got a reputation as a frump', "she's a real dog"]
- n. informal term for a man
    Exx: ['you lucky dog']
- n. someone who is morally reprehensible
    Syn: ['cad', 'bounder', 'blackguard', 'hound', 'heel']
    Exx: ['you dirty dog']
- n. a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
    Syn: ['frank', 'frankfurter', 'hotdog', 'hot_dog', 'wiener', 'wienerwurst', 'weenie']
- n. a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
    Syn: ['pawl', 'detent', 'click']
- n. metal supports for logs in a fireplace
    Syn: ['andiron', 'firedog', 'dog-iron']
    Exx: ['the andirons were too hot to touch']
- v. go after with the intent to catch
    Syn: ['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'go_after', 'track']
    Exx: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit']
```

---