name: title layout: true class: center, middle, inverse --- # WordNet --- layout: false # Semantic relations between words A thesaurus gives you lists of synonyms. What are the synonyms of "broadcast"? The answer to that is not obvious---what do you mean by "broadcast"? - a radio or television program? - the act of transmission? - the general dissemination of information? To know what the synonyms of "broadcast" are, we first need to isolate the different *senses* that "broadcast" can have. Dictionaries will provide you with this. WordNet will also provide you with this. --- WordNet, a kind of dictionary/thesaurus to help with semantic processing. Load up WordNet, we'll call it `wn`. ```python from nltk.corpus import wordnet as wn ``` Let's see what we can figure out about "broadcast". In WordNet terminology, every individual word-sense has a list of synonyms. So, the synoyms of "broadcast over the airwaves, as in radio or television" are ``` 'air', 'send', 'broadcast', 'beam', 'transmit' ``` This is a **synset**. A synset is a collection of synonyms. There is one of these for each sense of "broadcast". The synset corresponding to the sense "a radio or television show" contains: ``` 'broadcast', 'program', 'programme' ``` Together, the collection of synsets are referred to as, well, **synsets**. --- To get this out of WordNet, we can do the following: ```python >>> bss = wn.synsets("broadcast") >>> bss [Synset('broadcast.n.01'), Synset('broadcast.n.02'), Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')] ``` Those are the senses of "broadcast" mentioned before. Each one of those is a set of synonyms, with a designated representative as its label. If we want to see the words that are synonyms for the third sense ("air.v.03"), we ask it for the **lemma_names**: ```python >>> bss[2] Synset('air.v.03') >>> bss[2].lemma_names() ['air', 'send', 'broadcast', 'beam', 'transmit'] ``` --- WordNet has definitions for the synsets and (sometimes) also has examples. ```python >>> bss[2].definition() 'broadcast over the airwaves, as in radio or television' >>> bss[2].examples() ['We cannot air this X-rated song'] ``` So, we could write a little function to make a dictionary entry: ```python def webster(word): synsets = wn.synsets(word) for s in synsets: print(s.definition()) ``` And then ```python >>> webster('broadcast') message that is transmitted by radio or television a radio or television show broadcast over the airwaves, as in radio or television sow over a wide area, especially by hand cause to become widely known ``` --- We can include more information too. ```python def webster(word): synsets = wn.synsets(word) for s in synsets: print(s.definition()) print(s.examples()) print(s.lemma_names()) ``` ```python >>> webster('broadcast') message that is transmitted by radio or television [] ['broadcast'] a radio or television show ['did you see his program last night?'] ['broadcast', 'program', 'programme'] broadcast over the airwaves, as in radio or television ['We cannot air this X-rated song'] ['air', 'send', 'broadcast', 'beam', 'transmit'] sow over a wide area, especially by hand ['broadcast seeds'] ['broadcast'] cause to become widely known ['spread information', 'circulate a rumor', 'broadcast the news'] ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] ``` Thaat's pretty ugly though. --- We can make the `print` statements do a little bit of formatting for us. It's a little better. ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print('- ', end='') print(s.definition()) print(' Syn:', end='') print(s.lemma_names()) print(' Exx:', end='') print(s.examples()) ``` ```python >>> webster('broadcast') broadcast - message that is transmitted by radio or television Syn:['broadcast'] Exx:[] - a radio or television show Syn:['broadcast', 'program', 'programme'] Exx:['did you see his program last night?'] - broadcast over the airwaves, as in radio or television Syn:['air', 'send', 'broadcast', 'beam', 'transmit'] Exx:['We cannot air this X-rated song'] - sow over a wide area, especially by hand Syn:['broadcast'] Exx:['broadcast seeds'] - cause to become widely known Syn:['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx:['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- We can make this more compact by using string formatting. (Ch 3, sec 3.9) ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print("- {}".format(s.definition())) print(" Syn: {}".format(s.lemma_names())) print(" Exx: {}".format(s.examples())) ``` The structure of a formatted string is that a template string is asked to `format` itself, filling in each of the `{}` blanks with values provided as parameters to `format`. ```python >>> webster('broadcast') broadcast - message that is transmitted by radio or television Syn: ['broadcast'] Exx: [] - a radio or television show Syn: ['broadcast', 'program', 'programme'] Exx: ['did you see his program last night?'] - broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'broadcast', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - sow over a wide area, especially by hand Syn: ['broadcast'] Exx: ['broadcast seeds'] - cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- Sometimes there are no examples. We can make it only print an example line if there are examples. And let's add part of speech as well. ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) print(" Syn: {}".format(s.lemma_names())) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` ```python >>> webster('broadcast') broadcast - n. message that is transmitted by radio or television Syn: ['broadcast'] - n. a radio or television show Syn: ['broadcast', 'program', 'programme'] Exx: ['did you see his program last night?'] - v. broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'broadcast', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - v. sow over a wide area, especially by hand Syn: ['broadcast'] Exx: ['broadcast seeds'] - v. cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'broadcast', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- And we probably don't need to keep the original word in the list of synonyms. ```python def webster(word): synsets = wn.synsets(word) print(word) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) syns = [w for w in s.lemma_names() if w != word] if len(syns) > 0: print(" Syn: {}".format(syns)) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` Incrementally improving. ```python >>> webster('broadcast') broadcast - n. message that is transmitted by radio or television - n. a radio or television show Syn: ['program', 'programme'] Exx: ['did you see his program last night?'] - v. broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - v. sow over a wide area, especially by hand Exx: ['broadcast seeds'] - v. cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- String formatting is even more useful for making pretty numbers. You can put a formatting instruction inside the `{}` on a format string. These start with `:`. The simplest one is just the width. You can also specify alignment (with `<`), or specify for "floating point" (`f`) numbers the width and digits after the decimal point. You can precede it with `0` to make it "pad with zeros." ```python def printlist(numbers): for n in numbers: print("{:6} - {:<6} - {:5.2f} - {:06.2f} - {} :)".format(n, n, n, n, n)) ``` ```python >>> numberlist = [1, 2, 42, 3.14, 7.5] >>> printlist(numberlist) 1 - 1 - 1.00 - 001.00 - 1 :) 2 - 2 - 2.00 - 002.00 - 2 :) 42 - 42 - 42.00 - 042.00 - 42 :) 3.14 - 3.14 - 3.14 - 003.14 - 3.14 :) 7.5 - 7.5 - 7.50 - 007.50 - 7.5 :) ``` --- But back to WordNet. We were here. ```python >>> bss = wn.synsets("broadcast") >>> bss [Synset('broadcast.n.01'), Synset('broadcast.n.02'), Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')] >>> bss[1].definition() 'a radio or television show' ``` If we already have prior knowledge of the senses, we can ask for a specific synset by name: ```python show = wn.synset('broadcast.n.02') ``` If we want to limit our words to verbs, we can do this: ```python >>> bssv = wn.synsets('broadcast', pos=wn.VERB) >>> bssv [Synset('air.v.03'), Synset('broadcast.v.02'), Synset('circulate.v.02')] ``` Options are ADJ, ADJ_SAT, NOUN, ADV, VERB. Or: 'a', 's', 'n', 'r', 'v'. ```python bssv = wn.synsets('broadcast', pos='v') ``` --- A Lemma object is a disambiguated word. The format of a Lemma's name is ```
.
.
.
``` where `word` is the identifier of the synset, and `lemma` is the specific form within the synset that we're looking at. ```python >>> bss[2] Synset('air.v.03') >>> bsls = bss[2].lemmas() >>> bsls [Lemma('air.v.03.air'), Lemma('air.v.03.send'), Lemma('air.v.03.broadcast'), Lemma('air.v.03.beam'), Lemma('air.v.03.transmit')] >>> bsls[1].name() 'send' >>> bsls[1].synset() Synset('air.v.03') ``` --- We can also use synsets to find relationships between words. Synsets are linked to more general (hypernyms) and more specific words (hyponyms): ```python >>> bss[2] Synset('air.v.03') >>> bss[2].definition() 'broadcast over the airwaves, as in radio or television' >>> bss[2].hypernyms() [Synset('publicize.v.01')] >>> bss[2].hyponyms() [Synset('interrogate.v.01'), Synset('rerun.v.01'), Synset('satellite.v.01'), Synset('sportscast.v.01'), Synset('telecast.v.01')] >>> bss[2].root_hypernyms() [Synset('act.v.01')] >>> bss[2].min_depth() 6 >>> wn.synset('sportscast.v.01').min_depth() 7 >>> wn.synset('publicize.v.01').min_depth() 5 ``` --- We can also look at how word-senses (synsets) are related. ```python >>> bss[1].definition() 'a radio or television show' >>> bss[0].definition() 'message that is transmitted by radio or television' >>> bss[1].lowest_common_hypernyms(bss[0]) [Synset('abstraction.n.06')] >>> bss[0].lowest_common_hypernyms(bss[1]) [Synset('abstraction.n.06')] >>> wn.synset('abstraction.n.06').definition() 'a general concept formed by extracting common features from specific examples' >>> bss[0].path_similarity(bss[1]) 0.1111111111111111 >>> bss[0].path_similarity(wn.synset('abstraction.n.06')) 0.25 >>> wn.synset('eat.v.01').definition() 'take in solid food' >>> wn.synset('eat.v.01').entailments() [Synset('chew.v.01'), Synset('swallow.v.01')] ``` --- If there's time, maybe we can do one more thing with our `webster` function. ```python from nltk.corpus import cmudict pro = cmudict.dict() ``` ```python def webster(word): synsets = wn.synsets(word) prons = pro[word] print("{} - {}".format(word, prons)) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) syns = [w for w in s.lemma_names() if w != word] if len(syns) > 0: print(" Syn: {}".format(syns)) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` ```python >>> webster('broadcast') broadcast - [['B', 'R', 'AO1', 'D', 'K', 'AE2', 'S', 'T']] - n. message that is transmitted by radio or television - n. a radio or television show Syn: ['program', 'programme'] Exx: ['did you see his program last night?'] - v. broadcast over the airwaves, as in radio or television Syn: ['air', 'send', 'beam', 'transmit'] Exx: ['We cannot air this X-rated song'] - v. sow over a wide area, especially by hand Exx: ['broadcast seeds'] - v. cause to become widely known Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around'] Exx: ['spread information', 'circulate a rumor', 'broadcast the news'] ``` --- Can't stop now. ```python >>> from nltk.corpus import swadesh >>> en2fr = swadesh.entries(['en', 'fr']) >>> translate = dict(en2fr) ``` ```python def webster(word): synsets = wn.synsets(word) prons = pro[word] print("{} - {}".format(word, prons)) if word in translate: print("Fr.: {}".format(translate[word])) for s in synsets: print("- {}. {}".format(s.pos(), s.definition())) syns = [w for w in s.lemma_names() if w != word] if len(syns) > 0: print(" Syn: {}".format(syns)) if len(s.examples()) > 0: print(" Exx: {}".format(s.examples())) ``` ```python >>> webster('dog') dog - [['D', 'AO1', 'G']] Fr.: chien - n. a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds Syn: ['domestic_dog', 'Canis_familiaris'] Exx: ['the dog barked all night'] - n. a dull unattractive unpleasant girl or woman Syn: ['frump'] Exx: ['she got a reputation as a frump', "she's a real dog"] - n. informal term for a man Exx: ['you lucky dog'] - n. someone who is morally reprehensible Syn: ['cad', 'bounder', 'blackguard', 'hound', 'heel'] Exx: ['you dirty dog'] - n. a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll Syn: ['frank', 'frankfurter', 'hotdog', 'hot_dog', 'wiener', 'wienerwurst', 'weenie'] - n. a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward Syn: ['pawl', 'detent', 'click'] - n. metal supports for logs in a fireplace Syn: ['andiron', 'firedog', 'dog-iron'] Exx: ['the andirons were too hot to touch'] - v. go after with the intent to catch Syn: ['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'go_after', 'track'] Exx: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit'] ``` ---