Authorship and Semantics

CAS LX 390 / NLP/CL Homework 7
GRS LX 690 Fall 2017 due Mon 11/14

Authorship

We are going to take a quick look at the movie review database and try to determine who wrote a review, based on some statistics from others by two different authors. This is a “toy” problem, but it will give you a sense of how this can work.

from nltk.corpus import movie_reviews

The fileids in the movie_reviews corpus look like neg/cv000_29416.txt. I checked on a few files that were in the database and cross-checked them on the IMDB archive page, and so I know 3 by one author (JB), 3 by another author (SG), and one that’s by one of them (which we’re going to try to check). To get the fileids, do this:

jbf = ['29416', '29417', '29439']
sgf = ['29423', '29444', '29465']
myf = ['29497']
sgfids = [f for f in movie_reviews.fileids() if f[10:15] in sgf]
jbfids = [f for f in movie_reviews.fileids() if f[10:15] in jbf]
myfids = [f for f in movie_reviews.fileids() if f[10:15] in myf]

Now sgfids has the fileid value for three reviews by SG. What we want to do is write a function that will extract some metrics from this text.

Task 1. Write a function auth_stats(fileid) that will return three values: average word length, average sentence length, and lexical diversity.

You can get the words using movie_reviews.words(fileids=fileid), and the sentences using movie_reviews.sents(fileids=fileid). Lexical diversity is the ratio of distinct words to words. Your function can just return a list like [word_length, sent_length, lexical_diversity].

Task 2. Run the auth_stats function on the three reviews by SG, then on the three reviews by JB, and then on the mystery review (29497). What seems to characterize the reviews by SG as compared to JB, and who wrote the mystery review?

If you go to the IMDB archive page I linked above, you can read the reviews and check your answer.

Formalizing meaning

This largely follows the discussion from chapter 10 in the NLTK book, but I will try to elaborate on it here somewhat. This actually is also somewhat like what we did in class.

This will contain a bit more by way of exercises, to help make it clearer what the concepts are here. We will start by creating a little world that we can evaluate sentences against.

In this world, there are four people: Andrea, Bobby, Chris, and Dana. These are our individuals. An “individual” need not be a person, it’s just some kind of entity that we can refer to. So, let’s also add a couple of non-human individuals as well. To keep things somewhat simple, they will be the Moon and the Sun. We are going to pretend their names are the_sun and the_moon however (having a space in there makes things not work, so we will use _ instead of a space).

So, step one, let’s define a set of our individuals. (There is no intrinsic order to these individuals, they are just the individuals in our model of the world, so it should be a set and not a list.)

dom = {'a', 'b', 'c', 'd', 'm', 's'}

Now we will build up some information about how English maps to these individuals. First off, we will set up the names. As we did in class, and as it is done in the textbook, we will use the fromstring function to create these, because it’s easier to type. What we do is set up a multi-line string first, using then """ delimiters on each end, and then we will create a Valuation by parsing it.

import nltk
names = """
andrea => a
bobby => b
chris => c
dana => d
the_sun => s
the_moon => m
"""
val = nltk.Valuation.fromstring(names)
print(val)

Now that we have done this, we can “evaluate” the English words and get their referents.

print(val['bobby'])
print(val['the_moon'])

This tells us which individual in our set of individuals is being referred to by “bobby” and by “the moon”. Doing this counts, at least in a certain sense, as translating from English into “semantics.” We are determining the meaning of the word “bobby,” for example.

The individuals in this world have properties and relationships, however, as well. For example, some of the individuals are people. So, we define “person” as being something that holds of the individuals a, b, c, and d. For the moment, we are going to create a new Valuation to hold this information, and we will merge these together shortly.

valp = nltk.Valuation.fromstring("person => {a, b, c, d}")
print(valp)

It’s kind of a pain to keep typing this nltk.Valuation.fromstring thing, let’s give it a shorter name. I’m going with vfs for “valuation-from-string”:

vfs = nltk.Valuation.fromstring

Now, the sun and the moon are not people, what are they?

valsb = vfs("spaceball => {s, m}")

So far, we have three different Valuations (val, valp, and valsb). But we need to merge them together into one. It is possible to combine two Valuations using update. So, let’s add valp and valsb to val:

val.update(valp)
print(val)
val.update(valsb)
print(val)

The update function is actually fairly general. It is defined for Valuations, but it is also defined for just regular sets, as well as for dictionaries. If you call update on a set or a dictionary, it merges the argument of update into it (with priority given to the additions, if there is a conflict).

aset = {1, 2, 3}
aset.update({3, 4, 5})
print(aset) # => {1, 2, 3, 4, 5}
adict = {'a': 1, 'b': 2}
adict.update({'b': 4, 'c': 6})
print(adict) # => {'a': 1, 'b': 4, 'c': 6}

Now, all of our world-building work to date is represented in val. Let’s do a little bit more building. It turns out that Andrea and Bobby are from Boston, while Chris and Dana are from Cambridge.

val.update(vfs("bostonian => {a, b}"))
val.update(vfs("cantabrigian => {c, d}"))

Now, we’ve defined the mapping between names and individuals, and we’ve defined some nouns/predicates that hold of sets of individuals. What remains is to define some relationships between them. Relationships are asymmetrical, so just because Andrea likes Bobby does not mean that Bobby likes Andrea. But let’s start with that.

val.update(vfs("likes => {(a, b)}"))

Ok, now let’s (attempt to) make it mutual.

val.update(vfs("likes => {(b, a)}"))
print(val['likes'])

Hmm. That didn’t really work. Instead of making Andrea and Bobby like each other, Bobby started liking Andrea and Andrea stopped liking Bobby. This simply replaced the liking pair, rather than adding to it. So, we could spell this out fully like this:

val.update(vfs("likes => {(b, a), (a, b)}"))
print(val['likes'])

And that gets what we want. But it would be nice to be able to add relations in stages, rather than redefine it in full every time. And you can, with a bit of trickery. The code below will add the fact that Chris likes Dana to our model of the world:

val['likes'].update(vfs("x => {(c,d)}")['x'])
print(val['likes'])

Task 1. Why does that work? Explain how this added ('c', 'd') to val['likes'].

We could have just spelled it out (essentially performing the fromstring operation ourselves). The code below has the same effect as the code above. Arguably more simply.

val['likes'].update({('c', 'd')})

However, using Valuation.fromstring means that we can just type (c,d) instead of ('c', 'd') (and the latter could get annoying if we’re adding several relations at once).

Similarly, if we wanted to make Bobby a spaceball, we can just do this:

val['spaceball'].update(vfs("x => {b}")['x'])

instead of making the 1-tuple with a string in it by hand, as in:

val['spaceball'].update({('b',)})

You are free to believe that I have introduced a more complicated way to do a relatively simple thing. But wait until

Task 2. Finish setting up val['likes'] to represent the following world situation: Andrea likes everyone; everyone likes Dana; Bobby likes Andrea; Dana likes Chris; Andrea and Bobby like the Sun; Bobby and Chris like the Moon.

That is, start with what we already have for val['likes'] at this point, and add the rest in. You can use whatever method you want, including just redefining it completely from scratch. Just provide me with whatever command(s) you used to do it. What we start with for val['likes'] is:

{('a', 'b'), ('b', 'a'), ('c', 'd')}

Also: Assume that if Andrea likes everyone, Andrea also likes Andrea.

That’s complex enough that it’s probably worth checking to see if you wound up with what you intended to. You can print(val['likes']) but it’s a long list (set) of pairs, that isn’t necessarily in a helpful order. So, let’s see if Andrea likes everyone (that is, all the people). If you type the following, you should get True if Andrea likes all the people. (If you get False then you probably set up your world wrong, double check Task 2.)

not False in [(val['andrea'], x) in val['likes'] for (x,) in val['person']]

Got True? Great. But why? If you just blindly typed it in and got True without figuring out what it is doing, that’s fine. But now we’re going to figure out what it is doing.

First of all, remind yourself what val['andrea'], val['person'], and val['likes'] are:

print(val['andrea'])
print(val['person'])
print(val['likes'])

We are trying to determine whether Andrea likes all the people. So, we check, for each person, whether it is true that Andrea likes that person. When we’re done checking people, we should not have found any that yield False. As you just saw, val['person'] is a set of 1-tuples, like ('a',). So to go through the people, we want to use for (x,) in val['person'] in order to set x to be the individual in our domain that corresponds to the person (e.g., 'a'). To determine whether Andrea likes the person in x, we need to find out whether the pair that has Andrea as the first member and x as the second member is in the set of “likings” in val['likes']. The individual that Andrea represents is val['andrea'] (which will be 'a'). So, we evaluate whether the pair (val['andrea'], x) is in val['likes']. The expression (val['andrea'], x) in val['likes'] will be True if Andrea likes x and False otherwise. The list that this list comprehension builds will be a list of True or False values (one for each person). If Andrea indeed likes every person, then the list should be [True, True, True, True]. Finally, we check to see if False is anywhere in that list. If it is, we failed: Andrea doesn’t like every person. If there is no False in there, then we succeeded. So, not False in [...] is True if we succeeded.

Task 3. Use the same technique to verify that every person likes Dana.

Now, let’s formalize our model of the world into an official NLTK model. A model is just a pairing of a domain and a Valuation function.

m = nltk.Model(dom, val)

Once we have a model defined, we can use the model’s evaluate function to test the truth of things in the model. In order to use evaluate we also need to set up an “assignment function” (which can be thought of as a record of who we’re pointing to). To begin with, we’ll just set up an empty assignment function (we aren’t pointing at anything).

g = nltk.Assignment(dom)

Now, we can verify that Dana likes Chris, and verify that Bobby does not like Chris, like so:

print(m.evaluate('likes(dana, chris)', g))
print(m.evaluate('likes(bobby, chris)', g))

Task 4. Use evaluate to verify that Dana does not like Bobby, and that Chris likes the Moon.

We can also use quantifiers like all and exists with evaluate. For example, we can re-verify that Andrea likes every person, like so:

print(m.evaluate('all x.(person(x) -> likes(andrea, x))', g))

The way this works is pretty much exactly how our home-spun version from Task 3 worked. It goes through all of the individuals in the domain one by one, and for each it checks to see if it’s a person, and if it is a person, then it checks to see if it is the second member of a pair, whose first member is Andrea, that can be found in the list of “likings”.

Task 5. Use evaluate to verify that everybody likes Dana.

You can also use exists, which is true if the condition is met for at least one of the individuals in the domain. So, if we want to ascertain that at least somebody likes Bobby, we can do the following:

print(m.evaluate('exists x.(person(x) & likes(x, bobby))', g))

What that means is that we can find some x in our domain dom such that x is both a person and in a likes relation with Bobby.

Task 6. Use evaluate to verify that every Bostonian likes the Sun.

Task 7. Use evaluate to verify that no spaceballs are from Cambridge.

You can either use exists x.(...) and look for False as an answer, or you can use -exists x.(...) and look for True as an answer. The meaning of -exists x.(...) is: ‘it is not the case that there exists an x such that…’

The string that we give to evaluate is first interpreted as a “semantic Expression” built from a string. If we don’t want to evaluate immediately, we can define such expressions directly. The function that does this is nltk.sem.Expression.fromstring. Like before, we’ll give it a shorter name (sfs) to save on some typing. Then we’ll define a formula f1 to be “x likes the Moon”.

sfs = nltk.sem.Expression.fromstring
f1 = sfs('likes(x, the_moon)')
print(f1)

So, is “x likes the Moon” true? No idea. We can’t decide that until we know who x is supposed to be. Once we know who x is, then we can figure out whether it’s true. Because we don’t know who x is, x is considered a “free variable.” Although it’s kind of obvious, we can interrogate f1 to ask it what its free variables are:

print(f1.free())

If we want to know who likes the Moon, we can ask the model to tell us which individuals, when substituted in for x, would make f1 true:

print(m.satisfiers(f1, 'x', g))

Task 8. Use satsifiers to determine who/what Chris likes.

One way that we can set a value for x is to use x to point to an individual. That is, suppose we point (with our “x” finger) at Bobby, and then ask whether “x likes the Moon” is true. Since this tells us who x is (namely, Bobby), we can decide whether “x likes the Moon” is true. It’s true if (and only if) Bobby likes the Moon.

This is what the assignment function is for. It is a record of who/what we are pointing at, and with which fingers. (This is really designed to handle pronouns like he, she, it. If you use those pronouns, it is assumed that something in the discourse is basically pointing at the individual you mean. Without some kind of pointing (“deixis”) you won’t be able to interpret the referent of a pronoun.)

Parsing sentences

Let’s try to build a little grammar that can take sentences and interpret them. What we want to do here is create some phrase structure rules that will apply the semantics we defined to a syntactic structure. We’ll build this up from the bottom.

As a first step, we will define the NPs, which will be just the names we have. (We are going to build a big multi-line string and then create the grammar using a fromstring function.)

npdef = """
NP[SEM=<andrea>] -> 'andrea'
NP[SEM=<bobby>] -> 'bobby'
NP[SEM=<chris>] -> 'chris'
NP[SEM=<dana>] -> 'dana'
NP[SEM=<the_sun>] -> 'the_sun'
NP[SEM=<the_moon>] -> 'the_moon'
"""

What this means is that if the English word ‘andrea’ is encountered, that can be interpreted as an NP with the SEM feature being <andrea>. And likewise for the other proper names.

As for how the whole tree combines, it will start with S at the top, which is formed from an NP and a VP, and the VP is formed from a V and an NP. For now, that’s all we’ll do.

What we want is for the semantics of the VP to combine the semantics of the V with the semantics of the NP. So, if the V is “likes(x, y)”, and the NP is “bobby”, then we want the VP to be “likes(x, bobby)”, more or less.

cfgdef = r"""
% start S
S[SEM=<?vp(?subj)>] -> NP[SEM=?subj] VP[SEM=?vp]
"""

The way to understand this is: The semantics of S is the function that we get from the semantics of VP, applied to the argument that we get from the semantics of the NP subject. So, by saying NP[SEM=?subj] we are naming the value of the NP’s SEM feature (whatever it is) as ?subj. We name the value of the VP’s SEM feature (whatever it is) as ?vp. We assume that ?vp is a function that can take ?subj as an argument. And so, the SEM feature that we assign to S is whatever we get when we apply the function ?vp to the argument ?subj.

We then do the same thing for the VP. We assume that the V is going to be a function that we can apply to the NP.

cfgdef += r"""
VP[SEM=<?v(?obj)>] -> V[SEM=?v] NP[SEM=?obj]
V[SEM=<\y.\x.likes(x,y)>] -> 'likes'
"""

So, now we can add in the NP definitions we did at the beginning, and take a look at the whole grammar.

cfgdef += npdef
print(cfgdef)

Now that we have the definition, we can parse it into an actual grammar that NLTK can use, and then connect it to a parser (we will use the one called FeatureChartParser).

from nltk import grammar
gram = grammar.FeatureGrammar.fromstring(cfgdef)
cp = nltk.FeatureChartParser(gram)

And now we can parse some sentences. Let’s start with “bobby likes chris”:

parses = list(cp.parse('bobby likes chris'.split()))
print(len(parses))
print(parses[0])

If everything worked up to now, you should see that there is 1 parse, and print(parses[0]) will show you the parse it got.

The very first line is the overall semantic value for the tree, which we can get like this:

treesem = parses[0].label()['SEM']
print(treesem)

And, now that we have this expression, we can test it against the model to see if it is actually true. Note that we are using satisfy and not evaluate – the evaluate function takes a string and turns it into a semantic expression, and then calls satisfy. Since we already have a semantic expression, we can just call satisfy directly.

print(m.satisfy(treesem, g))

And thus we learn that, in this model, Bobby does not like Chris.

If we want to know if Bobby likes Dana, we just change the sentence.

parses = list(cp.parse('bobby likes dana'.split()))
print(parses[0])
treesem = parses[0].label()['SEM']
print(treesem)
print(m.satisfy(treesem, g))

Task 9. Use this grammar to parse sentences telling you whether Chris likes Bobby and whether Chris likes the Sun.

Don’t forget that the Sun is all one word (the_sun) in this grammar.

That’s actually pretty cool. We can get from a sentence to a tree to truth conditions to an actual evaluation of whether a sentence is true or false. Granted, we can’t do very complicated sentences, but we have a place to start and we can kind of see how we could proceed.

Handling quantifiers

I’m going to take us one step further, but this is going to get a little bit complex. What we’re going to try to do is allow for quantifiers like “every bostonian”.

What we want to get at the top of “every bostonian likes chris” is: every x.(bostonian(x) -> likes(x, chris))

Let’s assume that the VP is still going to be: \y.likes(y, chris). So, the question then is: what semantics can we give to “every bostonian” that can combine with the VP to give us what we want for S? It’s pretty clear that there’s nothing we can pick for y that we can put into likes(y, chris) to get that every x... semantics that we want for S.

Above I switched the variable name in the semantic value of the VP. Instead of saying \x.likes(x, chris) I said \y.likes(y, chris). I made this change because I think it will be less confusing later. But those two functions are completely equivalent. It doesn’t matter what the variable is, it just has to match. Those are both the same as \z.likes(z, chris) or \rhinoceros.likes(rhinoceros, chris).

The trick that semanticists pull at this point is to say that actually what is happening here is not that the VP is a function that takes the NP subject as an argument. Rather, the NP is a function that takes the VP as an argument. That is, the meaning of “every bostonian” is going to be something that takes a function (like “likes-chris”) and returns the value we want for S. More concretely, we assume that “every bostonian” is the function:

\P.(every x.(bostonian(x) -> P(x)))

If we apply this function to \y.likes(y, chris) then what that means is that we set P (predicate) to be equal to \y.likes(y, chris) and so we can substitute \y.likes(y, chris) in for P in the part of the definition after the .. That gives us:

every x.(bostonian(x) -> \y.likes(y, chris)(x))

which simplifies to what we want:

every x.(bostonian(x) -> likes(x, chris))

The “simplification” step here comes up repeatedly. Remember that this “lambda-notation” for a function is \x.(something...x...something) and what that means is “given a value, replace all instances of x with that value”. The notation for a function with an argument is function(argument), and when the function is in lambda-notation, it looks like \x.(something...x...something)(argument). Replacing x with argument, we get (something...argument...something). That’s what happening in these “simplification” steps.

I told you it was going to be complicated. But I think it’s still comprehensible, though it might take a couple of readings-through.

Before we put this into the grammar, let’s also deal with the fact that we can also talk about “every person” as well as “every bostonian”. We want to split up “every” and the noun, and assign a meaning to each. The meanings for bostonian, etc. can just be these:

ndef = r"""
N[SEM=<\x.bostonian(x)>] -> 'bostonian'
N[SEM=<\x.cantabrigian(x)>] -> 'cantabrigian'
N[SEM=<\x.spaceball(x)>] -> 'spaceball'
N[SEM=<\x.person(x)>] -> 'person'
"""

And then we want the semantics of “every” to take one of those Ns and give us back the “every” value we outlined above. Here’s what we can set “every” to in order to get that:

ddef = r"""
D[SEM=<\N.(\P.(all x.(N(x) -> P(x))))>] -> 'every'
"""

So with that D and the Ns before, we make NPs out of them, and then we need to define S to apply the subject NP to the VP (the reverse of what we had done before). So:

cfgdef = r"""
% start S
S[SEM=<?subj(?vp)>] -> NP[SEM=?subj] VP[SEM=?vp]
NP[SEM=<?d(?n)>] -> D[SEM=?d] N[SEM=?n]
VP[SEM=<?v(?obj)>] -> V[SEM=?v] NP[SEM=?obj]
"""

Now that we’ve changed the definition of S so that the NP is the function and the VP is the argument, we need to fix our proper names. The proper names used to be just referring to individuals, but if the subject needs to be a function that takes a predicate as an argument, we need to make proper names (like “Andrea”) be functions as well. What semanticists do here is interpret “Andrea” as being not the individual a, but rather a function that is true of any predicate that holds of a. That is:

npdef = r"""
NP[SEM=<\P.P(andrea)>] -> 'andrea'
NP[SEM=<\P.P(bobby)>] -> 'bobby'
NP[SEM=<\P.P(chris)>] -> 'chris'
NP[SEM=<\P.P(dana)>] -> 'dana'
NP[SEM=<\P.P(the_sun)>] -> 'the_sun'
NP[SEM=<\P.P(the_moon)>] -> 'the_moon'
"""

The last problem we need to tackle is that we need to derive the value of the VP correctly, but now objects are not individuals but functions taking predicates. We still want the semantics of the VP “likes chris” to be \x.likes(x, chris) but now we need to build that from a combination of whatever semantics we assign to “likes” and the semantics we just defined above for “Chris”.

What we’re going to do here is change “likes” so that it still takes “Chris” as an argument, but just expects it to be this higher type. It’s confusing, I know. But I’ll walk through it anyway.

The verb “likes” is going to take an argument NP, that argument NP might be “Chris” and the semantic value of “Chris” is \P.P(chris). We’re going to take that and call it X. This is the function that is true of any property Chris has. What we want to return is \x.likes(x, chris). Here is how we will define “likes”:

likesdef = r"""
V[SEM=<\X y.X(\x.likes(y,x))>] -> 'likes'
"""

So, if we are combining “likes” and “chris”, then we have:

\X y.X(\x.likes(y,x)) ( \P.P(chris) )

Simplifying by replacing X with \P.P(chris) we get:

\y.\P.P(chris)(\x.likes(y,x))

Simplifying by replacing P with \x.likes(y,x) we get:

\y.\x.likes(y,x)(chris)

Simplifying by replacing x with chris we get:

\y.likes(y,chris)

And that is what we wanted. It’s hard to keep track of, I think you probably would need to work out a bunch of these before you could feel confident that this is a generally applicable definition for a transitive verb, but let’s just assume it is. So, we are almost ready to assemble our new grammar. One other addition we can make is the quantifier “a”, which works in much the same way as “every” did:

adef = r"""
D[SEM=<\N.(\P.(exists x.(N(x) & P(x))))>] -> 'a'
"""

Ok, let’s finally build this grammar. We can pull all the pieces together like this:

cfgdef += ndef + ddef + adef + npdef + likesdef
print(cfgdef) # just to make sure it looks right
gram2 = grammar.FeatureGrammar.fromstring(cfgdef)
cp2 = nltk.FeatureChartParser(gram2)

And now the moment of truth. Let’s try parsing “every person likes dana”.

parses = list(cp2.parse('every person likes dana'.split()))
print(parses[0])
treesem = parses[0].label()['SEM']
print(treesem)
print(m.satisfy(treesem, g))

If it said True, you have my permission to stand up and do a little jig.

If it didn’t, it should have, so you probably need to go back and check for typos.

Task 10. Check whether Andrea likes every person, whether a spaceball likes a person, whether every Bostonian likes the Sun.

One could imagine continuing on, but at this point, that’s on your own time.