Trees

CAS LX 390 / NLP/CL Homework 4
GRS LX 690 Fall 2017 due Tue 10/3

Setting up the environment

We’ll now use NLTK to do a little bit of actual theoretical linguistics. This is at least partly based on chapter 8 of the NLTK book.

The first step is to get the environment set up. Because we want to save a file and have Python find it, we need to put the files someplace it looks.

I mainly know how to do this on a Mac—if you are working in Windows, something similar should be possible.

It turns out that the Python console (iPython console) in Spyder will accept some commands that let you change your “working directory.” If you type

pwd

(for “print working directory”), it should tell you where it will look for files. When I did that just now it told me /Users/hagstrom (because my user name on this computer is hagstrom, so this is my home directory). If you do not see something analogous when you try this, you can point the working directory at somewhere else using cd like this:

cd /Users/hagstrom

(for “change directory”). After doing that, pwd should tell you it’s now looking where you just pointed it to. Although I don’t have a means to test it, my guess is that on Windows you can use the same thing except with backslashes (\) instead of forward slashes (/). That is, I think the Windows home directory is \Users\username\.

Note on IDLE: When I lauch IDLE, it sets the default working directory to my Documents folder rather than to my home folder, so the simplest way to work this out if you are using IDLE instead of Spyder is to put the files you want Python to find in the Documents folder.

Just to make sure everything’s working, though, let’s start with this, which kind of mimics what we were trying to in class.

Create a new file (choose New File… from the File menu in Spyder), and the clean out the template stuff it puts in there (like “# -*- coding: utf-8 -*-” etc.) so that it is actually empty.

Type just the following:

S -> NP VP

And save. It will bring up a file dialog, and it would normally insist that you save it as a file ending in .py—but we want to save this as a file ending in .cfg (“context-free grammar”). On the Mac at least, in order to make it allow a .cfg file extension, you need to select the popup that says “Files of type:” and change it to “All files (*)” first. Then it will let you save a .cfg file. Save it as something like homework4.cfg in your home directory (or, wherever you pointed your working directory at).

If it worked, you should be able to do the following without getting any errors (or, indeed, any kind of response).

import nltk
hw4cfg = nltk.data.load("file:homework4.cfg")

If you did get errors, it probably could not find the file, which would mean either it is looking in the wrong folder, or you saved it in the wrong folder, or both. Or there’s just a typo. To be sure it worked (in the absence of any response), try:

>>> hw4cfg.productions()
[S -> NP VP]

It should give you the rule you entered into homework4.cfg.

Next, add two more lines to homework4.cfg so that it looks like this:

S -> NP VP
NP -> "I"
VP -> "left"

and save it. Now, we’ll re-load it. Before we loaded it like in the block below—but if you try that now, you’ll find that it didn’t update hw4cfg:

>>> hw4cfg = nltk.data.load("file:homework4.cfg")
>>> hw4cfg.productions()
[S -> NP VP]

The reason (as I mentioned quickly in class) is that NLTK believes it already loaded this, so it doesn’t bother to load it again. We need to explicitly tell it that it needs to read the grammar again. The way we do that is to give it a cache=False parameter, to tell it not to read from its “cache” of stored files, but to read it in from the disk again.

>>> hw4cfg = nltk.data.load("file:homework4.cfg", cache=False)
>>> hw4cfg.productions()
[S -> NP VP, NP -> 'I', VP -> 'left']

That’s more like it. So, don’t forget about this as you change your grammar, you need to tell NLTK explicitly that you need it to be re-loaded.

Last bit of setup, let’s get it to draw a simple tree with this grammar. Clearly, this grammar is only useful for parsing the sentence “I left.” So, we will set that up. We set raw to be the string of words representing the sentence. We then split it into words (a form of “tokenizing”) using split() (which breaks raw at the spaces). The print(sent) line is there so you can see what it did. We then ask NLTK to make us a recursive descent parser based on the hw4cfg grammar, which we will call parser.

raw = "I left"
sent = raw.split()
print(sent)
parser = nltk.RecursiveDescentParser(hw4cfg)
treegen = parser.parse(sent)
trees = [t for t in treegen]
print(trees)
trees[0].draw()

Once we had parser set up, we asked the parser to parse our sentence, and store the result in treegen. I used this name because what we have really at this point is not the possible trees for the sentence, but rather a generator of those trees. The next step (trees = [t for t in treegen]) actually makes a list of those trees. It might seem like that line does very little, but it forces NLTK to use the treegen generator to construct the trees and put them in a list, called trees. Now we actually have a list of trees. The print(trees) shows you that. Then the last line tells the first (and in this case, only) tree in trees to draw itself.

When you say trees[0].draw(), this will cause a program (also called python, just like Spyder is) to start, and display a window with a small tree in it. This window is likely going to be hidden behind the Spyder window. You’ll need to find it, look at the tree it drew, and then explicitly close it—only then will you get control back in the console.

I hope that at this point, everything has either gone smoothly or at least been troubleshootable enough that we can proceed to do actual things from here on.

A note on annoying “path” related errors: As I’ve been setting this homework up, I keep bumping into various installation issues.

One that I hit pretty regularly is that if you have a Tree object and you try to display it directly (e.g., if tree is a Tree object and you just type tree at the Python prompt to see what is in it), I get an error that basically says that gs cannot be found. The gs that it can’t find is Ghostscript, which is used to create a graphical representation of the tree. As far as I can tell, you can’t turn this off, it wants to print the graphic, and it can’t find the program that lets it do this. In fact, I actually have GhostScript installed, but it is not “on the search path”. However, I am not going to encourage you, now at least, to install Ghostscript. Instead, we can just print(tree) to get a textual representation.

For those who have Ghostscript installed and want to fix it and know what I’m saying here: My Ghostscript was installed at /usr/local/bin/gs and /usr/local/bin was not in my path within Python. Rather than trying to add this to the path, I just made a symbolic link: ln -s /usr/local/bin/gs ~/anaconda/bin/gs — because ~/anaconda/bin is in the search path. There are about six other ways you could solve this problem, but that’s the one I picked.

When it is actually working, you see this:

tree display

Developing a phrase structure grammar

We can start with the basic “park grammar” that we used in class (and which comes from the book).

So, you can fix up homework4.cfg to have the rest of this grammar in it. It should be evident that this grammar can handle sentences like “Mary saw Bob,” “my dog saw a cat in the park,” “a dog ate John.”

S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"

Task 1. Prove to yourself that this can indeed handle “Mary saw Bob,” “my dog saw a cat in the park,” and “a dog ate John” by following the procedure we used above for “I left.”

Specifically, set raw to be the sentence string, split it into words using split() and store it in sent. At least for the first one, you’ll have to reload the grammar (using cache=False again), and create a new recursive descent parser for it called, e.g., parser. Once you have parser created, you can ask it to give you a tree generator treegen for each sentence as you check it, create a trees list using the generator, and then, if you want, draw the trees using trees[0].draw() again.

Task 2. Add adjectives to the grammar so that it can handle “My annoying little dog saw a cat in the lovely park” and “a vicious dog ate John.”

The chapter pretty much goes over this, around example 8.5, though it is quite concise there. When you think about the structure of “my annoying little dog”, we have a determiner (my), two adjectives (annoying and little), and then the head noun (dog). So, in addition to NP -> Det N (which we need to keep so we can still get “a dog”), we need an NP rule that puts an adjective between Det and N. We could just say NP -> Det Adj N, except that doesn’t cover “my annoying little dog” (since there are two adjectives). We could add yet another rule (NP -> Det Adj Adj N) to account for two adjectives, except then it seems like we’re missing something. It’s not that in English you can have zero, one, or two adjectives between Det and N—you actually can have basically any number.

So, instead of adding adjectives in as just suggested above, it’s better to say that you can combine an Adj and a Nom (which is what the book calls the node that in Intro you would have called N') to form a Nom—which you could then combine with another Adj to form a Nom—which you could then combine with another Adj to form a Nom—and so on. This is “recursive” in the sense that the result of applying the Nom -> Adj Nom rule yields a structure that you can apply the rule to again.

So, you want to add a rule like Adj -> "annoying" | "little" | "lovely" | "vicious" so that it covers the adjectives we need, and then a Nom -> Adj Nom rule to allow the iterative attachment of adjectives to “N’”, and then revise the NP rule so it expands to Det and Nom (rather than to Det and N), and add one more Nom rule that allows it to be rewritten just as N (without any further adjectives).

Task 3. This grammar will give you nothing for “a man walked in the park”. Why?

If you haven’t done this already, let’s define a function that can take a string, break it up into words, parse it, and return the trees. That will make it simpler to deal with things. What I had in mind is a function like this:

def get_trees(raw):
    sent = raw.split()
    hw4cfg = nltk.data.load("file:homework4.cfg", cache=False)
    parser = nltk.RecursiveDescentParser(hw4cfg)
    treegen = parser.parse(sent)
    trees = [t for t in treegen]
    return trees

This will re-initialize the parser each time, as well, so we don’t need to keep doing those steps by hand as we change things. Once we have that defined, we can just say:

>>> trees = get_trees("a man walked in the park")
>>> trees
[]
>>> trees = get_trees("John saw Mary")
>>> trees
[Tree('S', [Tree('NP', ['John']), Tree('VP', [Tree('V', ['saw']), Tree('NP', ['Mary'])])])]

Next, we will try to find the subject of a sentence. Descriptively, the subject of a sentence is the NP that is a daughter of S. Ultimately, in this grammar we have built so far, it’s always going to be in the same place, but let’s explore this a little bit anyway.

We can take a tree that our parser has found for us and break it up into subtrees, which will allow us to isolate NP-daughter-of-S pretty easily. So, for the “John saw Mary” tree, what get_trees("John saw Mary") gives us back is a list of parses (containing just one element), so let’s look at that tree. We’ll name it tree:

>>> tree = get_trees("John saw Mary")[0]
>>> print(tree)
(S (NP John) (VP (V saw) (NP Mary)))

Just a reminder of the “path-related problems” note from above. If you just type tree to find out what tree contains, it will either draw a graphic of the tree if you’re lucky, or crash out with a big list of errors stemming from an inability to find gs.

Just use print(tree), that prints the tree out in text.

If you ask a Tree to give you subtrees() it will go through a tree and give you a list of all the subtrees contained in it.

>>> [s for s in tree.subtrees()]
[Tree('S', [Tree('NP', ['John']), Tree('VP', [Tree('V', ['saw']), Tree('NP', ['Mary'])])]),
 Tree('NP', ['John']),
 Tree('VP', [Tree('V', ['saw']), Tree('NP', ['Mary'])]),
 Tree('V', ['saw']),
 Tree('NP', ['Mary'])]

So, if we want to find the “NP daughter of S”, we look for a subtree whose label is “S” and for which the label of one of the contained Tree nodes is “NP”.

There are not many subtrees that have the label “S”. In fact, there’s just the one, and it’s the whole tree. But we can still find it in a more general way, by adding an if clause to the list comprehension we used to display them above.

>>> [s for s in tree.subtrees() if s.label() == "S"]
[Tree('S', [Tree('NP', ['John']), Tree('VP', [Tree('V', ['saw']), Tree('NP', ['Mary'])])])]

Now, we want to find the daughter of that node S whose label is “NP”. What are the daughters of S? A Tree basically behaves like a list, so we can count the number of daughters with len(tree) and go through them with for n in tree. If there are two, tree[0] will be the left one, and tree[1] will be the right one.

This is illustrated below. Notice that at the end, when we are looking at the left daughter of VP, we can address it in three equivalent ways: we can explicitly request the 0th element of the 1st element of the snode Tree (that is, the left daughter of the right daughter of S), or we can compress that into a tuple and ask for the (1, 0)th element (still means the left daughter of the right daughter of S), and it’s possible (though kind of confusing I think) to leave the parentheses off the tuple when it is within square brackets (like in the last example).

>>> snodes = [s for s in tree.subtrees() if s.label() == "S"]
>>> snode = snodes[0]
>>> print(snode)
(S (NP John) (VP (V saw) (NP Mary)))
>>> len(snode)
2
>>> print(snode[0])
(NP John)
>>> len(snode[0])
1
>>> print(snode[1])
(VP (V saw) (NP Mary))
>>> len(snode[1])
2
>>> print(snode[1][0])
(V saw)
>>> print(snode[(1,0)])
(V saw)
>>> print(snode[1,0])
(V saw)

So, this is how you navigate subtrees. If we have snode set to be an “S”-labeled Tree, as we did above, we can cycle through its daughters and find the one that has an “NP” label, and that will be the NP daughter of S.

>>> [d for d in snode if d.label() == "NP"]
[Tree('NP', ['John'])]

We can actually combine these in a single (though complicated) list comprehension. This takes the NP-finder above, but adds in the computation of snodes as well. Notice the order. We’re making a list of ds, which are the daughters of snode. So we continue to start with [d for... but then we are going to find the snodes first, and then the daughters of those once we have an snode. So, we continue with n in tree.subtrees() if n.label() == "S" meaning that n is going to be a subtree with label “S” that we want to then check the daughters of. So, then we go through the daughters with for d in n if d.label() == "NP". Put together, it looks like this:

[d for n in tree.subtrees() if n.label() == 'S' for d in n if d.label() == 'NP']

Task 4. Understand how that complex list comprehension works. Convince yourself by changing it to find the object. (What we did above is find the subject, which is the NP daughter of S. So, how do we characterize the object? Use the technique above to find it. The answer should be “Mary”, right?)

Now, let’s enhance the grammar by adding the ability to embed clauses, like in “Bob thought that John saw Mary” and “Bob said that John saw Mary”.

Task 5. Enhance the grammar so that it can parse “Bob thought that John saw Mary” and “Bob said that John saw Mary”.

The idea here is to change homework4.cfg like we did before when we were adding adjectives. We need to add the verbs "said" and "thought" at least, and the complementizer "that".

Consider that, although “said” and “thought” are verbs, they do not take NP objects. So they’re a different kind of a verb. They are in the category of “verb” but they are a sub-type, a sub-category of verb. So, we do not simply want to add something like ... | "thought" | "said" to the V -> line. We need a different kind of verb, the book calls them “Sentential verbs” and gives them a label of SV, so we can follow that here.

If the sentential verbs are category SV, we still want to be able to form a VP out of a SV and its complement. So, we need to add that as an option to the VP rules. In order to do this, we also need to figure out what the complement of such a verb is.

This is simplifying things, but let’s assume that the complement of “thought” is basically always that S — so “that” is a complementizer, we can call it category C and we can form a CP from C and S. Then SV type verbs will have a CP as their complement. It’s pretty close to what you’d have seen in Intro syntax, apart from probably calling S “IP” instead.

Ultimately, you should be adding a rule for C (the category of “that”), a rule for CP (forming that S structures), a rule for SV (the category of “said” and “thought”), and adding a rule for VP that allows SV to be the head of a VP.

Task 6. Give trees for “Bob said that John saw Mary in the park” and “the annoying man thought that Bob said that my dog saw a vicious cat in the park”. Ideally, drawn out using tree.draw() but you can give the print(tree) version too.

Task 7. Find the subjects of those sentences using our subject-finding procedure from before. (Should be “Bob” and “John” in one case, “the annoying man”, “Bob”, and “my dog” in the other.)

Task 8. Find the objects of those sentences using our object-finding procedure from before. (Should be “Mary” in one case, “a vicious cat (in the park)” in the other.)

Relative clauses

A relative clause is something like “who saw Mary” in “the man who saw Mary”. It is formed by adding a wh-question to a noun, more or less. So the referent of “the man who saw Mary” is the individual that is a man, and also the answer to the question “Who saw Mary?”.

Suppose we want our parser to recognize “the man who saw Mary” as an NP.

It can already recognize “the man” and “the man in the park”, so we can simply add an extra option for the NP rule to allow for this.

Task 9. If we just add ... | Det Nom S | "who" to the NP rule, we can parse “the man who saw Mary saw Bob”. Give some reasons why this is not really that good a solution.

Coming up with a better solution though is kind of difficult. Let’s see if we can hammer together a kind of a solution for this case before we look at the more general problem.

In “the man who saw Mary”, it seems like “who” is basically the subject of “saw”. So, “who saw Mary” is a special kind of sentence with “who” as the subject. Let’s define this kind of special case by, first, making “who” a special kind of NP, and then making a relative clause be a special kind of sentence with “who” as its subject.

That is, add the following to the grammar:

RP -> "who"
RC -> RP VP
NP -> Det Nom RC

Task 10. Draw a tree using the grammar above of “the man who saw Mary saw Bob”.

There’s another form a relative clause can take, though. You can also say “the man who Mary saw saw Bob”. What’s different here is that “who” is now really playing the role of the object of the “saw” that Mary did. But “who Mary saw” is not in the right form for parsing.

What’s going on? Well, the relative pronoun “who” generally corresponds to a gap in the sentence. We didn’t notice the gap in subject posistion, but it’s obvious when the gap is in the object position. These relative clauses are, again, basically wh-questions, and the normal way wh-questions are formed is to move the wh-word to the front of the clause.

And this is where parsing becomes difficult, when things move around in a sentence.

Let’s try a kind of a hack to make this work.

For any transitive verb (“saw”, “ate”, and “walked” in our grammar), there is the version we already have, which form a VP with their object NP. If any of these appear in an “object relative”, then the object NP will be missing. So, let’s make a version of the VP that has a gap. That is, let’s define VPG to be just V rather than V NP.

Task 11. Define VPG that way in the grammar, so that any VP has an analogous VPG without an object.

We haven’t used VPG yet in the grammar apart from defining it. But conceptually, what we want is that VPG should be available in a relative clause where the object is missing. So, we want to redefine RC, in a form that’s closer to what we believe the structure of these is. So replace the RC definition you added before with this:

RC -> RP SOG
SOG -> NP VPG

The idea here is that SOG (sentence-with-object-gap) is like a regular S but has a VP with an object gap.

Task 12. Draw a tree using the grammar we have at this point of “the man who Mary saw in the park saw Bob”.

Late breaking note: If you are not getting anything for “the man who Mary saw in the park saw Bob” then you may have missed a subtle aspect of the instruction for Task 11. You want any VP to have an analogous VPG without an object. Recall that before we had

VP -> V NP | V NP PP

and (in this little grammar at least) the only place a PP can attach is inside a VP. If you only created a VPG rule corresponding to VP -> V NP and not to VP -> V NP PP then it is not going to know how to parse “…who Mary saw in the park…” (which has a VP with no object, but still with a PP). So, you need a VPG that corresponds to the second VP expansion as well.

Doing this broke our ability to parse the subject relative “the man who saw Mary”, however.

With our new understanding of relative clauses, the gap in “the man who saw Mary” must actually be the subject of “saw”. So, let’s add some rules to get subject relatives as well.

Start by adding RC -> RP SSG to the grammar (SSG being intended to mean “sentence-with-subject-gap”). So, there are now two kinds of relative clause the grammar can handle. Both start with the RP “who”, and one continues with SSG (an S but missing the subject), and the other continues with a SOG (an S but missing the object). We still need to define SSG. What should it be?

Task 13. Add a definition of SSG into the grammar.

Now we should be able to parse both “the man who saw Mary saw Bob” and “the man who Mary saw saw Bob”.

Task 14. Draw trees of “Mary walked the dog who my cat saw in the park” and “Mary walked the dog who saw my cat in the park”.

There were other things I was thinking about trying out here, but this is good enough.