name: title layout: true class: center, middle, inverse --- # Issues in parsing # --- layout: false We've been looking at parsing with a "recursive descent" parser by building a little grammar and finding the parses of a string. ```python nltk.app.rdparser() ``` Recursive descent parser is breaking up goals into subgoals. Goal is to find an S. This can be accomplished by finding an NP and a VP. Finding an NP can be accomplished in one of three ways: ```python NP -> Det N PP NP -> Det N NP -> 'I' ``` First one (`Det N NP`) can be accomplished in one of two ways: ```python Det -> 'the' Det -> 'a' ``` First one—matched! Now on to N, then to PP. Which fails, ultimately, so backtrack to `Det N`. --- We could write such a parser, but NLTK has one built in. That's what we were using. This is a "top-down" parser. It can be kind of inefficient. Particularly when the vocabulary gets bigger. Shift-reduce parsing: a "bottom-up" parser. Seems like it would be more efficient, it only considers the words that are actually there, and then works out what it can build from them. ```python nltk.app.srparser() ``` Fairly clear what it does as it proceeds. --- Shift-reduce parser: Finds something it can build out of what it has plus the incremental addition of the next word. But the default sentence "my dog saw a man in the park with a statue" fails! Because it seens that it can build "saw a man" as a VP and then that can be built to an S, and then there's no more room for the PP. You have to shift in "in" before it incorporates the NP into the VP. This is also partly due to it not remembering its choices and having a way to undo them if it gets into trouble. We should be able to think of ways to improve this. For example, allow backtracking of some kind to make a different choice (generally speaking, failing to reduce something that can be reduced). But it's clearly not an easy problem to solve generally, at least not "cheaply." --- A "Left-Corner Parser" tries to combine a top-down approach with a connection to the actual words by using only top-down expansions that could line up with the words. ```python S -> NP VP VP -> V NP NP -> Det N | PropN PropN -> 'John' | 'Mary' Det -> 'the' N -> 'dog' V -> 'saw' ``` As we expand down, looking for S, then looking for NP VP, we have three options for expanding NP: 'John', 'Mary', and Det N. "Mary" is the first word (if we're going for "Mary saw the dog"). So, whatever we do here, we need to expand S such that the first word is "Mary". How to accomplish this? --- ```python S -> NP VP VP -> V NP NP -> Det N | PropN PropN -> 'John' | 'Mary' Det -> 'the' N -> 'dog' V -> 'saw' ``` Looking at the grammar, the "left corner" of each non-terminal is: | LHS category | left corner | |--------------|-------------| | S | NP | | NP | Det, PropN | | VP | V | So, to find `S` we need to find `NP VP`. To find `NP` we weed to find either `Det` or `PropN`. But "Mary" is not a `Det`, so discard that. So find `PropN`. "Mary" is a `PropN` so we found the `NP`. The rest is easy (here). Then: scale it up. Beyond that, we'll just leave that as a concept without trying to implement it. There is a quite complex discusson of "chart parsing" that follows this in the book, and maybe we can work through that at some later point. Maybe in a homework, where there's time to think it through. --- ## Human sentence processing ## Tom said that Bill had taken the cleaning out yesterday. -- cf. One morning, I shot an elephant in my pyjamas. -- The steak with the sauce that was tasty didn't win the prize. --- ## Human sentence processing ## The teacher told the kids the ghost story that she knew would frighten them -- The teacher told the kids the ghost story had frightened that it wasn't true -- I -- told -- the -- boy -- the -- cat -- scratched -- Mary -- would -- help -- her. -- The cotton clothing is made of grows in Mississippi. -- I like garden path sentences am delightful. --- ## Python structures ## Moving on (back) to chapter 4. Variables, type/token (reference/value), `id()`, conditionals (`if...elif...else`), sequences, mutability, "procedural vs. declarative" loops, index variables, (really, I think a matter of readability and efficiency). Next time: functions, variable scope issues.