Classification

CAS LX 390 / NLP/CL Homework 6
GRS LX 690 Fall 2016 due Mon 11/21

Classifiying things

Ok, there’s been a lot of me typing in front of the class, and you generally following along, so here’s some stuff for you to do on your own to see how it goes. (Eventually, at least. First there will be a lot of me typing here on this page, and you generally following along.)

What we’ve been doing for the last little while is training a little machine to recognize patterns, and then seeing how well it learned. There are different little machines, different patterns, different success rates. But the basic idea is that we are training these little machines to do this so that we can drown them in huge amounts of data that we couldn’t realistically process ourselves, and see what insights we can derive.

We’ll start with something roughly based on exercise 7 of chapter 6. This makes use of the NPS Chat Corpus. These are things taken from a few chat rooms in 2006, and both part-of-speech tagged and dialogue-act tagged. We’re for the moment concentrating on the dialogue-act tags.

There are 15 such tags. The NLTK book lists “Statement,” “Emotion,” “ynQuestion”, and “Continuer” as being among them but there are more. The description of the corpus linked above gives the whole list, but let’s find it ourselves.

import nltk
from nltk.corpus import nps_chat as nps

Above, I suggested importing nps_chat as nps—the purpose of this is to save typing. Instead of referring to nps_chat all the time, this allows us to refer to nps.

The nps corpus actually derives from several chatroom samples on several dates. You can see what they are, using:

print(nps.fileids())

The filenames indicate the date (in 2006, e.g., 10/19), the chat room they came from (designated by age group: 20s, 30s, 40s, teens), and the number of posts contained in the file. Never mind that for the moment, we will just load them all up into a list called posts. The way to get the posts out of the corpus is with xml_posts like so:

posts = nps.xml_posts()

I will now select one particular post. Number 119. We can call it p for short, and we can find out what dialogue-act it is by using p.get('class') and we can find out what user typed it by using p.get('user'). The user is encoded with the date, chat room, and unique user number, but it allows us to see when different posts come from the same person.

p = posts[119]
print(p.text)
print(p.get('class'))
print(p.get('user'))

As I mentioned in class, this corpus is kind of questionable. It isn’t very much fun to read most of the texts. Post 119 is a Statement by 10-19-20sUser7 who says i feel like im in the wrong room. About as good as it gets. But it is dialog, and it is tagged. So, since we know how to get the tags out, let’s proceed with the initial goal of getting a list of the (dialogue-act) tags.

Task 1. Assign the set of dialogue-act tags to the variable acts

There should overall be 15. There are probably a couple of ways to do this, but what I’d suggest is something of the form acts = set(something for something in something). Show me how you did it, though. You can see if you got the right answer by looking at the NPS Chat Corpus page.

We’ll now start with what the textbook does: train a classifier to predict the dialogue-act tag of a given post. To do this, we need to define a way to extract the features that the classifier will pay attention to when it learns, and then we need to make a list of the post-features for each of the posts. Then split it into training and test sets, train a classifier on the training set, and test to see how well it did on the test set.

So, step one, extracting the features we will train on. We can copy this (mostly) straight out of the chapter. I changed it a little bit in a couple of ways, but not the logic of it.

def extract_features(post):
    features = {}
    for word in nltk.word_tokenize(post.text):
        features['contains({})'.format(word.lower())] = True
    return features

Try it out on posts[119]:

print(extract_features(posts[119]))

Now, we will go through all of the posts to make a list of pairs, the first member of the pair is the extracted features (using extract_features) and the second member is the classification.

fposts = [(extract_features(p), p.get('class')) for p in posts]

We have now applied extract_features to every post. Take a look at the 119th one, just to see what an element of this list looks like

print(fposts[119])

We want to use 90% of the corpus for a training set, 10% for a test set. So, we figure out where the 10% line is, and define training and test sets:

test_size = int(len(fposts) * 0.1)
train_set, test_set = fposts[test_size:], fposts[:test_size]

And then we create a classifier (Naive Bayes Classifier) and train it on the training set.

classifier = nltk.NaiveBayesClassifier.train(train_set)

And then see how it did. The number you get should be (rounded up) 0.67.

print(nltk.classify.accuracy(classifier, test_set))

Not bad. Not great, but not horrible. Here comes exercise 7:

The dialog act classifier assigns labels to individual posts, without considering the context in which the post is found. However, dialog acts are highly dependent on context, and some sequences of dialog act are much more likely than others. For example, a ynQuestion dialog act is much more likely to be answered by a yanswer than by a greeting. Make use of this fact to build a consecutive classifier for labeling dialog acts. Be sure to consider what features might be useful. See the code for the consecutive classifier for part-of-speech tags in 1.7 to get some ideas.

If I were mean, I’d just say “do exercise 7 now.” But I think the consecutive classifier was a bit complicated for the uninitiated. So, I’ll have you do exercise 7 more or less, but I’ll try to step you through it a little bit more. See? Not mean.

Think first about what we want to do. Right now what we are training our Naive Bayes Classifier with are features that indicate what words are contained in the individual posts. We want to add to that an extra piece of information, the tag of the previous post.

The exercise suggests looking at the consecutive classifier for part-of-speech tags for ideas, but it seems to me that it should only be for ideas. What we’re doing here is somewhat different. One thing that made the part-of-speech tagging different is that NLTK has a basic “framework” for part-of-speech tagging already set up (as part of the tagger interface nltk.TaggerI), and so we derived some benefit by adopting all of the basic interface and just adding in the specific parts we needed for the consecutive tagger. That is what made it useful to define ConsecutivePosTagger as a class.

Here we don’t need to adopt a general interface framework, so we can do this without defining a class.

For thinking about this, step one is to revisit how we build up the list of feature sets. This is what we did before:

fposts = [(extract_features(p), p.get('class')) for p in posts]

It is not going to be easy to build up fposts in a one-line list comprehension like that, since we need to keep track of the previous decision and add it to the feature set. We can change this to a slightly more verbose function, though. Here’s a function to start with. This does the same thing as the list comprehension above.

def fpost_list(posts):
    fposts = []
    for p in posts:
        fposts.append((extract_features(p), p.get('class')))
    return fposts

That is, if you then say fposts = fpost_list(posts) with the function as defined above, you will have the same thing for fposts that we had with the one-line list comprehension earlier. But the fpost_list gives us more room to work on putting in the “history.”

Now: Finally something to add in yourself! Here’s what we want, for now: The elements of the list that fpost_list returns should be pairs with the first member being a dictionary of features that includes all the ones we get from extract_features(post) as well as a feature like prev-class that holds the class of the previously processed post.

That is, print(fposts[119]) should yield something like:

({'prev-class': 'whQuestion', 'contains(in)': True, 'contains(like)': True, 'contains(feel)': True, 'contains(room)': True, 'contains(i)': True, 'contains(the)': True, 'contains(im)': True, 'contains(wrong)': True}, 'Statement')

Task 2. Fix the fpost_list function to return the previous tag in prev-class as above.

For the first post, prev-class should be NONE.

Now, let’s try it out:

fposts = fpost_list(posts)
classifier = nltk.NaiveBayesClassifier.train(fposts[test_size:])
print(nltk.classify.accuracy(classifier, fposts[:test_size]))

Task 3. How much did the new, “context-aware” version improve?

Yeah, that’s not so great.

Here are a couple of observations. The only time prev-class gets set to NONE is on the very first post. But the actual corpus is divided into several different chat rooms on several different days. The last post of one chat room should have no logical connection to the first post of another chat room, so you could try building up fposts one fileid at a time, setting prev-class of the first post of each file to NONE. That seems like it is more appropriate.

Another observation. There are plenty of System-class posts in there, which are generally people joining or leaving the chat room. These might not be really true dialog events (that is, it is not clear that there would be a pattern to what precedes or follows those). You could try to have prev-class “skip over” System so that if the prev-class would have been System you do not update it.

The chat rooms are also divided into age groups (which you can determine by the file name), and maybe the patterns differ between age groups, so you might try adding that as a feature.

Chat room dialog generally has several people talking at once, and often the messages are kind of interleaved such that person A and B are talking to each other, but person C is providing some kind of monologue in between them. It’s not clear what would help here, but maybe looking at the previous couple of dialogue-act tags would help (rather than just the preceding one).

Task 4. Try some of these things out, try to improve the overall accuracy. Try at least three variants to test out whether incorporating various additional features helps the accuracy. You can use the ideas I listed above, or if you have other ideas, try those.

I listed the ideas I had. I tried them all. I did not have very much luck at all improving the accuracy. The best accuracy score I got was 0.706.

Task 5. Try to explain what is going on with the accuracy as you change the features.

Maybe you will have better luck, or have a better idea, but for me in fact, I got almost the best accuracy score using the original one, without any context, letting the chat rooms all run together. So the question I was asking was to think about why the accuracy might be getting worse as features are added. I did manage to beat the original one by a little bit. I got it up to 0.706, by running through the files in reverse order (that is, I did the last member of fileids() first, then the preceding one, etc.—the posts within each file were still in forward order though). What difference might reversing it have made?

I went into this problem expecting that there would be a fairly dramatic improvement when context was incorporated. Now that I’ve tried it out, well. A learning experience for all. But there’s some value in learning how to work through this at least.

Authorship

We are going to take a quick look at the movie review database and try to determine who wrote a review, based on some statistics from others by two different authors. This is a “toy” problem, but it will give you a sense of how this can work.

from nltk.corpus import movie_reviews

The fileids in the movie_reviews corpus look like neg/cv000_29416.txt. I checked on a few files that were in the database and cross-checked them on the IMDB archive page, and so I know 3 by one author (JB), 3 by another author (SG), and one that’s by one of them (which we’re going to try to check). To get the fileids, do this:

jbf = ['29416', '29417', '29439']
sgf = ['29423', '29444', '29465']
myf = ['29497']
sgfids = [f for f in movie_reviews.fileids() if f[10:15] in sgf]
jbfids = [f for f in movie_reviews.fileids() if f[10:15] in jbf]
myfids = [f for f in movie_reviews.fileids() if f[10:15] in myf]

Now sgfids has the fileid value for three reviews by SG. What we want to do is write a function that will extract some metrics from this text.

Task 6. Write a function auth_stats(fileid) that will return three values: average word length, average sentence length, and lexical diversity.

You can get the words using movie_reviews.words(fileids=fileid), and the sentences using movie_reviews.sents(fileids=fileid). Lexical diversity is the ratio of distinct words to words. Your function can just return a list like [word_length, sent_length, lexical_diversity].

Task 7. Run the auth_stats function on the three reviews by SG, then on the three reviews by JB, and then on the mystery review (29497). What seems to characterize the reviews by SG as compared to JB, and who wrote the mystery review?

If you go to the IMDB archive page I linked above, you can read the reviews and check your answer.

That’s not a very long homework, but it will do.