Class 8a

name: title
layout: true
class: center, middle, inverse
---
# Classification and testing #

---
layout: false

# Classifying texts. #

For example:
- is this email spam?
- is this news article about sports, technology, politics?
- who wrote this text?

# Ex 1 - guessing gender from names #

```python
>>> import nltk
>>> from nltk.corpus import names
>>> print(names.fileids())
['female.txt', 'male.txt']
```

We have the answers here, `female.txt` has names coded as female,
`male.txt` has names coded as male.  So, the task is to figure out
what kinds of things differentiate them such that the computer
can guess what an unknown name is.

---

So let's take a look at what we have here.

```python
>>> print(names.words(fileids='female.txt')[:10])
>>> print(names.words(fileids='male.txt')[:10])
```

Not clear.  First of all, there will *absolutely* be errors when guessing on
"Abbey", "Abbie", "Abby".  Second, is there anything that stands out?  Not really.

But, let's take a guess.  Suppose we have a feeling that a name that ends in "a"
is usually going to come from the female list.  Or, maybe more specifically, that
the last letter might be a clue.
(Note here that this is so far a claim about the spelling.  "Mika" ends in "a",
"Micah" does not.)

Is this reasonable? Well, we can plot the last letters by gender and see.

```python
last_letters = [(fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid)]
print(last_letters[:5])
cfd = nltk.ConditionalFreqDist(last_letters)    )
cfd.plot()
```

---

So, ending in "s" is a pretty good indicator that a name is in the male list,
and ending in "a" or "i" is a pretty good indicator that a name is in the female list.

The hypothesis is that the last letter is information that allows prediction.
So to predict the file a name will be in, we need to know the last letter
(and nothing else, for this hypothesis).
Also, to train the classifier, we need to know the last letter.

We'll define something that will extract the relevant features (perhaps to revise later).

```python
def gender_features(word):
    return {'last_letter': word[-1]}
print(gender_features("Gabriel"))
```

We are going to use a bit of quasi-magic called a "naive Bayes classifier"
to do our predictions.  The process is:

- create the classifier
- train the classifier by giving it a bunch of inputs and known-correct outputs
- it adjusts itself so that it would get the right answer for the inputs
- classifier can now be presented with new inputs and guess correct output

---

We are not going to focus on actual methods of implementing machine learning here.
But the basic idea is that the classifier is trying to estimate, based on the
training data, how likely a given input is to be associated with a particular output.
So, here, if the input is `last_letter: a` then the name is likely to be in `female.txt`.
It determines that by seeing how often those co-occur.  So, the more data it gets
trained on, the more accurately it reflects the statistical shape of the mapping.

This depends a lot on what we choose as features to train on.  If we train the classifier
with word length instead of last letter, we expect it will do much worse.

```python
def unpromising_features(word):
    return {'length': len(word)}
print(unpromising_features("Gabriel"))
```

How can we quantify this?

How do we know if our classifier is doing a good job?

Can't use actually unknown outputs to check this, but can't check with outputs
we trained on because it should get those uncharacteristically well.

So: *training set* and *test set*.
---

What we have: two piles of names, one set in `male.txt` and one set in `female.txt`.

We want to train it on a bunch of these.

We have `len(names.words())` of these.
So, let's plan to train on 85% of those, and then see how it fares on the last
15%.  We'll set `test_boundary` to be the point where the data switches from test to train.

```python
test_boundary = int(.15 * len(names.words()))
print(test_boundary)
```

We still have to convert the data we have in these two files into something useful.
We need a **list of pairs** with the input as the first element and the correct output
as the second element.  And we need to shuffle them up.

```python
labeled_names = ([(name, 'M') for name in names.words('male.txt')] +
    [(name, 'F') for name in names.words('female.txt')])
print(labeled_names[:5])
```

```python
import random
random.shuffle(labeled_names)
print(labeled_names[:5])
```

---

This is not really what we need for training and testing, though, we are trying to find
out how looking at the last letter compares to looking at the length.  So, we need to
create the actual data sets by applying the feature extractor to each name.

```python
name_letters = [(gender_features(n), f) for (n,f) in labeled_names]
name_lengths = [(unpromising_features(n), f) for (n,f) in labeled_names]
```

And now we can split this up into a training set and a testing set.

```python
lettrain, lettest = name_letters[test_boundary:], name_letters[:test_boundary]
lentrain, lentest = name_lengths[test_boundary:], name_lengths[:test_boundary]
```

---

And we are finally ready to train and test.  Let's train the two classifiers.

```python
let_classifier = nltk.NaiveBayesClassifier.train(lettrain)
len_classifier = nltk.NaiveBayesClassifier.train(lentrain)
```

How does each do on "Neo" and "Trinity"?

```python
print(let_classifier.classify(gender_features('Neo')))
print(let_classifier.classify(gender_features('Trinity')))
print(len_classifier.classify(unpromising_features('Neo')))
print(len_classifier.classify(unpromising_features('Trinity')))
```

How does each do on the entire test set?

```python
print(nltk.classify.accuracy(let_classifier, lettest))
print(nltk.classify.accuracy(len_classifier, lentest))
```

Remember that anything 50% means that it is getting things systematically *wrong*.

So, last letter is better than length (as expected) but it is still not great.

```python
let_classifier.show_most_informative_features(5)
len_classifier.show_most_informative_features(5)
```

---

Let's see if we can improve this.  How?  One thing we could do is add a zillion
different features.  Check the book for an example of this, where it proposes
a feature extractor that has two features for each letter of the alphabet, counting
how many there are and whether there is one at all.  This winds up **overfitting**
the training set, making it actually worse.

But, let's take a look at what mistakes it made and see if this can help us
improve the features.  This means that we are going to train with the training
set and test with the test set, look at the errors, and then retrain with the
training set.  But the changes we are making in this scenario are *explicitly*
for improving the performance on this particular test set.  So, we actually want
to have two test sets, one for guiding our revisions the features and retraining,
and one for actually seeing how it does.

So, we need a "dev-test" set.  This is actually why I picked 85% before.
Let's use 10% of the corpus as a "dev-test" set, 5% as the real test set,
and the same 85% of the corpus as the training set.

```python
devtest_boundary = int(.05 * len(names.words()))
print(devtest_boundary)
```

Let's cut the labeled names up into training, test, and dev-test sets:

```python
name_letters = [(gender_features(n), f) for (n,f) in labeled_names]
train_set = name_letters[test_boundary:]
devtest_set = name_letters[devtest_boundary:test_boundary]
test_set = name_letters[:devtest_boundary]
```

Then re-create the classifier (we're going to revise the feature extractor later,
but we're setting up a baseline here).

```python
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))
```

Now, what are the errors?  These are places where the guess and the answer do
not match.

```python
errors = []
for (name, tag) in labeled_names[devtest_boundary:test_boundary]:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )
```

```python
for (tag, guess, name) in sorted(errors):
    print('Correct:{:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))
```

---

There are a lot of errors and it's pretty clear we actually cannot achieve 100%
unless we at least handle the possibility that a name can be in both lists and
allow it to guess that.  But the book suggests that 'yn' as a final two letters
is pretty commonly triggering an incorrect guess.  So, we can try adding the last
two letters as an extracted feature.

```python
def gender_features2(word):
    return {'suffix1': word[-1], 'suffix2': word[-2]}
```

Now, let's re-train, then re-test.

```python
name_letters = [(gender_features2(n), f) for (n,f) in labeled_names]
train_set = name_letters[test_boundary:]
devtest_set = name_letters[devtest_boundary:test_boundary]
test_set = name_letters[:devtest_boundary]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))
```

I got a mild improvement.  Though, actually, *super* mild.  The book had better
success.

This can be iterated.

---

# Document classification #

Turning now to the question of how you could sift documents by category looking at their
characteristics.  The example now will be guessing whether a review is positive or
negative.

We can do this because there is a movie reviews corpus that has been hand-classified
into "positive" and "negative" reviews, so we look for properties of the review that
could tip us off.  Like containing the word "waste" or "horrible".

```python
from nltk.corpus import movie_reviews
print(movie_reviews.categories())
print(movie_reviews.fileids()[:5])
```

Let's create a list of documents with its categorization in a useful format.

```python
documents = [(list(movie_reviews.words(fileid)), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)]
```

```python
documents = [(list(movie_reviews.words(fileid)), fileid[:3])
    for fileid in movie_reviews.fileids()]
```

---

Whichever way you extracted it, we want the positive and negative reviews
to be scrambled up, so:

```python
random.shuffle(documents)
```

Now, we are guessing that a good route to classifying these would be on the
basis of what words they contain.  Is "platypus" likely to differentiate a good
from a bad review?  Well, not if it isn't in any review at all.  And if it's in
just one bad review, does that mean it 100% predicts a bad review?  Unlikely.

So, let's grab the words that occur a lot in the whole corpus, and then see how
those distribute (since they are expected to appear in many individual reviews).

```python
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
popwords = [w for (w,_) in all_words.most_common(2000)]
```

and then define a feature extractor for documents based on these words.

---

```python
def document_features(document):
    document_words = set(document)
    features = {}
    for word in popwords:
        features['contains({})'.format(word)] = (word in document_words)
    return features
```

Note: The following would have worked, but is over 15 times slower.
Took me a bit over a minute to run that.  Above version took about 4 seconds.

```python
        features['contains({})'.format(word)] = (word in document)
```

Here's what we have, then, for one of them

```python
r0 = movie_reviews.fileids()[0]
print(document_features(movie_reviews.words(r0)))
```

And away we go.

```python
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)
```

---

# Behind the magic a little bit #

What the naive Bayes classifier is doing can be described fairly simply.
When it has been trained, it starts with odds of each possible outcome based
on how often they occurred in the training data.  So, if we're classifying
texts between "sports," "mystery," "automotive" it will start knowing that
there's a 50% chance it is automative, 30% chance it's mystery, 20% chance it's sports.

Then it looks at one word, i.e. "contains(run)", and how often that feature
occurs in the three genres.  The probability that this is sports overall is
multiplied by the probability that when the "contains(run)" feature appears,
it is in sports, to get the overall probability that it's sports.  And so on
for all the features, until we have a final probability distribution across
the genres.  And then it picks the most likely.

The "naive" part is that it assumes that all features contribute equally
and independently.  We can do better, but this is basically the simplest
approach, good first approximation.

---