name: title layout: true class: center, middle, inverse --- # Classification and testing # --- layout: false # Classifying texts. # For example: - is this email spam? - is this news article about sports, technology, politics? - who wrote this text? # Ex 1 - guessing gender from names # ```python >>> import nltk >>> from nltk.corpus import names >>> print(names.fileids()) ['female.txt', 'male.txt'] ``` We have the answers here, `female.txt` has names coded as female, `male.txt` has names coded as male. So, the task is to figure out what kinds of things differentiate them such that the computer can guess what an unknown name is. --- So let's take a look at what we have here. ```python >>> print(names.words(fileids='female.txt')[:10]) >>> print(names.words(fileids='male.txt')[:10]) ``` Not clear. First of all, there will *absolutely* be errors when guessing on "Abbey", "Abbie", "Abby". Second, is there anything that stands out? Not really. But, let's take a guess. Suppose we have a feeling that a name that ends in "a" is usually going to come from the female list. Or, maybe more specifically, that the last letter might be a clue. (Note here that this is so far a claim about the spelling. "Mika" ends in "a", "Micah" does not.) Is this reasonable? Well, we can plot the last letters by gender and see. ```python last_letters = [(fileid, name[-1]) for fileid in names.fileids() for name in names.words(fileid)] print(last_letters[:5]) cfd = nltk.ConditionalFreqDist(last_letters) ) cfd.plot() ``` --- So, ending in "s" is a pretty good indicator that a name is in the male list, and ending in "a" or "i" is a pretty good indicator that a name is in the female list. The hypothesis is that the last letter is information that allows prediction. So to predict the file a name will be in, we need to know the last letter (and nothing else, for this hypothesis). Also, to train the classifier, we need to know the last letter. We'll define something that will extract the relevant features (perhaps to revise later). ```python def gender_features(word): return {'last_letter': word[-1]} print(gender_features("Gabriel")) ``` We are going to use a bit of quasi-magic called a "naive Bayes classifier" to do our predictions. The process is: - create the classifier - train the classifier by giving it a bunch of inputs and known-correct outputs - it adjusts itself so that it would get the right answer for the inputs - classifier can now be presented with new inputs and guess correct output --- We are not going to focus on actual methods of implementing machine learning here. But the basic idea is that the classifier is trying to estimate, based on the training data, how likely a given input is to be associated with a particular output. So, here, if the input is `last_letter: a` then the name is likely to be in `female.txt`. It determines that by seeing how often those co-occur. So, the more data it gets trained on, the more accurately it reflects the statistical shape of the mapping. This depends a lot on what we choose as features to train on. If we train the classifier with word length instead of last letter, we expect it will do much worse. ```python def unpromising_features(word): return {'length': len(word)} print(unpromising_features("Gabriel")) ``` How can we quantify this? How do we know if our classifier is doing a good job? Can't use actually unknown outputs to check this, but can't check with outputs we trained on because it should get those uncharacteristically well. So: *training set* and *test set*. --- What we have: two piles of names, one set in `male.txt` and one set in `female.txt`. We want to train it on a bunch of these. We have `len(names.words())` of these. So, let's plan to train on 85% of those, and then see how it fares on the last 15%. We'll set `test_boundary` to be the point where the data switches from test to train. ```python test_boundary = int(.15 * len(names.words())) print(test_boundary) ``` We still have to convert the data we have in these two files into something useful. We need a **list of pairs** with the input as the first element and the correct output as the second element. And we need to shuffle them up. ```python labeled_names = ([(name, 'M') for name in names.words('male.txt')] + [(name, 'F') for name in names.words('female.txt')]) print(labeled_names[:5]) ``` ```python import random random.shuffle(labeled_names) print(labeled_names[:5]) ``` --- This is not really what we need for training and testing, though, we are trying to find out how looking at the last letter compares to looking at the length. So, we need to create the actual data sets by applying the feature extractor to each name. ```python name_letters = [(gender_features(n), f) for (n,f) in labeled_names] name_lengths = [(unpromising_features(n), f) for (n,f) in labeled_names] ``` And now we can split this up into a training set and a testing set. ```python lettrain, lettest = name_letters[test_boundary:], name_letters[:test_boundary] lentrain, lentest = name_lengths[test_boundary:], name_lengths[:test_boundary] ``` --- And we are finally ready to train and test. Let's train the two classifiers. ```python let_classifier = nltk.NaiveBayesClassifier.train(lettrain) len_classifier = nltk.NaiveBayesClassifier.train(lentrain) ``` How does each do on "Neo" and "Trinity"? ```python print(let_classifier.classify(gender_features('Neo'))) print(let_classifier.classify(gender_features('Trinity'))) print(len_classifier.classify(unpromising_features('Neo'))) print(len_classifier.classify(unpromising_features('Trinity'))) ``` How does each do on the entire test set? ```python print(nltk.classify.accuracy(let_classifier, lettest)) print(nltk.classify.accuracy(len_classifier, lentest)) ``` Remember that anything 50% means that it is getting things systematically *wrong*. So, last letter is better than length (as expected) but it is still not great. ```python let_classifier.show_most_informative_features(5) len_classifier.show_most_informative_features(5) ``` --- Let's see if we can improve this. How? One thing we could do is add a zillion different features. Check the book for an example of this, where it proposes a feature extractor that has two features for each letter of the alphabet, counting how many there are and whether there is one at all. This winds up **overfitting** the training set, making it actually worse. But, let's take a look at what mistakes it made and see if this can help us improve the features. This means that we are going to train with the training set and test with the test set, look at the errors, and then retrain with the training set. But the changes we are making in this scenario are *explicitly* for improving the performance on this particular test set. So, we actually want to have two test sets, one for guiding our revisions the features and retraining, and one for actually seeing how it does. So, we need a "dev-test" set. This is actually why I picked 85% before. Let's use 10% of the corpus as a "dev-test" set, 5% as the real test set, and the same 85% of the corpus as the training set. ```python devtest_boundary = int(.05 * len(names.words())) print(devtest_boundary) ``` Let's cut the labeled names up into training, test, and dev-test sets: ```python name_letters = [(gender_features(n), f) for (n,f) in labeled_names] train_set = name_letters[test_boundary:] devtest_set = name_letters[devtest_boundary:test_boundary] test_set = name_letters[:devtest_boundary] ``` Then re-create the classifier (we're going to revise the feature extractor later, but we're setting up a baseline here). ```python classifier = nltk.NaiveBayesClassifier.train(train_set) print(nltk.classify.accuracy(classifier, devtest_set)) ``` Now, what are the errors? These are places where the guess and the answer do not match. ```python errors = [] for (name, tag) in labeled_names[devtest_boundary:test_boundary]: guess = classifier.classify(gender_features(name)) if guess != tag: errors.append( (tag, guess, name) ) ``` ```python for (tag, guess, name) in sorted(errors): print('Correct:{:<8} guess={:<8s} name={:<30}'.format(tag, guess, name)) ``` --- There are a lot of errors and it's pretty clear we actually cannot achieve 100% unless we at least handle the possibility that a name can be in both lists and allow it to guess that. But the book suggests that 'yn' as a final two letters is pretty commonly triggering an incorrect guess. So, we can try adding the last two letters as an extracted feature. ```python def gender_features2(word): return {'suffix1': word[-1], 'suffix2': word[-2]} ``` Now, let's re-train, then re-test. ```python name_letters = [(gender_features2(n), f) for (n,f) in labeled_names] train_set = name_letters[test_boundary:] devtest_set = name_letters[devtest_boundary:test_boundary] test_set = name_letters[:devtest_boundary] classifier = nltk.NaiveBayesClassifier.train(train_set) print(nltk.classify.accuracy(classifier, devtest_set)) ``` I got a mild improvement. Though, actually, *super* mild. The book had better success. This can be iterated. --- # Document classification # Turning now to the question of how you could sift documents by category looking at their characteristics. The example now will be guessing whether a review is positive or negative. We can do this because there is a movie reviews corpus that has been hand-classified into "positive" and "negative" reviews, so we look for properties of the review that could tip us off. Like containing the word "waste" or "horrible". ```python from nltk.corpus import movie_reviews print(movie_reviews.categories()) print(movie_reviews.fileids()[:5]) ``` Let's create a list of documents with its categorization in a useful format. ```python documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] ``` ```python documents = [(list(movie_reviews.words(fileid)), fileid[:3]) for fileid in movie_reviews.fileids()] ``` --- Whichever way you extracted it, we want the positive and negative reviews to be scrambled up, so: ```python random.shuffle(documents) ``` Now, we are guessing that a good route to classifying these would be on the basis of what words they contain. Is "platypus" likely to differentiate a good from a bad review? Well, not if it isn't in any review at all. And if it's in just one bad review, does that mean it 100% predicts a bad review? Unlikely. So, let's grab the words that occur a lot in the whole corpus, and then see how those distribute (since they are expected to appear in many individual reviews). ```python all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) popwords = [w for (w,_) in all_words.most_common(2000)] ``` and then define a feature extractor for documents based on these words. --- ```python def document_features(document): document_words = set(document) features = {} for word in popwords: features['contains({})'.format(word)] = (word in document_words) return features ``` Note: The following would have worked, but is over 15 times slower. Took me a bit over a minute to run that. Above version took about 4 seconds. ```python features['contains({})'.format(word)] = (word in document) ``` Here's what we have, then, for one of them ```python r0 = movie_reviews.fileids()[0] print(document_features(movie_reviews.words(r0))) ``` And away we go. ```python featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print(nltk.classify.accuracy(classifier, test_set)) classifier.show_most_informative_features(5) ``` --- # Behind the magic a little bit # What the naive Bayes classifier is doing can be described fairly simply. When it has been trained, it starts with odds of each possible outcome based on how often they occurred in the training data. So, if we're classifying texts between "sports," "mystery," "automotive" it will start knowing that there's a 50% chance it is automative, 30% chance it's mystery, 20% chance it's sports. Then it looks at one word, i.e. "contains(run)", and how often that feature occurs in the three genres. The probability that this is sports overall is multiplied by the probability that when the "contains(run)" feature appears, it is in sports, to get the overall probability that it's sports. And so on for all the features, until we have a final probability distribution across the genres. And then it picks the most likely. The "naive" part is that it assumes that all features contribute equally and independently. We can do better, but this is basically the simplest approach, good first approximation. ---