+ - 0:00:00
Notes for current slide
Notes for next slide

Characterizing texts

1 / 40

FreqDist

NLTK provides a couple of statistical tools to count things. This can be useful in characterizing texts.

(Again, point of all of this is that we're generally dealing with too much data to sensibly work with it by hand)

Suppose that we want to know what the most common words are.

>>> import nltk
>>> from nltk.book import *
>>> fdist4 = FreqDist(text4)
>>> print(fdist4)
<FreqDist with 9754 samples and 145735 outcomes>
>>> fdist6.most_common(40)

This does what we want (presuming now that we wanted to do this), but there are some things to say.

2 / 40

Capitalization

One thing to notice is that there are 1141 instances of we and 483 instances of We -- but it is almost certain that we don't care about that distinction. What we want is to be told that there are 1624 instances of we regardless of capitalization.

Stopwords

Also, the most common things are like the and of and to and it is almost guaranteed that these are not going to be interesting. These are grammatical things. It is possible that these could lead us to some kind of statistic that could characterize the text, but it's quite likely that these are not going to be interesting.

3 / 40

Collapsing distinctions

Recall that when we had a list like this, where repeats and order don't count, we could convert it to a set. That's one way of collapsing distinctions. Conceptually, counting the number of words in a (small) text, and the number of distinct words in a text.

>>> list1 = ["The", "cook", "gave", "the", "panini", "to", "the", "butler"]
>>> print(len(list1))
8
>>> print(set(list1))
{'to', 'The', 'cook', 'butler', 'gave', 'the', 'panini'}
>>> print(len(set(list1)))
7

A problem here is that we still are distinguishing the and The. The computer doesn't know that they're the same, they seem different to it.

4 / 40

A list of lowercase words

Goal: We want to convert a list of words (list1) into a list of words that contains only the lower-case versions of those words. We want to collapse the distinction between the and The before we start counting things.

That is, we want to go from:

["The", "cook", "gave", "the", "panini", "to", "the", "butler"]

to

["the", "cook", "gave", "the", "panini", "to", "the", "butler"]

The approach we will use (a lot) is to go through the original list, and make a list of lowercased versions of each word. First, the way we can lowercase a word:

>>> print("Hi".lower())
'hi'
5 / 40

List comprehensions

There is a pretty easy way to do this in Python, using list comprehensions. These are super useful, we'll use them a lot. The one that will get us what we are after is this:

llist1 = [w.lower() for w in list1]

This does exactly what we said we wanted to do, but it's written out in a kind of backwards way.

  • The end result is a list, it's enclosed in [ and ]
  • Each member of the result list is based on a member of list1 (called w)
  • The member of the result is derived by applying lower() to current w.
  • So, it's a list of lowercased w for each w in llist1

There's no special requirement that you refer to the element, incidentally.

>>> [1 for w in [5, 6, 7]]
[1, 1, 1]
6 / 40

So, now

>>> llist1 = [w.lower() for w in list1]
>>> print(llist1)
['the', 'cook', 'gave', 'the', 'panini', 'to', 'the', 'butler']
>>> print(set(llist1))
{'to', 'cook', 'butler', 'gave', 'the', 'panini'}
>>> print(len(set(llist1)))
6
7 / 40

So, now

>>> llist1 = [w.lower() for w in list1]
>>> print(llist1)
['the', 'cook', 'gave', 'the', 'panini', 'to', 'the', 'butler']
>>> print(set(llist1))
{'to', 'cook', 'butler', 'gave', 'the', 'panini'}
>>> print(len(set(llist1)))
6

While we're here, let's take a look at what FreqDist does. It's counting the number of times something occurs in the list. The frequency distribution. Terminology is non-transparent, but we have 6 distinct words ("samples") having analyzed 8 words ("outcomes").

>>> fd1 = FreqDist(llist1)
>>> print(fd1)
<FreqDist with 6 samples and 8 outcomes>
>>> fd1
FreqDist({'butler': 1, 'cook': 1, 'gave': 1, 'panini': 1, 'the': 3, 'to': 1})
8 / 40

So, now

>>> llist1 = [w.lower() for w in list1]
>>> print(llist1)
['the', 'cook', 'gave', 'the', 'panini', 'to', 'the', 'butler']
>>> print(set(llist1))
{'to', 'cook', 'butler', 'gave', 'the', 'panini'}
>>> print(len(set(llist1)))
6

While we're here, let's take a look at what FreqDist does. It's counting the number of times something occurs in the list. The frequency distribution. Terminology is non-transparent, but we have 6 distinct words ("samples") having analyzed 8 words ("outcomes").

>>> fd1 = FreqDist(llist1)
>>> print(fd1)
<FreqDist with 6 samples and 8 outcomes>
>>> fd1
FreqDist({'butler': 1, 'cook': 1, 'gave': 1, 'panini': 1, 'the': 3, 'to': 1})

There is a hint here in the representation that you can find out the count for a particular word like this:

>>> fd1['cook']
1
>>> fd1['the']
3
9 / 40

Ok, so now back to text4. What we want to do is collapse the distinction between upper and lower case words in text4 before we do a FreqDist calculation on it.

text4 can be treated as a list of words.

>>> len(text4)
145735
>>> text4[10:12]
['of', 'Representatives']

so...?

10 / 40

Ok, so now back to text4. What we want to do is collapse the distinction between upper and lower case words in text4 before we do a FreqDist calculation on it.

text4 can be treated as a list of words.

>>> len(text4)
145735
>>> text4[10:12]
['of', 'Representatives']

so...?

lower4 = [w.lower() for w in text4]
fdistl4 = FreqDist(lower4)
fdistl4.most_common(40)

et voilà. One problem taken care of.

11 / 40

Let's play around a bit more with list comprehensions.

>>> fdistl4['angry']
2

So there are 2 instances of angry in text4. What are the other things that there are 2 of?

12 / 40

Let's play around a bit more with list comprehensions.

>>> fdistl4['angry']
2

So there are 2 instances of angry in text4. What are the other things that there are 2 of?

What we want to do is make a list of all the things in fdistl4 for which the value you get from the distribution is 2.

13 / 40

Let's play around a bit more with list comprehensions.

>>> fdistl4['angry']
2

So there are 2 instances of angry in text4. What are the other things that there are 2 of?

What we want to do is make a list of all the things in fdistl4 for which the value you get from the distribution is 2.

>>> twos = [w for w in fdistl4 if fdistl4[w] == 2]

How many did we find?

14 / 40

Let's play around a bit more with list comprehensions.

>>> fdistl4['angry']
2

So there are 2 instances of angry in text4. What are the other things that there are 2 of?

What we want to do is make a list of all the things in fdistl4 for which the value you get from the distribution is 2.

>>> twos = [w for w in fdistl4 if fdistl4[w] == 2]

How many did we find?

>>> len(twos)
1407

What are the first 10?

15 / 40

Let's play around a bit more with list comprehensions.

>>> fdistl4['angry']
2

So there are 2 instances of angry in text4. What are the other things that there are 2 of?

What we want to do is make a list of all the things in fdistl4 for which the value you get from the distribution is 2.

>>> twos = [w for w in fdistl4 if fdistl4[w] == 2]

How many did we find?

>>> len(twos)
1407

What are the first 10?

>>> twos[0:10]
['injunction', 'core', 'inconvenient', 'willingly', ..., 'uncontrollable']
16 / 40

What words occur just once?

17 / 40

What words occur just once?

>>> ones = [w for w in fdistl4 if fdistl4[w] == 1]
>>> len(ones)
3712
18 / 40

What words occur just once?

>>> ones = [w for w in fdistl4 if fdistl4[w] == 1]
>>> len(ones)
3712

There's actually another way to get those built in to FreqDist. These are "hapaxes" and we can get them like this as well:

>>> len(fdistl4.hapaxes())
3712
>>> fdistl4.hapaxes()[0:10]
['spasms', 'uncontrolled', ..., 'pitiful']
>>> ones[0:10]
['spasms', 'uncontrolled', ..., 'pitiful']
19 / 40

Conditionals

>>> ones = [w for w in fdistl4 if fdistl4[w] == 1]

The if in there is new, though it works basically like it sunds like it would. If the thing after if is True then the element is included. If it is False then the element is not included.

Why ==? That means "is equal?" -- it is different from =, which means "be equal!"

There are a couple of other things we can test for with words.

>>> print("Hello".startswith("H"))
True
>>> print("Hello".islower())
False
>>> print("Hello".isupper())
False
>>> print("Hello".istitle())
True
>>> print("Hello".endswith("o"))
True
20 / 40
>>> print(set([w for w in text4 if w.endswith("nment")]))
{'supergovernment', 'abandonment', 'discernment', 'assignment', 'government',
'Abandonment', 'concernment', 'arraignment', 'Government', 'environment',
'attainment'}

You can test for a word in a list or a letter in a string using in:

>>> print('i' in 'team')
False
>>> print('fun' in 'funeral')
True
>>> print('one' in ['one', 'two', 'three'])
True

You can also have multiple conditions, using and and or or not:

>>> print("Hello".istitle() and "Hello".endswith("o"))
True
>>> print([w for w in [1, 2, 3, 4] if w not in [2, 4]])
[1, 3]
21 / 40
>>> print(set([w for w in text4 if w.endswith("nment")]))
{'supergovernment', 'abandonment', 'discernment', 'assignment', 'government',
'Abandonment', 'concernment', 'arraignment', 'Government', 'environment',
'attainment'}

You can test for a word in a list or a letter in a string using in:

>>> print('i' in 'team')
False
>>> print('fun' in 'funeral')
True
>>> print('one' in ['one', 'two', 'three'])
True

You can also have multiple conditions, using and and or or not:

>>> print("Hello".istitle() and "Hello".endswith("o"))
True
>>> print([w for w in [1, 2, 3, 4] if w not in [2, 4]])
[1, 3]

Returning to the question of characterizing the text, there are still a lot of the and of tokens in there.

These are kind of "uninteresting." Why?

22 / 40

Let's try to get a version of text4 that doesn't have these uninteresting words in it.

How can we tell if a word is interesting? There's one easy approach we can try first.

Here are some uninteresting words: a, the, an, I, of, in, ., [

What do they have in common?

23 / 40

Let's try to get a version of text4 that doesn't have these uninteresting words in it.

How can we tell if a word is interesting? There's one easy approach we can try first.

Here are some uninteresting words: a, the, an, I, of, in, ., [

What do they have in common? Well, they're short. Suppose that there are not (m)any interesting words that are less than 4 letters long. How then might we limit our word list to the relatively long words?

>>> notshort4 = [w for w in lower4 if len(w) > 4]

Aside: what if we wanted to know how word length is distributed across the corpus? How many 4-letter words are there, 5-letter words, 6-letter words, etc.?

24 / 40

Let's try to get a version of text4 that doesn't have these uninteresting words in it.

How can we tell if a word is interesting? There's one easy approach we can try first.

Here are some uninteresting words: a, the, an, I, of, in, ., [

What do they have in common? Well, they're short. Suppose that there are not (m)any interesting words that are less than 4 letters long. How then might we limit our word list to the relatively long words?

>>> notshort4 = [w for w in lower4 if len(w) > 4]

Aside: what if we wanted to know how word length is distributed across the corpus? How many 4-letter words are there, 5-letter words, 6-letter words, etc.? We have what we need to do this. We want a list of word lengths, right? We want to find with what frequency each length occurs, right?

25 / 40

Let's try to get a version of text4 that doesn't have these uninteresting words in it.

How can we tell if a word is interesting? There's one easy approach we can try first.

Here are some uninteresting words: a, the, an, I, of, in, ., [

What do they have in common? Well, they're short. Suppose that there are not (m)any interesting words that are less than 4 letters long. How then might we limit our word list to the relatively long words?

>>> notshort4 = [w for w in lower4 if len(w) > 4]

Aside: what if we wanted to know how word length is distributed across the corpus? How many 4-letter words are there, 5-letter words, 6-letter words, etc.? We have what we need to do this. We want a list of word lengths, right? We want to find with what frequency each length occurs, right?

>>> wlengths = [len(w) for w in lower4]
>>> wlfd = FreqDist(wlengths)
>>> print(wlfd[4])
18158
26 / 40

Visualization

We can visualize the distribution of word lengths with the plot() function that FreqDist makes available:

>>> wlfd.plot()

fdistl4

27 / 40

You do want to be a bit cautious with this. wlfd has only 17 things on the x-axis. But fdistl4 has many more.

>>> len(wlfd)
17
>>> len(fdistl4)
9070

If we try to plot that, what is it going to plot? Is it going to be interesting? It is going to be processor intensive and take a while.

fdistl4

28 / 40

A couple more things we can do with FreqDist.

>>> fdistl4['core']
2
>>> fdistl4.N()
145735
>>> len(fdistl4)
9070
>>> fdistl4.freq('core')
1.3723539300785672e-05
>>> print(2/145735)
1.3723539300785672e-05
>>> fdistl4.max()
'the'
>>> fdistl4.most_common(2)
[('the', 9906), ('of', 6986)]
>>> wlfd.tabulate()
3 2 4 1 5 6 7 8 9 10 11 12 13 14 15 16 17
28426 27111 18158 16269 12885 10604 9827 7168 5591 4690 2442 1411 615 399 79 50 10

Like with graphs, be cautious with tabulate(), since if the x-axis is huge, it is going to be useless.

>>> wlfd.tabulate(4)
3 2 4 1
28426 27111 18158 16269

The gory details are here: FreqDist docs

29 / 40

Ok, where were we? We were going to try to characterize the text with something that eliminated the uninteresting words. We had gotten to this point.

>>> notshort4 = [w for w in lower4 if len(w) > 4]

We can now create a FreqDist of these.

>>> nsfd = FreqDist(notshort4)
>>> print(nsfd)
<FreqDist with 8185 samples and 55771 outcomes>
>>> nsfd.most_common(20)
[('which', 1002),
('their', 738),
('government', 593),
...
('power', 230),
('public', 225)]

This is a bit more like it. But this is still really an approximation. We'd like it to be less of an approximation at least. All we did is remove the "short" words, but we don't know for sure that "short" words are uninteresting, grammatical words.

Enter the "stopwords"

30 / 40

Stopwords

>>> from nltk.corpus import stopwords
>>> print(nltk.corpus.stopwords.fileids())
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
'italian', 'kazakh', 'norwegian', 'portuguese', 'russian', 'spanish',
'swedish', 'turkish']
>>> print(len(nltk.corpus.stopwords.words("English")))
153
>>> print(nltk.corpus.stopwords.words("English"))
['i', 'me', 'my', ..., 'wouldn']

These are the words deemed actually uninteresting, not just short.

For example, ax seems interesting. But short. And it is in our text.

>>> print('ax' in lower4)
True

So what we want is a list of words that are in lower4 but not in our English stopwords.

31 / 40

Removing the stopwords

Let's define a shortcut first, we'll name the English stopwords.

>>> sweng = nltk.corpus.stopwords.words("English")
>>> print(len(sweng))
153
>>> print('the' in sweng)
True
>>> print('ax' in sweng)
False

Let's start again with text4 just so we don't lose track of what we're doing. We want to find all of the words in text4 that are not in stopwords, but the stopwords are lowercase and we want to collapse case distinctions.

>>> lower4 = [w.lower() for w in text4]
>>> print(len(lower4))
145735
>>> nonsw = [w for w in lower4 if w not in sweng]
>>> print(len(nonsw))
76199
32 / 40

Now we can do the FreqDist on just the interesting words.

>>> nonswfd = FreqDist(nonsw)
>>> print(nonswfd)
<FreqDist with 8934 samples and 76199 outcomes>
>>> nonswfd.most_common(20)
[(',', 6840),
('.', 4676),
('government', 593),
('people', 563),
(';', 544),
...
('one', 243)]

This is better. But that punctuation does not seem like we want it there. It wasn't included in the stopwords.

>>> print('.' in sweng)
False
33 / 40

Let's try one more time. We could explicitly test for punctuation, or we could test for words of length 1. There probably are no interesting words that are 1 character long, so length is probably safe.

>>> nonsw1 = [w for w in lower4 if len(w) > 1 and w not in sweng]
>>> print(len(nonsw1))
63177

Or we could add punctuation to our filter. If we can guess it all.

>>> swengpunc = sweng + ['.', ',', '-', '?', '!', ';', '--']
>>> print(len(sweng))
153
>>> print(leng(swengpunc))
160
>>> nonswpunc = [w for w in lower4 if len(w) > 1 and w not in swengpunc]
>>> print(len(nonswpunc))
62814
34 / 40

The one where we filtered out 1-character words predictably left -- in there.

>>> nonsw1fd = FreqDist(nonsw1)
>>> print(nonsw1fd)
<FreqDist with 8907 samples and 63177 outcomes>
>>> nonsw1fd.most_common(20)
[('government', 593),
('people', 563),
('us', 455),
('upon', 369),
('--', 363),
...
('power', 230),
('public', 225)]

The one where we included -- as (essentially) a stopword filtered it out.

>>> nonswpuncfd = FreqDist(nonswpunc)
>>> print(nonswpuncfd)
<FreqDist with 8906 samples and 62814 outcomes>
>>> nonswpuncfd.most_common(20)
[('government', 593),
('people', 563),
('us', 455),
('upon', 369),
('must', 346),
...
('power', 230),
('public', 225),
('would', 209)]
35 / 40

Code blocks, scripts

First: instead of typing everything for immediate effect, it is possible to put the commands you want to use in a "script" file. This is one of the windows in the Anaconda interface. This is super useful, you can remember what you've done, you can replicate it, fix the middle of a complex sequence, share it with others.

Second: it enables (more easily at least) more complex algorithms.

We've done some iteration, like in the list comprehensions:

>>> wordlist = [w.lower() for w in text1]

This iterates through the text1 list and executes w.lower() for each w you can name in the list.

36 / 40

You can also iterate like this:

for w in text1[0:10]:
if len(w) > 4:
print(w)
else:
print('Blah')
print('All done.')

Important things to note: The for statement ends in a : that defines a block. Indenting lines define the extent of the block. It will repeat everything in the indented part. Blocks are relevant for for loops like this, as well as for if conditionals. Conditional if can generally be paired with else (where one of the two is guaranteed to execute). There are more complexities as well, other ways to loop and test, but these are the basic ones.

37 / 40

Functions

As our problems get more involved, we will generally want to break larger problems up into smaller ones. Also, it is good to be able to generalize solutions to use in a wider range of problems. Functions provide a way to do this.

For example: Suppose what we want to determine is how "diverse" a text is. That is, how many unique words there are as a proportion of the total number of words.

>>> print(len(text1))
260819
>>> print(len(set(text1)))
19317
>>> print(19317/260819)
0.07406285585022564
38 / 40

We can do this over and over for each text we want to do this with, but if we could create something like a Python "command" to automatically do this for a text, even better. This is what a function is (and actually, all of the NLTK stuff is basically made up of functions that the NLTK developers defined for us ahead of time).

def lexical_diversity(text):
return len(set(text)) / len(text)

With that, we can do this:

>>> print(lexical_diversity(text1))
0.07406285585022564

Although you can define functions just fine at the command line, it's better as part of a script. Easier to modify, etc.

39 / 40

This is so far not great, because we didn't collapse the case distinction. We can amend our function to fix that.

def lexical_diversity(text):
lowertext = [w.lower() for w in text]
return len(set(lowertext)) / len(lowertext)

Now, we get a slightly different answer. One hope, more conceptually correct.

>>> print(lexical_diversity(text1))
0.06606497226045649

At this point, if we trust our work, we can just file that away. That's our lexical diversity function, we can mostly forget about how it works and just use it in the same way we use len or set or whatever (so long as we have the code to define it in our script).

When we use those import lines, it is defining a bunch of functions (essentially) that are stored in files that NLTK (or another Python library) mkes available.

There's plenty more to say about functions, but we have to start somewhere.

40 / 40

FreqDist

NLTK provides a couple of statistical tools to count things. This can be useful in characterizing texts.

(Again, point of all of this is that we're generally dealing with too much data to sensibly work with it by hand)

Suppose that we want to know what the most common words are.

>>> import nltk
>>> from nltk.book import *
>>> fdist4 = FreqDist(text4)
>>> print(fdist4)
<FreqDist with 9754 samples and 145735 outcomes>
>>> fdist6.most_common(40)

This does what we want (presuming now that we wanted to do this), but there are some things to say.

2 / 40
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow