Class 5a

name: title
layout: true
class: center, middle, inverse
---
# CHILDES notes #

## (Continuing from previous class) ##
---
layout: false

CHILDES comes from [http://childes.psy.cmu.edu](http://childes.psy.cmu.edu).

We need to get XML data to work with, let's get Brown.

Needs to go into "nltk_data/corpora/childes/data-xml/Eng-USA-MOR/Brown"

Unfortunately, I don't have an easier way to do this than just downloading it and putting it there.

When we have the data, we can set up the `CHILDESCorpusReader` included in NLTK.

```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
```

> Side note: In a lot of contexts when looking for "all files that end in `.xml`" you would wind up needing to use `*.xml`.  In those cases, `*` is a "wildcard" that means "anything can be here."

> Despite the similar appearance, `*` above is not a wildcard, `.*.xml` is a "regular expression" and the meaning of that it is "any character (`.`), any number of times (`*`), followed by `.xml`" -- same effect, but it's in a different (but similar-looking) "language." (These are "regular expressions.")

---
```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
```

Here are the files that are included

```python
>>> brown.fileids()
brown.fileids()
Out[5]: 
['Brown/Adam/adam01.xml',
    ...
 'Brown/Adam/adam55.xml',
 'Brown/Eve/eve01.xml',
    ...
 'Brown/Eve/eve20.xml',
 'Brown/Sarah/sarah001.xml',
    ...
 'Brown/Sarah/sarah139.xml']
```

There are three children, we want to concentrate on Eve.
We will do that by getting just the subset of the `fileids()` that are for Eve.
So: how?

--
```python
eve = [f for f in brown.fileids() ... ]
```
--

```python
eve = [f for f in brown.fileids() if f[6:9] == 'Eve']
```

---

```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
eve = [f for f in brown.fileids() if f[6:9] == 'Eve']
```

There are a number of things we can do with the corpus (that we've named `brown`).

```python
corpus_data = brown.corpus(eve)
corpus_parties = brown.participants(eve)
ages = brown.age(eve)
mlus = brown.MLU(eve) # CHI appears to be assumed
```

If you want to know what something like `brown` "knows how to do," you can ask it for a "directory."

```python
dir(brown)
```

That contains a lot of internal stuff, though.  And it's just returning a list, which we know because:

```python
>>> type(dir(brown))
list
```

---

There are a number of things we can do with the corpus (that we've named `brown`).

```python
corpus_data = brown.corpus(eve)
corpus_parties = brown.participants(eve)
ages = brown.age(eve)
mlus = brown.MLU(eve) # CHI appears to be assumed
```

How can we get just that part of the list that doesn't start with `"_"`?

--
```python
[x for x in dir(brown) ... ]
```

```python
[x for x in dir(brown) if x[0] != '_']
```

---

```python
['MLU',
 'abspath',
 'abspaths',
 'age',
 'childes_url_base',
 'citation',
 'convert_age',
 'corpus',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'participants',
 'raw',
 'readme',
 'root',
 'sents',
 'tagged_sents',
 'tagged_words',
 'unicode_repr',
 'webview_file',
 'words',
 'xml']
```
---
We can compare MLUs of parents and adults.

```python
pmlus = brown.MLU(eve, speaker='MOT')
mlus
pmlus
```

We can look at pronouns, nouns, verbs.

```python
tws = brown.tagged_words(eve[0])
tws[:10]
```

So, how would we get just the pronouns?

```python
pros = [w for  ... ]
```
--

```python
pros = [w for (w,p) in brown.tagged_words(eve[0]) ... ] ...?
```
--

```python
pros = [w for (w,p) in brown.tagged_words(eve[0]) if p[:3] == 'pro'] ...?
```

```python
pros = [w for (w,p) in brown.tagged_words(eve[0]) if p and p[:3] == 'pro']
```

---
We can plot things, like noun/verb ratio as it changes over time, compared to adults'

```python
from matplotlib import pyplot as plt

def nvratio(f, speaker=['CHI']):
    ws = brown.tagged_words(f, speaker=speaker)
    ns = [w for (w,p) in ws if p and p[0] == 'n']
    vs = [w for (w,p) in ws if p and p[0] == 'v']
    nns = len(ns)
    nvs = len(vs)
    ratio = nns/nvs
    return(ratio)

age_months = [brown.age(f, month=True)[0] for f in eve]
eve_rat = [nvratio(f) for f in eve]
mot_rat = [nvratio(f, speaker=['MOT']) for f in eve]

plt.plot(age_months, eve_rat, age_months, mot_rat)    
plt.ylabel('noun-to-verb-ratio')
plt.show()

```

---

If there's still time, we can start talking about stemming, tagging, and chunking.
And prepare for regular expressions.  Some of this is coming from `pythonprogramming.net`.

```python
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

```

---

```python
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

except Exception as e:
        print(str(e))

process_content()

```

---

```python
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            chunked.draw()

except Exception as e:
        print(str(e))

process_content()

```