Class 4c

name: title
layout: true
class: center, middle, inverse
---
# CHILDES notes #

---
layout: false

CHILDES comes from [http://childes.psy.cmu.edu](http://childes.psy.cmu.edu).

We need to get XML data to work with, let's get Brown.

Needs to go into "nltk_data/corpora/childes/data-xml/Eng-USA-MOR/Brown"

When we have the data, we can set up the `CHILDESCorpusReader` included in NLTK.

```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
```

Side note: In a lot of contexts when looking for "all files that end in `.xml`" you would wind up needing to use `*.xml`.  In those cases, `*` is a "wildcard" that means "anything can be here."

Despite the similar appearance, `*` above is not a wildcard, `.*.xml` is a "regular expression" and the meaning of that it is "any character (`.`), any number of times (`*`), followed by `.xml`" -- same effect, but it's in a different (but similar-looking) "language."

---
Here are the files that are included

```python
brown.fileids()
```

There are three children, we want to concentrate on Eve.

```python
eve = [f for f in brown.fileids() if f[6:9] == 'Eve']
```

There are a number of things we can do with the corpus (that we've named `brown`).

```python
corpus_data = brown.corpus(eve)
corpus_parties = brown.participants(eve)
ages = brown.age(eve)
mlus = brown.MLU(eve) # CHI appears to be assumed
dir(brown)
```
---
We can compare MLUs of parents and adults.

```python
pmlus = brown.MLU(eve, speaker='MOT')
mlus
pmlus
```

We can look at pronouns, nouns, verbs.

```python
tws = brown.tagged_words(eve[0])
tws[:10]
pros = [w for (w,p) in brown.tagged_words(eve[0]) if p and p[:3] == 'pro']
```

---
We can plot things, like noun/verb ratio as it changes over time, compared to adults'

```python
from matplotlib import pyplot as plt

def nvratio(f, speaker=['CHI']):
    ws = brown.tagged_words(f, speaker=speaker)
    ns = [w for (w,p) in ws if p and p[0] == 'n']
    vs = [w for (w,p) in ws if p and p[0] == 'v']
    nns = len(ns)
    nvs = len(vs)
    ratio = nns/nvs
    return(ratio)

age_months = [brown.age(f, month=True)[0] for f in eve]
eve_rat = [nvratio(f) for f in eve]
mot_rat = [nvratio(f, speaker=['MOT']) for f in eve]

plt.plot(age_months, eve_rat, age_months, mot_rat)    
plt.ylabel('noun-to-verb-ratio')
plt.show()

```

---