Class 5a

name: title
layout: true
class: center, middle, inverse
---
# CHILDES #

---
layout: false

CHILDES comes from [http://childes.talkbank.org](http://childes.talkbank.org).

We need to get XML data to work with, let's get Brown.

Needs to go into "nltk_data/corpora/childes/data-xml/Eng-USA-MOR/Brown"

When we have the data, we can set up the `CHILDESCorpusReader` included in NLTK.

**NOTE.** *iPython console crashes at the moment when using paths, I need to find
a workaround.*

```python
from nltk.corpus.reader import CHILDESCorpusReader
# corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
# brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
brown = CHILDESCorpusReader('nltk_data/corpora/childes/data-xml/Eng-USA-MOR/', 'Brown/.*.xml')
```

> Side note: In a lot of contexts when looking for "all files that end in `.xml`" you would wind up needing to use `*.xml`.  In those cases, `*` is a "wildcard" that means "anything can be here."

> Despite the similar appearance, `*` above is not a wildcard, `.*.xml` is a "regular expression" and the meaning of that it is "any character (`.`), any number of times (`*`), followed by `.xml`" -- same effect, but it's in a different (but similar-looking) "language." (These are "regular expressions.")

---
```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
```

Here are the files that are included

```python
>>> brown.fileids()
Out[5]: 
['Brown/Adam/adam01.xml',
    ...
 'Brown/Adam/adam55.xml',
 'Brown/Eve/eve01.xml',
    ...
 'Brown/Eve/eve20.xml',
 'Brown/Sarah/sarah001.xml',
    ...
 'Brown/Sarah/sarah139.xml']
```

There are three children, we want to concentrate on Eve.
We will do that by getting just the subset of the `fileids()` that are for Eve.
So: how?

--
```python
eve = [f for f in brown.fileids() ... ]
```
--

```python
eve = [f for f in brown.fileids() if f[6:9] == 'Eve']
```

---

```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
eve = [f for f in brown.fileids() if f[6:9] == 'Eve']
```

There are a number of things we can do with the corpus (that we've named `brown`).

```python
corpus_data = brown.corpus(eve)
corpus_parties = brown.participants(eve)
ages = brown.age(eve)
mlus = brown.MLU(eve) # CHI appears to be assumed
```

If you want to know what something like `brown` "knows how to do," you can ask it for a "directory."

```python
dir(brown)
```

That contains a lot of internal stuff, though.  And it's just returning a list, which we know because:

```python
>>> type(dir(brown))
list
```

---

There are a number of things we can do with the corpus (that we've named `brown`).

```python
corpus_data = brown.corpus(eve)
corpus_parties = brown.participants(eve)
ages = brown.age(eve)
mlus = brown.MLU(eve) # CHI appears to be assumed
```

How can we get just that part of the list that doesn't start with `"_"`?

--
```python
[x for x in dir(brown) ... ]
```

```python
[x for x in dir(brown) if x[0] != '_']
```

---

```python
['MLU',
 'abspath',
 'abspaths',
 'age',
 'childes_url_base',
 'citation',
 'convert_age',
 'corpus',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'participants',
 'raw',
 'readme',
 'root',
 'sents',
 'tagged_sents',
 'tagged_words',
 'unicode_repr',
 'webview_file',
 'words',
 'xml']
```
---
We can compare MLUs of parents and adults.

```python
pmlus = brown.MLU(eve, speaker='MOT')
mlus
pmlus
```

We can look at pronouns, nouns, verbs.

```python
tws = brown.tagged_words(eve[0])
tws[:10]
```

So, how would we get just the pronouns?

```python
pros = [w for  ... ]
```
--

Wait, before we go further, what are the options?

```python
poses = set([p for ... in ...])
```

```python
poses = set([p for (w,p) in brown.tagged_words(eve[0])])
```

Ok, now back to the question: How would we get just the pronouns?

---
```python
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/')
brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml')
eve = [f for f in brown.fileids() if f[6:9] == 'Eve']
[x for x in dir(brown) if x[0] != '_']
poses = set([p for (w,p) in brown.tagged_words(eve[0])])
```

```python
pros = [w for (w,p) in brown.tagged_words(eve[0]) ... ] ...?
```
--

```python
pros = [w for (w,p) in brown.tagged_words(eve[0]) if p[:3] == 'pro'] ...?
```

In my notes, I added this.  I do not at present see why this is needed, since everything seems
to be paired.  But in case we run into a reason for having done this, for completeness:

```python
pros = [w for (w,p) in brown.tagged_words(eve[0]) if p and p[:3] == 'pro']
```

---
We can plot things, like noun/verb ratio as it changes over time, compared to adults'

```python
def nvratio(f, speaker=['CHI']):
    ws = brown.tagged_words(f, speaker=speaker)
    ns = [w for (w,p) in ws if p and p[0] == 'n']
    vs = [w for (w,p) in ws if p and p[0] == 'v']
    nns = len(ns)
    nvs = len(vs)
    ratio = nns/nvs
    return(ratio)

age_months = [brown.age(f, month=True)[0] for f in eve]
eve_rat = [nvratio(f) for f in eve]
mot_rat = [nvratio(f, speaker=['MOT']) for f in eve]

from matplotlib import pyplot as plt
```

For me, this drew immediately in the window:

```python
plt.plot(age_months, eve_rat, age_months, mot_rat)
```

I think with some configurations, `plot` just opens a drawing scene that you can later `show`.
I am not certain what differentiates these situations, but for completeness:

```python
plt.plot(age_months, eve_rat, age_months, mot_rat)
plt.ylabel('noun-to-verb-ratio')
plt.show()
```