name: title layout: true class: center, middle, inverse --- # CHILDES notes # --- layout: false CHILDES comes from [http://childes.psy.cmu.edu](http://childes.psy.cmu.edu). We need to get XML data to work with, let's get Brown. Needs to go into "nltk_data/corpora/childes/data-xml/Eng-USA-MOR/Brown" When we have the data, we can set up the `CHILDESCorpusReader` included in NLTK. ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') ``` Side note: In a lot of contexts when looking for "all files that end in `.xml`" you would wind up needing to use `*.xml`. In those cases, `*` is a "wildcard" that means "anything can be here." Despite the similar appearance, `*` above is not a wildcard, `.*.xml` is a "regular expression" and the meaning of that it is "any character (`.`), any number of times (`*`), followed by `.xml`" -- same effect, but it's in a different (but similar-looking) "language." --- Here are the files that are included ```python brown.fileids() ``` There are three children, we want to concentrate on Eve. ```python eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` There are a number of things we can do with the corpus (that we've named `brown`). ```python corpus_data = brown.corpus(eve) corpus_parties = brown.participants(eve) ages = brown.age(eve) mlus = brown.MLU(eve) # CHI appears to be assumed dir(brown) ``` --- We can compare MLUs of parents and adults. ```python pmlus = brown.MLU(eve, speaker='MOT') mlus pmlus ``` We can look at pronouns, nouns, verbs. ```python tws = brown.tagged_words(eve[0]) tws[:10] pros = [w for (w,p) in brown.tagged_words(eve[0]) if p and p[:3] == 'pro'] ``` --- We can plot things, like noun/verb ratio as it changes over time, compared to adults' ```python from matplotlib import pyplot as plt def nvratio(f, speaker=['CHI']): ws = brown.tagged_words(f, speaker=speaker) ns = [w for (w,p) in ws if p and p[0] == 'n'] vs = [w for (w,p) in ws if p and p[0] == 'v'] nns = len(ns) nvs = len(vs) ratio = nns/nvs return(ratio) age_months = [brown.age(f, month=True)[0] for f in eve] eve_rat = [nvratio(f) for f in eve] mot_rat = [nvratio(f, speaker=['MOT']) for f in eve] plt.plot(age_months, eve_rat, age_months, mot_rat) plt.ylabel('noun-to-verb-ratio') plt.show() ``` ---