name: title layout: true class: center, middle, inverse --- # CHILDES # --- layout: false CHILDES comes from [http://childes.talkbank.org](http://childes.talkbank.org). We need to get XML data to work with, let's get Brown. Needs to go into "nltk_data/corpora/childes/data-xml/Eng-USA-MOR/Brown" When we have the data, we can set up the `CHILDESCorpusReader` included in NLTK. **NOTE.** *iPython console crashes at the moment when using paths, I need to find a workaround.* ```python from nltk.corpus.reader import CHILDESCorpusReader # corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') # brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') brown = CHILDESCorpusReader('nltk_data/corpora/childes/data-xml/Eng-USA-MOR/', 'Brown/.*.xml') ``` > Side note: In a lot of contexts when looking for "all files that end in `.xml`" you would wind up needing to use `*.xml`. In those cases, `*` is a "wildcard" that means "anything can be here." > Despite the similar appearance, `*` above is not a wildcard, `.*.xml` is a "regular expression" and the meaning of that it is "any character (`.`), any number of times (`*`), followed by `.xml`" -- same effect, but it's in a different (but similar-looking) "language." (These are "regular expressions.") --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') ``` Here are the files that are included ```python >>> brown.fileids() Out[5]: ['Brown/Adam/adam01.xml', ... 'Brown/Adam/adam55.xml', 'Brown/Eve/eve01.xml', ... 'Brown/Eve/eve20.xml', 'Brown/Sarah/sarah001.xml', ... 'Brown/Sarah/sarah139.xml'] ``` There are three children, we want to concentrate on Eve. We will do that by getting just the subset of the `fileids()` that are for Eve. So: how? -- ```python eve = [f for f in brown.fileids() ... ] ``` -- ```python eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` There are a number of things we can do with the corpus (that we've named `brown`). ```python corpus_data = brown.corpus(eve) corpus_parties = brown.participants(eve) ages = brown.age(eve) mlus = brown.MLU(eve) # CHI appears to be assumed ``` If you want to know what something like `brown` "knows how to do," you can ask it for a "directory." ```python dir(brown) ``` That contains a lot of internal stuff, though. And it's just returning a list, which we know because: ```python >>> type(dir(brown)) list ``` --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` There are a number of things we can do with the corpus (that we've named `brown`). ```python corpus_data = brown.corpus(eve) corpus_parties = brown.participants(eve) ages = brown.age(eve) mlus = brown.MLU(eve) # CHI appears to be assumed ``` How can we get just that part of the list that doesn't start with `"_"`? -- ```python [x for x in dir(brown) ... ] ``` -- ```python [x for x in dir(brown) if x[0] != '_'] ``` --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] [x for x in dir(brown) if x[0] != '_'] ``` ```python ['MLU', 'abspath', 'abspaths', 'age', 'childes_url_base', 'citation', 'convert_age', 'corpus', 'encoding', 'ensure_loaded', 'fileids', 'license', 'open', 'participants', 'raw', 'readme', 'root', 'sents', 'tagged_sents', 'tagged_words', 'unicode_repr', 'webview_file', 'words', 'xml'] ``` --- We can compare MLUs of parents and adults. ```python pmlus = brown.MLU(eve, speaker='MOT') mlus pmlus ``` We can look at pronouns, nouns, verbs. ```python tws = brown.tagged_words(eve[0]) tws[:10] ``` So, how would we get just the pronouns? ```python pros = [w for ... ] ``` -- Wait, before we go further, what are the options? ```python poses = set([p for ... in ...]) ``` -- ```python poses = set([p for (w,p) in brown.tagged_words(eve[0])]) ``` Ok, now back to the question: How would we get just the pronouns? --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] [x for x in dir(brown) if x[0] != '_'] poses = set([p for (w,p) in brown.tagged_words(eve[0])]) ``` ```python pros = [w for (w,p) in brown.tagged_words(eve[0]) ... ] ...? ``` -- ```python pros = [w for (w,p) in brown.tagged_words(eve[0]) if p[:3] == 'pro'] ...? ``` -- In my notes, I added this. I do not at present see why this is needed, since everything seems to be paired. But in case we run into a reason for having done this, for completeness: ```python pros = [w for (w,p) in brown.tagged_words(eve[0]) if p and p[:3] == 'pro'] ``` --- We can plot things, like noun/verb ratio as it changes over time, compared to adults' ```python def nvratio(f, speaker=['CHI']): ws = brown.tagged_words(f, speaker=speaker) ns = [w for (w,p) in ws if p and p[0] == 'n'] vs = [w for (w,p) in ws if p and p[0] == 'v'] nns = len(ns) nvs = len(vs) ratio = nns/nvs return(ratio) age_months = [brown.age(f, month=True)[0] for f in eve] eve_rat = [nvratio(f) for f in eve] mot_rat = [nvratio(f, speaker=['MOT']) for f in eve] from matplotlib import pyplot as plt ``` For me, this drew immediately in the window: ```python plt.plot(age_months, eve_rat, age_months, mot_rat) ``` I think with some configurations, `plot` just opens a drawing scene that you can later `show`. I am not certain what differentiates these situations, but for completeness: ```python plt.plot(age_months, eve_rat, age_months, mot_rat) plt.ylabel('noun-to-verb-ratio') plt.show() ```