name: title layout: true class: center, middle, inverse --- # CHILDES notes # ## (Continuing from previous class) ## --- layout: false CHILDES comes from [http://childes.psy.cmu.edu](http://childes.psy.cmu.edu). We need to get XML data to work with, let's get Brown. Needs to go into "nltk_data/corpora/childes/data-xml/Eng-USA-MOR/Brown" Unfortunately, I don't have an easier way to do this than just downloading it and putting it there. When we have the data, we can set up the `CHILDESCorpusReader` included in NLTK. ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') ``` > Side note: In a lot of contexts when looking for "all files that end in `.xml`" you would wind up needing to use `*.xml`. In those cases, `*` is a "wildcard" that means "anything can be here." > Despite the similar appearance, `*` above is not a wildcard, `.*.xml` is a "regular expression" and the meaning of that it is "any character (`.`), any number of times (`*`), followed by `.xml`" -- same effect, but it's in a different (but similar-looking) "language." (These are "regular expressions.") --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') ``` Here are the files that are included ```python >>> brown.fileids() brown.fileids() Out[5]: ['Brown/Adam/adam01.xml', ... 'Brown/Adam/adam55.xml', 'Brown/Eve/eve01.xml', ... 'Brown/Eve/eve20.xml', 'Brown/Sarah/sarah001.xml', ... 'Brown/Sarah/sarah139.xml'] ``` There are three children, we want to concentrate on Eve. We will do that by getting just the subset of the `fileids()` that are for Eve. So: how? -- ```python eve = [f for f in brown.fileids() ... ] ``` -- ```python eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` There are a number of things we can do with the corpus (that we've named `brown`). ```python corpus_data = brown.corpus(eve) corpus_parties = brown.participants(eve) ages = brown.age(eve) mlus = brown.MLU(eve) # CHI appears to be assumed ``` If you want to know what something like `brown` "knows how to do," you can ask it for a "directory." ```python dir(brown) ``` That contains a lot of internal stuff, though. And it's just returning a list, which we know because: ```python >>> type(dir(brown)) list ``` --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] ``` There are a number of things we can do with the corpus (that we've named `brown`). ```python corpus_data = brown.corpus(eve) corpus_parties = brown.participants(eve) ages = brown.age(eve) mlus = brown.MLU(eve) # CHI appears to be assumed ``` How can we get just that part of the list that doesn't start with `"_"`? -- ```python [x for x in dir(brown) ... ] ``` -- ```python [x for x in dir(brown) if x[0] != '_'] ``` --- ```python from nltk.corpus.reader import CHILDESCorpusReader corpus_root = nltk.data.find('corpora/childes/data-xml/Eng-USA-MOR/') brown = CHILDESCorpusReader(corpus_root, 'Brown/.*.xml') eve = [f for f in brown.fileids() if f[6:9] == 'Eve'] [x for x in dir(brown) if x[0] != '_'] ``` ```python ['MLU', 'abspath', 'abspaths', 'age', 'childes_url_base', 'citation', 'convert_age', 'corpus', 'encoding', 'ensure_loaded', 'fileids', 'license', 'open', 'participants', 'raw', 'readme', 'root', 'sents', 'tagged_sents', 'tagged_words', 'unicode_repr', 'webview_file', 'words', 'xml'] ``` --- We can compare MLUs of parents and adults. ```python pmlus = brown.MLU(eve, speaker='MOT') mlus pmlus ``` We can look at pronouns, nouns, verbs. ```python tws = brown.tagged_words(eve[0]) tws[:10] ``` So, how would we get just the pronouns? ```python pros = [w for ... ] ``` -- ```python pros = [w for (w,p) in brown.tagged_words(eve[0]) ... ] ...? ``` -- ```python pros = [w for (w,p) in brown.tagged_words(eve[0]) if p[:3] == 'pro'] ...? ``` -- ```python pros = [w for (w,p) in brown.tagged_words(eve[0]) if p and p[:3] == 'pro'] ``` --- We can plot things, like noun/verb ratio as it changes over time, compared to adults' ```python from matplotlib import pyplot as plt def nvratio(f, speaker=['CHI']): ws = brown.tagged_words(f, speaker=speaker) ns = [w for (w,p) in ws if p and p[0] == 'n'] vs = [w for (w,p) in ws if p and p[0] == 'v'] nns = len(ns) nvs = len(vs) ratio = nns/nvs return(ratio) age_months = [brown.age(f, month=True)[0] for f in eve] eve_rat = [nvratio(f) for f in eve] mot_rat = [nvratio(f, speaker=['MOT']) for f in eve] plt.plot(age_months, eve_rat, age_months, mot_rat) plt.ylabel('noun-to-verb-ratio') plt.show() ``` --- If there's still time, we can start talking about stemming, tagging, and chunking. And prepare for regular expressions. Some of this is coming from `pythonprogramming.net`. ```python from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer() new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once." words = word_tokenize(new_text) for w in words: print(ps.stem(w)) ``` --- ```python import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized[:5]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) except Exception as e: print(str(e)) process_content() ``` --- ```python import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) chunkGram = r"""Chunk: {
*
*
+
?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) chunked.draw() except Exception as e: print(str(e)) process_content() ```