Standoff annotation, XML, and more CHILDES

Updated:

So, the last thing we did in class was a kind of live demonstration of how to deal with corpora that may not have all the structure you might like. I will try to write up here something like what happened, as a reminder and potentially for future reference.

CHILDES project reminder

So, the “default project” for the end of this semester was this idea I had for looking at bilingual CHILDES corpora to try to determine whether children leave the “root infinitive” stage in both languages simultaneously. The basic premise is that children from about 2 years old to 3 years old will use infinitive verbs in main clauses, where adults do not. The hypothesis being tested is that this is due to a kind of biological maturation, rather than anything about the language input itself. The caveat is that there is a set of languages that do not seem to show root infinitives at all, and which seem to be the “null subject” languages like Spanish and Italian. And there are some other languages where the infinitive form might not really exist, at least not in a way that is usefully detectible, e.g., Japanese, Mandarin.

In practical terms, what this means is that we have a pretty small set of possible corpora. To do this project, a corpus must be located that:

  • is bilingual,
  • involving two languages that both show root infinitives,
  • both of which you know well enough to make sense of the transcripts,
  • has children between about 1.5 and 3.5 years old,

If you look at the descriptions of the bilingal corpora in CHILDES the options are… few. The ones with Cantonese, Catalan, Chinese, Italian, Japanese, Portuguese, and Spanish are out, at least. German, French, Russian, English, Dutch, Danish should be ok. But, this is not many to work with.

Even before we get to other questions, this does pose a problem for anyone who isn’t at least marginally comfortable with the language other than English in a workable pairing. Fluency isn’t required, but some way of identifying when the verb is in the infinitive (or possibly some kind of default) form probably is. Realistically, Dutch or German is probably kind of guessable based on English, though it would take some research.

But, supposing that there is a language pairing that will work, we then have another issue: pretty much none of the corpora are as well annotated as the Brown corpus (Adam, Eve, Sarah) that we worked with in class before. Lots of time and effort has gone into that corpus to tag the parts of speech, label the dependencies, etc. So, with that corpus we could just search for the verbs, look at that agreement, because it was all tagged.

In most of these corpora, we have mostly just the words. Fair enough, these were transcribed for some particular purpose of the original researchers. But it means that if we want to do things like look for main clause infinitives, we need to do some more work than just searching for them directly.

XML vs. CHAT

If you look at the description of the particular corpus you want to use, there will likely be links both into the browsable database and to a file you can download. However, the file you download there is almost certain to be in CHAT format (files ending in .cha). NLTK does not know how to handle those (except as general text files), the CHILDESCorpusReader is designed for XML files.

CHAT is specific to CHILDES and is well specified. The format guidelines are specific. Participant and recording information goes at the top in specific forms, participants are referred to by three-letter codes (CHI = child, MOT = mother, etc.), and individual utterances begin with * and the participant (*CHI: ), while dependent “tiers” begin with %, and so on. You can read the CHAT manual if you like, and you can kind of absorb how it works by looking at the browsable transcripts of any corpus.

NLTK (or more specifically CHILDESCorpusReader) wants these to be in XML format, instead. Many of the corpora in CHILDES are in XML format already, but you need to go to the XML section specifically (these are not in general linked from the main page that describes the corpus). So, once you pick a corpus you want to use, you want to look in the bilingual XML corpora directory to find the XML files for that corpus.

Thing is: Not all the corpora have XML versions there. I’m not sure why not. The one I was experimenting with was actually the FallsChurch Japanese-English bilingual one (which, however, wouldn’t be great for this project due to Japanese not showing root infinitives in any obvious way). This seems to exist in CHAT format but not in XML format.

CHAT is well enough defined though that there is a pretty easy way to convert from CHAT to XML. There is a program to do this called Chatter. It is easy to download and use on the Mac, and there is a Java version that is supposed to work on Windows and Linux. I did not try it on Windows or Linux though. To use it, unzip the CHAT files you have downloaded, open the Chatter program, then choose “Open” in Chatter, select the folder where the CHAT files are, and it will process them into a folder that it creates alongside the folder with the CHAT files. It will give it the same name but with -xml at the end.

Once you have XML files, we can start operating with them in NLTK. So, now we can do some Python again.

Finding the files

There has been a persistent problem with finding the files. NLTK is supposed to be a bit smarter than it has proven to be in locating the data files, but people have generally had quite a bit of trouble getting this to work. The best idea would be just to explicitly specify where the files are.

On the Mac, my files are in a folder called nltk_data in my home directory. Within nltk_data there is a corpus folder, within that a childes folder, and within that a data-xml folder. This is where I had put the Brown files from earlier work.

So, if I download the GNP corpus (which has an XML version), I can move the GNP folder into the data-xml folder. And then the full path to this folder, given that on my computer my username is hagstrom, is:

/Users/hagstrom/nltk_data/corpora/childes/data-xml/GNP/

It should be clear how this is constructed. You could put it on your Desktop, and find it with /Users/whatever/Desktop/GNP/ instead. The main thing is to know exactly what folders it is in.

For Windows, I’m less sure, but in class the nltk_data folder was actually at the top level of the C: drive. So, the path I used in class was:

C:/nltk_data/corpora/childes/data-xml/GNP/

or something like that. By the way, I know that in class I separated the directories with the forward slash character (/). That is the normal separator on recent Macs and Linux machines. The normal separator for Windows in other contexts is actually the backslash character (\), and I don’t know why that path wasn’t instead:

C:\nltk_data\corpora\childes\data-xml\GNP\

Maybe that would have worked also.

Anyway, I will assume that the XML files landed where I indicated above. I am going to use this GNP corpus for examples, as I did in class. If you look at this, you will see that there are three folders in the GNP folder: Both, English, and French. I’m just going to look at the English one in the examples below.

To Python!

I used Spyder for this because I just feel more comfortable there being able to re-run things from beginning to end. Also, the “autocomplete” is a bit smarter there than it is in Jupyter Notebook. But, it’s Python, do it however you want.

So, to begin, we bring in NLTK and tell it where the corpus is.

import nltk
from nltk.corpus.reader import CHILDESCorpusReader
data_root = '/Users/hagstrom/nltk_data/corpora/childes/data-xml/'
gnpec = CHILDESCorpusReader(data_root, 'GNP/English/.*.xml')
print(gnpec.fileids())

You should get a list of the fileids in the corpus. This much should work whatever corpus you are using really (not just the GNP/English one).

Much of what I want to do here below requires picking a single transcript, so let’s name the last transcript (which will be somebody’s latest one, so more likely to have a bunch of words in it).

the_file = gnpec.fileids()[-1]

At this point we can do the stuff that the CHILDESCorpusReader allows us to do. But it’s a little disappointing.

gnpec.participants(fileids=the_file)
gnpec.sents(fileids=the_file, speaker='CHI')
gnpec.tagged_sents(fileids=the_file, speaker='CHI')

The thing that’s (potentially) disappointing is that there are no tags. Using tagged_sents() or tagged_words() just returns a bunch of pairs of words with empty strings.

What’s more, there’s no way (that I know of at least) to know what utterance we’re looking at. If we want to look at just the child utterances, we can limit the search to the speaker CHI, and we will get the sentences in order, but we won’t know what MOT said in between, and if we look at MOT’s utterances, we’ll get those in order, but we won’t know what order they occur in with respect to the child’s utterances. And if we don’t limit the speaker, then we don’t know who’s talking. It’s surprisingly limited.

It’s probably informative to look at the XML file itself. Below I’ve given what we see in a couple of these utterances. It’s useful to see the structure here. There is an utterance indicated by an opening <u ...> tag (and closed by a </u>), and inside each utterance we have a series of words enclosed by <w> and </w> tags. The utterances have attributes who (for the speaker) and uID for the utterance ID. That’s very interesting/useful to see. This means that we can pinpoint any utterance in a transcript by referring to its uID. There are also a couple of other tags. One is <t type="p"></t> which seems to correspond to clause type or turn type—it distinguishes between statements ("p") and questions ("q") at least. And there is a more arbitrary tag (<a>...</a>) that holds codes of special interest to the original researchers. The type="coding" one marks what language the utterance is in and to whom it was addressed. The other one (type="extension")? I don’t know. Whatever.

  ...
  <u who="CHI" uID="u12">
    <w>I</w>
    <w>want</w>
    <w>go</w>
    <w>play</w>
    <w>make</w>
    <w>a</w>
    <w>house</w>
    <t type="p"></t>

    <a type="extension" flavor="pho">ai want go ple mek a haus</a>
    <a type="coding">$LAN:E $ADD:MOT</a>
  </u>
  <u who="MOT" uID="u13">
    <w>you</w>
    <w>want</w>
    <w>to</w>
    <w>go</w>
    <w>make</w>
    <w>a</w>
    <w>house</w>
    <t type="p"></t>

    <a type="coding">$LAN:E $ADD:CHI</a>
  </u>
  ...

So, back to our disappointment with CHILDESCorpusReader—it doesn’t (again, as far as I know) give us access to that uID attribute of an utterance when we retrieve it. However, CHILDESCorpusReader is itself a type of a more general XMLCorpusReeader, and using this we can actually get access to the parsed XML directly. That will allow us a much more flexible way into these transcripts, though at the cost of having to deal with another bit of technology.

So, step one is to get the XML representation of the corpus we read. This can be done like so:

the_xml = gnpec.xml(the_file)

The .xml() call does require exactly one file, so we need to specify which transcript file we are going to look at. We’ll look at the last one, which we named the_file.

Finding our way around the XML

There is some brief discussion of using XML in the NLTK book chapter 11, section 4.

However, probably the most rigorous place to look for examples is the official Python documentation for XML ElementTree. I’m going to just mention a couple of things here.

The basic goal here is to be able to look at an utterance and figure out the speaker (who) and the utterance ID (uID), which we know is in the XML file but is inaccessible through the CHILDESCorpusReader.

So, the first thing we’ll do is find the utterances by searching for the <u>...</u> tags. This can be accomplished by using the findall() function called on the XML structure.

This should look like this—but, it actually doesn’t quite.

utterances = the_xml.findall('u')

The thing above will not find anything, even though if you look at the XML file, there are u tags there. Why? The source of the issue is that at the top of the XML file, it specifies a “namespace”:

<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://www.talkbank.org/ns/talkbank"
      xsi:schemaLocation="http://www.talkbank.org/ns/talkbank http://talkbank.org/software/talkbank.xsd"
      PID="11312/c-00001462-1"
      Version="2.5.0"
      Lang="eng"
      Corpus="Genesee"
      Date="1994-03-08">
      ...

The xmlns is the XML Namespace, and it is http://www.talkbank.org/ns/talkbank. The point of specifying this is to allow mixing of tags from different files together. This file has u tags, but other XML files might use u not for “utterance” but for “underline” or something. So, the real tag, as far as the XML parser is concerned, is not u but rather {http://www.talkbank.org/ns/talkbank}u – that is, it is the namespace in braces preceding the tag we see in the file. So, what this boils down to is that to find the utterances we need to do this:

utterances = the_xml.findall('{http://www.talkbank.org/ns/talkbank}u')

That will work, but it’s clunky, we need to put the namespace before any tag. So what I will do is put the namespace in its own variable:

ns ='{http://www.talkbank.org/ns/talkbank}'
utterances = the_xml.findall(ns+'u')

We can now interrogate the who and uID like this:

print(utterances[4].get('uID'))
print(utterances[4].get('who'))

What is utterance u4 though? Looking at the XML, the utterance is parent to a sequence of words (among other things), so we can collect them like this:

ws = [w for w in utterances[4]]

This isn’t quite what we want, though. This has collected the child elements, but not all are words. And even when they are words, we need to ask the word element what its text is to get the word if we want to print or compare it to something. So, two things we want to do. One is we want to make sure we are look at words (the w tag), and the second is that we want to collect the text (since my current side quest is to print the words of the utterance).

words = [w.text for w in utterances[4] if w.tag == ns+'w']
print(words)

Ok, good, now we’re getting somewhere. We are starting to be able to get access to the data in the corpus at a deeper level.

Dealing with the lack of POS tags

Now, one thing that this corpus does not have is any kind of part of speech tagging. Utlimately what we want to look at is the form that verbs take, but we have no good way to find the verbs.

So we need a strategy. Here’s the strategy I thought of, at least. We’ll find out what the most common words are first, and then look by hand to see which of them are verbs. We’ll take one or a few of the most common verbs in the corpus and we’ll then search for just those verbs to see what form those verbs are in. So we are no longer looking in general for verb forms, but we’re trying to take something like a representative sample with a couple of the verbs that we are most likely to find in varying contexts in the transcript/corpus.

Finding the words is something we can do with the basic functions given to us by CHILDESCorpusReader. And then we can make a Frequency Distribution to figure out what the most common ones are.

all_words = gnpec.words(fileids=the_file)
fd = nltk.FreqDist(all_words)
print(fd.most_common(20))

In my results, I see basically go and do among the top 20 words of this transcript. Perhaps you might want to gather the words over all transcripts. Anyway, however you want to do it. This just seems like a good place to start given that we don’t have tags built into the corpus that allow us to search automatically for verbs and agreement characteristics.

So, the idea from here would be to look for the various forms that go can occur (goes, go, went) and see how often it is arguably infinitive in a main clause. Or perhaps simply missing agreement (in French it is at least plausible that agreement can come out as 3rd person singular as a kind of default when the agreement is deficient somehow).

Standoff annotation

It might be that you want to add some coding to a corpus that you have. For example, perhaps your project might be to look at a transcript and code for whether (a) a child’s utterance is prompting an adult’s utterance/repitition, or (b) a child’s utterance is imitating or resulting from an adult’s utterance. This will not be coded in the transcripts already, it will require coding it by hand.

One way to do this would be to actually edit the XML file and add the tags in. To do this would require that you not mess up the XML file in the process, which is potentially not trivial.

The way I’d be more comfortable doing this would be to leave the original corpus as it is, but to create a second file that has the coding for each utterance you want to add a code to. More concretely, I’m suggestinng a second annotation file that contains something like this:

u0 RESP
u2 RESP
u4 PROMPT
u6 PROMPT
u8 PROMPT

The intent here is that u0 and u2 are CHI utterances in which the child is reponding to a prompt, and u4, u6, and u8 are utterances in which the child is asking or prompting the adult to respond.

Once this is coded, one might look to see whether, say, English transcripts show a different pattern from transcripts in another language.

So the goal would be to use the CHILDES transcript together with the new file of extra annotations. Since these are annotations to the file oli33b06m.xml, we can save the annotations file as oli33b06m.xml.ann.txt (the idea being that you can locate the annotations by using the fileid you are using in CHILDES/XML and adding .ann.txt to it).

This is called “standoff annotation” in the NLTK book chapter 11, because it is not a direct modification of the original corpus, but is a separate kind of “overlay” that stands apart but points to spots in the original corpus.

To load this up, you can do this (making some assumptions here about where the annotation files will go):

annroot = '/Users/hagstrom/nltk_data/annotations/'
annfile = '{}{}.ann.txt'.format(annroot, the_file)
with open(annfile) as f:
    annotations = [l.strip().split() for l in f if len(l.strip().split())>0]

The l.strip().split() part will remove the return character from the end of each line, and then break up each line into lists. So the first line would result in a list like ['u0', 'RESP']. And the entire file will be read into the annotations array.

Now, if you want to go through annotations and retrieve the utterance that corresponds to the annotation, you can do this:

for a in annotations:
    u = the_xml.find(".//*[@uID='{}']".format(a[0]))
    words = [w.text for w in u if w.tag == ns+'w' and w.text]
    print('Utterance {}: {}: {}'.format(u.get('uID'), u.get('who'), ' '.join(words)))

This finds any tag that has a uID attribute that matches the one in the current line of the annotations file, then assembles the words, and prints what it found.

Or, you could go through the corpus but catch cases where you have an extra annotation from the annotation file. To do this, it would be better to re-organize the annotations list so that we can look up the annotations by utterance ID. We can do this with a dictionary in Python. At the moment this annotations file is set up in such a way that it can pretty much automatically create this, because annotations is just a list of 2-member lists. All you have to do is

anndict = dict(annotations)
print(anndict['u4']) # 'PROMPT'

But this is not fully general; if you had multiple annotations on each line this would no longer work, it’s only good for the special case where each line has an utterance number followed by a single tag. Better would be to just use the first element in each line of the annotation file as the dictionary key, like so:

anndict = {a[0]: a[1:] for a in annotations}
print(anndict['u4']) # ['PROMPT']

The result is not identical (the entries are now lists instead of strings), but it is more general/adaptable.

So, now if we go through the utterances, we can check whether there is an extra annotation in the annotations file:

ns ='{http://www.talkbank.org/ns/talkbank}'
utterances = the_xml.findall(ns+'u')
for u in utterances:
    if u.get('uID') in anndict:
        promptresp = anndict[u.get('uID')][0]
    else:
        promptresp = 'UNKNOWN'
    words = [w.text for w in u if w.tag == ns+'w' and w.text]
    print('Utterance {}: {}: {}: {}'.format(u.get('uID'), u.get('who'), promptresp, ' '.join(words)))

One thing you might consider if you do this for a perhaps more serious project is to record the version number in your annotation file, so that it is clear what version of the corpus you are working with.

the_xml.get('Version')

Anyway

This is basically what I was trying to cover during class today. There will be more to do in your own projects, but I wanted to provide a couple of examples of how you might deal with the fact that some of the corpora are fairly sparse in terms of what they have tagged. Using this kind of a standoff annotation file keyed to the individual utterance numbers is one way that you can “extend” the corpus by hand without having to work out how to modify the corpus’ XML file itself. And, I wanted to suggest a strategy of looking for the most common words to find the most common verbs and then looking for forms of those most common verbs.

It is possible that even with all of this, the data sets you have are going to be small enough that it will be hard to say anything with much confidence. But, you can tell me what you did find at least, and what you might expect to find if you had bigger corpora (or better tagged corpora).