Standoff annotation, XML, and more CHILDES
So, the last thing we did in class was a kind of live demonstration of how to deal with corpora that may not have all the structure you might like. I will try to write up here something like what happened, as a reminder and potentially for future reference.
CHILDES project reminder
So, the “default project” for the end of this semester was this idea I had for looking at bilingual CHILDES corpora to try to determine whether children leave the “root infinitive” stage in both languages simultaneously. The basic premise is that children from about 2 years old to 3 years old will use infinitive verbs in main clauses, where adults do not. The hypothesis being tested is that this is due to a kind of biological maturation, rather than anything about the language input itself. The caveat is that there is a set of languages that do not seem to show root infinitives at all, and which seem to be the “null subject” languages like Spanish and Italian. And there are some other languages where the infinitive form might not really exist, at least not in a way that is usefully detectible, e.g., Japanese, Mandarin.
In practical terms, what this means is that we have a pretty small set of possible corpora. To do this project, a corpus must be located that:
- is bilingual,
- involving two languages that both show root infinitives,
- both of which you know well enough to make sense of the transcripts,
- has children between about 1.5 and 3.5 years old,
If you look at the descriptions of the bilingal corpora in CHILDES the options are… few. The ones with Cantonese, Catalan, Chinese, Italian, Japanese, Portuguese, and Spanish are out, at least. German, French, Russian, English, Dutch, Danish should be ok. But, this is not many to work with.
Even before we get to other questions, this does pose a problem for anyone who isn’t at least marginally comfortable with the language other than English in a workable pairing. Fluency isn’t required, but some way of identifying when the verb is in the infinitive (or possibly some kind of default) form probably is. Realistically, Dutch or German is probably kind of guessable based on English, though it would take some research.
But, supposing that there is a language pairing that will work, we then have another issue: pretty much none of the corpora are as well annotated as the Brown corpus (Adam, Eve, Sarah) that we worked with in class before. Lots of time and effort has gone into that corpus to tag the parts of speech, label the dependencies, etc. So, with that corpus we could just search for the verbs, look at that agreement, because it was all tagged.
In most of these corpora, we have mostly just the words. Fair enough, these were transcribed for some particular purpose of the original researchers. But it means that if we want to do things like look for main clause infinitives, we need to do some more work than just searching for them directly.
XML vs. CHAT
If you look at the description of the particular corpus you want to use, there will
likely be links both into the browsable database and to a file you can download.
However, the file you download there is almost certain to be in CHAT format
(files ending in .cha
). NLTK does not know how to handle those
(except as general text files), the CHILDESCorpusReader
is designed for XML files.
CHAT is specific to CHILDES and is well specified. The format guidelines are
specific. Participant and recording information goes at the top in specific forms,
participants are referred to by three-letter codes (CHI = child, MOT = mother, etc.),
and individual utterances begin with *
and the participant (*CHI:
), while
dependent “tiers” begin with %
, and so on. You can read the CHAT manual if you
like, and you can kind of absorb how it works by looking at the browsable transcripts
of any corpus.
NLTK (or more specifically CHILDESCorpusReader
) wants these to be in XML format,
instead. Many of the corpora in CHILDES are in XML format already, but you need to
go to the XML section specifically (these are not in general linked from the
main page that describes the corpus). So, once you pick a corpus you want to
use, you want to look in
the bilingual XML corpora directory
to find the XML files for that corpus.
Thing is: Not all the corpora have XML versions there. I’m not sure why not.
The one I was experimenting with was actually the FallsChurch
Japanese-English
bilingual one (which, however, wouldn’t be great for this project due to Japanese
not showing root infinitives in any obvious way). This seems to exist in CHAT format
but not in XML format.
CHAT is well enough defined though that there is a pretty easy way to convert
from CHAT to XML. There is a program to do this called
Chatter.
It is easy to download and use on the Mac, and there is a Java version that
is supposed to work on Windows and Linux. I did not try it on Windows or Linux
though. To use it, unzip the CHAT files you have downloaded, open the Chatter
program, then choose “Open” in Chatter, select the folder where the CHAT files are,
and it will process them into a folder that it creates alongside the folder with
the CHAT files. It will give it the same name but with -xml
at the end.
Once you have XML files, we can start operating with them in NLTK. So, now we can do some Python again.
Finding the files
There has been a persistent problem with finding the files. NLTK is supposed to be a bit smarter than it has proven to be in locating the data files, but people have generally had quite a bit of trouble getting this to work. The best idea would be just to explicitly specify where the files are.
On the Mac, my files are in a folder called nltk_data
in my home directory.
Within nltk_data
there is a corpus
folder, within that a childes
folder,
and within that a data-xml
folder. This is where I had put the Brown files
from earlier work.
So, if I download the GNP
corpus (which has an XML version), I can move the
GNP
folder into the data-xml
folder. And then the full path to this
folder, given that on my computer my username is hagstrom
, is:
/Users/hagstrom/nltk_data/corpora/childes/data-xml/GNP/
It should be clear how this is constructed. You could put it on your Desktop,
and find it with /Users/whatever/Desktop/GNP/
instead. The main thing is
to know exactly what folders it is in.
For Windows, I’m less sure, but in class the nltk_data
folder was actually
at the top level of the C:
drive. So, the path I used in class was:
C:/nltk_data/corpora/childes/data-xml/GNP/
or something like that. By the way, I know that in class I separated the
directories with the forward slash character (/
). That is the normal
separator on recent Macs and Linux machines. The normal separator for Windows in
other contexts is actually the backslash character (\
), and I don’t know
why that path wasn’t instead:
C:\nltk_data\corpora\childes\data-xml\GNP\
Maybe that would have worked also.
Anyway, I will assume that the XML files landed where I indicated above.
I am going to use this GNP
corpus for examples, as I did in class.
If you look at this, you will see that there are three folders in the
GNP
folder: Both
, English
, and French
. I’m just going to
look at the English one in the examples below.
To Python!
I used Spyder for this because I just feel more comfortable there being able to re-run things from beginning to end. Also, the “autocomplete” is a bit smarter there than it is in Jupyter Notebook. But, it’s Python, do it however you want.
So, to begin, we bring in NLTK and tell it where the corpus is.
import nltk
from nltk.corpus.reader import CHILDESCorpusReader
data_root = '/Users/hagstrom/nltk_data/corpora/childes/data-xml/'
gnpec = CHILDESCorpusReader(data_root, 'GNP/English/.*.xml')
print(gnpec.fileids())
You should get a list of the fileids in the corpus. This much should work whatever corpus you are using really (not just the GNP/English one).
Much of what I want to do here below requires picking a single transcript, so let’s name the last transcript (which will be somebody’s latest one, so more likely to have a bunch of words in it).
the_file = gnpec.fileids()[-1]
At this point we can do the stuff that the CHILDESCorpusReader
allows
us to do. But it’s a little disappointing.
gnpec.participants(fileids=the_file)
gnpec.sents(fileids=the_file, speaker='CHI')
gnpec.tagged_sents(fileids=the_file, speaker='CHI')
The thing that’s (potentially) disappointing is that there are no tags.
Using tagged_sents()
or tagged_words()
just returns a bunch of pairs
of words with empty strings.
What’s more, there’s no way (that I know of at least) to know what
utterance we’re looking at. If we want to look at just the child
utterances, we can limit the search to the speaker CHI
, and we will
get the sentences in order, but we won’t know what MOT
said in between,
and if we look at MOT
’s utterances, we’ll get those in order, but we
won’t know what order they occur in with respect to the child’s utterances.
And if we don’t limit the speaker, then we don’t know who’s talking.
It’s surprisingly limited.
It’s probably informative to look at the XML file itself. Below I’ve given what
we see in a couple of these utterances. It’s useful to see the structure
here. There is an utterance indicated by an opening <u ...>
tag
(and closed by a </u>
), and inside each utterance we have a series of
words enclosed by <w>
and </w>
tags. The utterances have attributes
who
(for the speaker) and uID
for the utterance ID. That’s very
interesting/useful to see. This means that we can pinpoint any utterance
in a transcript by referring to its uID
. There are also a couple of other
tags. One is <t type="p"></t>
which seems to correspond to clause type
or turn type—it distinguishes between statements ("p"
) and questions
("q"
) at least. And there is a more arbitrary tag (<a>...</a>
)
that holds codes of special interest to the original researchers.
The type="coding"
one marks what language the utterance is in and
to whom it was addressed. The other one (type="extension"
)? I don’t know.
Whatever.
...
<u who="CHI" uID="u12">
<w>I</w>
<w>want</w>
<w>go</w>
<w>play</w>
<w>make</w>
<w>a</w>
<w>house</w>
<t type="p"></t>
<a type="extension" flavor="pho">ai want go ple mek a haus</a>
<a type="coding">$LAN:E $ADD:MOT</a>
</u>
<u who="MOT" uID="u13">
<w>you</w>
<w>want</w>
<w>to</w>
<w>go</w>
<w>make</w>
<w>a</w>
<w>house</w>
<t type="p"></t>
<a type="coding">$LAN:E $ADD:CHI</a>
</u>
...
So, back to our disappointment with CHILDESCorpusReader
—it doesn’t
(again, as far as I know) give us access to that uID
attribute of an
utterance when we retrieve it. However, CHILDESCorpusReader
is itself
a type of a more general XMLCorpusReeader
, and using this we can actually
get access to the parsed XML directly. That will allow us a much more
flexible way into these transcripts, though at the cost of having to
deal with another bit of technology.
So, step one is to get the XML representation of the corpus we read. This can be done like so:
the_xml = gnpec.xml(the_file)
The .xml()
call does require exactly one file, so we need to specify
which transcript file we are going to look at. We’ll look at the last
one, which we named the_file
.
Finding our way around the XML
There is some brief discussion of using XML in the NLTK book chapter 11, section 4.
However, probably the most rigorous place to look for examples is the official Python documentation for XML ElementTree. I’m going to just mention a couple of things here.
The basic goal here is to be able to look at an utterance and
figure out the speaker (who
) and the utterance ID (uID
), which
we know is in the XML file but is inaccessible through the
CHILDESCorpusReader
.
So, the first thing we’ll do is find the utterances by
searching for the <u>...</u>
tags. This can be accomplished
by using the findall()
function called on the XML structure.
This should look like this—but, it actually doesn’t quite.
utterances = the_xml.findall('u')
The thing above will not find anything, even though if you look at the
XML file, there are u
tags there. Why? The source of the issue is
that at the top of the XML file, it specifies a “namespace”:
<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.talkbank.org/ns/talkbank"
xsi:schemaLocation="http://www.talkbank.org/ns/talkbank http://talkbank.org/software/talkbank.xsd"
PID="11312/c-00001462-1"
Version="2.5.0"
Lang="eng"
Corpus="Genesee"
Date="1994-03-08">
...
The xmlns
is the XML Namespace, and it is http://www.talkbank.org/ns/talkbank
.
The point of specifying this is to allow mixing of tags from different files together.
This file has u
tags, but other XML files might use u
not for “utterance” but for
“underline” or something. So, the real tag, as far as the XML parser is concerned,
is not u
but rather {http://www.talkbank.org/ns/talkbank}u
– that is, it is the
namespace in braces preceding the tag we see in the file. So, what this boils down
to is that to find the utterances we need to do this:
utterances = the_xml.findall('{http://www.talkbank.org/ns/talkbank}u')
That will work, but it’s clunky, we need to put the namespace before any tag. So what I will do is put the namespace in its own variable:
ns ='{http://www.talkbank.org/ns/talkbank}'
utterances = the_xml.findall(ns+'u')
We can now interrogate the who
and uID
like this:
print(utterances[4].get('uID'))
print(utterances[4].get('who'))
What is utterance u4
though? Looking at the XML, the utterance is parent
to a sequence of words (among other things), so we can collect them like this:
ws = [w for w in utterances[4]]
This isn’t quite what we want, though. This has collected the child elements,
but not all are words. And even when they are words, we need to ask the word element
what its text
is to get the word if we want to print or compare it to something.
So, two things we want to do. One is we want to make sure we are look at words
(the w
tag), and the second is that we want to collect the text
(since my current side quest is to print the words of the utterance).
words = [w.text for w in utterances[4] if w.tag == ns+'w']
print(words)
Ok, good, now we’re getting somewhere. We are starting to be able to get access to the data in the corpus at a deeper level.
Dealing with the lack of POS tags
Now, one thing that this corpus does not have is any kind of part of speech tagging. Utlimately what we want to look at is the form that verbs take, but we have no good way to find the verbs.
So we need a strategy. Here’s the strategy I thought of, at least. We’ll find out what the most common words are first, and then look by hand to see which of them are verbs. We’ll take one or a few of the most common verbs in the corpus and we’ll then search for just those verbs to see what form those verbs are in. So we are no longer looking in general for verb forms, but we’re trying to take something like a representative sample with a couple of the verbs that we are most likely to find in varying contexts in the transcript/corpus.
Finding the words is something we can do with the basic
functions given to us by CHILDESCorpusReader
. And then we can make
a Frequency Distribution to figure out what the most common ones are.
all_words = gnpec.words(fileids=the_file)
fd = nltk.FreqDist(all_words)
print(fd.most_common(20))
In my results, I see basically go and do among the top 20 words of this transcript. Perhaps you might want to gather the words over all transcripts. Anyway, however you want to do it. This just seems like a good place to start given that we don’t have tags built into the corpus that allow us to search automatically for verbs and agreement characteristics.
So, the idea from here would be to look for the various forms that go can occur (goes, go, went) and see how often it is arguably infinitive in a main clause. Or perhaps simply missing agreement (in French it is at least plausible that agreement can come out as 3rd person singular as a kind of default when the agreement is deficient somehow).
Standoff annotation
It might be that you want to add some coding to a corpus that you have. For example, perhaps your project might be to look at a transcript and code for whether (a) a child’s utterance is prompting an adult’s utterance/repitition, or (b) a child’s utterance is imitating or resulting from an adult’s utterance. This will not be coded in the transcripts already, it will require coding it by hand.
One way to do this would be to actually edit the XML file and add the tags in. To do this would require that you not mess up the XML file in the process, which is potentially not trivial.
The way I’d be more comfortable doing this would be to leave the original corpus as it is, but to create a second file that has the coding for each utterance you want to add a code to. More concretely, I’m suggestinng a second annotation file that contains something like this:
u0 RESP
u2 RESP
u4 PROMPT
u6 PROMPT
u8 PROMPT
The intent here is that u0 and u2 are CHI utterances in which the child is reponding to a prompt, and u4, u6, and u8 are utterances in which the child is asking or prompting the adult to respond.
Once this is coded, one might look to see whether, say, English transcripts show a different pattern from transcripts in another language.
So the goal would be to use the CHILDES transcript together with the new
file of extra annotations. Since these are annotations to the file
oli33b06m.xml
, we can save the annotations file as oli33b06m.xml.ann.txt
(the idea being that you can locate the annotations by using the fileid
you are using in CHILDES/XML and adding .ann.txt
to it).
This is called “standoff annotation” in the NLTK book chapter 11, because it is not a direct modification of the original corpus, but is a separate kind of “overlay” that stands apart but points to spots in the original corpus.
To load this up, you can do this (making some assumptions here about where the annotation files will go):
annroot = '/Users/hagstrom/nltk_data/annotations/'
annfile = '{}{}.ann.txt'.format(annroot, the_file)
with open(annfile) as f:
annotations = [l.strip().split() for l in f if len(l.strip().split())>0]
The l.strip().split()
part will remove the return character from the end of
each line, and then break up each line into lists. So the first line would
result in a list like ['u0', 'RESP']
. And the entire file will be read into
the annotations
array.
Now, if you want to go through annotations and retrieve the utterance that corresponds to the annotation, you can do this:
for a in annotations:
u = the_xml.find(".//*[@uID='{}']".format(a[0]))
words = [w.text for w in u if w.tag == ns+'w' and w.text]
print('Utterance {}: {}: {}'.format(u.get('uID'), u.get('who'), ' '.join(words)))
This finds any tag that has a uID
attribute that matches the one in the current
line of the annotations file, then assembles the words, and prints what it found.
Or, you could go through the corpus but catch cases where you have an extra
annotation from the annotation file. To do this, it would be better to re-organize
the annotations
list so that we can look up the annotations by utterance ID.
We can do this with a dictionary in Python. At the moment this annotations file
is set up in such a way that it can pretty much automatically create this, because
annotations
is just a list of 2-member lists. All you have to do is
anndict = dict(annotations)
print(anndict['u4']) # 'PROMPT'
But this is not fully general; if you had multiple annotations on each line this would no longer work, it’s only good for the special case where each line has an utterance number followed by a single tag. Better would be to just use the first element in each line of the annotation file as the dictionary key, like so:
anndict = {a[0]: a[1:] for a in annotations}
print(anndict['u4']) # ['PROMPT']
The result is not identical (the entries are now lists instead of strings), but it is more general/adaptable.
So, now if we go through the utterances, we can check whether there is an extra annotation in the annotations file:
ns ='{http://www.talkbank.org/ns/talkbank}'
utterances = the_xml.findall(ns+'u')
for u in utterances:
if u.get('uID') in anndict:
promptresp = anndict[u.get('uID')][0]
else:
promptresp = 'UNKNOWN'
words = [w.text for w in u if w.tag == ns+'w' and w.text]
print('Utterance {}: {}: {}: {}'.format(u.get('uID'), u.get('who'), promptresp, ' '.join(words)))
One thing you might consider if you do this for a perhaps more serious project is to record the version number in your annotation file, so that it is clear what version of the corpus you are working with.
the_xml.get('Version')
Anyway
This is basically what I was trying to cover during class today. There will be more to do in your own projects, but I wanted to provide a couple of examples of how you might deal with the fact that some of the corpora are fairly sparse in terms of what they have tagged. Using this kind of a standoff annotation file keyed to the individual utterance numbers is one way that you can “extend” the corpus by hand without having to work out how to modify the corpus’ XML file itself. And, I wanted to suggest a strategy of looking for the most common words to find the most common verbs and then looking for forms of those most common verbs.
It is possible that even with all of this, the data sets you have are going to be small enough that it will be hard to say anything with much confidence. But, you can tell me what you did find at least, and what you might expect to find if you had bigger corpora (or better tagged corpora).