CHILDES lab

This lab is due February 25.

To do this lab you will need access to the CHILDES data and the CLAN analysis program. The analysis programs can be used either online or locally (available on Mac, Windows, and some flavor of Unix). The description here will be primarily for the CHILDES transcript browser (“CTB”) interface. (Another online option is the more bare-bones WebCLAN interface).

On this page, you will find information on:

(The task you’ll be doing in this lab assignment was originally formulated by Martha McGinnis, U. Calgary)

Picking a version of CLAN to use

The CLAN program is really a collection of several different commands that you can execute to analyze your data. There are two ways that you can use it. The simplest is to simply use  the CHILDES transcript browser (“CTB”) online, but you can also download and install it on your computer, along with the corpus we’ll be working with. Each has advantages and disadvantages, but my advice is to use the online CTB for this assignment, and to download the program and data if you intend to do your course project using CHILDES.

The instructions here will be tailored to the CTB, but you can refer to the notes at the end of this page about using the CLAN program on your own computer for comments on how things differ if you run the program locally, as well as a bit more elaboration on the advantages and disadvantages.

The structure of a CLAN instruction

A CLAN instruction comes in three parts. Here is an example of one such instruction. The first part names the sub-program (or, as I’ll call it, the “command”). In the example below it is mlu. The mlu command computes the MLU (mean length of utterance) from the utterances in a transcript file. You will also be making use of the commands freq and combo before we are finished. For a more complete description of the available commands, you can consult the CLAN manual.

mlu +t*CHI nina*

The second part, after the command, contains the parameters. These modify the way in which the command operates. Above, there is one parameter given: +t*CHI. Transcripts that are in the standard format (called CHAT) are organized into “tiers”, and the major tiers represent who is speaking at any given point in the transcript. These major tiers are named with an asterisk as the first character. It is common for *CHI to be the tier assigned to the child. What +t*CHI will do in the instruction above is tell the mlu command to only consider utterances in the *CHI tier (“+t” stands for “examine tier”). That is, you only want the MLU to be computed for the child utterances. You can also examine other tiers (like *MOT for “mother”, if that’s what it’s called in the transcript) in the same way.

Important note: There must be no spaces in +t*CHI. Parameters are separated by spaces, but +t*CHI is a single parameter.

The third part of the instruction, nina* above, indicates which transcripts you wish to process. You can name a particular file (such as nina09.cha), or you can use the asterisk as above as a “wildcard.” The effect of using nina* is that it will process all of the files whose names begin with “nina” (so, “nina01.cha”, “nina02.cha”, and so on).

In the CTB, the command is chosen using a popup menu in the lower left corner of the screen. The parameters and files to analyze are typed into the blank space right next to the popup menu. When you are ready to run the command, you press the Run button (or just hit return in the text box). In the list of transcripts that appears in the upper left corner of the screen, you will see a [+] symbol after each transcript. Clicking on the [+] is a shortcut that will put the name of the transcript at the end of the text in the instruction box.

The result of running a CLAN instruction will be displayed on the right side of the screen. As far as I know, there is no good way to capture this information except to select it, copy it, and paste it into a text editor (Notepad, TextEdit, Word, whatever you prefer).

Locating Nina’s transcripts

In the upper left corner, click Eng-USA, then Suppes. You should now see several files like nina01.cha; these are the transcripts we will be analyzing in this exercise.

Your assignment

The lab assignment comes in several parts.

Section 1: Sampling subject drop rate

In the first section, we’ll isolate a common verb in each of a couple of transcripts, and determine how often the subject is omitted, to get some experience with CLAN and to get some baseline information.

Part 1: Determine Nina’s age and MLU for files 01-19.

Use the mlu command to determine the MLU for these transcripts (note that there is no file nina08.cha). The command described above should do it. The MLU is the “ratio of morphemes over utterances.” To get Nina’s age, you can look inside each transcript, it is recorded at the top.

Hand in: A list containing, for each file from nina01.cha to nina19.cha, Nina’s age from the transcript, and the MLU as computed by the mlu command.

Part 2: Determine the word frequencies for two representative files.

We are going to be doing the analysis on two of Nina’s transcripts, which we will take to be representative. Those two files are nina10.cha and nina19.cha. The goal of this part is to pick a verb in each of the transcripts that occurs often. Later, we will examine all utterances containing these verbs to count things like the number of null subjects. The freq command will search a transcript and count the number of times each word appears, then provide you with a list.

The freq command works just like mlu did. The parameters are the same. We want to restrict the frequency computation to just the child utterances, and we are looking at just the files nina10.cha and nina19.cha. (You can if you wish put both file names after the parameters, and it will do the computation on both of them in a single Run, or you can do them separately.)

The result should be a list of words along with a count of how many times each word appears in the transcript. To double-check, you should find that there are 21 instances of “Mommy” in nina10.cha and 2 in nina19.cha. These results are sorted alphabetically by word, rather than by most common. In each file, look for verbs, and look for the verbs that are the most frequent. For these purposes, ignore the following verbs: have, go, want (because they can be used as an auxiliary and might behave differently) and see (because it so often occurs as “See?” which is correct without a subject). Pick the most frequent verb from each file (excluding those just listed).

Hand in: For each file (nina10.cha, nina19.cha), the verb you’ll look at and the number of times it occurs (according to freq). Don’t forget to count all forms (so, not just the 1sg present form, but also the past form and the progressive –ing form). As a check, the forms I picked each have 24 occurrences.

Part 3: Search the transcripts for the examples.

Having picked a common verb from each file, what we’re going to do is look at each time the verb is used in the transcript and count how often it appears with a subject (the idea being that we can extrapolate this result to estimate overall rates of subject omission).

To perform the search, use the CLAN command combo (you might want to refer to the notes on combo at the bottom of this page). You will want to use the -w2 parameter in order to display the two lines preceding any matches, so you can get an idea of the context in which the matched utterances occur. The context will allow you to determine whether an utterance should be excluded. Here too, be sure you include all forms of the verb (1st present, past tense, -ing form, etc.). You can search for more than one form using something like +s"(see+seeing+saw+seen)" as a parameter.

The results of these searches will be somewhat large. You will want to copy and paste the results into a text file (using whatever you find convenient, e.g., TextEdit, Microsoft Word, whatever you have on hand).

Hand in: The full combo instruction you used (you can copy this from the top of the search result) and the number of matches it returned (you can copy this from the bottom of the search result).

Part 4: Count up and report on the totals

Now, go through each example and decide which of the following categories it falls under. Be sure to read the “exclusion” criteria carefully.

X. Excluded. The utterance is (a) a repetition of an immediately preceding utterance (either by the child or the adults), (b) incomprehensible, (c) part of a rote-learned expression (e.g., “…how I wonder what you are”), (d) an imperative or infinitive where a subject is not required in the adult language. We do not want to count these because they are not certain to reflect the child’s productive grammar, or because no subject is required in the adult language.

O. Overt subject. The verb has an overt (pronounced) subject. I include among these cases where there is a modal (like will) or auxiliary (like am or don’t or go) before the verb.

N. Null subject. The verb should have had a subject, but the subject is missing.

F. Fragment. These look a lot like null subjects, but if a child answers a question like “what are you doing?” with “Eating sandwiches”, it isn’t accurate to call that a null subject utterance. However, in response to “What were the monkeys eating?”, “Eating a balloon” should count as a null subject (not as a fragment), since this is not a well-formed fragment in adult speech. We will exclude these from the analysis, but you might as well differentiate them from the X category.

Hand in: Create a 2×3 table of results (2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each file. In the third column, compute the percentage of included utterances for each file that have overt subjects (divide the number of overt subjects by the sum of both overt and missing subjects, and then multiply by 100).

Hand in: Write a couple of sentences that describe the results in the table.

null subjects overt subjects percentage with overt subjects
nina10.cha N O 100*O/(N+O)
nina19.cha N O 100*O/(N+O)

Section 2: Determine how often subjects are dropped in wh-questions

In this section, we will look specifically at wh-questions to see what difference, if any, there is in the number of subjects omitted.

Part 5: Find the wh-questions

We’re going to use CLAN to study Nina’s use of subject drop in wh-questions, over two different time periods. So, first we need to find the wh-questions. We want to be sure we get all of them (including, e.g., “what’s”), so you can use the following search string as one of the parameters.

+s"(who*+what*+when*+how*+why*+where*+whose+which)"

Do two separate searches. Do the first search on transcripts 01-09, and the second search on transcripts 10-19. When you do these searches, you will get a large result, which you will want to copy into a text file.

Hand in: The two CLAN instructions you used to do the searches, and the number of matches for the last file in each search (that is, give me the number of matches in nina09.cha and nina19.cha).

Part 6: Count up and report on the totals

What we care about here are wh-questions where the wh-word is not the subject, and among those we will count the number of overt and missing subjects. So, omit from consideration all of those wh-questions (a) where the wh-word is the subject, (b) which are direct repetitions of a previous utterance, and (c) where the classification cannot be determined.

You will see a lot of examples like: “What’s that?” Let’s consider this to be derived from “That is what”, so “that” is the subject (and it is overt), and “what” is the object. Same for “where’s my candy”, “who’s that”.

Hand in: Create a 2×3 table of results (2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each set of files. In the third column, compute the percentage of included utterances for each range of files that have overt subjects (divide the number of overt subjects by the sum of both overt and missing subjects, and then multiply by 100).

Hand in: Write a couple of sentences that describe the results in the table.

non-subject, wh-word,
null subject
non-subject, wh-word,
overt subject
Percentage of wh-questions
with overt subjects
Early transcripts (01-09)
Late transcripts (10-19)

Part 7. Discuss the comparison with Valian’s (1991) results.

Consider the tables below, from O’Grady (1997), based on data from Valian (1991). They show overall percentages of dropped subjects in general, not just in (non-subject) wh-questions.

Group No. of children Age range MLU
I 5 1;10 – 2;2 1.53 – 1.99
II 5 2;3 – 2;8 2.24 – 2.76
III 8 2;3 – 2;6 3.07 – 3.72
IV 3 2;6 – 2;8 4.12 – 4.38

Table 1. English-speaking children in Valian’s study (based on Valian 1991:38)

Group Mean Range
I 69% 55-82%
II 89% 84-94%
III 93% 87-99%
IV 95% 92-95%

Table 2. Proportion of utterances containing a subject (based on Valian 1991:44-45)

Hand in: Describe how your results on subject omission for the verbs you chose in part 4 compare with what Valian found. Mention things like whether you found more or less omission than Valian found, and pay particular attention to the groups of children whose age and/or MLU match the transcript you are looking at.

Hand in: Describe how your results on subject omission in wh-questions (from part 5) compare to the overall rate of subject omission. Mention things like whether subjects are dropped more often or less often in wh-questions.

Hand in: Consider your results in light of the hypothesis that “topic drop” accounts for some of the cases of subject omission in Child English (cf. comments about Bromberg & Wexler 1995 from the class handouts). Do your results support this hypothesis? Briefly explain why or why not.

References

O’Grady, William (1997). Syntactic Development. Chicago: University of Chicago Press.

Valian, Virginia (1991). Syntactic subjects in the early speech of American and Italian children. Cognition 35:105-22.

Comments on combo

CLAN includes a relatively powerful searching tool called combo. I will outline a couple of points here, although you should probably refer to the CLAN manual for more information.

An example of a combo instruction is given below:

combo +t*CHI +w2 -w2 +s"what^my" nina* > whatmy.txt

This command says:

  • combo: the command
  • +t*CHI: restrict attention to the lines uttered by the child
  • +w2: show me the line you find and 2 lines after it.
  • -w2: show me the line you find and two lines before it.
  • +s"what^my": search for what followed directly by my.
  • nina*: search all of the files in the Working directory that begin with nina.
  • > whatmy.txt: Save the results in a file called whatmy.txt in the Output directory. (Note: this part of the instruction is only applicable to running CLAN on your own computer)

This will look for what immediately followed by my in any of the nina files, returning something like this:

*** File "Moxie:CLAN:suppes:nina19.cha": line 254.
*CHI: I want to play with you here .
*CHI: look what my got .
*CHI: look (1)what (1)my got .
*MOT: I see what you got .
*MOT: what did you get ?

You can see that we used the “^” character in the search string. This character means “immediately followed by”, so what we searched for was what immediately followed by my. In these search strings there are several other special characters that you can use.

  • x^y

    • Finds x immediately followed by y. x and y are full words (bounded by spaces).
  • *

    • Finds anything
  • _

    • Finds any one character (that is an underline character)
  • x+y

    • Finds x or y
  • !x

    • Finds anything except x

You can combine these in various ways to get useful effects. A couple of common things you might use are:

  • x^*^y

    • Finds x eventually followed by y (unlike with x^y, y does not need to immediately follow x). Literally this means, search for x, immediately followed by anything, immediately followed by y.
  • *ing

    • Finds anything that ends in ing. For example, verbs like swimming. Of course it will also get some irrelevant things like thing, boring, etc.

Some example combo commands are

combo +t*CHI +w2 -w2 +s"the^*^!grey^*^(dog+cat)" nina*

This will search for the followed eventually (^*^ means “followed by anything followed by…”) by something other than grey (!grey means “not grey”), followed eventually by either dog or cat (dog+cat means “either dog or cat”). It will not find the grey cat but it will find the black cat, the big red dog, etc.

combo +t*CHI +w2 -w2 +s"my^*^*ing" nina*

This will search for all instances of my followed eventually by something that ends in ing. If you are running CLAN on your own computer, you can use a “search” file instead of typing in the thing you are searching for each time. The “search” file is a text file that contains the things you want to search for, one item per line. combo will match if an item from any line is found.

If the file containing your search items is called search-1pron.txt in your Working directory, then you could do the search with the following combo instruction, where the @ tells combo to look in your file for the list of things to search for.

combo +t*CHI +w2 -w2 +s@search-1pron.txt nina* > pron1-nina.txt