CHILDES lab

On this page, you will find information on:

The CHILDES lab assignment

The lab is due on Feb 14. Awww.

To do this lab you will need access to the CHILDES data and the CLAN analysis program. The analysis programs can be used either online or locally (available on Mac, Windows, and some flavors of Unix). The description here will be primarily for the CHILDES browsable database (“CBD”) interface.

There are two possible projects you can pick from. One is about case marking and verb finiteness, and the other is about wh-movement and subject omission. You can pick either one, I’m not certain that they are exactly equivalent in terms of work involved. The case one involves more work using a spreadsheet than the wh-movement one does, but the wh-movement one involves a bit more counting by hand.

First a bit of initial stuff that applies to either one, and then later the two projects will be separately outlined.

Background

This lab is designed to give you some experience with the CHILDES system.

The traditional way to use the CHILDES database is with a set of analysis programs called CLAN (Computerized Language ANalysis). You can download these and install them on your own computer and do the data analysis that way. A few years ago, a web interface called the CHILDES Browsable Database was made available that allows you to do much of this work online as well. The downside of doing it online this way is that it is somewhat trickier to save the output, and you can’t feed it your own input files. So for more complex projects, it is likely still preferable to have it on your local machine. However, the local version does show its age, and it can be somewhat unforgiving when trying to make it install or find its directories.

CLAN download page. On Mac OS X, download dmg, open, double-click install package. Will also install CAfont by default. Windows and *nix versions can also be retrieved here.

For complete information on the format and searching utilities, look at the CHAT manual and the CLAN manual.

Using the CHILDES browsable databse will suffice for simple projects. One limitation is that you cannot use input files or save to output files (although for output you can use copy and paste).

Picking a version of CLAN to use

The CLAN program is really a collection of several different commands that you can execute to analyze your data. There are two ways that you can use it. The simplest is to simply use the CHILDES Browsable Database (CBD) online, but you can also download and install it on your computer, along with the corpus we’ll be working with. Each has advantages and disadvantages, but my advice is to use the online CBD for this assignment, and to download the program and data if you intend to do your course project using CHILDES.

The instructions here will be tailored to the CBD, but you can refer to the notes at the end of this page about using the CLAN program on your own computer for comments on how things differ if you run the program locally, as well as a bit more elaboration on the advantages and disadvantages. If I get that far.

Note that the CBD says that it works well with Chrome, Firefox, and Opera, but not well with Edge or Safari.

The biggest disadvantage of using the CBD that I can see at the moment is that it is difficult to feed it custom searches with files (rather than typing full searches out on the command line).

The structure of a CLAN instruction

A CLAN instruction comes in three parts. Here is an example of one such instruction. The first part names the sub-program (or, as I’ll call it, the “command” ). In the example below it is mlu. The mlu command computes the MLU (mean length of utterance) from the utterances in a transcript file. You will also be making use of the commands freq and combo before we are finished. For a more complete description of the available commands, you can consult the CLAN manual.

mlu +t*CHI *.cha

The second part, after the command, contains the parameters. These modify the way in which the command operates. Above, there is one parameter given: +t*CHI. Transcripts that are in the standard format (called CHAT) are organized into “tiers”, and the major tiers represent who is speaking at any given point in the transcript. These major tiers are named with an asterisk as the first character. It is common for *CHI to be the tier assigned to the child. What +t*CHI will do in the instruction above is tell the mlu command to only consider utterances in the *CHI tier (“+t” stands for “examine tier”). That is, you only want the MLU to be computed for the child utterances. You can also examine other tiers (like *MOT for “mother”, if that’s what it’s called in the transcript) in the same way.

Important note: There must be no spaces within the string +t*CHI. Parameters are separated by spaces, but +t*CHI is a single parameter.

The third part of the instruction, *.cha above, indicates which transcripts you wish to process. You can name a particular file (such as 020212.cha), or you can use the asterisk as above as a “wildcard.” The effect of using *.cha is that it will process all of the files whose names end with “.cha”.

In the CBD, the command is chosen using a popup menu in the lower left corner of the screen. The parameters and files to analyze are typed into the blank space right next to the popup menu. When you are ready to run the command, you press the Run button (or just hit return in the text box). In the list of transcripts that appears in the upper left corner of the screen, you will see a [+] symbol after each transcript. Clicking on the [+] is a shortcut that will put the name of the transcript at the end of the text in the instruction box.

The result of running a CLAN instruction will be displayed on the right side of the screen. As far as I know, there is no good way to capture this information except to select it, copy it, and paste it into a text editor (Notepad, TextEdit, Word, whatever you prefer).

Comments on combo

Diving a little bit into more technical stuff briefly; CLAN includes a relatively powerful searching tool called combo. I will outline a couple of points here, although you should probably refer to the CLAN manual for more information.

An example of a combo instruction for the CBD is given below:

combo +t*CHI +w2 -w2 +s"what^my" *.cha > whatmy.txt

This command says:

  • combo: the command
  • +t*CHI: restrict attention to the lines uttered by the child
  • +w2: show me the line you find and 2 lines after it.
  • -w2: show me the line you find and two lines before it.
  • +s"what^my": search for what followed directly by my.
  • *.cha: search all of the files in the Working directory that end with .cha.
  • > whatmy.txt: Save the results in a file called whatmy.txt in the Output directory. (Note: this part of the instruction is only applicable to running CLAN on your own computer)

This will look for what immediately followed by my in any of the nina files, returning something like this:

*** File "Moxie:CLAN:suppes:nina19.cha": line 254.
*CHI: I want to play with you here .
*CHI: look what my got .
*CHI: look (1)what (1)my got .
*MOT: I see what you got .
*MOT: what did you get ?

You can see that we used the “^” character in the search string. This character means “immediately followed by”, so what we searched for was what immediately followed by my. In these search strings there are several other special characters that you can use.

  note
x^y Finds x immediately followed by y. x and y are full words (bounded by spaces).
* Finds anything
_ Finds any one character (the symbol about is the underline character)
x+y Finds x or y
!x Finds anything except x

You can combine these in various ways to get useful effects. A couple of common things you might use are:

  note
x^*^y Finds x eventually followed by y (unlike with x^y, y does not need to immediately follow x). Literally this means, search for x, immediately followed by anything, immediately followed by y.
*ing Finds anything that ends in ing. For example, verbs like swimming. Of course it will also get some irrelevant things like thing, boring, etc.

Some example combo commands are

combo +t*CHI +w2 -w2 +s"the^*^!grey^*^(dog+cat)" *.cha

This will search for the followed eventually (^*^ means “followed by anything followed by…”) by something other than grey (!grey means “not grey”), followed eventually by either dog or cat (dog+cat means “either dog or cat”). It will not find the grey cat but it will find the black cat, the big red dog, etc.

combo +t*CHI +w2 -w2 +s"my^*^*ing" *.cha

This will search for all instances of my followed eventually by something that ends in ing. If you are running CLAN on your own computer, you can use a “search” file instead of typing in the thing you are searching for each time. The “search” file is a text file that contains the things you want to search for, one item per line. combo will match if an item from any line is found.

If you are running CLAN on your own computer (rather than using the web-based CHILDES Browsable Database), you can use a command like the following (which matches the first one in this section, but has the additional > whatmy.txt at the end.

combo +t*CHI +w2 -w2 +s"what^my" *.cha > whatmy.txt

This extra part of this command says:

  • >: Save the output (result) in a file named next.
  • whatmy.txt: The name of the file that will be created (or overwritten) in the Working directory with the output results. This saves you from the copy and paste step you’d have to perform if using the CBD.

Another thing you can do if you are running CLAN on your own computer is to put the search string in a text file, rather than as part of the command line. So, if you put the file containing your search items in the Working directory and call it search-1pron.txt, then you could do the search with the following combo instruction, where the @ tells combo to look in your file for the list of things to search for.

combo +t*CHI +w2 -w2 +s@search-1pron.txt *.cha > pron1-nina.txt

That’s particularly useful if you are searching for something relatively complicated. For example, in the subject case one when there are a bunch of pronouns to search for. You can do it all in the command line in principle, but it is a little clumsier. Still, it requires balancing that against the clumsiness of installing the CLAN program.

Lab option 1: Subject case and finiteness

The goal

The plan here is to analyze some of Nina’s use of subject pronouns and case to evaluate the connection between case of the subject and the finiteness of the verb. Generally, we’re “replicating” a part of the results from the Schütze & Wexler (1996) paper (although this isn’t exactly replicating it because this is actually a subpart of the actual data that they used in their analysis).

Generally, though, the hypothesis is that there is a relationship between the form of the verb that a child uses and the case form that a subject pronoun takes. The simplest kind of correlation would have been that finite verbs correspond to nominative subjects (I, they), and non-finite verbs correspond to non-nominative subjects (me, my). However, a) what they actually found seems a little bit more complicated, and b) determining whether a verb is finite or nonfinite is not entirely trivial.

Warmup with first person pronouns

To begin, we’ll analyze the first-person pronouns. This was demonstrated fairly quickly in class, but you may as well go through it as well. To do this, we’ll look at a particular file, from a child known as Peter. (And later from a file from a child known as Nina.)

Locating Peter’s transcripts

In the upper left corner, click Eng-NA, then Bloom70, then Peter. You should now see several files like 010908.cha; these are the transcripts we will be analyzing in this exercise.

Searching for first-person pronouns

Take a look at 020100.cha (clicking on it will bring it up in the panel on the right). This is the file we’re going to be working with at first.

We’ll start with first person pronouns. Specifically, what we are looking for are first person subjects. What are the options? Well, I of course. And me. And my. But also contracted forms like I’ll and I’m. And, given that it’s possible that children might use me, I suppose it’s possible that we might find someting like me’ll or me’s or me’m.

Rather than listing out all the possibilities, we can use the wildcard character * to cover some of them, meaning that we should be able to get away with searching for the following things (note that each represents a word, surrounded by spaces). These are also designed to minimize the amount of false positives; we don’t want just any word that starts with I.

I, I’*, me, me’*, my, my*

The combo command is how searching is done. So, select combo from the command popup. Then we enter the parameters in the text box next to the command. (Notes on the combo command above are relevant here.)

We want to include at least -w2 in the command to include the two utterances prior to the line in which the search succeeded (“w” is for ‘window’). And +w2 is maybe also useful just to get a slightly better sense of the context (and the reaction the child got).

We only care about what the child says, so we include +t*CHI.

To search, we include +s"search parameters" and in this case we want to search for one thing, or another, or another, etc. The way you do “or” in the search is with +, so searching for cat+dog will return instances of either cat or dog. In the present circumstance, we want all the pronoun forms from above.

The last thing it is to say what file to search in. We’ll look at 020100.cha.

The ultimate result of all that is that you want to type the following into the combo command box, and then hit Run.

+t*CHI +w2 -w2 +s"i+i'*+me+my+me'*+my'*" 020100.cha

It will think briefly and then show you the result in the panel on the right side of the screen. We nee to get that into a form we can use.

Moving on to third person pronouns

In class, we proceeded a bit further with these first person pronouns, which I thought was partly interesting because in this file there are quite a few utterances with my subjects, and it’s kind of neat to be able to see them in context.

There is, however, a big downside to looking at first person pronouns: the adult form of the verb that goes with first person is the same as the non-adult infinitive/bare verb. So, it is almost always impossible to tell whether the verb is adult-like or not. Since what we are looking at is the correlation between subject case and verb form, this winds up giving us almost no data. So I don’t want to drive you through all of that just to get to the end with almost nothing to show for it. Better to spend the energy on the third person pronouns.

The third person pronouns are better because the agreeing verb is distinct, so we can tell if it is adult-like (he plays) or non-adult-like (he play).

TASK: Search Nina’s transcript from 2;2.6 for third person pronouns

So, the first task: Find Nina’s transcript from when she was 2;2.6 (in the Suppes corpus), and search it for the following pronouns, in the same way we searched Peter’s transcript above.

he, he’*, his, his’*, him, him’*, she, she’*, her, her’*

Getting the result into a Google Sheet

The goal is to get this into a Google Sheets document so we can do some analysis. Here is what worked for me, and I was not able to skip any of these steps.

Double-click on the first word in the output (“combo”) to select it, then scroll the page down to the bottom of the output to where it says “123 times”. Hold down shift and click there, so that you have now selected/highlighted the whole output. Copy. (Note: it would have been great if you could just “select all” and “copy” but that didn’t work for me, it selects too much.)

Open a plain text editor on your local computer. On a Mac, you can open TextEdit, make a new document (File->New), choose “Make plain text” under the “Format” menu, and paste your results in. Save if you like, it doesn’t matter. Then: select all, and copy. You can do this same thing on Windows in Notepad. (What we’re doing here is fixing the end-of-line characters, it does not paste properly into Google Sheets unless you do this first.)

Go to Google Sheets, and create a new sheet. Your cursor should be in the upper left cell (column A, row 1), but if it isn’t, put it there. Paste. You should now have your results in the spreadsheet, and it should go down to around row 750.

Highlighting child utterances

This next bit is possibly surplus to requirements, but it’s kind of a cool trick and I think it reduces the pain involved in doing the counting. What we are going to do here is set it up so that it will highlight the child utterances in which the pronouns were found, so it’s easy to find them.

To start, hover on the label of the cell that heads column A and you will see that there’s a popup menu there. If you click on that you will get the option to insert 1 left. Do that. So there should be a column now to the left of the search results you pasted in.

Here’s what we’re going to try to do. We are going to use that column to count how many utterances have gone by since the beginning of each result block, so that when we get to the middle one, it will paint the row yellow. Each search result is separated by a row of dashes, so we can reset our count when we have a row of dashes, and we can count up for every line that starts with an asterisk. Looking at it, you’ll see that the fourth line that starts with an asterisk is the one containing the child utterance that the search located.

This makes use of something called “conditional formatting” which is a way to apply formatting (like highlighting) when a certain condition (like being on the fourth line) is met. Step one is to make the first column hold the line we’re on, and step two is to make every row where that line we’re on is 4 yellow.

Put the cursor on the line that contains the first row of dashes in column A. For me that’s row 10, though it might depend a little bit on how you selected and pasted and things.

I’m going to build this up a little bit just so there’s a chance of it being clear what we’re doing. It winds up looking a little complicated at the end. So, start by typing this in A10 (or whatever row your first line of dashes is, but if it is not 10, replace the 10 in B10 with the number that matches your row).

=(LEFT(B10,3)="---")

It should say TRUE once you hit return. Put the cursor on A10 again, copy, move the cursor down a row to A11, and paste. A11 should now say FALSE. That’s because B10 does start with ---, and B11 does not. What the formula in A10 is doing is looking to see whether the left 3 characters of the contents of cell B10 are ---. And when we copied it to A11, it incremented the row in the formula as well, so the formula in A11 is assessing the contents of B11.

It turns out that FALSE also evaluates to 0 and TRUE to 1. So, if we subtract FALSE from 1, we get 1, and if we subtract TRUE from 1, we get 0. This has the effect both of reversing the value and of turning it into a number. So, if we change the formula in A11 to this, you should get 0 when the cell in B starts with --- and 1 when it doesn’t.

=1-(LEFT(B10,3)="---")

So, perfect. Now, the logic of what we want to do is: put 0 in this column if we are at a row of dashes, and otherwise increase the number by 1 if we are at a row that starts with *. Copy the following into A10.

=((LEFT(B10,1)="*")+A9)*(1-(LEFT(B10,3)="---"))

It should evaluate to 0. The formula refers to A9, which doesn’t have anything in it, but that’s ok (you’ll see in a moment why it is there, but it basically means “the cell in the row above this one”). Select A10, copy, and paste it into A11. It should evaluate to 1. Look at the formula in A11, and figure out how it works. It’s borderline clever, making use of the fact that FALSE is 0 and TRUE is 1, and building itself on the value from the row above it.

Now, copy A11 and select the whole rest of the column from A12 down to row 750 or so. Paste. And you should get numbers in column A that count up from 0 to 6 cyclically, once per search result. Nice.

Now to the colorful part. Put the cursor on the cell with the first row of dashes, which for me is at B10. Select “Conditional Formatting…” from the “Format” menu in the spreadsheet. This will open a panel called “Conditional format rules”. We want to apply this to everything that we might want to colorize, so put “A10:B750” in the “Apply to range” box. That’s basically everything in the spreadsheet.

Under “Format rules”, change “Format cells if…” to “Custom formula is”. Now: for any given cell, if the formula we enter below that comes out as TRUE, the formatting will be applied, otherwise it will not. The formula is evaluated relative to the selected cell, which is why it was important to select a cell first before we opened the Conditional Formatting rules panel. If we’re looking at B10, what would have to be true for it to be colorized? Well, we’d want the number in A10 to be 4. Whenever that number over there is 4, we highlight the line. Of course, A10 isn’t 4, it’s 0. But we’re applying this assessment to the entire range, and since a) we’re at B10, and b) we’re checking A10, the meaning of assessing A10 is really assessing the cell one column to the left.

So if you type this as the formula there

=(A10=4)

You should get all the child utterances highlighted, as desired. In green probably, but you can change the color it highlights in by clicking on the color below the formula.

There is a subtle thing here that came up when I demonstrated it in class, but what we have done here is say “highlight any cell to the right of a cell that has 4 in it”. That’s not exactly what we wanted, but it amounts to the same thing in this case so far. What we really wanted was to highlight any cell in a row that has 4 in column A. In order to get what we actually meant, we need to “lock” the column to A, rather than letting it be evaluated relative to the current cell. Practically, this is accomplished by putting $ before the column. (The same thing can be done with rows as well, by putting $ before the row.) So, $A10 when evaluated relative to a selected cell in row 10 means “column A in this row” – and that is what we really meant. So, the correct formula for the conditional formatting should really be:

 =($A10=4)

Once you’ve put that in, you can click the “Done” button. And at that point you can close the conditional formatting rules panel if you want, to give yourself more room to see the spreadsheet.

Counting what you found

Now that we have a somewhat more navigable spreadsheet, we want to start counting what we have here. So, first we will add some columns to record the counts in.

In the header for column A, select “Insert 1 right” to add a column, and then do that four more times, so that there are 5 columns between the counter in column A and the data, now in column G. Give the columns headers: “Nom”, “Acc”, “Gen”, “Fin”, “Nonfin”.

To help keep track of where you are, choose “Freeze” > “1 row” from the “View” menu, so that your labels always stay visible. And then, go down and in each highlighted line, put a 1 in one of the first three cells if you have a pronoun subject, and a 1 in one of the last two if you have a verb you can identify.

Right on the very first line we have an issue. The line we found is

(1)he's (.) (2)he's dirty .

How should we code this? The same line has two instances of nominative he and two finite verbs (is). Question 1: Do we count both of them? Generally we do not want to count exact repetitions because it does not necessarily reflect the active creation of a new sentence using the child’s grammar. Do these qualify? They are not identical, though it seems like maybe a kind of false start that might really reflect just two attempts at the same sentence. These are hard questions to which there is no indisputably right answer.

But suppose we take the view that we want to count both, because they are after all not identical. What should we put in the “Nom” column and the “Fin” column to reflect the fact that there were two in the same line? Here, looking ahead, the right thing to do is to put a 1 in “Nom”, a 1 in “Fin”, and then go to the row below and put another 1 in “Nom” and another 1 in “Fin”. Here’s why I suggest this: When we do the tabluation of how often things co-occur, the simplest thing to do is to multiply the case form by the verb form. So if there’s a zero (or empty cell) in one of them, the result will be zero, but if there’s a 1 in both, the result will be 1. For example: the number for “finite verb with nominative subject” can be determined by multiplying the number for “finite verb” by the number for “nominative subject”. Even if you aren’t 100% following right now, you’ll see what I mean shortly. The main thing is, you don’t want to put 2 in “Nom” and in “Fin” in this line because it will throw the math off (partly because it decouples the subject form from the verb form, we no longer know for sure which things occurred together).

Let’s keep going a little bit. Next one is:

big dirty in [//] (1)her big mouth .

Almost certainly we want to discount this as being not a subject but a possessor, so: irrelevant to what we’re looking at. Proceeding:

(1)her big mouth .

This came right after the previous one. So that’s a repetition, we don’t want to count that one either. Then:

(1)her (.) (2)her have a big mouth .

Ok, interesting. New issues. First, we have one of these false starts again ((.)), but we wouldn’t have counted a lone pronoun her as anything anyway. This is no longer a repetition for sure. (And these three utterances are one right after another.) But now we clearly have a her subject and a nonfinite verb. Hooray! We have something to record.

But wait. Several things. First, so is that an accusative subject or a genitive subject? Could be either, they both sound like her. Again, there’s no right answer here, it depends on what you’re trying to answer with the analysis. It might even be that the best idea is to make a fourth case column for “Acc/Gen” that holds counts specifically of her subjects, so you can analyze them later. The other possibilities are: ignore them because they are not unambiguous, count them as Acc, or count then as Gen. My inclination is to count them as Acc (or, keep track of them separately and then probably count them as Acc later), because unambiguous Acc subjects are a lot more frequent than unambiguous Gen subjects.

Second, in retrospect, were we right to discard her big mouth as a possessive construction? Given that this leads into her have a big mouth maybe her big mouth is really just her have a big mouth but with the verb dropped. In which case, it at least should count as a non-nominative subject. (Though for what we’re counting it would still get ignored because it isn’t co-occuring with a verb in any form.)

So far, we have now looked at the first four utterances out of 123. So, it does require some thought, and, at certain points, judgment calls. The best thing to do is to try to make those calls in some consistent way. For the purposes of science, you probably also want to do what you can to make calls that would tend to count against your hypothesis, such that if you find that your hypothesis holds even despite that, it is all the more soundly supported.

One last note here on verb forms. In the nonfinite category, you would want to include verb forms like sleeping (in her sleeping) as well as bare forms like have. You’ll have to make a call about things like he on a horse because there are a few of them. It could arguably be that these are cases where be is omitted due to a lack of tense/inflection, such that these should count as nonfinite. Or, you could count them as not having verbs at all and just ignoring them. You could also make more columns to keep track of these things in if you want to postpone the decision. There are calls to make here and not always a clear right and wrong approach, but perhaps the most important thing is to be consistent.

TASK: Continue through the rest of the examples and mark 1s in appropriate columns for subject case and verb finiteness.

Note that if you can’t determine whether a verb is finite or not (or there is no verb or something), it doesn’t matter what you put for subject case, since we’re only looking at co-occurrence. So you can also skip past any of those if you encounter them.

Computing co-occurrences

Now that the raw data is there, we can make the spreadsheet compute what the co-occurrence rates were. There are six scenarios that we want to count: Nom+Fin, Acc+Fin, Gen+Fin, Nom+Nonfin, Acc+Nonfin, Gen+Nonfin. (Though if you decided to keep track of other things like omitted be or Acc/Gen her the possible combinations multiply out considerably, and you might want to either combine things or make additional co-occurrence columns. If you have more columns, then references I make to columns below will need to be adjusted.)

For this, the simplest way to compute this is to just insert six more columns, let’s put them to the left, just after the row counter. So, in column A, choose “Insert 1 right” again, six more times to get six blank columns. You can label the columns as above, “Nom+Fin” etc.

Then, put the cursor on the row of the first utterance. For me that would be cell B14. This is the “Nom+Fin” column so what we want here is to multiply the number in the Nom column by the number in the Fin column. If either is zero then the result in this co-occurrence column will be zero, but if they are both 1, then we’ll get 1.

=H14*K14

As it happens, the first one is both nominative and finite, so this will yield 1. In the next column (“Acc+Fin”) we want to multiply the Acc column and the Fin column, so:

=I14*K14

TASK: Continue with the other four, put formulas in that first row for columns “Gen+Fin”, “Nom+Nonfin”, “Acc+Nonfin”, “Gen+Nonfin”

Once you have all the formulas in (which should evaluate to 1 in the first column and 0 in the other five), then select all six cells (from B10 to G10) and press Copy.

Then put the cursor one row down (in B5), scroll down to the bottom, hold down shift and click in B735 (so you have selected from B5 down to the end of the data in column B) and hit Paste. The field should fill with mostly zeros.

As a last step we want to find out how many 1s there are in each of the co-occurrence columns. Go up above the data near the top of the columns and click in an empty space in the first column. For example, in B2 just under the “Nom+Fin” label. Enter:

=SUM(B10:B750)

This should evaluate to something representing the number of times nominative case occurred with a finite verb. It should be a reasonably non-zero number. (I overshot the column a little by picking B10 and B750 as its starting and ending points, you could be more precise if you want, just make sure you include all the data rows.)

You should be able to just copy this cell in B2 and paste it into C2 through G2 and it will adjust so that it is always getting the sum of the column below it.

And you now have essentially the result from your study, and you can consider the implications of what you have found.

Writing it up

The last part is to write up a short mini-report about this. It can be brief, but you should include the following sections (more or less modeled on things we’ve read).

  • Background. A short review of the proposal Schütze & Wexler (1996) made and what it predicts for the co-occurrence of non-nominative subjects and non-finite verbs.
  • Methodology. CHILDES, what transcript, brief review of what you counted.
  • Results. A review of the results that you found. Probably not much more than a table and a sentence or two.
  • Discussion. What the counts you found mean about the relationship between case and finiteness.
  • References. Due to the ground rules for using CHILDES.

I anticipate this being like a 5-page thing at most. Then turn this in along with the spreadsheet.

Lab option 2: Subject omission in wh-questions

Locating Nina’s transcripts

In the upper left corner, click Eng-NA, then Suppes. You should now see several files like 011116.cha; these are the transcripts we will be analyzing in this exercise.

Your assignment

The lab assignment comes in two major parts, one centered around collecting the data (in Collection Sections 1 and 2 below), and the other centered around reporting on it (the Reporting section below).

  • Collection Section 1: Sampling subject drop rate.
    • Part 1. Determine Nina’s age and MLU for files up to 020328.
    • Part 2. Determine the word frequencies for two representative files.
    • Part 3. Search the transcripts for the examples.
    • Part 4. Count up and record the totals.
  • Collection Section 2: Determine how often subjects are dropped in wh-questions.
    • Part 5. Sketch a hypothesis.
    • Part 6. Find the wh-questions.
    • Part 7. Count up and record the totals.
  • Reporting: Writing up the results.
    • Part 1. Background.
    • Part 2. Methodology.
    • Part 3. Subjects.
    • Part 4. Results.
    • Part 5. Discussion.

Section 1: Sampling subject drop rate

In the first section, we’ll isolate a common verb in each of a couple of transcripts, and determine how often the subject is omitted, to get some experience with CLAN and to get some baseline information.

Part 1: Determine Nina’s age and MLU for files 01-19.

Back in the olden days, these files used to be called ninaXX.cha for various numbers. Now they are named by a different convention as a six-digit number. Fun game: what is the naming convention? Open a couple and read the top and it will probably become evident. But when you finish this task it will be obvious, regardless.

Use the mlu command to determine the MLU for the transcripts up to 020328. The command described above should do it. The MLU is the “ratio of morphemes over utterances.” To get Nina’s age, you can look inside each transcript, it is recorded at the top.

A note on computing MLU with CLAN: The default behavior of the CLAN program is to compute the Mean Length of Utterance in terms of average morphemes per utterance. However, the traditional measure of MLU is done in terms of average words per utterance. So, if you just use the command given earlier in the page, you will wind up with the “morpheme” version, which is of course going to be a higher number than the “word” version would have been (since many words have multiple morphemes within them). There are reasons one might prefer the “morpheme” version, but at the very end of this lab you’ll be asked to compare what you find in Nina’s files to what was reported by Valian (1991), and this comparison is only really possible if you have the “words” version of Nina’s MLU. So, you may as well get the “words” version here instead of the “morphemes” version. This is accomplished by using something like the following command (which will get the “word” MLU for all of the files whose names end with .cha):

mlu +t*CHI -t%mor *.cha

Note down: A list containing, for each file from 011116.cha to 020328.cha, Nina’s age from the transcript, and the MLU as computed by the mlu command. You might just want to copy and paste the output of the mlu command for later reference.

Part 2: Determine the word frequencies for two representative files.

We are going to be doing the analysis on two of Nina’s transcripts, which we will take to be representative. Those two files are 020115.cha and 020328.cha. Our ultimate goal here is to find out the rate at which Nina omits subjects in two different points in her linguistic development. We could do this by simply going through the entire transcript and counting, but that would be very long and tedious. Instead, we are going to approximate that by looking at a subset of the utterances (and count those by hand). To pick a sensible subset, what we’ll do is find the most commonly used verb, and then look at all of the sentences that contain that verb. So, the first step in that process is to find the most commonly used verb. So, the aim of this step is to pick a verb in each of the transcripts that occurs often (we’ll use a different verb in each transcript, whatever occurs most often in that transcript). Later, we will examine all utterances containing these verbs to count things like the number of null subjects. The freq command will search a transcript and count the number of times each word appears, then provide you with a list.

The freq command works just like mlu did, except that now we don’t need to specify exclusion of the morpheme tier. We want to restrict the frequency computation to just the child utterances, and we are looking at just the files 020115.cha and 020328.cha. I won’t give you that command, but if you’re getting how this works you can figure it out. (You can if you wish put both file names after the parameters, and it will do the computation on both of them in a single Run, or you can do them separately.)

The result should be a list of words along with a count of how many times each word appears in the transcript. To double-check, you should find that there are 21 instances of “Mommy” in 020115.cha and 2 in 020328.cha. These results are sorted alphabetically by word, rather than by most common. For each file, look for verbs, and look for the verbs that are the most frequent. NOTE: For these purposes, ignore the following verbs: have, go, want (because they can be used as an auxiliary and might behave differently) and see (because it so often occurs as “See?” which is correct without a subject). Pick the most frequent verb from each file (excluding those just listed). The two files differ in their frequencies, so you’ll have one verb for the first file and a different one for the second file.

Note down: For each file (020115.cha, 020328.cha), the verb you’ll look at and the number of times it occurs (according to freq). Don’t forget to count all forms (so, not just the 1sg present form, but also the past form and the progressive –ing form). As a check, the forms I picked each have 24 occurrences.

Part 3: Search the transcripts for the examples.

Having picked a common verb from each file, what we’re going to do is look at each time the verb is used in the transcript and count how often it appears with a subject (the idea being that we can extrapolate this result to estimate overall rates of subject omission). The first step here is to isolate those cases where the verb you’ve picked appears.

To perform the search, use the CLAN command combo (you might want to refer to the notes on combo earlier on this page). You will want to use the -w2 parameter in order to display the two lines preceding any matches, so you can get an idea of the context in which the matched utterances occur. The context will allow you to determine whether an utterance should be excluded. Here too, be sure you include all forms of the verb (1st present, past tense, -ing form, etc.). You can search for more than one form using something like +s"(see+seeing+saw+seen)" as a parameter.

The results of these searches will be somewhat large. You will want to copy and paste the results into a text file (using whatever you find convenient, e.g., TextEdit, Microsoft Word, whatever you have on hand).

Note down: The full combo instruction you used (you can copy this from the top of the search result) and the number of matches it returned (you can copy this from the bottom of the search result). The number of matches in each file should of course be 24, right? Right.

Part 4: Count up and report on the totals

Now, go through each example that combo found, all 48 of them, and decide which of the following categories the one you are looking at falls under. Be sure to read the “exclusion” criteria carefully.

X. Excluded. The utterance is (a) a repetition of an immediately preceding utterance (either by the child or the adults), (b) incomprehensible, (c) part of a rote-learned expression (e.g., “…how I wonder what you are”), (d) an imperative or infinitive where a subject is not required in the adult language. We do not want to count these because they are not certain to reflect the child’s productive grammar, or because no subject is required in the adult language.

O. Overt subject. The verb has an overt (pronounced) subject, like the adult language would. I include among these cases where there is a modal (like will or can) or auxiliary (like am or don’t or go) before the verb.

N. Null subject. The verb should have had a subject if an adult said it, but the subject is missing.

F. Fragment. These look a lot like null subjects, but if a child answers a question like “what are you doing?” with “Eating sandwiches”, it isn’t accurate to call that a null subject utterance. However, in response to “What were the monkeys eating?”, “Eating a balloon” should count as a null subject (not as a fragment), since this is not a well-formed fragment in adult speech. Just to see how often they appear, we will count the number of fragments, but when we do the analysis later we are going to exclude things both in the F category and in the X category.

Note down: Create a 2×3 table of results (2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each file. In the third column, compute the percentage of included utterances for each file that have missing subjects (divide the number of missing subjects by the sum of both overt and missing subjects, and then multiply by 100).

Note down: Write a couple of sentences that describe the results in the table, just to help you keep track of what’s going on.

  null subjects overt subjects percentage with null subjects
nina10.cha N O 100 * N / ( N + O )
nina19.cha N O 100 * N / ( N + O )

This concludes the collection of the baseline data. Now, we can turn to the more interesting question of whether subjects are dropped more (or less) often in wh-questions than they are just in general. What we have already collected is the “just in general” rate, assuming that the files and verbs we used are basically representative.

Section 2: Determine how often subjects are dropped in wh-questions

In this section, we will look specifically at wh-questions to see what difference, if any, there is in the number of subjects omitted.

Part 5: Sketch a hypothesis

This is preparation for the write-up, but an important part of collecting the data is knowing what to collect and what is going to be interesting. This lab is designed around the question of what interaction there is, if any, between wh-questions and null subjects. The reason we would look at this is that at least one hypothesis we’re entertaining is that there is a connection between the two. The basic components of the hypothesis are these: A child can omit a subject by making it a “topic”, which is accomplished by syntactically moving the subject into the specifier of CP. This is supposed to be related to something like “Beans, I like” in adult speech (where the object “beans” has been topicalized). In addition, we maintain the usual analysis of wh-questions, according to which wh-words move into the specifier of CP. So, that’s two things that care about the specifier of CP, and we make the further assumption that only one thing can actually be in the specifier of CP. We will assume that in wh-questions, the specifier of CP is occupied by the wh-word. You can finish the thought there: what does this predict about how many null subjects we will find in wh-questions in child speech?

Note down: Write a couple of sentences about what we expect to find when we look at how many null subjects occur in wh-questions as compared to how many we find in statements, in both the case when the verb is finite and in the case when the verb is bare.

Part 6: Find the wh-questions

We’re going to use CLAN to study Nina’s use of subject drop in wh-questions, over two different time periods. So, first we need to find the wh-questions. We want to be sure we get all of them (including, e.g., “what’s”), so you can use the following search string as one of the parameters. If it’s not already clear to you, make sure you understand why this will find “who”, “whose”, and “who’s”.

+s"(who*+what*+when*+how*+why*+where*+which)"

Do two separate searches. Do the first search on the first 8 transcripts (up to and including 020106), and the second search on rest of the transcripts up to 020328. When you do these searches, you will get a large result, which you will want to copy into a text file.

Note down: The two CLAN instructions you used to do the searches.

Part 7: Count up and record the totals

What we care about here are wh-questions where the wh-word is not the subject, and among those we will count the number of overt and missing subjects. So, omit from consideration all of those wh-questions (a) where the wh-word is the subject, (b) which are direct repetitions of a previous utterance, and (c) where the classification cannot be determined.

You will see a lot of examples like: “What’s that?” Let’s consider this to be derived from “That is what”, so “that” is the subject (and it is overt), and “what” is the object. Same for “where’s my candy”, “who’s that”.

Note down: Create a 2×3 table of results (2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each set of files. In the third column, compute the percentage of included utterances for each range of files that have null subjects (divide the number of null subjects by the sum of both overt and missing subjects, and then multiply by 100).

Note down: Write a couple of sentences that describe the results in the table, what pattern you see.

transcripts non-subject, wh-word, null subject non-subject, wh-word, overt subject percentage of wh-questions with null subjects
Early      
Late      

Reporting: Writing up the results

In the second part of this exercise, we will concentrate on writing up the results in a standard way. This is the form that you’ll want to follow in your final project as well.

An experimental write-up generally consists of five basic sections. They are: Background, Methodology, Subjects, Results, and Discussion.

Hand in: Work through each of the following five sections and create a short paper. Hand in the resulting paper.

Part 1. Background.

The Background section lays out the theoretical background to the experiment you are conducting. This generally includes a couple of sub-components, one being a basic description of the issue, and another being a review of the known results that bear on the issue. The review of existing results is usually referred to as the “literature review.” One point worth making about the literature review: There are very often space constraints on papers like this, were you to submit it for publication in a journal or for presentation at a conference. The literature review should be concise: it is important to acknowledge the important work that sets the scene for your own experiment, and to indicate what other experiments have found, but at the same time it is also important to include only those experiments and discussions that have a direct relation to the question you are investigating. For this reason, the background section will sometimes need to be written/revised fairly late in the development of the write-up. The Background section is a place where you introduce things that you’ll refer to in your methodology and discussion, and motivate your particular experiment, but until you know what the analysis of your results will be, it isn’t always clear what will and will not be relevant to include.

The Background section should also outline the question that your experiment is going to address.

In this exercise, the experiment is about the things that were located in the second part of the data collection. Specifically, about the interaction of wh-words and missing subjects.

Specifically for this assignment: Your introduction should touch on the following points. I realize that I’m presenting you with the relatively bizarre task of putting these points in your own words, but this is partly why I left the points in very brief form, to give you more room. Write full sentences, flesh it out a little bit.

  • Children acquiring English are known to omit subjects frequently before the age of 3.
  • One hypothesis is that omitted subjects are topics, and topics must be moved to the specifier of CP.
  • The specifier of CP is also where wh-words move in wh-questions.
  • We assume that the specifier of CP cannot have both a wh-word and a topic in it.
  • The prediction is that children should not omit subjects in wh-questions.

Part 2. Methodology.

The Methodology section addresses your specific experiment, laying out the way in which you propose to answer the question that was set up in the Background section. In this section, you will put in all of the specifics of the experiment. If the experiment were conducted with actual subjects in trials of some sort, you would explain what sort of stimuli you provide, and what the subjects’ task is. In this case, the experiment is being done on a corpus, so what you would discuss in the Methodology section is what you are looking for within the corpus.

Specifically for this assignment: You should indicate that this study is done by searching transcripts of child productions contained within the CHILDES database. As per the CHILDES “ground rules”, you need to cite MacWhinney (2000) when you use CHILDES. You should also outline what you searched for (it is not necessary to include the exact command used), and how you coded individual results (basically describe what you did in “Part 7: Count up and record the totals” above).

Part 3. Subjects.

The Subjects section is usually pretty short, and outlines who the participants in your experiment are, along with any relevant characteristics. It is in general important for your subjects to remain anonymous, although sometimes it is useful to assign codes to individual subjects so that you can refer to them by subject number, or initials. What characteristics are relevant depends a lot on what you are looking at. For example, in an adult L2 study, it would be important to know things like what their first language is, whether they know any other languages apart from their L1 and L2, how long they have been exposed to the L2 in question, and in what circumstances, how early they were exposed to the L2, how central the L2 is in their daily lives, that sort of thing. In a child corpus study like this one, you would list the children whose transcripts you are using and their ages and MLUs.

Specifically for this assignment: Again according to the CHILDES “ground rules”, you need to cite the source of the data. The CHILDES database manual for American English will tell you what to cite for the Nina (Suppes) files. Apart from that, indicate which dataset you were using (Nina) and describe the division into early and late files, giving an age and MLU range for the early files and for the late files.

Part 4. Results.

In the Results section, you can report (usually in some kind of tabular or summarized form) what actually came out of the experiment you conducted. It is often appropriate to characterize the patterns in the data as well in the Results section, although interpretation of what the results actually tell us should be reserved for the Discussion section. Additionally, if there are statistics to be done, the Results section is the place to report those.

Specifically for this assignment: Here, you basically just provide the table you made in “Part 7: Count up and record the totals” above. You can add your couple of sentences describing the basic pattern, as well, though the discussion of the pattern’s relation to the hypothesis you were testing should be saved for…

Part 5. Discussion.

The Discussion section is the place where you can ultimately draw everything together, and talk about what the results you found indicate about the answer to the question you were endeavoring to answer (as laid out in the Background section). Although most of the numerical results will have been reported in the Results section, it is sometimes appropriate to do additional and more abstract computations in the Discussion section. The main point here is to describe the meaning of the results you found. A secondary task for this section is to identify weaknesses in the experimental design or results, to show where a future experiment might be able to improve on the results. This is also the place to discuss what experiments might be good as follow-up studies to clarify anything left unclear by the results you found.

Specifically for this assignment: This is where most of the “thinky” stuff goes. Given what you found in your results, how does this bear on the hypothesis? What can we conclude? What shortcomings might there be in the results you found? What further studies might be helpful in addressing the questions here?

To address the hypothesis, it will be useful to compare what you found searching the wh-questions with what you found in the first section (the approximate rate at which subjects are dropped overall). For additional information about how your results line up with those reported in earlier literature, compare your results to those reported below, from O’Grady (1997), based on data from Valian (1991). They show overall percentages of dropped subjects in general, not just in (non-subject) wh-questions. To compare your results to Valian’s results, you will need to match up Nina’s ages/MLUs to those of Valian’s subjects. You will then want to describe how the rates are different in wh-questions, if they are, and then wind up by talking about whether the prediction made by the hypothesis you gave in the Background section is borne out in the results you fund.

Note about the MLUs: Valian reports MLUs in terms of words per utterance, rather than morphemes per utterance, so to do this comparison, you need to be sure that when you got the MLUs in Part 1 are also indicating the mean words per utterance.

Group No. of children Age range MLU
I 5 1;10 – 2;2 1.53 – 1.99
II 5 2;3 – 2;8 2.24 – 2.76
III 8 2;3 – 2;6 3.07 – 3.72
IV 3 2;6 – 2;8 4.12 – 4.38

Table 1. English-speaking children in Valian’s study (based on Valian 1991:38)

Group Mean Range
I 69% 55-82%
II 89% 84-94%
III 93% 87-99%
IV 95% 92-95%

Table 2. Proportion of utterances containing a subject (based on Valian 1991:44-45)

References (for tables 1 and 2 above)

O’Grady, William (1997). Syntactic Development. Chicago: University of Chicago Press.

Valian, Virginia (1991). Syntactic subjects in the early speech of American and Italian children. Cognition 35:105-22.