CHILDES lab

Background / demo

These page includes some links and notes relating to the CHILDES demo.

Snapshots of data sets used in William Snyder’s textbook (Child language: The parametric approach). If you want to follow along with his discussion in the book. At least one intermediate file created in the process is here in addition to the CHILDES data sets. Also note that in the data for Sarah, the child utterances are in the SAR tier in this version, although in the current version of the dataset within CHILDES, the tier has been named CHI instead (which is more conventional).

The traditional way to use the CHILDES database is with a set of analysis programs called CLAN (Computerized Language ANalysis). You can download these and install them on your own computer and do the data analysis that way. A few years ago, a web interface called the CHILDES Browsable Database was made available that allows you to do much of this work online as well. The downside of doing it online this way is that it is somewhat trickier to save the output, and you can’t feed it your own input files. So for more complex projects, it is likely still preferable to have it on your local machine. However, the local version does show its age, and it can be somewhat unforgiving when trying to make it install or find its directories.

CLAN download page. On Mac OS X, download dmg, open, double-click install package. Will also install CAfont by default. Windows and *nix versions can also be retrieved here.

For complete information on the format and searching utilities, look at the CHAT manual and the CLAN manual.

Using the CHILDES browsable databse will suffice for simple projects. One limitation is that you cannot use input files or save to output files (although for output you can use copy and paste).

For the class demonstration, we’re going to mainly use the web interface, and you can probably manage the CHILDES lab there as well.

Some things to try in the browser.

  • Eng-NA > Bloom70 > Peter > peter07.cha. (020100 I believe)
  • Eng-NA > Brown > Adam > adam32. (030529 I believe)
  • Run MLU on adam32. Limit to CHI with +t*CHI. Results by copy and paste.
  • combo on adam32: +t*CHI +s"*^?"
  • Look at morpheme tier on these questions.
  • Look in child and morpheme tiers: +t*CHI +t%mor
  • Search the morpheme tier now exclusively for aux or be ending with ?
  • Almost +s"%mor:^*^(aux+be)^*^?"
  • Roast beef and chicken flowers?
  • Passing the time away, expanding the window. Trying this from Eve’s directory. +t*CHI +w2 -w2 +s"pass^the^time" 020300a.cha

The CHILDES lab assignment

The lab is in two conceptual parts: the data collection and the write-up. The write-up contains the results of the data collection, and it is all due on March 2.

To do this lab you will need access to the CHILDES data and the CLAN analysis program. The analysis programs can be used either online or locally (available on Mac, Windows, and some flavors of Unix). The description here will be primarily for the CHILDES browsable database (“CBD”) interface.

On this page, you will find information on:

(The task you’ll be doing in this lab assignment was originally formulated by Martha McGinnis, University of Victoria.)

Picking a version of CLAN to use

The CLAN program is really a collection of several different commands that you can execute to analyze your data. There are two ways that you can use it. The simplest is to simply use the CHILDES Browsable Database (CBD) online, but you can also download and install it on your computer, along with the corpus we’ll be working with. Each has advantages and disadvantages, but my advice is to use the online CBD for this assignment, and to download the program and data if you intend to do your course project using CHILDES.

The instructions here will be tailored to the CBD, but you can refer to the notes at the end of this page about using the CLAN program on your own computer for comments on how things differ if you run the program locally, as well as a bit more elaboration on the advantages and disadvantages. If I get that far.

Note that the CBD says that it works well with Chrome, Firefox, and Opera, but not well with Edge or Safari.

The biggest disadvantage of using the CBD that I can see at the moment is that it is difficult to feed it custom searches with files (rather than typing full searches out on the command line).

The structure of a CLAN instruction

A CLAN instruction comes in three parts. Here is an example of one such instruction. The first part names the sub-program (or, as I’ll call it, the “command” ). In the example below it is mlu. The mlu command computes the MLU (mean length of utterance) from the utterances in a transcript file. You will also be making use of the commands freq and combo before we are finished. For a more complete description of the available commands, you can consult the CLAN manual.

mlu +t*CHI *.cha

The second part, after the command, contains the parameters. These modify the way in which the command operates. Above, there is one parameter given: +t*CHI. Transcripts that are in the standard format (called CHAT) are organized into “tiers”, and the major tiers represent who is speaking at any given point in the transcript. These major tiers are named with an asterisk as the first character. It is common for *CHI to be the tier assigned to the child. What +t*CHI will do in the instruction above is tell the mlu command to only consider utterances in the *CHI tier (“+t” stands for “examine tier”). That is, you only want the MLU to be computed for the child utterances. You can also examine other tiers (like *MOT for “mother”, if that’s what it’s called in the transcript) in the same way.

Important note: There must be no spaces within the string +t*CHI. Parameters are separated by spaces, but +t*CHI is a single parameter.

The third part of the instruction, *.cha above, indicates which transcripts you wish to process. You can name a particular file (such as 020212.cha), or you can use the asterisk as above as a “wildcard.” The effect of using *.cha is that it will process all of the files whose names end with “.cha”.

In the CBD, the command is chosen using a popup menu in the lower left corner of the screen. The parameters and files to analyze are typed into the blank space right next to the popup menu. When you are ready to run the command, you press the Run button (or just hit return in the text box). In the list of transcripts that appears in the upper left corner of the screen, you will see a [+] symbol after each transcript. Clicking on the [+] is a shortcut that will put the name of the transcript at the end of the text in the instruction box.

The result of running a CLAN instruction will be displayed on the right side of the screen. As far as I know, there is no good way to capture this information except to select it, copy it, and paste it into a text editor (Notepad, TextEdit, Word, whatever you prefer).

Locating Nina’s transcripts

In the upper left corner, click Eng-NA, then Suppes. You should now see several files like 011116.cha; these are the transcripts we will be analyzing in this exercise.

Your assignment

The lab assignment comes in two major parts, one centered around collecting the data (in Collection Sections 1 and 2 below), and the other centered around reporting on it (the Reporting section below).

  • Collection Section 1: Sampling subject drop rate.
    • Part 1. Determine Nina’s age and MLU for files up to 020328.
    • Part 2. Determine the word frequencies for two representative files.
    • Part 3. Search the transcripts for the examples.
    • Part 4. Count up and record the totals.
  • Collection Section 2: Determine how often subjects are dropped in wh-questions.
    • Part 5. Sketch a hypothesis.
    • Part 6. Find the wh-questions.
    • Part 7. Count up and record the totals.
  • Reporting: Writing up the results.
    • Part 1. Background.
    • Part 2. Methodology.
    • Part 3. Subjects.
    • Part 4. Results.
    • Part 5. Discussion.

Section 1: Sampling subject drop rate

In the first section, we’ll isolate a common verb in each of a couple of transcripts, and determine how often the subject is omitted, to get some experience with CLAN and to get some baseline information.

Part 1: Determine Nina’s age and MLU for files 01-19.

Back in the olden days, these files used to be called ninaXX.cha for various numbers. Now they are named by a different convention as a six-digit number. Fun game: what is the naming convention? Open a couple and read the top and it will probably become evident. But when you finish this task it will be obvious, regardless.

Use the mlu command to determine the MLU for the transcripts up to 020328. The command described above should do it. The MLU is the “ratio of morphemes over utterances.” To get Nina’s age, you can look inside each transcript, it is recorded at the top.

A note on computing MLU with CLAN: The default behavior of the CLAN program is to compute the Mean Length of Utterance in terms of average morphemes per utterance. However, the traditional measure of MLU is done in terms of average words per utterance. So, if you just use the command given earlier in the page, you will wind up with the “morpheme” version, which is of course going to be a higher number than the “word” version would have been (since many words have multiple morphemes within them). There are reasons one might prefer the “morpheme” version, but at the very end of this lab you’ll be asked to compare what you find in Nina’s files to what was reported by Valian (1991), and this comparison is only really possible if you have the “words” version of Nina’s MLU. So, you may as well get the “words” version here instead of the “morphemes” version. This is accomplished by using something like the following command (which will get the “word” MLU for all of the files whose names end with .cha):

mlu +t*CHI -t%mor *.cha

Note down: A list containing, for each file from 011116.cha to 020328.cha, Nina’s age from the transcript, and the MLU as computed by the mlu command. You might just want to copy and paste the output of the mlu command for later reference.

Part 2: Determine the word frequencies for two representative files.

We are going to be doing the analysis on two of Nina’s transcripts, which we will take to be representative. Those two files are 020115.cha and 020328.cha. Our ultimate goal here is to find out the rate at which Nina omits subjects in two different points in her linguistic development. We could do this by simply going through the entire transcript and counting, but that would be very long and tedious. Instead, we are going to approximate that by looking at a subset of the utterances (and count those by hand). To pick a sensible subset, what we’ll do is find the most commonly used verb, and then look at all of the sentences that contain that verb. So, the first step in that process is to find the most commonly used verb. So, the aim of this step is to pick a verb in each of the transcripts that occurs often (we’ll use a different verb in each transcript, whatever occurs most often in that transcript). Later, we will examine all utterances containing these verbs to count things like the number of null subjects. The freq command will search a transcript and count the number of times each word appears, then provide you with a list.

The freq command works just like mlu did, except that now we don’t need to specify exclusion of the morpheme tier. We want to restrict the frequency computation to just the child utterances, and we are looking at just the files 020115.cha and 020328.cha. I won’t give you that command, but if you’re getting how this works you can figure it out. (You can if you wish put both file names after the parameters, and it will do the computation on both of them in a single Run, or you can do them separately.)

The result should be a list of words along with a count of how many times each word appears in the transcript. To double-check, you should find that there are 21 instances of “Mommy” in 020115.cha and 2 in 020328.cha. These results are sorted alphabetically by word, rather than by most common. For each file, look for verbs, and look for the verbs that are the most frequent. NOTE: For these purposes, ignore the following verbs: have, go, want (because they can be used as an auxiliary and might behave differently) and see (because it so often occurs as “See?” which is correct without a subject). Pick the most frequent verb from each file (excluding those just listed). The two files differ in their frequencies, so you’ll have one verb for the first file and a different one for the second file.

Note down: For each file (020115.cha, 020328.cha), the verb you’ll look at and the number of times it occurs (according to freq). Don’t forget to count all forms (so, not just the 1sg present form, but also the past form and the progressive –ing form). As a check, the forms I picked each have 24 occurrences.

Part 3: Search the transcripts for the examples.

Having picked a common verb from each file, what we’re going to do is look at each time the verb is used in the transcript and count how often it appears with a subject (the idea being that we can extrapolate this result to estimate overall rates of subject omission). The first step here is to isolate those cases where the verb you’ve picked appears.

To perform the search, use the CLAN command combo (you might want to refer to the notes on combo at the bottom of this page). You will want to use the -w2 parameter in order to display the two lines preceding any matches, so you can get an idea of the context in which the matched utterances occur. The context will allow you to determine whether an utterance should be excluded. Here too, be sure you include all forms of the verb (1st present, past tense, -ing form, etc.). You can search for more than one form using something like +s"(see+seeing+saw+seen)" as a parameter.

The results of these searches will be somewhat large. You will want to copy and paste the results into a text file (using whatever you find convenient, e.g., TextEdit, Microsoft Word, whatever you have on hand).

Note down: The full combo instruction you used (you can copy this from the top of the search result) and the number of matches it returned (you can copy this from the bottom of the search result). The number of matches in each file should of course be 24, right? Right.

Part 4: Count up and report on the totals

Now, go through each example that combo found, all 48 of them, and decide which of the following categories the one you are looking at falls under. Be sure to read the “exclusion” criteria carefully.

X. Excluded. The utterance is (a) a repetition of an immediately preceding utterance (either by the child or the adults), (b) incomprehensible, (c) part of a rote-learned expression (e.g., “…how I wonder what you are”), (d) an imperative or infinitive where a subject is not required in the adult language. We do not want to count these because they are not certain to reflect the child’s productive grammar, or because no subject is required in the adult language.

O. Overt subject. The verb has an overt (pronounced) subject, like the adult language would. I include among these cases where there is a modal (like will or can) or auxiliary (like am or don’t or go) before the verb.

N. Null subject. The verb should have had a subject if an adult said it, but the subject is missing.

F. Fragment. These look a lot like null subjects, but if a child answers a question like “what are you doing?” with “Eating sandwiches”, it isn’t accurate to call that a null subject utterance. However, in response to “What were the monkeys eating?”, “Eating a balloon” should count as a null subject (not as a fragment), since this is not a well-formed fragment in adult speech. Just to see how often they appear, we will count the number of fragments, but when we do the analysis later we are going to exclude things both in the F category and in the X category.

Note down: Create a 2×3 table of results (2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each file. In the third column, compute the percentage of included utterances for each file that have missing subjects (divide the number of missing subjects by the sum of both overt and missing subjects, and then multiply by 100).

Note down: Write a couple of sentences that describe the results in the table, just to help you keep track of what’s going on.

  null subjects overt subjects percentage with null subjects
nina10.cha N O 100 * N / ( N + O )
nina19.cha N O 100 * N / ( N + O )

This concludes the collection of the baseline data. Now, we can turn to the more interesting question of whether subjects are dropped more (or less) often in wh-questions than they are just in general. What we have already collected is the “just in general” rate, assuming that the files and verbs we used are basically representative.

Section 2: Determine how often subjects are dropped in wh-questions

In this section, we will look specifically at wh-questions to see what difference, if any, there is in the number of subjects omitted.

Part 5: Sketch a hypothesis

This is preparation for the write-up, but an important part of collecting the data is knowing what to collect and what is going to be interesting. This lab is designed around the question of what interaction there is, if any, between wh-questions and null subjects. The reason we would look at this is that at least one hypothesis we’re entertaining is that there is a connection between the two. The basic components of the hypothesis are these: A child can omit a subject by making it a “topic”, which is accomplished by syntactically moving the subject into the specifier of CP. This is supposed to be related to something like “Beans, I like” in adult speech (where the object “beans” has been topicalized). In addition, we maintain the usual analysis of wh-questions, according to which wh-words move into the specifier of CP. So, that’s two things that care about the specifier of CP, and we make the further assumption that only one thing can actually be in the specifier of CP. We will assume that in wh-questions, the specifier of CP is occupied by the wh-word. You can finish the thought there: what does this predict about how many null subjects we will find in wh-questions in child speech?

Note down: Write a couple of sentences about what we expect to find when we look at how many null subjects occur in wh-questions as compared to how many we find in statements, in both the case when the verb is finite and in the case when the verb is bare.

Part 6: Find the wh-questions

We’re going to use CLAN to study Nina’s use of subject drop in wh-questions, over two different time periods. So, first we need to find the wh-questions. We want to be sure we get all of them (including, e.g., “what’s”), so you can use the following search string as one of the parameters. If it’s not already clear to you, make sure you understand why this will find “who”, “whose”, and “who’s”.

+s"(who*+what*+when*+how*+why*+where*+which)"

Do two separate searches. Do the first search on the first 8 transcripts (up to and including 020106), and the second search on rest of the transcripts up to 020328. When you do these searches, you will get a large result, which you will want to copy into a text file.

Note down: The two CLAN instructions you used to do the searches.

Part 7: Count up and record the totals

What we care about here are wh-questions where the wh-word is not the subject, and among those we will count the number of overt and missing subjects. So, omit from consideration all of those wh-questions (a) where the wh-word is the subject, (b) which are direct repetitions of a previous utterance, and (c) where the classification cannot be determined.

You will see a lot of examples like: “What’s that?” Let’s consider this to be derived from “That is what”, so “that” is the subject (and it is overt), and “what” is the object. Same for “where’s my candy”, “who’s that”.

Note down: Create a 2×3 table of results (2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each set of files. In the third column, compute the percentage of included utterances for each range of files that have null subjects (divide the number of null subjects by the sum of both overt and missing subjects, and then multiply by 100).

Note down: Write a couple of sentences that describe the results in the table, what pattern you see.

transcripts non-subject, wh-word, null subject non-subject, wh-word, overt subject percentage of wh-questions with null subjects
Early      
Late      

Reporting: Writing up the results

In the second part of this exercise, we will concentrate on writing up the results in a standard way. This is the form that you’ll want to follow in your final project as well.

An experimental write-up generally consists of five basic sections. They are: Background, Methodology, Subjects, Results, and Discussion.

Hand in: Work through each of the following five sections and create a short paper. Hand in the resulting paper.

Part 1. Background.

The Background section lays out the theoretical background to the experiment you are conducting. This generally includes a couple of sub-components, one being a basic description of the issue, and another being a review of the known results that bear on the issue. The review of existing results is usually referred to as the “literature review.” One point worth making about the literature review: There are very often space constraints on papers like this, were you to submit it for publication in a journal or for presentation at a conference. The literature review should be concise: it is important to acknowledge the important work that sets the scene for your own experiment, and to indicate what other experiments have found, but at the same time it is also important to include only those experiments and discussions that have a direct relation to the question you are investigating. For this reason, the background section will sometimes need to be written/revised fairly late in the development of the write-up. The Background section is a place where you introduce things that you’ll refer to in your methodology and discussion, and motivate your particular experiment, but until you know what the analysis of your results will be, it isn’t always clear what will and will not be relevant to include.

The Background section should also outline the question that your experiment is going to address.

In this exercise, the experiment is about the things that were located in the second part of the data collection. Specifically, about the interaction of wh-words and missing subjects.

Specifically for this assignment: Your introduction should touch on the following points. I’m presenting you with the relatively bizarre task of putting these points in your own words, but don’t try too hard. Write full sentences, flesh it out a little bit, but don’t go out of your way to avoid the specific wording I used, since some of these terms are standard and it would be weird to use synonyms.

  • Say something about the fact that children acquiring English are known to omit subjects frequently before the age of 3. There is a lot of literature on this, but for this I would suggest you cite Hyams & Wexler (1993) as a source, and leave it at that. Many of the basic points made in that paper were covered in class. Most of the points Hyams & Wexler (1993) make are negative, arguing against accounts of null subjects that take it to arise from non-syntactic properties.
  • Say that one prominent analysis of this phenomenon is that children can omit subjects that are treated as topics, and that topics occupy the specifier of CP in their syntactic structure. There is a certain degree of complexity in the full version of this hypothesis, but we’ll stick with this simple statement, and you can cite Hyams & Wexler (1993) for this hypothesis too.
  • Say that the standard analysis of wh-questions is that the wh-word is moved into the specifier of CP.
  • Say that you will be exploring the hypothesis that when a wh-word is in the specifier of CP, there is no room for a topic, which predicts that children should not omit subjects in wh-questions.

Part 2. Methodology.

The Methodology section addresses your specific experiment, laying out the way in which you propose to answer the question that was set up in the Background section. In this section, you will put in all of the specifics of the experiment. If the experiment were conducted with actual subjects in trials of some sort, you would explain what sort of stimuli you provide, and what the subjects’ task is. In this case, the experiment is being done on a corpus, so what you would discuss in the Methodology section is what you are looking for within the corpus.

Specifically for this assignment: You should indicate that this study is done by searching transcripts of child productions contained within the CHILDES database. As per the CHILDES “ground rules”, you need to cite MacWhinney (2000) when you use CHILDES. You should also outline what you searched for (it is not necessary to include the exact command used), and how you coded individual results (basically describe what you did in “Part 7: Count up and record the totals” above).

Part 3. Subjects.

The Subjects section is usually pretty short, and outlines who the participants in your experiment are, along with any relevant characteristics. It is in general important for your subjects to remain anonymous, although sometimes it is useful to assign codes to individual subjects so that you can refer to them by subject number, or initials. What characteristics are relevant depends a lot on what you are looking at. For example, in an adult L2 study, it would be important to know things like what their first language is, whether they know any other languages apart from their L1 and L2, how long they have been exposed to the L2 in question, and in what circumstances, how early they were exposed to the L2, how central the L2 is in their daily lives, that sort of thing. In a child corpus study like this one, you would list the children whose transcripts you are using and their ages and MLUs.

Specifically for this assignment: Again according to the CHILDES “ground rules”, you need to cite the source of the data. The CHILDES database manual for American English will tell you what to cite for the Nina (Suppes) files. Apart from that, indicate which dataset you were using (Nina) and describe the division into early and late files, giving an age and MLU range for the early files and for the late files.

Part 4. Results.

In the Results section, you can report (usually in some kind of tabular or summarized form) what actually came out of the experiment you conducted. It is often appropriate to characterize the patterns in the data as well in the Results section, although interpretation of what the results actually tell us should be reserved for the Discussion section. Additionally, if there are statistics to be done, the Results section is the place to report those.

Specifically for this assignment: Here, you basically just provide the table you made in “Part 7: Count up and record the totals” above. You can add your couple of sentences describing the basic pattern, as well, though the discussion of the pattern’s relation to the hypothesis you were testing should be saved for…

Part 5. Discussion.

The Discussion section is the place where you can ultimately draw everything together, and talk about what the results you found indicate about the answer to the question you were endeavoring to answer (as laid out in the Background section). Although most of the numerical results will have been reported in the Results section, it is sometimes appropriate to do additional and more abstract computations in the Discussion section. The main point here is to describe the meaning of the results you found. A secondary task for this section is to identify weaknesses in the experimental design or results, to show where a future experiment might be able to improve on the results. This is also the place to discuss what experiments might be good as follow-up studies to clarify anything left unclear by the results you found.

Specifically for this assignment: This is where most of the “thinky” stuff goes. Given what you found in your results, how does this bear on the hypothesis? What can we conclude? What shortcomings might there be in the results you found? What further studies might be helpful in addressing the questions here?

To address the hypothesis, it will be useful to compare what you found searching the wh-questions with what you found in the first section (the approximate rate at which subjects are dropped overall). For additional information about how your results line up with those reported in earlier literature, compare your results to those reported below, from O’Grady (1997), based on data from Valian (1991). They show overall percentages of dropped subjects in general, not just in (non-subject) wh-questions. To compare your results to Valian’s results, you will need to match up Nina’s ages/MLUs to those of Valian’s subjects. You will then want to describe how the rates are different in wh-questions, if they are, and then wind up by talking about whether the prediction made by the hypothesis you gave in the Background section is borne out in the results you fund.

Note about the MLUs: Valian reports MLUs in terms of words per utterance, rather than morphemes per utterance, so to do this comparison, you need to be sure that when you got the MLUs in Part 1 are also indicating the mean words per utterance.

Group No. of children Age range MLU
I 5 1;10 – 2;2 1.53 – 1.99
II 5 2;3 – 2;8 2.24 – 2.76
III 8 2;3 – 2;6 3.07 – 3.72
IV 3 2;6 – 2;8 4.12 – 4.38

Table 1. English-speaking children in Valian’s study (based on Valian 1991:38)

Group Mean Range
I 69% 55-82%
II 89% 84-94%
III 93% 87-99%
IV 95% 92-95%

Table 2. Proportion of utterances containing a subject (based on Valian 1991:44-45)

References (for tables 1 and 2 above)

O’Grady, William (1997). Syntactic Development. Chicago: University of Chicago Press.

Valian, Virginia (1991). Syntactic subjects in the early speech of American and Italian children. Cognition 35:105-22.

Comments on combo

CLAN includes a relatively powerful searching tool called combo. I will outline a couple of points here, although you should probably refer to the CLAN manual for more information.

An example of a combo instruction for the CBD is given below:

combo +t*CHI +w2 -w2 +s"what^my" *.cha > whatmy.txt

This command says:

  • combo: the command
  • +t*CHI: restrict attention to the lines uttered by the child
  • +w2: show me the line you find and 2 lines after it.
  • -w2: show me the line you find and two lines before it.
  • +s"what^my": search for what followed directly by my.
  • *.cha: search all of the files in the Working directory that end with .cha.
  • > whatmy.txt: Save the results in a file called whatmy.txt in the Output directory. (Note: this part of the instruction is only applicable to running CLAN on your own computer)

This will look for what immediately followed by my in any of the nina files, returning something like this:

*** File "Moxie:CLAN:suppes:nina19.cha": line 254.
*CHI: I want to play with you here .
*CHI: look what my got .
*CHI: look (1)what (1)my got .
*MOT: I see what you got .
*MOT: what did you get ?

You can see that we used the “^” character in the search string. This character means “immediately followed by”, so what we searched for was what immediately followed by my. In these search strings there are several other special characters that you can use.

  note
x^y Finds x immediately followed by y. x and y are full words (bounded by spaces).
* Finds anything
_ Finds any one character (the symbol about is the underline character)
x+y Finds x or y
!x Finds anything except x

You can combine these in various ways to get useful effects. A couple of common things you might use are:

  note
x^*^y Finds x eventually followed by y (unlike with x^y, y does not need to immediately follow x). Literally this means, search for x, immediately followed by anything, immediately followed by y.
*ing Finds anything that ends in ing. For example, verbs like swimming. Of course it will also get some irrelevant things like thing, boring, etc.

Some example combo commands are

combo +t*CHI +w2 -w2 +s"the^*^!grey^*^(dog+cat)" *.cha

This will search for the followed eventually (^*^ means “followed by anything followed by…”) by something other than grey (!grey means “not grey”), followed eventually by either dog or cat (dog+cat means “either dog or cat”). It will not find the grey cat but it will find the black cat, the big red dog, etc.

combo +t*CHI +w2 -w2 +s"my^*^*ing" *.cha

This will search for all instances of my followed eventually by something that ends in ing. If you are running CLAN on your own computer, you can use a “search” file instead of typing in the thing you are searching for each time. The “search” file is a text file that contains the things you want to search for, one item per line. combo will match if an item from any line is found.

If you are running CLAN on your own computer (rather than using the web-based CHILDES Browsable Database), you can use a command like the following (which matches the first one in this section, but has the additional > whatmy.txt at the end.

combo +t*CHI +w2 -w2 +s"what^my" *.cha > whatmy.txt

This extra part of this command says:

  • >: Save the output (result) in a file named next.
  • whatmy.txt: The name of the file that will be created (or overwritten) in the Working directory with the output results. This saves you from the copy and paste step you’d have to perform if using the CBD.

Another thing you can do if you are running CLAN on your own computer is to put the search string in a text file, rather than as part of the command line. So, if you put the file containing your search items in the Working directory and call it search-1pron.txt, then you could do the search with the following combo instruction, where the @ tells combo to look in your file for the list of things to search for.

combo +t*CHI +w2 -w2 +s@search-1pron.txt *.cha > pron1-nina.txt