Of projects, CHILDES, and Twitter

Updated:

Today was kind of a miscellaneous day, positioned as it was before the Thanksgiving break. There were basically three things discussed, mostly loosely related to the upcoming projects.

As a reminder, the plan for the “final” has changed since the beginning of the semester (which I remember us basically agreeing on early in the semester). In particular, everyone will be doing a final project, regardless of the course number registered for. Students in LX390 will do a smaller project. There will be no in-class final for anyone.

Part of the idea for the LX390 project was that I’d provide a project topic, whereas the LX690 topics would be proposed by those planning to do them. So, I outlined a proposed project for LX390 students, although a) it is not required that LX390 students with a different idea do it; and, b) if someone in LX690 wants to do it (perhaps more extensively), that’s ok too.

Before getting to that point, I talked a bit about getting access to corpora, which I find to be one of the most difficult hurdles to working out an interesting project idea. There are basically four corpora I’d suggest as a starting point, based on their availability, apart from the samples that NLTK comes with. One is Project Gutenberg, which is also described in the NLTK book chapter 2. This has a lot of public domain texts, generally old books. If it’s ok for your text to be coming from, like, the 1800s, this is a good source. Another is CHILDES, which we’ve talked a bit about as well (and to which I’ll come back). The last two are the British National Corpus and Twitter.

With respect to the British National Corpus, there was an announcement of the availabilty of BNC 2014 that caught my attention. It is available for non-commercial academic use, through the Corpus Query Procesor, University of Lancaster, which you can sign up for. I successfully signed up for an account there and got access to BNC 2014 that way. You can do simple queries there on the included data sets. Some corpora you get “full access” to, which allows you to download them in their entirety; other, you can do queries on but can’t download in full.

For Twitter, I have successfully set up the tweepy package to access Twitter streams. I had tweepy already installed, I assume through installation of Anaconda initially, though it was not installed on the lab PCs. It’s pretty straightforward, though. The main documentation for tweepy does a pretty good job of the basic setup. I used user_timeline() and search() to good effect. Before you can use this, you need to sign in to the Twitter apps page and “create an application” that allows your Python scripts to sign in. It will eventually give you, under “Keys and Access tokens” a set of four big strings that are used to sign in: The “Consumer Key”, “Consumer Secret”, “Access Token”, and “Access Token Secret”. When you have all four of those you can plug them in and start retrieving things from Twitter.

There is an NLTK “Howto” about Twitter which might be of some help, though it uses a different Twitter library. Also, there are some interesting videos/demos at pythonprogramming.net:

Back to CHILDES and the proposed project, the basic outline is this: There is a phenomenon often referred to as “root infinitives” or “optional infinitives” that occurs for children of a large set of languages between the ages of 2 and 3. The short version of this is that they will sometimes use the infinitive form of a main verb instead of the tensed/agreeing form that adults would use. You can perhaps consult a handout about root infinitives from another class to get the basic idea. The important thing really here, though, is the fact that it is proposed to be on a maturational schedule, meaning that the disappearance of root infinitives is essentially biological, kind of like losing one’s baby teeth.

There have been a bunch of studies of the root infinitive stage in a number of different languages. It doesn’t occur in all languages; it seems not to occur at all in languages like Italian and Spanish that allow for silent subjects and have elaborate verbal morphology. It does occur in English, French, Dutch, and a lot of other languages.

The project I’m suggesting is looking at transcripts of bilingual children acquiring two languages both of which are known to show these root infinitives. Simply put, the prediction made by the hypothesis that root infinitives “mature away” on a biological schedule is that root inifinitves should disappear from both languages a bilingual child speaks at the same time.

So, this project would be to find some sufficiently large corpora for bilingual children acquiring two root infinitive languages, then analyze each langauge for the presence of root infinitives. The languages may well be mixed in the transcripts so it would probably be good to pick languages that you speak or at least can kind of decode. The challenge would be to work out how to use NLTK (or something) to narrow in at least on the candidates for root infinitives (even if the process of finding them cannot be fully automated), at which point you can start comparing them to see how often they occur at what age in what language.

I think it will be interesting to see what the results of this are, but I am not really aware of anyone looking at this directly before. So, that seems like as good a project as any—it addresses a real theoretical question, by processing large amounts of language data.