Middle term

TL;DR: Take-home midterm, out Friday, due Wednesday.

We are almost to the middle of the semester. It is hard to imagine that. However, this means that we are also coming up on the promised midterm. The midterm was supposed to be Monday in class, according to the schedule. You may guess that my phrasing it that way suggests that I’m rethinking that.

Format of the midterm

Generally speaking, I don’t think the stuff we’ve been doing really lends itself that well to an in-class midterm. This is in significant part a “practical” course, and perhaps even more so here in this first half. I could provide a written test, though most of what we’ve been doing is getting Python on the computer to obey our commands. So, it seems more appropriate for the test to be (at least in part) in the same Python environment we’ve been working in. But the classes are 50 minutes long, and one technical mishap and half the time evaporates. So, to relieve the time pressure, etc., the midterm will be a “take-home.”

It is still a midterm, not a homework, and it will be midterm-sized, not homework-sized. It should still take about an hour, not several, and it won’t be on new things but testing your facility with what we have covered, with possibly some slight extensions.

So, the plan is: I will provide you with the midterm on Friday, and it will be due by class on Wednesday. Friday will still be a review day, but Monday will be a regular day when we will start new things, and Wednesday I intend to give out the next homework.

What have we done?

The topics in the schedule and what has been mentioned in class are what I will draw the midterm material from.

The sections of the NLTK book below are what I will restrict my attention to. We also talked a bit about CHILDES, which is not in the book, though we did not get too far with it. But I still might ask something about it, so be familiar with basically what it is and what we did with it.

Chapter 1 basic corpora, frequency distributions
Chapter 2 working with corpora, conditional frequency distributions, pronunciation corpus, WordNet
Chapter 3, sec. 9 formatting output
Chapter 4, sec. 1-4 basic Python, variables, functions
Chapter 8, sec. 1-4.3 parsing with grammars

Generally speaking, if we actually talked about something in the classroom or it was involved in a homework, it has a relatively high likelihood of coming up in the midterm. If it was in the readings mentioned above and we didn’t talk about it directly, it still might come up, but the likelihood is smaller. So, preparing by re-reading those parts of the book would be sensible. Preferably before Friday so you can ask questions about anything you’re not clear about.

The midterm will ask you how to do some basic things in Python and with NLTK, and will set up a couple of problems to consider, describe, or solve. I will provide the ground rules with the midterm, but this is not going to be a “closed-book test” of any sort. You’ll want to know the material basically, but if you forget a detail but know where to find it, you can look it up. (You can’t ask your roommate/classmate, but you can look it up.) What I want you to do mostly is understand how it works and why we’re doing it, and I will try to focus the questions on that. But, really, in real life after this course, if you want to do something with NLP, you would look it up. I want you to be able to find things and look them up efficiently, but there’s no real point in memorizing things just to take a test on them.

  • announcements

Homework 4 posted

Ok, I’m happy enough with Homework 4 now that I’ll call it complete. I’m not really sure how long it will take, though I think it is going to take you a lot less long to do than it took to write.

Let me know if you see typos, or find anything unclear or not working.

  • homework

Homework 4 beginning posted

Because it’s now the middle of the night, I’m posting just the first part of Homework 4 right now.

I will continue to expand this tomorrow a bit more. However, if you want to get started on it, the part that is there is mainly notes on helping make sure that you can get the parser running on your computer. So, it would make sense to go through that part first, particularly if you had any trouble getting it to work in class.

I’ll post again when the homework is actually ready. Right now it just covers very basic tree parsing and drawing, like we did in class, and adding in the capability of doing adjectives. What’s coming up are some further exercises involving complex sentences (one sentence inside another), locating subjects and objects by position in the tree, differentiating between transitive and intransitive verbs, and some initial explorations of relative clauses (like the person who wrote the article or the person who I met).

  • homework

Starting with CHILDES

As usual, 50 minutes arrived pretty fast. I’d hoped to talk a bit about the haiku generator, but it wasn’t necessary, really, since the thing itself more or less explains everything as it goes along.

I started with CHILDES, but all we did really is go through the basic characteristics of the corpora in there and started getting NLTK to recognize it. So, I’ll continue with that next time.

If you want to see what I’d been planning to do, or to replicate what was done in class (since this is not actually in the NLTK book), here is the presentation I was following. It was mostly just sketching what I was going to be typing, not what happened.

I’ll make a more elaborate version for next time, but if you wanted to see where I was headed, you can.

Also, notes on installing CHILDES:

  • On the CHILDES site, you want the XML version of the database
  • You need to put this somewhere NLTK can find it. The canonical place is in a folder called nltk_data inside the “home” directory. This is not immediately reachable on a default MacOS/Mac OS X installation, so you may need to go to “Documents” and then to “Enclosing Folder” from the “Go” menu. The icon should be a little house.
  • You may need to create the nltk_data folder if it wasn’t already there. It’s not there on the lab computers, but it might be there on your own.
  • Then, the appropriate place for it to be is, inside a corpora folder, inside a childes folder, inside a data-xml folder, inside a Eng-USA-MOR folder (assuming you downloaded something from the CHILDES Eng-USA-MOR directory, like Brown.zip). That is: ~/nltk_data/corpora/childes/data-xml/Eng-USA-MOR/.
  • Once the corpus is in place the stuff we did in class should work.

Next time, we can try to actually find something substantive out about child acquisition of English using this.

  • handouts

Homework 3 and presentation

I’ve changed the due date of homework 3 (haiku generator) to Monday.

Also, I’ve put the presentation slides from today online. They are linked from the schedule page from the title of today’s topic. I’ll plan regularly to link things in that way when there are future presentation slides. It’s just a web page, but it’s smart enough to let you page through it with the arrow keys. (I made these using Remark, which seems like a pretty “lightweight” way to make such things, without needing PowerPoint or Keynote or anything installed.) If it doesn’t work for you immediately, try just refreshing the page—but it probably should “just work.”

I didn’t get to slides 12-15, maybe I’ll come back and briefly talk about them on Friday.

  • homework
  • handouts