Midterm

CAS LX 390 / NLP/CL Midterm
GRS LX 690 Fall 2017 due Tue 10/24

Status

As mentioned in class, this page was going to be updated and clarified a little bit after class. That update has happened. Commence coding.

Ground rules

This is a “take-home” midterm, because I don’t want to add unnecessary anxiety, or run into technical trouble that causes things to get delayed in what would be a short time to take an in-class midterm.

However, it is still of course important that you do this on your own. What that means specifically is not consulting with classmates, roommates, etc. I do not plan for this to be difficult. If you (feel that you) have been doing relatively well on understanding the homework, then these things should not be particularly challenging. If you are having technical trouble, please let me know, and I will try to help you troubleshoot.

One thing you are hereby explicitly allowed to do, however, is consult static online reference sources. So, looking things up in the book, in the detailed documentation on python.org, nltk.org, StackExchange, etc., is fine. That is, after all, how you would proceed in the real world if you have a problem you want to solve. If you have questions, you can ask me.

I don’t think there will be much cause to worry about the ground rules, but I still wanted to state them. I expect this to take not (much?) more than the time it would have taken to do this in class time.

Task 1

Adapted from the NLTK book, ch. 3, exercise 10.

Read and understand what is happening in the following code.

>>> text = 'The dog gave John the newspaper'
>>> sent = text.split()
>>> result = []
>>> for word in sent:
...     word_len = (word, len(word))
...     result.append(word_len)
>>> result
[('The', 3),  ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]

The task for you is to convert the logic above into a list comprehension. More specifically, you essentially want to replace lines 3 to 6 with a single line, defining the list result using a list comprehension to get the same list as you got above.

Task 2

Adapted from the NLTK book, ch. 3, number 25

Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append “ay”. For example: stringingstray, idleidleay.

See: http://en.wikipedia.org/wiki/Pig_Latin

Task 2a

Write a function to convert a word to Pig Latin.

Specifically, the function should take a string (a word) and return a string (that word, in Pig Latin). So, if you called the function pig_latin_word, you want to be able to recreate the following:

>>> print(pig_latin_word('string'))
ingstray
>>> print(pig_latin_word('idle'))
idleay
>>> print(pig_latin_word('sun'))
unsay
>>> print(pig_latin_word('trash'))
ashtray
>>> print(pig_latin_word('beast'))
eastbay

Task 2b

Write code that converts text, instead of individual words. That is, write a function that takes a whole text (like a sentence) and converts each word (using the function from task 2a).

This should work for (basically) arbitrary text, so it should take a string and break it up into words. For this, you can assume that the text does not have any punctuation and is all lowercase. (It is relatively easy to make it all lowercase, slightly more complicated to factor out punctuation, so we’ll skip this here.)

So, the first one below satisfies the presuppostions, and your function should reproduce that. The second one below does not satisfy the presuppositions and so it is fine if your function provides the same output as below for the second one.

>>> print(pig_latin_text('trash idles in the sun'))
ashtray idlesay inay ethay unsay
>>> print(pig_latin_text('Trash idles in the sun!'))
ashTray idlesay inay ethay un!say

The goal here is to define a function pig_latin_text that takes a string, breaks it into words (on spaces), converts each word into its Pig Latin equivalent (using pig_latin_word from before), and then returns the string of all the words reassembled into a string.

Task 2c

Extend the function further to do three things:

  • preserve capitalization (if a word is capitalized in English, it should also be in Pig Latin)
  • keep qu together, so that quiet becomes ietquay and squeak becomes eaksquay.
  • treat y as a consonant when it is a consonant (yellow) and as a vowel when it is a vowel (style)

Doing these things requires a bit of thinking through the problem, to figure out what we need. Ideas: Recall that istitle() can tell you if a word is in “title-case” (capitalized in the relevant sense). So, "Hello".istitle() is True, "hello".istitle() and "HELLO".istitle() are False. Similarly, title() can transform a string into title-case (like lower() transforms into lowercase). Handling the qu sequence is pretty specific. You can check to see if your first vowel is u and if your initial consonants end in q and handle it appropriately. And on the y case, think about the situations in which y seems to be a vowel. It is only ever going to matter (here) when y is the first vowel in the word.

>>> print(better_pig_latin_word('quiet'))
ietquay
>>> print(better_pig_latin_word('Squeak'))
Eaksquay
>>> print(better_pig_latin_word('style'))
ylestay
>>> print(better_pig_latin_word('yahoo'))
ahooyay