Home-doesn't-work 5 notes

It has been observed that the first homework problem I assigned is made… difficult… by the fact that the example code does not actually work.

This is in fact still an open bug of the NLTK book, but there are some details in a related bug report.

However, the main thing is that the thing you would base your approach to problem 9 on, the listing that starts with text = 'That U.S.A. poster-print costs $12.40...', does not work on the current version of NLTK. The reason is that the behavior of “capturing groups” has changed.

Basically, it misbehaves when you use grouping parentheses, so where you have a capturing group like (...) you want to instead use a non-capturing group like (?:...). Concretely, the example from chapter 3 should read:

>>> text = "That U.S.A. poster-print costs $12.40..."
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(?:-\w+)*        # words with optional internal hyphens
...   | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

That worked for me anyway. That should be able to give you a basis to work from when doing problem 9.