Jekyll2017-12-13T03:06:08-05:00http://ling-blogs.bu.edu/lx390f17/Topics: NLP&CLNatural Language Processing and Computational LinguisticsCAS LX 390<br/>GRS LX 690</br>Fall 2017Standoff annotation, XML, and more CHILDES2017-12-12T16:07:40-05:002017-12-12T16:07:40-05:00http://ling-blogs.bu.edu/lx390f17/standoff-annotation-xml-and-more-childes<p>So, the last thing we did in class was a kind of live demonstration of
how to deal with corpora that may not have all the structure you might like.
I will try to write up here something like what happened, as a reminder
and potentially for future reference.</p>
<h3 id="childes-project-reminder">CHILDES project reminder</h3>
<p>So, the “default project” for the end of this semester was this idea I had
for looking at bilingual CHILDES corpora to try to determine whether
children leave the “root infinitive” stage in both languages simultaneously.
The basic premise is that children from about 2 years old to 3 years old will
use infinitive verbs in main clauses, where adults do not. The hypothesis
being tested is that this is due to a kind of biological maturation, rather
than anything about the language input itself. The caveat is that there is a
set of languages that do not seem to show root infinitives at all, and which
seem to be the “null subject” languages like Spanish and Italian. And there
are some other languages where the infinitive form might not really exist,
at least not in a way that is usefully detectible, e.g., Japanese, Mandarin.</p>
<p>In practical terms, what this means is that we have a pretty small set of
possible corpora. To do this project, a corpus must be located that:</p>
<ul>
<li>is bilingual,</li>
<li>involving two languages that both show root infinitives,</li>
<li>both of which you know well enough to make sense of the transcripts,</li>
<li>has children between about 1.5 and 3.5 years old,</li>
</ul>
<p>If you look at the
<a href="http://childes.talkbank.org/access/Biling/">descriptions of the bilingal corpora in CHILDES</a>
the options are… few.
The ones with Cantonese, Catalan, Chinese, Italian, Japanese, Portuguese,
and Spanish are out, at least. German, French, Russian, English, Dutch,
Danish should be ok. But, this is not many to work with.</p>
<p>Even before we get to other questions, this does pose a problem for anyone
who isn’t at least marginally comfortable with the language other than
English in a workable pairing. Fluency isn’t required, but some way of
identifying when the verb is in the infinitive (or possibly some kind of default)
form probably is. Realistically, Dutch or German is probably kind of guessable
based on English, though it would take some research.</p>
<p>But, supposing that there is a language pairing that will work, we then
have another issue: pretty much none of the corpora are as well annotated
as the Brown corpus (Adam, Eve, Sarah) that we worked with in class before.
Lots of time and effort has gone into that corpus to tag the parts of speech,
label the dependencies, etc. So, with that corpus we could just search for
the verbs, look at that agreement, because it was all tagged.</p>
<p>In most of these corpora, we have mostly just the words. Fair enough,
these were transcribed for some particular purpose of the original researchers.
But it means that if we want to do things like look for main clause infinitives, we need
to do some more work than just searching for them directly.</p>
<h3 id="xml-vs-chat">XML vs. CHAT</h3>
<p>If you look at the description of the particular corpus you want to use, there will
likely be links both into the browsable database and to a file you can download.
However, the file you download there is almost certain to be in CHAT format
(files ending in <code class="highlighter-rouge">.cha</code>). NLTK does not know how to handle those
(except as general text files), the <code class="highlighter-rouge">CHILDESCorpusReader</code> is designed for XML files.</p>
<p>CHAT is specific to CHILDES and is well specified. The format guidelines are
specific. Participant and recording information goes at the top in specific forms,
participants are referred to by three-letter codes (CHI = child, MOT = mother, etc.),
and individual utterances begin with <code class="highlighter-rouge">*</code> and the participant (<code class="highlighter-rouge">*CHI:</code> ), while
dependent “tiers” begin with <code class="highlighter-rouge">%</code>, and so on. You can read the CHAT manual if you
like, and you can kind of absorb how it works by looking at the browsable transcripts
of any corpus.</p>
<p>NLTK (or more specifically <code class="highlighter-rouge">CHILDESCorpusReader</code>) wants these to be in XML format,
instead. Many of the corpora in CHILDES are in XML format already, but you need to
go to the XML section specifically (these are not in general linked from the
main page that describes the corpus). So, once you pick a corpus you want to
use, you want to look in
<a href="http://childes.talkbank.org/data-xml/Biling/">the bilingual XML corpora directory</a>
to find the XML files for that corpus.</p>
<p>Thing is: Not all the corpora have XML versions there. I’m not sure why not.
The one I was experimenting with was actually the <code class="highlighter-rouge">FallsChurch</code> Japanese-English
bilingual one (which, however, wouldn’t be great for this project due to Japanese
not showing root infinitives in any obvious way). This seems to exist in CHAT format
but not in XML format.</p>
<p>CHAT is well enough defined though that there is a pretty easy way to convert
from CHAT to XML. There is a program to do this called
<a href="http://talkbank.org/software/chatter.html">Chatter</a>.
It is easy to download and use on the Mac, and there is a Java version that
is supposed to work on Windows and Linux. I did not try it on Windows or Linux
though. To use it, unzip the CHAT files you have downloaded, open the Chatter
program, then choose “Open” in Chatter, select the folder where the CHAT files are,
and it will process them into a folder that it creates alongside the folder with
the CHAT files. It will give it the same name but with <code class="highlighter-rouge">-xml</code> at the end.</p>
<p>Once you have XML files, we can start operating with them in NLTK. So, now we
can do some Python again.</p>
<h3 id="finding-the-files">Finding the files</h3>
<p>There has been a persistent problem with finding the files. NLTK is supposed to
be a bit smarter than it has proven to be in locating the data files, but
people have generally had quite a bit of trouble getting this to work.
The best idea would be just to explicitly specify where the files are.</p>
<p>On the Mac, my files are in a folder called <code class="highlighter-rouge">nltk_data</code> in my home directory.
Within <code class="highlighter-rouge">nltk_data</code> there is a <code class="highlighter-rouge">corpus</code> folder, within that a <code class="highlighter-rouge">childes</code> folder,
and within that a <code class="highlighter-rouge">data-xml</code> folder. This is where I had put the Brown files
from earlier work.</p>
<p>So, if I download the <code class="highlighter-rouge">GNP</code> corpus (which has an XML version), I can move the
<code class="highlighter-rouge">GNP</code> folder into the <code class="highlighter-rouge">data-xml</code> folder. And then the full path to this
folder, given that on my computer my username is <code class="highlighter-rouge">hagstrom</code>, is:</p>
<p><code class="highlighter-rouge">/Users/hagstrom/nltk_data/corpora/childes/data-xml/GNP/</code></p>
<p>It should be clear how this is constructed. You could put it on your Desktop,
and find it with <code class="highlighter-rouge">/Users/whatever/Desktop/GNP/</code> instead. The main thing is
to know exactly what folders it is in.</p>
<p>For Windows, I’m less sure, but in class the <code class="highlighter-rouge">nltk_data</code> folder was actually
at the top level of the <code class="highlighter-rouge">C:</code> drive. So, the path I used in class was:</p>
<p><code class="highlighter-rouge">C:/nltk_data/corpora/childes/data-xml/GNP/</code></p>
<p>or something like that. By the way, I <em>know</em> that in class I separated the
directories with the forward slash character (<code class="highlighter-rouge">/</code>). That is the normal
separator on recent Macs and Linux machines. The normal separator for Windows in
other contexts is actually the backslash character (<code class="highlighter-rouge">\</code>), and I don’t know
why that path wasn’t instead:</p>
<p><code class="highlighter-rouge">C:\nltk_data\corpora\childes\data-xml\GNP\</code></p>
<p>Maybe that would have worked also.</p>
<p>Anyway, I will assume that the XML files landed where I indicated above.
I am going to use this <code class="highlighter-rouge">GNP</code> corpus for examples, as I did in class.
If you look at this, you will see that there are three folders in the
<code class="highlighter-rouge">GNP</code> folder: <code class="highlighter-rouge">Both</code>, <code class="highlighter-rouge">English</code>, and <code class="highlighter-rouge">French</code>. I’m just going to
look at the English one in the examples below.</p>
<h3 id="to-python">To Python!</h3>
<p>I used Spyder for this because I just feel more comfortable there being able
to re-run things from beginning to end. Also, the “autocomplete” is a bit
smarter there than it is in Jupyter Notebook. But, it’s Python, do it however
you want.</p>
<p>So, to begin, we bring in NLTK and tell it where the corpus is.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="kn">from</span> <span class="nn">nltk.corpus.reader</span> <span class="kn">import</span> <span class="n">CHILDESCorpusReader</span>
<span class="n">data_root</span> <span class="o">=</span> <span class="s">'/Users/hagstrom/nltk_data/corpora/childes/data-xml/'</span>
<span class="n">gnpec</span> <span class="o">=</span> <span class="n">CHILDESCorpusReader</span><span class="p">(</span><span class="n">data_root</span><span class="p">,</span> <span class="s">'GNP/English/.*.xml'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">gnpec</span><span class="o">.</span><span class="n">fileids</span><span class="p">())</span>
</code></pre>
</div>
<p>You should get a list of the fileids in the corpus. This much should
work whatever corpus you are using really (not just the GNP/English one).</p>
<p>Much of what I want to do here below requires picking a single transcript,
so let’s name the last transcript (which will be somebody’s latest one, so
more likely to have a bunch of words in it).</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">the_file</span> <span class="o">=</span> <span class="n">gnpec</span><span class="o">.</span><span class="n">fileids</span><span class="p">()[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</code></pre>
</div>
<p>At this point we can do the stuff that the <code class="highlighter-rouge">CHILDESCorpusReader</code> allows
us to do. But it’s a little disappointing.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">gnpec</span><span class="o">.</span><span class="n">participants</span><span class="p">(</span><span class="n">fileids</span><span class="o">=</span><span class="n">the_file</span><span class="p">)</span>
<span class="n">gnpec</span><span class="o">.</span><span class="n">sents</span><span class="p">(</span><span class="n">fileids</span><span class="o">=</span><span class="n">the_file</span><span class="p">,</span> <span class="n">speaker</span><span class="o">=</span><span class="s">'CHI'</span><span class="p">)</span>
<span class="n">gnpec</span><span class="o">.</span><span class="n">tagged_sents</span><span class="p">(</span><span class="n">fileids</span><span class="o">=</span><span class="n">the_file</span><span class="p">,</span> <span class="n">speaker</span><span class="o">=</span><span class="s">'CHI'</span><span class="p">)</span>
</code></pre>
</div>
<p>The thing that’s (potentially) disappointing is that there are no tags.
Using <code class="highlighter-rouge">tagged_sents()</code> or <code class="highlighter-rouge">tagged_words()</code> just returns a bunch of pairs
of words with empty strings.</p>
<p>What’s more, there’s no way (that I know of at least) to know what
utterance we’re looking at. If we want to look at just the child
utterances, we can limit the search to the speaker <code class="highlighter-rouge">CHI</code>, and we will
get the sentences in order, but we won’t know what <code class="highlighter-rouge">MOT</code> said in between,
and if we look at <code class="highlighter-rouge">MOT</code>’s utterances, we’ll get those in order, but we
won’t know what order they occur in with respect to the child’s utterances.
And if we don’t limit the speaker, then we don’t know who’s talking.
It’s surprisingly limited.</p>
<p>It’s probably informative to look at the XML file itself. Below I’ve given what
we see in a couple of these utterances. It’s useful to see the structure
here. There is an utterance indicated by an opening <code class="highlighter-rouge"><u ...></code> tag
(and closed by a <code class="highlighter-rouge"></u></code>), and inside each utterance we have a series of
words enclosed by <code class="highlighter-rouge"><w></code> and <code class="highlighter-rouge"></w></code> tags. The utterances have attributes
<code class="highlighter-rouge">who</code> (for the speaker) and <code class="highlighter-rouge">uID</code> for the utterance ID. That’s very
interesting/useful to see. This means that we can pinpoint any utterance
in a transcript by referring to its <code class="highlighter-rouge">uID</code>. There are also a couple of other
tags. One is <code class="highlighter-rouge"><t type="p"></t></code> which seems to correspond to clause type
or turn type—it distinguishes between statements (<code class="highlighter-rouge">"p"</code>) and questions
(<code class="highlighter-rouge">"q"</code>) at least. And there is a more arbitrary tag (<code class="highlighter-rouge"><a>...</a></code>)
that holds codes of special interest to the original researchers.
The <code class="highlighter-rouge">type="coding"</code> one marks what language the utterance is in and
to whom it was addressed. The other one (<code class="highlighter-rouge">type="extension"</code>)? I don’t know.
Whatever.</p>
<div class="language-xml highlighter-rouge"><pre class="highlight"><code> ...
<span class="nt"><u</span> <span class="na">who=</span><span class="s">"CHI"</span> <span class="na">uID=</span><span class="s">"u12"</span><span class="nt">></span>
<span class="nt"><w></span>I<span class="nt"></w></span>
<span class="nt"><w></span>want<span class="nt"></w></span>
<span class="nt"><w></span>go<span class="nt"></w></span>
<span class="nt"><w></span>play<span class="nt"></w></span>
<span class="nt"><w></span>make<span class="nt"></w></span>
<span class="nt"><w></span>a<span class="nt"></w></span>
<span class="nt"><w></span>house<span class="nt"></w></span>
<span class="nt"><t</span> <span class="na">type=</span><span class="s">"p"</span><span class="nt">></t></span>
<span class="nt"><a</span> <span class="na">type=</span><span class="s">"extension"</span> <span class="na">flavor=</span><span class="s">"pho"</span><span class="nt">></span>ai want go ple mek a haus<span class="nt"></a></span>
<span class="nt"><a</span> <span class="na">type=</span><span class="s">"coding"</span><span class="nt">></span>$LAN:E $ADD:MOT<span class="nt"></a></span>
<span class="nt"></u></span>
<span class="nt"><u</span> <span class="na">who=</span><span class="s">"MOT"</span> <span class="na">uID=</span><span class="s">"u13"</span><span class="nt">></span>
<span class="nt"><w></span>you<span class="nt"></w></span>
<span class="nt"><w></span>want<span class="nt"></w></span>
<span class="nt"><w></span>to<span class="nt"></w></span>
<span class="nt"><w></span>go<span class="nt"></w></span>
<span class="nt"><w></span>make<span class="nt"></w></span>
<span class="nt"><w></span>a<span class="nt"></w></span>
<span class="nt"><w></span>house<span class="nt"></w></span>
<span class="nt"><t</span> <span class="na">type=</span><span class="s">"p"</span><span class="nt">></t></span>
<span class="nt"><a</span> <span class="na">type=</span><span class="s">"coding"</span><span class="nt">></span>$LAN:E $ADD:CHI<span class="nt"></a></span>
<span class="nt"></u></span>
...
</code></pre>
</div>
<p>So, back to our disappointment with <code class="highlighter-rouge">CHILDESCorpusReader</code>—it doesn’t
(again, as far as I know) give us access to that <code class="highlighter-rouge">uID</code> attribute of an
utterance when we retrieve it. However, <code class="highlighter-rouge">CHILDESCorpusReader</code> is itself
a type of a more general <code class="highlighter-rouge">XMLCorpusReeader</code>, and using this we can actually
get access to the parsed XML directly. That will allow us a much more
flexible way into these transcripts, though at the cost of having to
deal with another bit of technology.</p>
<p>So, step one is to get the XML representation of the corpus we read.
This can be done like so:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">the_xml</span> <span class="o">=</span> <span class="n">gnpec</span><span class="o">.</span><span class="n">xml</span><span class="p">(</span><span class="n">the_file</span><span class="p">)</span>
</code></pre>
</div>
<p>The <code class="highlighter-rouge">.xml()</code> call does require exactly one file, so we need to specify
which transcript file we are going to look at. We’ll look at the last
one, which we named <code class="highlighter-rouge">the_file</code>.</p>
<h3 id="finding-our-way-around-the-xml">Finding our way around the XML</h3>
<p>There is some brief discussion of using XML in the
<a href="http://www.nltk.org/book/ch11.html">NLTK book chapter 11</a>, section 4.</p>
<p>However, probably the most rigorous place to look for examples is
<a href="https://docs.python.org/3.6/library/xml.etree.elementtree.html#xpath-support">the official Python documentation for XML ElementTree</a>.
I’m going to just mention a couple of things here.</p>
<p>The basic goal here is to be able to look at an utterance and
figure out the speaker (<code class="highlighter-rouge">who</code>) and the utterance ID (<code class="highlighter-rouge">uID</code>), which
we know is in the XML file but is inaccessible through the
<code class="highlighter-rouge">CHILDESCorpusReader</code>.</p>
<p>So, the first thing we’ll do is find the utterances by
searching for the <code class="highlighter-rouge"><u>...</u></code> tags. This can be accomplished
by using the <code class="highlighter-rouge">findall()</code> function called on the XML structure.</p>
<p>This <em>should</em> look like this—but, it actually doesn’t quite.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">utterances</span> <span class="o">=</span> <span class="n">the_xml</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">'u'</span><span class="p">)</span>
</code></pre>
</div>
<p>The thing above will not find anything, even though if you look at the
XML file, there are <code class="highlighter-rouge">u</code> tags there. Why? The source of the issue is
that at the top of the XML file, it specifies a “namespace”:</p>
<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt"><CHAT</span> <span class="na">xmlns:xsi=</span><span class="s">"http://www.w3.org/2001/XMLSchema-instance"</span>
<span class="na">xmlns=</span><span class="s">"http://www.talkbank.org/ns/talkbank"</span>
<span class="na">xsi:schemaLocation=</span><span class="s">"http://www.talkbank.org/ns/talkbank http://talkbank.org/software/talkbank.xsd"</span>
<span class="na">PID=</span><span class="s">"11312/c-00001462-1"</span>
<span class="na">Version=</span><span class="s">"2.5.0"</span>
<span class="na">Lang=</span><span class="s">"eng"</span>
<span class="na">Corpus=</span><span class="s">"Genesee"</span>
<span class="na">Date=</span><span class="s">"1994-03-08"</span><span class="nt">></span>
...
</code></pre>
</div>
<p>The <code class="highlighter-rouge">xmlns</code> is the XML Namespace, and it is <code class="highlighter-rouge">http://www.talkbank.org/ns/talkbank</code>.
The point of specifying this is to allow mixing of tags from different files together.
This file has <code class="highlighter-rouge">u</code> tags, but other XML files might use <code class="highlighter-rouge">u</code> not for “utterance” but for
“underline” or something. So, the <em>real</em> tag, as far as the XML parser is concerned,
is not <code class="highlighter-rouge">u</code> but rather <code class="highlighter-rouge"><span class="p">{</span><span class="err">http://www.talkbank.org/ns/talkbank</span><span class="p">}</span><span class="err">u</span></code> – that is, it is the
namespace in braces preceding the tag we see in the file. So, what this boils down
to is that to find the utterances we need to do this:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">utterances</span> <span class="o">=</span> <span class="n">the_xml</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">'{http://www.talkbank.org/ns/talkbank}u'</span><span class="p">)</span>
</code></pre>
</div>
<p>That will work, but it’s clunky, we need to put the namespace before any tag.
So what I will do is put the namespace in its own variable:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">ns</span> <span class="o">=</span><span class="s">'{http://www.talkbank.org/ns/talkbank}'</span>
<span class="n">utterances</span> <span class="o">=</span> <span class="n">the_xml</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">ns</span><span class="o">+</span><span class="s">'u'</span><span class="p">)</span>
</code></pre>
</div>
<p>We can now interrogate the <code class="highlighter-rouge">who</code> and <code class="highlighter-rouge">uID</code> like this:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">utterances</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'uID'</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">utterances</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'who'</span><span class="p">))</span>
</code></pre>
</div>
<p>What <em>is</em> utterance <code class="highlighter-rouge">u4</code> though? Looking at the XML, the utterance is parent
to a sequence of words (among other things), so we can collect them like this:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">ws</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">utterances</span><span class="p">[</span><span class="mi">4</span><span class="p">]]</span>
</code></pre>
</div>
<p>This isn’t quite what we want, though. This has collected the child elements,
but not all are words. And even when they are words, we need to ask the word element
what its <code class="highlighter-rouge">text</code> is to get the word if we want to print or compare it to something.
So, two things we want to do. One is we want to make sure we are look at words
(the <code class="highlighter-rouge">w</code> tag), and the second is that we want to collect the text
(since my current side quest is to print the words of the utterance).</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">utterances</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="k">if</span> <span class="n">w</span><span class="o">.</span><span class="n">tag</span> <span class="o">==</span> <span class="n">ns</span><span class="o">+</span><span class="s">'w'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
</code></pre>
</div>
<p>Ok, good, now we’re getting somewhere. We are starting to be able to get
access to the data in the corpus at a deeper level.</p>
<h3 id="dealing-with-the-lack-of-pos-tags">Dealing with the lack of POS tags</h3>
<p>Now, one thing that this corpus does not have is any kind of part of speech
tagging. Utlimately what we want to look at is the form that verbs take,
but we have no good way to find the verbs.</p>
<p>So we need a strategy. Here’s the strategy I thought of, at least.
We’ll find out what the most common words are first, and then look by hand
to see which of them are verbs. We’ll take one or a few of the most common
verbs in the corpus and we’ll then search for just those verbs to see what
form those verbs are in. So we are no longer looking in general for verb
forms, but we’re trying to take something like a representative sample with
a couple of the verbs that we are most likely to find in varying contexts
in the transcript/corpus.</p>
<p>Finding the words is something we can do with the basic
functions given to us by <code class="highlighter-rouge">CHILDESCorpusReader</code>. And then we can make
a Frequency Distribution to figure out what the most common ones are.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">all_words</span> <span class="o">=</span> <span class="n">gnpec</span><span class="o">.</span><span class="n">words</span><span class="p">(</span><span class="n">fileids</span><span class="o">=</span><span class="n">the_file</span><span class="p">)</span>
<span class="n">fd</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">FreqDist</span><span class="p">(</span><span class="n">all_words</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">fd</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">20</span><span class="p">))</span>
</code></pre>
</div>
<p>In my results, I see basically <em>go</em> and <em>do</em> among the top 20 words
of this transcript. Perhaps you might want to gather the words over all
transcripts. Anyway, however you want to do it. This just seems like a
good place to start given that we don’t have tags built into the corpus
that allow us to search automatically for verbs and agreement characteristics.</p>
<p>So, the idea from here would be to look for the various forms that <em>go</em>
can occur (<em>goes</em>, <em>go</em>, <em>went</em>) and see how often it is arguably infinitive
in a main clause. Or perhaps simply missing agreement (in French it is at least
plausible that agreement can come out as 3rd person singular as a kind of
default when the agreement is deficient somehow).</p>
<h3 id="standoff-annotation">Standoff annotation</h3>
<p>It might be that you want to add some coding to a corpus that you have.
For example, perhaps your project might be to look at a transcript and code for whether
(a) a child’s utterance is prompting an adult’s utterance/repitition, or (b)
a child’s utterance is imitating or resulting from an
adult’s utterance. This will not be coded in the transcripts already,
it will require coding it by hand.</p>
<p>One way to do this would be to actually edit the XML file and add
the tags in. To do this would require that you not mess up the XML
file in the process, which is potentially not trivial.</p>
<p>The way I’d be more comfortable doing this would be to leave the original
corpus as it is, but to create a second file that has the coding for each
utterance you want to add a code to. More concretely, I’m suggestinng a
second annotation file that contains something like this:</p>
<div class="language-xml highlighter-rouge"><pre class="highlight"><code>u0 RESP
u2 RESP
u4 PROMPT
u6 PROMPT
u8 PROMPT
</code></pre>
</div>
<p>The intent here is that u0 and u2 are CHI utterances in which the child
is reponding to a prompt, and u4, u6, and u8 are utterances in which the
child is asking or prompting the adult to respond.</p>
<p>Once this is coded, one might look to see whether, say, English transcripts
show a different pattern from transcripts in another language.</p>
<p>So the goal would be to use the CHILDES transcript together with the new
file of extra annotations. Since these are annotations to the file
<code class="highlighter-rouge">oli33b06m.xml</code>, we can save the annotations file as <code class="highlighter-rouge">oli33b06m.xml.ann.txt</code>
(the idea being that you can locate the annotations by using the fileid
you are using in CHILDES/XML and adding <code class="highlighter-rouge">.ann.txt</code> to it).</p>
<p>This is called “standoff annotation” in the NLTK book chapter 11, because
it is not a direct modification of the original corpus, but is a separate
kind of “overlay” that stands apart but points to spots in the original
corpus.</p>
<p>To load this up, you can do this (making some assumptions here about where
the annotation files will go):</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">annroot</span> <span class="o">=</span> <span class="s">'/Users/hagstrom/nltk_data/annotations/'</span>
<span class="n">annfile</span> <span class="o">=</span> <span class="s">'{}{}.ann.txt'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">annroot</span><span class="p">,</span> <span class="n">the_file</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">annfile</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">annotations</span> <span class="o">=</span> <span class="p">[</span><span class="n">l</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">()</span> <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">())</span><span class="o">></span><span class="mi">0</span><span class="p">]</span>
</code></pre>
</div>
<p>The <code class="highlighter-rouge">l.strip().split()</code> part will remove the return character from the end of
each line, and then break up each line into lists. So the first line would
result in a list like <code class="highlighter-rouge">['u0', 'RESP']</code>. And the entire file will be read into
the <code class="highlighter-rouge">annotations</code> array.</p>
<p>Now, if you want to go through annotations and retrieve the utterance that
corresponds to the annotation, you can do this:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">annotations</span><span class="p">:</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">the_xml</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">".//*[@uID='{}']"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">u</span> <span class="k">if</span> <span class="n">w</span><span class="o">.</span><span class="n">tag</span> <span class="o">==</span> <span class="n">ns</span><span class="o">+</span><span class="s">'w'</span> <span class="ow">and</span> <span class="n">w</span><span class="o">.</span><span class="n">text</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Utterance {}: {}: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'uID'</span><span class="p">),</span> <span class="n">u</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'who'</span><span class="p">),</span> <span class="s">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)))</span>
</code></pre>
</div>
<p>This finds any tag that has a <code class="highlighter-rouge">uID</code> attribute that matches the one in the current
line of the annotations file, then assembles the words, and prints what it found.</p>
<p>Or, you could go through the corpus but catch cases where you have an extra
annotation from the annotation file. To do this, it would be better to re-organize
the <code class="highlighter-rouge">annotations</code> list so that we can look up the annotations by utterance ID.
We can do this with a dictionary in Python. At the moment this annotations file
is set up in such a way that it can pretty much automatically create this, because
<code class="highlighter-rouge">annotations</code> is just a list of 2-member lists. All you have to do is</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">anndict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">annotations</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">anndict</span><span class="p">[</span><span class="s">'u4'</span><span class="p">])</span> <span class="c"># 'PROMPT'</span>
</code></pre>
</div>
<p>But this is not fully general; if you had multiple annotations on each line this would
no longer work, it’s only good for the special case where each line has an utterance number
followed by a single tag. Better would be to just use the first element in each line
of the annotation file as the dictionary key, like so:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">anndict</span> <span class="o">=</span> <span class="p">{</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span> <span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">annotations</span><span class="p">}</span>
<span class="k">print</span><span class="p">(</span><span class="n">anndict</span><span class="p">[</span><span class="s">'u4'</span><span class="p">])</span> <span class="c"># ['PROMPT']</span>
</code></pre>
</div>
<p>The result is not identical (the entries are now lists instead of strings),
but it is more general/adaptable.</p>
<p>So, now if we go through the utterances, we can check whether there is an extra
annotation in the annotations file:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">ns</span> <span class="o">=</span><span class="s">'{http://www.talkbank.org/ns/talkbank}'</span>
<span class="n">utterances</span> <span class="o">=</span> <span class="n">the_xml</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">ns</span><span class="o">+</span><span class="s">'u'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">u</span> <span class="ow">in</span> <span class="n">utterances</span><span class="p">:</span>
<span class="k">if</span> <span class="n">u</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'uID'</span><span class="p">)</span> <span class="ow">in</span> <span class="n">anndict</span><span class="p">:</span>
<span class="n">promptresp</span> <span class="o">=</span> <span class="n">anndict</span><span class="p">[</span><span class="n">u</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'uID'</span><span class="p">)][</span><span class="mi">0</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">promptresp</span> <span class="o">=</span> <span class="s">'UNKNOWN'</span>
<span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">u</span> <span class="k">if</span> <span class="n">w</span><span class="o">.</span><span class="n">tag</span> <span class="o">==</span> <span class="n">ns</span><span class="o">+</span><span class="s">'w'</span> <span class="ow">and</span> <span class="n">w</span><span class="o">.</span><span class="n">text</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Utterance {}: {}: {}: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'uID'</span><span class="p">),</span> <span class="n">u</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'who'</span><span class="p">),</span> <span class="n">promptresp</span><span class="p">,</span> <span class="s">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)))</span>
</code></pre>
</div>
<p>One thing you might consider if you do this for a perhaps more serious project
is to record the version number in your annotation file, so that it is clear
what version of the corpus you are working with.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">the_xml</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'Version'</span><span class="p">)</span>
</code></pre>
</div>
<h3 id="anyway">Anyway</h3>
<p>This is basically what I was trying to cover during class today. There will be more to do
in your own projects, but I wanted to provide a couple of examples of how you might
deal with the fact that some of the corpora are fairly sparse in terms of what they
have tagged. Using this kind of a standoff annotation file keyed to the individual
utterance numbers is one way that you can “extend” the corpus by hand without having to
work out how to modify the corpus’ XML file itself. And, I wanted to suggest a strategy
of looking for the most common
words to find the most common verbs and then looking for forms of those most common
verbs.</p>
<p>It is possible that even with all of this, the data sets you have are going to be small
enough that it will be hard to say anything with much confidence. But, you can tell me
what you did find at least, and what you might expect to find if you had bigger corpora
(or better tagged corpora).</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017So, the last thing we did in class was a kind of live demonstration of how to deal with corpora that may not have all the structure you might like. I will try to write up here something like what happened, as a reminder and potentially for future reference.Of projects, CHILDES, and Twitter2017-11-21T20:00:00-05:002017-11-21T20:00:00-05:00http://ling-blogs.bu.edu/lx390f17/of-projects-childes-and-twitter<p>Today was kind of a miscellaneous day, positioned as it was before the
Thanksgiving break. There were basically three things discussed, mostly
loosely related to the upcoming projects.</p>
<p>As a reminder, the plan for the “final” has changed since the beginning
of the semester (which I remember us basically agreeing on early in the semester).
In particular, everyone will be doing a final project, regardless of the course
number registered for. Students in LX390 will do a smaller project. There
will be no in-class final for anyone.</p>
<p>Part of the idea for the LX390 project was that I’d provide a project topic,
whereas the LX690 topics would be proposed by those planning to do them.
So, I outlined a proposed project for LX390 students, although a) it is not
required that LX390 students with a different idea do it; and, b) if someone
in LX690 wants to do it (perhaps more extensively), that’s ok too.</p>
<p>Before getting to that point, I talked a bit about getting access to corpora,
which I find to be one of the most difficult hurdles to working out an
interesting project idea. There are basically four corpora I’d suggest as
a starting point, based on their availability, apart from the samples that
NLTK comes with. One is <a href="https://www.gutenberg.org">Project Gutenberg</a>,
which is also described in the <a href="http://www.nltk.org/book/ch02.html">NLTK book chapter 2</a>.
This has a lot of public domain texts, generally old books. If it’s ok for your
text to be coming from, like, the 1800s, this is a good source.
Another is <a href="http://childes.talkbank.org">CHILDES</a>, which we’ve talked a bit
about as well (and to which I’ll come back). The last two are the British National
Corpus and Twitter.</p>
<p>With respect to the British National Corpus, there was an announcement of
the availabilty of <a href="http://corpora.lancs.ac.uk/bnc2014/">BNC 2014</a> that caught my
attention. It is available for non-commercial academic use, through the
<a href="https://cqpweb.lancs.ac.uk">Corpus Query Procesor, University of Lancaster</a>,
which you can sign up for. I successfully signed up for an account there and
got access to BNC 2014 that way. You can do simple queries there on the included
data sets. Some corpora you get “full access” to, which allows you to download
them in their entirety; other, you can do queries on but can’t download in full.</p>
<p>For Twitter, I have successfully set up the <code class="highlighter-rouge">tweepy</code> package to access Twitter
streams. I had <code class="highlighter-rouge">tweepy</code> already installed, I assume through installation of
Anaconda initially, though it was not installed on the lab PCs. It’s pretty
straightforward, though. The
<a href="http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html">main documentation for tweepy</a>
does a pretty good job of the basic setup. I used <code class="highlighter-rouge">user_timeline()</code> and
<code class="highlighter-rouge">search()</code> to good effect. Before you can use this, you need to sign in to
the <a href="https://apps.twitter.com">Twitter apps page</a> and “create an application”
that allows your Python scripts to sign in. It will eventually give you, under
“Keys and Access tokens” a set of four big strings that are used to sign in:
The “Consumer Key”, “Consumer Secret”, “Access Token”, and “Access Token Secret”.
When you have all four of those you can plug them in and start retrieving things
from Twitter.</p>
<p>There is an <a href="http://www.nltk.org/howto/twitter.html">NLTK “Howto” about Twitter</a>
which might be of some help, though it uses
a different Twitter library. Also, there are some interesting videos/demos at
pythonprogramming.net:</p>
<ul>
<li><a href="https://pythonprogramming.net/twitter-sentiment-analysis-nltk-tutorial/">Twitter sentiment analysis with NLTK</a></li>
<li><a href="https://pythonprogramming.net/graph-live-twitter-sentiment-nltk-tutorial/">Graphing live Twitter sentiment analysis</a></li>
</ul>
<p>Back to CHILDES and the proposed project, the basic outline is this:
There is a phenomenon often referred to as “root infinitives” or
“optional infinitives” that occurs for children of a large set of
languages between the ages of 2 and 3. The short version of this is
that they will sometimes use the infinitive form of a main verb instead
of the tensed/agreeing form that adults would use. You can perhaps
consult a <a href="http://ling-blogs.bu.edu/lx540f12/files/2012/09/lx540f12-04-nrfs-handout.pdf">handout about root infinitives</a>
from another class to get the basic idea. The important thing really here,
though, is the fact that it is proposed to be on a maturational schedule,
meaning that the disappearance of root infinitives is essentially biological,
kind of like losing one’s baby teeth.</p>
<p>There have been a bunch of studies of the root infinitive stage in a number
of different languages. It doesn’t occur in all languages; it seems not to
occur at all in languages like Italian and Spanish that allow for silent
subjects and have elaborate verbal morphology. It does occur in English,
French, Dutch, and a lot of other languages.</p>
<p>The project I’m suggesting is looking at transcripts of <em>bilingual</em> children
acquiring two languages both of which are known to show these root infinitives.
Simply put, the prediction made by the hypothesis that root infinitives
“mature away” on a biological schedule is that root inifinitves should disappear
from <em>both</em> languages a bilingual child speaks at the same time.</p>
<p>So, this project would be to find some sufficiently large corpora for bilingual
children acquiring two root infinitive languages, then analyze each langauge for
the presence of root infinitives. The languages may well be mixed in the transcripts
so it would probably be good to pick languages that you speak or at least
can kind of decode. The challenge would be to work out how to use NLTK (or something)
to narrow in at least on the candidates for root infinitives (even if the process
of finding them cannot be fully automated), at which point you can start
comparing them to see how often they occur at what age in what language.</p>
<p>I think it will be interesting to see what the results of this are, but I am
not really aware of anyone looking at this directly before. So, that seems
like as good a project as any—it addresses a real theoretical question, by
processing large amounts of language data.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017Today was kind of a miscellaneous day, positioned as it was before the Thanksgiving break. There were basically three things discussed, mostly loosely related to the upcoming projects.Office hours today2017-11-15T05:30:00-05:002017-11-15T05:30:00-05:00http://ling-blogs.bu.edu/lx390f17/office-hours-today<p>Turns out that I’m not going to be able to make my office
hours today, sorry for the late notice. I should be there
earlier, from 1pm to 2pm, but I have a conflict at 2pm.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017Turns out that I’m not going to be able to make my office hours today, sorry for the late notice. I should be there earlier, from 1pm to 2pm, but I have a conflict at 2pm.Class notes and homework2017-11-07T20:30:00-05:002017-11-07T20:30:00-05:00http://ling-blogs.bu.edu/lx390f17/class-notes-and-homework<p>There hasn’t been very much posting on the blog here, but I thought I would take
a moment to add a note about today’s class, since no slides were used.</p>
<p>I can’t really entirely revisit what I talked about here, but it was primarily
drawn from the NLTK book, chapter 10, sections 1-5. We did run into the
fact that Prover9 is not installed by default with Anaconda, so we were not
able to see the theorem prover in action (and it is not expected that you will).
As with much of this, my goal was to provide some idea of what is possible,
the terminology, and where to find information about it. We won’t be directly
working with Prover9, but you can install it if you find something it would be
useful for.</p>
<p>On the whiteboard I spent a fair amount of time going over logical connectives,
semantic models, valuation functions, lambda notation for functions.</p>
<p>Also, another thing that I had not resolved by the end of class was what
exactly the homework would consist of. Since no homework dealt with the
stuff we went over most recently due the BUCLD break, the homework just posted
has both a classification task (authorship determination) and a more theoretical
semantic exercise based on the stuff from today. The authorship part is quite
short, and the semantics part appears quite long but is really quite a lot of
reading with a few short tasks along the way. The semantics part of the homework
should help solidify the ideas that were discussed during the class time.</p>
<p>The homework is here: <a href="/lx390f17/hw7-authorship-semantics/">HW 7</a></p>
<p>Also, we’re rapidly approaching the point at which course project
ideas and proposals will be due. I will provide some specifications and
ideas for that shortly. Because that snuck up on me a little bit, the due
date for the proposals might be shifted to be after the Thanksgiving break,
but of course settling on a project sooner is probably better. Anyway, more
to come on that topic soon.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017There hasn’t been very much posting on the blog here, but I thought I would take a moment to add a note about today’s class, since no slides were used.Homework 22017-09-11T16:30:00-04:002017-09-11T16:30:00-04:00http://ling-blogs.bu.edu/lx390f17/homework-2<p>Here’s the plan for the second homework assignment. It again involves exercises from the book.</p>
<p>From chapter 1: 19, 20, 23, 24, 26<del>, 27, 28</del> [27 and 28 were in HW#1 already]</p>
<p>From chapter 2: 17, 18</p>
<p><em>However</em>, if you feel quite Python-competent already, and feel insufficiently challenged,
try doing chapter 2 exercise 23 (about Zipf’s Law) as well. The other exercises are not
pointless if you know Python, but you might get through them quickly. Ch. 2 number 23 is
not required, and it is not a risk to do it (if you get it wrong somewhere, you won’t
be penalized).</p>
<p>This is still not a great division by background level, but it might help a little bit.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017Here’s the plan for the second homework assignment. It again involves exercises from the book.Homework 12017-09-04T17:07:40-04:002017-09-04T17:07:40-04:00http://ling-blogs.bu.edu/lx390f17/homework-1<p>The first homework is a “just getting comfortable” assignment, based on section 1 and partly on section 2 of the NLTK book, and using exercises already in the book.</p>
<p>Those exercises are: 2, 3, 4, 6, 12, 13, 21, 27, 28, 29.</p>
<p>You can send the answers to me in whatever form I can read. Electronic is preferable, just plain text in an email is fine, but so is Word, PDF. Please include the output (and whatever commentary is necessary) to anything that you do (that is, don’t just give me something that I need to run myself).</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017The first homework is a “just getting comfortable” assignment, based on section 1 and partly on section 2 of the NLTK book, and using exercises already in the book. Those exercises are: 2, 3, 4, 6, 12, 13, 21, 27, 28, 29. You can send the answers to me in whatever form I can read. Electronic is preferable, just plain text in an email is fine, but so is Word, PDF. Please include the output (and whatever commentary is necessary) to anything that you do (that is, don’t just give me something that I need to run myself).pythontutor.com2017-09-04T11:09:20-04:002017-09-04T11:09:20-04:00http://ling-blogs.bu.edu/lx390f17/pythontutorcom<p>Last time I taught this class, I was made aware that when people
are taught Python in CS classes, the the
site <a href="http://www.pythontutor.com">pythontutor.com</a> is recommended
as a troubleshooting and learning resource.
It looks pretty nice. For a short program, you can copy and paste
it in, and then step through your program as it runs, line by line,
to see how variables evolve. I’ll demo this a little bit in class
as well when we work through how various Python functions operate.
But if you have a program that is misbehaving, this is one good
way to see where it is going wrong.</p>
<p>I have not played with it much really, but since I know about it
from the beginning this semester, we’ll get a chance to see how
useful it can be together. One downside is that you won’t be
able to use it for anything that involves NLTK, so it’s good for
logic, but for debugging NLTK-based issues, it will still be
necessary to work locally.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017Last time I taught this class, I was made aware that when people are taught Python in CS classes, the the site pythontutor.com is recommended as a troubleshooting and learning resource. It looks pretty nice. For a short program, you can copy and paste it in, and then step through your program as it runs, line by line, to see how variables evolve. I’ll demo this a little bit in class as well when we work through how various Python functions operate. But if you have a program that is misbehaving, this is one good way to see where it is going wrong.Anaconda is the recommended environment2017-09-04T10:26:56-04:002017-09-04T10:26:56-04:00http://ling-blogs.bu.edu/lx390f17/anaconda-is-the-recommended-environment<p>Even if you have Python working in some fashion on your computers already,
I would recommend installing <a href="https://www.continuum.io">Anaconda</a>, which is a “distribution” of Python
(and some other things, including NLTK) that makes setting up Python and related
software much easier than most of the alternatives.</p>
<p>(The Anaconda distribution makes everything pretty
painless, you don’t have to worry about, e.g., not having <code class="highlighter-rouge">matplotlib</code> installed, etc.,
and it also contains some other stuff that we’ll use later in the semester (notably R).</p>
<p>To install it, go to the <a href="https://www.continuum.io">Anaconda page</a>, click on
“Download”, pick your platform (Mac, Windows, or Linux), and download
the Python 3.6 version. For the Mac, I’d advise picking the graphical installer option.
Double-click on the downloaded package, follow the instructions.</p>
<p>When you’re finished, you should have an <code class="highlighter-rouge">anaconda</code> folder in your home folder
(which might not be immediately visible on the Mac—open your Documents folder
and then press command + up-arrow to move out of your Documents folder up in the
hierarchy, and at that point you should see the <code class="highlighter-rouge">anaconda</code> folder). Inside that
folder you should see an application called Navigator. If you already see a
<code class="highlighter-rouge">python</code> application in there, you can double click on that, but otherwise double-click
on Navigator and then install/launch Spyder.</p>
<p>This will give you a multipanel interface, with a temporary file on the left,
and an “IPython console” in the lower right. The “IPython console” is basically
like IDLE. You can drag the separators around to make it bigger.</p>
<p>I’ll use this in class, so you can see how I use it there as well.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017Even if you have Python working in some fashion on your computers already, I would recommend installing Anaconda, which is a “distribution” of Python (and some other things, including NLTK) that makes setting up Python and related software much easier than most of the alternatives.Welcome to NLP&CL2017-09-03T12:40:51-04:002017-09-03T12:40:51-04:00http://ling-blogs.bu.edu/lx390f17/announcements/welcome-to-nlp-cl<p>And, here we go. Welcome to Fall 2017, and welcome to the
Natural Language Processing and Computational Linguistics
course.</p>
<p>Announcements will be posted here, updates to the schedule,
etc. I will probably not even bother handing out a printed
copy of the schedule. If you would like to print it out, go
to the schedule page here and do so. It is likely to change
as the semester proceeds.</p>
<p>The course information page has stuff about the course
requirements and my office hours, etc.</p>CAS LX 390<br/>GRS LX 690</br>Fall 2017And, here we go. Welcome to Fall 2017, and welcome to the Natural Language Processing and Computational Linguistics course.