Has Deep Learning Uncovered Chomsky’s Deep Structure?
November 27, 2016
After Trump’s shocking victory, many of our professors began class with an opportunity for us to voice any fears or feelings we were harboring. One of my professors spoke about how studying linguistics is a way to study what unites us as humans: this strange ability called “language.” Despite all of our languages looking and sounding different, all humans have this amazing ability to learn complex rules and thousands of words in our first few years of existence. Moreover, we do this without being prodded to learn and without much explicit instruction. Language is something that should, at its core, unite us, not divide us.
Earlier this week, Google Research announced a breakthrough in “one-shot” machine translation. What this means is that Google Translate can now perform translations on unseen pairs of languages. Typically, a machine translation algorithm needs to be trained on each language pair, e.g. English <–> French and French <–> Spanish. But Google’s latest results can perform translations from, e.g., English <–> Korean by only being trained on pairs of other languages (see Google’s visual representation below). In essence, they are only training the machine on the “gist” of language or language relationships, rather than a specific pairing.
The Google team calls this “interlingua.” For linguists, this underlying abstract form has been the basis of their field since Chomsky’s earliest writings. “Deep Structure” or D-structure, is distinct from “Surface Structure,” or S-structure; where Deep Structure is something like the Platonic form, the S-structure is the concrete realization in the phonetic sounds of a sentence. For example, the sentences I love New York and New York is loved by me both have essentially the same meaning. According to Chomsky, the D-structure of both of these sentences is the same, and the deep structure is transformed in different ways en route to the different respective surface realizations.
The field of generative syntax has been primarily concerned with elucidating the rules and constraints that each and all languages undergo during this transformational process. If we can unwind these transformations, peeling back layer upon layer of surface structure, then we can uncover the deep structure underlying all of language.
And now, it’s my turn to be speculative: For the last 20 years, computational linguists have been trying to apply the rules and constraints of generative syntax to the computational tasks of natural language understanding and translation. However, rules-based accounts have been less successful than the more flexible probability-based algorithms. The result has been that many “language engineers” have become dismissive of the rules-based Chomskian community.
But if we (speculatively) assume that Google’s algorithms have uncovered an underlying interlingua, then perhaps this means that Chomsky’s notion of D-structure has been right all along, we’ve just been going about the process of uncovering it in the wrong way. Whereas generative syntacticians base most of their findings on patterns in a single language or single collection of languages, maybe the real findings lie in the space between languages, the glue that binds it all together.
Of course, the findings of many deep learning-based systems are notoriously difficult to suss apart, so we don’t really know what the features of this possible interlingua look like. While this is frustrating, I suppose it also means there is still plenty of work left for a budding computational linguist. And if we can start to elucidate the ties that linguistically bind us, maybe we can elucidate the ties that bind humanity, as well.
A Funny Thing Happens When Typing Polysyllabic Words
December 28, 2015
Over the holidays I’ve been emptying my digital pockets and finding all sorts of fun nick-nacks. In particular, Jessie Daniels describes how to be a scholar now, when peer-reviewed articles can begin as Tweets and blog posts. Taking up her clarion call, I thought I’d give it a shot.
[Warning: These findings are minimal and preliminary. A much more thorough analysis needs to be done, and many many more statistical tests need to be run.]
I’ve been meaning to study variation in language production, specifically on a word-by-word basis. For example, how does one typist or one population of typists produce the word giraffe versus another typist or population of typists?
I took a few polysyllabic words from the word list used by Molly Lewis & Michael C. Frank in their recent paper The length of words reflects their conceptual complexity, and measured the pause times (intervals) before each letter. Here are the results:
The first thing to note is that pauses before a word are much longer than pauses within a word. This finding is well-established, though.
More interesting (to me, at least) is what happens at syllable boundaries. In the two compound words because and someone, the pause at the syllable boundary is more pronounced. An unpaired t-test shows a significant difference in pause times between syllable-liminal and syllable-internal pause times (p < 0.01), whereas differences between other syllable-internal pauses are not significant.
In typing research, a more pronounced pause time indicates “more cognition” is happening. There is some process, such as downloading a word into the lexical buffer, that causes a slowdown in figuring out which key to strike next. It is possible that we are observing a phenomenon where lexical retrieval occurs at the syllable level when a word is made up of multiple words, even if those words do not “compose” the compound word.
Specifically, the word someone can reasonably be decomposed into some + one. It might make sense that someone is downloaded syllable-by-syllable, and we see that delay in typing as the next word/syllable is retrieved.
More surprisingly, though, we do not think of because as being composed of be + cause, even though these are two perfectly good words. Nonetheless, we see _something_ happening when the next word/syllable is retrieved.
None of these delays, though, are observed in the words people and about, although I supposed about = a + bout.
tl;dr: Something fun is going on with multisyllabic, compound words. It needs a lot more investigation, and I plan on doing just that over the holidays.
NAACL ’15 Roundup
June 7, 2015
I just returned from NAACL 2015 in beautiful Denver, CO. This was my first “big” conference, so I didn’t know quite what to expect. Needless to say, I was blown away (for better or for worse).
First, a side note: I’d like to thank the NAACL and specifically the conference chair Rada Mihalcea for providing captions during the entirety of the conference. Although there were some technical hiccups, we all got through them. Moreover, Hal Daume and the rest of the NAACL board were extrememly receptive to expanding accessibility going forward. I look forward to working with all of them.
Since this was my first “big” conference, this is also my first “big” conference writeup. Let’s see how it goes.
Keynote #1: Lillian Lee Big Data Pragmatics etc….
- This was a really fun and insightful talk to open the conference. There were a few themes within Lillian’s talk, but my two favorite were why movie quotes become popular and why we use hedging. Regarding the first topic, my favorite quote was: “When Arnold says, ‘I’ll be back’, everyone talked about it. When I say ‘I’ll be back’, you guys are like ‘Well, don’t rush!'”
- The other theme I really enjoyed was “hedging” and why we do it. I find this topic fascinating, since it’s all around us. For instance, in saying “I’d claim it’s 200 yards away” we add no new information with I’d claim.” So why do we say it? I think this is also a hallmark of hipster-speak, e.g. “This is maybe the best bacon I’ve ever had.”
Ehsan Mohammady Ardehaly & Aron Culotta Inferring latent attributes of Twitter users with label regularization
- This paper uses a lightly-supervised method to infer attributes like age and political orientation. It therefore avoids the need for costly annotation. One way that they infer attributes is by determining which Twitter accounts are central to a certain class. Really interesting, and I need to read the paper in-depth to fully understand it.
One Minute Madness
- This was fun. Everyone who presented a poster had one minute to preview/explain their poster. Some “presentations” were funny and some really pushed the 60-second mark. Joel Tetreault did a nice job enforcing the time limit. Here’s a picture of the “lineup” of speakers.
Nathan Schneider & Noah Smith A Corpus and Model Integrating Multiword Expressions and Supersenses
- Nathan Schneider has been doing some really interesting semantic work, whether on FrameNet or MWEs. Here, the CMU folks did a ton of manual annotation of the “supersense” of words and MWEs. Not only do they manage to achieve some really impressive results on tagging of MWES, but they also have provided a really valuable resource to the MWE community in the form of their STREUSLE 2.0 corpus of annotated MWEs/supersenses.
Keynote #2: Fei-Fei Li A Quest for Visual Intelligence in Computers
- This was a fascinating talk. The idea here is to combine image recognition with semantics/NLP. For a computer to really “identify” something, it has to understand its meaning; pixel values are not “meaning.” I wish I had taken better notes, but Fei-Fei’s lab was able to achieve some incredibly impressive results. Of course, even the best image recognition makes some (adorable) mistakes.
Manaal Faruqui et al. Retrofitting Word Vectors to Semantic Lexicons
- This was one of the papers that won a Best Paper Award, and for good reason. It addresses a fundamental conflict in computational linguistics, specifically within computational semantics: distributional meaning representation vs. lexical semantics. The authors combine distributional vector representation with information from lexicons such as WordNet and FrameNet, and achieve significantly higher accuracy in semantic evaluation tasks from multiple languages. Moreover, their methods are highly modular, and they have made their tools available online. This is something I look forward to tinkering around with.
Some posters that I really enjoyed
- Oracle and Human Baselines for Native Language Identification – Shervin Malmasi, Joel Tetreault and Mark Dras
- Lexicon-Free Conversational Speech Recognition with Neural Networks – Andrew L. Maas, Ziang Xie, Dan Jurafsky, Andrew Y. Ng
- Using Zero-Resource Spoken Term Discovery for Ranked Retrieval – Jerome White et al.
- Recognizing Textual Entailment using Dependency Analysis and Machine Learning – Nidhi Sharma, Richa Sharma and Kanad K. Biswas
- Deep learning and neural nets are still breaking new ground in NLP. If you’re in the NLP domain, it would behoove you to gain a solid understanding of them, because they can achieve some incredibly impressive results.
- Word embeddings: The running joke throughout the conference was that if you wanted your paper to be accepted, it had to include “word embeddings” in the title. Embeddings were everywhere (I think I saw somewhere that ~30% of the posters included this is their title). Even Chris Manning felt the need to comment on this in his talk/on Twitter:
RT @aidotech: RT aidotech: Chris actually showing a tweet on his slides! #deeplearning #naacl2015 pic.twitter.com/GWI7rDiQVC
— StanfordCSLI (@StanfordCSLI) June 5, 2015
Takeaways for Future Conferences
- I should’ve read more of the papers beforehand. Then I would have been better prepared to ask good questions and get more out of the presentations.
- As Andrew warned me beforehand “You will burn out.” And he was right. There’s no way to fully absorb every paper at every talk you attend. At some point, it becomes beneficial to just take a breather and do nothing. I did this Wednesday morning, and I’m really glad I did it.
- Get to breakfast early. If you come downstairs 10 minutes before the first session, you’ll be scraping the (literal) bottom of the barrel on the buffet line.
Shameless self-citation: Here is the paper Andrew and I wrote for the conference.