Wednesday, December 23, 2009

Get Your LangFreq On Twitter

New Tool LangFreq Helps to Uncover the Most Common Words on Twitter

Everybody loves Twitter for the people, but not many people know how useful the bots can be. When Evan Williams talked about the impact of persistant communication on being social at LeWeb03 in December 2007, it was in the context of Twitter as a command line interface.

The idea had been around for quite some time and came about as Twitter began to open up the API, heralding the rise of Twitter as a mobile applications platform. A whole ecosystem of apps and bots sprung up around the api, now inhabiting the Twitter fan wiki in various states of (dis)repair.

One of the points Ev liked to make was that too often we ask "What can we add to a product to make it better?" He thought rather we should be asking "What can we take away to create something new?"

The new austerity.

Many of the bots that have sprung relate to productivity or novelty, but there is also a small handful devoted to languages. One of the most recent additions to this burgeoning hive of activity is LangFreq by Zyaga. LangFreq is set of language tools including word frequency, phrase rank and comparison, language translation and identification.

Zyaga and I first talked about his ideas in my Japanese classes on eduFire as he was developing these web tools to compare word frequency across a few languages. During an exchange of emails I suggested that Twitter would not only be a good source of data, but a good platform for a command line interface to his web work.

A Twitter bot is born.

The LangFreq suite of language learning tools is in beta so there is still plenty of work to be done, already it's showing lots of promise. LangFreq is built on the simple premise that someone studying the 100 most frequent words in any language would be much further ahead than someone studying 100 words at random.

Linguists have long been aware that there might be some advantage to learners in identifying a core vocabulary[1]. There are a certain number of high frequency words in each language that cover a large proportion of words in common use[2]. The notion of a core vocabulary has also become a central principle of some language learning systems, notably Pimsleur.

LangFreq addresses the need to know these most frequent words simply and cleanly on the website for English and Spanish. For Japanese however, the situation is a little more difficult. The trouble with Japanese is that it is difficult to define where one word ends and another begins[3]. Although it is possible that breaking words down into smaller units makes it easier to work with, doing so limts the applicability of the tool.

Japanese is a language largely built on compunds of two or more kanji and the okurigana that helps identify the nature of the verb conjugation or adjectival inflection. It will be interesting to see how Zyaga solves this computational problem. It might be worthwhile taking another look at Jim Breen's classic Japanese dictionary WWWJDIC, or looking at the work of Rick Noelle and his Japanese Sentence Parser which uses the MeCab morphological analyzer.

Taking a closer look at the bot.

Please go and take a closer look at the bot @langfreq, I'll be doing an indepth comparison with some other similar bots in the next week or so.

It's pretty easy to get started. There are only four commands including help, rank, translate and identify. The bot is an elegant way to get this kind of information about core words when you are out and about.

Have a play around with it and tell me what you think. Do you use bots like these?

1. Carter, Ronald. “Is there a Core Vocabulary? Some Implications for Language Teaching*.” Applied Linguistics 8, no. 2 (February 1, 1987): 178-193.

2. Nation, I. S. P. Learning vocabulary in another language. Cambridge Univ Pr, 2001. [pdf]

3. Douglas, M. O. “Japanese Cloze Tests: Toward Their: Construction.” [pdf]
blog comments powered by Disqus