Tika language detection is not designed for short texts like tweets or Facebook status, as acknowledged in its documentation1. Nonetheless, I wanted to know what to expect when detecting the language of short documents like tweets.
In a nutshell
I compared the language identified by Twitter to the language identified by Tika 1.5 on over 500,000 tweets. The global accuracy of Tika was a mere 48 %. However, when we break down the results by language, Tika gets a precision of 98 % or more on Greek, English, Spanish, Russian and Thai at the expense of a recall ranging between 28 % and 75 %. Tika is thus a valid option for those language as long as missing out many documents is not an issue.
I collected a corpus using twitter-sampler, an application collecting tweets from Twitter's sample stream. Over 700,000 tweets were downloaded from which I removed tweets in languages unknown to Tika 1.52, resulting in a corpus of 503,337 tweets.
I ran Tika language detection on tweet's texts using a small program and saved the results in a CSV file (sample for easy viewing). Each row of this file contains a tweet id, Twitter's language tag, Tika language tag and a binary flag specifying whether Tika is confident with its prediction or not. Note that Tika is not confident about any of its prediction.
|Language (ISO 639-1 code)||size||precision||recall||fmeasure|
As I said earlier, recall is bad for nearly all languages. However, precision is over 98 % for Greek, English, Spanish, Russian and Thai. If the task was to collect a corpus of micro-blogs in any of those languages and that we can afford to ignore many valid documents, Tika is still a viable option.