Evaluating Tika language detection on tweets

Tika language detection is not designed for short texts like tweets or Facebook status, as acknowledged in its documentation1. Nonetheless, I wanted to know what to expect when detecting the language of short documents like tweets.

In a nutshell

I compared the language identified by Twitter to the language identified by Tika 1.5 on over 500,000 tweets. The global accuracy of Tika was a mere 48 %. However, when we break down the results by language, Tika gets a precision of 98 % or more on Greek, English, Spanish, Russian and Thai at the expense of a recall ranging between 28 % and 75 %. Tika is thus a valid option for those language as long as missing out many documents is not an issue.

Experiment

I collected a corpus using twitter-sampler, an application collecting tweets from Twitter's sample stream. Over 700,000 tweets were downloaded from which I removed tweets in languages unknown to Tika 1.52, resulting in a corpus of 503,337 tweets.

I ran Tika language detection on tweet's texts using a small program and saved the results in a CSV file (sample for easy viewing). Each row of this file contains a tweet id, Twitter's language tag, Tika language tag and a binary flag specifying whether Tika is confident with its prediction or not. Note that Tika is not confident about any of its prediction.

This CSV file was then fed to a R script computing precision, recall and f-measure3 for each language:

Language (ISO 639-1 code) size precision recall fmeasure
da 614 0.05 0.07 0.06
de 5515 0.61 0.25 0.36
el 457 1.00 0.75 0.86
en 240424 0.98 0.45 0.62
es 135879 0.98 0.28 0.44
et 4041 0.03 0.25 0.05
fi 657 0.03 0.31 0.05
fr 24422 0.72 0.44 0.55
hu 867 0.01 0.24 0.03
is 245 0.01 0.15 0.02
it 10080 0.19 0.58 0.29
lt 349 0.00 0.32 0.01
nl 8039 0.55 0.30 0.39
no 389 0.00 0.39 0.01
pl 3698 0.28 0.48 0.35
pt 48340 0.68 0.34 0.45
ru 11513 0.99 0.65 0.79
sk 1771 0.02 0.13 0.03
sl 2723 0.08 0.21 0.11
sv 1745 0.42 0.37 0.39
th 840 1.00 0.65 0.79
uk 728 0.17 0.83 0.28

As I said earlier, recall is bad for nearly all languages. However, precision is over 98 % for Greek, English, Spanish, Russian and Thai. If the task was to collect a corpus of micro-blogs in any of those languages and that we can afford to ignore many valid documents, Tika is still a viable option.


  1. LanguageIdentifier#isReasonablyCertain() 

  2. be, ca, da, de, el, en, eo, es, et, fi, fr, gl, hu, is, it, lt, nl, no, pl, pt, ro, ru, sk, sl, sv, th, uk 

  3. For an introduction on precision and recall, see Wikipedia

Comments