1. Evaluating Tika language detection on tweets

    Tika language detection is not designed for short texts like tweets or Facebook status, as acknowledged in its documentation1. Nonetheless, I wanted to know what to expect when detecting the language of short documents like tweets.

    In a nutshell

    I compared the language identified by Twitter to the language ...

    Tagged as : tika nlp
  2. Using Ruta in a maven project

    For those who are unfamiliar with UIMA and its ecosystem, Ruta (for RUle-Based Text Annotation) is a tool for rule-based information extraction. For example, a very simple date extractor could look like:

    PACKAGE com.textjuicer.ruta.date;
    DECLARE Date;
    DECLARE Day;
    DECLARE Month;
    DECLARE Year;
    // A date is a month ...
    Tagged as : ruta uima uimafit
  3. Words lists in a shell

    I often want to manipulate set of words that I want to compare. This post present some of the one lines that I frequently use to manipulate such lists.

    Get a set of words from a text file

    If you start from a text file, the following command will convert ...

    Tagged as : shell

Page 1 / 1