Using Ruta in a maven project

For those who are unfamiliar with UIMA and its ecosystem, Ruta (for RUle-Based Text Annotation) is a tool for rule-based information extraction. For example, a very simple date extractor could look like:

PACKAGE com.textjuicer.ruta.date;

DECLARE Date;
DECLARE Day;
DECLARE Month;
DECLARE Year;

// A date is a month, followed by a day, optionally an ordinal suffix and a year
//
// examples : "january 1st 2008", "february 28, 2010"
W{REGEXP("(?i)(january|february|march|...|november|december)") -> MARK(Month)}
    NUM {-> MARK(Day)}
    W?{REGEXP("(?i)(th|st|nd|rd)")}
    COMMA?
    NUM{-> MARK(Year), MARK(Date, 1, 5)}
    ;

// A date can also be specified in the MM/DD/YYYY format
NUM{-> MARK(Month)} "/" NUM{-> MARK(Day)} "/" NUM{-> MARK(Year), MARK(Date, 1, 5)};

// We can also create rules over our Date annotations
DECLARE DateRange;
"from" Date "to" Date{-> MARK(DateRange, 1, 4)};

Running this script on:

A date that should be extracted is September 1st 2013.
We can also use another format 12/31/1999.
A simple date range is from September 1st 2013 to September 3rd 2013.

will produce the following annotations:

Date Annotations

Not bad! But what if we need to export this annotor from eclipse to a maven project?

Ruta in a maven project

Copy the script and its descriptor

All ruta scripts are interpreted by the org.apache.uima.ruta.engine.RutaEngine analysis engine. Each script thus need a custom RutaEngine descriptor. Writing these descriptors can be tedious (see NaiveDateExtractorEngine.xml for an example), that's why Ruta Workbench generates them for us. Each time a ruta script is saved, the workbench updates its matching descriptor under the descriptor/ directory. You need to copy both the script and its descriptor in your maven project.

For example, the descriptor for a script named scripts/com/textjuicer/ruta/date/NaiveDateExtractor.ruta will be saved under descriptor/com/textjuicer/ruta/date/NaiveDateExtractorEngine.xml. You can copy both of these files under src/main/resources/com/textjuicer/ruta/date in your maven project.

Import the type systems by name

Note that Ruta descriptors imports their type systems by location (from the file system). You will need to import them by name (from the classpath) to package your script in a maven artifact:

<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
    [...]
    <analysisEngineMetaData>
        [...]
        <typeSystemDescription>
            [...]
            <imports>
                <!-- Replace <import location="../../../BasicTypeSystem.xml"/>   by -->
                <import name="org.apache.uima.ruta.engine.BasicTypeSystem"/>
            </imports>
            [...]
        </typeSystemDescription>
        [...]
    </analysisEngineMetaData>
    [...]
</analysisEngineDescription>

Add a dependency to Ruta

You first need to add a dependency on ruta-core by adding these lines to your pom.xml:

<dependency>
        <groupId>org.apache.uima</groupId>
        <artifactId>ruta-core</artifactId>
        <version>2.0.1</version>
    </dependency>

And while you are there, add a dependency on uimaFIT:

<dependency>
    <groupId>org.apache.uima</groupId>
    <artifactId>uimafit-core</artifactId>
    <version>2.0.0</version>
</dependency>

Run the script

Once your script and its descriptor are under the resource directory, it is easy to run your script from java:

// create the annotation engine to extract dates
final AnalysisEngine engine =
    AnalysisEngineFactory.createEngine("com.textjuicer.ruta.date.NaiveDateExtractorEngine");
final CAS cas = engine.newCAS();

// build a document to process
cas.setDocumentText("text with a date, like September 2nd, 2013");

// annotate the document
engine.process(cas);

// extract dates from the annotated document
final Type dateType = cas.getTypeSystem().getType("com.textjuicer.ruta.date.NaiveDateExtractor.Date");
for (AnnotationFS date : CasUtil.select(cas, dateType)) {
    System.out.println("Found date: " + date.getCoveredText());
}

Your are now one mvn deploy away to use your new Ruta script as a maven dependency in your other projects.

Try it yourself

I have pushed a working example on github. Do not hesitate to use it as a starting point for your own projects.

Comments