Apache UIMA DUCC (Distributed UIMA Cluster Computing) is a cluster management system providing tooling, management, and scheduling facilities that automate the scale-out of applications written using the UIMA framework. Core UIMA provides a generalized framework for applications that process unstructured information such as human language, but does not provide a scale-out mechanism. UIMA-AS extends UIMA and provides a scale-out mechanism for distributing UIMA pipelines over a cluster of computing resources, but does not provide job or cluster management of the resources. DUCC extends UIMA-AS by defining a formal job model that closely maps to a standard UIMA pipeline. Around this job model DUCC provides cluster management services to automate the scale-out of UIMA pipelines over computing clusters.
DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. Many powerful and state-of-the-art NLP components are already freely available in the NLP research community. New and improved components are being developed and released continuously. The components cover the whole range of NLP-related processing tasks. DKPro Core provides wrappers for such third-party tool as well as original NLP components. DKPro Core builds heavily on uimaFIT which allows for rapid and easy development of NLP processing pipelines.
uimaFIT provides Java annotations for describing UIMA components which can be used to directly describe the UIMA components in Java code without the need for traditional UIMA XML descriptors. This greatly simplifies refactoring a component definition (e.g., changing a configuration parameter name). uimaFIT also makes it easy to instantiate UIMA components without using XML descriptor files by providing convenient factory methods. This makes uimaFIT an ideal library for testing UIMA components because the component can be easily instantiated and invoked without requiring a descriptor file to be created first. uimaFIT is very useful in research environments in which programmatic/dynamic instantiation of UIMA pipelines can simplify experimentation. For example, when performing 10-fold cross-validation across a number of experimental conditions, it can be quite laborious to create a different set of descriptor files for each run, or even a script which generates such descriptor files. uimaFIT is type system agnostic and does not depend on (or provide) a specific type system. This project has been superseded by the Apache uimaFIT project.
Midao JDBC simplifies development with Java JDBC. It is flexible, customizable, and simple/intuitive to use, and provides a lot of functionality: transactions, working with metadata, type handling, profiling, input/output processing/converting, pooled datasource libraries support, cached/lazy query execution, named parameters, multiple vendor support out of the box, custom exception handling, and overrides. With a single jar, it supports both JDBC 3.0 (Java 5) and JDBC 4.0 (Java 6). Midao JDBC is well tested. Not only does it have around 700 unit and functional tests, but it's also tested with the latest drivers of Derby, MySQL (MariaDB), PostgreSQL, Microsoft SQL, and Oracle. Midao is a data-centric project. Its goal is to shield Java developer from nuances of vendor implementation and standard boilerplate code. Midao JDBC is the first library released under it.
WebAnno is a general purpose Web-based annotation tool for a wide range of linguistic annotations. It offers annotation project management, freely configurable tagsets, and the management of users in different roles. It uses technology from the brat rapid annotation tool for visualizing and editing annotations in a Web browser. It supports annotation and visualization of arbitrarily large documents, pluggable import/export filters, the curation of annotations across various users, and farming out annotations to a crowdsourcing platform.
DKPro Similarity is a framework for text similarity. Its goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity also comes with a set of full-featured experimental setups which can be run out-of-the-box and used for future systems to built upon.
Apache uimaFIT provides Java annotations for describing UIMA components which can be used to directly describe the UIMA components in Java code without the need for traditional UIMA XML descriptors. This greatly simplifies refactoring a component definition (e.g., changing a configuration parameter name). It also makes it easy to instantiate UIMA components without using XML descriptor files by providing convenient factory methods. It is ideal for testing UIMA components because the component can be easily instantiated and invoked without requiring a descriptor file to be created first.
jWeb1T is an Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format. It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time. As the corpus is stored in many files, a simple index is used to retrieve the files containing the n-grams.