RSS 7 projects tagged "NLP"

No download Website Updated 26 Apr 2010 Okapi Framework

Screenshot
Pop 32.09
Vit 1.48

The Okapi project’s main purpose is to architect a set of building blocks for the creation of larger open source localization and translation tools. But many Okapi components are generic enough to be of interest to the text mining, natural language processing, and text retrieval communities. Okapi’s many text filters (HTML, Properties, XML (ITS XPath-based rules), OpenXML, ODF, Regex etc.) provide a straightforward way to access the text of multiple document formats. Its document events and pipeline can be made to integrate with other frameworks such as UIMA, LingPipe, OpenPipeline, OpenNLP, GATE, and Lucene. The advantage of Okapi’s text filters is that not only is text extracted, but all non-textual formatting is preserved. It is possible to decompose a document into events, process them via the pipeline, and then rebuild the input document without loss. Structural information can be added to Okapi document events so that tables, lists, links, titles etc. are grouped together and treated as a unit. This is useful when context based on a “universal” document structure is needed. The Okapi event model supports user configurable annotations, similar to UIMA, but simpler and more restricted in scope. User can annotate spans of text or add new resources such as translation memory matches, terminology, token types, or part of speech information.

Download Website Updated 04 Mar 2010 Acopost

Screenshot
Pop 24.06
Vit 1.00

ACOPOST is a set of freely available POS taggers modeled after well-known techniques. The programs are written in C (aiming for extreme portability and code correctness/safety) and run under various Unix flavors (and probably even under Windows). ACOPOST currently consists of four taggers that are based on different frameworks: Maximum Entropy Tagger (MET), Trigram Tagger (T3, based on Hidden Markov Models), Error-driven Transformation-based Tagger (TBT or Brill Tagger), and Example-based tagger (ET).

No download No website Updated 15 Oct 2010 Language Detection Library for Java

Screenshot
Pop 52.42
Vit 31.26

The Language Detection Library for Java is a Java library to detect the natural languages in which texts are written. This task is also known as "language identification", "language guessing", and "language recognition". It has over 99% precision for more than 40 languages. The supported languages are Afrikaans, Arabic, Bulgarian, Bengali, Czech, German, Greek, English, Spanish, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Macedonian, Malayalam, Marathi, Nepali, Dutch, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Somali, Albanian, Swedish, Swahili, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, and Simplified/Traditional Chinese.

Download Website Updated 29 Nov 2011 Apache OpenNLP

Screenshot
Pop 78.28
Vit 1.53

Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.

No download Website Updated 15 Dec 2011 foma

Screenshot
Pop 50.70
Vit 1.00

foma is a compiler, programming language, and C library for constructing finite-state automata and transducers for various uses. It has specific support for many natural language processing applications such as producing morphological analyzers. Although NLP applications are probably the main use of foma, it is sufficiently generic to use for a large number of purposes. It comes with an xfst-compatible interface and regular expression language. The library contains efficient implementations of all classical automata/transducer algorithms: determinization, minimization, epsilon-removal, composition, and boolean operations. More advanced construction methods are also available: context restriction, quotients, first-order regular logic, transducers from replacement rules, etc.

No download Website Updated 16 Feb 2012 jWeb1T

Screenshot
Pop 28.88
Vit 1.00

jWeb1T is an Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format. It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time. As the corpus is stored in many files, a simple index is used to retrieve the files containing the n-grams.

Download No website Updated 30 Apr 2013 UBY

Screenshot
Pop 86.56
Vit 4.24

UBY is a large-scale unified lexical-semantic resource for natural language processing (NLP) based on the ISO standard Lexical Markup Framework (LMF).

Screenshot

Project Spotlight

nomacs

A fast and small image viewer that can synchronize multiple instances.

Screenshot

Project Spotlight

Fotoxx

A photo editing and collection management application.