ACOPOST is a set of freely available POS taggers modeled after well-known techniques. The programs are written in C (aiming for extreme portability and code correctness/safety) and run under various Unix flavors (and probably even under Windows). ACOPOST currently consists of four taggers that are based on different frameworks: Maximum Entropy Tagger (MET), Trigram Tagger (T3, based on Hidden Markov Models), Error-driven Transformation-based Tagger (TBT or Brill Tagger), and Example-based tagger (ET).
The Language Detection Library for Java is a Java library to detect the natural languages in which texts are written. This task is also known as "language identification", "language guessing", and "language recognition". It has over 99% precision for more than 40 languages. The supported languages are Afrikaans, Arabic, Bulgarian, Bengali, Czech, German, Greek, English, Spanish, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Macedonian, Malayalam, Marathi, Nepali, Dutch, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Somali, Albanian, Swedish, Swahili, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, and Simplified/Traditional Chinese.
The AlchemyAPI Android SDK enables real-time semantic analysis of text, HTML, or Internet-hosted Web page content. The SDK provides mechanisms to extract Concepts, Named Entities, Keywords and Tags, Categories, and clean HTML into text, and even detects languages. It can analyze text in eight different languages: English, French, German, Italian, Portuguese, Russian, Spanish, and Swedish. Example code and a demo application are included to help get you started.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.
foma is a compiler, programming language, and C library for constructing finite-state automata and transducers for various uses. It has specific support for many natural language processing applications such as producing morphological analyzers. Although NLP applications are probably the main use of foma, it is sufficiently generic to use for a large number of purposes. It comes with an xfst-compatible interface and regular expression language. The library contains efficient implementations of all classical automata/transducer algorithms: determinization, minimization, epsilon-removal, composition, and boolean operations. More advanced construction methods are also available: context restriction, quotients, first-order regular logic, transducers from replacement rules, etc.
jWeb1T is an Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format. It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time. As the corpus is stored in many files, a simple index is used to retrieve the files containing the n-grams.