The Okapi project’s main purpose is to architect a set of building blocks for the creation of larger open source localization and translation tools. But many Okapi components are generic enough to be of interest to the text mining, natural language processing, and text retrieval communities. Okapi’s many text filters (HTML, Properties, XML (ITS XPath-based rules), OpenXML, ODF, Regex etc.) provide a straightforward way to access the text of multiple document formats. Its document events and pipeline can be made to integrate with other frameworks such as UIMA, LingPipe, OpenPipeline, OpenNLP, GATE, and Lucene. The advantage of Okapi’s text filters is that not only is text extracted, but all non-textual formatting is preserved. It is possible to decompose a document into events, process them via the pipeline, and then rebuild the input document without loss. Structural information can be added to Okapi document events so that tables, lists, links, titles etc. are grouped together and treated as a unit. This is useful when context based on a “universal” document structure is needed. The Okapi event model supports user configurable annotations, similar to UIMA, but simpler and more restricted in scope. User can annotate spans of text or add new resources such as translation memory matches, terminology, token types, or part of speech information.
Keystone is a cross-platform, object oriented application framework which allows applications to be written to build on the target platforms of GNU/Linux and Win32 without modification of their source. Keystone implements several modern Web standards, including SVG graphics and the XUL user interface description language.
LibAxl is an efficient implementation of the XML 1.0 standard specification. It doesn't have any external library dependencies, having a clean implementation based on opaque types and a consistent API to manipulate your XML documents without compromising your code. It is extremely memory efficient and thread safe with a small footprint (111k). It also includes XML Namespaces support.
SENTENSA Knowledge Miner is a platform independent tool for searching any text. SENTENSA uses robust methods of indexing and searching text, leveraging experience from more than 20 years of information retrieval. SENTENSA products offer advanced text retrieval solutions for large databases that will make your searches for key information fast and effective. You can index on one platform and query on another.
Hyper Estraier is a full-text search system. It can be used as a Web search engine, mailbox searching, etc. It features high performance searching, high scalability of target documents, a perfect recall ratio by the N-gram method, phrase searching, attribute searching, and similarity searching. Multilingualism is supported with Unicode. It is independent of file format and repository, and has a simple and powerful API.