The Okapi project’s main purpose is to architect a set of building blocks for the creation of larger open source localization and translation tools. But many Okapi components are generic enough to be of interest to the text mining, natural language processing, and text retrieval communities. Okapi’s many text filters (HTML, Properties, XML (ITS XPath-based rules), OpenXML, ODF, Regex etc.) provide a straightforward way to access the text of multiple document formats. Its document events and pipeline can be made to integrate with other frameworks such as UIMA, LingPipe, OpenPipeline, OpenNLP, GATE, and Lucene. The advantage of Okapi’s text filters is that not only is text extracted, but all non-textual formatting is preserved. It is possible to decompose a document into events, process them via the pipeline, and then rebuild the input document without loss. Structural information can be added to Okapi document events so that tables, lists, links, titles etc. are grouped together and treated as a unit. This is useful when context based on a “universal” document structure is needed. The Okapi event model supports user configurable annotations, similar to UIMA, but simpler and more restricted in scope. User can annotate spans of text or add new resources such as translation memory matches, terminology, token types, or part of speech information.
itools is a collection of Python libraries which provides a wide range of capabilities, including an abstraction over directory and file resources, a search engine, type marshallers, datatype schemas, i18n support, URI handlers, a Web programming interface, a workflow interface, and support for data formats such as (X)HTML, XML, iCalendar, RSS 2.0, and XLIFF.
HEBCI is a technique that allows a Web form handler to transparently detect the character set with which its data was encoded. By using carefully-chosen character references, the browser's encoding can be inferred. Thus, it is possible to guarantee that data is in a standard encoding without relying on (often unreliable) Web server/browser encoding interactions.
SILGraphite (formerly OpenGraphite) is a project within SIL's Non-Roman Script Initiative and Language Software Development groups to provide extensible cross-platform rendering capabilities for complex non-Roman writing systems. It consists of a rule-based programming language, Graphite Description Language (GDL), that can be used to describe the behavior of a writing system, a compiler for that language, and a rendering engine that can serve as the backend of a text processing application. SILGraphite renders TrueType fonts that have been extended by means of compiling a GDL program. It is currently being integrated into Gecko/Mozilla through the SILA project, a GNU/Linux port is also underway, and there are plans for OpenOffice.org and Abiword integration.