Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.
EXIP provides a C library for the parsing and serialization of Efficient XML Interchange (EXI) format streams. The focus is portability and efficiency for embedded systems development. The project was started at the EISLAB research group in the Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, and is part of research efforts to bring resource-constrained embedded devices, such as wireless sensor nodes, closer to the enterprise business processes taking place in processing, manufacturing, and communication industries.
Ell is a library to write EBNF grammars as C++ code for quick development of LL(n) parsers or similar applications. It is not a tool to generate parsers (like ANTLR): the grammar you write is directly embedded into your C++ code. The core library is very light (less than 2000 lines of headers) and written in generation templates to achieve the fastest execution. The service provided by Ell is very similar to what Boost Spirit provides, but with a simpler object model, and without the need of the Boost library (it only depends on STL).
Flexc++ is a tool for generating scanners based on regular expressions. Flexc++ is highly comparable to the programs flex and flex++. The goal was to create a similar program, but to implement it completely in C++. Most flex and flex++ grammars should be usable with flexc++ with minor adjustments.
HtmlCleaner is an HTML parser. HTML found on the Web is usually dirty, ill-formed, and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring order to the tags, attributes, and ordinary text. For a given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows rules similar to those which most Web browsers use to create a Document Object Model. However, the user may provide custom tag and rule sets for tag filtering and balancing.
JWPL is a language independent, database-driven, high performance Wikipedia API that provides structured access to information nuggets like redirects, categories, articles, and link structure. It contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page or standalone with other text, TimeMachine, which reconstructs a snapshot of Wikipedia from a specific date, or multiple snapshots from a time span, and RevisionMachine, which offers efficient access to the history of articles using a dedicated storage format which decreases storage space by 98%. This enables random access to the whole revision history without requiring several terabytes of storage for a single Wikipedia dump.