jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.
gradle-sablecc-plugin is a gradle plugin which creates parsers using SableCC. SableCC supports automatic CST-to-AST transformation, emits all the visitor patterns and analysis helpers you will likely ever need, and is LR, not LL(k). Many example grammars are available for modern languages; the author of this plugin has written dozens.
JWPL is a language independent, database-driven, high performance Wikipedia API that provides structured access to information nuggets like redirects, categories, articles, and link structure. It contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page or standalone with other text, TimeMachine, which reconstructs a snapshot of Wikipedia from a specific date, or multiple snapshots from a time span, and RevisionMachine, which offers efficient access to the history of articles using a dedicated storage format which decreases storage space by 98%. This enables random access to the whole revision history without requiring several terabytes of storage for a single Wikipedia dump.
lihata is a compact textual language which can represent a tree of lists, hashes, and tables. The syntax tries to be minimal and flexible to allow formatting a lihata file to fit the context it represents. The source release contains an event and DoM parser and helper functions for maintaining lihata trees. lihata is a convenient language for both simple and complex configuration files and text representation of data files.
YAJL (Yet Another JSON Library) is a small event-driven (SAX-style) JSON parser written in ANSI C, and a small validating JSON generator. It's highly portable, data representation independent, fast, generates verbose error messages including context of where the error occurs in the input text, can parse JSON data incrementally off a stream, and is tiny.
Flexc++ is a tool for generating scanners based on regular expressions. Flexc++ is highly comparable to the programs flex and flex++. The goal was to create a similar program, but to implement it completely in C++. Most flex and flex++ grammars should be usable with flexc++ with minor adjustments.