jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
LEPL is a recursive descent parser library written in Python. It is based on parser combinator libraries popular in functional programming, but also exploits Python language features. Operators provide a friendly syntax, and the consistent use of generators supports full backtracking and resource management. Backtracking implies that a wide variety of grammars are supported; appropriate memoisation ensures that even left-recursive grammars terminate.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.