jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
The Lean Mean C++ Option Parser handles program arguments (argc, argv). It supports the short and long option formats of getopt(), getopt_long(), and getopt_long_only(), but has a more convenient interface. It is a freestanding, header-only library with no dependencies, not even libc or STL. It comes with a usage message formatter which supports column alignment and line wrapping, making it ideal for localized messages with different lengths.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.
Yap4j is the simplest library for parsing CSV files in Java. It deserializes CSV files into a list of POJOs using a set of Java annotations, while allowing you to specify Object-CSV mappings. It automatically converts to and from a wide range of data types, and includes support for types from popular libraries such as Joda Time, and support for custom record delimiters.