stupid-xml is a ridiculously simple annotation-based XML stream parser for Java. The main goal of this project is to get the strings you care about out of XML and into Java as quickly as possible. You define a simple model class, specify the relative paths for its fields, and it will start generating instances for you from an XML stream. The functionality is limited. It will only parse Strings into your model, but this keeps everything extremely simple. Once you have the Strings in your model, you can perform filtering or more complex conversions.
jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.
Yap4j is the simplest library for parsing CSV files in Java. It deserializes CSV files into a list of POJOs using a set of Java annotations, while allowing you to specify Object-CSV mappings. It automatically converts to and from a wide range of data types, and includes support for types from popular libraries such as Joda Time, and support for custom record delimiters.
HtmlCleaner is an HTML parser. HTML found on the Web is usually dirty, ill-formed, and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring order to the tags, attributes, and ordinary text. For a given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows rules similar to those which most Web browsers use to create a Document Object Model. However, the user may provide custom tag and rule sets for tag filtering and balancing.