jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
Yap4j is the simplest library for parsing CSV files in Java. It deserializes CSV files into a list of POJOs using a set of Java annotations, while allowing you to specify Object-CSV mappings. It automatically converts to and from a wide range of data types, and includes support for types from popular libraries such as Joda Time, and support for custom record delimiters.
Piglet is a tool for parsing and lexing text for the .NET framework. The purpose of Piglet is to provide an easy-to-use tool for parsing text which can be easily included in any .NET project as a single assembly. In contrast to most parser generators, Piglet provides a fluent interface which enables you to express your grammar in a syntax which is accessible for users with no prior experience of parser generators. Piglet generates efficient, type safe, and reentrant LALR(1) parsers at runtime, which saves you from having a pre-compile step to generate your parsing tables. It also includes a lexical scanner generator which can be used independently of the parser generator.
UniCC, (Universal Compiler-Compiler) is a powerful LALR(1) parser generator and language development system for computer professionals. It serves as an all-round design and build tool assisting compiler writers in any parsing-related task, including production quality compiler construction and the implementation of domain specific languages. It unifies an integrated generator for lexical analyzers and a powerful LALR(1) parser generator into one software solution. The programming interface is a rich, extendable, and innovative BNF-based grammar definition language for expressing context-free grammars.
csvgrep is a commandline program which enables users to execute searches on text-delimited files using a rudimentary query language. Its query language is bound to simplicity and expressivity, to be easily comprehensible. It aims at replacing both grep and awk when you are challenged to retrieve information from a text-delimited file based on the content of a specific field (or column). You can get what you want using the semantic already in the file’s underlying structure.
MightyString adds array functionality and other tools for Ruby strings, including matching, indexing, substitution, and deletion. MightyString::HTML.strip_html provides more ideal HTML-to-ASCII formatting output. This is an advanced block "filtering" module. It works very well, with extremely rare cases which fall through its fingers.