jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
Metrix++ is a platform to collect and analyze code metrics. It has a plugin-based architecture, so it is easy to add support for new languages, define new metrics, and/or create new pre- and post-processing tools. Every metric has 'turn-on' and other configuration options. There are no predefined thresholds for metrics or rules; you can choose and configure any limit you want. It scales well to large codebases. For example, initial parsing of about 10000 files takes 2-3 minutes on an average PC, and only 10-20 seconds for iterative re-run. Reporting summary results and exceeded limits takes less than 1 - 10 seconds. It can compare results for 2 code snapshots (collections) and differentiate added regions (classes, functions, etc.), modified regions, and unchanged regions. As a result, easy deployment is guaranteed into legacy software, helping you to deal with legacy code efficiently, and either enforce the 'leave it not worse than it was before' rule or motivate re-factoring.
HtmlCleaner is an HTML parser. HTML found on the Web is usually dirty, ill-formed, and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring order to the tags, attributes, and ordinary text. For a given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows rules similar to those which most Web browsers use to create a Document Object Model. However, the user may provide custom tag and rule sets for tag filtering and balancing.
EXIP provides a C library for the parsing and serialization of Efficient XML Interchange (EXI) format streams. The focus is portability and efficiency for embedded systems development. The project was started at the EISLAB research group in the Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, and is part of research efforts to bring resource-constrained embedded devices, such as wireless sensor nodes, closer to the enterprise business processes taking place in processing, manufacturing, and communication industries.
JWPL is a language independent, database-driven, high performance Wikipedia API that provides structured access to information nuggets like redirects, categories, articles, and link structure. It contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page or standalone with other text, TimeMachine, which reconstructs a snapshot of Wikipedia from a specific date, or multiple snapshots from a time span, and RevisionMachine, which offers efficient access to the history of articles using a dedicated storage format which decreases storage space by 98%. This enables random access to the whole revision history without requiring several terabytes of storage for a single Wikipedia dump.
UniCC, (Universal Compiler-Compiler) is a powerful LALR(1) parser generator and language development system for computer professionals. It serves as an all-round design and build tool assisting compiler writers in any parsing-related task, including production quality compiler construction and the implementation of domain specific languages. It unifies an integrated generator for lexical analyzers and a powerful LALR(1) parser generator into one software solution. The programming interface is a rich, extendable, and innovative BNF-based grammar definition language for expressing context-free grammars.
The Lean Mean C++ Option Parser handles program arguments (argc, argv). It supports the short and long option formats of getopt(), getopt_long(), and getopt_long_only(), but has a more convenient interface. It is a freestanding, header-only library with no dependencies, not even libc or STL. It comes with a usage message formatter which supports column alignment and line wrapping, making it ideal for localized messages with different lengths.