jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
| Tags | Java HTML Parser HTML cleaner Extract whitelist Cross Platform |
|---|---|
| Licenses | MIT/X |
| Operating Systems | Java Cross Platform |
| Implementation | Java Java 5 HTML |
| Translations | English |
Recent releases


Release Notes: This release adds a number of improvements and bugfixes, including renewed support for the Google App Engine and parsing fixes.


Release Notes: This release adds many improvements, including a relaxed XML parser, a lighter memory footprint, and a range of bugfixes.


Release Notes: This release included a new HTML5 compliant parser and fixes for Java 1.5 and Android 2.2 compatibility.


Release Notes: This version of jsoup includes a brand new HTML5-conformant parser, which ensures HTML is parsed just as modern browsers do. It also improves parse time and lowers memory usage, and adds new convenience methods including Element.unwrap() and Node.after() and Node.before().


Release Notes: This release primarily corrects a regression bug where the content-type of a document retrieved using Jsoup.connect(String url) may not be correctly detected if specified in a meta tag.
A Linux live distribution intended as a communication aid in hostile environments.
A GNOME/GTK+ GUI for the Cainteoir Text-to-Speech Engine.