Projects / jsoup


jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.

Operating Systems

RSS Recent releases

  •  11 Nov 2013 11:22

Release Notes: This release introduces improved form handling, more robust character set detection, speed and memory optimizations in parsing and CSS selectors, and a number of bugfixes.

  •  28 Jan 2013 21:58

Release Notes: This release introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bugfixes.

  •  24 Sep 2012 21:31

Release Notes: This release parses HTML 2.3x faster. The author has profiled the parse execution of thousands of documents, optimized every hotspot to streamline the parser, and significantly minimized node memory consumption. This release also trims the retained heap memory when retrieving data from parsed documents, reduces garbage collection when selecting elements, and removes lock contention to allow jsoup to run concurrently on as many threads as are available.

  •  29 May 2012 01:03

Release Notes: This release adds a number of improvements and bugfixes, including renewed support for the Google App Engine and parsing fixes.

  •  28 Mar 2012 16:37

Release Notes: This release adds many improvements, including a relaxed XML parser, a lighter memory footprint, and a range of bugfixes.


Project Spotlight


A Scala-based build system.


Project Spotlight


A language that adds classes, methods, and other object oriented features to C.