Projects / Apache Nutch

Apache Nutch

Nutch is highly scalable Web searching software which builds on top of Apache Hadoop and Lucene Java. Key features include a Web crawler, indexer, crawl management tools, parsers for HTML, PDF, DOC, and several other document formats, and an expandable architecture that allows you to plug in additional functionality such as document parsers, custom scoring algorithms, custom content parsers, protocols, and more.

Tags
Licenses
Operating Systems

Recent releases

  •  28 Mar 2009 20:20

    Release Notes: This version contains a number of bugfixes and improvements such as Solr Integration, a new indexing framework, and a new scoring framework.

    •  06 Apr 2007 13:34

      Release Notes: This release includes several critical bugfixes, as well as key speedups.

      •  27 Sep 2006 17:32

        Release Notes: A thread blocking issue that negatively impacted crawling performance has been fixed. Bugs in scoring have been fixed. Problems with updatedb on Windows/Cygwin have been fixed. A bug in the generator where the lowest scoring pages were selected instead of highest scoring pages has been fixed.

        •  28 Jul 2006 13:06

          No changes have been submitted for this release.

          Screenshot

          Project Spotlight

          OpenStack4j

          A Fluent OpenStack client API for Java.

          Screenshot

          Project Spotlight

          TurnKey TWiki Appliance

          A TWiki appliance that is easy to use and lightweight.