Projects / Apache Nutch

Apache Nutch

Nutch is highly scalable Web searching software which builds on top of Apache Hadoop and Lucene Java. Key features include a Web crawler, indexer, crawl management tools, parsers for HTML, PDF, DOC, and several other document formats, and an expandable architecture that allows you to plug in additional functionality such as document parsers, custom scoring algorithms, custom content parsers, protocols, and more.

Tags
Licenses
Operating Systems

RSS Recent releases

  •  29 Mar 2009 09:40

Release Notes: This version contains a number of bugfixes and improvements such as Solr Integration, a new indexing framework, and a new scoring framework.

  •  06 Apr 2007 06:34

Release Notes: This release includes several critical bugfixes, as well as key speedups.

  •  27 Sep 2006 10:32

Release Notes: A thread blocking issue that negatively impacted crawling performance has been fixed. Bugs in scoring have been fixed. Problems with updatedb on Windows/Cygwin have been fixed. A bug in the generator where the lowest scoring pages were selected instead of highest scoring pages has been fixed.

  •  28 Jul 2006 06:06

No changes have been submitted for this release.

Screenshot

Project Spotlight

Gearmand

A job dispatching server.

Screenshot

Project Spotlight

GNU ed

An 8-bit clean, POSIX-compliant line editor.