Projects / Combine

Combine

Combine is an open and extensible system for crawling Internet resources, including harvesting and indexing. It can be used both as a general and focused crawler. Integration with database systems are provided in order to make complete vertical search engine generation possible.

Tags
Operating Systems
Implementation

Recent releases

  •  16 Jun 2009 07:45

    Release Notes: Better handling of special charatcters, better HTML to text extraction, support for new URL scheduling algorithms including score based algorithms, and support for exceptions to GeoIP. Some tests were fixed.

    •  09 Dec 2008 23:32

      Release Notes: This release is integrated with the Solr enterprise search server, and can feed records directly to a Solr server. There is also a new version numbering system that is compatible with CPAN requirements.

      •  18 Nov 2008 21:33

        Release Notes: Code for simple Lucene integration has been added to the templates directory. The documentation HTML generator has been changed to use ht4tex.

        •  13 Nov 2008 19:07

          Release Notes: This release adds the switch ZebraIndexing to combineExport. It enables updating of the configured Zebra server with exported records. It fixes a bug in Zebra recordId handling. It adds the switches 'collapseinlinks' and 'nooutlinks' to combineExport. It improves indexing of PDF documents. It fixes a bug in the processing of pure text documents.

          •  15 Oct 2008 09:52

            Release Notes: A fulltext-index was added in MySQL table search, as was a configuration variable to enable or disable it. Integration with the Zebra database system was fixed. Updates, fixes, and code cleaning were done. Support for SVM classifiers was added (which depends on SVMLight). Country determination was added (adding a dependency on GeoIp). Two new PlugIn types were added: "relevant text extraction" and "extra analysis".

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.