Projects / Yioop!

Yioop!

Yioop! is a PHP search engine. Yioop! can be configured as either a general purpose search engine for the whole Web or it can be configured to provide search results for a set of URLs or domains. Yioop can crawl pages or can directly index archives such as ARC and WARC. It supports indexing several file formats such as HTML, Atom, PDF, DOC, PPT, RTF, RSS, XML, SVG, PNG, JPG, BMP, GIF, and sitemaps. The Yioop! crawler can be deployed on one or many machines. It supports having one or more to crawl scheduler processes, as well as multiple fetchers and mirrors. Crawling respects robots.txt including Crawl-delay. Yioop! crawls are stored in a Web archive format that is easy to move around. Crawling can be done on one machine and the results deployed elsewhere. Yioop! supports mixing of crawls. Yioop! comes with a search front end that can be localized as desired using a GUI. This GUI supports RTL languages. Management of crawls can also be done using this GUI. Yioop! can be configured in a straightforward manner to make use of file caching or memcache if available.

Tags
Licenses
Operating Systems
Implementation
Translations

Recent releases

  •  01 Dec 2013 19:24

    Release Notes: This version improves crawl stability and has been used in a page crawl of 1/3 billion pages. The indexing plugin API was improved to allow plugins to have configure screens. A new example Word Filter plugin has been added. Yioop can now crawl Tor networks. A Manage Groups pane has now been added to Yioop.

    •  24 Jul 2013 18:21

      Release Notes: This version includes a new hybrid inverted index/suffix tree indexing scheme that should make calculating search results from future crawls faster (doesn't affect old crawls). It can make use of HTTP ETag: and Expire: information when deciding whether to download a URL it has seen before. It also supports the creation of classifiers using active learning. These can be used to label and add scoring information to documents during a crawl. This release includes improvements to the RSS feed news_updater and a segmenter for Chinese.

      •  05 Apr 2013 03:19

        Release Notes: This release adds a simple language called Page Rules for controlling how data is extracted from webpages during the summary creation phase of indexing. It also adds the ability to index records coming from a database query and adds a generic text importer which works on plain text, gzip'd, and bzip'd text records. Other features in this version of Yioop are Atom support as a News Feed Search Source, a dedicated process new_updater.php for handling news updates, and a better algorithm for distributing archive data during an archive crawl. Many other minor improvements have been made.

        •  05 Jan 2013 02:08

          Release Notes: This release supports materializing as new indexes query-based combinations (crawl mixes) of old search indexes. This should make query performance of crawl mixes much better. Cache pages of search results now have a new history UI which allows you to search cache pages in all indexes you have, much like the way Internet Archive does. Yioop now supports spell corrections on searches after they have been performed, and it has an API for transliterating between roman and other scripts. Query performance has been improved over previous versions, and lots of minor bugs have been fixed.

          •  17 Sep 2012 09:06

            Release Notes: This release adds an activity to manage search media sources. For now, one can add Video and RSS sources. When configured, RSS feeds download hourly and are integrated into search results. Also new is a command line tool for configuring Yioop in VPS settings. An Italian stemmer has been added, as well as more translations. This version implements some important bugfixes in robot handling, as well as unit testing of these. Yioop! now works in PHP 5.4 as well as PHP 5.3, and plays friendlier with more recent versions of Xampp on Windows.

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.