Projects / Yioop!

Yioop!

Yioop! is a PHP search engine. Yioop! can be configured as either a general purpose search engine for the whole Web or it can be configured to provide search results for a set of URLs or domains. It supports indexing several file formats such as HTML, PDF, DOC, PPT, RTF, RSS, XML, SVG, PNG, JPG, BMP, GIF, and sitemaps. The Yioop! crawler can be deployed on one or many machines. It supports having one or more to crawl scheduler processes, as well as multiple fetchers and mirrors. Crawling respects robots.txt including Crawl-delay. Yioop! crawls are stored in a Web archive format that is easy to move around. Crawling can be done on one machine and the results deployed elsewhere. Yioop! supports mixing of crawls. Yioop! comes with a search front end that can be localized as desired using a GUI. This GUI supports RTL languages. Management of crawls can also be done using this GUI. Yioop! can be configured in a straightforward manner to make use of memcache if available.

Tags
Licenses
Operating Systems
Implementation
Translations

RSS Recent releases

  •  03 May 2012 21:23

Release Notes: This release adds initial support for word suggestions as a user types in queries. The bigramming used to speed common two word queries now works with n word grams. N word gram filter files can now be created using Wikipedia raw page count dumps. This version adds support for * and $ in allowed- and disallowed-to-crawl sites. Using this, the user can crawl sites to a fixed depth. Robots.txt processing now supports * and $ in robot.txt paths. Support for NOSNIPPET, NOARCHIVE, and X-Robots-tag HTTP headers has also been implemented. A tool for editing search summaries after a crawl has also been added.

  •  20 Mar 2012 17:07

Release Notes: The crawler now has its own DNS caching mechanism independent of cURL's. Yioop now has a detection mechanism for when websites are becoming congested. The user can also set a quota on the number of URLs downloaded per hour from sites. A webcrawl statistics page can now be generated for a crawl. Bugs in robots.txt handling and in archive handling, which were introduced in 0.82, have been fixed. The demo site now features an example crawl of 100 million pages crawled with the previous version of the software.

  •  04 Feb 2012 02:54

Release Notes: This release improved scalability by allowing multiple machines to maintain portions of the "to crawl next" queue. Query processing can also be split amongst machines, with different machines being responsible for documents of a given hash. Yioop! now supports mirroring of machines. Two word phrases as determined by an XML file such as Wikipedia URL dump can now be treated as a logical unit. The Yioop! model-view-controller framework has been made easier to extend and documentation for it has been added to the website.

  •  07 Dec 2011 23:57

Release Notes: This version supports starting, stopping, and viewing log files of the queue server and fetchers from a Web interface. One can now inject new URLs into an active crawl via a Web interface. This version of Yioop! supports re-crawling of pages after a fixed number of days. Also, the file extensions that are crawled, the number of bytes downloaded per page, and how Yioop! weighs different page components can now all be controlled through a Web interface rather than just the config.php file. Improvements have also been made to how HTML Processor extracts text to index.

  •  29 Oct 2011 02:40

Release Notes: Character n-grams are now supported for many languages that did not have a stemmer. Language detection was improved and better UTF-8 preparation was provided for downloads. Yioop!'s ability to following redirects, including bit.ly redirects, was improved. Proximity scoring of text in documents has also been enhanced.

Screenshot

Project Spotlight

Path Defense Framework

A framework for games with defense towers along a path.

Screenshot

Project Spotlight

turses

A Twitter client for the console.