HarvestMan is a multithreaded off-line browser.It has many features for customizing offline browsing through URL filters, word filters, domain filters, URL priorities, depth-fetching, fetch levels, file limits, time limits, robot exclusion protocols, and many more. It is useful to download an entire Web site or certain files from a Web site to the hard disk for offline browsing later. It supports HTTP/HTTPS and FTP protocols and can work across proxies.
Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
SPyDI Is a powerful engine to create distributed full text indexing systems and distributed search engines. It supports harvesting, crawling (pull mehtods), and push methods (via a Web interface or SPyRO Web services). It supports boolean and vector Information retrieval models. It has few dependencies, and comes with its own HTTP server and HTML embedded pages language (called pyew and wey pages), and session manager. It can use the SMTP of the Python library. It supports replacing the default modules with some better modules (Apache, exim, etc).