Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
pyratemp is probably (one of) the smallest complete template-engines for Python (with about 500 LOC). It has a very small set of special syntax in the templates. This reduces complexity and the probability of bugs and lead to an easy-to-use and intuitive user-interface. It uses embedded Python-expressions (in a "sandbox"), is well documented, has full Unicode-support, and produces very good error-messages, which is very useful when creating new templates.
itools is a collection of Python libraries which provides a wide range of capabilities, including an abstraction over directory and file resources, a search engine, type marshallers, datatype schemas, i18n support, URI handlers, a Web programming interface, a workflow interface, and support for data formats such as (X)HTML, XML, iCalendar, RSS 2.0, and XLIFF.
HarvestMan is a multithreaded off-line browser.It has many features for customizing offline browsing through URL filters, word filters, domain filters, URL priorities, depth-fetching, fetch levels, file limits, time limits, robot exclusion protocols, and many more. It is useful to download an entire Web site or certain files from a Web site to the hard disk for offline browsing later. It supports HTTP/HTTPS and FTP protocols and can work across proxies.
pacparser is a library to parse proxy auto-config (PAC) files. Proxy auto-config files are a vastly used proxy configuration method these days. Web browsers can use a PAC file to determine which proxy server to use or whether to go direct for a given URL. The idea behind pacparser is to make it easy to add PAC file parsing capability to any program (C and Python are supported right now). It comes as a shared C library and a Python module that can be used to make any C or Python program PAC scripts intelligent. Some very useful targets could be popular Web software like wget, curl, and python-urllib.
EmPy is a system for embedding Python expressions and statements in template text. It takes an EmPy source file, processes it, and produces output. This is accomplished via expansions, which are special signals to the EmPy system and are set off by a special prefix (by default the at sign, '@'). It can expand arbitrary Python expressions and statements in this way, as well as a variety of special forms. Textual data not explicitly delimited in this way is sent unaffected to the output, allowing Python to be used in effect as a markup language. Also supported are callbacks via hooks, recording and playback via diversions, and dynamic, chainable filters. The system is highly configurable via command line options and embedded commands.