NetCrawler is the frontend to a Web crawling system. This command line application will download all of the pages within a domain, and then parse and process all of the relative content (Images, Text, Audio, Video), saving this content within an XML document for later processing. It is definitely alpha quality, but has been used quite extensively.
EZ Reusable Objects (EZRO) is a Web application that can be used by non-technical staff to manage content as "objects." Content objects containing text, video, and audio can be shared, modified, and re-styled to appear via a traditional Web site, an on-line course, an innovative "Coach," or as a community of interest site. It is highly scalable and can be used for public Web sites, secure environments, and private intra/extranets.
BTE (Body Text Extractor) is a Python module that extracts the main body of text from a Web page. Many Web articles consist of a main body which constitutes the relevant part of the particular page. Surrounding this body is irrelevant information such as copyright notices, advertising, links to sponsors, etc. BTE identifies and extracts the main body text of an article.
Silva is a CMS for organizations that manage multiple or complex Web sites. Content is stored in clean XML, independent of layout and presentation. Features include versioning, a workflow system, an integral visual editor, content reuse, sophisticated access control, multi-site management, extensive import/export facilities, fine-grained templating, and hi-res image storage and manipulation. Silva is built on top of the Zope Web application platform.
Libxslt is a C library for GNOME which allows developers to work with XSLT. It is based on libxml for XML parsing, tree manipulation, and XPath support. Also included is 'xsltproc', a command line XSLT processor. The library is written in plain C, making as few assumptions as possible, and sticking closely to ANSI C/POSIX for easy embedding. It should work on Linux, Unix, and Windows. Though not designed primarily with performances in mind, libxslt seems to be a relatively fast processor. It also include full support for the EXSLT set of extension functions as well as some common extensions present in other XSLT engines.
Libxml2 is the XML C library developed for the Gnome project. The library code is portable (to Linux, Unix, Windows, embedded systems, etc.) and modular; most of the extensions can be compiled out. Libxml2 implements a number of existing standards related to markup languages, including the XML standard, Namespaces in XML, XML Base, Relax NG, RFC 2396, XPath, XPointer, HTML4, XInclude, SGML Catalogs, and XML Catalogs. In most cases, libxml tries to implement the specifications in a relatively strict way. To some extent, it provides support for the following specifications, but doesn't claim to implement them: DOM, FTP client, HTTP client, and SAX2. Support for W3C XML Schemas is in progress. It includes xmllint, a command line XML validator.
itools is a collection of Python libraries which provides a wide range of capabilities, including an abstraction over directory and file resources, a search engine, type marshallers, datatype schemas, i18n support, URI handlers, a Web programming interface, a workflow interface, and support for data formats such as (X)HTML, XML, iCalendar, RSS 2.0, and XLIFF.
Myghty is a Python-based Web application framework originally ported from HTML::Mason. It supports the full feature set of Mason, allowing component-based Web development with Python-embedded HTML. It also features additional paradigms such as module components, environment-neutral session support, and many more language features. The HTTP connector API includes mod_python, CGI, WSGI, and standalone implementations. It also supports command line and custom non-HTTP environments.