NetCrawler is the frontend to a Web crawling system. This command line application will download all of the pages within a domain, and then parse and process all of the relative content (Images, Text, Audio, Video), saving this content within an XML document for later processing. It is definitely alpha quality, but has been used quite extensively.
GroupServer is a Web-based mailing list manager designed for large sites. It provides email interaction like a traditional mailing list manager but also supports reading, searching, and posting of messages and files via the Web. Users have forum-style profiles, and can manage their email addresses and other settings using the same Web interface. It has supports features such as Atom feeds, a basic CMS, statistics, multiple verified addresses per user, and bounce detection, and is able to be heavily customized.
EZ Reusable Objects (EZRO) is a Web application that can be used by non-technical staff to manage content as "objects." Content objects containing text, video, and audio can be shared, modified, and re-styled to appear via a traditional Web site, an on-line course, an innovative "Coach," or as a community of interest site. It is highly scalable and can be used for public Web sites, secure environments, and private intra/extranets.
doclifter helps with lifting documents with nroff markup to XML-DocBook. Lifting documents from presentation level to semantic level is hard, and a really good job requires human polishing. This tool aims to do everything that can be mechanized, and to preserve any troff-level information that might have structural implications in XML comments. TBL tables are translated into DocBook table markup, PIC into SVG, and EQN into MathML (relying on pic2svg and GNU eqn for the last two).
LyX is a document processor that encourages an approach to writing based on the structure of your documents, not their appearance. It is intended for people people who write and want their writing to look great without tinkering with formatting details, font attributes, or page boundaries. On screen, it looks like any word processor, but it uses the TeX engine for printed output and producing richly cross-referenced PDFs. It is stable and fully featured.
BTE (Body Text Extractor) is a Python module that extracts the main body of text from a Web page. Many Web articles consist of a main body which constitutes the relevant part of the particular page. Surrounding this body is irrelevant information such as copyright notices, advertising, links to sponsors, etc. BTE identifies and extracts the main body text of an article.