HTTrack is an easy-to-use offline browser utility. It allows you to download a Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the mirrored Web site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. WebHTTrack is a Web-based GUI for HTTrack.
libextractor is a library used to extract meta-data from files of arbitrary type. It is designed to use helper-libraries to perform the actual extraction, and to be trivially extendable by linking against external extractors for additional file types. The goal is to provide developers of file-sharing networks, file managers, and WWW-indexing bots with a universal library to obtain meta-data about files. It includes a shell-command and bindings for Java (JNI) and Python.
YaCy is a search engine that anyone can use to index the Internet (WWW and FTP) or to create a search portal for others (Internet or intranet). The scale of YaCy is limited only by the number of users. and can index billions of web pages. In P2P mode it is fully decentralized. All users of the search engine network are equal and it is not possible for anyone to censor the content of the distributed index.
Invenio (formerly CDSware) is a suite of applications that provides the framework and tools for building and managing an autonomous digital library server. It complies with the Open Archives Initiative metadata harvesting protocol (OAI-PMH) and uses MARC 21 as its underlying bibliographic standard. Its flexibility and performance make it a comprehensive solution for the management of document repositories of moderate to large size.