Redland is a set of C libraries providing a high-level API for the Resource Description Framework (RDF), allowing it to be stored, parsed, serialized, queried, and manipulated. It has an object-based, modular design and comes with detailed reference documentation and examples. Redland supports all RDF vocabularies such as FOAF, RSS 1.0, Dublin Core, DOAP, and OWL, the query languages SPARQL and RDQL, and all RDF syntaxes including Turtle, RDF/XML, RDF/JSON, RSS, Atom, RDFa, and GRDDL.
Nutch is highly scalable Web searching software which builds on top of Apache Hadoop and Lucene Java. Key features include a Web crawler, indexer, crawl management tools, parsers for HTML, PDF, DOC, and several other document formats, and an expandable architecture that allows you to plug in additional functionality such as document parsers, custom scoring algorithms, custom content parsers, protocols, and more.
Python Web Graph Generator is a threaded Web graph (Power law random graph) generator. It can generate a synthetic Web graph of about one million nodes in a few minutes on a desktop machine. It supports both directed and undirected graphs. It implements a threaded variant of the RMAT algorithm. A little tweak can produce graphs representing social networks or community networks. It can also output connected components in a graph.
Compass is a Java framework which makes it simple to map your Java object model into a search engine. It is built on top of the Lucene search engine. Compass features declarative mapping technology OSEM (similar to O/R database mapping), transaction management, Google-like query syntax, externalization of common metadata, and much much more.
Catacomb is a WebDAV repository module for use with the Apache WebDAV module, mod_dav. Apache mod_dav parses WebDAV and DeltaV protocol requests into operations on a repository providing persistent storage of resources and their properties. The default repository for mod_dav is provided by a separate module, mod_dav_fs, which stores resource bodies as files in the filesystem, and stores properties in a (G)DBM database. It could be used for server side searching and versioning of files over the HTTP protocol.
locust is a full featured Internet search engine specifically designed to power vertical search, enterprise search, or a knowledge area search applications. It can index 2.5 million documents per 24 hours on a single Dell server. It consists of clean C++/STL code written from scratch.
SitemapGen4j is a Java library to generate XML sitemaps. It supports gzipped output, sitemap validation, and sitemap index generation. It can also generate Google-specific sitemaps, such as Mobile sitemaps, Geo sitemaps, Code Search sitemaps, Google News sitemaps, and Video sitemaps.