SitemapGen4j is a Java library to generate XML sitemaps. It supports gzipped output, sitemap validation, and sitemap index generation. It can also generate Google-specific sitemaps, such as Mobile sitemaps, Geo sitemaps, Code Search sitemaps, Google News sitemaps, and Video sitemaps.
urlwatch is a script intended to help you watch URLs and get notified (via email) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed. The script works out of a single directory, so there is no need to install anything. State files are kept in the same folder. The script supports stripping parts of a page that are always changing through the use of a filter hook function. It is typically run as a cronjob.
OpenEphyra is a question answering (QA) system. It retrieves answers to natural language questions from the Web and other sources. OpenEphyra comes with implementations of algorithms that proved effective in Carnegie Mellon's Ephyra system, which participated in the TREC evaluations. It is platform independent and can be set up in just a few minutes. The goal of this project is to give researchers the opportunity to develop new QA techniques without worrying about the end-to-end system.
index.rb is a general indexing framework for Ruby. With it, you can create collections of documents, then index and search them. The traditional inverted index is supported, as is Latent Semantic Indexing (LSI). Input documents may be stemmed, to make user queries more general. It also provides TextTiling to break input documents covering multiple topics into topic-specific sub-documents.
Isobel is a framework to build complex information retrieval and analysis systems. Isobel can be functionally divided in two subsytems, Isobel Gatherer (the crawling and filtering subsystem) and Isobel Analyzer (the analysis subsystem). The two subsytems can also be used separately. Isobel Gatherer offers ready-to-use services like content fetching, scheduling, document format conversion, Hyperlink graph storage and analysis, content storage and indexing. A programmer may easily add new services. Isobel Analyzer uses the IBM UIMA architecture to reuse the analysis components developed for this architecture.
safox is a simple PHP API for XML handling. It merges the DOM approach with XML, and it provides a simple, object-oriented API for PHP-based XML generation, parsing, manupilation, and traversal. SAFOX provides a generation package and a package that parses XML documents and returns objects.
HEBCI is a technique that allows a Web form handler to transparently detect the character set with which its data was encoded. By using carefully-chosen character references, the browser's encoding can be inferred. Thus, it is possible to guarantee that data is in a standard encoding without relying on (often unreliable) Web server/browser encoding interactions.
Dowser is a Web research and archiving tool that clusters results from search engines, associates words that appear in previous searches, and keeps a local cache of all the results you click on in a searchable database along with summaries and links to related information. It helps you to keep track of what you find, with no advertising.