Nutch is highly scalable Web searching software which builds on top of Apache Hadoop and Lucene Java. Key features include a Web crawler, indexer, crawl management tools, parsers for HTML, PDF, DOC, and several other document formats, and an expandable architecture that allows you to plug in additional functionality such as document parsers, custom scoring algorithms, custom content parsers, protocols, and more.
Kallimachos is a simple Web-based digital book-catalog intended for personal use. Books are indexed by title, author, translator, edition, genre, page number, and ISBN. You can insert a new book, enter the library, and search by any item. It also provides a system-info viewer. The installation is quick and simple, and the program has a user-friendly interface.
A 'honeypot' is designed to detect server-side attacks. In contrast, a 'honeyclient' is designed to detect client-side attacks. Specifically, a honeyclient is a dedicated host that drives specially instrumented applications to access remote servers to see if those servers are behaving in a malicious manner (by compromising the client). Honeyclients can proactively detect exploits against client applications without known signatures. This framework uses a client-server model with SOAP messaging as the primary communication method, and uses the free version of VMware Server as a means of virtualizing the client environment.
PDFTextStream is a PDF text and metadata extraction library available for Java and .NET. It supports all versions of the PDF document specification (including v1.7, used by Acrobat 8, 9, and X), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of documents encrypted using 40-bit, 128-bit, 256-bit, and variable bit length ciphers, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included, as well as interactive form update capability.
Plait (pronounced "play") is a command-line jukebox and music player front end. It understands brief, easy to type queries that pick a single song, mix queries that combine works from multiple artists, and stream queries that find Shoutcast radio streams. A variety of filters are available to pick just the music you want to hear. In order to actually play the music it finds, Plait automatically hands off a playlist to one of the supported music players, or creates a playlist that you can manually load.
SWISH++ is a Unix-based file indexing and searching engine (typically used to index and search files on web sites). It was based on SWISH-E although SWISH++ is a complete rewrite. SWISH++ is at least 10 times faster and can handle much larger numbers of files. Additionally, it has unique features such as selective non-indexing, on-the-fly filters, user-selectable stemming, and more.