VectorSpace Database (VSDB) provides multi-dimensional similarity search capability in a robust server package. It is a server that allows any socket-capable programming language to post and search vectorspaces of multi-dimensional data. Data can be of any base datatype (e.g. text, objects, dating profiles, sessions, ecommerce orders, etc.). VSDB also offers a clustering capability that can display groupings of data based on common dimensions. A built-in thesaurus feature can help bridge multiple-similar-dimensions in search or clustering.
Solr is an enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g. Word and PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
Search::Xapian is a Perl XS frontend to the Xapian C++ search library. It is a fairly complete wrapper: most features of the Xapian library are made available for use from Perl. Xapian is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model as well as a rich set of boolean query operators. It's fast and scalable to hundreds of millions of documents.
jSlovo is a fast database engine with a GUI that was designed for free dictionaries. It can create a file-based database from a text file and then be used to search it for particular words. It can scan any large number of file-based databases and the size of the databases is not limited. HTML tags can be used in the text files and for cross-references.
Document clustering is a data mining suite to cluster a document set. This set of tools was implemented from a series of papers: "Clustering Web Pages Semantically using Combinatorial Topology", "Data mining using granular computing", and "A fast association rule algorithm based on bitmap and granular computing".
Invenio (formerly CDSware) is a suite of applications that provides the framework and tools for building and managing an autonomous digital library server. It complies with the Open Archives Initiative metadata harvesting protocol (OAI-PMH) and uses MARC 21 as its underlying bibliographic standard. Its flexibility and performance make it a comprehensive solution for the management of document repositories of moderate to large size.
TextSearch is a program to search through a set of text files in a directory structure. Each document is searched using a regular expression and an overview of the results is shown as a tree structure. By clicking on a file, it can be viewed, with matches being highlighted. As opposed to other programs out there, its focus is not so much on statistics, i.e. how often a word would occur in an entire corpus of files, but rather on occurrences in single files.