Winnow efficiently trains and operates any number of unique Bayesian (Naive Bayes) classifiers on large sets of content. It has very high performance and works with very small training and unbalanced training sets. It has been used to power an innovative Web feed reader that uses smart tags, which learn and find the content you want to see, from more sources than you can follow with traditional feed readers. It works particularly well with Ruby and Ruby on Rails.
SiSU (Structured information, Serialized Units) is a lightweight markup based, text structuring and publishing framework (that features granular search). With minimal markup of a plaintext file, it produces: plain-text, HTML, XHTML, XML, ODF, LaTeX, PDF, and populates an SQL database at an object/paragraph level for granular searches. Prepare documents using your text editor of choice, then use SiSU to generate the desired output formats. SiSU is controlled from the command line.
Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
Sanzang is a compact and simple cross-platform machine translation system. It is especially useful for translating from the CJK languages (Chinese, Japanese, and Korean), and it is very suitable for working with ancient and otherwise difficult texts. Unlike most other machine translation systems, Sanzang is small and approachable. Any user can develop his or her own translation rules, and these rules are simply stored in a text file and applied at runtime.
AntFlow builds upon Apache Ant to provide a new approach to simplifying system automation that uses pipelines of hot folders chained together to perform a given task. Using XML, it associates an automated task such as data transfer, encryption, or XML processing with a directory on the local system. Whenever a file is copied or written into the hot folder, the associated task is executed and the file is moved to the next hot folder in the pipeline for further processing.