39 projects tagged "Text Processing"
Winnow efficiently trains and operates any number of unique Bayesian (Naive Bayes) classifiers on large sets of content. It has very high performance and works with very small training and unbalanced training sets. It has been used to power an innovative Web feed reader that uses smart tags, which learn and find the content you want to see, from more sources than you can follow with traditional feed readers. It works particularly well with Ruby and Ruby on Rails.
Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
SiSU (Structured information, Serialized Units) is a lightweight markup based, text structuring and publishing framework (that features granular search). With minimal markup of a plaintext file, it produces: plain-text, HTML, XHTML, XML, ODF, LaTeX, PDF, and populates an SQL database at an object/paragraph level for granular searches. Prepare documents using your text editor of choice, then use SiSU to generate the desired output formats. SiSU is controlled from the command line.
glark offers grep-like searching of text files, with very powerful, complex regular expressions (e.g., "/foo\w+/ and /bar[^\d]*baz$/ within 4 lines of each other"). It also highlights the matches, displays context (preceding and succeeding lines), does case-insensitive matches, and automatic exclusion of non-text files. It supports most options from the GNU version of grep.
deplate converts wiki-like markup to LaTeX (standard classes, koma, dramatist, sweave), HTML/PHP (single page, chunked/website, HTML, or s5-based slideshow), DocBook (article, book, man/ref page), and really plain text. Currently supported input formats are viki and Ruby's rdoc. The viki markup supports footnotes, citations, index, table of contents, embedded LaTeX for mathematics, integration with R for dynamically generated figures and tables, and more. Output can be customized via page templates.
ZenWeb is a system for building entire Web sites, not just pages. It allows you to focus on the content and the structure of the website, while leaving page construction, markup, layout, and navigation as secondary concerns. It provides tools for complete Web site design and creation, simple paragraph to HTML generation with embellishments, and a rich set of tools for page and Web site creation, modification, and customization.
A Web-based document management system with a Google-like search engine.