HTMLDOC converts HTML files and Web pages into indexed HTML, PostScript, and PDF files suitable for online viewing and printing. It can be used as a standalone GUI application, in a batch document processing environment, as a Web-based report generation application, or in embedded environments to support printing of HTML content. It runs on all Unix platforms as well as Mac OS X and Windows 2000 and higher.
libextractor is a library used to extract meta-data from files of arbitrary type. It is designed to use helper-libraries to perform the actual extraction, and to be trivially extendable by linking against external extractors for additional file types. The goal is to provide developers of file-sharing networks, file managers, and WWW-indexing bots with a universal library to obtain meta-data about files. It includes a shell-command and bindings for Java (JNI) and Python.
SILVERCODERS DocStorage is a utility to improve document management. You can have one database for all invoices, guarantees, protocols, and other documents. DocStorage can extract plain text from documents in doc, XLS, PPT, PDF, RTF, ODT, ODS, ODP, docx, XLSX, PPTX, and many other formats. It can use an OCR engine to extract plain text even from scanned documents. It can perform global fulltext search in all documents regardless of format. It supports document versioning, document duplicate detection, document notes, and document signing. It provides full integration with software suites like Microsoft Office and OpenOffice.
Emdros is a corpus query system for storing and searching linguistically annotated text. It is very generic, supporting almost any kind of annotation from almost any linguistic theory. All linguistic levels of analysis are supported, including phonology, morphology, the lexical level, syntax, and discourse. The core libraries act as a middleware layer between a client and an underlying SQL database. MySQL, PostgreSQL, and SQLite are supported.
The Multivalent PDF Tools is a suite of tools for manipulating PDF documents. It includes tools for compressing, uncompressing (for hand editing), obtaining metadata, splitting and merging, encrypting and decrypting, validating, imposition (aka n-up), making page images, extracting text, and full-text indexing (with Lucene). The compress tool shrinks the PDF 1.5 Reference from 13.5MB to 8MB in PDF 1.5/Acrobat 6 format and down to 5.1MB in a new proposed "Compact" format.
Hyper Estraier is a full-text search system. It can be used as a Web search engine, mailbox searching, etc. It features high performance searching, high scalability of target documents, a perfect recall ratio by the N-gram method, phrase searching, attribute searching, and similarity searching. Multilingualism is supported with Unicode. It is independent of file format and repository, and has a simple and powerful API.
Connexor Machinese analyzers process sequences of written words, identify and classify the various entities in them, and show how these relate to each other, marking the language with a simple and systematic notation. Currently, the Machinese product family includes: Machinese Phrase Tagger, a fast, light-weight morphosyntactic tagger; Machinese Syntax, a full-scale dependency parser; Machinese Semantics, a dependency parser with semantic analysis; and Machinese Metadata, an entity extractor.