PDF OCR is a simple drag-and-drop utility that converts PDFs and images into text documents. It uses advanced OCR (optical character recognition) technology to extract the text of the PDF or image. This is particularly useful for dealing with PDFs and images that were created via a scan-to-PDF function in a scanner or photo copier. It uses the Tesseract engine to perform OCR, and currently supports over 20 languages for OCR.
getxbook is a collection of tools to download books from websites. There are tools to download from Google Books' "book preview", Amazon's "look inside the book", and Barnes and Noble's "book viewer". There is an optional GUI written in Tcl/Tk, and some shell scripts using OCR to create plain text or searchable PDFs and DjVu files from the downloaded books.
OCRFeeder is a document layout analysis and optical character recognition application. It is able to automatically outline a document image's contents, distinguish between graphics and text and perform OCR over the latter. It can export to several formats, its main one being ODT. OCRFeeder has a GTK+ graphical user interface that allows the user to control the application and, for example, edit and correct the automatic recognition. It can also be used from the command line for automation.
Paperless Office is a document management and electronic filing system. It is similar to Paperport, but adds many new features, such as automatic document classification, synchronization with your filing cabinet, date extraction, semantic Web integration, and sophisticated natural language processing, such as extracting todo lists from documents, spam detection, urgency classification, along with planning, scheduling, and execution features. You can set due dates and interdependencies for documents and tasks, so it has workflow support.
MALODOS helps you to scan, store, and easily retrieve all your personal documents. Its storage format is open and documented, so your document archive can remain accessible even without MALODOS. The documents themselves are stored as standard PDF files, while their metadata (such as title, tags, and description) are stored into a separate SQLite database in an open format. With MALODOS, you can also manage existing files in PDF, JPEG, TIFF, and other formats, so you can still use the documents that you've already scanned. You can connect to any external OCR program to give access to a fulltext search feature.
Aspose.OCR for .NET is a character recognition component built to allow developers to add OCR functionality in their ASP .NET Web applications, Web services, and applications. It provides a simple set of classes for controlling character recognition tasks and supports BMP and TIFF.
FuzzyOcr is a plugin for SpamAssassin that can be used on image spam. It supports optical character recognition using different engines and settings, a fuzzy word matching algorithm applied to OCR results, an image hashing system to learn the unique properties of known spam images, dimension, size, and integrity checking of images, and content-type verification for the containing email message.