Nutch is highly scalable Web searching software which builds on top of Apache Hadoop and Lucene Java. Key features include a Web crawler, indexer, crawl management tools, parsers for HTML, PDF, DOC, and several other document formats, and an expandable architecture that allows you to plug in additional functionality such as document parsers, custom scoring algorithms, custom content parsers, protocols, and more.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.
Apache PhotArk is a photo gallery application including a content repository for the images, a display piece, an access control layer, and upload capabilities. The idea is to have a rigid design for the content repository with a very flexible display piece. The images in the content repository will be protected with granular access control.
The mission of the Apache Portable Runtime (APR) project is to create and maintain software libraries that provide a predictable and consistent interface to underlying platform- specific implementations. The primary goal is to provide an API to which software developers may code and be assured of predictable if not identical behaviour regardless of the platform on which their software is built, relieving them of the need to code special-case conditions to work around or take advantage of platform-specific deficiencies or features.
Apache Qpid is a messaging broker that implements the latest AMQP specification, providing transaction management, queuing, distribution, security, management, clustering, federation, heterogeneous multi-platform support, and much more. It is extremely fast and aims to be 100% AMQP Compliant.
UIMA SDK is a software architecture and framework for supporting the development, integration, and deployment of search and analysis technologies. It can be used to analyze large volumes of unstructured information (text, audio, video, images, etc.) to discover, organize, and deliver relevant knowledge to the client or application end user.
Apertium is a machine translation platform, initially aimed at related-language pairs, but recently expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides a language-independent machine translation engine, tools to manage the linguistic data necessary to build a machine translation system for a given language pair, and linguistic data for a growing number of language pairs.
Apitron PDF Rasterizer is a .NET component that performs high-quality conversion from PDF files to images. It supports complex PDF content including text (with embedded, externally linked, standard, simple, and composite fonts), images, including masked ones, complex paths and fills, PDF Forms, annotation objects of various types, all blending modes, tiling patterns, shading patterns (function-based, axial, radial), transparency groups, masked content (stencil masks, colorkey masks, soft masks), all colorspaces specified by the PDF standard, Adobe Illustrator created files, PDF bookmarks and page navigation support, and text search and highlighting (including non-Latin alphabets).