Projects / PDFTextStream

PDFTextStream

PDFTextStream is a PDF text and metadata extraction library available for Java and .NET. It supports all versions of the PDF document specification (including v1.7, used by Acrobat 8, 9, and X), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of documents encrypted using 40-bit, 128-bit, 256-bit, and variable bit length ciphers, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included, as well as interactive form update capability.

Tags
Operating Systems
Implementation

RSS Recent releases

  •  09 Aug 2012 22:59

    Release Notes: PDFTextStream is now free for use in single-threaded applications; all previous "evaluation" limitations no longer apply when PDFTextStream is operated without a license file. A new OutputHandler is now available: com.snowtide.pdf.SelectionOutputTarget, implementing text extraction based on a "selection coordinates", as commonly found in user-facing PDF viewer UIs.

    •  02 Aug 2012 21:47

      Release Notes: This release adds support for decryption of AES-encrypted PDF documents (including support for 256-bit and variable bit length ciphers), and adds dozens of performance and PDF document compatibility enhancements and fixes. PDFTextStream for Java now requires version 1.5.0 or higher of the JVM/JRE, and PDFTextStream.NET now ships with IKVM 0.46.0.1 and requires .NET 2.0 or higher. PDF merge capability (com.snowtide.pdf.util.MergeUtil) has been deprecated, as has memory-mapping of opened PDF files (now disabled by default).

      •  15 Sep 2011 17:39

        Release Notes: This release includes a variety of fixes made to ensure PDFTextStream is capable of extracting text from PDF documents that are nonconforming to the PDF specification. It also includes a variety of performance enhancements.

        •  23 Apr 2009 14:42

          Release Notes: An .isStruckThrough() method was added to com.snowtide.pdf.TextUnit, indicating whether a character has a strikethrough drawn through it. PDFTextStream's support for embedded character mappings was improved. The calculation of whitespace between words has been fixed to properly account for whitespace that is explicitly encoded in the source PDF documents. PDFTextStream's handling of composite content encodings was improved, which previously could fail resulting in some ranges of PDF content being "ignored" during extraction.

          •  30 Dec 2008 19:16

          Release Notes: This release adds support for extracting XFA forms data as XML. It significantly improves the performance of text extraction using VisualOutputTarget. Support for PDF documents larger than 2GB. A fix for a bug where the encodings from embedded Type1 fonts were previously not being applied properly in some circumstances. A fix for a problem where newer content in updated PDF documents was sometimes being ignored. A fix for a problem where PDFDocEncoding-encoded bookmarks and metadata were not being decoded properly. A .getDestinationName() method in com.snowtide.pdf.Bookmark.

          Screenshot

          Project Spotlight

          Alaya Webdav Server

          A simple WebDAV 1.0 server.

          Screenshot

          Project Spotlight

          filterunit

          A unit test facility for command line programs with file input and output.