PDFreactor is a formatting processor to convert HTML and XML to PDF. It uses Cascading Style Sheets (CSS) to define page layout and styles. It allows you to dynamically generate PDF documents such as invoices, delivery notes, shipping documents, or print versions of Web content on-the-fly. Vector graphics (SVG), barcodes, MathML, XSLT, and CMYK colors are supported. All common J2EE application servers are supported. Complete .NET, PHP, Perl, Python, and Ruby APIs are included. Direct integration into automatic build processes using Apache Ant is also possible.
getxbook is a collection of tools to download books from websites. There are tools to download from Google Books' "book preview", Amazon's "look inside the book", and Barnes and Noble's "book viewer". There is an optional GUI written in Tcl/Tk, and some shell scripts using OCR to create plain text or searchable PDFs and DjVu files from the downloaded books.
PdfMasher is a tool to convert PDF files containing text into a ready-for-ebook set of HTML files. Most ebook readers support PDF files natively, but it's often a real pain to read those documents because you can't control the font size of the document and have to resort to the zooming feature instead. Another drawback of PDFs on ebook readers is that annotations are not supported. Unlike other tools that convert PDFs to ebooks, PdfMasher does not try to guess the role of each piece of text in the PDF, and instead asks the user about the role of each piece of text, and does so in an efficient manner.
Whyteboard allows you to annotate PDF and PostScript documents and various image formats. You can draw with common tools such as a pen, rectangle, ellipse, text tool, etc. You can draw shapes, which can be moved, resized, recoloured, etc. Your drawing history is stored, allowing you to replay it. Tabbed painting is supported, with each sheet having its own unlimited undo and redo operations. There are live-updating thumbnails for each sheet. Sheets that are closed can also be undone, restoring their data. Note controls, similar to virtual, editable Post-It Notes. A draggable, live-updating resizable canvas that stretches to whatever size you want.
Apitron PDF Rasterizer is a .NET component that performs high-quality conversion from PDF files to images. It supports complex PDF content including text (with embedded, externally linked, standard, simple, and composite fonts), images, including masked ones, complex paths and fills, PDF Forms, annotation objects of various types, all blending modes, tiling patterns, shading patterns (function-based, axial, radial), transparency groups, masked content (stencil masks, colorkey masks, soft masks), all colorspaces specified by the PDF standard, Adobe Illustrator created files, PDF bookmarks and page navigation support, and text search and highlighting (including non-Latin alphabets).
I, Librarian is a PDF manager or PDF organizer that allows individual researchers or a group of researchers to create an annotated collection of PDF articles. Users may build the virtual library collaboratively, thus sharing the workload of literature mining. It enables smart browsing and fast searching in reference data and PDF files, and includes an advanced tool for mining scientific literature from PubMed, PubMed Central, NASA ADS, arXiv, IEEE Xplore, and HighWire Press.
MALODOS helps you to scan, store, and easily retrieve all your personal documents. Its storage format is open and documented, so your document archive can remain accessible even without MALODOS. The documents themselves are stored as standard PDF files, while their metadata (such as title, tags, and description) are stored into a separate SQLite database in an open format. With MALODOS, you can also manage existing files in PDF, JPEG, TIFF, and other formats, so you can still use the documents that you've already scanned. You can connect to any external OCR program to give access to a fulltext search feature.
Solr-Connector-Files crawls and indexes directories and files from your filesystem (whatever is mountable to Linux) into Apache Solr. It features extraction of file contents with Tika, which extracts metadata and text form many document and file formats. It also integrates automatic text recognition (OCR) for images, photos, and PDFs using Tesseract OCR.
offrss is a standalone program that can download your favorite feeds and then show them in your favorite Web browser by spawning a simple local Web server. It will not only download the feeds' text, but also the pictures, so you will also be able to read comics strips and enjoy posts with pictures in them while offline. It can also generate PDFs from text. It remembers what you read and what you don't, and all the information stays in normal files, so you can synchronize it easily to any device that may not have an Internet connection. It can also work as a CGI to serve your feeds in your Web site, and it can update the feeds from crontab. It has few dependencies to build and can be cross compiled easily.