Projects / SILVERCODERS DocToText

SILVERCODERS DocToText

SILVERCODERS DocToText is a powerful utility which can convert documents in many formats to plain text. It includes a console application and C/C++ library, which allows embedding text extraction mechanisms into other applications. It supports MS Office binary formats (MS Word (DOC), MS Excel (XLS, XLSB), MS PowerPoint (PPT), and Rich Text Format (RTF)), OpenDocument formats (text documents (ODT), spreadsheets (ODS), presentations (ODP) and graphics (ODG)), Office Open XML formats (MS Word (DOCX), MS Excel (XLSX), and MS PowerPoint (PPTX)), iWork formats (PAGES, NUMBERS, KEYNOTE), OpenDocument Flat XML formats (FODP, FODS, FODT), Portable Document Format (PDF), Email files (EML), and HyperText Markup Language (HTML). DocToText can extract text not only from the document body but also from annotations (comments) embedded in odt, doc, docx, or rtf files and read metadata like author, last modification date, or number of pages. It can be used as a fast console viewer, and is able to convert corrupted OpenDocument and Office Open XML documents. It can be used to recover text even if other recovery methods failed.

Tags
Licenses
Operating Systems
Implementation

Recent releases

  •  07 Jan 2014 00:17

    Release Notes: After introducing PDF, iWork, XLSB, OpenDocument Flat XML, and EML (email) this version supports all the important document formats on the market. Support for Object Linking and Embedding (OLE) in ODF formats has been added. Win64 is now officially supported. The capabilities of the C API have been expanded significantly. Many fixes and improvements have been added, including improvements for multithreaded applications.

    •  08 Mar 2013 21:55

      Release Notes: HyperText Markup Language (HTML) format support was introduced in this version. The ability to retrieve metadata like document author, last modification date, or number of pages was added. The new important feature is extracting text from annotations (comments) embedded in odt, doc, docx, or rtf files. Some malfunctions were also fixed.

      •  19 Oct 2012 11:45

        Release Notes: This is the first version available for Mac OS X and also the first version available as a C/C++ library in addition to the console application. MS PowerPoint binary format (PPT) support has been added. Headers, footers, and embedded XLS workbooks in DOC files are now supported. Extracting text from OpenDocument and OOXML formats has been significantly optimized. A lot of bugs have been fixed.

        •  04 Aug 2010 13:57

          Release Notes: In addition to bug fixes and optimizations, MS Excel binary format (XLS) support was added in this version.

          •  04 Sep 2009 21:46

            Release Notes: In addition to bugfixes and optimizations, a corrupted OpenDocument and Office Open XML documents conversion feature was added.

            Recent comments

            12 Oct 2008 18:18 silvercoders

            Re: prior art


            > Thanks for both the utility and

            > description update then; will have a

            > look :-)

            There is one more thing: catdoc is not actively developed since 2005. Doctotext was started in 2006 and will have new functionalities, like for example pdf support. You can consider it as a future replacement.

            We could try to add something to catdoc, but we started new project because of licensing issues (we need to use doctotext in our commercial software).

            11 Oct 2008 21:20 gvy

            Re: prior art
            Thanks for both the utility and description update then; will have a look :-)

            11 Oct 2008 21:08 silvercoders

            Re: prior art


            > ...is it much better than catdoc(1)? :)

            It supports more formats (OpenDocument, Office Open XML) and as far as I know some inconvenient DOC documents are handled better.

            03 Aug 2006 09:41 gvy

            prior art
            ...is it much better than catdoc(1)? :)

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.