Projects / Docx to Text Converter (docx2txt)

Docx to Text Converter (docx2txt)

docx2txt is a tool that attempts to generate equivalent text files from (even corrupted) Microsoft .docx documents, preserving some formatting and document information (which MS text conversion drops) along with appropriate character conversions for a good (ASCII) text experience. It is a platform independent solution consisting of (core) Perl and (wrapper) Unix/Windows shell scripts and a configuration file to control the output text appearance to fair extent. It depends upon a commandline unzipping program (like unzip, 7z, pkzipc, or wzunzip) that can silently extract single files from zip archives to console/standard output/pipe. It can very conveniently be used to build a Web based docx document conversion service. Some Makefiles and Windows batch files are provided for easy installation of the scripts. With unzippers like CakeCmd that can deal with corrupt Zip archives, this tool can extract text from corrupt docx documents in many cases, where MS word processor fails to even open them.

Tags
Licenses
Operating Systems
Implementation

Recent releases

  •  16 May 2014 12:01

    Release Notes: This release adds the configuration variable config_unzip_opts. This removes the dependency on the unzip program and allows users to use unzipping programs like 7z, pkzipc, and winzip as well. This release also fixes list numbering, improves list/paragraph indentation and corresponding code, and updates the README with brief guidance on how this utility can be used to recover text from corrupted docx file.

    •  07 Apr 2014 18:40

      Release Notes: Adds support for handling lists (bullet, decimal, letter, and roman) along with (an attempt at) indentation. Adds the configuration variable config_twipsPerChar, and removes the configuration variables config_listIndent and config_exp_extra_deEscape. Text output omits deleted text. This matters in case changes are being tracked in a docx document.

      •  14 Jan 2012 21:15

        Release Notes: The Perl script can now take input from stdin, and also works with input/output redirection. Script files and the configuration file can now be installed in separate directories on (non-Windows) systems using Makefile for installation. The configuration file is now uniformly looked for in the current directory, the user configuration directory, and the system configuration directory, in the specified order. Handling of special (non-text) characters has been improved, along with support for more non-text characters, like fractions.

        •  12 Dec 2011 05:29

          Release Notes: Minor non-extraction feature enhancements and bugfixes, based on the feedback/input received from users. A check for the existence of the unzip command. The configuration file is looked for in $HOME as well. Configuration variables now begin with config_ . Bugs #3003903, #3082018, and #3082035 have been fixed. The null device for Cygwin has been fixed. Superscripted cross-references are placed within [...] now.

          •  05 Oct 2009 08:32

            Release Notes: This releases focuses mainly on user interaction aspects. The new features are a Windows installation script, a Windows wrapper script, support for using CakeCmd apart from Unzip, a configuration file, and support for working with a directory holding the unzipped content of .docx file. There has been improvement in handling of short line justification; many cases that were missed out in the earlier approach are captured. Path names containing spaces are now handled.

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.