docx2txt is a tool that attempts to generate equivalent (ASCII) text files from Microsoft .docx documents, preserving some formatting and document information (which MS text conversion drops) along with appropriate character conversions for a good (ASCII) text experience. It is a platform independent solution consisting of (core) Perl and (wrapper) Unix/Windows shell scripts and a configuration file to control the output text appearance to fair extent. It can very conveniently be used to build a Web based docx document conversion service. Some Makefiles and Windows batch files are provided for easy installation of the scripts. With unzippers like CakeCmd that can deal with corrupt Zip archives, this tool can extract text from corrupt docx documents in many cases, where MS word processor fails to even open them.
| Tags | Text Processing docx Conversion |
|---|---|
| Licenses | GPLv3 |
| Operating Systems | OS Independent |
| Implementation | Perl Unix Shell bash Windows batch file |
Recent releases


Release Notes: The Perl script can now take input from stdin, and also works with input/output redirection. Script files and the configuration file can now be installed in separate directories on (non-Windows) systems using Makefile for installation. The configuration file is now uniformly looked for in the current directory, the user configuration directory, and the system configuration directory, in the specified order. Handling of special (non-text) characters has been improved, along with support for more non-text characters, like fractions.


Release Notes: Minor non-extraction feature enhancements and bugfixes, based on the feedback/input received from users. A check for the existence of the unzip command. The configuration file is looked for in $HOME as well. Configuration variables now begin with config_ . Bugs #3003903, #3082018, and #3082035 have been fixed. The null device for Cygwin has been fixed. Superscripted cross-references are placed within [...] now.


Release Notes: This releases focuses mainly on user interaction aspects. The new features are a Windows installation script, a Windows wrapper script, support for using CakeCmd apart from Unzip, a configuration file, and support for working with a directory holding the unzipped content of .docx file. There has been improvement in handling of short line justification; many cases that were missed out in the earlier approach are captured. Path names containing spaces are now handled.


Release Notes: Display of hyperlinks is configurable. TOC related cleanup was done. Many new character conversions were implemented. Character conversion tables were added. Currency characters are converted to full currency names. Code tweaks were done to speed up the conversion process.


Release Notes: Center and right justification of text fitting in a line of (adjustable) 80 columns. Indication of hyperlinked text along with the hyperlink. A BSD makefile. Some suggestions on how Windows users can use this tool and more documentation. docx2txt.pl invocation has been changed a little. User involvement during installation is reduced.