402 projects tagged "Text Processing"
LibAxl is an efficient implementation of the XML 1.0 standard specification. It doesn't have any external library dependencies, having a clean implementation based on opaque types and a consistent API to manipulate your XML documents without compromising your code. It is extremely memory efficient and thread safe with a small footprint (111k). It also includes XML Namespaces support.
John the Ripper is a fast password cracker, currently available for many flavors of Unix, Windows, DOS, BeOS, and OpenVMS. Its primary purpose is to detect weak Unix passwords. It supports several crypt(3) password hash types commonly found on Unix systems, as well as Windows LM hashes. On top of this, lots of other hashes and ciphers are added in the community-enhanced version (-jumbo), and some are added in John the Ripper Pro.
TXR is a new data munging language to replace the likes of awk and Perl. TXR's special pattern language provides template-based matching of entire documents or large sections of documents. It also contains a language for functional and imperative programming. It is written in C and takes the form of a utility that is portable to Unix-like platforms and Windows.
Template Data Interface (TDI, /ʹtedɪ/) is a markup templating system written in Python with (optional but recommended) speedup code written in C. Unlike most templating systems, TDI does not invent its own language to provide functionality. Instead, you simply mark the nodes you want to manipulate within the template document. The template is parsed, and the marked nodes are presented to your Python code, where they can be modified in any way you want.
SILVERCODERS DocToText is a powerful utility which can convert documents in many formats to plain text. It includes a console application and C/C++ library, which allows embedding text extraction mechanisms into other applications. It supports MS Office binary formats (MS Word (DOC), MS Excel (XLS), MS PowerPoint (PPT), and Rich Text Format (RTF)), OpenDocument formats (text documents (ODT), spreadsheets (ODS), and presentations (ODP)), Office Open XML formats (MS Word (DOCX), MS Excel (XLSX), and MS PowerPoint (PPTX)), and HyperText Markup Language (HTML). DocToText can extract text not only from the document body but also from annotations (comments) embedded in odt, doc, docx, or rtf files and read metadata like author, last modification date, or number of pages. It can be used as a fast console viewer, and is able to convert corrupted OpenDocument and Office Open XML documents. It can be used to recover text even if other recovery methods failed.