Redet is a tool for developing and executing regular expressions using any of more than 50 search programs, editors, and programming languages, intended both for developing regular expressions for use elsewhere and as a search tool in its own right. For each program in each locale, a palette showing the available constructs is provided. The properties of each program are determined by runtime tests, which guarantees that they will be correct for the program version and locale. Additional features include persistent history, extensive help, a variety of character entry tools, and the ability to change locale while running. Redet is highly configurable and fully supports Unicode.
The Unicode Utilities are a set of programs for manipulating and analyzing Unicode text. uniname prints any combination of the character offset of each character, its byte offset, its hex code value, its encoding, the glyph itself, and its name. unidesc reports the character ranges to which different portions of the text belong. unihist generates a histogram of the characters in its input. ExplicateUTF8 determines and explains the validity of a sequence of bytes as a UTF-8 encoding. unirev reverses UTF-8 strings. unifuzz tests other programs' unicode handling.
The Ascii2Binary project consists of two complementary programs that convert between textual and binary representations of numbers. Ascii2Binary reads input consisting of textual representations of numbers separated by whitespace and produces as output the binary equivalents. It is useful for generating test data and linking programs that generate textual output to programs that require binary input. Binary2Ascii converts binary numbers to text. In both programs, the type and size/precision of the input or output is selected using command line flags.
ISCII Utilities is two programs for analyzing text files encoded according to the Indian Script Code for Information Interchange (ISCII), the Indian national standard. IsciiName identifies each code, printing the byte offset, the code in hex, and an explanation of the meaning of the code. ATR codes for writing system transition and display mode are interpreted. CountIsciiChars counts the codes in an ISCII file and classifies them according to their type and function. The original purpose was computing accurate letter counts for reading studies, but this information is also useful when processing ISCII-encoded text.
WAVE Utilities is a set of three programs for dealing with WAVE format audio files. Some software is unable to parse complex WAVE files containing such things as playlists and padding. SimplifyWave converts complex files into files that such programs can read by stripping everything other than the data chunk. RepairWave inserts the required data chunk id and size information into ill-formed files in which the audio data directly follows the header. InfoWave extracts information from a RIFF/WAV or RIFX/WAV file and reports on the contents of the file.
uni2ascii and ascii2uni provide conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files. It can also convert between the escapes used for Unicode in languages such as Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.
Msort sorts files in sophisticated ways. Records may be fixed size, newline-separated blocks, or terminated by any specified character. Key fields may be selected by position, tag, or character range. For each key, distinct exclusions, multigraphs, substitutions, and a sort order may be defined or locale collation rules used. Comparisons may be lexicographic, numeric, numeric string, hybrid, random, by string length, angle, domain name, date, time, month name, or ISO8601 timestamp. Keys may be reversed so as to generate reverse dictionaries. Optional keys are supported. Unicode is supported, including full case-folding. Msort itself has a somewhat complex command line interface, but may be driven by an optional GUI.
ByteName is a tool that for each byte of the input prints a line consisting of the byte offset, the byte in hex, octal, binary, and decimal, and its description in a selected single-byte encoding. A command line flag suppresses printing of lines corresponding to ASCII characters, which is useful for locating stray non-ASCII codes. It can also generate a chart for a specified encoding or, for a specified codepoint, generate descriptions in all known encodings.
Minpair consists of two programs, a C command-line program and a Tcl/Tk GUI, each of which can independently generate a complete list of minimal pairs (words differing in exactly one segment) for use in linguistic research. The GUI may also be used to control the faster CLI program. Both allow sequences of characters to be defined as single segments. Unicode is fully supported. It is also possible to obtain a list of pairs differing in exactly two positions for use in finding phonological rules.
Xlit converts text from one writing system into another. It allows the user to define a transliteration simply by typing the input strings in one window and the strings to which they are to be mapped in another. Transliteration may be restricted to regions bounded by specified delimiters or their complements. Transliteration may also be performed by external commands or plugins. Xlit can also convert one type of delimiter to another, e.g. from HZ escapes to XML. Xlit can read and write transliteration definitions in its own format and as Yudit keymaps. It can be run in batch mode without the GUI.
Pause determines the location of silences in an audio file for use in fragmentation of large recordings, studies of pause duration, and the like. It generates both a nicely formatted table intended to be read by people and a simple tab-delimited file that is easily parsed by software.
SndBite is a specialized audio editor designed for breaking large recordings into smaller components with great efficiency. Its principal intended application is in linguistic research where it is often desirable to put each word or sentence into a separate file before further processing. It is also useful for measuring pause durations. Its features include multiple simultaneous views of the waveform at different resolutions, the ability to position window edges at transitions between sound and silence, automated setting of cut points at zero-crossings, automatic filename generation easily controlled by the user, and optional automatic playback on window motion. It is scriptable and may be run in batch mode without the GUI.
ColorExplorer is a tool for exploring the color space and finding out how colors, color names, and numerical color specifications are related. The user can specify a color by selecting its name from a list of color names, by adjusting sliders that control the mixture of red, green, and blue, by entering a numerical color specification, by copying it from the history list or elsewhere on the display, or by requesting a random color. The numerical specification of the current color and an example of that color are shown in a pair of adjacent boxes. The color name list may be searched by entering a regular expression or by requesting the closest match to the current color.
WordGenerator generates hypothetical words from specifications of their syllable structure. The user specifies the maximum length of the words in syllables, the abstract structure of syllables in the language (in terms of such units as consonants and vowels or onsets and rhymes), and the actual sounds that comprise each abstract class (e.g. the list of vowels in the language); WordGenerator then generates the words that conform to this specification. Such lists are useful to field linguists exploring the vocabulary of a language, and to designers of artificial languages.
Tamil Converters is a collection of programs for converting among a variety of encodings and transliterations of Tamil, including: Unicode, ISCII, TSCII, ITRANS, the International Phonetic Alphabet, the Koln, Penn, and Colloquial Tamil romanizations, ISO-15919 transliteration, and Unicode character names enclosed in angle brackets (as in POSIX locale source files).
CharEntry is a tool for inserting non-ASCII characters into text, with particular emphasis on linguistic notation. It provides charts of the consonants, vowels, and diacritics of the International Phonetic Alphabet as well as a chart of precomposed accented characters. Clicking on a character inserts it into a text region, the contents of which may be saved to a file or copied and pasted elsewhere. A widget for inserting characters by Unicode codepoint is also provided. Furthermore, it is possible to read the definition of a custom character chart from a file.
libuninum is a library for converting Unicode strings to integers and integers to Unicode strings. Internal computation is done using arbitrary precision arithmetic, so there is no limit on the size of the integer that can be converted. Values are passed and returned as ASCII decimal strings, GNU MP mpz_t objects, or unsigned long integers. Auto-detection of the number system is provided. Very many number systems are supported. Group delimitation for output strings is fully controllable. Command line and graphical interfaces are also provided.
AudioSpace calculates the amount of storage required by an audio recording of a given duration, for different sampling rates, resolutions, and numbers of channels. The calculation may be made for uncompressed audio data or for several types of compression. A variety of units may be selected for reporting the result. The calculation may also be inverted to determine the maximum duration of audio that will fit into the available storage.
UnicodeDataBrowser is a browser for the UnicodeData.txt file, which contains much useful information but is not easily read by humans. It creates a scrollable table in which columns represent properties. The table may be sorted on any column. Abbreviations are expanded and characters cross-referenced in decomposition and casing fields are named. Regular expression search restricted to a selected column is available. The set of characters for which information is displayed may be restricted to those characters matching a regular expression on a specified property.