seltz_analyzer is a PHP class that tries to find the most important words inside a well-formed XHTML trunk. Every word takes a score based on the role in the XHTML structure. For example, a word between strong tags will take 5 points. In addition, it will look at some simple syntax rules. For example a word with the first character uppercase will take 4 points. The score is cumulative, so the more a word is used, the more meaning it will have.
ssct is a command-line utility, humble of intent, that takes a single word, spell checks it, takes the result(s) and then translates them. It works to/from english only. From/to languages are limited by ispell in the first instance, and by the IDP (Internet Dictionary Project) files in the second. Currently the latter includes Spanish, Portuguese (minimal), Latin, German, French and Italian. These files are included with this package. This utility was originally created to make it easier to decode badly-scrawled postcards from Spain.
transtoba2 facilitates the transliteration or transcription of a word or text from the Roman script into the Toba Batak script. Transliterating from the Roman into the Batak script is not an easy undertaking, as the Batak script has a number of peculiarities that complicates the process of transliteration. This program uses a set of algorithms which enables the user to effortlessly transliterate from the Roman to the Toba Batak script.
Tspell is a library and applications for solving Turkish Natural Language Processing (NLP) related computational problems. Turkish, by nature, has a very different morphological and grammatical structure than Indo-European languages such as English. Since it is an agglutinative language like Finnish, even making a simple spell checker is very challenging. Some target problems are: a spell checker, a word analyzer that determines roots and suffixes, a word constructor based on suffixes, and much more.
uni2ascii and ascii2uni provide conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files. It can also convert between the escapes used for Unicode in languages such as Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.