seltz_analyzer is a PHP class that tries to find the most important words inside a well-formed XHTML trunk. Every word takes a score based on the role in the XHTML structure. For example, a word between strong tags will take 5 points. In addition, it will look at some simple syntax rules. For example a word with the first character uppercase will take 4 points. The score is cumulative, so the more a word is used, the more meaning it will have.
Cypher is an AI program that generates the RDF graph and SPARQL query representations of plain language input, allowing users to speak plain language to update and query databases. With robust definition languages, Cypher's grammar and lexicon can quickly and easily be extended to process highly complex sentences and phrases of any natural language, and can cover any vocabulary. Equipped with Cypher, programmers can begin building next generation semantic Web applications that harness natural language.
PottyMouth transforms completely unstructured and untrusted text to valid, nice-looking, completely safe XHTML. PottyMouth is designed to handle input text from non-technical, potentially careless, or malicious users. It produces HTML that is completely safe, programmatically and visually, to include on any Web page. You don't need to make your users read any instructions before they start typing. They don't even need to know that PottyMouth is being used.
Sikher is a desktop program designed to archive, search, and display the Sikh scriptures using advanced functions. It allows the common person to understand and read the messages contained in the Sikh scriptures through translations and transliterations in different languages, thereby breaking the language and geographical barrier between Gurbani (Sikh Scriptures) and the world. Sikher is a robust, future proof, and cross-platform application which may be used by developers to create similar internationalized and localized search applications.
Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora. Several tools have been integrated in Uplug. Pre-processing tools include a sentence splitter, tokenizer, and external part-of-speech tagger and shallow parsers. The following external tools are used: the Grok system for English (tagging and chunking) and the morphological analyzer ChaSen for Japanese. Other tools such as the TreeTagger can easily be added. Translated documents can be sentence aligned using the length-based approach by Gale & Church. Words and phrases can be aligned using the clue alignment approach and the toolbox for training statistical alignment models GIZA++.
Xlit converts text from one writing system into another. It allows the user to define a transliteration simply by typing the input strings in one window and the strings to which they are to be mapped in another. Transliteration may be restricted to regions bounded by specified delimiters or their complements. Transliteration may also be performed by external commands or plugins. Xlit can also convert one type of delimiter to another, e.g. from HZ escapes to XML. Xlit can read and write transliteration definitions in its own format and as Yudit keymaps. It can be run in batch mode without the GUI.
uni2ascii and ascii2uni provide conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files. It can also convert between the escapes used for Unicode in languages such as Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.