Projects / Computational Linguistics T...

Computational Linguistics Toolset

The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams (useful for finding statistically significant syntactical differences between any two sets of tagged texts), and various examination-tools. The tools themselves are well documented.

Tags
Licenses
Operating Systems
Implementation

RSS Recent releases

  •  22 Apr 2007 07:53

Release Notes: A CorpusTagsetReducer tool was added to the corpus task-set for filtering out tags and tag-types. A RowChecker, TableScaler, and TableTurner tool were added to the examine-set for checking the alignment of tags and words and for manipulating tab-delimited output-tables. Several smaller fixes and additions were applied.

  •  30 Nov 2006 05:45

Release Notes: Compression was made the default for NgramPermutator and the PermutationStatter, and it was removed as an option. A bug was fixed in the compression of NgramPermutator that prevented the creation of data since version 1.1.2.

  •  10 Oct 2006 08:04

Release Notes: Full support for the manual n-gram search function (-n option) was added to Tag Sample Finder.

  •  22 May 2006 08:08

Release Notes: PermStatResultSelector was added, which is a tool to select and sort significant POS-tag n-grams by weight for each compared sub-corpus. The Goall-script was restructured for the permutation testing, and a few minor bugs were fixed.

  •  13 Dec 2005 10:22

Release Notes: Tools for disambiguating were added. They allow semantic disambiguation to be done about ten times faster than was possible previously with only the WordNet::Similarity package. Two extra corpus-tools for preparing the ICE-corpus for disambiguation were added.

Screenshot

Project Spotlight

jquery.serialize-hash

A small jQuery plugin which returns a hash from serialization of a form.

Screenshot

Project Spotlight

Samizdat

An RDF-based open publishing engine.