2 projects tagged "Multilingual Corpora Harvesting"

Download Website Updated 04 May 2009 Bitextor

Screenshot
Pop 30.50
Vit 43.02

Bitextor is an application whose objective is to generate translation memories using multilingual Web sites as a corpus source. It downloads all the HTML files in a Web site, it performs a preprocess to convert them to a coherent and suitable format and, finally, applies a set of heuristics (based mainly on HTML tag structure and text block length) to make pairs of files which are candidates to contain the same text in different languages. From these candidates, translation memories are generated in TMX format using the library LibTagAligner, which uses the HTML tags and the length of text chunks to perform the alignment.

Download Website Updated 04 May 2009 TagAligner

Screenshot
Pop 15.49
Vit 43.02

TagAligner is an application whose objective is to generate translation memories from two XHTML tagged files. It uses XHTML tag structure and text block length to calculate the most probable alignment between the both files. Once it has done so, TagAligner uses a set of rules defined by the user to cut every text block into phrases and then it generates a TMX file that represents the translation memory obtained from the original files. You can download TagAligner as an application or as a library to be used by other applications.

Screenshot

Project Spotlight

Kangas Sound Editor

A program to create sound effects and music.

Screenshot

Project Spotlight

Solr-Connector-Files

A tool that indexes directories and files from your filesystem into Solr.