Bitextor is an application whose objective is to generate translation memories using multilingual Web sites as a corpus source. It downloads all the HTML files in a Web site, it performs a preprocess to convert them to a coherent and suitable format and, finally, applies a set of heuristics (based mainly on HTML tag structure and text block length) to make pairs of files which are candidates to contain the same text in different languages. From these candidates, translation memories are generated in TMX format using the library LibTagAligner, which uses the HTML tags and the length of text chunks to perform the alignment.
TagAligner is an application whose objective is to generate translation memories from two XHTML tagged files. It uses XHTML tag structure and text block length to calculate the most probable alignment between the both files. Once it has done so, TagAligner uses a set of rules defined by the user to cut every text block into phrases and then it generates a TMX file that represents the translation memory obtained from the original files. You can download TagAligner as an application or as a library to be used by other applications.