Forum: CAT Tools Technical Help
Topic: Re-segmentation of TMX files: is there a tool for this?
Poster: Samuel Murray
Post title: I can't help but I can comment
[quote]Hans Lenting wrote:
Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU? [/quote]
1. No, I have had to do this myself, and I could only accomplish it with conversions and with loss of information (i.e. the resulting TM can only be used as a reference TM, not an active TM). What I did in the past is to convert the TMX file to WFC TM format, then convert that using a hack script of mine to PO format, then use "posegment" (from TranslateHouse) to split it into sentences, then use "po2tmx" (also from TranslateHouse) to convert it to TMX (and if I wanted WFC TM format, I'd convert the TMX to WFC TM as well). The posegment program is not interactive, but when it encounters a paragraph segment with a dissimilar number of sentences in the source and target field, it would simply create a sentence segment for each source text sentence, containing the entire paragraph's text as the target text. This allowed a CAT tool to perform fuzzy matching on the sentences, but required the user to manually check the translations (i.e. don't use pre-translation on such a TM). Personally, my solution to that would have been to simply added a single dummy character to each source text (i.e. in cases of dissimilarity), to ensure that it never gives a 100% match to anything.
2. Apparently, you can do the conversion using OmegaT, by tricking OmegaT into thinking that you've loaded an old project from the days when OmegaT could not do sentence segmentation. This might be an option for you, since there is no "conversion" to other formats -- OmegaT takes your TMX file and rewrites it. The instructions are given by Didier [url= [url removed] ]here[/url], ...but I have just tried it again and can't get it to work (perhaps you can).
[Edited at 2017-05-25 09:34 GMT]