Quantcast
Channel: ProZ.com Translation Forums
Viewing all articles
Browse latest Browse all 3915

TMLookup | Well...

$
0
0
Forum: CAT Tools Technical Help
Topic: TMLookup
Poster: FarkasAndras
Post title: Well...

[quote]Michael Joseph Wdowiak Beijer wrote:

[quote]FarkasAndras wrote:

[quote]Michael Joseph Wdowiak Beijer wrote:

The TMX was created by Déjà Vu X3. Just sent it to you.

Thanks for the new version!

Michael [/quote]

As expected, this is due to creative tag formatting. The tmx has the tag split between two lines and TMLookup expects it to be on one line. Again, like a previous issue, this is because TMLookup doesn't have a proper xml parser because I can't be bothered to learn how to implement one. So instead of wrangling with a horrible coding problem I'm left wrangling with somewhat less horrible troubleshooting problems every now and then. So it goes. I could just look for xml:lang= without the tuv, but in principle some tmx files could have other elements where the language is specified with xml:lang=, not just the text itself. So then it could break on those. This can be solved in multiple ways of course, but none of them are trivial or appetizing to me. Implementing a proper parser is the least appetizing of all. So maybe I'll fix this... maybe not. Adding the language codes to the filename should work.

In the meantime, the new version of sqlite that will allow for somewhat fancier/faster text searches is trickling down the pipeline. It went through two stages and it is now one step away from where I can start fiddling with it. We'll see.
[/quote]

Indeed, I just had another look at it, and they sure got creative with the line breaks.

Until you fix it I'll just add the language codes to the filename, which seems to work fine. I might also mention to Atril support that their TMXs are a little weird. [/quote]
To be fair, their tmx is perfectly valid. Perhaps a little unusual but fine. It's my "parsing" that is not up to scratch.
BTW, if you have a lot of tmxes to import, you could fix them by removing line breaks after <tuv. Sed and other tools can do mass replacements on multiple files.

I don't want the codes in filenames to overrule langcodes read from inside the file... the latter tends to be more reliable I would think.
But in your specific example "nl-BE" should be read by TML as nl and "en-US" as en... it chops off the bit after the hyphen.


Viewing all articles
Browse latest Browse all 3915

Trending Articles