Quantcast
Channel: ProZ.com Translation Forums
Viewing all articles
Browse latest Browse all 3915

New free & open source aligner (for Windows, OS X and linux) | A "dumb aligner" would be great!

$
0
0
Forum: CAT Tools Technical Help
Topic: New free & open source aligner (for Windows, OS X and linux)
Poster: Michael Beijer
Post title: A "dumb aligner" would be great!

[quote]FarkasAndras wrote:

[quote]Michael Beijer wrote:

There really should be an easier way to do this though, seeing as how all of these files are effectively already aligned. All I need is for a program to: take the first line of text file de1.txt and match it to the first line of text file en1.txt, and turn this into a TU. Then, it needs to take the second line of text file de1.txt and match it with the second line of text file en1.txt, and turn it into a TU. Then repeat that a few times.

PS: this is what I'm currently working on: [url removed] [/quote]

I thought you already did this with the public patent TM? In any case, some dumb aligners do this. I also have my own software for this because I store my large multilingual TMs in a similar format (separate txt files for each document in each language, one line per segment).
Maybe one day I will add such a dumb pairing feature to lf aligner, or release a separate program that merges files into tabbed files. [/quote]

I did, but there were much fewer files, and they were much bigger. Now, I have hundreds of small txt files, so my approach is different.

This is how I did the PatT data:

Original workflow:

1. Append ".txt" to file names
2. Open files in EmEditor (or a good text editor capable of opening large files; UltraEdit is also good)
3. In Ron's CSV Editor, create empty file and paste in contents of .txt files (of src + trgt language) to create a tab-delimited .csv
4. In Xbench, convert aforementioned .csv to .tmx;
5. In Heartsome TMX editor, edit the TMX custom attributes and clean up the TMX (remove duplicates).

Improved workflow:

1. Append ".txt" to file names
2. Use "split" command in cmd.exe to split large text file into smaller files based on number of lines (1,000,000 lines): split -l 1000000 filename.txt
3. Use "generate_tabbed.exe" (in András Farkas’s "Grab Bag", included in LF Aligner download) to convert src and trgt language .txt files into tab-delimited .txt containing both src + trgt
4. Use Heartsome TMX editor to convert bilingual tab-del .txt files into .tmx

Viewing all articles
Browse latest Browse all 3915