Quantcast
Channel: ProZ.com Translation Forums
Viewing all articles
Browse latest Browse all 3915

smartCAT: match analysis | word count differencies

$
0
0
Forum: CAT Tools Technical Help
Topic: smartCAT: match analysis
Poster: Pavel Doronin
Post title: word count differencies

Dear Chiara,
Every software has its own algorithm. Usually, the statistics are calculated following these steps:
Text extraction. Different software may or may not extract the text from footers, headers, tables of contents and embedded objects. This affects the total number of words or symbols. For example:
MS Word ignores header text, however it is included in the word count in SDL Trados and Smartcat.
MS Word does not include automatically generated page numbers in its statistics, while SDL Trados does.
MS Word counts the words in the table of contents as separate words, while SDL Trados and Smartcat do not (we believe it makes sense since it’s created automatically based on the titles and subtitles which will be translated anyway, so after the translation is completed, you will just need to update the table of contents).
Text segmentation (splitting the document into sentences). This is not applicable to MS Word. Here, the approach may be different, depending on:
What is considered a “segment” — For example, a line that contains only spaces will not be seen as a segment by both Smartcat and Trados, so the spaces won’t be counted as characters. However in MS Word, they will be considered characters, and included in the statistics.
Which characters (combination of characters, line breaks) are treated as segment delimiters — this may also affect the number of TM matches (in the cases when a Trados TM is used in a Smartcat document or a Smartcat TM is used in a Trados document).
The segments-into-words splitting can also work differently in different software and even different versions of the same software, as each of them utilize different algorithms. The differences may include:
Apostrophes or slashes are not treated as word delimiters in MS Word, unlike Trados and Smartcat (“Student’s Book” counts as 3 words).
Trados 2011 does not consider digits-only segments to be containing any words, while Trados 2007 and MS Word do.
Dashes are treated as delimiters in Trados 2007, but not in the other software.
MS Word counts numbers in numbered lists as separate words, while Trados and Smartcat ignore them.
Various character sequences, such as ________ or ***** are treated as words in MS Word but are not considered to be such by Trados and Smartcat.
PowerPoint statistics are a total mess.
And the list goes on.
Matches and repetitions — if two lines are almost identical and the only difference between them is a number, a tag or a certain kind of character, they will be considered to be repeating. For TM matches it works in a similar way.

[Edited at 2017-06-28 18:00 GMT]

Viewing all articles
Browse latest Browse all 3915

Trending Articles