Working with plain text


Default encoding

Plain text files - in most of the cases with a .txt. extension - contain exclusively textual information. There is no clearly defined way to inform the computer which language they contain. In (very) simple terms, that means the computer will per default assume the text is written in the same language the computer itself uses.


Garbled displays

If you are Russian, it is very likely that your computer works in Russian too: the menus are in Russian, the files you open will be in Russian etc. In most cases, the computer makes the right assumption regarding the Contents of files in general: they all contain Russian and nothing Russian characters could not display.

Now, if you are a Russian translator who translates from Japanese, the Japanese files you will get, if they are plain text files will most probably be considered by the computer to be files containing Russian. Because there is no information in the file itself that indicates to the computer in which language they are written.

The Japanese file Contents could be:

OmegaTとは、コンピュータを利用した翻訳ツールです。

But your text editor could very well display it like this:

OmegaTВ∆ВЌБAГRГУГsГЕБ[Г^ВрЧШЧpµšЦ|ЦуГcБ[ГЛВ≈ВЈБB

Because it expects the Contents to be Russian... But this is not Russian. This is Japanese characters wrongly displayed as Russian characters.

OmegaT is no different. OmegaT considers that plain text files contain text that can automatically be displayed by using the computer's defaults. That works well when the computer works in French and if you get English files, or when the computer is German and if you get Italian files.


Character sets and encodings

Why would that work with English and French but not with Russian and Japanese? Because English and French share a common character set. Namely Latin-1, or a variation. Until recently, Russian and Japanese have not shared any character sets. Most current Russian characters sets do not cover Japanese and reciprocally. The result is as shown above.

The Japanese client works with a Japanese computer and creates text files that contain Japanese. The character set selected by the client computer will depend on the operating system and of other settings, but it is very unlikely that the chosen (Japanese) character set will be correctly interpreted by the Russian computer.

Now, how the textual information in the specified character set is physically transmitted (i.e. how is it written in the file for the computer to interpret and display) depends on an encoding. When the computer reads the file, it "decodes" the information according to the encoding and displays it according to the character set. Roughly speaking, one encoding corresponds to one character set...


The OmegaT solution

There are basically 3 ways to fix this in OmegaT. The 3 ways all involve using the file filters in the Options menu.

  1. Specify the encoding for your plain text files - i.e. files with .txt extension.
    In the Text files section of the file filters dialog, change the Source File Encoding from <auto> to the encoding that corresponds to your source .txt file.
  2. Change the extensions of your plain text source files - from .txt to .jp for Japanese plain texts for instance.
    In the Text files section of the file filters dialog, add the *.jp Source Filename Pattern and select the appropriate parameters for the source and target encoding.
  3. Open your source file in a text editor that correctly interprets its encoding and save the file in the "UTF-8" encoding.
    Change the file extension from .txt to .utf8.
    OmegaT will automatically interpret the file as a UTF-8 file.

Currently, OmegaT is set to understand plain text files as follows

You can check that yourself by selecting the item File Filters in the menu Options.

OmegaT just keeps this short list ready to make it easier for you to deal with some plain text files.

For example, when you have a Czech text file (very probably written in the ISO-8859-2 code) you just need to change the extension .txt to .txt2 and OmegaT will interpret its contents correctly.


Legal notices