Character Encodings in Linux

The issue of how bits and bytes of a text file are interpreted and displayed as letters on the computer, and how a computer user can type letters in, is rather complicated. This document explains the related concepts of character encoding, key code, and input method, especially as they refer to the GNU Linux operating system and software that runs on it.

Terminology

character encoding

The data in a document can be thought of as a string of 0’s and 1’s: binary numbers. These numbers are always grouped into bytes which is a set of 8 bits. A byte can be interpreted as a number from 0 to 255.

A character encoding is a re-interpretation of these numbers as characters from human language alphabets.

The most famous encoding is the old ASCII encoding, which interprets numbers in the range 33–126 as characters found on the keys of an American typewriter. (The remaining characters 0–33 and 127 are used as device control characters: carriage return, tab, newline, delete, etc.) ASCII uses only the first half of the possible numbers in a byte, so it is called a “7-bit” encoding. This encoding only barely suffices for American English, however. Many other encodings have been developed to encode other languages.

The first 127 characters of most (but not all) encodings are ASCII. This is a good thing because it means that the US English part of any text can be understood (almost) independently of the encoding.

The character encoding to end all character encodings is Unicode, whose goal is to encode the writing systems of all the world’s languages (as well as many other symbolic systems). Variants of Unicode are UTF-8, UTF-16, and ISO-10646. To do this, it is insufficient to associate the 255 different bytes with characters, so Unicode uses multiple bytes to encode characters. It is therefore called a multi-byte encoding.

Computer systems have mostly completed the move to Unicode. There are many weak spots, and older systems often don't support it at all.

Before Unicode, dozens of other encoding systems were developed. Most of them were 8-bit encodings, specialized for a particular language, or for some small group of languages. These include the international standard ISO-9959 series, which enable one to type in both English and some other language or languages. Then there were series of encodings meant for a particular computer architecture, such as the PC CodePage (CP) series for IBM PC compatibles and Microsoft Windows, and the Macintosh encodings.

An 8-bit encoding is insufficient for ideogrammatic writing systems such as Chinese, Japanese, and Korean. So multi-byte encoding systems were developed for these: Big-5, JIS, and KSC.

key code

A key code identifies a particular key on the computer keyboard. Unfortunately, different manufacturers numbered their keyboards in incompatible ways. However, with knowledge of your keyboard brand and model, X-windows can sort this out.

input method

For English, we have to use the shift key to obtain upper-case letters. In other languages, it is necessary to use other combinations of keys to obtain accent marks. In many languages, complex combinations of keys may be required to obtain a given character. For a given writing system, an input method is the computer algorithm which obtains a character from a combination of key strokes. Notice that some languages may have several input methods.

keyboard layout

A keyboard layout is a mapping of the keys on the keyboard to character values.

Usually, if you have a U.S. English keyboard, you expect the character produced by typing a certain key to correspond to the letter displayed on the key. But of course the plastic letter has no direct bearing on the function of the keyboard --- that is determined by software, and the software can be changed, so that pushing a key produces a character different from what is displayed on the key.

This can be very useful for people who regularly type in more than one language. By switching the keyboard layout between, say U.S. English QWERTY, and a standard German layout, they can conveniently type text in one language or the other.

There are also multiple keyboard layouts for individual languages. For English, there is the conventional U.S. American QWERTY layout, as well as the more efficient DVORAK layout; and then there is the standard U.K. English layout.

KDE

Control of the keyboard layout is provided by the Keyboard Layout module of the KDE Control Center.

Gnome

Control of the keyboard layout is provided by the Keyboard Layout module of the System Settings utility.

In addition to English, I type often in German and occasionally French For this, I find the layout “U.S. English” convenient.

Under Options set your preferred "Compose Key", which will be used to type non-English letters and characters. I use the "right-Alt" key for this purpose.

Applications

Setting the keyboard layout only allows for the possibility of X-windows to enter special symbols. This does not mean a given application can accept the symbols.

Some older applications such as Kterm have built-in input methods.

Applications linked with the KDE 3 or Gnome 2 libraries should be able to accept general Unicode typing. That is, you should be able to type any language, given you have the fonts and know how to work the input method for that language.

Recent versions of the text editors KWrite and Gedit are both Unicode enabled (but in different ways: Gedit assumes everything is Unicode; KWrite will open and save files in a variety of encodings). The mail client Balsa is also Unicode enabled.