Quick Introduction to HTML
Internationalization
Introduction
Most text on the World Wide Web is in English, because at this time, that is the most widely understood language. However, there are many reasons you may wish to insert text from some other language in an English web page, or create a web page completely in some language other than English.
The main internationalization issue for the Web is one of character encoding, which is the way the data in the documented is interpreted as, say, letters in an alphabet. An encoding is necessary because, without further information, a computer file is just a sequence of bits and bytes. One must specify further that a file is, say, ASCII text, or Russian text in order to interpret these bits and bytes.
Another unfortunate historical fact is that some common punctuation marks in English (curley qotes and dashes, particularly) are encoded differently on different computer systems. If you use these, you should indicate the character encoding in your document. Otherwise, on someone else’s machine they will appear as garbage.
It is also important to indicate in the web page what language the text is meant to represent, and if the text is a mixture of languages, which text is in one language and which is in another.
Character entities
If you just want to include in your document a single non-English character, such as an accented vowel, often the easiest way is to use HTML Character Entities. These include most of the letters from the Western European alphabets. Their use side-steps the issue of encodings.
A good use of this is to insert non-English names in an English document:
HTML code | Rendered |
---|---|
García |
García |
Eugène |
Eugène |
Schönberg |
Schönberg |
It is also possible, but not recommended, to insert Greek this way:
HTML code | Rendered |
---|---|
Απολλο |
Απολλο |
The last example shows one of the weaknesses of this approach—the HTML is big and ugly and looks nothing like the finished product. If you want to write a whole document in an alphabet other than that of English, this is not the way to go. We will discuss better alternatives below.
The support for Greek in the HTML Character Entities is really accidental. The Greek Entities were really meant for writing math formulas, not for display of Greek text. For the same reason, there is an HTML Entity for one Hebrew letter (Aleph). But there are no entities for Russian, Arabic, or Chinese.
It is also possible to specify a single character by its numerical Unicode encoding. With this, one can in principle specify any character from any of the world’s writing systems:
HTML code | Rendered |
---|---|
ش |
ش |
ฒ |
ฒ |
两 |
两 |
Of course, the HTML code in this approach is even less readable than with the HTML Character Encodings. One has to have a Unicode table to read it at all.
Furthermore, since HTML Entities was really meant for display of single characters, the browser may have a hard time displaying words in languages such as Arabic, where the shape of one letter depends on the surrounding letters. See below for a solution to this problem.
Character encodings
For historical and technical reasons, there are numerous character encodings, some to accommodate different languages, some invented by different companies.
If you produce your document in one encoding, and your user’s browser interprets it as another, it is likely to appear as computer-gibberish, and will at least have some misinterpreted characters.
Some important encodings include:
- ASCII
- ISO-8859-1 through ISO-8859-16
- Unicode
The Unicode (UTF) encodings represent the ultimate solution to the problem of encoding all human written languages. However, to date, not all of Unicode is completely implemented on any browser, or any platform. Whole languages are missing, fonts aren’t complete, and some platforms don’t support it at all.
The International Standards Organization (ISO ) series of encodings iso-8858 represent a fairly safe intermediate step to Unicode. For Western European languages, iso-8859-15 (Latin-9) does almost everything. The other iso-8859 encodings provide support for mixtures of Western and Eastern European languages, and a few other alphabetic writing systems, such as Arabic, Hebrew, and Thai.
Before the iso-8859 encodings, and before Unicode, many other character encodings were invented by different groups for special purposes. There are several encodings for Chinese and Japanese, and several that permit Russian, Arabic, Hebrew, Thai, etc. to be mixed with Western languages.
Specify an encoding
The de facto default encoding on the Web is Latin-1 (iso-8859-1). That means, if the user’s browser gets no other information about a page’s encoding, it will assume Latin-1 (most browsers also allow the user to alter this default, but to do so is not usually a good idea).
Unless you are very sure your document will end up as Latin-1, you should always inform your user’s browser as to which encoding you intend. Since different HTML files may have different encodings, it is best for this information to be contained in the document.
The HTML way is to place a special meta
tag in the head
section of the document.
For example, for mixed Greek and English text, the encoding iso-8859-7
is a good choice.
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-7" />
If your document is XHTML rather than HTML, it is best to put the information
in the xml
tag in the first line of the document:
<?xml version="1.0" encoding="iso-8859-15"?>
For best results in XHTML, use both the xml
and
meta
tags.
Specify language
It is considered polite to specify which languages are being used in
a document. Some search engines can filter documents based on this
information. For example, with the following HTML meta
tag in
the head section of the document, we can explain that the document contains
U.S. English (en-US
) and Greek (el
):
<meta http-equiv="Content-Language" content="en-US,el" />
It is very useful to specify the language you mean a particular
piece of text to represent. You can do this for the text in any HTML element
by setting the value of its lang
attribute to the code
for that language. For example:
Our friend Miguel says "<span lang="es">¡Holá!</span>"
is rendered as
Our friend Miguel says "¡Holá!"
Reasons to specify the language include:
- Some browsers do a much better job of rendering text in mixed alphabets when they are given the language hint
- A program that reads web pages audibly can only have a chance of pronouncing words correctly if this information is provided
- In documents containing mixed languages, you can specify different font style properties (even color!) for different languages
If you mean the whole HTML document to be in a specific language,
you can provide that information in the lang
attribute of the html
tag for the document.
Note in XHTML, the way to specify language has changed.
The correct language tag in HTML5 is now lang
. With
XHTML 1.0, you can use both lang
and
xml:lang
, but in XHTML 1.1, only xml:lang
is
acceptable.
Encodings and servers
If you don’t specify a character encoding for your document, your web server will. The usual default encoding is iso-8859-1 (Latin-1), although many servers are now switching to UTF-8 (Unicode).
The web server at a site can also specify a default encoding. This is important if most of the files at the site are written in a non-latin alphabet.
Fonts and encodings
A computer font is a collection of glyphs, which are the graphical represntations of characters in an encoding. It is this glyph that is displayed in the user’s browser window.
For the browser to display the glyph of a character in a certain encoding, a font that includes that glyph must be installed on the user’s system. Nowadays, most computer systems come with at least one fairly complete Unicode font, so for most languages there is a good chance that your viewer will see the correct glyphs.
Note it is not the business of your web page to specify the font for a language encoding. The user’s browser is meant to find a font on their system that best represents the encoding of the text,
The correct behaviour of a web browser is to identify and use a font containing the requested glyph, regardless of the font currently being used for text display. For example, if a run of text is being displayed in one font, and a character that is not supported by the font appears in the text, the browser should locate another font which does support that character, and use that font to display the character. Only when no font can be found that supports the character should the browser insert a placeholder glyph in place of the unsupported character.
The writing systems of some languages such as Cherokee and Myanmar have appeared in only a few Unicode fonts. At this time, the only solution is to explain to your user to install the proper font in case they don’t see the characters displayed properly.
Curley quotes and dashes
Users of Microsoft Windows: beware! Many Windows programs for typing text automatically make quote marks and apostrophes “curley”, so
How's "this"?
becomes
How’s “this”?
Which is good and fine, but unless the document’s character encoding is properly set, these will not appear correctly on other computer systems. Unfortunately, many Windows applications fail to do this (especially older ones, and including Microsoft Word).
A good encoding for documents in English (and other Western languages) made
in Windows is “Code page 1252 - West European Latin”. The meta
tag for this would be:
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252" />
Other gotchas
Internet Explorer only recognizes the Unicode character encoding UTF-8 when it
appears in all capitals in the meta
tag
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
Also, while other browsers will recognize XHTML documents and use the XML declaration’s encoding
<?xml version="1.0" encoding="utf-8"?>
Internet Explorer 7 ignores the XML declaration if the server tells it that the document is HTML. At the time of this writing, this is almost always the case.
In summary, if you use Unicode encoding, and you want Internet Explorer
users to be able to read your document, always include the meta
tag above, and be careful to captialize the UTF-8.