Historical Document - Last Updated Wed Nov 10 16:18:03 1999

The Web - not only in English

If you read any of the articles in the media about the World Wide Web and Mosaic, or look at the hotlists and Internet starting points provided with Mosaic and Netscape, you might be forgiven for thinking that the Web is only available in English. This is not the case.

The Hypertext Markup Language (HTML) and the Web itself were developed at CERN in Europe, straddling the Swiss-French border. It should come as no surprise, then, to find that HTML has provision for European languages such as French, German and Italian by means of standard escape sequences. Mosaic and Netscape , the most common graphical Web browsers, were designed to use the ISO8859-1 character set, so that pages in the Latin languages may be simply displayed, eg. http://www.afuu.fr, http://www.chemie.fu-berlin.de. So how about the rest of the world? It turns out that providing Web pages in a language such as Russian or Greek is fairly simple, if there is an 8-bit character set with ASCII as a subset. The document has standard HTML tags using ASCII in the lower 128 characters of the charset, and the foreign language text in the upper 128 characters. For example http://www.kiae.su/, http://uranus.eng.auth.gr/home/gr/. The only trouble with this scheme is that the globe-trotting Net surfer must know the charset ahead of time, otherwise the screen will fill with nonsense. In Windows 3.1, TrueType fonts may be selected until the page is readable. In Unix, the fonts must be selected before starting Mosaic. If one lives in, say, Russia or Japan then the fonts would be set correctly by default.

Is there an answer to this frustration ? Yes there is, and it is called Mosaic-l10n. Developed by TAKADA Toshihiro at Nippon Telegraph and Telephone Corporation in Japan, Mosaic-l10n is a multi-localized browser that can switch fonts easily. It can display the common 8-bit charsets such as Russian and Greek, and also 16 and 24-bit charsets such as the Japanese JIS or Chinese BIG-5. It will even switch automatically between fonts when a new page is selected, though this feature and indeed the coding of non-Latin languages is not part of the HTML-2 standard. See http://www.ntt.jp/Mosaic-l10n/ for details.

At this time Mosaic-l10n exists as a patch to the standard Unix distribution of Mosaic-2.4, so that to run it on a PC you would need to run a Unix such as Linux, FreeBSD, or SCO Unix. However, the Netscape developers are following these developments, so there may be a multilingual version of Netscape for Windows sometime.

What's next?

Software developers around the world are in pursuit of the Holy Grail of a true multi-lingual Web browser, able to display many languages on a single page, and multi-lingual servers able to serve documents in the user's language of choice. These might be based on the 32-bit Unicode character set, though there are a few problems. Any new protocol must be compatible with existing software, and some languages such as Japanese have a different charset for each operating system (JIS, EUC and shift-JIS). Languages such as Hebrew that are not written left-to-right present a problem for the browser's paragraph formatting , though this may be crudely circumvented using the HTML <PRE> tag.

Andrew Daviel, January 1995


Andrew Daviel is an Engineer/Programmer at TRIUMF with a keen interest in the Web. An HTML version (with links) of this article may be found at http://andrew.triumf.ca/mling.html
Some sample pages in foreign languages may be found at the NTT webserver (http://www.ntt.jp/Mosaic-l10n/) and also at http://andrew.triumf.ca/multilingual/samples

Information on Chinese is available at http://www.webcom.com/~chinabus