All Mimsy were the Borogoves: A Brief Introduction to the Unicode Standard

By Scott Paul McGinnis

You can’t spell “digital humanities” without letters, and you can’t make letters appear on a computer screen without character encodings. The ubiquity of character encodings, and the enormity of the challenges involved in creating and standardizing them, are (happily) obscured by the fact that, when done well, they are not seen at all. It is when they break down, when a beloved paradox from an ancient text, 名可名非常名 (Laozi 1), turns into a string of unintelligible jabber—hÑOvqÝŸ/PÿQ¶úðÍ¶Œ ¡éØc:¹að—that the issue demands your attention.

The intelligibility of an electronic text rests on its encoding. Since Unicode's inception in 1987, the project has endeavored to create an encoding system capable of including all the world’s written languages, past and present, in a single, standardized format. Their mantra: “a unique number for every character, no matter what the platform, no matter what the program, no matter what the language" (http://www.unicode.org/standard/WhatIsUnicode.html).

Before Unicode, the standard was ASCII (The American Standard Code for Information Interchange), which was developed because computers also need standard character sets in order to use the same programs. In 1963, the ASCII character set was limited by hardware capabilities to 128 characters (2^7 or 7 bits), which were based on English (http://edition.cnn.com/TECH/computing/9907/06/1963.idg/index.html).

This standardization allowed for easier communication between computers and in English, but the 128-large character set of ASCII was too limited to encompass even French with diacritics, not to mention Arabic, Braille, Sanskrit, or mathematical notation. Some were later added to the standard as hardware limitations relaxed, by the creation of alternative 128-character sets (using an eighth bit to create a full byte, that is 2^8=256 possibilities). But this led to a proliferation of separate, mutually unintelligible extended sets. And with its 50,000+ large character set, the Chinese, Japanese, and Korean (CJK) group of written languages presented an encoding challenge at an order of magnitude higher than the others. The eight bits of extended ASCII would not suffice.

Indeed, it was while developing a Japanese Kanji-enabled Macintosh computer in 1985 that Unicode President and co-founder Mark Davis first realized the need for a much larger, comprehensive encoding standard. In 1987, Davis met with researchers from Xerox who were doing work on multilingual character encoding. He joined with two of them, Joe Becker and Lee Collins, and together the three would begin the Unicode project (http://www.unicode.org/history/earlyyears.html). In 1991, the Unicode Consortium was officially incorporated (ibid), and in 1993 the Unicode standard replaced ASCII for the first time in an operating system, Windows NT version 3.1 (http://support.microsoft.com/kb/99884).

As is well known, advances in hardware have meant that the memory allocation problem that limited ASCII to 7 bits in 1963 is, thankfully, quite moot. Unicode was developed as a 16-bit standard (UTF-16), which allows for 65,536 unique code-points (without the need for extension into other “planes”). The standard also includes an variable-length 8-bit encoding (UTF-8) and an extended 32-bit encoding (UTF-32). Today, systems based on Windows NT (e.g. XP, Vista, Windows 7) and Mac OS X use the 16-bit standard (UTF-16), and many UNIX-based systems and a majority of websites use UTF-8 (http://trends.builtwith.com/encoding/UTF-8).

Now, with hardware limitations no longer an issue, Unicode offers a practical and comprehensive character encoding standard.

“The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 860,000 unused code points. More characters are under consideration for addition to future versions of the standard” (http://www.unicode.org/standard/principles.html).

In other words, there is plenty of space in the Unicode standard to handle all of the world’s written languages.

The Unicode Consortium has brought all the major languages written today, and many less-common and ancient ones, into a single standard, thus allowing humanities researchers in many fields and areas of study to read electronic texts as though they weren't just strings of ones and zeros, blissfully unaware of the jabberwocky behind the screen.

Scott Paul McGinnis is a Graduate Student Researcher at the Townsend Center for the Humanities.

Multimedia

All Mimsy were the Borogoves: A Brief Introduction to the Unicode Standard

All Mimsy were the Borogoves: A Brief Introduction to the Unicode Standard

TOWNSEND CENTER FOR THE HUMANITIES