The GNU C Library - Multibyte Char Intro

Go to the first, previous, next, last section, table of contents.

Multibyte Characters

In the ordinary ASCII code, a sequence of characters is a sequence of bytes, and each character is one byte. This is very simple, but allows for only 256 distinct characters.

In a multibyte character code, a sequence of characters is a sequence of bytes, but each character may occupy one or more consecutive bytes of the sequence.

There are many different ways of designing a multibyte character code; different systems use different codes. To specify a particular code means designating the basic byte sequences--those which represent a single character--and what characters they stand for. A code that a computer can actually use must have a finite number of these basic sequences, and typically none of them is more than a few characters long.

These sequences need not all have the same length. In fact, many of them are just one byte long. Because the basic ASCII characters in the range from 0 to 0177 are so important, they stand for themselves in all multibyte character codes. That is to say, a byte whose value is 0 through 0177 is always a character in itself. The characters which are more than one byte must always start with a byte in the range from 0200 through 0377.

The byte value 0 can be used to terminate a string, just as it is often used in a string of ASCII characters.

Specifying the basic byte sequences that represent single characters automatically gives meanings to many longer byte sequences, as more than one character. For example, if the two byte sequence 0205 049 stands for the Greek letter alpha, then 0205 049 065 must stand for an alpha followed by an `A' (ASCII code 065), and 0205 049 0205 049 must stand for two alphas in a row.

If any byte sequence can have more than one meaning as a sequence of characters, then the multibyte code is ambiguous--and no good. The codes that systems actually use are all unambiguous.

In most codes, there are certain sequences of bytes that have no meaning as a character or characters. These are called invalid.

The simplest possible multibyte code is a trivial one:

The basic sequences consist of single bytes.

This particular code is equivalent to not using multibyte characters at all. It has no invalid sequences. But it can handle only 256 different characters.

Here is another possible code which can handle 9376 different characters:

The basic sequences consist of

single bytes with values in the range 0 through 0237.
two-byte sequences, in which both of the bytes have values in the range from 0240 through 0377.

This code or a similar one is used on some systems to represent Japanese characters. The invalid sequences are those which consist of an odd number of consecutive bytes in the range from 0240 through 0377.

Here is another multibyte code which can handle more distinct extended characters--in fact, almost thirty million:

The basic sequences consist of

single bytes with values in the range 0 through 0177.
sequences of up to four bytes in which the first byte is in the range from 0200 through 0237, and the remaining bytes are in the range from 0240 through 0377.

In this code, any sequence that starts with a byte in the range from 0240 through 0377 is invalid.

And here is another variant which has the advantage that removing the last byte or bytes from a valid character can never produce another valid character. (This property is convenient when you want to search strings for particular characters.)

The basic sequences consist of

single bytes with values in the range 0 through 0177.
two-byte sequences in which the first byte is in the range from 0200 through 0207, and the second byte is in the range from 0240 through 0377.
three-byte sequences in which the first byte is in the range from 0210 through 0217, and the other bytes are in the range from 0240 through 0377.
four-byte sequences in which the first byte is in the range from 0220 through 0227, and the other bytes are in the range from 0240 through 0377.

The list of invalid sequences for this code is long and not worth stating in full; examples of invalid sequences include 0240 and 0220 0300 065.

The number of possible multibyte codes is astronomical. But a given computer system will support at most a few different codes. (One of these codes may allow for thousands of different characters.) Another computer system may support a completely different code. The library facilities described in this chapter are helpful because they package up the knowledge of the details of a particular computer system's multibyte code, so your programs need not know them.

You can use special standard macros to find out the maximum possible number of bytes in a character in the currently selected multibyte code with MB_CUR_MAX, and the maximum for any multibyte code supported on your computer with MB_LEN_MAX.

Macro: int MB_LEN_MAX: This is the maximum length of a multibyte character for any supported locale. It is defined in `limits.h'.

Macro: int MB_CUR_MAX

This macro expands into a (possibly non-constant) positive integer expression that is the maximum number of bytes in a multibyte character in the current locale. The value is never greater than MB_LEN_MAX.

MB_CUR_MAX is defined in `stdlib.h'.

Normally, each basic sequence in a particular character code stands for one character, the same character regardless of context. Some multibyte character codes have a concept of shift state; certain codes, called shift sequences, change to a different shift state, and the meaning of some or all basic sequences varies according to the current shift state. In fact, the set of basic sequences might even be different depending on the current shift state. See section Multibyte Codes Using Shift Sequences, for more information on handling this sort of code.

What happens if you try to pass a string containing multibyte characters to a function that doesn't know about them? Normally, such a function treats a string as a sequence of bytes, and interprets certain byte values specially; all other byte values are "ordinary". As long as a multibyte character doesn't contain any of the special byte values, the function should pass it through as if it were several ordinary characters.

For example, let's figure out what happens if you use multibyte characters in a file name. The functions such as open and unlink that operate on file names treat the name as a sequence of byte values, with `/' as the only special value. Any other byte values are copied, or compared, in sequence, and all byte values are treated alike. Thus, you may think of the file name as a sequence of bytes or as a string containing multibyte characters; the same behavior makes sense equally either way, provided no multibyte character contains a `/'.

Go to the first, previous, next, last section, table of contents.