In the ordinary ASCII code, a sequence of characters is a sequence of bytes, and each character is one byte. This is very simple, but allows for only 256 distinct characters.
In a multibyte character code, a sequence of characters is a sequence of bytes, but each character may occupy one or more consecutive bytes of the sequence.
There are many different ways of designing a multibyte character code; different systems use different codes. To specify a particular code means designating the basic byte sequences--those which represent a single character--and what characters they stand for. A code that a computer can actually use must have a finite number of these basic sequences, and typically none of them is more than a few characters long.
These sequences need not all have the same length. In fact, many of
them are just one byte long. Because the basic ASCII characters in the
range from 0
to 0177
are so important, they stand for
themselves in all multibyte character codes. That is to say, a byte
whose value is 0
through 0177
is always a character in
itself. The characters which are more than one byte must always start
with a byte in the range from 0200
through 0377
.
The byte value 0
can be used to terminate a string, just as it is
often used in a string of ASCII characters.
Specifying the basic byte sequences that represent single characters
automatically gives meanings to many longer byte sequences, as more than
one character. For example, if the two byte sequence 0205 049
stands for the Greek letter alpha, then 0205 049 065
must stand
for an alpha followed by an `A' (ASCII code 065), and 0205 049
0205 049
must stand for two alphas in a row.
If any byte sequence can have more than one meaning as a sequence of characters, then the multibyte code is ambiguous--and no good. The codes that systems actually use are all unambiguous.
In most codes, there are certain sequences of bytes that have no meaning as a character or characters. These are called invalid.
The simplest possible multibyte code is a trivial one:
The basic sequences consist of single bytes.
This particular code is equivalent to not using multibyte characters at all. It has no invalid sequences. But it can handle only 256 different characters.
Here is another possible code which can handle 9376 different characters:
The basic sequences consist of
- single bytes with values in the range
0
through0237
.- two-byte sequences, in which both of the bytes have values in the range from
0240
through0377
.
This code or a similar one is used on some systems to represent Japanese
characters. The invalid sequences are those which consist of an odd
number of consecutive bytes in the range from 0240
through
0377
.
Here is another multibyte code which can handle more distinct extended characters--in fact, almost thirty million:
The basic sequences consist of
- single bytes with values in the range
0
through0177
.- sequences of up to four bytes in which the first byte is in the range from
0200
through0237
, and the remaining bytes are in the range from0240
through0377
.
In this code, any sequence that starts with a byte in the range
from 0240
through 0377
is invalid.
And here is another variant which has the advantage that removing the last byte or bytes from a valid character can never produce another valid character. (This property is convenient when you want to search strings for particular characters.)
The basic sequences consist of
- single bytes with values in the range
0
through0177
.- two-byte sequences in which the first byte is in the range from
0200
through0207
, and the second byte is in the range from0240
through0377
.- three-byte sequences in which the first byte is in the range from
0210
through0217
, and the other bytes are in the range from0240
through0377
.- four-byte sequences in which the first byte is in the range from
0220
through0227
, and the other bytes are in the range from0240
through0377
.
The list of invalid sequences for this code is long and not worth
stating in full; examples of invalid sequences include 0240
and
0220 0300 065
.
The number of possible multibyte codes is astronomical. But a given computer system will support at most a few different codes. (One of these codes may allow for thousands of different characters.) Another computer system may support a completely different code. The library facilities described in this chapter are helpful because they package up the knowledge of the details of a particular computer system's multibyte code, so your programs need not know them.
You can use special standard macros to find out the maximum possible
number of bytes in a character in the currently selected multibyte
code with MB_CUR_MAX
, and the maximum for any multibyte
code supported on your computer with MB_LEN_MAX
.
MB_LEN_MAX
.
Normally, each basic sequence in a particular character code stands for one character, the same character regardless of context. Some multibyte character codes have a concept of shift state; certain codes, called shift sequences, change to a different shift state, and the meaning of some or all basic sequences varies according to the current shift state. In fact, the set of basic sequences might even be different depending on the current shift state. See section Multibyte Codes Using Shift Sequences, for more information on handling this sort of code.
What happens if you try to pass a string containing multibyte characters to a function that doesn't know about them? Normally, such a function treats a string as a sequence of bytes, and interprets certain byte values specially; all other byte values are "ordinary". As long as a multibyte character doesn't contain any of the special byte values, the function should pass it through as if it were several ordinary characters.
For example, let's figure out what happens if you use multibyte
characters in a file name. The functions such as open
and
unlink
that operate on file names treat the name as a sequence of
byte values, with `/' as the only special value. Any other byte
values are copied, or compared, in sequence, and all byte values are
treated alike. Thus, you may think of the file name as a sequence of
bytes or as a string containing multibyte characters; the same behavior
makes sense equally either way, provided no multibyte character contains
a `/'.
Go to the first, previous, next, last section, table of contents.