Managing character sets and encodings


There are many languages in use throughout the world, and they use many different character sets. There are also many ways of encoding character sets into binary formats of bytes. This chapter considers some of the issues in this.




Once upon a time there was EBCDIC and ASCII... Actually, it was never that simple and has just become more complex over time. There is light on the horizon, but some estimates are that it may be 50 years before we all live in the daylight on this!


Early computers were developed in the english-speaking countries of the US, the UK and Australia. As a result of this, assumptions were made about the language and character sets in use. Basically, the Latin alphabet was used, plus numerals, punctuation characters and a few others. These were then encoded into bytes using ASCII or EBCDIC.


The character-handling mechanisms were based on this: text files and I/O consisted of a sequence of bytes, with each byte representing a single character. String comparison could be done by matching corresponding bytes; conversions from upper to lower case could be done by mapping individual bytes, and so on.


There are about 6,000 living languages in the world (3,000 of them in Papua New Guinea!). A few languages use the "english" characters but most do not. The Romanic languages such as French have adornments on various characters, so that you can write "j'ai arrêté", with two differently accented vowels. Similarly, the Germanic languages have extra characters such as 'ß'. Even UK English has characters not in the standard ASCII set: the pound symbol '£' and recently the euro '€'

世界上现存约有6000种语言(居然有3000种在巴布亚新几内亚)。一小部分使用英文字符,但更多的则不是。想法文这样的拉丁语系语言还会有字符修饰符号,所以你可以用两种不同的重读元音来拼写“j'ai arrêté”。同样地,德语也有像'ß'这样的字符,甚至是英式英语也会有不在ASCII编码中的字符:英镑和欧元('£'和 '€')

But the world is not restricted to variations on the Latin alphabet. Thailand has its own alphabet, with words looking like this: "ภาษาไทย". There are many other alphabets, and Japan even has two, Hiragana and Katagana.


There are also the hierographic languages such as Chinese where you can write "百度一下,你就知道".


It would be nice from a technical viewpoint if the world just used ASCII. However, the trend is in the opposite direction, with more and more users demanding that software use the language that they are familiar with. If you build an application that can be run in different countries then users will demand that it uses their own language. In a distributed system, different components of the system may be used by users expecting different languages and characters.


Internationalisation (i18n) is how you write your applications so that they can handle the variety of languages and cultures. Localisation (l10n) is the process of customising your internationalised application to a particular cultural group.


i18n and l10n are big topics in themselves. For example, they cover issues such as colours: while white means "purity" in Western cultures, it means "death" to the Chinese and "joy" to Egyptians. In this chapter we just look at issues of character handling.




It is important to be careful about exactly what part of a text handling system you are talking about. Here is a set of definitions that have proven useful.




A character is a "unit of information that roughly corresponds to a grapheme (written symbol) of a natural language, such as a letter, numeral, or punctuation mark" (Wikipedia). A character is "the smallest component of written language that has a semantic value" (Unicode). This includes letters such as 'a' and 'À' (or letters in any other language), digits such as '2', punctuation characters such as ',' and various symbols such as the English pound currency symbol '£'.


A character is some sort of abstraction of any actual symbol: the character 'a' is to any written 'a' as a Platonic circle is to any actual circle. The concept of character also includes control characters, which do not correspond to natural language symbols but to other bits of information used to process texts of the language.


A character does not have any particular appearance, although we use the appearance to help recognise the character. However, even the appearance may have to be understood in a context: in mathematics, if you see the symbol π (pi) it is the character for the ratio of circumference to radius of a circle, while if you are reading Greek text, it is the sixteenth letter of the alphabet: "προσ" is the greek word for "with" and has nothing to do with 3.14159...

字符本身并不没有特定形状,只是我们通过形状来识别它。即使如此,我们也要联系上下文才能理解:数学中,如果你看到π (pi)这个字符,它表示圆周率,但是如果你读希腊文,它只是16个字母;"προσ"是希腊词语“with”,这个和3.14159没有半点关系。

Character repertoire/character set


A character repertoire is a set of distinct characters, such as the Latin alphabet. No particular ordering is assumed. In English, although we say that 'a' is earlier in the alphabet than 'z', we wouldn't say that 'a' is less than 'z'. The "phone book" ordering which puts "McPhee" before "MacRea" shows that "alphabetic ordering" isn't critical to the characters.


A repertoire specifies the names of the characters and often a sample of how the characters might look. e.g the letter 'a' might look like 'a', 'a' or 'a'. But it doesn't force them to look like that - they are just samples. The repertoire may make distinctions such as upper and lower case, so that 'a' and 'A' are different. But it may regard them as the same, just with different sample appearances. (Just like some programming languages treat upper and lower as different - e.g. Go - but some don't e.g. Basic.). On the other hand, a repertoire might contain different characters with the same sample appearance: the repertoire for a Greek mathematician would have two different characters with appearance π. This is also called a noncoded character set.


Character code


A character code is a mapping from characters to integers. The mapping for a character set is also called a coded character set or code set. The value of each character in this mapping is often called a code point. ASCII is a code set. The codepoint for 'a' is 97 and for 'A' is 65 (decimal).

字符编码是字符到整数的映射。一个字符集的映射也被称为一个编码字符集或字符集。这个映射中的每个字符的值通常被称为一个编码(code point)。 ASCII也是一个字符集,'a'的编码是97,'A'是65(十进制)。

The character code is still an abstraction. It isn't yet what we will see in text files, or in TCP packets. However, it is getting close. as it supplies the mapping from human oriented concepts into numerical ones.


Character encoding


To communicate or store a character you need to encode it in some way. To transmit a string, you need to encode all characters in the string. There are many possible encodings for any code set.


For example, 7-bit ASCII code points can be encoded as themselves into 8-bit bytes (an octet). So ASCII 'A' (with codepoint 65) is encoded as the 8-bit octet 01000001. However, a different encoding would be to use the top bit for parity checking e.g. with odd parity ASCII 'A" would be the octet 11000001. Some protocols such as Sun's XDR use 32-bit word-length encoding. ASCII 'A' would be encoded as 00000000 00000000 0000000 01000001.

例如,7位字节ASCII编码可以转换成8位字节(8进制)。所以,ASCII的'A'(编码值65)可以被编码为8进制的01000001。不过,另一种不同的编码方式对最高位别有用途,如奇偶校验,带有奇校验的ASCII编码“A”将是这个8进制数11000001。还有一些协议,如Sun的XDR,使用32位字长编码ASCII编码。所以,'A'将被编码为00000000 00000000000000001000001。

The character encoding is where we function at the programming level. Our programs deal with encoded characters. It obviously makes a difference whether we are dealing with 8-bit characters with or without parity checking, or with 32-bit characters.


The encoding extends to strings of characters. A word-length even parity encoding of "ABC" might be 10000000 (parity bit in high byte) 0100000011 (C) 01000010 (B) 01000001 (A in low byte). The comments about the importance of an encoding apply equally strongly to strings, where the rules may be different.


Transport encoding


A character encoding will suffice for handling characters within a single application. However, once you start sending text between applications, then there is the further issue of how the bytes, shorts or words are put on the wire. An encoding can be based on space-and hence bandwidth-saving techniques such as zip'ping the text. Or it could be reduced to a 7-bit format to allow a parity checking bit, such as base64.

某个应用程序的字符编码只要内部能处理字符串就足够了。然而,一旦你需要在不同应用程序之间交互,那怎么编码可就成了需要进一步讨论问题了:字节、字符、字是怎么传输的。字符编码可能有很多空白字符(待商议),从而可以使用如zip算法对文本进行压缩,从而节省带宽。或者,它可以减少到7位字节,奇偶校验位,使用 base64编码来代替。

If we do know the character and transport encoding, then it is a matter of programming to manage characters and strings. If we don't know the character or transport encoding then it is a matter of guesswork as to what to do with any particular string. There is no convention for files to signal the character encoding.


There is however a convention for signalling encoding in text transmitted across the internet. It is simple: the header of a text message contains information about the encoding. For example, an HTTP header can contain lines such as


Content-Type: text/html; charset=ISO-8859-4

Content-Encoding: gzip


which says that the character set is ISO 8859-4 (corresponding to certain countries in Europe) with the default encoding, but then gziped. The second part - content encoding - is what we are referring to as "transfer encoding" (IETF RFC 2130).

上面是说,将字符集是ISO 8859-4(对应到欧洲的某些国家)作为默认编码,然后用 gzip压缩。内容类型的第二部分就是我们指的是“传输编码”(IETF RFC2130)。

But how do you read this information? Isn't it encoded? Don't we have a chicken and egg situation? Well, no. The convention is that such information is given in ASCII (to be precise, US ASCII) so that a program can read the headers and then adjust its encoding for the rest of the document.




ASCII has the repertoire of the English characters plus digits, punctuation and some control characters. The code points for ASCII are given by the familiar table

ASCII字符集包含的英文字符、数字,标点符号和一些控制字符。 下面这张熟悉的表给出了ASCII字符编码值

       Oct   Dec   Hex   Char           Oct   Dec   Hex   Char


       000   0     00    NUL '\0'       100   64    40    @

       001   1     01    SOH            101   65    41    A

       002   2     02    STX            102   66    42    B

       003   3     03    ETX            103   67    43    C

       004   4     04    EOT            104   68    44    D

       005   5     05    ENQ            105   69    45    E

       006   6     06    ACK            106   70    46    F

       007   7     07    BEL '\a'       107   71    47    G

       010   8     08    BS  '\b'       110   72    48    H

       011   9     09    HT  '\t'       111   73    49    I

       012   10    0A    LF  '\n'       112   74    4A    J

       013   11    0B    VT  '\v'       113   75    4B    K

       014   12    0C    FF  '\f'       114   76    4C    L

       015   13    0D    CR  '\r'       115   77    4D    M

       016   14    0E    SO             116   78    4E    N

       017   15    0F    SI             117   79    4F    O

       020   16    10    DLE            120   80    50    P

       021   17    11    DC1            121   81    51    Q

       022   18    12    DC2            122   82    52    R

       023   19    13    DC3            123   83    53    S

       024   20    14    DC4            124   84    54    T

       025   21    15    NAK            125   85    55    U

       026   22    16    SYN            126   86    56    V

       027   23    17    ETB            127   87    57    W

       030   24    18    CAN            130   88    58    X

       031   25    19    EM             131   89    59    Y

       032   26    1A    SUB            132   90    5A    Z

       033   27    1B    ESC            133   91    5B    [

       034   28    1C    FS             134   92    5C    \   '\\'

       035   29    1D    GS             135   93    5D    ]

       036   30    1E    RS             136   94    5E    ^

       037   31    1F    US             137   95    5F    _

       040   32    20    SPACE          140   96    60    `

       041   33    21    !              141   97    61    a

       042   34    22    "              142   98    62    b

       043   35    23    #              143   99    63    c

       044   36    24    $              144   100   64    d

       045   37    25    %              145   101   65    e

       046   38    26    &              146   102   66    f

       047   39    27    '              147   103   67    g

       050   40    28    (              150   104   68    h

       051   41    29    )              151   105   69    i

       052   42    2A    *              152   106   6A    j

       053   43    2B    +              153   107   6B    k

       054   44    2C    ,              154   108   6C    l

       055   45    2D    -              155   109   6D    m

       056   46    2E    .              156   110   6E    n

       057   47    2F    /              157   111   6F    o

       060   48    30    0              160   112   70    p

       061   49    31    1              161   113   71    q

       062   50    32    2              162   114   72    r

       063   51    33    3              163   115   73    s

       064   52    34    4              164   116   74    t

       065   53    35    5              165   117   75    u

       066   54    36    6              166   118   76    v

       067   55    37    7              167   119   77    w

       070   56    38    8              170   120   78    x

       071   57    39    9              171   121   79    y

       072   58    3A    :              172   122   7A    z

       073   59    3B    ;              173   123   7B    {

       074   60    3C    <              174   124   7C    |

       075   61    3D    =              175   125   7D    }

       076   62    3E    >              176   126   7E    ~

       077   63    3F    ?              177   127   7F    DEL


The most common encoding for ASCII uses the code points as 7-bit bytes, so that the encoding of 'A' for example is 65.


This set is actually US ASCII. Due to European desires for accented characters, some punctuation characters are omitted to form a minimal set, ISO 646, while there are "national variants" with suitable European characters. The page by Jukka Korpela has more information for those interested. We shall not need these variants though.

这个字符集是实际的美国ASCII。鉴于欧洲需要处理重音字符,于是省略一些标点字符,形成一个最小的字符集,ISO 646,同时有合适的欧洲本国字符的“国家变种字符集”。有兴趣的可以看看Jukka Korpel的这个网页〜jkorpela/ chars.html。当然我们并不需要这些变种。

ISO 8859

ISO 8859字符集

Octets are now the standard size for bytes. This allows 128 extra code points for extensions to ASCII. A number of different code sets to capture the repertoires of various subsets of European languages are the ISO 8859 series. ISO 8859-1 is also known as Latin-1 and covers many languages in western Europe, while others in this series cover the rest of Europe and even Hebrew, Arabic and Thai. For example, ISO 8859-5 includes the Cyrillic characters of countries such as Russia, while ISO 8859-8 includes the Hebrew alphabet.

8进制是字节的标准长度。这使得ASCII可以有128个额外的编码。 ISO 8859系列的字符集可以包含众多的欧洲语言字符集。。 ISO 8859-1也被称为Latin-1,覆盖了许多在西欧国家的语言,同时这一系列的其他字符集包括欧洲其他国家,甚至希伯来语,阿拉伯语和泰语。例如,ISO 8859-5包括使用斯拉夫语字符的俄罗斯等,而ISO 8859-8则包含希伯来文字母。

The standard encoding for these character sets is to use their code point as an 8-bit value. For example, the character 'Á' in ISO 8859-1 has the code point 193 and is encoded as 193. All of the ISO 8859 series have the bottom 128 values identical to ASCII, so that the ASCII characters are the same in all of these sets.

这些字符集使用8进制作为标准的编码格式。例如,在ISO 8859-1字符' 'Á'的字符编码为193,同时被编码为193。所有的ISO 8859系列前128个保持和ASCII相同的值,所以,ASCII字符在所有这些集合都是相同的。

The HTML specifications used to recommend the ISO 8859-1 character set. HTML 3.2 was the last one to do so, and after that HTML 4.0 recommended Unicode. In 2010 Google made an estimate that of the pages it sees, about 20% were still in ISO 8859 format while 20% were still in ASCII ("Unicode nearing 50% of the web"

HTML语言规范曾经推荐ISO 8859-1字符集,不过HTML3.2之后的规范就不再推荐,4.0开始推荐Unicode编码。2010年Google通过它抓取的网页做出了一个估算,20%的网页使用ISO 8859编码,20%使用ASCII(unicode 接近50%,



Neither ASCII nor ISO 8859 cover the languages based on hieroglyphs. Chinese is estimated to have about 20,000 separate characters, with about 5,000 in common use. These need more than a byte, and typically two bytes has been used. There have been many of these two-byte character sets: Big5, EUC-TW, GB2312 and GBK/GBX for Chinese, JIS X 0208 for Japanese, and so on. These encodings are generally not mutually compatable.

ASCII和ISO 8859都不能覆盖象形文字。中文大约有20000个独立的字符,其中5000个常用字符。这些字符需要不知一个字节,基本上双字节都会被用上。也有一些多字节的编码:中文的Big5, EUC-TW, GB2312 和GBK/GBX,日文的 JIS X 0208,等等。这些编码通常是不兼容的

Unicode is an embracing standard character set intended to cover all major character sets in use. It includes European, Asian, Indian and many more. It is now up to version 5.2 and has over 107,000 characters. The number of code points now exceeds 65,536, that is. more than 2^16. This has implications for character encodings.


The first 256 code points correspond to ISO 8859-1, with US ASCII as the first 128. There is thus a backward compatability with these major character sets, as the code points for ISO 8859-1 and ASCII are exactly the same in Unicode. The same is not true for other character sets: for example, while most of the Big5 characters are also in Unicode, the code points are not the same. The page contains one example of a (large) table mapping from Big5 to Unicode.

(Unicode编码)前256个编码对应 ISO 8859-1,同时前128个也是美式ASCII编码。所以主流的编码都是相互兼容的,ISO 8859-1、ASCII和Unicode是一样的。对其他字符集则不一定正确:例如,虽然Big5编码也在Unicode中,但他们的编码值并不相同。这个页面就是证明:一张Big5到Unicode的大的映射表。

To represent Unicode characters in a computer system, an encoding must be used. The encoding UCS is a two-byte encoding using the code point values of the Unicode characters. However, since there are now too many characters in Unicode to fit them all into 2 bytes, this encoding is obsolete and no longer used. Instead there are:


UTF-8, Go and runes

UTF-8, Go语言和runes

UTF-8 is the most commonly used encoding. Google estimates that 50% of the pages that it sees are encoded in UTF-8. The ASCII set has the same encoding values in UTF-8, so a UTF-8 reader can read text consisting of just ASCII characters as well as text from the full Unicode set.

UTF - 8是最常用的编码。谷歌估计它抓取的网页有50%使用UTF-8编码。ASCII字符集具有相同的在UTF-8中编码值相同,所以UTF-8的读取方法可以用Unicode字符集读取一个ASCII字符组成的网页。

Go uses UTF-8 encoded characters in its strings. Each character is of type rune. This is a alias for int32 as a Unicode character can be 1, 2 or 4 bytes in UTF-8 encoding. In terms of characters, a string is an array of runes.


A string is also an array of bytes, but you have to be careful: only for the ASCII subset is a byte equal to a character. All other characters occupy two, three or four bytes. This means that the length of a string in characters (runes) is generally not the same as the length of its byte array. They are only equal when the string consists of ASCII characters only.


The following program fragment illustrates this. If we take a UTF-8 string and test its length, you get the length of the underlying byte array. But if you cast the string to an array of runes []rune then you get an array of the Unicode code points which is generally the number of characters:


str := "百度一下,你就知道"

println("String length", len([]rune(str)))

println("Byte length", len(str))




String length 9

Byte length 27


UTF-8 client and server

UTF-8 编码的客户端和服务端

Possibly surprisingly, you need do nothing special to handle UTF-8 text in either the client or the server. The underlying data type for a UTF-8 string in Go is a byte array, and as we saw just above, Go looks after encoding the string into 1, 2, 3 or 4 bytes as needed. The length of the string is the length of the byte array, so you write any UTF-8 string by writing the byte array.


Similarly to read a string, you just read into a byte array and then cast the array to a string using string([]byte). If Go cannot properly decode bytes into Unicode characters, then it gives the Unicode Replacement Character \uFFFD. The length of the resulting byte array is the length of the legal portion of the string.


So the clients and servers given in earlier chapters work perfectly well with UTF-8 encoded text.


ASCII client and server

ASCII 编码的客户端和服务器

The ASCII characters have the same encoding in ASCII and in UTF-8. So ordinary UTF-8 character handling works fine for ASCII characters. No special handling need to be done.


UTF-16 and Go


UTF-16 deals with arrays of short 16-bit unsigned integers. The package utf16 is designed to manage such arrays. To convert a normal Go string, that is a UTF-8 string, into UTF-16, you first extract the code points by coercing it into a []rune and then use utf16.Encode to produce an array of type uint16.

utf-16编码可以用16位字节无符号整形数组处理。 utf16 包就是用来处理这样的字串的。将一个Go语言的utf-8正常编码的字串转换utf-16的编码,你应先将字串转换成[]runerune数组,然后使用 utf16.Encode 生成一个 uint16类型的数组。

Similarly, to decode an array of unsigned short UTF-16 values into a Go string, you use utf16.Decode to convert it into code points as type []rune and then to a string. The following code fragment illustrates this

同样,解码一个无符号短整型的utf-16数组成一个Go字符串,你需要utf16.Decode将编码转换成[]rune ,然后才能改成一个字符串。如下面的代码所示:

str := "百度一下,你就知道"

runes := utf16.Encode([]rune(str))

ints := utf16.Decode(runes)

str = string(ints)


These type conversions need to be applied by clients or servers as appropriate, to read and write 16-bit short integers, as shown below.


Little-endian and big-endian


Unfortunately, there is a little devil lurking behind UTF-16. It is basically an encoding of characters into 16-bit short integers. The big question is: for each short, how is it written as two bytes? The top one first, or the top one second? Either way is fine, as long as the receiver uses the same convention as the sender.


Unicode has addressed this with a special character known as the BOM (byte order marker). This is a zero-width non-printing character, so you never see it in text. But its value 0xfffe is chosen so that you can tell the byte-order:


Text will sometimes place the BOM as the first character in the text. The reader can then examine these two bytes to determine what endian-ness has been used.


UTF-16 client and server

UTF-16 编码的客户端和服务器

Using the BOM convention, we can write a server that prepends a BOM and writes a string in UTF-16 as


/* UTF16 Server


package main

import (






const BOM = '\ufffe'

func main() {

        service := ""

        tcpAddr, err := net.ResolveTCPAddr("tcp", service)


        listener, err := net.ListenTCP("tcp", tcpAddr)


        for {

                conn, err := listener.Accept()

                if err != nil {



                str := "j'ai arrêté"

                shorts := utf16.Encode([]rune(str))

                writeShorts(conn, shorts)

                conn.Close() // we're finished



func writeShorts(conn net.Conn, shorts []uint16) {

        var bytes [2]byte

        // send the BOM as first two bytes

 bytes[0] = BOM >> 8

        bytes[1] = BOM & 255

        _, err := conn.Write(bytes[0:])

        if err != nil {



        for _, v := range shorts {

                bytes[0] = byte(v >> 8)

                bytes[1] = byte(v & 255)

                _, err = conn.Write(bytes[0:])

                if err != nil {





func checkError(err error) {

        if err != nil {

                fmt.Println("Fatal error ", err.Error())




while a client that reads a byte stream, extracts and examines the BOM and then decodes the rest of the stream is


/* UTF16 Client


package main

import (






const BOM = '\ufffe'

func main() {

        if len(os.Args) != 2 {

                fmt.Println("Usage: ", os.Args[0], "host:port")



        service := os.Args[1]

        conn, err := net.Dial("tcp", service)


        shorts := readShorts(conn)

        ints := utf16.Decode(shorts)

        str := string(ints)




func readShorts(conn net.Conn) []uint16 {

        var buf [512]byte

        // read everything into the buffer

 n, err := conn.Read(buf[0:2])

        for true {

                m, err := conn.Read(buf[n:])

                if m == 0 || err != nil {



                n += m



        var shorts []uint16

        shorts = make([]uint16, n/2)

        if buf[0] == 0xff && buf[1] == 0xfe {

                // big endian

         for i := 2; i < n; i += 2 {

                        shorts[i/2] = uint16(buf[i])<<8 + uint16(buf[i+1])


        } else if buf[1] == 0xff && buf[0] == 0xfe {

                // little endian

         for i := 2; i < n; i += 2 {

                        shorts[i/2] = uint16(buf[i+1])<<8 + uint16(buf[i])


        } else {

                // unknown byte order

         fmt.Println("Unknown order")


        return shorts


func checkError(err error) {

        if err != nil {

                fmt.Println("Fatal error ", err.Error())




Unicode gotcha's


This book is not about i18n issues. In particular we don't want to delve into the arcane areas of Unicode. But you should know that Unicode is not a simple encoding and there are many complexities. For example, some earlier character sets used non-spacing characters, particularly for accents. This was brought into Unicode, so you can produce accented characters in two ways: as a single Unicode character, or as a pair of non-spacing accent plus non-accented character. For example, U+04D6 CYRILLIC CAPITAL LETTER IE WITH BREVE is a single character. It is equivalent to U+0415 CYRILLIC CAPITAL LETTER IE combined with the breve accent U+0306 COMBINING BREVE. This makes string comparison difficult on occassions. The Go specification does not at present address such issues.

这本书不是有关国际化问题。特别是,我们不想钻研的神秘的Unicode。但是你应该知道,Unicode不是一个简单的编码,也有很多的复杂的地方。例如,一些早期的字符集用非空格字符,尤其是重音字符。这些重音字符要转换成Unicode可以用两种办法:作为一个Unicode字符,或作为一个非空格字符和非重音字符的组合。例如, U+04D6 CYRILLIC CAPITAL LETTER IE WITH BREVE 是一个字符。这是相当于 U+0415 CYRILLIC CAPITAL LETTER IE和U+0306加上BREVE.。这使得字符串比较有时变得困难了。 GO规范确目前没有对这个问题过深研究。

ISO 8859 and Go

ISO 8859 编码和Go语言

The ISO 8859 series are 8-bit character sets for different parts of Europe and some other areas. They all have the ASCII set common in the low part, but differ in the top part. According to Google, ISO 8859 codes account for about 20% of the web pages it sees.

ISO 8859系列字符集都是8位字符集,他们为欧洲不同地区和其他一些地方设计。他们有相同的ASCII并且都在地位,但高位不同。据谷歌估计,ISO 8859编码了尽20%的网页。

The first code, ISO 8859-1 or Latin-1, has the first 256 characters in common with Unicode. The encoded value of the Latin-1 characters is the same in UTF-16 and in the default ISO 8859-1 encoding. But this doesn't really help much, as UTF-16 is a 16-bit encoding and ISO 8859-1 is an 8-bit encoding. UTF-8 is a 8-bit encoding, but it uses the top bit to signal extra bytes, so only the ASCII subset overlaps for UTF-8 and ISO 8859-1. So UTF-8 doesn't help much either.

第一个编码字符集,ISO 8859-1或叫做Latin-1,前256个字符和Unicode相同。 Latin-1字符的utf-16和ISO 8859-1有相同的编码。但是,这并不真的有用,因为UTF-16是一个16位的编码字符集而ISO 8859-1是8位编码。 UTF-8是一种8位编码,但是高位用来表示更多的字符,所以只有ASCII的一部分是utf-8和ISO 8859-1相同,所以UTF-8并没有多大实际用途(都是8位的)。

But the ISO 8859 series don't have any complex issues. To each character in each set corresponds a unique Unicode character. For example, in ISO 8859-2, the character "latin capital letter I with ogonek" has ISO 8859-2 code point 0xc7 (in hexadecimal) and corresponding Unicode code point of U+012E. Transforming either way between an ISO 8859 set and the corresponding Unicode characters is essentially just a table lookup.

但ISO8859系列没有任何复杂的问题。每一组中的每个字符对应一个唯一的Unicode字符。例如,在ISO 8859-2中的字符“latin capital letter I with ogonek”在ISO 8859-2是0xc7(十六进制),对应的Unicode的U+012E。 ISO 8859字符集和Unicode字符集之间转换其实只是一个表查找。

The table from ISO 8859 code points to Unicode code points could be done as an array of 256 integers. But many of these will have the same value as the index. So we just use a map of the different ones, and those not in the map take the index value.

这个从 ISO 8859到Unicode的查找表,可以用一个256的数组完成。因为,许多字符索引相同。因此,我们只需要一个标注不同索引的映射就可以。

For ISO 8859-2 a portion of the map is

ISO 8859-2的映射为

var unicodeToISOMap = map[int] uint8 {

    0x12e: 0xc7,

    0x10c: 0xc8,

    0x118: 0xca,

    // plus more



and a function to convert UTF-8 strings to an array of ISO 8859-2 bytes is

从utf-8转换成 ISO 8859-2的函数

/* Turn a UTF-8 string into an ISO 8859 encoded byte array


func unicodeStrToISO(str string) []byte {

        // get the unicode code points

        codePoints := []int(str)

        // create a byte array of the same length

        bytes := make([]byte, len(codePoints))

        for n, v := range(codePoints) {

                // see if the point is in the exception map

                iso, ok := unicodeToISOMap[v]

                if !ok {

                        // just use the value

                        iso = uint8(v)


                bytes[n] = iso


        return bytes



In a similar way you cacn change an array of ISO 8859-2 bytes into a UTF-8 string:

同样你可以将ISO 8859-2转换为utf-8

var isoToUnicodeMap = map[uint8] int {

    0xc7: 0x12e, 

    0xc8: 0x10c,

    0xca: 0x118,

    // and more


func isoBytesToUnicode(bytes []byte) string {

        codePoints := make([]int, len(bytes))

        for n, v := range(bytes) {

                unicode, ok :=isoToUnicodeMap[v]

                if !ok {

                        unicode = int(v)


                codePoints[n] = unicode


        return string(codePoints)



These functions can be used to read and write UTF-8 strings as ISO 8859-2 bytes. By changing the mapping table, you can cover the other ISO 8859 codes. Latin-1, or ISO 8859-1, is a special case - the exception map is empty as the code points for Latin-1 are the same in Unicode. You could also use the same technique for other character sets based on a table mapping, such as Windows 1252.

这些函数可以用来将ISO 8859-2当作UTF-8来读写。通过改变映射表,可以覆盖其他的ISO 8859字符集合。Latin-1字符集(ISO 8859-1)是一个特殊的情况:地图映射为空,因为字符在Latin-1和Unicode中编码相同。同样的方法,你也可以使用其他字符集构建映射表,如Windows1252。

Other character sets and Go


There are very, very many character set encodings. According to Google, these generally only have a small use, which will hopefully decrease even further in time. But if your software wants to capture all markets, then you may need to handle them.


In the simplest cases, a lookup table will suffice. But that doesn't always work. The character coding ISO 2022 minimised character set sizes by using a finite state machine to swap code pages in and out. This was borrowed by some of the Japanese encodings, and makes things very complex.

在最简单的情况下,查找表就够了。但是,这样也不是总是奏效。ISO 2022字符编码方案通过……。这是从日本某写编码中借用来个,相当复杂。

Go does not at present give any language or package support for these other character sets. So you either avoid their use, fail to talk to applications that do use them, or write lots of your own code!




There hasn't been much code in this chapter. Instead, there have been some of the concepts of a very complex area. It's up to you: if you want to assume everyone speaks US English then the world is simple. But if you want your applications to be usable by the rest of the world, then you need to pay attention to these complexities.


Copyright Jan Newmarch,

If you like this book, please contribute using Flattr
or donate using PayPal