Surrogate Universe
On Windows, characters are handled in little-endian 16-bit. It will be in the so-called UTF-16LE format, which Microsoft calls "Unicode". Fortunately or unfortunately, Windows NT 3.1 was an operating system that adopted Unicode because the development period of Windows NT overlapped with the period of early Unicode formulation Appeared. Development of the NT dates back to the end of the 1980s. Around the same time, Unicode was conceived, and in 1990 Microsoft participated in its formulation, and in 1991 the following year, the Unicode Consortium was established. That's why, after Windows NT, the current Windows 10 and Windows 11 use "Unicode" with 16 bits per character as an internal character code. This is a little-endian UTF-16 character encoding.
Most users see characters as "fonts" because they use Windows as a GUI, so they don't see what the character code is, nor do they need it. However, users sometimes run into problems with character encoding. One is so-called "garbled characters". Text files downloaded from the Internet or brought from other platforms may become unreadable. This is being addressed by supporting multiple encoding methods in applications used as file viewers such as Notepad.
The other is the problem of "surrogate pairs" in Excel and PowerShell. In 1996, when Windows NT 4.0 appeared, the Unicode Consortium announced Unicode 2.0. Around this time, American companies also understood that 16-bit characters could not contain characters from all over the world, and the scope of the code was expanded. rice field. However, systems such as Windows NT, which had already defined character codes with 16 bits, were beginning to spread. Surrogate pairs came out as a compromise. First, set the code range assigned to characters in Unicode to 21 bits (0 to 0x10FFFF) or less. Next, I decided to allocate some of the 16-bit character codes for surrogates, and use two 16-bit characters to represent 20 bits. This covered the 21-bit Unicode character code range. However, support for Windows surrogate pairs was made in Windows Vista in 2006.
In a surrogate pair, two 16-bit characters represent the range from plane 1 to plane 16 (0x010000 to 0x10FFFF) of Unicode (Table 01). Pages 4 to 13 are in the undecided range, and there is enough "vacancy" at the moment. The range used as a surrogate pair does not overlap with other characters, and the first bit of the high surrogate indicating the upper 10 bits and the low surrogate indicating the lower 10 bits do not overlap, so it is easy to distinguish.
■Table 01 | ||||
---|---|---|---|---|
Surface | Main use | begin | end | remarks |
Plane 0 Basic multilingual plane | General script | 00000 | 01FFF | < /td> |
Symbol | 02000 | 02DFF | ||
CJK Phonetic Characters, Symbols | 02E00 | 033FF | ||
CJK Integrated Kanji | td>03400 | 09FFF | ||
Yi character | 0A000 | < td>0A4CF|||
Hangul Syllables | 0AC00 | 0D7AF | < /td> | |
Substitute code point | 0D800 | 0DFFF | Surrogate range | Hy surrogate | 0D800 | 0DBFF | 10bit minutes |
Losarogate | th>0DC00 | 0DFFF | 10bit minutes | |
Private | 0E000 | 0F8FF | ||
Compatibility and special characters | 0F900 | 0FFFD | ||
First side | Additional multilingual side | 10000 | 1FFFF td> | This 20-bit range is represented by D800-DC00 to DBFF-DFFF |
Second plane | Additional Kanji (ideogram) plane | 20000 | 2FFFF | |
3rd plane | 3rd kanji (ideographic) plane | td>30000 | 3FFFF | |
Planes 4-13 | Undecided | 40000< /td> | DFFFF | |
Surface 14 | Additional special purpose surface | E0000 | EFFFF | |
15th/16th side | Private side | F0000 | 10FFFF |
However, some software still thinks that "all characters can be represented in 16 bits" before the introduction of surrogates. For this reason, some surrogate pairs are counted as two characters even though they are actually one character. As more characters are added after the first plane of Unicode, the frequency of surrogate pair usage increases. For example, many pictographs that are becoming more popular are within the range represented by surrogate pairs.
Windows also has a string processing function that correctly determines a surrogate pair as one character. Also, it is not difficult to determine whether a surrogate pair exists or not from the bit allocation for the surrogate pair. Such processing should be provided by the system, not by individual applications. This is because the rules for handling Unicode are so numerous that no single application can cover them all.
However, some programs that were created on the premise that one character is 16 bits cannot be handled unless they are rewritten on a large scale. However, most programs don't deal with strings that contain surrogate pairs, so the problem just doesn't surface.
The widely used Excel and Windows PowerShell that comes with Windows are "ancient" software that judge surrogate pairs as two characters. For example, if you put one pictogram in a cell in Excel and apply the LEN function to find the length of the character string, the number 2 is returned (Photo 01). When a kanji that does not form a surrogate pair is entered, it is correctly displayed as 1. The difference in the length of the string affects the MID/LEFT/RIGHT functions for string processing, and cuts out the middle of the surrogate pair. As a result, there is a problem that characters are not displayed correctly. Aside from the old days when version upgrades were once every two to three years, now Excel is upgraded every six months. Since we added string processing functions such as LENB and LEN functions for 1-byte code and 2-byte code to support 2-byte code characters such as Japanese, we have newly added string functions that support surrogate pairs. I feel like it's going to be enough to add it. Not everything is ancient though, for example Excel supports allegon selectors with IVS. I can understand why the priority was high because there would be a strong request from the government office. However, it's been 15 years since Windows supported surrogate pairs, and I'd like to see Windows properly support surrogate pairs soon.
For PowerShell, things get a little more complicated because it's a language. As in Excel, strings containing surrogate pairs do not match the number of characters perceivable by the user and the string length. Of course, you can call the .NET Framework string handling library functions (such as System.Globalization.StringInfo) from PowerShell to correctly determine the length of strings, including surrogate pairs. But 'length', the property to get the length of a regular string, reports surrogate pair characters as 2 (picture 02).
In Windows 10, standard attachments such as Notepad support UTF-8, a pictogram input panel is installed, and standard text input displays pictograms with color fonts (introduced in Windows 8 that supports color fonts), etc. is being improved environmentally. For example, the search fields in the Settings app and File Explorer contain emojis that represent the face of a man with red hair. This is a sequence of four codes, "Male, Skin Tone, Zero Width Joiner (ZWJ), Red Hair", but it is displayed as one pictogram (Photo 03). Since ZWJ is a "control character", it is difficult to determine "how many characters" this glyph will be, but this sequence (Unicode Text Segment) is expected to be displayed as one glyph. In fact, things like the search field in Windows settings and the address field in many browsers can be displayed that way. Being able to display Unicode correctly means that Windows has a built-in function for that purpose. However, Excel and PowerShell do not use this.
In the 1980s, various operating systems such as UNIX began to implement what we now call localization. At this time, there were various discussions and various implementations. There was a bit of confusion at the time, as there was even an example of using "Shift JIS", which was created to support Japanese in MS-DOS and BASIC languages, on UNIX. In the middle of the 1980s, EUC was proposed, and regionalization using this (using the kanji code of each country for the code) was standardized. At that time, "internationalization" had already been taken into account, so the future path was clear. Unicode support adopted UTF-8, which Linux now inherits. So, even if you put a pictogram in a variable in bash, the length of the string can be displayed correctly (Photo 04).
Windows and development tools have been developed, and in software development, UTF-8 can be used for character strings and source code, and the surrogate pair problem is disappearing. However, like the LEN function in Excel, sometimes surrogate pairs appear and can be seen by general users. Of course, if you don't process character strings containing surrogate pairs in Excel, you won't see this problem, but in the future, character string data containing surrogate pairs such as pictograms will increase, and I feel that it will become a big problem. ing.
Addendum
Actually, in this series, each subtitle always has a source material. The original story of this time is the American TV drama "Stargate Universe". I wonder what a remake of the 1960s American drama "Time Tunnel" would look like. It's a little military, but the setting for the time tunnel was a US military project. For this reason, in the title, the surrogate notation was adapted to the story. The formal Japanese notation of the Unicode Consortium for Surrogate Pairs is ``Substitute Pairs''.
Unicode Terminology: English - Japanesehttps://www.unicode.org/terminology/termenja.html