Additional notes on Unicode-based documents
[I]. Set up 1) Outlook-98 in Win-NT Tools -- Options -- Mail Format * Message Format: HTML * Stationary & Fonts: Character Set - Universal Alphabet (UTF-8) - Set as Default
2) For IE 6.x View -- Font / Encoding -- Universal Alphabet (UTF-8) or Right-click the mouse, then: Language -- Universal Alphabet (UTF-8)
3) For Netscape: * View -- Encoding (or Character Set) -- Unicode (UTF-8) * Edit -- Preferences -- Appearance-Fonts -- Use document-specified fonts
[II]. Printers: 1) HP Laser printers: may need adjustment, as following: 1.a) Models HP-III, HP-4M, HP-5Si File -- Print -- Properties -- Advanced -- Documents Options -- Print Text as Graphics: ON
1.b) Model HP-5M File -- Print -- Properties -- Advanced -- Options -- Graphic Mode: HP-GL/2 Laser III compatible: ENABLED
1.c) Model HP-8000, HP-4MP File -- Print -- Properties -- Finishing -- Details -- Font Settings: Send True Type as Bitmaps.
1.d) Other models: follow one of the above procedures.
2) HP Inkjet printers: - HP Inkjet 2500C: cannot print Unicode page, both from browser and from Wotd-97. - HP Inkjet 721C: can print Unicode in Word-97 - HP 970 Deskjet: can print in Word-97 and in Netscape 4.x (but not IE 5.x)
3) Other printers: - CANON Bubblejet BJC: can print Unicode in Win-98/Word-2000 - PANASONIC Laser printer KX series: can print Unicode with both browsers - RICOH Aficio 270: can print Unicode only in Word - EPSON Color Stylus series can print Unicode documents either from browsers or from Word.
[III]. Resources: 1) Alan Wood's Unicode Resources: https://www.alanwood.net/unicode/ 2) Unicode for Vietnamese: https://www.vovisoft.com/vovisoft/UnicodeChoVN.htm 3) Unicode consortium: https://www.unicode.org/ 4) See also links and information on Viet Unicode: https://vietunicode.sourceforge.net/ [IV]. Fonts: 1) Basic fonts come with Office-2000, Windows-98 SE, Windows-Me, Windows-2000, Windows XP. For older versions, check these fonts: - Core fonts: Arial, Courier New, Times New Roman, version 2.76 or later. If not, then download them and install. - Not all WGL-4 fonts supplied by Microsoft contain VN characters.
2) A larger set: Arial-Unicode MS by Microsoft and CN-Times by Chan-Nguyen, includes Chinese-Japanse-Korean characters (15 Mb, zipped), for Viet-Han texts. 3) VU-Times by Ho Phuoc Hung for Viet-Pali texts. [V]. Software and Hardware Folowing is a list of common software and hardware I use for our web site. Keyboard programs: 1) VPS-Keys 4.3 (freeware): https://www.hcgvn.net/software/ 2) WinVNKey, 4.0 (freeware): https://sourceforge.net/projects/winvnkey 3) UniKey, 3.55 (freeware): https://sourceforge.net/projects/unikey
Document and graphics preparation: 1) MS Word-2000, -XP 2) MS Image Composer 1.5 3) Corel Draw and Corel PhotoPaint, versions 9 & 11
Document conversion programs: 1) Convert2anything (freeware), by Cafe68T https://cafe68t.multimania.com/content/unicode/download.html 2) VoviSoft (freeware), https://www.vovisoft.com/vovisoft/UnicodeChoVN.htm 3) VPSKeys 4.3 (freeware), https://www.hcgvn.net/software/ 4) UniKey 3.55 (freeware), https://sourceforge.net/projects/unikey 5) WinVNKey, 4.0 (freeware): https://sourceforge.net/projects/winvnkey
Web page set up: 1) MS Frontpage-2000, -XP (commercial) 2) Arachnophilia 4.0 (freeware): https://www.arachnoid.com/arachnophilia/
Operating systems: 1) Windows 2000 2) Windows XP
Browser: IE 6.x System hardware: 1) PC Pentium-IV 1.6 GHz, 512 Mb RAM with Win-XP 2) PC Pentium Celeron 1.6 GHz, 256 Mb RAM, with Win-XP 3) PC Pentium Xeon 2.8 GHz, 2Gb RAM, with Win 2000
Printers: 1) Epson Stylus series (color inkjet) 2) HP Laser 5L 3) Many networked HP Laser printers (4x, 5x) and Inkjet printers.
[VI]. Mac machines: I have no experience with Mac machines and Mac-OS. You might like to consult Alan Wood's website at: https://www.alanwood.net/unicode
[VII] UTF-8 UTF-8 (UTF: Unicode Transformation Format) has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This section is only an illustration of how you can encode a Unicode character in UTF-8. 1) Take the Unicode value of the character to find out how many bytes you need. Unicode values are given in hexadecimal & decimal numbers: | Hex Range | Dec Range | | | 0000-007F | 0 - 127 | 1 byte | | 0080-07FF | 128 - 2,047 | 2 bytes | | 0800-FFFF | 2,048 - 65,535 | 3 bytes | | 10000-1FFFFF | 65,536 - 2,097,151 | 4 bytes | | 200000 - 3FFFFFF | 2,097,152 - 67,108,863 | 5 bytes | | 4000000 - 7FFFFFFF | 67,108,864 - 2,147,483,648 (*) | 6 bytes |
(*) Maximum 2,147,483,648 (2**31) characters could be created.
2) Convert the hex code to binary form and fill in the empty bits: | 1 byte | 0xxxxxxx | | 2 bytes | 110xxxxx 10xxxxxx | | 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx | | 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | | 5 bytes | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | | 6 bytes | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
Example: The Unicode value of 'tea' (Han) is 8336 (dec: 33,590), so you need 3 bytes. The binary form of hexadecimal 8336 is: - 10000011 00110110
Fill the empty slots of the three-byte template with the binary value of 'tea' and you will get:
Fill the empty slots of the three-byte template with the binary value of 'tea' and you will get: - 11101000 10001100 10110110
Thus you have converted 0x8336 to 3 bytes: 0xE8 0x8C 0xB6.
[VIII] UTF-16 Conversion UTF-16 definition Each character is assigned a number, which Unicode calls the Unicode scalar value. In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, the rules for how characters are encoded in UTF-16 are: - Characters with values less than 0x10000 are represented as a single 16-bit integer with a value equal to that of the character number. - Characters with values between 0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between 0xD800 and 0xDBFF (within the so-called high-half zone or high surrogate area) followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF (within the so-called low-half zone or low surrogate area). - Characters with values greater than 0x10FFFF cannot be encoded in UTF-16. Note:Values between 0xD800 and 0xDFFF are specifically reserved for use with UTF-16, and don't have any characters assigned to them.
Encoding UTF-16 Encoding of a single character from an ISO 10646 character value to UTF-16 proceeds as follows. Let U be the character number, no greater than 0x10FFFF. 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, U' must be less than or equal to 0xFFFFF. That is, U' can be represented in 20 bits. 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have 10 bits free to encode the character value, for a total of 20 bits. 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. Terminate.
Graphically, steps 2 through 4 look like: U' = yyyyyyyyyyxxxxxxxxxx (binary, 20 bits) W1 = 110110yyyyyyyyyy W2 = 110111xxxxxxxxxx
-ooOoo- |