diff options
| author | engelsman <engelsman> | 2010-05-17 20:16:51 +0000 |
|---|---|---|
| committer | engelsman <engelsman> | 2010-05-17 20:16:51 +0000 |
| commit | f0be902828479b806ef28cb2ee9b81aa9cfff015 (patch) | |
| tree | a01a7d7de205b28dc00491c064046abb6b55bd15 /documentation/src/unicode.dox | |
| parent | 20a837c75620f0f947748f63d76c936f85e42e13 (diff) | |
documentation/unicode.dox: added to the Unicode and UTF-8 Support chapter
added references to RFC 3629 as the source of the 21-bit U+10FFFF limit,
outlined the illegal character strategy of fl_utf8decode(), and
added warnings that fl_utf8len() is unsafe
git-svn-id: file:///fltk/svn/fltk/branches/branch-1.3@7610 ea41ed52-d2ee-0310-a9c1-e6b18d33e121
Diffstat (limited to 'documentation/src/unicode.dox')
| -rw-r--r-- | documentation/src/unicode.dox | 90 |
1 files changed, 80 insertions, 10 deletions
diff --git a/documentation/src/unicode.dox b/documentation/src/unicode.dox index 7528f0a28..46026d2e8 100644 --- a/documentation/src/unicode.dox +++ b/documentation/src/unicode.dox @@ -19,6 +19,7 @@ For further information, please see: - http://www.iso.org - http://en.wikipedia.org/wiki/Unicode - http://www.cl.cam.ac.uk/~mgk25/unicode.html +- http://www.apps.ietf.org/rfc/rfc3629.html \par The Unicode Standard @@ -64,9 +65,20 @@ and are usually shown using 'U+' and the code in hexadecimal, e.g. U+0041 is the "Latin capital letter A". The UCS characters U+0000 to U+007F correspond to US-ASCII, and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1). + +ISO 10646 was originally designed to handle a 31-bit character set +from U+00000000 to U+7FFFFFFF, but the current idea is that 21-bits +will be sufficient for all future needs, giving characters up to +U+10FFFF. The complete character set is sub-divided into \e planes. +<i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b> +(BMP), ranges from U+0000 to U+FFFD and consists of the most commonly +used characters from previous encoding standards. Other planes +contain characters for specialist applications. +\todo +Do we need this info about planes? + The UCS also defines various methods of encoding characters as a sequence of bytes. - UCS-2 encodes Unicode characters into two bytes, which is wasteful if you are only dealing with ASCII or Latin1 text, and insufficient if you need characters above U+00FFFF. @@ -77,6 +89,8 @@ but this is even more wasteful for ASCII or Latin1. The Unicode standard defines various UCS Transformation Formats. UTF-16 and UTF-32 are based on units of two and four bytes. +UCS characters requiring more than 16-bits are encoded using +"surrogate pairs" in UTF-16. UTF-8 encodes all Unicode characters into variable length sequences of bytes. Unicode characters in the 7-bit ASCII @@ -86,7 +100,7 @@ making the transformation to Unicode quick and easy. All UCS characters above U+007F are encoded as a sequence of several bytes. The top bits of the first byte are set to show the length of the byte sequence, and subseqent bytes are -always in the range 0x80 to 8x8F. This combination provides +always in the range 0x80 to 0x8F. This combination provides some level of synchronisation and error detection. <table summary="Unicode character byte sequences" align="center"> @@ -128,9 +142,13 @@ library. \section unicode_in_fltk Unicode in FLTK -FLTK will be entirely converted to Unicode in UTF-8 encoding. -If a different encoding is required by the underlying operatings -system, FLTK will convert string as needed. +\todo +Work through the code and this documentation to harmonize +the [<b>OksiD</b>] and [<b>fltk2</b>] functions. + +FLTK will be entirely converted to Unicode using UTF-8 encoding. +If a different encoding is required by the underlying operating +system, FLTK will convert the string as needed. It is important to note that the initial implementation of Unicode and UTF-8 in FLTK involves three important areas: @@ -138,7 +156,7 @@ Unicode and UTF-8 in FLTK involves three important areas: - provision of Unicode character tables and some simple related functions; - conversion of char* variables and function parameters from single byte - per character representation to UTF-8 variable length characters; + per character representation to UTF-8 variable length sequences; - modifications to the display font interface to accept general Unicode character or UCS code numbers instead of just ASCII or Latin1 @@ -147,9 +165,15 @@ Unicode and UTF-8 in FLTK involves three important areas: The current implementation of Unicode / UTF-8 in FLTK will impose the following limitations: -- An implementation note in the code says that all functions are - LIMITED to 24 bit Unicode values, but also says that only 16 bits +- An implementation note in the [<b>OksiD</b>] code says that all functions + are LIMITED to 24 bit Unicode values, but also says that only 16 bits are really used under linux and win32. + <b>[Can we verify this?]</b> + +- The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are + designed to handle Unicode characters in the range U+000000 to U+10FFFF + inclusive, which covers all UTF-16 characters, as specified in RFC 3629. + <i>Note that the user must first convert UTF-16 surrogate pairs to UCS.</i> - FLTK will only handle single characters, so composed characters consisting of a base character and floating accent characters @@ -164,8 +188,54 @@ the following limitations: Verify 16/24 bit Unicode limit for different character sets? OksiD's code appears limited to 16-bit whereas the FLTK2 code appears to handle a wider set. What about illegal characters? - See comments in fl_utf8fromwc() and fl_utf8toUtf16(). - + See comments in %fl_utf8fromwc() and %fl_utf8toUtf16(). + +\section unicode_illegals Illegal Unicode and UTF8 sequences + +Three pre-processor variables are defined in the source code that +determine how %fl_utf8decode() handles illegal UTF8 sequences: + +- if ERRORS_TO_CP1252 is set to 1 (the default), %fl_utf8decode() will + assume that a byte sequence starting with a byte in the range 0x80 + to 0x9f represents a Microsoft CP1252 character, and will instead + return the value of an equivalent UCS character. Otherwise, it + will be processed as an illegal byte value as described below. + +- if STRICT_RFC3629 is set to 1 (not the default!) then UTF-8 + sequences that correspond to illegal UCS values are treated as + errors. Illegal UCS values include those above U+10FFFF, or + corresponding to UTF-16 surrogate pairs. Illegal byte values + are handled as described below. + +- if ERRORS_TO_ISO8859_1 is set to 1 (the default), the illegal + byte value is returned unchanged, otherwise 0xFFFD, the Unicode + REPLACEMENT CHARACTER, is returned instead. + +%fl_utf8encode() is less strict, and only generates the UTF-8 +sequence for 0xFFFD, the Unicode REPLACEMENT CHARACTER, if it is +asked to encode a UCS value above U+10FFFF. + +Many of the [<b>fltk2</b>] functions below use %fl_utf8decode() and +%fl_utf8encode() in their own implementation, and are therefore +somewhat protected from bad UTF-8 sequences. + +The [<b>OksiD</b>] %fl_utf8len() function assumes that the byte it is +passed is the first byte in a UTF-8 sequence, and returns the length +of the sequence. Trailing bytes in a UTF-8 sequence will return -1. + +- \b WARNING: + %fl_utf8len() can not distinguish between single + bytes representing Microsoft CP1252 characters 0x80-0x9f and + those forming part of a valid UTF-8 sequence. You are strongly + advised not to use %fl_utf8len() in your own code unless you + know that the byte sequence contains only valid UTF-8 sequences. + +- \b WARNING: + Some of the [OksiD] functions below use still use %fl_utf8len() in + their implementations. These may need further validation. + +Please see the individual function description for further details +about error handling and return values. \section unicode_fltk_calls FLTK Unicode and UTF8 functions |
