diff options
Diffstat (limited to 'documentation/src')
| -rw-r--r-- | documentation/src/unicode.dox | 43 |
1 files changed, 25 insertions, 18 deletions
diff --git a/documentation/src/unicode.dox b/documentation/src/unicode.dox index ecd9074bd..8d78701cf 100644 --- a/documentation/src/unicode.dox +++ b/documentation/src/unicode.dox @@ -12,8 +12,9 @@ the current state of Unicode support. \section unicode_about About Unicode, ISO 10646 and UTF-8 The summary of Unicode, ISO 10646 and UTF-8 given below is -deliberately brief, and provides just enough information for +deliberately brief and provides just enough information for the rest of this chapter. + For further information, please see: - http://www.unicode.org - http://www.iso.org @@ -21,11 +22,12 @@ For further information, please see: - http://www.cl.cam.ac.uk/~mgk25/unicode.html - http://www.apps.ietf.org/rfc/rfc3629.html + \par The Unicode Standard The Unicode Standard was originally developed by a consortium of mainly US computer manufacturers and developers of multi-lingual software. -It has now become a defacto standard for character encoding, +It has now become a defacto standard for character encoding and is supported by most of the major computing companies in the world. Before Unicode, many different systems, on different platforms, @@ -40,7 +42,8 @@ and typographic publishing systems, such as algorithms for sorting and comparing text, composite character and text rendering, right-to-left and bi-directional text handling. -<i>There are currently no plans to add this extra functionality to FLTK.</i> +\note There are currently no plans to add this extra functionality to FLTK. + \par ISO 10646 @@ -57,8 +60,8 @@ which contains the characters required for almost all known languages. The standard also defines three different implementation levels specifying how these characters can be combined. -<i>There are currently no plans for handling the different implementation -levels or the combining characters in FLTK.</i> +\note There are currently no plans for handling the different implementation +levels or the combining characters in FLTK. In UCS, characters have a unique numerical code and an official name, and are usually shown using 'U+' and the code in hexadecimal, @@ -67,15 +70,15 @@ The UCS characters U+0000 to U+007F correspond to US-ASCII, and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1). ISO 10646 was originally designed to handle a 31-bit character set -from U+00000000 to U+7FFFFFFF, but the current idea is that 21-bits +from U+00000000 to U+7FFFFFFF, but the current idea is that 21 bits will be sufficient for all future needs, giving characters up to U+10FFFF. The complete character set is sub-divided into \e planes. <i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b> (BMP), ranges from U+0000 to U+FFFD and consists of the most commonly used characters from previous encoding standards. Other planes contain characters for specialist applications. -\todo -Do we need this info about planes? + +\todo Do we need this info about planes? The UCS also defines various methods of encoding characters as a sequence of bytes. @@ -87,7 +90,7 @@ but this is even more wasteful for ASCII or Latin1. \par UTF-8 -The Unicode standard defines various UCS Transformation Formats. +The Unicode standard defines various UCS Transformation Formats (UTF). UTF-16 and UTF-32 are based on units of two and four bytes. UCS characters requiring more than 16 bits are encoded using "surrogate pairs" in UTF-16. @@ -100,9 +103,11 @@ making the transformation to Unicode quick and easy. All UCS characters above U+007F are encoded as a sequence of several bytes. The top bits of the first byte are set to show the length of the byte sequence, and subseqent bytes are -always in the range 0x80 to 0x8F. This combination provides +always in the range 0x80 to 0xBF. This combination provides some level of synchronisation and error detection. +\par + <table summary="Unicode character byte sequences" align="center"> <tr> <td>Unicode range</td> @@ -134,6 +139,8 @@ some level of synchronisation and error detection. </tr> </table> +\par + Moving from ASCII encoding to Unicode will allow all new FLTK applications to be easily internationalized and used all over the world. By choosing UTF-8 encoding, FLTK remains largely @@ -176,12 +183,12 @@ the following limitations: - FLTK will only handle single characters, so composed characters consisting of a base character and floating accent characters - will be treated as multiple characters; + will be treated as multiple characters. - FLTK will only compare or sort strings on a byte by byte basis - and not on a general Unicode character basis; + and not on a general Unicode character basis. -- FLTK will not handle right-to-left or bi-directional text; +- FLTK will not handle right-to-left or bi-directional text. \todo Verify 16/24 bit Unicode limit for different character sets? @@ -189,7 +196,7 @@ the following limitations: appears to handle a wider set. What about illegal characters? See comments in %fl_utf8fromwc() and %fl_utf8toUtf16(). -\section unicode_illegals Illegal Unicode and UTF-8 sequences +\section unicode_illegals Illegal Unicode and UTF-8 Sequences Three pre-processor variables are defined in the source code [1] that determine how %fl_utf8decode() handles illegal UTF-8 sequences: @@ -240,7 +247,7 @@ of the sequence. Trailing bytes in a UTF-8 sequence will return -1. Please see the individual function description for further details about error handling and return values. -\section unicode_fltk_calls FLTK Unicode and UTF-8 functions +\section unicode_fltk_calls FLTK Unicode and UTF-8 Functions This section currently provides a brief overview of the functions. For more details, consult the main text for each function via its link. @@ -348,8 +355,8 @@ or ISO-8859-1 characters below 0xFF are replaced with '?'. \par Both functions return the number of bytes that would be written, not counting the null terminator. -\p destlen provides a means of limiting the number of bytes written, -so setting \p destlen to zero is a means of measuring how much storage +\p dstlen provides a means of limiting the number of bytes written, +so setting \p dstlen to zero is a means of measuring how much storage would be needed before doing the real conversion. @@ -455,7 +462,7 @@ converts the strings to lower case Unicode as part of the comparison. \p %flt_utf_strncasecmp() only compares the first \p n characters [bytes?] -\section unicode_system_calls FLTK Unicode versions of system calls +\section unicode_system_calls FLTK Unicode Versions of System Calls - int fl_access(const char* f, int mode) \b OksiD |
