diff options
| author | engelsman <engelsman> | 2009-04-11 20:46:06 +0000 |
|---|---|---|
| committer | engelsman <engelsman> | 2009-04-11 20:46:06 +0000 |
| commit | d1593df45be79099595e36bff9960169f6ad4b8c (patch) | |
| tree | d8283c6aeddf6ff34261075cd516781b09ebb4fe /documentation/src | |
| parent | 01a6e197c2488ecaf7e952cf932d97b652c48f7d (diff) | |
fleshed out the background information in unicode.dox
added more info and links on the Unicode Standard, ISO 10646, and UTF-8.
added bullet points about what FLTK will and won't do.
git-svn-id: file:///fltk/svn/fltk/branches/branch-1.3@6752 ea41ed52-d2ee-0310-a9c1-e6b18d33e121
Diffstat (limited to 'documentation/src')
| -rw-r--r-- | documentation/src/unicode.dox | 148 |
1 files changed, 136 insertions, 12 deletions
diff --git a/documentation/src/unicode.dox b/documentation/src/unicode.dox index 41733558b..f27551412 100644 --- a/documentation/src/unicode.dox +++ b/documentation/src/unicode.dox @@ -1,37 +1,161 @@ /** - \page unicode Unicode and utf-8 Support + \page unicode Unicode and UTF-8 Support This chapter explains how FLTK handles international -text via Unicode and utf-8. +text via Unicode and UTF-8. Unicode support was only recently added to FLTK and is still incomplete. This chapter is Work in Progress, reflecting the current state of Unicode support. -\section unicode_about About Unicode and utf-8 +\section unicode_about About Unicode, ISO 10646 and UTF-8 + +The summary of Unicode, ISO 10646 and UTF-8 given below is +deliberately brief, and provides just enough information for +the rest of this chapter. +For further information, please see: +- http://www.unicode.org +- http://www.iso.org +- http://en.wikipedia.org/wiki/Unicode +- http://www.cl.cam.ac.uk/~mgk25/unicode.html + +\par The Unicode Standard + +The Unicode Standard was originally developed by a consortium of mainly +US computer manufacturers and developers of mult-lingual software. +It has now become a defacto standard for character encoding, +and is supported by most of the major computing companies in the world. + +Before Unicode, many different systems, on different platforms, +had been developed for encoding characters for different languages, +but no single encoding could satisfy all languages. +Unicode provides access to over 100,000 characters +used in all the major languages written today, +and is independent of platform and language. + +Unicode also provides higher-level concepts needed for text processing +and typographic publishing systems, such as algorithms for sorting and +comparing text, composite character and text rendering, right-to-left +and bi-directional text handling. + +<i>There are currently no plans to add this extra functionality to FLTK.</i> + +\par ISO 10646 + +The International Organisation for Standardization (ISO) had also +been trying to develop a single unified character set. +Although both ISO and the Unicode Consortium continue to publish +their own standards, they have agreed to coordinate their work so +that specific versions of the Unicode and ISO 10646 standards are +compatible with each other. + +The international standard ISO 10646 defines the +<b>Universal Character Set</b> (UCS) +which contains the characters required for almost all known languages. +The standard also defines three different implementation levels specifying +how these characters can be combined. + +<i>There are currently no plans for handling the different implementation +levels or the combining characters in FLTK.</i> + +In UCS, characters have a unique numerical code and an official name, +and are usually shown using 'U+' and the code in hexadecimal, +e.g. U+0041 is the "Latin capital letter A". +The UCS characters U+0000 to U+007F correspond to US-ASCII, +and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1). +The UCS also defines various methods of encoding characters as +a sequence of bytes. + +UCS-2 encodes Unicode characters into two bytes, +which is wasteful if you are only dealing with ASCII or Latin1 text, +and insufficient if you need characters above U+00FFFF. +UCS-4 uses four bytes, which lets it handle higher characters, +but this is even more wasteful for ASCII or Latin1. + +\par UTF-8 + +The Unicode standard defines various UCS Transformation Formats. +UTF-16 and UTF-32 are based on units of two and four bytes. + +UTF-8 encodes all Unicode characters into variable length +sequences of bytes. Unicode characters in the 7-bit ASCII +range map to the same value and are represented as a single byte, +making the transformation to Unicode quick and easy. -The Unicode Standard is a worldwide accepted charatcer encoding -standard. Unicode provides access to over 100,000 characters -used in all the major languages written today. +All UCS characters above U+007F are encoded as a sequence of +several bytes. The top bits of the first byte are set to show +the length of the byte sequence, and subseqent bytes are +always in the range 0x80 to 8x8F. This combination provides +some level of synchronisation and error detection. -Utf-8 encodes all Unicode characters into variable length -sequences of bytes. Unicode characters in the 7-bit ASCII -range map to the same value in utf-8, making the transformation -to Unicode quick and easy. +<table summary="Unicode character byte sequences" align="center"> +<tr> + <td>Unicode range</td> + <td>Byte sequences</td> +</tr> +<tr> + <td><tt>U+00000000 - U+0000007F</tt></td> + <td><tt>0xxxxxxx</tt></td> +</tr> +<tr> + <td><tt>U+00000080 - U+000007FF</tt></td> + <td><tt>110xxxxx 10xxxxxx</tt></td> +</tr> +<tr> + <td><tt>U+00000800 - U+0000FFFF</tt></td> + <td><tt>1110xxxx 10xxxxxx 10xxxxxx</tt></td> +</tr> +<tr> + <td><tt>U+00010000 - U+001FFFFF</tt></td> + <td><tt>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td> +</tr> +<tr> + <td><tt>U+00200000 - U+03FFFFFF</tt></td> + <td><tt>111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td> +</tr> +<tr> + <td><tt>U+04000000 - U+7FFFFFFF</tt></td> + <td><tt>1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td> +</tr> +</table> Moving from ASCII encoding to Unicode will allow all new FLTK applications to be easily internationalized and and used all -over the world. By choosing utf-8 encoding, FLTK remains +over the world. By choosing UTF-8 encoding, FLTK remains largely source-code compatible to previous iteration of the library. \section unicode_in_fltk Unicode in FLTK -FLTK will be entirely converted to Unicode in utf-8 encoding. +FLTK will be entirely converted to Unicode in UTF-8 encoding. If a different encoding is required by the underlying operatings system, FLTK will convert string as needed. +It is important to note that the initial implementation of +Unicode and UTF-8 in FLTK involves three important areas: + +- provision of Unicode character tables and some simple related functions; + +- conversion of char* variables and function parameters from single byte + per character representation to UTF-8 variable length characters; + +- modifications to the display font interface to accept general + Unicode character or UCS code numbers instead of just ASCII or Latin1 + characters. + +The current implementation of Unicode / UTF-8 in FLTK will impose +the following limitations: + +- FLTK will only handle single characters, so composed characters + consisting of a base character and floating accent characters + will be treated as multiple characters; + +- FLTK will only compare or sort strings on a byte by byte basis + and not on a general Unicode character basis; + +- FLTK will not handle right-to-left or bi-directional text; + \par TODO: \li more doc on unicode, add links |
