/** \page unicode Unicode and UTF-8 Support This chapter explains how FLTK handles international text via Unicode and UTF-8. Unicode support was only recently added to FLTK and is still incomplete. This chapter is Work in Progress, reflecting the current state of Unicode support. \section unicode_about About Unicode, ISO 10646 and UTF-8 The summary of Unicode, ISO 10646 and UTF-8 given below is deliberately brief, and provides just enough information for the rest of this chapter. For further information, please see: - http://www.unicode.org - http://www.iso.org - http://en.wikipedia.org/wiki/Unicode - http://www.cl.cam.ac.uk/~mgk25/unicode.html \par The Unicode Standard The Unicode Standard was originally developed by a consortium of mainly US computer manufacturers and developers of mult-lingual software. It has now become a defacto standard for character encoding, and is supported by most of the major computing companies in the world. Before Unicode, many different systems, on different platforms, had been developed for encoding characters for different languages, but no single encoding could satisfy all languages. Unicode provides access to over 100,000 characters used in all the major languages written today, and is independent of platform and language. Unicode also provides higher-level concepts needed for text processing and typographic publishing systems, such as algorithms for sorting and comparing text, composite character and text rendering, right-to-left and bi-directional text handling. There are currently no plans to add this extra functionality to FLTK. \par ISO 10646 The International Organisation for Standardization (ISO) had also been trying to develop a single unified character set. Although both ISO and the Unicode Consortium continue to publish their own standards, they have agreed to coordinate their work so that specific versions of the Unicode and ISO 10646 standards are compatible with each other. The international standard ISO 10646 defines the Universal Character Set (UCS) which contains the characters required for almost all known languages. The standard also defines three different implementation levels specifying how these characters can be combined. There are currently no plans for handling the different implementation levels or the combining characters in FLTK. In UCS, characters have a unique numerical code and an official name, and are usually shown using 'U+' and the code in hexadecimal, e.g. U+0041 is the "Latin capital letter A". The UCS characters U+0000 to U+007F correspond to US-ASCII, and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1). The UCS also defines various methods of encoding characters as a sequence of bytes. UCS-2 encodes Unicode characters into two bytes, which is wasteful if you are only dealing with ASCII or Latin1 text, and insufficient if you need characters above U+00FFFF. UCS-4 uses four bytes, which lets it handle higher characters, but this is even more wasteful for ASCII or Latin1. \par UTF-8 The Unicode standard defines various UCS Transformation Formats. UTF-16 and UTF-32 are based on units of two and four bytes. UTF-8 encodes all Unicode characters into variable length sequences of bytes. Unicode characters in the 7-bit ASCII range map to the same value and are represented as a single byte, making the transformation to Unicode quick and easy. All UCS characters above U+007F are encoded as a sequence of several bytes. The top bits of the first byte are set to show the length of the byte sequence, and subseqent bytes are always in the range 0x80 to 8x8F. This combination provides some level of synchronisation and error detection.
Unicode range Byte sequences
U+00000000 - U+0000007F 0xxxxxxx
U+00000080 - U+000007FF 110xxxxx 10xxxxxx
U+00000800 - U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+00010000 - U+001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U+00200000 - U+03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+04000000 - U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Moving from ASCII encoding to Unicode will allow all new FLTK applications to be easily internationalized and and used all over the world. By choosing UTF-8 encoding, FLTK remains largely source-code compatible to previous iteration of the library. \section unicode_in_fltk Unicode in FLTK FLTK will be entirely converted to Unicode in UTF-8 encoding. If a different encoding is required by the underlying operatings system, FLTK will convert string as needed. It is important to note that the initial implementation of Unicode and UTF-8 in FLTK involves three important areas: - provision of Unicode character tables and some simple related functions; - conversion of char* variables and function parameters from single byte per character representation to UTF-8 variable length characters; - modifications to the display font interface to accept general Unicode character or UCS code numbers instead of just ASCII or Latin1 characters. The current implementation of Unicode / UTF-8 in FLTK will impose the following limitations: - FLTK will only handle single characters, so composed characters consisting of a base character and floating accent characters will be treated as multiple characters; - FLTK will only compare or sort strings on a byte by byte basis and not on a general Unicode character basis; - FLTK will not handle right-to-left or bi-directional text; \par TODO: \li more doc on unicode, add links \li write something about filename encoding on OS X... \li explain the fl_utf8_... commands \li explain issues with Fl_Preferences \li why FLTK has no Fl_String class \par DONE: \li initial transfer of the Ian/O'ksi'D patch \li adapted Makefiles and IDEs for available platforms \li hacked some Unicode keybard entry for OS X \par ISSUES: \li IDEs: - Makefile support: tested on Fedora Core 5 and OS X, but heaven knows on which platforms this may fail - Xcode: tested, seems to be working (but see comments below on OS X) - VisualC (VC6): tested, test/utf8 works, but may have had some issues during merge. Some additional work needed (imm32.lib) - VisualStudio2005: tested, test/utf8 works, some addtl. work needed (imm32.lib) - VisualCNet: sorry, I have no longer access to that IDE - Borland and other compiler: sorry, I can't update those \li Platforms: - you will encounter problems on all platforms! - X11: many characters are missing, but that may be related to bad fonts on my machine. I also could not do any keyboard tests yet. Rendering seems to generally work ok. - Win32: US and German keyboard worked ok, but no compositing was tested. Rendering looks pretty good. - OS X: redering looks good. Keyboard is completely messed up, even in US setting (with Alt key) - all: while merging I have seen plenty of places that are not entirley utf8-safe, particularly Fl_Input, Fl_Text_Editor, and Fl_Help_View. Keycodes from the keyboard conflict with Unicode characters. Right-to-left rendered text can not be marked or edited, and probably much more. \htmlonly
[Prev] Advanced FLTK [Index] FLTK Enumerations [Next]
\endhtmlonly */