git-svn-id: file:///fltk/svn/fltk/branches/branch-1.3@8217 ea41ed52-d2ee-0310-a9c1-e6b18d33e121

author: Matthias Melcher <fltk@matthiasm.com> 2011-01-08 16:28:16 +0000
committer: Matthias Melcher <fltk@matthiasm.com> 2011-01-08 16:28:16 +0000
commit: 2dc664935d8109767c2d107c6b644082fe06ac05 (patch)
tree: 6e5e622962a1503161b86884cd3423cb2bba1ab1 /branch-3.0-2011/documentation/src/unicode.dox
parent: f62a6a927a8ce7aa91b023e7aafad9b5ff96f755 (diff)
1 files changed, 520 insertions, 0 deletions
diff --git a/branch-3.0-2011/documentation/src/unicode.dox b/branch-3.0-2011/documentation/src/unicode.dox
new file mode 100644
index 000000000..f1a6e4cf3
--- /dev/null
+++ b/branch-3.0-2011/documentation/src/unicode.dox
@@ -0,0 +1,520 @@
+/**
+
+ \page unicode Unicode and UTF-8 Support
+
+This chapter explains how FLTK handles international 
+text via Unicode and UTF-8.
+
+Unicode support was only recently added to FLTK and is
+still incomplete. This chapter is Work in Progress, reflecting
+the current state of Unicode support.
+
+\section unicode_about About Unicode, ISO 10646 and UTF-8
+
+The summary of Unicode, ISO 10646 and UTF-8 given below is
+deliberately brief, and provides just enough information for
+the rest of this chapter.
+For further information, please see:
+- http://www.unicode.org
+- http://www.iso.org
+- http://en.wikipedia.org/wiki/Unicode
+- http://www.cl.cam.ac.uk/~mgk25/unicode.html
+- http://www.apps.ietf.org/rfc/rfc3629.html
+
+\par The Unicode Standard
+
+The Unicode Standard was originally developed by a consortium of mainly
+US computer manufacturers and developers of multi-lingual software.
+It has now become a defacto standard for character encoding,
+and is supported by most of the major computing companies in the world.
+
+Before Unicode, many different systems, on different platforms,
+had been developed for encoding characters for different languages,
+but no single encoding could satisfy all languages.
+Unicode provides access to over 100,000 characters 
+used in all the major languages written today,
+and is independent of platform and language.
+
+Unicode also provides higher-level concepts needed for text processing
+and typographic publishing systems, such as algorithms for sorting and
+comparing text, composite character and text rendering, right-to-left
+and bi-directional text handling.
+
+<i>There are currently no plans to add this extra functionality to FLTK.</i>
+
+\par ISO 10646
+
+The International Organisation for Standardization (ISO) had also
+been trying to develop a single unified character set.
+Although both ISO and the Unicode Consortium continue to publish
+their own standards, they have agreed to coordinate their work so
+that specific versions of the Unicode and ISO 10646 standards are
+compatible with each other.
+
+The international standard ISO 10646 defines the
+<b>Universal Character Set</b> (UCS)
+which contains the characters required for almost all known languages.
+The standard also defines three different implementation levels specifying
+how these characters can be combined.
+
+<i>There are currently no plans for handling the different implementation
+levels or the combining characters in FLTK.</i>
+
+In UCS, characters have a unique numerical code and an official name,
+and are usually shown using 'U+' and the code in hexadecimal,
+e.g. U+0041 is the "Latin capital letter A".
+The UCS characters U+0000 to U+007F correspond to US-ASCII,
+and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1).
+
+ISO 10646 was originally designed to handle a 31-bit character set
+from U+00000000 to U+7FFFFFFF, but the current idea is that 21-bits
+will be sufficient for all future needs, giving characters up to
+U+10FFFF.  The complete character set is sub-divided into \e planes.
+<i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b>
+(BMP), ranges from U+0000 to U+FFFD and consists of the most commonly
+used characters from previous encoding standards. Other planes
+contain characters for specialist applications.
+\todo
+Do we need this info about planes?
+
+The UCS also defines various methods of encoding characters as
+a sequence of bytes.
+UCS-2 encodes Unicode characters into two bytes,
+which is wasteful if you are only dealing with ASCII or Latin1 text,
+and insufficient if you need characters above U+00FFFF.
+UCS-4 uses four bytes, which lets it handle higher characters,
+but this is even more wasteful for ASCII or Latin1.
+
+\par UTF-8
+
+The Unicode standard defines various UCS Transformation Formats.
+UTF-16 and UTF-32 are based on units of two and four bytes.
+UCS characters requiring more than 16-bits are encoded using
+"surrogate pairs" in UTF-16.
+
+UTF-8 encodes all Unicode characters into variable length 
+sequences of bytes. Unicode characters in the 7-bit ASCII 
+range map to the same value and are represented as a single byte,
+making the transformation to Unicode quick and easy.
+
+All UCS characters above U+007F are encoded as a sequence of
+several bytes. The top bits of the first byte are set to show
+the length of the byte sequence, and subseqent bytes are
+always in the range 0x80 to 0x8F. This combination provides
+some level of synchronisation and error detection.
+
+<table summary="Unicode character byte sequences" align="center">
+<tr>
+ <td>Unicode range</td>
+ <td>Byte sequences</td>
+</tr>
+<tr>
+ <td><tt>U+00000000 - U+0000007F</tt></td>
+ <td><tt>0xxxxxxx</tt></td>
+</tr>
+<tr>
+ <td><tt>U+00000080 - U+000007FF</tt></td>
+ <td><tt>110xxxxx 10xxxxxx</tt></td>
+</tr>
+<tr>
+ <td><tt>U+00000800 - U+0000FFFF</tt></td>
+ <td><tt>1110xxxx 10xxxxxx 10xxxxxx</tt></td>
+</tr>
+<tr>
+ <td><tt>U+00010000 - U+001FFFFF</tt></td>
+ <td><tt>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
+</tr>
+<tr>
+ <td><tt>U+00200000 - U+03FFFFFF</tt></td>
+ <td><tt>111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
+</tr>
+<tr>
+ <td><tt>U+04000000 - U+7FFFFFFF</tt></td>
+ <td><tt>1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
+</tr>
+</table>
+
+Moving from ASCII encoding to Unicode will allow all new FLTK
+applications to be easily internationalized and used all
+over the world. By choosing UTF-8 encoding, FLTK remains 
+largely source-code compatible to previous iteration of the 
+library.
+
+\section unicode_in_fltk Unicode in FLTK
+
+\todo
+Work through the code and this documentation to harmonize
+the [<b>OksiD</b>] and [<b>fltk2</b>] functions.
+
+FLTK will be entirely converted to Unicode using UTF-8 encoding.
+If a different encoding is required by the underlying operating
+system, FLTK will convert the string as needed.
+
+It is important to note that the initial implementation of
+Unicode and UTF-8 in FLTK involves three important areas:
+
+- provision of Unicode character tables and some simple related functions;
+
+- conversion of char* variables and function parameters from single byte
+  per character representation to UTF-8 variable length sequences;
+
+- modifications to the display font interface to accept general
+  Unicode character or UCS code numbers instead of just ASCII or Latin1
+  characters.
+
+The current implementation of Unicode / UTF-8 in FLTK will impose
+the following limitations:
+
+- An implementation note in the [<b>OksiD</b>] code says that all functions
+  are LIMITED to 24 bit Unicode values, but also says that only 16 bits
+  are really used under linux and win32.
+  <b>[Can we verify this?]</b>
+  
+- The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are
+  designed to handle Unicode characters in the range U+000000 to U+10FFFF
+  inclusive, which covers all UTF-16 characters, as specified in RFC 3629.
+  <i>Note that the user must first convert UTF-16 surrogate pairs to UCS.</i>
+
+- FLTK will only handle single characters, so composed characters
+  consisting of a base character and floating accent characters
+  will be treated as multiple characters; 
+
+- FLTK will only compare or sort strings on a byte by byte basis
+  and not on a general Unicode character basis;
+
+- FLTK will not handle right-to-left or bi-directional text;
+  
+  \todo
+  Verify 16/24 bit Unicode limit for different character sets?
+  OksiD's code appears limited to 16-bit whereas the FLTK2 code
+  appears to handle a wider set. What about illegal characters?
+  See comments in %fl_utf8fromwc() and %fl_utf8toUtf16().
+
+\section unicode_illegals Illegal Unicode and UTF-8 sequences
+
+Three pre-processor variables are defined in the source code that
+determine how %fl_utf8decode() handles illegal UTF-8 sequences:
+
+- if ERRORS_TO_CP1252 is set to 1 (the default), %fl_utf8decode() will
+  assume that a byte sequence starting with a byte in the range 0x80
+  to 0x9f represents a Microsoft CP1252 character, and will instead
+  return the value of an equivalent UCS character. Otherwise, it
+  will be processed as an illegal byte value as described below.
+
+- if STRICT_RFC3629 is set to 1 (not the default!) then UTF-8
+  sequences that correspond to illegal UCS values are treated as
+  errors.  Illegal UCS values include those above U+10FFFF, or
+  corresponding to UTF-16 surrogate pairs. Illegal byte values
+  are handled as described below.
+
+- if ERRORS_TO_ISO8859_1 is set to 1 (the default), the illegal
+  byte value is returned unchanged, otherwise 0xFFFD, the Unicode
+  REPLACEMENT CHARACTER, is returned instead.
+
+%fl_utf8encode() is less strict, and only generates the UTF-8
+sequence for 0xFFFD, the Unicode REPLACEMENT CHARACTER, if it is
+asked to encode a UCS value above U+10FFFF.
+
+Many of the [<b>fltk2</b>] functions below use %fl_utf8decode() and
+%fl_utf8encode() in their own implementation, and are therefore
+somewhat protected from bad UTF-8 sequences.
+
+The [<b>OksiD</b>] %fl_utf8len() function assumes that the byte it is
+passed is the first byte in a UTF-8 sequence, and returns the length
+of the sequence. Trailing bytes in a UTF-8 sequence will return -1.
+
+- \b WARNING:
+  %fl_utf8len() can not distinguish between single
+  bytes representing Microsoft CP1252 characters 0x80-0x9f and
+  those forming part of a valid UTF-8 sequence. You are strongly
+  advised not to use %fl_utf8len() in your own code unless you
+  know that the byte sequence contains only valid UTF-8 sequences.
+
+- \b WARNING:
+  Some of the [OksiD] functions below use still use %fl_utf8len() in
+  their implementations. These may need further validation.
+
+Please see the individual function description for further details
+about error handling and return values.
+
+\section unicode_fltk_calls FLTK Unicode and UTF-8 functions
+
+This section currently provides a brief overview of the functions.
+For more details, consult the main text for each function via its link.
+
+int fl_utf8locale()
+  \b FLTK2
+  <br>
+\par
+\p %fl_utf8locale() returns true if the "locale" seems to indicate
+that UTF-8 encoding is used.
+\par
+<i>It is highly recommended that your change your system so this does return
+true!</i>
+
+
+int fl_utf8test(const char *src, unsigned len)
+  \b FLTK2
+  <br>
+\par
+\p %fl_utf8test() examines the first \p len bytes of \p src.
+It returns 0 if there are any illegal UTF-8 sequences;
+1 if \p src contains plain ASCII or if \p len is zero;
+or 2, 3 or 4 to indicate the range of Unicode characters found.
+
+
+int fl_utf_nb_char(const unsigned char *buf, int len)
+  \b OksiD
+  <br>
+\par
+Returns the number of UTF-8 character in the first \p len bytes of \p buf.
+
+
+int fl_unichar_to_utf8_size(Fl_Unichar)
+  <br>
+int fl_utf8bytes(unsigned ucs)
+  <br>
+\par
+Returns the number of bytes needed to encode \p ucs in UTF-8.
+
+
+int fl_utf8len(char c)
+  \b OksiD
+  <br>
+\par
+If \p c is a valid first byte of a UTF-8 encoded character sequence,
+\p %fl_utf8len() will return the number of bytes in that sequence.
+It returns -1 if \p c is not a valid first byte.
+
+
+unsigned int fl_nonspacing(unsigned int ucs)
+  \b OksiD
+  <br>
+\par
+Returns true if \p ucs is a non-spacing character.
+<b>[What are non-spacing characters?]</b>
+
+
+const char* fl_utf8back(const char *p, const char *start, const char *end)
+  \b FLTK2
+  <br>
+const char* fl_utf8fwd(const char *p, const char *start, const char *end)
+  \b FLTK2
+  <br>
+\par
+If \p p already points to the start of a UTF-8 character sequence,
+these functions will return \p p.
+Otherwise \p %fl_utf8back() searches backwards from \p p
+and \p %fl_utf8fwd() searches forwards from \p p,
+within the \p start and \p end limits,
+looking for the start of a UTF-8 character.
+
+
+unsigned int fl_utf8decode(const char *p, const char *end, int *len)
+  \b FLTK2
+  <br>
+int fl_utf8encode(unsigned ucs, char *buf)
+  \b FLTK2
+  <br>
+\par
+\p %fl_utf8decode() attempts to decode the UTF-8 character that starts
+at \p p and may not extend past \p end.
+It returns the Unicode value, and the length of the UTF-8 character sequence
+is returned via the \p len argument.
+\p %fl_utf8encode() writes the UTF-8 encoding of \p ucs into \p buf
+and returns the number of bytes in the sequence.
+See the main documentation for the treatment of illegal Unicode
+and UTF-8 sequences.
+
+
+unsigned int fl_utf8froma(char *dst, unsigned dstlen, const char *src, unsigned srclen)
+  \b FLTK2
+  <br>
+unsigned int fl_utf8toa(const char *src, unsigned srclen, char *dst, unsigned dstlen)
+  \b FLTK2
+  <br>
+\par
+\p %fl_utf8froma() converts a character string containing single bytes
+per character (i.e. ASCII or ISO-8859-1) into UTF-8.
+If the \p src string contains only ASCII characters, the return value will
+be the same as \p srclen.
+\par
+\p %fl_utf8toa() converts a string containing UTF-8 characters into
+single byte characters. UTF-8 characters do not correspond to ASCII
+or ISO-8859-1 characters below 0xFF are replaced with '?'.
+
+\par
+Both functions return the number of bytes that would be written, not
+counting the null terminator.
+\p destlen provides a means of limiting the number of bytes written,
+so setting \p destlen to zero is a means of measuring how much storage
+would be needed before doing the real conversion.
+
+
+char* fl_utf2mbcs(const char *src)
+  \b OksiD
+  <br>
+\par
+converts a UTF-8 string to a local multi-byte character string.
+<b>[More info required here!]</b>
+
+unsigned int fl_utf8fromwc(char *dst, unsigned dstlen, const wchar_t *src, unsigned srclen)
+  \b FLTK2
+  <br>
+unsigned int fl_utf8towc(const char *src, unsigned srclen, wchar_t *dst, unsigned dstlen)
+  \b FLTK2
+  <br>
+unsigned int fl_utf8toUtf16(const char *src, unsigned srclen, unsigned short *dst, unsigned dstlen)
+  \b FLTK2
+  <br>
+\par
+These routines convert between UTF-8 and \p wchar_t or "wide character"
+strings.
+The difficulty lies in the fact \p sizeof(wchar_t) is 2 on Windows
+and 4 on Linux and most other systems.
+Therefore some "wide characters" on Windows may be represented
+as "surrogate pairs" of more than one \p wchar_t.
+
+\par
+\p %fl_utf8fromwc() converts from a "wide character" string to UTF-8.
+Note that \p srclen is the number of \p wchar_t elements in the source
+string and on Windows and this might be larger than the number of characters.
+\p dstlen specifies the maximum number of \b bytes to copy, including
+the null terminator.
+
+\par
+\p %fl_utf8towc() converts a UTF-8 string into a "wide character" string.
+Note that on Windows, some "wide characters" might result in "surrogate
+pairs" and therefore the return value might be more than the number of
+characters.
+\p dstlen specifies the maximum number of \b wchar_t elements to copy,
+including a zero terminating element.
+<b>[Is this all worded correctly?]</b>
+
+\par
+\p %fl_utf8toUtf16() converts a UTF-8 string into a "wide character"
+string using UTF-16 encoding to handle the "surrogate pairs" on Windows.
+\p dstlen specifies the maximum number of \b wchar_t elements to copy,
+including a zero terminating element.
+<b>[Is this all worded correctly?]</b>
+
+\par
+These routines all return the number of elements that would be required
+for a full conversion of the \p src string, including the zero terminator.
+Therefore setting \p dstlen to zero is a way of measuring how much storage
+would be needed before doing the real conversion.
+
+
+unsigned int fl_utf8from_mb(char *dst, unsigned dstlen, const char *src, unsigned srclen)
+  \b FLTK2
+  <br>
+unsigned int fl_utf8to_mb(const char *src, unsigned srclen, char *dst, unsigned dstlen)
+  \b FLTK2
+  <br>
+\par
+These functions convert between UTF-8 and the locale-specific multi-byte
+encodings used on some systems for filenames, etc.
+If fl_utf8locale() returns true, these functions don't do anything useful.
+<b>[Is this all worded correctly?]</b>
+
+
+int fl_tolower(unsigned int ucs)
+  \b OksiD
+  <br>
+int fl_toupper(unsigned int ucs)
+  \b OksiD
+  <br>
+int fl_utf_tolower(const unsigned char *str, int len, char *buf)
+  \b OksiD
+  <br>
+int fl_utf_toupper(const unsigned char *str, int len, char *buf)
+  \b OksiD
+  <br>
+\par
+\p %fl_tolower() and \p %fl_toupper() convert a single Unicode character
+from upper to lower case, and vice versa.
+\p %fl_utf_tolower() and \p %fl_utf_toupper() convert a string of bytes,
+some of which may be multi-byte UTF-8 encodings of Unicode characters,
+from upper to lower case, and vice versa.
+\par
+Warning: to be safe, \p buf length must be at least \p 3*len
+[for 16-bit Unicode]
+
+
+int fl_utf_strcasecmp(const char *s1, const char *s2)
+  \b OksiD
+  <br>
+int fl_utf_strncasecmp(const char *s1, const char *s2, int n)
+  \b OksiD
+  <br>
+\par
+\p %fl_utf_strcasecmp() is a UTF-8 aware string comparison function that
+converts the strings to lower case Unicode as part of the comparison.
+\p %flt_utf_strncasecmp() only compares the first \p n characters [bytes?]
+
+
+\section unicode_system_calls FLTK Unicode versions of system calls
+
+- int fl_access(const char* f, int mode)
+  \b OksiD
+- int fl_chmod(const char* f, int mode)
+  \b OksiD
+- int fl_execvp(const char* file, char* const* argv)
+  \b OksiD
+- FILE* fl_fopen(cont char* f, const char* mode)
+  \b OksiD
+- char* fl_getcwd(char* buf, int maxlen)
+  \b OksiD
+- char* fl_getenv(const char* name)
+  \b OksiD
+- char fl_make_path(const char* path)	- returns char ?
+  \b OksiD
+- void fl_make_path_for_file(const char* path)
+  \b OksiD
+- int fl_mkdir(const char* f, int mode)
+  \b OksiD
+- int fl_open(const  char* f, int o, ...)
+  \b OksiD
+- int fl_rename(const char* f, const char* t)
+  \b OksiD
+- int fl_rmdir(const char* f)
+  \b OksiD
+- int fl_stat(const char* path, struct stat* buffer)
+  \b OksiD
+- int fl_system(const char* f)
+  \b OksiD
+- int fl_unlink(const char* f)
+  \b OksiD
+
+\par TODO:
+
+\li more doc on unicode, add links
+\li write something about filename encoding on OS X...
+\li explain the fl_utf8_... commands
+\li explain issues with Fl_Preferences
+\li why FLTK has no Fl_String class
+
+\htmlonly
+<hr>
+<table summary="navigation bar" width="100%" border="0">
+<tr>
+  <td width="45%" align="LEFT">
+    <a class="el" href="advanced.html">
+    [Prev]
+    Advanced FLTK
+    </a>
+  </td>
+  <td width="10%" align="CENTER">
+    <a class="el" href="main.html">[Index]</a>
+  </td>
+  <td width="45%" align="RIGHT">
+    <a class="el" href="enumerations.html">
+    FLTK Enumerations
+    [Next]
+    </a>
+  </td>
+</tr>
+</table>
+\endhtmlonly
+
+*/
author	Matthias Melcher <fltk@matthiasm.com>	2011-01-08 16:28:16 +0000
committer	Matthias Melcher <fltk@matthiasm.com>	2011-01-08 16:28:16 +0000
commit	2dc664935d8109767c2d107c6b644082fe06ac05 (patch)
tree	6e5e622962a1503161b86884cd3423cb2bba1ab1 /branch-3.0-2011/documentation/src/unicode.dox
parent	f62a6a927a8ce7aa91b023e7aafad9b5ff96f755 (diff)