summaryrefslogtreecommitdiff
path: root/documentation
diff options
context:
space:
mode:
Diffstat (limited to 'documentation')
-rw-r--r--documentation/src/unicode.dox850
1 files changed, 344 insertions, 506 deletions
diff --git a/documentation/src/unicode.dox b/documentation/src/unicode.dox
index 31bae7756..ff4702186 100644
--- a/documentation/src/unicode.dox
+++ b/documentation/src/unicode.dox
@@ -2,513 +2,351 @@
\page unicode Unicode and UTF-8 Support
-This chapter explains how FLTK handles international
-text via Unicode and UTF-8.
-
-Unicode support was added to FLTK starting with version 1.3.0 and is
-still incomplete but mostly functional. This chapter is Work in Progress,
-reflecting the current state of Unicode support.
-
-\section unicode_about About Unicode, ISO 10646 and UTF-8
-
-The summary of Unicode, ISO 10646 and UTF-8 given below is
-deliberately brief and provides just enough information for
-the rest of this chapter.
-
-For further information, please see:
-- https://unicode.org
-- https://iso.org
-- https://en.wikipedia.org/wiki/Unicode
-- https://www.cl.cam.ac.uk/~mgk25/unicode.html
-- https://tools.ietf.org/html/rfc3629
-
-
-\par The Unicode Standard
-
-The Unicode Standard was originally developed by a consortium of mainly
-US computer manufacturers and developers of multi-lingual software.
-It has now become a defacto standard for character encoding
-and is supported by most of the major computing companies in the world.
-
-Before Unicode, many different systems, on different platforms,
-had been developed for encoding characters for different languages,
-but no single encoding could satisfy all languages.
-Unicode provides access to over 130,000 characters
-used in all the major languages written today,
-and is independent of platform and language.
-
-Unicode also provides higher-level concepts needed for text processing
-and typographic publishing systems, such as algorithms for sorting and
-comparing text, composite character and text rendering, right-to-left
-and bi-directional text handling.
-
-\note There are currently no plans to add this extra functionality to FLTK.
-
-
-\par ISO 10646
-
-The International Organisation for Standardization (ISO) had also
-been trying to develop a single unified character set.
-Although both ISO and the Unicode Consortium continue to publish
-their own standards, they have agreed to coordinate their work so
-that specific versions of the Unicode and ISO 10646 standards are
-compatible with each other.
-
-The international standard ISO 10646 defines the
-<b>Universal Character Set</b> (UCS)
-which contains the characters required for almost all known languages.
-The standard also defines three different implementation levels specifying
-how these characters can be combined.
-
-\note There are currently no plans for handling the different implementation
-levels or the combining characters in FLTK.
-
-In UCS, characters have a unique numerical code and an official name,
-and are usually shown using 'U+' and the code in hexadecimal,
-e.g. U+0041 is the "Latin capital letter A".
-The UCS characters U+0000 to U+007F correspond to US-ASCII,
-and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1).
-
-ISO 10646 was originally designed to handle a 31-bit character set
-from U+00000000 to U+7FFFFFFF, but the current idea is that 21 bits
-will be sufficient for all future needs, giving characters up to
-U+10FFFF. The complete character set is sub-divided into \e planes.
-<i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b>
-(BMP), ranges from U+0000 to U+FFFD and consists of the most commonly
-used characters from previous encoding standards. Other planes
-contain characters for specialist applications.
-
-\todo FLTK 1.3 and later supports the full Unicode range (21 bits), but
- there are a few exceptions, for instance binary shortcut values in menus
- (\ref Fl_Shortcut) can only be used with characters from the BMP (16 bits).
- This may be extended in a future FLTK version.
-
-The UCS also defines various methods of encoding characters as
-a sequence of bytes.
-UCS-2 encodes Unicode characters into two bytes,
-which is wasteful if you are only dealing with ASCII or Latin1 text,
-and insufficient if you need characters above U+00FFFF.
-UCS-4 uses four bytes, which lets it handle higher characters,
-but this is even more wasteful for ASCII or Latin1.
-
-\par UTF-8
-
-The Unicode standard defines various UCS Transformation Formats (UTF).
-UTF-16 and UTF-32 are based on units of two and four bytes.
-UCS characters requiring more than 16 bits are encoded using
-"surrogate pairs" in UTF-16.
-
-UTF-8 encodes all Unicode characters into variable length
-sequences of bytes. Unicode characters in the 7-bit ASCII
-range map to the same value and are represented as a single byte,
-making the transformation to Unicode quick and easy.
-
-All UCS characters above U+007F are encoded as a sequence of
-several bytes. The top bits of the first byte are set to show
-the length of the byte sequence, and subseqent bytes are
-always in the range 0x80 to 0xBF. This combination provides
-some level of synchronisation and error detection.
-
-\par
-
-<table summary="Unicode character byte sequences" align="center">
-<tr>
- <td>Unicode range</td>
- <td>Byte sequences</td>
-</tr>
-<tr>
- <td><tt>U+00000000 - U+0000007F</tt></td>
- <td><tt>0xxxxxxx</tt></td>
-</tr>
-<tr>
- <td><tt>U+00000080 - U+000007FF</tt></td>
- <td><tt>110xxxxx 10xxxxxx</tt></td>
-</tr>
-<tr>
- <td><tt>U+00000800 - U+0000FFFF</tt></td>
- <td><tt>1110xxxx 10xxxxxx 10xxxxxx</tt></td>
-</tr>
-<tr>
- <td><tt>U+00010000 - U+001FFFFF</tt></td>
- <td><tt>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
-</tr>
-<tr>
- <td><tt>U+00200000 - U+03FFFFFF</tt></td>
- <td><tt>111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
-</tr>
-<tr>
- <td><tt>U+04000000 - U+7FFFFFFF</tt></td>
- <td><tt>1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
-</tr>
-</table>
-
-\note This table contains theoretical values outside the valid Unicode
- range (<tt>U+000000 - U+10FFFF</tt>). Such values can only be returned by
- conversion functions for illegal input values (see \ref unicode_illegals).
-
-
-\par
-
-Moving from ASCII encoding to Unicode will allow all new FLTK
-applications to be easily internationalized and used all over
-the world. By choosing UTF-8 encoding, FLTK remains largely
-source-code compatible to previous iterations of the library.
-
-\section unicode_in_fltk Unicode in FLTK
-
-\todo
-Work through the code and this documentation to harmonize
-the [<b>OksiD</b>] and [<b>fltk2</b>] functions.
-
-FLTK will be entirely converted to Unicode using UTF-8 encoding.
-If a different encoding is required by the underlying operating
-system, FLTK will convert the string as needed.
-
-It is important to note that the initial implementation of
-Unicode and UTF-8 in FLTK involves three important areas:
-
-- provision of Unicode character tables and some simple related functions;
-
-- conversion of char* variables and function parameters from single byte
- per character representation to UTF-8 variable length sequences;
-
-- modifications to the display font interface to accept general
- Unicode character or UCS code numbers instead of just ASCII or Latin1
- characters.
-
-The current implementation of Unicode / UTF-8 in FLTK will impose
-the following limitations:
-
-- An implementation note in the [<b>OksiD</b>] code says that all functions
- are LIMITED to 24 bit Unicode values, but also says that only 16 bits
- are really used under linux and win32.
- <b>[Can we verify this?]</b>
-
-- The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are
- designed to handle Unicode characters in the range U+000000 to U+10FFFF
- inclusive, which covers all UTF-16 characters, as specified in RFC 3629.
- <i>Note that the user must first convert UTF-16 surrogate pairs to UCS.</i>
-
-- FLTK will only handle single characters, so composed characters
- consisting of a base character and floating accent characters
- will be treated as multiple characters.
-
-- FLTK will only compare or sort strings on a byte by byte basis
- and not on a general Unicode character basis.
-
-- FLTK will not handle right-to-left or bi-directional text.
-
- \todo
- Verify 16/24 bit Unicode limit for different character sets?
- OksiD's code appears limited to 16-bit whereas the FLTK2 code
- appears to handle a wider set. What about illegal characters?
- See comments in %fl_utf8fromwc() and %fl_utf8toUtf16().
-
-\section unicode_illegals Illegal Unicode and UTF-8 Sequences
-
-Three pre-processor variables are defined in the source code [1] that
-determine how %fl_utf8decode() handles illegal UTF-8 sequences:
-
-- if ERRORS_TO_CP1252 is set to 1 (the default), %fl_utf8decode() will
- assume that a byte sequence starting with a byte in the range 0x80
- to 0x9f represents a Microsoft CP1252 character, and will return
- the value of an equivalent UCS character. Otherwise, it will be
- processed as an illegal byte value as described below.
-
-- if STRICT_RFC3629 is set to 1 (not the default!) then UTF-8
- sequences that correspond to illegal UCS values are treated as
- errors. Illegal UCS values include those above U+10FFFF, or
- corresponding to UTF-16 surrogate pairs. Illegal byte values
- are handled as described below.
-
-- if ERRORS_TO_ISO8859_1 is set to 1 (the default), the illegal
- byte value is returned unchanged, otherwise 0xFFFD, the Unicode
- REPLACEMENT CHARACTER, is returned instead.
-
-[1] Since FLTK 1.3.4 you may set these three pre-processor variables on
- your compile command line with -D"variable=value" (value: 0 or 1)
- to avoid editing the source code.
+FLTK provides comprehensive Unicode support through UTF-8 encoding, allowing your applications to handle international text and be easily localized for users worldwide.
+
+\section unicode_overview Overview
+
+Starting with version 1.3.0, FLTK uses UTF-8 as its primary text encoding. This means:
+- All text in FLTK is expected to be UTF-8 encoded
+- Your application can display text in any language
+- File operations work correctly with international filenames
+- Most existing ASCII code continues to work unchanged
+
+\note Unicode support in FLTK is functional but still evolving. Some advanced features like bidirectional text and complex script shaping are not yet implemented.
+
+\section unicode_quick_start Quick Start
+
+For most applications, you simply need to ensure your text is UTF-8 encoded:
+
+\code
+// These all work automatically with UTF-8:
+Fl_Window window(400, 300, "Hello 世界"); // Mixed ASCII and Chinese
+button->label("Café"); // Accented characters
+fl_fopen("документ.txt", "r"); // Cyrillic filename
+\endcode
+
+\section unicode_background What is Unicode and UTF-8?
+
+__Unicode__ is a standard that assigns a unique number to every character used in human languages - from Latin letters to Chinese characters to emoji. Each character has a "code point" like U+0041 for 'A' or U+4E2D for '中'.
+
+__UTF-8__ is a way to store Unicode characters as bytes. It's backward-compatible with ASCII and efficient for most text:
+- ASCII characters (like 'A') use 1 byte
+- European accented characters use 2 bytes
+- Most other characters (Chinese, Arabic, etc.) use 3 bytes
+- Rare characters and emoji may use 4 bytes
+
+FLTK chose UTF-8 because it works well with existing C string functions and doesn't break legacy ASCII code.
+
+\section unicode_functions Unicode Functions in FLTK
+
+\subsection unicode_validation Text Validation and Analysis
+
+Functions to check and analyze UTF-8 text:
+
+fl_utf8test() - Check if a string contains valid UTF-8
+\code
+const char* text = "Hello 世界";
+int result = fl_utf8test(text, strlen(text));
+// Returns: 0=invalid, 1=ASCII, 2=2-byte chars, 3=3-byte chars, 4=4-byte chars
+\endcode
+
+fl_utf8len() - Get the byte length of a UTF-8 character
+\code
+char ch = '\xE4'; // First byte of a 3-byte UTF-8 sequence
+int len = fl_utf8len(ch); // Returns 3 (or -1 if invalid)
+\endcode
+
+fl_utf8locale() - Check if system uses UTF-8 encoding
+\code
+if (fl_utf8locale()) {
+ // System uses UTF-8, no conversion needed
+} else {
+ // May need to convert from local encoding
+}
+\endcode
+
+fl_utf_nb_char() - Count UTF-8 characters in a buffer
+\code
+const char* text = "Hello 世界";
+int char_count = fl_utf_nb_char((unsigned char*)text, strlen(text));
+// Returns 8 (number of characters, not bytes)
+\endcode
+
+fl_utf8bytes() / fl_unichar_to_utf8_size() - Get bytes needed for Unicode character
+\code
+unsigned int unicode_char = 0x4E2D; // Chinese character '中'
+int bytes_needed = fl_utf8bytes(unicode_char); // Returns 3
+\endcode
+
+fl_nonspacing() - Check if character is non-spacing (combining character)
+\code
+unsigned int accent = 0x0300; // Combining grave accent
+if (fl_nonspacing(accent)) {
+ // This is a combining character, doesn't take visual space
+}
+\endcode
+
+\subsection unicode_conversion Text Conversion
+
+Functions to convert between encodings:
+
+fl_utf8decode() / fl_utf8encode() - Convert between UTF-8 and Unicode values
+\code
+// Decode UTF-8 to Unicode code point
+const char* utf8_char = "中";
+int len;
+unsigned int unicode = fl_utf8decode(utf8_char, utf8_char + 3, &len);
+// unicode = 0x4E2D, len = 3
+
+// Encode Unicode back to UTF-8
+char buffer[5];
+int bytes = fl_utf8encode(0x4E2D, buffer); // Returns 3
+buffer[bytes] = '\0'; // Now buffer contains "中"
+\endcode
+
+fl_utf8froma() / fl_utf8toa() - Convert between UTF-8 and single-byte encodings
+\code
+// Convert ISO-8859-1 to UTF-8
+char utf8_buffer[200];
+fl_utf8froma(utf8_buffer, sizeof(utf8_buffer), "café", 4);
+
+// Convert UTF-8 to single-byte (non-representable chars become '?')
+char ascii_buffer[100];
+fl_utf8toa("café", 5, ascii_buffer, sizeof(ascii_buffer));
+\endcode
+
+fl_utf8fromwc() / fl_utf8towc() - Convert between UTF-8 and wide characters
+\code
+// Convert wide string to UTF-8
+wchar_t wide_text[] = L"Hello 世界";
+char utf8_buffer[100];
+fl_utf8fromwc(utf8_buffer, sizeof(utf8_buffer), wide_text, wcslen(wide_text));
+
+// Convert UTF-8 to wide string
+const char* utf8_text = "Hello 世界";
+wchar_t wide_buffer[50];
+fl_utf8towc(utf8_text, strlen(utf8_text), wide_buffer, 50);
+\endcode
+
+fl_utf8toUtf16() - Convert UTF-8 to UTF-16
+\code
+const char* utf8_text = "Hello 世界";
+unsigned short utf16_buffer[100];
+unsigned int result = fl_utf8toUtf16(utf8_text, strlen(utf8_text),
+ utf16_buffer, 100);
+// Converts to UTF-16, handling surrogate pairs on Windows
+\endcode
+
+fl_utf2mbcs() - Convert UTF-8 to local multibyte encoding
+\code
+const char* utf8_text = "Hello 世界";
+char* local_text = fl_utf2mbcs(utf8_text);
+// Converts to system's local encoding (Windows CP, etc.)
+// Remember to free the returned pointer
+free(local_text);
+\endcode
+
+fl_utf8from_mb() / fl_utf8to_mb() - Convert between UTF-8 and local multibyte
+\code
+// Convert from local multibyte to UTF-8
+char utf8_buffer[200];
+fl_utf8from_mb(utf8_buffer, sizeof(utf8_buffer), local_text, strlen(local_text));
+
+// Convert from UTF-8 to local multibyte
+char local_buffer[200];
+fl_utf8to_mb(utf8_text, strlen(utf8_text), local_buffer, sizeof(local_buffer));
+\endcode
+
+\subsection unicode_navigation Text Navigation
+
+Functions to move through UTF-8 text safely:
+
+fl_utf8back() / fl_utf8fwd() - Find character boundaries
+\code
+const char* text = "Café";
+const char* start = text;
+const char* end = text + strlen(text);
+const char* e_pos = text + 3; // Points to 'é'
+
+// Move to previous character
+const char* c_pos = fl_utf8back(e_pos, start, end); // Points to 'f'
+
+// Move to next character
+const char* next_pos = fl_utf8fwd(e_pos, start, end); // Points after 'é'
+\endcode
+
+\subsection unicode_string_ops String Operations
+
+UTF-8 aware string functions:
+
+fl_utf8strlen() - Count UTF-8 characters (not bytes)
+\code
+const char* text = "Café"; // 5 bytes, 4 characters
+int chars = fl_utf8strlen(text); // Returns 4
+int bytes = strlen(text); // Returns 5
+\endcode
+
+fl_utf_strcasecmp() / fl_utf_strncasecmp() - Compare strings ignoring case
+\code
+int result = fl_utf_strcasecmp("Café", "CAFÉ"); // Returns 0 (equal)
+int result2 = fl_utf_strncasecmp("Café", "CAFÉ", 2); // Compare first 2 chars
+\endcode
+
+fl_tolower() / fl_toupper() - Convert case for individual Unicode characters
+\code
+unsigned int lower_a = fl_tolower(0x41); // 'A' -> 'a' (0x61)
+unsigned int upper_e = fl_toupper(0xE9); // 'é' -> 'É' (0xC9)
+\endcode
+
+fl_utf_tolower() / fl_utf_toupper() - Convert case for UTF-8 strings
+\code
+const char* text = "Café";
+char lower_buffer[20];
+fl_utf_tolower((unsigned char*)text, strlen(text), lower_buffer);
+// lower_buffer now contains "café"
+\endcode
+
+\subsection unicode_file_ops File Operations
+
+Cross-platform file functions that handle UTF-8 filenames correctly:
+
+__Basic file operations:__
+\code
+// These work with international filenames on all platforms:
+FILE* f = fl_fopen("测试文件.txt", "r"); // Open file
+int fd = fl_open("документ.bin", O_RDONLY); // Open with file descriptor
+int result = fl_stat("файл.dat", &stat_buf); // Get file info
+\endcode
+
+__File access and properties:__
+\code
+fl_access("测试文件.txt", R_OK); // Check if file is readable
+fl_chmod("文档.dat", 0644); // Change file permissions
+fl_unlink("临时文件.tmp"); // Delete file
+fl_rename("旧名.txt", "新名.txt"); // Rename file
+\endcode
+
+__Directory operations:__
+\code
+fl_mkdir("新文件夹", 0755); // Create directory
+fl_rmdir("旧文件夹"); // Remove directory
+char current_dir[1024];
+fl_getcwd(current_dir, sizeof(current_dir)); // Get current directory
+\endcode
+
+__Path operations:__
+\code
+fl_make_path("新目录/子目录/深层目录"); // Create directory path
+fl_make_path_for_file("路径/到/新文件.txt"); // Create path for file
+\endcode
+
+__Process and system operations:__
+\code
+fl_execvp("程序名", argv); // Execute program
+fl_system("echo 'Hello 世界'"); // Execute system command
+char* value = fl_getenv("环境变量"); // Get environment variable
+\endcode
+
+\section unicode_best_practices Best Practices
+
+\subsection unicode_practices_files File Handling
+- Always use fl_fopen(), fl_open(), etc. for file operations with international names
+- Save source code files as UTF-8 with BOM if your editor requires it
+- Test with international filenames during development
+
+\subsection unicode_practices_strings String Processing
+- Use fl_utf8strlen() instead of strlen() for character counts
+- Use fl_utf8fwd()/fl_utf8back() when iterating through text character by character
+- Validate user input with fl_utf8test() if accepting external data
+- Be careful when truncating strings - use character boundaries, not arbitrary byte positions
+
+\subsection unicode_practices_display Display and UI
+- Test your interface with text in various languages (especially long German words or wide Asian characters)
+- Consider that text length varies greatly between languages when designing layouts
+- Ensure your chosen fonts support the characters you need to display
+
+\subsection unicode_practices_performance Performance Notes
+- ASCII text has no performance overhead compared to single-byte encodings
+- UTF-8 functions are optimized for common cases (ASCII and Western European text)
+- File operations may be slightly slower on Windows due to UTF-16 conversion
+
+\section unicode_troubleshooting Common Issues and Solutions
+
+\subsection unicode_problem_display "My international text shows up as question marks"
+__Solution:__ Ensure your text is UTF-8 encoded and your font supports the characters. If reading from files, verify they're saved as UTF-8.
+
+\subsection unicode_problem_files "File operations fail with international names"
+__Solution:__ Use FLTK's Unicode file functions instead of standard C functions:
+\code
+// Instead of:
+FILE* f = fopen("файл.txt", "r"); // May fail on Windows
+
+// Use:
+FILE* f = fl_fopen("файл.txt", "r"); // Works correctly
+\endcode
+
+\subsection unicode_problem_length "String length calculations are wrong"
+__Solution:__ Use UTF-8 aware functions:
+\code
+// Wrong - counts bytes, not characters:
+int len = strlen("Café"); // Returns 5
+
+// Correct - counts characters:
+int len = fl_utf8strlen("Café"); // Returns 4
+\endcode
+
+\subsection unicode_problem_truncation "Text gets corrupted when I truncate it"
+__Solution:__ Don't truncate UTF-8 strings at arbitrary byte positions:
+\code
+// Wrong - may cut in middle of character:
+char truncated[10];
+strncpy(truncated, utf8_text, 9);
+
+// Correct - find proper character boundary:
+const char* end = utf8_text;
+int char_count = 0;
+while (char_count < max_chars && *end) {
+ end = fl_utf8fwd(end, utf8_text, utf8_text + strlen(utf8_text));
+ char_count++;
+}
+int safe_length = end - utf8_text;
+\endcode
+
+\section unicode_error_handling Error Handling
+
+FLTK handles invalid UTF-8 sequences gracefully using configurable behavior:
+
+__Error handling modes (compile-time configuration):__
+- __ERRORS_TO_CP1252__ (default): Treats bytes 0x80-0x9F as CP1252 characters
+- __STRICT_RFC3629__: Strict UTF-8 validation according to RFC 3629
+- __ERRORS_TO_ISO8859_1__ (default): Invalid bytes returned as-is, otherwise returns Unicode replacement character (U+FFFD)
+
+\note You can configure these with compiler flags like -DERRORS_TO_CP1252=0
+
+This design allows FLTK to handle legacy text files that mix encodings, making it more robust in real-world scenarios.
+
+\section unicode_limitations Current Limitations
+
+FLTK's Unicode support covers most common use cases but has some limitations:
+
+__Text Processing:__
+- No automatic text normalization (combining characters are treated separately)
+- No complex script shaping (may affect some Arabic, Indic scripts)
+- No bidirectional text support (right-to-left languages like Arabic/Hebrew)
+
+__Character Range:__
+- Full Unicode range supported (U+000000 to U+10FFFF)
+- Some legacy APIs may be limited to 16-bit characters (Basic Multilingual Plane)
+
+__Sorting and Comparison:__
+- String comparison is byte-based, not linguistically correct
+- Use system locale functions for proper collation when needed for sorting
+
+__Composed Characters:__
+- Composed characters (base + combining accents) are treated as separate characters
+- No automatic character composition or decomposition
-%fl_utf8encode() is less strict, and only generates the UTF-8
-sequence for 0xFFFD, the Unicode REPLACEMENT CHARACTER, if it is
-asked to encode a UCS value above U+10FFFF.
-
-Many of the [<b>fltk2</b>] functions below use %fl_utf8decode() and
-%fl_utf8encode() in their own implementation, and are therefore
-somewhat protected from bad UTF-8 sequences.
-
-The [<b>OksiD</b>] %fl_utf8len() function assumes that the byte it is
-passed is the first byte in a UTF-8 sequence, and returns the length
-of the sequence. Trailing bytes in a UTF-8 sequence will return -1.
-
-- \b WARNING:
- %fl_utf8len() can not distinguish between single
- bytes representing Microsoft CP1252 characters 0x80-0x9f and
- those forming part of a valid UTF-8 sequence. You are strongly
- advised not to use %fl_utf8len() in your own code unless you
- know that the byte sequence contains only valid UTF-8 sequences.
-
-- \b WARNING:
- Some of the [OksiD] functions below still use %fl_utf8len() in
- their implementations. These may need further validation.
-
-Please see the individual function description for further details
-about error handling and return values.
-
-\section unicode_fltk_calls FLTK Unicode and UTF-8 Functions
-
-This section provides a brief overview of the functions.
-For more details, consult the main text for each function via its link.
-
-int fl_utf8locale()
- \b FLTK2
- <br>
-\par
-\p %fl_utf8locale() returns true if the "locale" seems to indicate
-that UTF-8 encoding is used.
-\par
-<i>It is highly recommended that you change your system so this does return
-true!</i>
-
-
-int fl_utf8test(const char *src, unsigned len)
- \b FLTK2
- <br>
-\par
-\p %fl_utf8test() examines the first \p len bytes of \p src.
-It returns 0 if there are any illegal UTF-8 sequences;
-1 if \p src contains plain ASCII or if \p len is zero;
-or 2, 3 or 4 to indicate the range of Unicode characters found.
-
-
-int fl_utf_nb_char(const unsigned char *buf, int len)
- \b OksiD
- <br>
-\par
-Returns the number of UTF-8 characters in the first \p len bytes of \p buf.
-
-
-int fl_unichar_to_utf8_size(Fl_Unichar)
- <br>
-int fl_utf8bytes(unsigned ucs)
- <br>
-\par
-Returns the number of bytes needed to encode \p ucs in UTF-8.
-
-
-int fl_utf8len(char c)
- \b OksiD
- <br>
-\par
-If \p c is a valid first byte of a UTF-8 encoded character sequence,
-\p %fl_utf8len() will return the number of bytes in that sequence.
-It returns -1 if \p c is not a valid first byte.
-
-
-unsigned int fl_nonspacing(unsigned int ucs)
- \b OksiD
- <br>
-\par
-Returns true if \p ucs is a non-spacing character.
-
-
-const char* fl_utf8back(const char *p, const char *start, const char *end)
- \b FLTK2
- <br>
-const char* fl_utf8fwd(const char *p, const char *start, const char *end)
- \b FLTK2
- <br>
-\par
-If \p p already points to the start of a UTF-8 character sequence,
-these functions will return \p p.
-Otherwise \p %fl_utf8back() searches backwards from \p p
-and \p %fl_utf8fwd() searches forwards from \p p,
-within the \p start and \p end limits,
-looking for the start of a UTF-8 character.
-
-
-unsigned int fl_utf8decode(const char *p, const char *end, int *len)
- \b FLTK2
- <br>
-int fl_utf8encode(unsigned ucs, char *buf)
- \b FLTK2
- <br>
-\par
-\p %fl_utf8decode() attempts to decode the UTF-8 character that starts
-at \p p and may not extend past \p end.
-It returns the Unicode value, and the length of the UTF-8 character sequence
-is returned via the \p len argument.
-\p %fl_utf8encode() writes the UTF-8 encoding of \p ucs into \p buf
-and returns the number of bytes in the sequence.
-See the main documentation for the treatment of illegal Unicode
-and UTF-8 sequences.
-
-
-unsigned int fl_utf8froma(char *dst, unsigned dstlen, const char *src, unsigned srclen)
- \b FLTK2
- <br>
-unsigned int fl_utf8toa(const char *src, unsigned srclen, char *dst, unsigned dstlen)
- \b FLTK2
- <br>
-\par
-\p %fl_utf8froma() converts a character string containing single bytes
-per character (i.e. ASCII or ISO-8859-1) into UTF-8.
-If the \p src string contains only ASCII characters, the return value will
-be the same as \p srclen.
-\par
-\p %fl_utf8toa() converts a string containing UTF-8 characters into
-single byte characters. UTF-8 characters that do not correspond to ASCII
-or ISO-8859-1 characters below 0xFF are replaced with '?'.
-
-\par
-Both functions return the number of bytes that would be written, not
-counting the null terminator.
-\p dstlen provides a means of limiting the number of bytes written,
-so setting \p dstlen to zero is a means of measuring how much storage
-would be needed before doing the real conversion.
-
-
-char* fl_utf2mbcs(const char *src)
- \b OksiD
- <br>
-\par
-converts a UTF-8 string to a local multi-byte character string.
-<b>[More info required here!]</b>
-
-unsigned int fl_utf8fromwc(char *dst, unsigned dstlen, const wchar_t *src, unsigned srclen)
- \b FLTK2
- <br>
-unsigned int fl_utf8towc(const char *src, unsigned srclen, wchar_t *dst, unsigned dstlen)
- \b FLTK2
- <br>
-unsigned int fl_utf8toUtf16(const char *src, unsigned srclen, unsigned short *dst, unsigned dstlen)
- \b FLTK2
- <br>
-\par
-These routines convert between UTF-8 and \p wchar_t or "wide character"
-strings.
-The difficulty lies in the fact that \p sizeof(wchar_t) is 2 on Windows
-and 4 on Linux and most other systems.
-Therefore some "wide characters" on Windows may be represented
-as "surrogate pairs" of more than one \p wchar_t.
-
-\par
-\p %fl_utf8fromwc() converts from a "wide character" string to UTF-8.
-Note that \p srclen is the number of \p wchar_t elements in the source
-string and on Windows this might be larger than the number of characters.
-\p dstlen specifies the maximum number of \b bytes to copy, including
-the null terminator.
-
-\par
-\p %fl_utf8towc() converts a UTF-8 string into a "wide character" string.
-Note that on Windows, some "wide characters" might result in "surrogate
-pairs" and therefore the return value might be more than the number of
-characters.
-\p dstlen specifies the maximum number of \b wchar_t elements to copy,
-including a zero terminating element.
-<b>[Is this all worded correctly?]</b>
-
-\par
-\p %fl_utf8toUtf16() converts a UTF-8 string into a "wide character"
-string using UTF-16 encoding to handle the "surrogate pairs" on Windows.
-\p dstlen specifies the maximum number of \b wchar_t elements to copy,
-including a zero terminating element.
-<b>[Is this all worded correctly?]</b>
-
-\par
-These routines all return the number of elements that would be required
-for a full conversion of the \p src string, including the zero terminator.
-Therefore setting \p dstlen to zero is a way of measuring how much storage
-would be needed before doing the real conversion.
-
-
-unsigned int fl_utf8from_mb(char *dst, unsigned dstlen, const char *src, unsigned srclen)
- \b FLTK2
- <br>
-unsigned int fl_utf8to_mb(const char *src, unsigned srclen, char *dst, unsigned dstlen)
- \b FLTK2
- <br>
-\par
-These functions convert between UTF-8 and the locale-specific multi-byte
-encodings used on some systems for filenames, etc.
-If fl_utf8locale() returns true, these functions don't do anything useful.
-<b>[Is this all worded correctly?]</b>
-
-
-int fl_tolower(unsigned int ucs)
- \b OksiD
- <br>
-int fl_toupper(unsigned int ucs)
- \b OksiD
- <br>
-int fl_utf_tolower(const unsigned char *str, int len, char *buf)
- \b OksiD
- <br>
-int fl_utf_toupper(const unsigned char *str, int len, char *buf)
- \b OksiD
- <br>
-\par
-\p %fl_tolower() and \p %fl_toupper() convert a single Unicode character
-from upper to lower case, and vice versa.
-\p %fl_utf_tolower() and \p %fl_utf_toupper() convert a string of bytes,
-some of which may be multi-byte UTF-8 encodings of Unicode characters,
-from upper to lower case, and vice versa.
-\par
-Warning: to be safe, \p buf length must be at least \p 3*len
-[for 16-bit Unicode]
-
-
-int fl_utf_strcasecmp(const char *s1, const char *s2)
- \b OksiD
- <br>
-int fl_utf_strncasecmp(const char *s1, const char *s2, int n)
- \b OksiD
- <br>
-\par
-\p %fl_utf_strcasecmp() is a UTF-8 aware string comparison function that
-converts the strings to lower case Unicode as part of the comparison.
-\p %flt_utf_strncasecmp() only compares the first \p n characters [bytes?]
-
-
-\section unicode_system_calls FLTK Unicode Versions of System Calls
-
-- int fl_access(const char* f, int mode)
- \b OksiD
-- int fl_chmod(const char* f, int mode)
- \b OksiD
-- int fl_execvp(const char* file, char* const* argv)
- \b OksiD
-- FILE* fl_fopen(cont char* f, const char* mode)
- \b OksiD
-- char* fl_getcwd(char* buf, int maxlen)
- \b OksiD
-- char* fl_getenv(const char* name)
- \b OksiD
-- char fl_make_path(const char* path) - returns char ?
- \b OksiD
-- void fl_make_path_for_file(const char* path)
- \b OksiD
-- int fl_mkdir(const char* f, int mode)
- \b OksiD
-- int fl_open(const char* f, int o, ...)
- \b OksiD
-- int fl_rename(const char* f, const char* t)
- \b OksiD
-- int fl_rmdir(const char* f)
- \b OksiD
-- int fl_stat(const char* path, struct stat* buffer)
- \b OksiD
-- int fl_system(const char* f)
- \b OksiD
-- int fl_unlink(const char* f)
- \b OksiD
-
-\par TODO:
-
-\li more doc on unicode, add links
-\li write something about filename encoding on OS X...
-\li explain the fl_utf8_... commands
-\li explain issues with Fl_Preferences
+Most applications won't encounter these limitations in practice. The Unicode support in FLTK is sufficient for displaying and processing international text in the majority of real-world scenarios.
\htmlonly
<hr>