Talk:UTF-8

From Second Life Wiki
Revision as of 03:40, 30 September 2014 by Pedro Oval (talk | contribs) (→‎Testing and Usage: Valid in LSO, but for Mono it's hopeless.)
Jump to navigation Jump to search

Testing and Usage

Yeah I always stayed away from the weird ranges. These functions were mostly written with LSO in mind, and LSO did no special conversions. LSO was happy with 6 byte character codes (old UTF-8). Mono is another thing altogether. The only testing I did for Mono was to make sure it compiled correctly and the logic executed the same. I did not do the extensive testing for Mono that I did for LSO. For LSO these functions were pretty much mirrors but I always tested the range of characters I was going to be using to make sure, and I always stayed away from ranges that do things.

Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- Strife (talk|contribs) 21:08, 29 September 2014 (PDT)

LSO uses byte strings and does not care about the validity of the Unicode codepoints they represent. You can even store an incomplete UTF-8 sequence such as llUnescapeURL("%C3") in LSO. For that reason, the conversion to/from integer is safe in LSO, being able to encode every code point between U+0001 and U+7FFFFFFF. This line works fine under LSO, proving the conversion is working flawlessly: llOwnerSay((string)UTF8ToUnicodeInteger(UnicodeIntegerToUTF8(0x7FFFFFFF))); displays 2147483647 as expected. Surrogates and U+FFFE are encoded and decoded without problems.
Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.
I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. llUnescapeURL("%ED%A0%80")) translates to "???" in Mono. --Pedro Oval 03:40, 30 September 2014 (PDT)