Difference between revisions of "Talk:UTF-8"

From Second Life Wiki
Jump to navigation Jump to search
(→‎Testing and Usage: Valid in LSO, but for Mono it's hopeless.)
(→‎Testing and Usage: On the gritty little details)
 
(One intermediate revision by one other user not shown)
Line 8: Line 8:
:Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.
:Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.
:I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. <code>llUnescapeURL("%ED%A0%80")</code>) translates to "???" in Mono. --[[User:Pedro Oval|Pedro Oval]] 03:40, 30 September 2014 (PDT)
:I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. <code>llUnescapeURL("%ED%A0%80")</code>) translates to "???" in Mono. --[[User:Pedro Oval|Pedro Oval]] 03:40, 30 September 2014 (PDT)
::I'm really ok with you documenting every gritty little detail of Mono and where these functions fail. Seriously, please do. Its interesting stuff. You might, if you haven't already, want to look at how Mono implemented its string libraries. -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 13:26, 30 September 2014 (PDT)
:::For the record, it's not what I intended to do. I was tempted to remove the whole section added by Ollj Oh, but decided to try to salvage the part that was not misleading and was perhaps useful. --[[User:Pedro Oval|Pedro Oval]] 10:22, 1 October 2014 (PDT)

Latest revision as of 09:22, 1 October 2014

Testing and Usage

Yeah I always stayed away from the weird ranges. These functions were mostly written with LSO in mind, and LSO did no special conversions. LSO was happy with 6 byte character codes (old UTF-8). Mono is another thing altogether. The only testing I did for Mono was to make sure it compiled correctly and the logic executed the same. I did not do the extensive testing for Mono that I did for LSO. For LSO these functions were pretty much mirrors but I always tested the range of characters I was going to be using to make sure, and I always stayed away from ranges that do things.

Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- Strife (talk|contribs) 21:08, 29 September 2014 (PDT)

LSO uses byte strings and does not care about the validity of the Unicode codepoints they represent. You can even store an incomplete UTF-8 sequence such as llUnescapeURL("%C3") in LSO. For that reason, the conversion to/from integer is safe in LSO, being able to encode every code point between U+0001 and U+7FFFFFFF. This line works fine under LSO, proving the conversion is working flawlessly: llOwnerSay((string)UTF8ToUnicodeInteger(UnicodeIntegerToUTF8(0x7FFFFFFF))); displays 2147483647 as expected. Surrogates and U+FFFE are encoded and decoded without problems.
Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.
I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. llUnescapeURL("%ED%A0%80")) translates to "???" in Mono. --Pedro Oval 03:40, 30 September 2014 (PDT)
I'm really ok with you documenting every gritty little detail of Mono and where these functions fail. Seriously, please do. Its interesting stuff. You might, if you haven't already, want to look at how Mono implemented its string libraries. -- Strife (talk|contribs) 13:26, 30 September 2014 (PDT)
For the record, it's not what I intended to do. I was tempted to remove the whole section added by Ollj Oh, but decided to try to salvage the part that was not misleading and was perhaps useful. --Pedro Oval 10:22, 1 October 2014 (PDT)