Difference between revisions of "Talk:UTF-8"
m (Testing and Usage) |
Pedro Oval (talk | contribs) (→Testing and Usage: Valid in LSO, but for Mono it's hopeless.) |
||
Line 4: | Line 4: | ||
Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 21:08, 29 September 2014 (PDT) | Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 21:08, 29 September 2014 (PDT) | ||
:LSO uses byte strings and does not care about the validity of the Unicode codepoints they represent. You can even store an incomplete UTF-8 sequence such as <code>llUnescapeURL("%C3")</code> in LSO. For that reason, the conversion to/from integer is safe in LSO, being able to encode every code point between U+0001 and U+7FFFFFFF. This line works fine under LSO, proving the conversion is working flawlessly: <code>llOwnerSay((string)UTF8ToUnicodeInteger(UnicodeIntegerToUTF8(0x7FFFFFFF)));</code> displays 2147483647 as expected. Surrogates and U+FFFE are encoded and decoded without problems. | |||
:Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid. | |||
:I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. <code>llUnescapeURL("%ED%A0%80")</code>) translates to "???" in Mono. --[[User:Pedro Oval|Pedro Oval]] 03:40, 30 September 2014 (PDT) |
Revision as of 02:40, 30 September 2014
Testing and Usage
Yeah I always stayed away from the weird ranges. These functions were mostly written with LSO in mind, and LSO did no special conversions. LSO was happy with 6 byte character codes (old UTF-8). Mono is another thing altogether. The only testing I did for Mono was to make sure it compiled correctly and the logic executed the same. I did not do the extensive testing for Mono that I did for LSO. For LSO these functions were pretty much mirrors but I always tested the range of characters I was going to be using to make sure, and I always stayed away from ranges that do things.
Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- Strife (talk|contribs) 21:08, 29 September 2014 (PDT)
- LSO uses byte strings and does not care about the validity of the Unicode codepoints they represent. You can even store an incomplete UTF-8 sequence such as
llUnescapeURL("%C3")
in LSO. For that reason, the conversion to/from integer is safe in LSO, being able to encode every code point between U+0001 and U+7FFFFFFF. This line works fine under LSO, proving the conversion is working flawlessly:llOwnerSay((string)UTF8ToUnicodeInteger(UnicodeIntegerToUTF8(0x7FFFFFFF)));
displays 2147483647 as expected. Surrogates and U+FFFE are encoded and decoded without problems. - Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.
- I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g.
llUnescapeURL("%ED%A0%80")
) translates to "???" in Mono. --Pedro Oval 03:40, 30 September 2014 (PDT)