Difference between revisions of "Talk:UTF-8"

Revision as of 02:40, 30 September 2014

Testing and Usage

Yeah I always stayed away from the weird ranges. These functions were mostly written with LSO in mind, and LSO did no special conversions. LSO was happy with 6 byte character codes (old UTF-8). Mono is another thing altogether. The only testing I did for Mono was to make sure it compiled correctly and the logic executed the same. I did not do the extensive testing for Mono that I did for LSO. For LSO these functions were pretty much mirrors but I always tested the range of characters I was going to be using to make sure, and I always stayed away from ranges that do things.

Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- Strife ^{(talk|contribs)} 21:08, 29 September 2014 (PDT)

LSO uses byte strings and does not care about the validity of the Unicode codepoints they represent. You can even store an incomplete UTF-8 sequence such as llUnescapeURL("%C3") in LSO. For that reason, the conversion to/from integer is safe in LSO, being able to encode every code point between U+0001 and U+7FFFFFFF. This line works fine under LSO, proving the conversion is working flawlessly: llOwnerSay((string)UTF8ToUnicodeInteger(UnicodeIntegerToUTF8(0x7FFFFFFF))); displays 2147483647 as expected. Surrogates and U+FFFE are encoded and decoded without problems.

Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.

I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. llUnescapeURL("%ED%A0%80")) translates to "???" in Mono. --Pedro Oval 03:40, 30 September 2014 (PDT)

@@ Line 4: / Line 4: @@
 Should I make a version for surrogate pairs? (on second thought do I want to learn enough about surrogate pairs to write a function for them?) -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 21:08, 29 September 2014 (PDT)
+:LSO uses byte strings and does not care about the validity of the Unicode codepoints they represent. You can even store an incomplete UTF-8 sequence such as <code>llUnescapeURL("%C3")</code> in LSO. For that reason, the conversion to/from integer is safe in LSO, being able to encode every code point between U+0001 and U+7FFFFFFF. This line works fine under LSO, proving the conversion is working flawlessly: <code>llOwnerSay((string)UTF8ToUnicodeInteger(UnicodeIntegerToUTF8(0x7FFFFFFF)));</code> displays 2147483647 as expected. Surrogates and U+FFFE are encoded and decoded without problems.
+:Mono, on the other hand, uses UTF-16 internally for its strings, and respects Unicode rules with respect to validity. Thus surrogates and U+FFFE are invalid.
+:I don't know how "a version for surrogate pairs" would look like. I don't see how a function can be made to work under Mono that allows mapping of the range U+D800-U+DFFF to a two-byte character in UTF-16 losslessly. Or that would allow the range above 0x10FFFF to be mapped to a valid Unicode string, for the matter. At least not without using escapes, i.e. reinventing the wheel. For the record, a UTF-8 sequence that would produce a surrogate (e.g. <code>llUnescapeURL("%ED%A0%80")</code>) translates to "???" in Mono. --[[User:Pedro Oval|Pedro Oval]] 03:40, 30 September 2014 (PDT)

Difference between revisions of "Talk:UTF-8"

Revision as of 02:40, 30 September 2014

Testing and Usage

Navigation menu

Search