Difference between revisions of "UTF-8"
m (→Limits: 2^20+2^16... but why does it start to fail at this value and not earlier (except for the simpler ealier ranges)) |
m (→Limits: HAH that solves one conundrum! . i almost missed it... stupud hexadecimal representation always confuses me unneccessarily. function is simply poorly defuined for high ranges) |
||
Line 14: | Line 14: | ||
* [55296..57343] (2048 = 2^11 values, this might mean that the 11-bit utf8 part of one of the functions is simply WRONG!) | * [55296..57343] (2048 = 2^11 values, this might mean that the 11-bit utf8 part of one of the functions is simply WRONG!) | ||
* [65534] =U+FFFE(hex)=0b1111111111111110(bin) is in fact THE intentionally "INVALID CHARACTER" utf-8 address that can not be unescaped because it is invalid by design. ([65534..65535] was possibly a false alert by my older range testing scripts) | * [65534] =U+FFFE(hex)=0b1111111111111110(bin) is in fact THE intentionally "INVALID CHARACTER" utf-8 address that can not be unescaped because it is invalid by design. ([65534..65535] was possibly a false alert by my older range testing scripts) | ||
* [1114112..???] ( | * [1114112..???] just because the function is not "well defined" because "else if(input <= 0x10FFFF)" is the highest integer range that it is well defined for. and the next highest hex value is 0x110000 = 1114112(dec), so there are simply some missing cases for all inputs > 1114111. | ||
But the range [65536..1114111] seems to be fine, but this long range takes a while to test all integers. | But the range [65536..1114111] seems to be fine, but this long range takes a while to test all integers. |
Revision as of 09:52, 29 September 2014
LSL Portal | Functions | Events | Types | Operators | Constants | Flow Control | Script Library | Categorized Library | Tutorials |
Second Life uses UTF-8 for storing and transmitting strings and with these functions you can work with Unicode characters. See: Unicode In 5 Minutes for a brief introduction to Unicode.
These functions are part of the Combined Library written by Strife Onizuka.
Limits
Unicode predominately contains simple characters. For the vast majority of the range of values, unicode values are simply characters. However some unicode values are not, they are instructions. UnicodeIntegerToUTF8 and UTF8ToUnicodeInteger work perfectly for the simple characters; however not so well for the other values. More on these ranges can be found in the Unicode specification and Specials (Unicode block) .
These integer ranges are ranges where UTF8ToUnicodeInteger() and UnicodeIntegerToUTF8() are NOT inverse functions of each other. Converting these integers to utf8 and back to integer may likely result in a different integer, no longer guaranteeing uniqueness:
- [55296..57343] (2048 = 2^11 values, this might mean that the 11-bit utf8 part of one of the functions is simply WRONG!)
- [65534] =U+FFFE(hex)=0b1111111111111110(bin) is in fact THE intentionally "INVALID CHARACTER" utf-8 address that can not be unescaped because it is invalid by design. ([65534..65535] was possibly a false alert by my older range testing scripts)
- [1114112..???] just because the function is not "well defined" because "else if(input <= 0x10FFFF)" is the highest integer range that it is well defined for. and the next highest hex value is 0x110000 = 1114112(dec), so there are simply some missing cases for all inputs > 1114111.
But the range [65536..1114111] seems to be fine, but this long range takes a while to test all integers.
Some characters within ranges where the two functions are inverse functions of each other may still behave VERY STRANGE if more than 1 character is stored in a string (of multiple characters). They may ...
- become double characters and mess up your indexing within a string. (this may even vary, depending on if a text field uses utf8 or utf16)
- merge with other characters into "super-mega-dino-mecha-character" (see "unicode combined characters")
- are reserved characters (especially for JSON-like string serializations but also for simple things like "\n")
- become invisible and may be skipped over by some functions (depending on compiler, VM, environment and programming language)
Because of that you quickly get down to a range of 65504 unique SINGLE characters (still working in LSO in case you unchecked the "mono" button by accident) (<2^16 due to avoiding a lot of reserved and invalid characters) being able to store 2^15 bits per character in a string within lsl (for mono and LSO compiler).
Storing 15 bits per character in a string quickly becomes more memory efficient than storing a list of 32-bit integers. Each mono-integer taking 16 bytes to store 32 bit. Each mono-character taking only 2 bytes to store 15 bit, + 18 taking bytes per string itself to store a string of characters.
Standard
This version of UnicodeIntegerToUTF8 complies to the latest standard. LSO on the other hand complies to an earlier standard. The newer standard includes only a subset of the older standard. The extended range of the old standard went unused so this incompleteness is moot.
<lsl>string UnicodeIntegerToUTF8(integer input)//Mono Safe, LSLEditor Safe, LSO Incomplete {//LSO allows for the older UTF-8 range, this function only supports the new UTF-16 range.
if(input > 0) { if(input <= 0x7FF) {//instead of a flat if else chain, this redistributes the fork load so that only the 4 byte characters result in 3 forks, all the other paths are 2 forks. if(input <= 0x7F){ input = input << 24; jump quick_return;//saves us from the implicit double jump that using an else would cause. } input = 0xC0800000 | ((input << 18) & 0x1F000000) | ((input << 16) & 0x3F0000); } else if(input <= 0xFFFF) input = 0xE0808000 | ((input << 12) & 0x0F000000) | ((input << 10) & 0x3F0000) | ((input << 8) & 0x3F00); else if(input <= 0x10FFFF) input = 0xF0808080 | ((input << 06) & 0x07000000) | ((input << 04) & 0x3F0000) | ((input << 2) & 0x3F00) | (input & 0x3F); else jump error;//not in our range @quick_return; return llBase64ToString(llIntegerToBase64(input)); } @error; return "";
}</lsl>
General Use
This version will work fine in LSO and Mono but not in LSLEditor.
<lsl>//===================================================// // Combined Library // // "Feb 4 2008", "08:35:00" // // Copyright (C) 2004-2008, Strife Onizuka (cc-by) // // http://creativecommons.org/licenses/by/3.0/ // //===================================================// //{
integer UTF8ToUnicodeInteger(string input)//LSLEditor Unsafe, LSO Safe {
integer result = llBase64ToInteger(llStringToBase64(input = llGetSubString(input,0,0))); if(result & 0x80000000)//multibyte, continuing to use base64 is impractical because it requires smart shifting. return ( ( 0x0000003f & result ) | (( 0x00003f00 & result) >> 2 ) | (( 0x003f0000 & result) >> 4 ) | (( 0x3f000000 & (result = (integer)("0x"+llGetSubString(input,-8,-1)))) >> 6 ) | (( 0x0000003f & result) << 24) | (( 0x00000100 & (result = (integer)("0x"+llDeleteSubString(input = (string)llParseString2List(llEscapeURL(input),(list)"%",[]),-8,-1)))) << 22) ) & ( 0x7FFFFFFF >> (5 * ((integer)(llLog(~result) / 0.69314718055994530941723212145818) - 25)));
// (( 0x00000100 & (result = (integer)("0x"+llDeleteSubString(input,-8,-1)))) << 22) // ) & ( 0x7FFFFFFF >> (30 - (5 * (llStringLength(input = (string)llParseString2List(llEscapeURL(input),(list)"%",[])) >> 1))));
return result >> 24;
}
string UnicodeIntegerToUTF8(integer input)//LSLEditor Unsafe, LSO Safe {
integer bytes = llCeil((llLog(input) / 0.69314718055994530941723212145818)); string result = "%" + byte2hex((input >> (6 * bytes)) | ((0x3F80 >> bytes) << !(bytes = ((input >= 0x80) * (bytes + ~(((1 << bytes) - input) > 0)) / 5)))); while (bytes) result += "%" + byte2hex((((input >> (6 * (bytes = ~-bytes))) | 0x80) & 0xBF)); return llUnescapeURL(result);
}
string byte2hex(integer x)//LSLEditor Safe, LSO Safe {//Helper function for use with unicode characters.
integer y = (x >> 4) & 0xF; return llGetSubString(hexc, y, y) + llGetSubString(hexc, x & 0xF, x & 0xF);
}//This function would benefit greatly from the DUP opcode, it would remove 19 bytes.
string hexc="0123456789ABCDEF";
//} Combined Library</lsl>
LSLEditor Safe
This version will work in Mono, LSO & LSLEditor. There will be a slight performance hit in LSO as compared to the LSLEditor Unsafe version.
<lsl>//===================================================// // Combined Library // // "Feb 4 2008", "08:38:13" // // Copyright (C) 2004-2008, Strife Onizuka (cc-by) // // http://creativecommons.org/licenses/by/3.0/ // //===================================================// //{
integer UTF8ToUnicodeInteger(string input)//LSLEditor Safe, LSO Safe {
integer result = llBase64ToInteger(llStringToBase64(input = llGetSubString(input,0,0))); if(result & 0x80000000){//multibyte, continuing to use base64 is impractical because it requires smart shifting. integer end = (integer)("0x"+llGetSubString(input = (string)llParseString2List(llEscapeURL(input),(list)"%",[]),-8,-1)); integer begin = (integer)("0x"+llDeleteSubString(input,-8,-1)); return ( ( 0x0000003f & end ) | (( 0x00003f00 & end) >> 2 ) | (( 0x003f0000 & end) >> 4 ) | (( 0x3f000000 & end) >> 6 ) | (( 0x0000003f & begin) << 24) | (( 0x00000100 & begin) << 22) ) & (0x7FFFFFFF >> (5 * ((integer)(llLog(~result) / 0.69314718055994530941723212145818) - 25))); } return result >> 24;
}
string UnicodeIntegerToUTF8(integer input)//LSLEditor Safe, LSO Safe {
integer bytes = llCeil((llLog(input) / 0.69314718055994530941723212145818)); bytes = (input >= 0x80) * (bytes + ~(((1 << bytes) - input) > 0)) / 5;//adjust string result = "%" + byte2hex((input >> (6 * bytes)) | ((0x3F80 >> bytes) << !bytes)); while (bytes) result += "%" + byte2hex((((input >> (6 * (bytes = ~-bytes))) | 0x80) & 0xBF)); return llUnescapeURL(result);
}
string byte2hex(integer x)//LSLEditor Safe, LSO Safe {//Helper function for use with unicode characters.
integer y = (x >> 4) & 0xF; return llGetSubString(hexc, y, y) + llGetSubString(hexc, x & 0xF, x & 0xF);
}//This function would benefit greatly from the DUP opcode, it would remove 19 bytes.
string hexc="0123456789ABCDEF";
//} Combined Library</lsl>