Difference between revisions of "Talk:LlUnescapeURL"

From Second Life Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 14: Line 14:
:It does seem a bit wordy. These functions (both directions) only use UTF-8 byte encoding. That means you can't use UCS-2 (UTF-16) byte pairs or UTF-32 byte quads, or to that matter any other encoding scheme (UTF-EBCDIC, CESU-8). So llEscapeURL will always encode characters via their byte representation in UTF-8 and llUnescapeURL will always decode bytes as if they were UTF-8 (that is all it's trying to say). While this doesn't rule out the possibility of the "%u####" syntax, LL never implemented it. I'm thinking I should maybe write a spec, notes or deepnotes section for this one, as it's easier to describe how it could work then it is to describe what it does (convert input string to byte array and {{Wikipedia|Type_punning|type pun}} the array into a utf-8 backed string, anything that doesn't pun correctly gets converted to a "?"). -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 15:59, 26 January 2014 (PST)
:It does seem a bit wordy. These functions (both directions) only use UTF-8 byte encoding. That means you can't use UCS-2 (UTF-16) byte pairs or UTF-32 byte quads, or to that matter any other encoding scheme (UTF-EBCDIC, CESU-8). So llEscapeURL will always encode characters via their byte representation in UTF-8 and llUnescapeURL will always decode bytes as if they were UTF-8 (that is all it's trying to say). While this doesn't rule out the possibility of the "%u####" syntax, LL never implemented it. I'm thinking I should maybe write a spec, notes or deepnotes section for this one, as it's easier to describe how it could work then it is to describe what it does (convert input string to byte array and {{Wikipedia|Type_punning|type pun}} the array into a utf-8 backed string, anything that doesn't pun correctly gets converted to a "?"). -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 15:59, 26 January 2014 (PST)
::Well if I wasn't already confused, I am now! :) [[User:Omei Qunhua|Omei Qunhua]] 01:55, 27 January 2014 (PST)
::Well if I wasn't already confused, I am now! :) [[User:Omei Qunhua|Omei Qunhua]] 01:55, 27 January 2014 (PST)
Lets see if this helps (meant to be read, I've not had time to compile it, I must get up in the morning ~_~):
<lsl>string UnescapeURL(string str) {
    integer pos = -llStringLength(str); //we will use negative indexes to walk str, set the position at the beginning
    string out; //output buffer.
    list queue; //queue for hex encoded blocks
    while(pos) {
        string char = llGetSubString(str, pos, pos);
        if(char != "%") {
            out += ByteList2UTF8String(queue); //we need to flush the queue before we can add this char on.
            queue = [];
            out += char;
            ++pos;
        } else if(pos > -3) { //the string was truncated, you are done.
            pos = 0;
        } else { //omg it's probably hex!
            integer test = (integer)("0xF"+llGetSubString(str, pos + 1, pos + 2) + "5BA11");
            if((test & 0xF005BA11) == 0xF005BA11) { //if the characters are hex this is true.
                queue += (test >> 20) & 0xFF;
                pos += 3;
            } else //if the first char is not hex, test will equal 0xF, if it's the second that is bad then test will equal 0xF*
                pos = 0;
        }
    }
    return out + ByteList2UTF8String(queue);
}
string ByteList2UTF8String(list array) {//take this array of bytes, and turn it into characters.
    string out;
    integer g; //bytes found
    integer h; //extracted bits
    integer i = ~llGetListLength(array);//position
    integer k; //bytes per character
    while( ++i ) {
        integer j = llList2Integer(array, i);
        if(k > 0) { // are we in the middle of a multibyte char?
            if((0xC0 & j) == 0x80) { // make sure it's a payload byte
                h = (h << 6) | (j & 0x3F); // push the payload into h
                if(k == ++g) { // should verify that h is the smallest encoding possible, not going to do it.
                    out += UnicodeIntegerToUTF8(h); // we have the entire char now, go make it!
                }
            } else { // it's not a payload byte, somethings gone wrong.
                g = k = 0;
                out += llGetSubString(":??????", 1, g);
            }
        } else { //start of a new char
            k = llListFindList(llListSort([j, 0x7F, 0xBF, 0xDF, 0xEF, 0xF7, 0xFB, 0xFD], 1, TRUE), [j]); // Footnote 1
            if(!k) // single byte char
                out += UnicodeIntegerToUTF8(j);
            else if((k > 1) && (k < 7)) { // start of a multibyte char.
                h = j & (0x7F >> k);
                g = 1;
            } else { // invalid char
                g = k = 0;
                out += "?";
            }
        }
    }
    if(k != g)//flush the flush corruption buffers
        out += llGetSubString(":??????", g + 1, k);
    return out;
}</lsl>
'''Footnotes:'''
# These values are taken from [http://en.wikipedia.org/wiki/UTF-8#Description UTF-8#Description], specifically from the Byte 1 column with the Xs replaced with 1s.
#: k represents the number of bytes required (the first two values, 0 and 1 have special meaning), by sorting the list, the index of the first occurrence of j gives us this value.
-- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 20:02, 28 January 2014 (PST)
Fixed. It really didn't take that much to fix it. I apparently forgot the usage of llListSort. Closer to LSLs result.
This could be done without an array but you end up combining the two loops and the code gets really very muddy and hard to separate the two tasks from each other (and hard to debug). It would be something of a state machine. Suffice it to say, not fun. -- '''[[User:Strife_Onizuka|Strife]]''' <sup><small>([[User talk:Strife_Onizuka|talk]]|[[Special:Contributions/Strife_Onizuka|contribs]])</small></sup> 20:22, 29 January 2014 (PST)

Latest revision as of 20:41, 29 January 2014

url description and url breakdown

I'm not keen on the url parameter description. It doesn't have to be a valid URL, it doesn't even have to be a url at all. I'm also not keen on the sample url, it's just not obvious how it's applicable with the article. -- Strife (talk|contribs) 22:03, 25 January 2014 (PST)

I see the point about the description. The sample url was copied from llGetHTTPHeader to llEscapeURL and llUnescapeURL because there was some text about the URL path and query string in llEscapeURL and I thought a reference wouldn't do no harm for those that don't know what it is...yet. -- Kireji Haiku (talk|contribs) 07:12, 26 January 2014 (PST)

ASCII7

While looking at the above, could the following be better explained?

 "The hexadecimal encoded representation of UTF-8 byte encoding is the only supported means of access to non ASCII7 characters (Unicode characters)."

The word 'encoding' seems to be superfluous. What is the scope of this 'access'? In relation to this function only? Or in LSL in general? Does it perhaps mean that in order to use non ASCII7 characters in any string in LSL you need to write the string using a specific hex representation of those characters and then process the string via llUnescapeURL(). And should there be an example (because by hex representation here I believe we mean %hh and not 0xhhhh) Omei Qunhua 04:37, 26 January 2014 (PST)

It does seem a bit wordy. These functions (both directions) only use UTF-8 byte encoding. That means you can't use UCS-2 (UTF-16) byte pairs or UTF-32 byte quads, or to that matter any other encoding scheme (UTF-EBCDIC, CESU-8). So llEscapeURL will always encode characters via their byte representation in UTF-8 and llUnescapeURL will always decode bytes as if they were UTF-8 (that is all it's trying to say). While this doesn't rule out the possibility of the "%u####" syntax, LL never implemented it. I'm thinking I should maybe write a spec, notes or deepnotes section for this one, as it's easier to describe how it could work then it is to describe what it does (convert input string to byte array and "Wikipedia logo"type pun the array into a utf-8 backed string, anything that doesn't pun correctly gets converted to a "?"). -- Strife (talk|contribs) 15:59, 26 January 2014 (PST)
Well if I wasn't already confused, I am now! :) Omei Qunhua 01:55, 27 January 2014 (PST)

Lets see if this helps (meant to be read, I've not had time to compile it, I must get up in the morning ~_~):

<lsl>string UnescapeURL(string str) {

   integer pos = -llStringLength(str); //we will use negative indexes to walk str, set the position at the beginning
   string out; //output buffer.
   list queue; //queue for hex encoded blocks
   while(pos) {
       string char = llGetSubString(str, pos, pos);
       if(char != "%") {
           out += ByteList2UTF8String(queue); //we need to flush the queue before we can add this char on.
           queue = [];
           out += char;
           ++pos;
       } else if(pos > -3) { //the string was truncated, you are done.
           pos = 0;
       } else { //omg it's probably hex!
           integer test = (integer)("0xF"+llGetSubString(str, pos + 1, pos + 2) + "5BA11");
           if((test & 0xF005BA11) == 0xF005BA11) { //if the characters are hex this is true.
               queue += (test >> 20) & 0xFF;
               pos += 3;
           } else //if the first char is not hex, test will equal 0xF, if it's the second that is bad then test will equal 0xF*
               pos = 0;
       }
   }
   return out + ByteList2UTF8String(queue);

}

string ByteList2UTF8String(list array) {//take this array of bytes, and turn it into characters.

   string out;
   integer g; //bytes found
   integer h; //extracted bits
   integer i = ~llGetListLength(array);//position
   integer k; //bytes per character
   while( ++i ) {
       integer j = llList2Integer(array, i);
       if(k > 0) { // are we in the middle of a multibyte char?
           if((0xC0 & j) == 0x80) { // make sure it's a payload byte
               h = (h << 6) | (j & 0x3F); // push the payload into h
               if(k == ++g) { // should verify that h is the smallest encoding possible, not going to do it.
                   out += UnicodeIntegerToUTF8(h); // we have the entire char now, go make it!
               }
           } else { // it's not a payload byte, somethings gone wrong.
               g = k = 0;
               out += llGetSubString(":??????", 1, g);
           }
       } else { //start of a new char
           k = llListFindList(llListSort([j, 0x7F, 0xBF, 0xDF, 0xEF, 0xF7, 0xFB, 0xFD], 1, TRUE), [j]); // Footnote 1
           if(!k) // single byte char
               out += UnicodeIntegerToUTF8(j);
           else if((k > 1) && (k < 7)) { // start of a multibyte char.
               h = j & (0x7F >> k);
               g = 1;
           } else { // invalid char
               g = k = 0;
               out += "?";
           }
       }
   }
   if(k != g)//flush the flush corruption buffers
       out += llGetSubString(":??????", g + 1, k);
   return out;

}</lsl> Footnotes:

  1. These values are taken from UTF-8#Description, specifically from the Byte 1 column with the Xs replaced with 1s.
    k represents the number of bytes required (the first two values, 0 and 1 have special meaning), by sorting the list, the index of the first occurrence of j gives us this value.

-- Strife (talk|contribs) 20:02, 28 January 2014 (PST)

Fixed. It really didn't take that much to fix it. I apparently forgot the usage of llListSort. Closer to LSLs result.

This could be done without an array but you end up combining the two loops and the code gets really very muddy and hard to separate the two tasks from each other (and hard to debug). It would be something of a state machine. Suffice it to say, not fun. -- Strife (talk|contribs) 20:22, 29 January 2014 (PST)