Difference between revisions of "User:LindaB Helendale/UTF8StringLength"

From Second Life Wiki
Jump to navigation Jump to search
m (<lsl> tag to <source>)
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{LSL Header}}
UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.
UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.


Channel communication, llOwnerSay, notecards, email, http calls, etc. use UTF-8 coding, with one character taking one, two or three bytes. The limits of message length are defined as bytes (e.g. llEmail 4500 bytes, llSay 1024 byes), and it may not be obvious how many bytes a string takes.
Channel communication, llOwnerSay, notecards, email, http calls, etc. use UTF-8 coding, with one character taking one, two or three bytes. The limits of message length are defined as bytes (e.g. llEmail 4500 bytes, llSay 1023 bytes). This function can be used to guard against clipped messages and to split long messages to parts that fit in the limits.


   Explanation of the formula:
   Explanation of the formula:
Line 30: Line 31:




<lsl>       
<source lang="lsl2">       
integer UTF8StringLength(string str) {
integer UTF8StringLength(string str) {
     // UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.
     // UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.
Line 40: Line 41:
     return N - 2 * P ;
     return N - 2 * P ;
}
}
</lsl>
</source>
 
 
Demo script to see it works:
<source lang="lsl2">
integer UTF8StringLength(string str) {
    // UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.
    // Useful in guarding against limits in communication to avoid clipped messages.
    // LindaB Helendale, permission to use this script in any way granted.
    string strEscaped = llEscapeURL(str);
    integer N = llStringLength(strEscaped);
    integer P = llGetListLength(llParseStringKeepNulls(strEscaped,["%"],[]))-1;
    return N - 2 * P ;
}
 
test(string s) {
    llOwnerSay("[" + s + "] Length: " + (string)llStringLength(s) + " characters, " + (string)UTF8StringLength(s) + " bytes");
}
 
default
{
    state_entry()
    {
        test("This should be 24 bytes.");
        test("% or %% won't break it :)");
        test("ÄÖÅ are two bytes each, they add 3 bytes.");
        test("these 20 three-byte chars add 40 bytes ☈☉☊☋☌☍☎☏☐☑☒ℋℌℍℏℐℑℒ⚃㐎.");
    }
}
</source>
With output
[04:45] demo: [This should be 24 bytes.] Length: 24 characters, 24 bytes
[04:45] demo: [% or %% won't break it :)] Length: 25 characters, 25 bytes
[04:45] demo: [ÄÖÅ are two bytes each, they add 3 bytes.] Length: 41 characters, 44 bytes
[04:45] demo: [these 20 three-byte chars add 40 bytes ☈☉☊☋☌☍☎☏☐☑☒ℋℌℍℏℐℑℒ⚃㐎.] Length: 60 characters, 100 bytes
 
An alternative function is [[Combined_Library#Byte_Length_of_UTF-8_Encoded_String|StringUTF8Size]].
 
{{#vardefine:sort|UTF8StringLength}}{{LSLC|Examples}}
[[Category:LSL Examples]]

Latest revision as of 16:28, 24 January 2015

UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.

Channel communication, llOwnerSay, notecards, email, http calls, etc. use UTF-8 coding, with one character taking one, two or three bytes. The limits of message length are defined as bytes (e.g. llEmail 4500 bytes, llSay 1023 bytes). This function can be used to guard against clipped messages and to split long messages to parts that fit in the limits.

 Explanation of the formula:
   L is the string length in utf-8 we want.
   N is the length of the string escaped by llEscapeURL.
   In the escaped string utf-8 characters with [1, 2, 3] bytes 
   map to strings of [1, 6, 9] plain ascii chars, with each triplet
   of the form %XX.
   Let P be the number of '%' characters in the escaped string, 
   and n1, n2 and n3 the number of 1,2 and 3 byte characters. Then
           L = n1 + 2 n2 + 3 n3
           N = n1 + 6 n2 + 9 n3
           P = 2 n1 + 3 n3
   Substitute P to N 
           N = n1 + 3 P   =>   n1 = N - 3 P
   and substitute in L
           L = (N - 3 P) + P = N - 2 P
     
   Another way to derive the formula, more intuitively: 
        In the escaped string every % represents triplet %XX, corresponding to
        one byte in the UTF-8 code, and it increases the escaped string length 
        by three, thus subtracting 2*number of "%"'s  from the escaped string 
        length gives the number of bytes.
        

You may use this script any way you wish. (c) LindaB Helendale


      
integer UTF8StringLength(string str) {
    // UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.
    // Useful in guarding against limits in communication to avoid clipped messages.
    // LindaB Helendale, permission to use this script in any way granted.
    string strEscaped = llEscapeURL(str);
    integer N = llStringLength(strEscaped);
    integer P = llGetListLength(llParseStringKeepNulls(strEscaped,["%"],[]))-1;
    return N - 2 * P ;
}


Demo script to see it works:

integer UTF8StringLength(string str) {
    // UTF8StringLength : returns the number of bytes a string takes in UTF-8 coding.
    // Useful in guarding against limits in communication to avoid clipped messages.
    // LindaB Helendale, permission to use this script in any way granted.
    string strEscaped = llEscapeURL(str);
    integer N = llStringLength(strEscaped);
    integer P = llGetListLength(llParseStringKeepNulls(strEscaped,["%"],[]))-1;
    return N - 2 * P ;
}

test(string s) {
    llOwnerSay("[" + s + "] Length: " + (string)llStringLength(s) + " characters, " + (string)UTF8StringLength(s) + " bytes");
}

default
{
    state_entry()
    {
        test("This should be 24 bytes.");
        test("% or %% won't break it :)");
        test("ÄÖÅ are two bytes each, they add 3 bytes.");
        test("these 20 three-byte chars add 40 bytes ☈☉☊☋☌☍☎☏☐☑☒ℋℌℍℏℐℑℒ⚃㐎.");
    }
}

With output

[04:45] demo: [This should be 24 bytes.] Length: 24 characters, 24 bytes
[04:45] demo: [% or %% won't break it :)] Length: 25 characters, 25 bytes
[04:45] demo: [ÄÖÅ are two bytes each, they add 3 bytes.] Length: 41 characters, 44 bytes
[04:45] demo: [these 20 three-byte chars add 40 bytes ☈☉☊☋☌☍☎☏☐☑☒ℋℌℍℏℐℑℒ⚃㐎.] Length: 60 characters, 100 bytes

An alternative function is StringUTF8Size.