Difference between revisions of "User:Becky Pippen/Text Storage"

From Second Life Wiki
Jump to navigation Jump to search
m (minor tweaks)
 
(2 intermediate revisions by 2 users not shown)
Line 3: Line 3:
===Terminology===
===Terminology===


Be warned -- these terms are not always used correctly and consistently in the LSL wikis. Here's the terminology from their various standards:
Be warned these terms are not always used correctly and consistently in the LSL wikis. Here's the terminology from their various standards:


'''[http://en.wikipedia.org/wiki/Unicode Unicode]''' is a character set of several thousand character glyphs and their assigned numeric ID codes. The numeric codes are 21-bit integers in the range 0 through 0x10ffff, although a few numbers in that range have special meaning other than character data. How are those 21-bit numbers stored in memory? That's what UTF-8 and UTF-16 are all about.
'''[http://en.wikipedia.org/wiki/Unicode Unicode]''' is a character set of several thousand character glyphs and their assigned numeric ID codes. The numeric codes are 21-bit integers in the range 0 through 0x10ffff, although a few numbers in that range have special meaning other than character data. How are those 21-bit numbers stored in memory? That's what UTF-8 and UTF-16 are all about.
Line 15: Line 15:
===So, summarize, ok?===
===So, summarize, ok?===


So, if using only ASCII and compiling with Mono, we can possibly reduce the memory requirements to LSO levels through clever encoding. Otherwise any attempt made to compress text will probably use more instruction code space than the amount of memory saved.
So, if using only an ASCII character set and compiling with Mono, we can possibly reduce the memory requirements to near LSO levels through clever encoding. Otherwise any attempt made to compress text will probably use more instruction code space than the amount of memory saved.


===ASCII text compression in Mono===
===ASCII text compression in Mono===
Line 21: Line 21:
This technique applies if your script uses only ASCII characters and only if compiling with Mono. We'll take the ASCII characters, two at a time, convert them to their Unicode numeric ID codes (which for these characters are identical to their ASCII numeric codes), combine them into a 14-bit integer, add a bias that so that the 14-bit numbers all are inside a valid range of Unicode characters, then convert it into a single Unicode character. This will result in a 16-bit Unicode character that bears no resemblance to the two ASCII characters that it encodes.
This technique applies if your script uses only ASCII characters and only if compiling with Mono. We'll take the ASCII characters, two at a time, convert them to their Unicode numeric ID codes (which for these characters are identical to their ASCII numeric codes), combine them into a 14-bit integer, add a bias that so that the 14-bit numbers all are inside a valid range of Unicode characters, then convert it into a single Unicode character. This will result in a 16-bit Unicode character that bears no resemblance to the two ASCII characters that it encodes.


The decoding process is just the reverse -- convert a Unicode character into its numeric value which will be a 14-bit integer, divide it into two 7-bit numbers, and convert that back to a string of two characters.
The decoding process is just the reverse convert a Unicode character into its numeric value which will be a 14-bit integer, divide it into two 7-bit numbers, and convert that back to a string of two characters.


For encoding, we need a couple of functions -- one to convert a single Unicode character to its numeric ID, and one to convert a 14-bit number into a single Unicode character. Here's the former:
For encoding, we need a couple of functions one to convert a single Unicode character to its numeric ID, and one to convert a 14-bit number into a single Unicode character. Here's the former:


<lsl> // Given a single character c, this returns its Unicode ID number
<source lang="lsl2">
// Given a single character c, this returns its Unicode ID number
  // This works only for character codes 0 through 0xffff.
  // This works only for character codes 0 through 0xffff.
  //
  //
Line 45: Line 46:
                     ((cInt & 0x00003f00) >> 8);
                     ((cInt & 0x00003f00) >> 8);
         } // else ignore the 4-byte UTF-8 form
         } // else ignore the 4-byte UTF-8 form
     } </lsl>
     }
</source>


This works because the function [[llStringToBase64]]() converts the character c into UTF-8 format first, then into base64 encoding. This isn't documented well, so let's be clear about it -- regardless if running in LSO where text is UTF-8 or in Mono where text is UTF-16, the function [[llStringToBase64]]() will return a base64-encoded form of the UTF-8 encoding of its argument.
This works because the function [[llStringToBase64]]() converts the character c into UTF-8 format first, then into base64 encoding. This isn't documented well, so let's be clear about it regardless if running in LSO where text is UTF-8 or in Mono where text is UTF-16, the function [[llStringToBase64]]() will return a base64-encoded form of the UTF-8 encoding of its argument.


This function is very similar to the function
This function is very similar to the function
<lsl> integer UTF8ToUnicodeInteger(string input); </lsl>
<source lang="lsl2">
integer UTF8ToUnicodeInteger(string input);
</source>
found in [[Combined_Library]]. I'm offering my own version here for two reasons: (1) I've named the function to emphasize that it takes a single character as string input, and to remove "UTF-8" from the function name because the role of UTF-8 in the function is only as a necessary intermediate encoding. The function is meant to work with LSO or Mono so that the underlying internal encoding of the input argument is transparent; and (2) my version is a little easier to read (at the expense of being a few bytes longer).
found in [[Combined_Library]]. I'm offering my own version here for two reasons: (1) I've named the function to emphasize that it takes a single character as string input, and to remove "UTF-8" from the function name because the role of UTF-8 in the function is only as a necessary intermediate encoding. The function is meant to work with LSO or Mono so that the underlying internal encoding of the input argument is transparent; and (2) my version is a little easier to read (at the expense of being a few bytes longer).


And just for completeness, if you need to revise the function to also work with Unicode ID codes above 0xffff, the additional else clause looks like this:
And just for completeness, if you need to revise the function to also work with Unicode ID codes above 0xffff, the additional else clause looks like this:


<lsl> . . .
<source lang="lsl2">
. . .
  } else {
  } else {
     // four-byte UTF-8 form:  1111 0vvv 10vv vvvv 10vv vvvv 10vv vvvv
     // four-byte UTF-8 form:  1111 0vvv 10vv vvvv 10vv vvvv 10vv vvvv
Line 62: Line 67:
             ((cInt & 0x00003f00) >> 2) |
             ((cInt & 0x00003f00) >> 2) |
             (cInt & 0x0000003f);
             (cInt & 0x0000003f);
  } </lsl>
  }
</source>


Next, we need a way to combine two 7-bit character codes into a single Unicode character. We'll simply combine two numbers to make a 14-bit integer, then convert that to a Unicode character using the function encode15BitsToChar() found in [[User:Becky_Pippen/Numeric_Storage]]:
Next, we need a way to combine two 7-bit character codes into a single Unicode character. We'll simply combine two numbers to make a 14-bit integer, then convert that to a Unicode character using the function encode15BitsToChar() found in [[User:Becky_Pippen/Numeric_Storage]]:
Line 72: Line 78:
First, we need to run a benchmark to measure memory usage in the normal uncompressed way. Simply put a bunch of ASCII text into a notecard and drop it into a prim with this script. It will concatenate all the notecard lines into a single global string named bigText. I used the text from http://www.gutenberg.org/files/6274/6274.txt as the notecard text. It runs out of memory after saving 1063 notecard lines:
First, we need to run a benchmark to measure memory usage in the normal uncompressed way. Simply put a bunch of ASCII text into a notecard and drop it into a prim with this script. It will concatenate all the notecard lines into a single global string named bigText. I used the text from http://www.gutenberg.org/files/6274/6274.txt as the notecard text. It runs out of memory after saving 1063 notecard lines:


<lsl>
<source lang="lsl2">
  integer lineNumber;
  integer lineNumber;
  string notecardName = "ASCII text";
  string notecardName = "ASCII text";
Line 94: Line 100:
     }
     }
  }
  }
</lsl>
</source>


Now we can compare -- here's the same notecard reader script using ASCII compression. Even though the added compression functions consume about 3K of program space, we can now store 1854 lines. '''That's 74% more text using compression.'''  
Now we can compare here's the same notecard reader script using ASCII compression. Even though the added compression functions consume about 3K of program space, we can now store 1854 lines. '''That's 74% more text using compression.'''  


<lsl> // Demo of ASCII compression in Mono scripts
<source lang="lsl2">
// Demo of ASCII compression in Mono scripts
  // By Becky Pippen, 2009, contributed to Public Domain
  // By Becky Pippen, 2009, contributed to Public Domain
   
   
Line 269: Line 276:
     }
     }
  }
  }
</lsl>
</source>


===Also see===
===Also see===


For more ideas about compressing bits into Unicode strings, see
For more ideas about compressing data and other memory-savings techniques, see
* [[User:Becky_Pippen/Key_Storage| Scripting techniques for storing lots of keys (UUIDs)]]
* [[User:Becky_Pippen/Numeric_Storage| Scripting techniques for storing arbitrary binary data without lists]]
* [[User:Becky_Pippen/Numeric_Storage| Scripting techniques for storing arbitrary binary data without lists]]
* [[User:Becky_Pippen/Script_Memory_Limits| Memory limits FAQ and Checklist for memory reduction]]
* [[User:Becky_Pippen/Script_Memory_Limits| Memory limits FAQ and Checklist for memory reduction]]
* [[User:Becky_Pippen/Hashing| Hashing for memory reduction]]
* [[User:Becky_Pippen/Hashing| Hashing for memory reduction]]

Latest revision as of 10:53, 2 May 2016

Text strings in scripts

Terminology

Be warned — these terms are not always used correctly and consistently in the LSL wikis. Here's the terminology from their various standards:

Unicode is a character set of several thousand character glyphs and their assigned numeric ID codes. The numeric codes are 21-bit integers in the range 0 through 0x10ffff, although a few numbers in that range have special meaning other than character data. How are those 21-bit numbers stored in memory? That's what UTF-8 and UTF-16 are all about.

UTF-8 and UTF-16 are two ways to represent the Unicode numeric codes in memory. UTF-8 encodes a Unicode ID number in a variable-length sequence of one to four bytes. UTF-16 encodes most characters in 16 bits, and some in 32 bits.

For a summary of Unicode character set and examples of the UTF-8 and UTF-16 storage formats, see: Unicode_In_5_Minutes.

LSO-compiled scripts store strings internally in memory in UTF-8 format and Mono uses UTF-16, but all that should be transparent to the script for most purposes. The main impact for the script and scripter is the amount of memory used. This page says that globally scoped strings use one byte per character in LSO and two bytes per character in Mono, which is true for strings containing only 7-bit ASCII characters. When the strings contain mostly international characters outside the ASCII range, then the memory consumption for UTF-8 in LSO and UTF-16 in Mono will both be close to two bytes per character, and in some cases the UTF-8 will be longer if there are many characters that require the three-byte or four-byte forms of UTF-8 encoding.

So, summarize, ok?

So, if using only an ASCII character set and compiling with Mono, we can possibly reduce the memory requirements to near LSO levels through clever encoding. Otherwise any attempt made to compress text will probably use more instruction code space than the amount of memory saved.

ASCII text compression in Mono

This technique applies if your script uses only ASCII characters and only if compiling with Mono. We'll take the ASCII characters, two at a time, convert them to their Unicode numeric ID codes (which for these characters are identical to their ASCII numeric codes), combine them into a 14-bit integer, add a bias that so that the 14-bit numbers all are inside a valid range of Unicode characters, then convert it into a single Unicode character. This will result in a 16-bit Unicode character that bears no resemblance to the two ASCII characters that it encodes.

The decoding process is just the reverse — convert a Unicode character into its numeric value which will be a 14-bit integer, divide it into two 7-bit numbers, and convert that back to a string of two characters.

For encoding, we need a couple of functions — one to convert a single Unicode character to its numeric ID, and one to convert a 14-bit number into a single Unicode character. Here's the former:

 // Given a single character c, this returns its Unicode ID number
 // This works only for character codes 0 through 0xffff.
 //
 integer charToUnicodeIdNumber(string c)
     integer cInt = llBase64ToInteger(llStringToBase64(c));
 
     if (!(cInt & 0x80000000)) {
         // UTF-8 single-byte form
         cInt = cInt >> 24;
     } else {
         if ((cInt & 0xe0000000) == 0xc0000000) {
             // two-byte UTF-8 form:  110v vvvv  10vv vvvv
             cInt = ((cInt & 0x1f000000) >> 18) |
                    ((cInt & 0x003f0000) >> 16);
         } else {
             // assume three-byte UTF-8 form:  1110 vvvv  10vv vvvv  10vv vvvv
             cInt = ((cInt & 0x0f000000) >> 12) |
                    ((cInt & 0x003f0000) >> 10) |
                    ((cInt & 0x00003f00) >> 8);
         } // else ignore the 4-byte UTF-8 form
     }

This works because the function llStringToBase64() converts the character c into UTF-8 format first, then into base64 encoding. This isn't documented well, so let's be clear about it — regardless if running in LSO where text is UTF-8 or in Mono where text is UTF-16, the function llStringToBase64() will return a base64-encoded form of the UTF-8 encoding of its argument.

This function is very similar to the function

 integer UTF8ToUnicodeInteger(string input);

found in Combined_Library. I'm offering my own version here for two reasons: (1) I've named the function to emphasize that it takes a single character as string input, and to remove "UTF-8" from the function name because the role of UTF-8 in the function is only as a necessary intermediate encoding. The function is meant to work with LSO or Mono so that the underlying internal encoding of the input argument is transparent; and (2) my version is a little easier to read (at the expense of being a few bytes longer).

And just for completeness, if you need to revise the function to also work with Unicode ID codes above 0xffff, the additional else clause looks like this:

 . . .
 } else {
     // four-byte UTF-8 form:  1111 0vvv 10vv vvvv 10vv vvvv 10vv vvvv
     cInt = ((cInt & 0x07000000) >> 6) |
            ((cInt & 0x003f0000) >> 4) |
            ((cInt & 0x00003f00) >> 2) |
             (cInt & 0x0000003f);
 }

Next, we need a way to combine two 7-bit character codes into a single Unicode character. We'll simply combine two numbers to make a 14-bit integer, then convert that to a Unicode character using the function encode15BitsToChar() found in User:Becky_Pippen/Numeric_Storage:

For decoding, we'll use the function decodeCharTo15Bits() found in User:Becky_Pippen/Numeric_Storage to get our 14-bit number, then split that into two 7-bit numbers, then use llUnescapeURL() to turn those into two characters in a string.

Code Example

First, we need to run a benchmark to measure memory usage in the normal uncompressed way. Simply put a bunch of ASCII text into a notecard and drop it into a prim with this script. It will concatenate all the notecard lines into a single global string named bigText. I used the text from http://www.gutenberg.org/files/6274/6274.txt as the notecard text. It runs out of memory after saving 1063 notecard lines:

 integer lineNumber;
 string notecardName = "ASCII text";
 integer linesRead;
 string bigText;
 
 default
 {
     touch_start(integer num)
     {
         llGetNotecardLine(notecardName, lineNumber++);
     }
     
     dataserver(key id, string data)
     {
         if (data != EOF) {
             bigText += data;
             llOwnerSay((string)(++linesRead) + ":" + data);
             llGetNotecardLine(notecardName, lineNumber++);
         }
     }
 }

Now we can compare — here's the same notecard reader script using ASCII compression. Even though the added compression functions consume about 3K of program space, we can now store 1854 lines. That's 74% more text using compression.

 // Demo of ASCII compression in Mono scripts
 // By Becky Pippen, 2009, contributed to Public Domain
 
 // Report memory free and the change from the last report:
 //
 integer lastFree;
 
 mem(string label)
 {
     integer newFree = llGetFreeMemory();
     integer change = lastFree - newFree;
     llOwnerSay((string)newFree + " free, change = " + (string)change);
     lastFree = newFree;
 }
 
 // Converts n = [0..0xff] to two hex characters
 //
 string hexChar2(integer n)
 {
     string hexChars = "0123456789abcdef";
     return llGetSubString(hexChars, n >> 4, n >> 4) +
            llGetSubString(hexChars, n & 0xf, n & 0xf);
 }
 
 // Given a single character c, this returns its Unicode ID number
 // This works only for character codes 0 through 0xffff.
 // For a more compact alternative, see UTF8ToUnicodeInteger()
 // found in http://wiki.secondlife.com/wiki/Combined_Library .
 //
 integer charToUnicodeIdNumber(string c)
 {
     integer cInt = llBase64ToInteger(llStringToBase64(c));
 
     if (!(cInt & 0x80000000)) {
         // UTF-8 single-byte form
         cInt = cInt >> 24;
     } else {
         if ((cInt & 0xe0000000) == 0xc0000000) {
             // two-byte UTF-8 form:  110v vvvv  10vv vvvv
             cInt = ((cInt & 0x1f000000) >> 18) |
                    ((cInt & 0x003f0000) >> 16);
         } else {
             // assume three-byte UTF-8 form:  1110 vvvv  10vv vvvv  10vv vvvv
             cInt = ((cInt & 0x0f000000) >> 12) |
                    ((cInt & 0x003f0000) >> 10) |
                    ((cInt & 0x00003f00) >> 8);
         } // else ignore the 4-byte UTF-8 form
     }
 
     return cInt;
 }
 
 // This is a memory-savings technique for use with Mono-compiled LSL scripts.
 // (It probably works in classic LSO too, but not as efficiently.) This technique
 // stores 15 bits of information in each 16-bit Unicode character. Use the
 // encode function below to convert any 15-bit data to a Unicode character, and
 // use the decode function to convert it back to the original 15-bit data.
 //
 // This example maps the data values 0 through 0x7fff to the Unicode
 // characters U-001000 through U-008fff. Use the matching function
 // decodeCharTo15Bits() to decode the Unicode character back into the original
 // 15-bit number.
 //
 // The technique used here is very similar to the technique used in the "Base 32768
 // Script" in http://wiki.secondlife.com/wiki/Key_Compression .
 
 // Convert any 15-bit integer into a single Unicode character
 //
 string encode15BitsToChar(integer num)
 {
     // Check the incoming range
 
     if (num < 0 || num >= 0x8000) {
         // illegal input -- do whatever is appropriate
         return "�";
     }
 
     // Bias the incoming numeric value by 0x1000 to avoid illegal Unicode codes:
 
     num += 0x1000;
 
     // Construct an escaped hex UTF-8 representation and return
     // it as a Unicode character
 
     return llUnescapeURL(
                   "%" + hexChar2(0xe0 + (num >> 12)) +
                   "%" + hexChar2(0x80 + ((num >> 6) & 0x3f)) +
                   "%" + hexChar2(0x80 + (num & 0x3f)));
 }
 
 
 // This is the inverse of encode15BitsToChar(), supra, q.v.
 // This expects a single 16-bit Unicode character that was created by
 // encode15BitsToChar() and returns the 15-bit numeric value used to create it.
 // The 15-bit return value will always be in the range 0x0000 - 0x7fff.
 //
 integer decodeCharTo15Bits(string ch)
 {
     string utf8 = llEscapeURL(ch); // convert to escaped hex UTF-8
 
     return
         (((integer)("0x" + llGetSubString(utf8, 1, 2)) & 0x1f) << 12) +
         (((integer)("0x" + llGetSubString(utf8, 4, 5)) & 0x3f) << 6) +
          ((integer)("0x" + llGetSubString(utf8, 7, 8)) & 0x3f) - 0x1000;
 }
 
 
 // Returns a Unicode string that encodes twice as many ASCII characters.
 // Use the matching function decompressAscii() to expand it back into
 // the original ASCII.
 //
 string compressAscii(string s)
 {
     integer len = llStringLength(s);
 
     // Append a space if needed to make s an even number of chars
     if (len % 2) {
        s += " ";
        ++len;
     }
 
     string encodedChars;
     integer i;
     for (i = 0; i < len; i += 2) {
         encodedChars += encode15BitsToChar(
                 charToUnicodeIdNumber(llGetSubString(s, i, i)) << 7 |
                 charToUnicodeIdNumber(llGetSubString(s, i+1, i+1)));
     }
 
     return encodedChars;
 }
 
 // This is the inverse of compressAscii()
 //
 string uncompressAscii(string s)
 {
     string result;
 
     integer len = llStringLength(s);
     integer i;
     for (i = 0; i < len; ++i) {
         integer cInt15 = decodeCharTo15Bits(llGetSubString(s, i, i));
         result += llUnescapeURL("%" + hexChar2(cInt15 >> 7) +
                                 "%" + hexChar2(cInt15 & 0x7f));
     }
 
     return result;
 }
 
 integer lineNumber;
 string notecardName = "ASCII text";
 integer linesRead;
 string bigText;
 
 default
 {
     touch_start(integer num)
     {
         llGetNotecardLine(notecardName, lineNumber++);
     }
     
     dataserver(key id, string data)
     {
         if (data != EOF) {
             string compressedLine = compressAscii(data);
             bigText += compressedLine;
             llOwnerSay((string)(++linesRead) + ":" + uncompressAscii(compressedLine));
             llGetNotecardLine(notecardName, lineNumber++);
         }
     }
 }

Also see

For more ideas about compressing data and other memory-savings techniques, see