Difference between revisions of "User:Becky Pippen/Numeric Storage"

From Second Life Wiki
Jump to navigation Jump to search
(preliminary page in anticipation of script memory limits)
 
m (→‎Design considerations: Added invalid chars + correction, as per wiki.secondlife.com/wiki/LLSD#string)
 
(2 intermediate revisions by one other user not shown)
Line 20: Line 20:
through 0x10ffff. However, that says nothing about how those 21-bit ID numbers are stored in memory.
through 0x10ffff. However, that says nothing about how those 21-bit ID numbers are stored in memory.


* '''[[http://en.wikipedia.org/wiki/Utf-8 UTF-8]''' refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.
* '''[http://en.wikipedia.org/wiki/Utf-8 UTF-8]''' refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.


* '''[[http://en.wikipedia.org/wiki/Utf-16 UTF-16]''' refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.
* '''[http://en.wikipedia.org/wiki/Utf-16 UTF-16]''' refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.


'''Escaped hex''' refers to a way of showing a character code as hex digits
'''Escaped hex''' refers to a way of showing a character code as hex digits
for the human's benefit. For example, the Unicode character "G" corresponds to Unicode character number 71.  
for the human's benefit. For example, the Unicode character "á" corresponds to Unicode character number 225.  
In UTF-16 format used in Mono, that's stored as two bytes (0x00,0x47). The escaped hex format
In UTF-8 format, that's stored as two bytes (0xc3,0xa1). The escaped hex format for that would look like the string "%c3%a1".
would look like the string "%00%47".


For a summary of Unicode, UTF-8, and UTF-16, see: [[Unicode_In_5_Minutes]].
For a summary of Unicode, UTF-8, and UTF-16, see: [[Unicode_In_5_Minutes]].
Line 35: Line 34:
In classic LSO, we could construct, store, and decode any Unicode character
In classic LSO, we could construct, store, and decode any Unicode character
from U-00000000 through U-ffffffff, but in Mono the character range is limited to
from U-00000000 through U-ffffffff, but in Mono the character range is limited to
U-00000000 through U-0010ffff minus these codes that are not allowed to be used
U-00000000 through U-0010fffd minus these codes that are not allowed to be used
as characters:
as characters:


Line 42: Line 41:
  0xfdd0 - 0xfdef
  0xfdd0 - 0xfdef
  0xfffe - 0xffff
  0xfffe - 0xffff
0x1fffe - 0x1ffff
0x2fffe - 0x2ffff


That's why we can't just use the entire range of character codes, so we'll use half.
That's why we can't just use the entire range of character codes, so we'll use half.
That means we must map the 32767 possible values of a 15-bit data item
That means we must map the 32768 possible values of a 15-bit data item
to a well-defined set of 32768 Unicode characters that are legal to use. And we must restrict our
to a well-defined set of 32768 Unicode characters that are legal to use. And we must restrict our
choices to the Unicode characters that can be represented with two bytes in UTF-16 format. The example code
choices to the Unicode characters that can be represented with two bytes in UTF-16 format. The example code
Line 187: Line 188:


===Also See===
===Also See===
Read more [[User:Becky_Pippen/User:Becky_Pippen/Memory_Limits_FAQ| about script memory limits]], and [[User:Becky_Pippen/User:Becky_Pippen/Script_Memory_Limits|other ways to reduce script memory usage]].
Read more [[User:Becky_Pippen/Memory_Limits_FAQ| about script memory limits]], and [[User:Becky_Pippen/Script_Memory_Limits|other ways to reduce script memory usage]].

Latest revision as of 05:57, 16 January 2012

A General Method of Encoding Binary Data in UTF-16 Unicode Strings in Mono

The problem and goal

A list in LSL incurs a large memory overhead for every list element (see LSL_Script_Memory ). A true array type in LSL would solve the problem, but LSL has no array type. Strings, however, have properties that are conceptually like an array where each element is a Unicode character. So, we'll use a string as a storage container, and encode any binary data we want as fixed-sized Unicode strings. Then you can use the usual string functions to access, set, or get any element. This page describes a general method of storing 15 bits of information in each 16-bit Unicode character for a memory efficiency of 15/16 = 93.75%.

Background and terminology

Unicode refers to a set of several thousand glyphs, where each glyph is assigned a numeric ID in the range from 0 through 0x10ffff. However, that says nothing about how those 21-bit ID numbers are stored in memory.

  • UTF-8 refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.
  • UTF-16 refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.

Escaped hex refers to a way of showing a character code as hex digits for the human's benefit. For example, the Unicode character "á" corresponds to Unicode character number 225. In UTF-8 format, that's stored as two bytes (0xc3,0xa1). The escaped hex format for that would look like the string "%c3%a1".

For a summary of Unicode, UTF-8, and UTF-16, see: Unicode_In_5_Minutes.

Design considerations

In classic LSO, we could construct, store, and decode any Unicode character from U-00000000 through U-ffffffff, but in Mono the character range is limited to U-00000000 through U-0010fffd minus these codes that are not allowed to be used as characters:

0x0000
0xd800 - 0xdfff
0xfdd0 - 0xfdef
0xfffe - 0xffff
0x1fffe - 0x1ffff
0x2fffe - 0x2ffff

That's why we can't just use the entire range of character codes, so we'll use half. That means we must map the 32768 possible values of a 15-bit data item to a well-defined set of 32768 Unicode characters that are legal to use. And we must restrict our choices to the Unicode characters that can be represented with two bytes in UTF-16 format. The example code below uses Unicode characters U-001000 through U-008fff to store the data values 0 through 0x7fff. This could be called a base 32768 solution, and is very similar to the "Base 32768" solution for key storage described in Key_Compression.

Converting a 15-bit numeric value to a 16-bit Unicode character is as follows:

  • Add 0x1000 to remap the 15-bit values from [0x0000..0x8000) to the character range [0x1000.. 0x9000).
  • Convert the numeric value to an escaped UTF-8 string.
  • Convert the escaped UTF-8 representation to a single UTF-16 Unicode character using llUnescapeURL().

For example, here are the steps to encode the 15-bit value 0x112b:

  • Add 0x1000, resulting in 0x212b
  • Convert 0x212b to UTF-8 represented in escaped hex form = "%e2%84%ab"
  • Convert the escaped UTF-8 to a 16-bit Unicode char using llUnescapeURL("%e2%84%ab") (The resulting character is "Å")

The intermediate step of converting the 15-bit numeric value to variable-length UTF-8 depends on the number range:

bytes | bits | range      | representation
    1 |    7 | 0..7f      | 0vvvvvvv
    2 |   11 | 80..7ff    | 110vvvvv 10vvvvvv
    3 |   16 | 800..ffff  | 1110vvvv 10vvvvvv 10vvvvvv
    4 |   21 | 10000..    | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

Because we're only encoding values in the range 0x1000 - 0x8fff, we only need to deal with the 3-byte form shown above, simplifying the code but also making these functions unsuitable for general character conversions.

Converting a Unicode character to its numeric code is the reverse process. For example, given the Angstrom character "Å", llEscapeURL("Å") returns "%E2%84%AB" which must be chopped up and converted in the reverse process to get 0x212b, then subtract the bias to get 0x112b.

A Mono script with a small amount of code dedicated to data storage could easily store 25,000 data items where each is an arbitrary 15-bit value. As long as you can combine or split your data to fit into 15-bit chunks, then you can use this technique. E.g, two such 15-bit items can hold a region positional vector with coordinates rounded off, or a 30-bit hash of an avatar name or UUID. This technique gives us 15/16 = 93.75% efficient utilization of memory, compared to about 20% efficiency if using global lists of integers or vectors.

Usage, Example, Properties:

<lsl> string DataBase; // some global string that holds all the elements

integer data15Bits = 0x1234; // an example 15-bit data item to store
. . .
DataBase += encode15BitsToChar(data15Bits); // append the new value to the DB
. . .
// Round-trip encoding/decoding is lossless. I.e., The following will always be true:
decodeCharTo15Bits(encode15BitsToChar(data15Bits)) == data15Bits </lsl>

Additional information can be found in the comments in the source code example below.

Source code example

<lsl> // Converts n = [0..0xff] to two hex characters

//
string hexChar2(integer n)
{
    string hexChars = "0123456789abcdef";
    return llGetSubString(hexChars, n >> 4, n >> 4) +
           llGetSubString(hexChars, n & 0xf, n & 0xf);
}

// This is a memory-savings technique for use with Mono-compiled LSL scripts.
// (It probably works in classic LSO too, but not as efficiently.) This technique
// stores 15 bits of information in each 16-bit Unicode character. Use the
// encode function below to convert any 15-bit data to a Unicode character, and
// use the decode function to convert it back to the original 15-bit data.
//
// This example maps the data values 0 through 0x7fff to the Unicode
// characters U-001000 through U-008fff.
//
// By Becky Pippen, 2009, contributed to Public Domain
// The technique used here is very similar to the technique used in the "Base 32768
// Script" in http://wiki.secondlife.com/wiki/Key_Compression .

// Return a single 16-bit Unicode character in UTF-16 two-byte form
//
string encode15BitsToChar(integer num)
{
    // Check the incoming range

    if (num < 0 || num >= 0x8000) {
        // illegal input -- do whatever is appropriate
        return "�";
    }

    // Bias the incoming numeric value by 0x1000 to avoid illegal Unicode codes:

    num += 0x1000;

    // construct an escaped hex UTF-8 representation:

    string utf8 = "%" + hexChar2(0xe0 + (num >> 12)) +
                  "%" + hexChar2(0x80 + ((num >> 6) & 0x3f)) +
                  "%" + hexChar2(0x80 + (num & 0x3f));

    // Convert the UTF-8 into a 16-bit Unicode character:

    return llUnescapeURL(utf8);
}


// This is the inverse of encode15BitsToChar(), supra, q.v.
// This expects a single 16-bit Unicode character that was created by
// encode15BitsToChar() and returns the numeric value used to create it.
// The 15-bit return value will always be in the range 0x0000 - 0x7fff.
//
integer decodeCharTo15Bits(string ch)
{
    integer retVal;
    string utf8 = llEscapeURL(ch); // convert to escaped hex UTF-8

    retVal =  (((integer)("0x" + llGetSubString(utf8, 1, 2)) & 0x1f) << 12) +
              (((integer)("0x" + llGetSubString(utf8, 4, 5)) & 0x3f) << 6) +
               ((integer)("0x" + llGetSubString(utf8, 7, 8)) & 0x3f);

    return retVal - 0x1000;
}

string Database; // think of this as an array of 15-bit values

default
{
    touch_start(integer num)
    {
        integer data15Bits = (integer)llFrand(32768.0); // [0..32767] pretend data
        string encodedChar = encode15BitsToChar(data15Bits);
        Database += encodedChar;

        llOwnerSay("Appended data value " + (string)data15Bits +
                " to database as character \"" + encodedChar + "\"");

        llOwnerSay("At " + (string)llStringLength(Database) +
                " items, free mem = " + (string)llGetFreeMemory());
    }
} </lsl>

Also See

Read more about script memory limits, and other ways to reduce script memory usage.