User:Becky Pippen/Numeric Storage: Difference between revisions

Revision as of 00:00, 31 December 2009

A General Method of Encoding Binary Data in UTF-16 Unicode Strings in Mono

The problem and goal

A list in LSL incurs a large memory overhead for every list element (see LSL_Script_Memory ). A true array type in LSL would solve the problem, but LSL has no array type. Strings, however, have properties that are conceptually like an array where each element is a Unicode character. So, we'll use a string as a storage container, and encode any binary data we want as fixed-sized Unicode strings. Then you can use the usual string functions to access, set, or get any element. This page describes a general method of storing 15 bits of information in each 16-bit Unicode character for a memory efficiency of 15/16 = 93.75%.

Background and terminology

Unicode refers to a set of several thousand glyphs, where each glyph is assigned a numeric ID in the range from 0 through 0x10ffff. However, that says nothing about how those 21-bit ID numbers are stored in memory.

UTF-8 refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.

UTF-16 refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.

Escaped hex refers to a way of showing a character code as hex digits for the human's benefit. For example, the Unicode character "G" corresponds to Unicode character number 71. In UTF-16 format used in Mono, that's stored as two bytes (0x00,0x47). The escaped hex format would look like the string "%00%47".

For a summary of Unicode, UTF-8, and UTF-16, see: Unicode_In_5_Minutes.

Design considerations

In classic LSO, we could construct, store, and decode any Unicode character from U-00000000 through U-ffffffff, but in Mono the character range is limited to U-00000000 through U-0010ffff minus these codes that are not allowed to be used as characters:

0x0000
0xd800 - 0xdfff
0xfdd0 - 0xfdef
0xfffe - 0xffff

That's why we can't just use the entire range of character codes, so we'll use half. That means we must map the 32767 possible values of a 15-bit data item to a well-defined set of 32768 Unicode characters that are legal to use. And we must restrict our choices to the Unicode characters that can be represented with two bytes in UTF-16 format. The example code below uses Unicode characters U-001000 through U-008fff to store the data values 0 through 0x7fff. This could be called a base 32768 solution, and is very similar to the "Base 32768" solution for key storage described in Key_Compression.

Converting a 15-bit numeric value to a 16-bit Unicode character is as follows:

Add 0x1000 to remap the 15-bit values from [0x0000..0x8000) to the character range [0x1000.. 0x9000).
Convert the numeric value to an escaped UTF-8 string.
Convert the escaped UTF-8 representation to a single UTF-16 Unicode character using llUnescapeURL().

For example, here are the steps to encode the 15-bit value 0x112b:

Add 0x1000, resulting in 0x212b
Convert 0x212b to UTF-8 represented in escaped hex form = "%e2%84%ab"
Convert the escaped UTF-8 to a 16-bit Unicode char using llUnescapeURL("%e2%84%ab") (The resulting character is "Å")

The intermediate step of converting the 15-bit numeric value to variable-length UTF-8 depends on the number range:

bytes | bits | range      | representation
    1 |    7 | 0..7f      | 0vvvvvvv
    2 |   11 | 80..7ff    | 110vvvvv 10vvvvvv
    3 |   16 | 800..ffff  | 1110vvvv 10vvvvvv 10vvvvvv
    4 |   21 | 10000..    | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

Because we're only encoding values in the range 0x1000 - 0x8fff, we only need to deal with the 3-byte form shown above, simplifying the code but also making these functions unsuitable for general character conversions.

Converting a Unicode character to its numeric code is the reverse process. For example, given the Angstrom character "Å", llEscapeURL("Å") returns "%E2%84%AB" which must be chopped up and converted in the reverse process to get 0x212b, then subtract the bias to get 0x112b.

A Mono script with a small amount of code dedicated to data storage could easily store 25,000 data items where each is an arbitrary 15-bit value. As long as you can combine or split your data to fit into 15-bit chunks, then you can use this technique. E.g, two such 15-bit items can hold a region positional vector with coordinates rounded off, or a 30-bit hash of an avatar name or UUID. This technique gives us 15/16 = 93.75% efficient utilization of memory, compared to about 20% efficiency if using global lists of integers or vectors.

Usage, Example, Properties:

<lsl> string DataBase; // some global string that holds all the elements

integer data15Bits = 0x1234; // an example 15-bit data item to store
. . .
DataBase += encode15BitsToChar(data15Bits); // append the new value to the DB
. . .
// Round-trip encoding/decoding is lossless. I.e., The following will always be true:
decodeCharTo15Bits(encode15BitsToChar(data15Bits)) == data15Bits </lsl>

Additional information can be found in the comments in the source code example below.

Source code example

<lsl> // Converts n = [0..0xff] to two hex characters

//
string hexChar2(integer n)
{
    string hexChars = "0123456789abcdef";
    return llGetSubString(hexChars, n >> 4, n >> 4) +
           llGetSubString(hexChars, n & 0xf, n & 0xf);
}

// This is a memory-savings technique for use with Mono-compiled LSL scripts.
// (It probably works in classic LSO too, but not as efficiently.) This technique
// stores 15 bits of information in each 16-bit Unicode character. Use the
// encode function below to convert any 15-bit data to a Unicode character, and
// use the decode function to convert it back to the original 15-bit data.
//
// This example maps the data values 0 through 0x7fff to the Unicode
// characters U-001000 through U-008fff.
//
// By Becky Pippen, 2009, contributed to Public Domain
// The technique used here is very similar to the technique used in the "Base 32768
// Script" in http://wiki.secondlife.com/wiki/Key_Compression .

// Return a single 16-bit Unicode character in UTF-16 two-byte form
//
string encode15BitsToChar(integer num)
{
    // Check the incoming range

    if (num < 0 || num >= 0x8000) {
        // illegal input -- do whatever is appropriate
        return "�";
    }

    // Bias the incoming numeric value by 0x1000 to avoid illegal Unicode codes:

    num += 0x1000;

    // construct an escaped hex UTF-8 representation:

    string utf8 = "%" + hexChar2(0xe0 + (num >> 12)) +
                  "%" + hexChar2(0x80 + ((num >> 6) & 0x3f)) +
                  "%" + hexChar2(0x80 + (num & 0x3f));

    // Convert the UTF-8 into a 16-bit Unicode character:

    return llUnescapeURL(utf8);
}


// This is the inverse of encode15BitsToChar(), supra, q.v.
// This expects a single 16-bit Unicode character that was created by
// encode15BitsToChar() and returns the numeric value used to create it.
// The 15-bit return value will always be in the range 0x0000 - 0x7fff.
//
integer decodeCharTo15Bits(string ch)
{
    integer retVal;
    string utf8 = llEscapeURL(ch); // convert to escaped hex UTF-8

    retVal =  (((integer)("0x" + llGetSubString(utf8, 1, 2)) & 0x1f) << 12) +
              (((integer)("0x" + llGetSubString(utf8, 4, 5)) & 0x3f) << 6) +
               ((integer)("0x" + llGetSubString(utf8, 7, 8)) & 0x3f);

    return retVal - 0x1000;
}

string Database; // think of this as an array of 15-bit values

default
{
    touch_start(integer num)
    {
        integer data15Bits = (integer)llFrand(32768.0); // [0..32767] pretend data
        string encodedChar = encode15BitsToChar(data15Bits);
        Database += encodedChar;

        llOwnerSay("Appended data value " + (string)data15Bits +
                " to database as character \"" + encodedChar + "\"");

        llOwnerSay("At " + (string)llStringLength(Database) +
                " items, free mem = " + (string)llGetFreeMemory());
    }
} </lsl>

@@ Line 20: / Line 20: @@
 through 0x10ffff. However, that says nothing about how those 21-bit ID numbers are stored in memory.
-* '''[[http://en.wikipedia.org/wiki/Utf-8 UTF-8]''' refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.
+* '''[http://en.wikipedia.org/wiki/Utf-8 UTF-8]''' refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.
-* '''[[http://en.wikipedia.org/wiki/Utf-16 UTF-16]''' refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.
+* '''[http://en.wikipedia.org/wiki/Utf-16 UTF-16]''' refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.
 '''Escaped hex''' refers to a way of showing a character code as hex digits

User:Becky Pippen/Numeric Storage: Difference between revisions

Revision as of 00:00, 31 December 2009

Contents

A General Method of Encoding Binary Data in UTF-16 Unicode Strings in Mono

The problem and goal

Background and terminology

Design considerations

Source code example

Also See

Navigation menu

User:Becky Pippen/Numeric Storage: Difference between revisions

Revision as of 00:00, 31 December 2009

A General Method of Encoding Binary Data in UTF-16 Unicode Strings in Mono

The problem and goal

Background and terminology

Design considerations

Source code example

Also See

Navigation menu

Search