Difference between revisions of "User:Becky Pippen/Numeric Storage"
Becky Pippen (talk | contribs) m |
Bryon Ruxton (talk | contribs) m (→Design considerations: Added invalid chars + correction, as per wiki.secondlife.com/wiki/LLSD#string) |
||
Line 34: | Line 34: | ||
In classic LSO, we could construct, store, and decode any Unicode character | In classic LSO, we could construct, store, and decode any Unicode character | ||
from U-00000000 through U-ffffffff, but in Mono the character range is limited to | from U-00000000 through U-ffffffff, but in Mono the character range is limited to | ||
U-00000000 through U- | U-00000000 through U-0010fffd minus these codes that are not allowed to be used | ||
as characters: | as characters: | ||
Line 41: | Line 41: | ||
0xfdd0 - 0xfdef | 0xfdd0 - 0xfdef | ||
0xfffe - 0xffff | 0xfffe - 0xffff | ||
0x1fffe - 0x1ffff | |||
0x2fffe - 0x2ffff | |||
That's why we can't just use the entire range of character codes, so we'll use half. | That's why we can't just use the entire range of character codes, so we'll use half. |
Latest revision as of 04:57, 16 January 2012
A General Method of Encoding Binary Data in UTF-16 Unicode Strings in Mono
The problem and goal
A list in LSL incurs a large memory overhead for every list element (see LSL_Script_Memory ). A true array type in LSL would solve the problem, but LSL has no array type. Strings, however, have properties that are conceptually like an array where each element is a Unicode character. So, we'll use a string as a storage container, and encode any binary data we want as fixed-sized Unicode strings. Then you can use the usual string functions to access, set, or get any element. This page describes a general method of storing 15 bits of information in each 16-bit Unicode character for a memory efficiency of 15/16 = 93.75%.
Background and terminology
Unicode refers to a set of several thousand glyphs, where each glyph is assigned a numeric ID in the range from 0 through 0x10ffff. However, that says nothing about how those 21-bit ID numbers are stored in memory.
- UTF-8 refers to a way of storing the Unicode character codes in variable length byte sequences of one or more bytes. This is how text is stored in a classic LSO script.
- UTF-16 refers to a way of storing the Unicode character numbers as one or two 16-bit integers. This is how text is stored in Mono scripts.
Escaped hex refers to a way of showing a character code as hex digits for the human's benefit. For example, the Unicode character "á" corresponds to Unicode character number 225. In UTF-8 format, that's stored as two bytes (0xc3,0xa1). The escaped hex format for that would look like the string "%c3%a1".
For a summary of Unicode, UTF-8, and UTF-16, see: Unicode_In_5_Minutes.
Design considerations
In classic LSO, we could construct, store, and decode any Unicode character from U-00000000 through U-ffffffff, but in Mono the character range is limited to U-00000000 through U-0010fffd minus these codes that are not allowed to be used as characters:
0x0000 0xd800 - 0xdfff 0xfdd0 - 0xfdef 0xfffe - 0xffff 0x1fffe - 0x1ffff 0x2fffe - 0x2ffff
That's why we can't just use the entire range of character codes, so we'll use half. That means we must map the 32768 possible values of a 15-bit data item to a well-defined set of 32768 Unicode characters that are legal to use. And we must restrict our choices to the Unicode characters that can be represented with two bytes in UTF-16 format. The example code below uses Unicode characters U-001000 through U-008fff to store the data values 0 through 0x7fff. This could be called a base 32768 solution, and is very similar to the "Base 32768" solution for key storage described in Key_Compression.
Converting a 15-bit numeric value to a 16-bit Unicode character is as follows:
- Add 0x1000 to remap the 15-bit values from [0x0000..0x8000) to the character range [0x1000.. 0x9000).
- Convert the numeric value to an escaped UTF-8 string.
- Convert the escaped UTF-8 representation to a single UTF-16 Unicode character using llUnescapeURL().
For example, here are the steps to encode the 15-bit value 0x112b:
- Add 0x1000, resulting in 0x212b
- Convert 0x212b to UTF-8 represented in escaped hex form = "%e2%84%ab"
- Convert the escaped UTF-8 to a 16-bit Unicode char using llUnescapeURL("%e2%84%ab") (The resulting character is "Å")
The intermediate step of converting the 15-bit numeric value to variable-length UTF-8 depends on the number range:
bytes | bits | range | representation 1 | 7 | 0..7f | 0vvvvvvv 2 | 11 | 80..7ff | 110vvvvv 10vvvvvv 3 | 16 | 800..ffff | 1110vvvv 10vvvvvv 10vvvvvv 4 | 21 | 10000.. | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
Because we're only encoding values in the range 0x1000 - 0x8fff, we only need to deal with the 3-byte form shown above, simplifying the code but also making these functions unsuitable for general character conversions.
Converting a Unicode character to its numeric code is the reverse process. For example, given the Angstrom character "Å", llEscapeURL("Å") returns "%E2%84%AB" which must be chopped up and converted in the reverse process to get 0x212b, then subtract the bias to get 0x112b.
A Mono script with a small amount of code dedicated to data storage could easily store 25,000 data items where each is an arbitrary 15-bit value. As long as you can combine or split your data to fit into 15-bit chunks, then you can use this technique. E.g, two such 15-bit items can hold a region positional vector with coordinates rounded off, or a 30-bit hash of an avatar name or UUID. This technique gives us 15/16 = 93.75% efficient utilization of memory, compared to about 20% efficiency if using global lists of integers or vectors.
Usage, Example, Properties:
<lsl> string DataBase; // some global string that holds all the elements
integer data15Bits = 0x1234; // an example 15-bit data item to store . . . DataBase += encode15BitsToChar(data15Bits); // append the new value to the DB . . . // Round-trip encoding/decoding is lossless. I.e., The following will always be true: decodeCharTo15Bits(encode15BitsToChar(data15Bits)) == data15Bits </lsl>
Additional information can be found in the comments in the source code example below.
Source code example
<lsl> // Converts n = [0..0xff] to two hex characters
// string hexChar2(integer n) { string hexChars = "0123456789abcdef"; return llGetSubString(hexChars, n >> 4, n >> 4) + llGetSubString(hexChars, n & 0xf, n & 0xf); } // This is a memory-savings technique for use with Mono-compiled LSL scripts. // (It probably works in classic LSO too, but not as efficiently.) This technique // stores 15 bits of information in each 16-bit Unicode character. Use the // encode function below to convert any 15-bit data to a Unicode character, and // use the decode function to convert it back to the original 15-bit data. // // This example maps the data values 0 through 0x7fff to the Unicode // characters U-001000 through U-008fff. // // By Becky Pippen, 2009, contributed to Public Domain // The technique used here is very similar to the technique used in the "Base 32768 // Script" in http://wiki.secondlife.com/wiki/Key_Compression . // Return a single 16-bit Unicode character in UTF-16 two-byte form // string encode15BitsToChar(integer num) { // Check the incoming range if (num < 0 || num >= 0x8000) { // illegal input -- do whatever is appropriate return "�"; } // Bias the incoming numeric value by 0x1000 to avoid illegal Unicode codes: num += 0x1000; // construct an escaped hex UTF-8 representation: string utf8 = "%" + hexChar2(0xe0 + (num >> 12)) + "%" + hexChar2(0x80 + ((num >> 6) & 0x3f)) + "%" + hexChar2(0x80 + (num & 0x3f)); // Convert the UTF-8 into a 16-bit Unicode character: return llUnescapeURL(utf8); } // This is the inverse of encode15BitsToChar(), supra, q.v. // This expects a single 16-bit Unicode character that was created by // encode15BitsToChar() and returns the numeric value used to create it. // The 15-bit return value will always be in the range 0x0000 - 0x7fff. // integer decodeCharTo15Bits(string ch) { integer retVal; string utf8 = llEscapeURL(ch); // convert to escaped hex UTF-8 retVal = (((integer)("0x" + llGetSubString(utf8, 1, 2)) & 0x1f) << 12) + (((integer)("0x" + llGetSubString(utf8, 4, 5)) & 0x3f) << 6) + ((integer)("0x" + llGetSubString(utf8, 7, 8)) & 0x3f); return retVal - 0x1000; } string Database; // think of this as an array of 15-bit values default { touch_start(integer num) { integer data15Bits = (integer)llFrand(32768.0); // [0..32767] pretend data string encodedChar = encode15BitsToChar(data15Bits); Database += encodedChar; llOwnerSay("Appended data value " + (string)data15Bits + " to database as character \"" + encodedChar + "\""); llOwnerSay("At " + (string)llStringLength(Database) + " items, free mem = " + (string)llGetFreeMemory()); } } </lsl>
Also See
Read more about script memory limits, and other ways to reduce script memory usage.