User:Becky Pippen/Hashing

From Second Life Wiki
< User:Becky Pippen
Revision as of 22:14, 4 January 2010 by Becky Pippen (talk | contribs) (scripting tutorial: hashing data to reduce memory usage)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Hashing to conserve memory

Hashing takes a longer item like an avatar name or UUID and maps it to something smaller, like an integer or a shorter string. It's compact, but the trade-off is that different inputs can possibly map to the same hash value by accident.

Here's an illustration of how much memory we can save by hashing UUIDs in a Mono script -- from 72 bytes in their native form to just a few bytes each. This illustrates techniques discussed in the rest of this article:

Key hash-1.jpg

There are two problems to solve to make hashing work in LSL: (1) designing a hash function that makes hash codes of the right size, evenly distributed across the hash space, and (2) storing the hash codes efficiently without using suffering the overhead of lists. Those two aspects are treated separately below.

Designing the hash function

A hash size too small increases the chance of a hash collision. That's when multiple values of the real data map to the same hash value. A hash size too large wastes memory.

Calculating the probability of hash collisions is similar to calculating the "Birthday Paradox" probabilities. The probability that there will be a duplicate among N random hashes of W bits each is approximately:

P = 1.0 - e ^ -(N ^2 / (2 * 2^W))

The table below shows some examples of the probability P of having at least one duplicate among N hashes of various sizes. For example, there is a probability of 0.0001 that there will be a duplicate in a set of 1000 32-bit hash codes.

15-bit 16-bit 30-bit 32-bit 45-bit
N P P P P P
100 0.14 0.07 0.000005 0.000001 0.0000000001
200 0.46 0.26 0.00002 0.000005 0.0000000006
500 0.978 0.85 0.0001 0.00003 0.0000000004
1000 1.0 0.9995 0.0005 0.0001 0.00000001
2000 1.0 1.0 0.0005 0.0005 0.00000006
5000 1.0 1.0 0.01 0.003 0.0000007
10000 1.0 1.0 0.05 0.01 0.000001
25000 1.0 1.0 0.26 0.07 0.000009
50000 1.0 1.0 0.69 0.25 0.00004

Hashing UUIDs

It appears that most UUIDs in SL are constructed with random bits (or sufficiently so for our purposes) in the first six bytes (first 12 hex digits) of the UUID. This lets us make a trivial hash function by extracting some of the hex digits from the UUID. For example, this code extracts the first eight hex digits as a hash:

<lsl> string hash(key k)

{
    return llGetSubString((string)k, 0, 7); // first 8 hex chars
} </lsl> 

That gives us a hash code with 32 bits of information content (eight hex characters at four bits each).

At this point you might wonder why not reduce the hash code to a regular 32-bit integer that only uses four bytes of memory. Unfortunately, we can't do anything with integers except store them in a list, and that makes any string-based solution more efficient, as described in a later section.

If you have any concern that the first few bytes of the UUID are not really as random as you would like, you can make an MD5 hash out of the UUID, then take hex digits from the MD5 hash string. For example, this hash function extracts eight hex digits from the MD5 hash of the UUID:

<lsl> string hash(key k)

{
    return llGetSubString(llMD5String((string)k, salt), 0, 7);
} </lsl>

Any salt value works in the code above, as long as it's consistent. Choose an unguessable value for salt if you want your hashing function to yield different values than other scripts using the same code snippet above. Think of the salt value as serving a purpose similar to the seed value used in pseudo-random number generators.

We can make the hash code even smaller by treating the eight hex digits as a 32-bit numeric value and converting that number into a six-character string using base64 representation. Something like this:

<lsl> string hash(key k)

{
    return llGetSubString(llIntegerToBase64(
        (integer)("0x" + llGetSubString((string)k, 0, 7))), 0, 5);
} </lsl>

In other words, the same information contained in eight hex characters fits into six base64 characters.

Hashing avatar names and other strings

Almost anything can be hashed using llMD5String(). That's a function historically used for encryption and security, and it's useful for our hashing needs because it nicely distributes all possible input values across its output range. The MD5 conversion gives us a long string of hex characters, and we can simply extract however many we want to use for a hash. The longer the hash, the less likely there will be a hash collision but the more memory it will consume. This example extracts twelve hex characters from the MD5 result to form a hash code worth 48 bits of information:

<lsl> string hash(string input)

{
    string md5 = llMD5String(input, salt);
    return llGetSubString(md5, 0, 11);
} </lsl> 

The problem with this hash function is that the average avatar name in SL is about 14 characters, so a twelve-character hash code is not a dramatic improvement. And then if we store these hash codes in a list, we eat up additional memory with list overhead. To save memory, we need to think of these hash codes as containing so many bits of information, not as characters. Then we need to think of a way to store those bits of information in a more compact form, i.e., without using lists. The next section describes a way of encoding information into Unicode character strings.

Avoiding list overhead

It's very difficult to store data compactly in Mono scripts -- there are no arrays; lists incur 20+ bytes of overhead per list element; and string data uses a minimum of two bytes per character. About the only way we have to encode arbitrary data in memory is by encoding the information as Unicode characters.

Using the functions in this article, you can store 15 bits of arbitrary data in each 16-bit Unicode character. It's not quite as straightforward as interpreting the 15 bits as the Unicode character ID because certain Unicode character codes may not be used. For example, we can't simply use the Unicode character ID 0x000000 to store a numeric value of zero, because 0x000000 is an illegal Unicode character code. So we have to remap any 15-bit number we want to store into some range of legal Unicode characters. The functions encode15BitsToChar() and decodeCharTo15Bits() do this remapping by adding 0x1000 to the 15-bit number to make a Unicode character ID in the range 0x1000 through 0x8fff. The result is that a 15-bit hash code can be stored as one character; a 30-bit hash code in two characters, and a 45-bit hash code in three Unicode characters.

Searching for hash codes when concatenated in a string

If the consequence of a false positive match is an acceptable risk, then searching a long string for a given hash code can be simplified to a call to llSubStringIndex(). In this example, hash codes are 45-bit values encoded as three Unicode characters each. All hash codes are concatenated into one long string named hashCodes:

<lsl> // Returns logical index of the given hash code

// or -1 if not found

string hashCodes;  // all hash codes are concatenated here

integer findHash(string hashCode)
{
    integer index = llSubStringIndex(hashCodes, hashCode);
    if (index != -1) {
        index /= 3;  // because each hash code is three Unicode chars
    }

    return index;
} </lsl>

The code above will find a false match if the hashCode value accidentally randomly matches characters that span two different hash codes in the hashCodes string. Your script can detect false positives by checking if the match occurred on a boundary of hashSize bytes:

<lsl> // Returns logical index of the given hash code

// or -1 if not found

string hashCodes;  // all hash codes are concatenated here
integer hashSize = 6; // number of chars in each hash code

integer findHash(string hashCode)
{
    integer index = llSubStringIndex(hashCodes, hashCode);
    if (index != -1 && (index % hashSize)) {
        // this is a false match.
        // fall back to a loop search, or whatever is appropriate
    } else if (index != -1) {
        // a valid match
        index /= hashSize;
    }

    return index;
} </lsl>

Code Example

Here is a complete example of a script that stores avatar names read from a notecard. First let's get a benchmark of the memory consumption using the naive approach of storing the names in a global list. The notecard I used for a test contains 2000 actual SL avatar names, one per line. There are only 1500 unique names in the notecard, so 500 of them are duplicates.

<lsl> integer lineNumber;

string notecardName = "names";
integer namesSaved;
list hashedNames;

default
{
    touch_start(integer num)
    {
        llGetNotecardLine(notecardName, lineNumber++);
    }

    dataserver(key id, string avName)
    {
        if (avName != EOF) {
            if (llListFindList(hashedNames, [avName]) == -1) {
                hashedNames += [avName];
                llOwnerSay((string)(++namesSaved) + ": " + avName + ", " +
                        (string)llGetFreeMemory() + " free");
            } else {
                llOwnerSay("duplicate: " + avName);
            }

            llGetNotecardLine(notecardName, lineNumber++);
        }
    }
} </lsl>

The data ate up 56000 bytes of available memory and ran out of memory after saving 1137 unique names. That works out to about 49 bytes consumed per item. That's roughly what we expected in Mono, based on the data in this page -- an average of 14 characters per avatar name uses 28 bytes of memory in Mono, plus 22 bytes of overhead per list element for a total of 14+28=42 bytes each.

Here is the same script but with the functions needed to save the avatar identities as 45-bit hash codes encoded as three Unicode characters each. It stores 1500 hashed names in just 9000 bytes. Even adding the 2500 extra bytes of program code to support hashing, that comes out to a total 11500 bytes consumed to store 1500 avatar identities, or 7.7 bytes each, compared to 49 bytes each on the average for the naive list approach above. That's an 84% net reduction in memory consumption.

<lsl> // Demo - store avatar identities as 45-bit hases encoded in 3 Unicode

// chars each, concatenated into one string.
// By Becky Pippen, 2010, contributed to the public domain.

integer lastFree;

// Report memory free and the change from the last report:
//
mem(string label)
{
    integer newFree = llGetFreeMemory();
    integer change = lastFree - newFree;
    llOwnerSay((string)newFree + " free, change = " + (string)change);
    lastFree = newFree;
}

// Converts n = [0..0xff] to "%" + two hex characters
//
string escapeHexChar(integer n)
{
    string hexChars = "0123456789abcdef";
    return "%" +
          llGetSubString(hexChars, n >> 4, n >> 4) +
          llGetSubString(hexChars, n & 0xf, n & 0xf);
}


// Encode 15 bits of data as a single Unicode character. Use decodeCharTo15Bits()
// to recover the original number.
//
// The technique used here is very similar to the technique used in the "Base 32768
// Script" in http://wiki.secondlife.com/wiki/Key_Compression .

string encode15BitsToChar(integer num)
{
    // Use only the lower 15 bits and bias by 0x1000 to avoid illegal Unicode IDs:

    num = 0x1000 + (num & 0x00007fff);

    // Construct an escaped hex UTF-8 representation and
    // return it as one Unicode character:

    return llUnescapeURL(
              escapeHexChar(0xe0 + (num >> 12)) +
              escapeHexChar(0x80 + ((num >> 6) & 0x3f)) +
              escapeHexChar(0x80 + (num & 0x3f)));
}


//****************************************
// demo -- store avatar identities as compressed 45-bit hashes of their name

integer lineNumber;       // for the notecard reader
string notecardName = "names"; // notecard of av names, one per line
integer namesSaved;       // count how many successfully saved
string hashedNames;       // all the hashes get concatenated in this string
integer hashSize = 3;     // num of Unicode chars per encoded hash code

default
{
    touch_start(integer num)
    {
        llGetNotecardLine(notecardName, lineNumber++);
    }

    dataserver(key id, string avName)
    {
        if (avName != EOF) {
            // Pick off 48 bits of information to start with:
            string md5 = llMD5String(avName, 0);
            integer n1 = (integer)("0x" + llGetSubString(md5, 0, 7));
            integer n2 = (integer)("0x" + llGetSubString(md5, 8, 11));

            // Map 45 of the bits into three characters. We cannot encode Unicode
            // directly, so we have to use UTF-8 as an intermediate form
            //
            // |------------------ n1 ------------------|  |------- n2 -------|
            // x000 0000  0000 0000  x000 0000  0000 0000  x000 0000  0000 0000
            //  |------char1------|   |------char2------|   |------char3------|

            string encoded3Chars = 
                encode15BitsToChar(n1 >> 16) +
                encode15BitsToChar(n1) +
                encode15BitsToChar(n2);

            integer idx = llSubStringIndex(hashedNames, encoded3Chars);
            if (idx == -1) {
                // this is a new one, so save it
                hashedNames += encoded3Chars;
                llOwnerSay((string)(++namesSaved) + ": " + avName + ", " +
                           (string)llGetFreeMemory() + " free");
            } else if (idx % hashSize) {
                llOwnerSay("WARNING!!! False match for " + avName);
            } else {
                llOwnerSay("Duplicate, already saved: " + avName);
            }

            llGetNotecardLine(notecardName, lineNumber++);
        } else {
            llOwnerSay("hashedNames len=" + (string)llStringLength(hashedNames));
        }
    }
} </lsl>

Also see

For more ideas about compressing bits into Unicode strings, see