Difference between revisions of "User:Becky Pippen/Key Storage"

Revision as of 12:57, 31 December 2009

Key/UUID Storage

Analysis of memory needed vs. memory used

Each key, or UUID, contains 128 bits of information which are stored in memory as 36 characters (32 hex digits plus four hyphens). If stored in a global list, there's an additional overhead of 12 bytes per element in classic LSO, or 34 bytes of overhead in Mono. That gives us a theoretical memory efficiency of 128/(8*(12+36)) = 33% for LSO, or 128/(8*(34+72)) = 16% for Mono.

Lossless key compression

If you need to store a lot of keys without losing any bits, see the suggestions at Key_Compression.

Lossy storage using hashes

If you need to store avatar names or keys just so the script can answer the question, "have I encountered this name or key before?" then consider storing only a hash of the name or key. Here is a convenient way to map all possible avatar names to 32-bit integers: <lsl> // Returns a 32-bit integer hash of the specified string

//
integer hash(string s)
{
    return (integer)("0x" + llGetSubString(llMD5String(s, salt), 0, 7));
}</lsl>

This uses the MD5 hash algorithm not for encryption or security, but for the way it evenly distributes all possible strings across the set of all possible integers.

Here is another hash function that maps the input string to a 32-bit hash space represented by a six-character base64 string:

<lsl> // Returns a six-character base64 hash of the specified string

//
string hash(string s)
{
    return llGetSubString(llIntegerToBase64((integer)("0x" + llGetSubString(llMD5String(s, salt), 0, 7))), 0, 5);
}</lsl>

With any hashing method that reduces a large set of possible values to a smaller set, there is a chance of hash collisions. This is when two or more strings accidentally map to the same hash value. The chances of this happening is similar to birthday paradox and can be approximated as:

P(n) = 1 - e ^ -(n^2 / 2 * x)

where P(n) is the possibility of having at least one duplicate among n hash values where x is the number of possible hash values. For example, if we map avatar names or keys to the set of 32-bit integers, the probability of a duplicate among 1000 hashes is P(1000) = 1 - e ^ -(1000^2 / 2 * 2^32) = 0.00012. The probability increases like this:

For 32-bit hashes:
n   probability of a hash duplicate
-   -------------------------------
1000    0.00012
10000   0.012
50000   0.25
100000  0.69

Storing the hashes as strings

Storing a bunch of hash values in an LSL list is not very efficient due to the overhead with every list element. Instead, consider concatenating a bunch of fixed-width hash values into a single string. For example, suppose you want to store avatar names and your hash function maps names to six-character strings as shown in the last hash function above:

<lsl> hash("Alli Avatar") returns "H2AyMw"

hash("Ally Avatar") returns "o+fBPA"
hash("Bobby Avatar") returns "y1qWIw"</lsl>

Those can be concatenated and stored in a string variable:

<lsl>string database = "H2AyMwo+fBPAy1qWIw";</lsl>

For six-character hashes, the n-th hash can be extracted as:

<lsl> integer hashSize = 6;

llGetSubString(database, n*6, n*6 + hashSize-1)</lsl>

In Mono, this uses 12 bytes of storage per avatar instead of 106 bytes per avatar if we stored UUIDs in a list.

Searching the list for a hash value can be done in a loop. Or it can be done with a single call to llSubStringIndex() at the risk of accidentally matching on an unaligned boundary:

<lsl> // Returns logical index [0..n) of the specified hash value, or -1 if not found:

//
integer findHashInDatabase(string hashValue)
{
    integer hashSize = 6;
    integer idx = llSubStringIndex(database, hashValue);
    if (idx == -1) {
        return -1;
    } else if (idx % hashSize == 0) {
        return idx / hashSize;
    } else {
        // oops -- matched on parts of two hashes -- here we could
        // fall back to a slower loop or recover in some other way
        . . .
    }
}</lsl>

For even better compression, write a hash function that generates 30-bit numeric hash values, then store them as two Unicode characters each as described in this page.

For more ideas about script memory usage, see this checklist of things you can do.