Difference between revisions of "User:Becky Pippen/Hashing"
Becky Pippen (talk | contribs) m (typos and formatting) |
(updated <lsl> tags to <source lang="lsl2"> tags) |
||
Line 102: | Line 102: | ||
It appears that most [[UUID]]s in SL are constructed with random bits (or sufficiently so for our purposes) in the first six bytes (first 12 hex digits) of the UUID. This lets us make a trivial hash function by extracting some of the hex digits from the UUID. For example, this code extracts the first eight hex digits as a hash: | It appears that most [[UUID]]s in SL are constructed with random bits (or sufficiently so for our purposes) in the first six bytes (first 12 hex digits) of the UUID. This lets us make a trivial hash function by extracting some of the hex digits from the UUID. For example, this code extracts the first eight hex digits as a hash: | ||
< | <source lang="lsl2">string hash(key k) | ||
{ | { | ||
return llGetSubString((string)k, 0, 7); // first 8 hex chars | return llGetSubString((string)k, 0, 7); // first 8 hex chars | ||
} </ | }</source> | ||
That gives us a hash code with 32 bits of information content (eight hex characters at four bits each). | That gives us a hash code with 32 bits of information content (eight hex characters at four bits each). | ||
Line 113: | Line 113: | ||
If you have any concern that the first few bytes of the UUID are not really as random as you would like, you can make an MD5 hash out of the UUID, then extract some hex digits from the MD5 hash string to make a shorter hash. For example, this hash function extracts eight hex digits from the MD5 hash of the UUID: | If you have any concern that the first few bytes of the UUID are not really as random as you would like, you can make an MD5 hash out of the UUID, then extract some hex digits from the MD5 hash string to make a shorter hash. For example, this hash function extracts eight hex digits from the MD5 hash of the UUID: | ||
< | <source lang="lsl2">string hash(key k) | ||
{ | { | ||
return llGetSubString(llMD5String((string)k, salt), 0, 7); | return llGetSubString(llMD5String((string)k, salt), 0, 7); | ||
} </ | } </source> | ||
Any salt value works in the code above, as long as it's consistent. Choose an unguessable value for salt if you want your hashing function to yield different values than other scripts using the same code snippet above. Think of the salt value as serving a purpose similar to the seed value used in pseudo-random number generators. | Any salt value works in the code above, as long as it's consistent. Choose an unguessable value for salt if you want your hashing function to yield different values than other scripts using the same code snippet above. Think of the salt value as serving a purpose similar to the seed value used in pseudo-random number generators. | ||
Line 122: | Line 122: | ||
We can make the hash code even smaller by treating the eight hex digits as a 32-bit numeric value and converting that number into a six-character string using [http://en.wikipedia.org/wiki/Base64 base64] representation. Something like this: | We can make the hash code even smaller by treating the eight hex digits as a 32-bit numeric value and converting that number into a six-character string using [http://en.wikipedia.org/wiki/Base64 base64] representation. Something like this: | ||
< | <source lang="lsl2">string hash(key k) | ||
{ | { | ||
return llGetSubString(llIntegerToBase64( | return llGetSubString(llIntegerToBase64( | ||
(integer)("0x" + llGetSubString((string)k, 0, 7))), 0, 5); | (integer)("0x" + llGetSubString((string)k, 0, 7))), 0, 5); | ||
} </ | } </source> | ||
In other words, the same information contained in eight hex characters fits into six base64 characters. | In other words, the same information contained in eight hex characters fits into six base64 characters. | ||
Line 134: | Line 134: | ||
Almost anything can be hashed using [[llMD5String]](). That's a function historically used for encryption and security, and it's useful for our hashing needs because it nicely distributes all possible input values across its output range. The MD5 conversion gives us a long string of hex characters, and we can simply extract however many we want to use for a hash. The longer the hash, the less likely there will be a hash collision but the more memory it will consume. This example extracts twelve hex characters from the MD5 result to form a hash code worth 48 bits of information: | Almost anything can be hashed using [[llMD5String]](). That's a function historically used for encryption and security, and it's useful for our hashing needs because it nicely distributes all possible input values across its output range. The MD5 conversion gives us a long string of hex characters, and we can simply extract however many we want to use for a hash. The longer the hash, the less likely there will be a hash collision but the more memory it will consume. This example extracts twelve hex characters from the MD5 result to form a hash code worth 48 bits of information: | ||
< | <source lang="lsl2">string hash(string input) | ||
{ | { | ||
string md5 = llMD5String(input, salt); | string md5 = llMD5String(input, salt); | ||
return llGetSubString(md5, 0, 11); | return llGetSubString(md5, 0, 11); | ||
} </ | }</source> | ||
The problem with this hash function is that the average avatar name in SL is about 14 characters, so a twelve-character hash code is not a dramatic improvement. And then if we store these hash codes in a list, we eat up additional memory with list overhead. To save memory, we need to think of these hash codes as containing so many bits of information, not as characters. Then we need to think of a way to store those bits of information in a more compact form, i.e., without using lists. The next section describes a way of encoding information into Unicode character strings. | The problem with this hash function is that the average avatar name in SL is about 14 characters, so a twelve-character hash code is not a dramatic improvement. And then if we store these hash codes in a list, we eat up additional memory with list overhead. To save memory, we need to think of these hash codes as containing so many bits of information, not as characters. Then we need to think of a way to store those bits of information in a more compact form, i.e., without using lists. The next section describes a way of encoding information into Unicode character strings. | ||
Line 152: | Line 152: | ||
If the consequence of a false positive match is an acceptable risk, then searching a long string for a given hash code can be simplified to a call to [[llSubStringIndex]](). In this example, hash codes are 45-bit values encoded as three Unicode characters each. All hash codes are concatenated into one long string named hashCodes: | If the consequence of a false positive match is an acceptable risk, then searching a long string for a given hash code can be simplified to a call to [[llSubStringIndex]](). In this example, hash codes are 45-bit values encoded as three Unicode characters each. All hash codes are concatenated into one long string named hashCodes: | ||
< | <source lang="lsl2">// Returns logical index of the given hash code | ||
// or -1 if not found | // or -1 if not found | ||
Line 165: | Line 165: | ||
return index; | return index; | ||
} </ | }</source> | ||
The code above will find a false match if the hashCode value accidentally randomly matches characters that span two different hash codes in the hashCodes string. Your script can detect false positives by checking if the match occurred on a boundary of hashSize bytes: | The code above will find a false match if the hashCode value accidentally randomly matches characters that span two different hash codes in the hashCodes string. Your script can detect false positives by checking if the match occurred on a boundary of hashSize bytes: | ||
< | <source lang="lsl2">// Returns logical index of the given hash code | ||
// or -1 if not found | // or -1 if not found | ||
Line 187: | Line 187: | ||
return index; | return index; | ||
} </ | }</source> | ||
===Code Example=== | ===Code Example=== | ||
Line 193: | Line 193: | ||
Here is a complete example of a script that stores avatar names read from a notecard. For each name read, the script searches its memory to see if that avatar name has been encountered before. If not, it adds the name to the list. First let's get a benchmark of the memory consumption using the naive approach of storing the names as strings in a global list. The notecard I used for a test contains 2000 actual SL avatar names, one per line. There are only 1500 unique names in the notecard, so 500 of them are duplicates scattered throughout. | Here is a complete example of a script that stores avatar names read from a notecard. For each name read, the script searches its memory to see if that avatar name has been encountered before. If not, it adds the name to the list. First let's get a benchmark of the memory consumption using the naive approach of storing the names as strings in a global list. The notecard I used for a test contains 2000 actual SL avatar names, one per line. There are only 1500 unique names in the notecard, so 500 of them are duplicates scattered throughout. | ||
< | <source lang="lsl2">integer lineNumber; | ||
string notecardName = "names"; | string notecardName = "names"; | ||
integer namesSaved; | integer namesSaved; | ||
Line 219: | Line 219: | ||
} | } | ||
} | } | ||
} </ | }</source> | ||
The data ate up 56000 bytes of available memory and ran out of memory after saving 1137 unique names. That works out to about 49 bytes consumed per item. That's very close to what we could have predicted in Mono, based on the data in [[LSL_Script_Memory|this page]] — an average of 14 characters per avatar name uses 28 bytes of memory in Mono, plus 22 bytes of overhead per list element for a total of 28+22=50 bytes each. | The data ate up 56000 bytes of available memory and ran out of memory after saving 1137 unique names. That works out to about 49 bytes consumed per item. That's very close to what we could have predicted in Mono, based on the data in [[LSL_Script_Memory|this page]] — an average of 14 characters per avatar name uses 28 bytes of memory in Mono, plus 22 bytes of overhead per list element for a total of 28+22=50 bytes each. | ||
Line 225: | Line 225: | ||
Here is the same script but with the functions needed to save the avatar identities as 45-bit hash codes encoded as three Unicode characters each. It stores 1500 hashed names in just 9000 bytes. Even adding the 2500 extra bytes of program code to support hashing, that comes out to a total 11500 bytes consumed to store 1500 avatar identities, or 7.7 bytes each, compared to 49 bytes each on the average for the naive list approach above. '''That's an 84% net reduction in memory consumption.''' | Here is the same script but with the functions needed to save the avatar identities as 45-bit hash codes encoded as three Unicode characters each. It stores 1500 hashed names in just 9000 bytes. Even adding the 2500 extra bytes of program code to support hashing, that comes out to a total 11500 bytes consumed to store 1500 avatar identities, or 7.7 bytes each, compared to 49 bytes each on the average for the naive list approach above. '''That's an 84% net reduction in memory consumption.''' | ||
< | <source lang="lsl2">// Demo - store avatar identities as 45-bit hashes encoded in 3 Unicode | ||
// chars each, concatenated into one string. | // chars each, concatenated into one string. | ||
// By Becky Pippen, 2010, contributed to the public domain. | // By Becky Pippen, 2010, contributed to the public domain. | ||
Line 315: | Line 315: | ||
} | } | ||
} | } | ||
} </ | }</source> | ||
===Also see=== | ===Also see=== |
Latest revision as of 23:22, 25 April 2015
Hashing to conserve memory
Hashing takes a longer item such as an avatar name or UUID and maps it to something smaller, like an integer or a shorter string. It's compact, but the trade-off is that different inputs can possibly map to the same hash value by accident.
Here's an illustration of how much memory we can save by hashing UUIDs in a Mono script — from 72 bytes in their native form to just a few bytes each. This illustrates techniques discussed in the rest of this article:
There are two problems to solve to make hashing work in LSL: (1) designing a hash function that makes hash codes of the right size, evenly distributed across the hash space, and (2) storing the hash codes efficiently without suffering the overhead of lists. Those two aspects are treated separately below.
Designing the hash function
A hash size too small increases the chance of a hash collision. That's when multiple values of the real data map to the same hash value. A hash size too large wastes memory.
Calculating the probability of hash collisions is similar to calculating the "Birthday Paradox" probabilities. The probability that there will be a duplicate among N random hashes of W bits each is approximately:
P = 1.0 - e ^ -(N^2 / (2 * 2^W))
The table below shows some examples of the probability P of having at least one duplicate among N hashes of various sizes. For example, there is a probability of 0.0001 that there will be a duplicate in a set of 1000 32-bit hash codes.
15-bit | 16-bit | 30-bit | 32-bit | 45-bit | |
N | P | P | P | P | P |
100 | 0.14 | 0.07 | 0.000005 | 0.000001 | 0.0000000001 |
200 | 0.46 | 0.26 | 0.00002 | 0.000005 | 0.0000000006 |
500 | 0.978 | 0.85 | 0.0001 | 0.00003 | 0.0000000004 |
1000 | 1.0 | 0.9995 | 0.0005 | 0.0001 | 0.00000001 |
2000 | 1.0 | 1.0 | 0.001 | 0.0005 | 0.00000006 |
5000 | 1.0 | 1.0 | 0.01 | 0.003 | 0.0000007 |
10000 | 1.0 | 1.0 | 0.05 | 0.01 | 0.000001 |
25000 | 1.0 | 1.0 | 0.26 | 0.07 | 0.000009 |
50000 | 1.0 | 1.0 | 0.69 | 0.25 | 0.00004 |
Hashing UUIDs
It appears that most UUIDs in SL are constructed with random bits (or sufficiently so for our purposes) in the first six bytes (first 12 hex digits) of the UUID. This lets us make a trivial hash function by extracting some of the hex digits from the UUID. For example, this code extracts the first eight hex digits as a hash:
string hash(key k)
{
return llGetSubString((string)k, 0, 7); // first 8 hex chars
}
That gives us a hash code with 32 bits of information content (eight hex characters at four bits each).
At this point you might wonder why not reduce the hash code to a regular 32-bit integer that only uses four bytes of memory. Unfortunately, we can't do anything with integers except store them in a list, and that makes any string-based solution more efficient, as described below.
If you have any concern that the first few bytes of the UUID are not really as random as you would like, you can make an MD5 hash out of the UUID, then extract some hex digits from the MD5 hash string to make a shorter hash. For example, this hash function extracts eight hex digits from the MD5 hash of the UUID:
string hash(key k)
{
return llGetSubString(llMD5String((string)k, salt), 0, 7);
}
Any salt value works in the code above, as long as it's consistent. Choose an unguessable value for salt if you want your hashing function to yield different values than other scripts using the same code snippet above. Think of the salt value as serving a purpose similar to the seed value used in pseudo-random number generators.
We can make the hash code even smaller by treating the eight hex digits as a 32-bit numeric value and converting that number into a six-character string using base64 representation. Something like this:
string hash(key k)
{
return llGetSubString(llIntegerToBase64(
(integer)("0x" + llGetSubString((string)k, 0, 7))), 0, 5);
}
In other words, the same information contained in eight hex characters fits into six base64 characters.
Hashing avatar names and other strings
Almost anything can be hashed using llMD5String(). That's a function historically used for encryption and security, and it's useful for our hashing needs because it nicely distributes all possible input values across its output range. The MD5 conversion gives us a long string of hex characters, and we can simply extract however many we want to use for a hash. The longer the hash, the less likely there will be a hash collision but the more memory it will consume. This example extracts twelve hex characters from the MD5 result to form a hash code worth 48 bits of information:
string hash(string input)
{
string md5 = llMD5String(input, salt);
return llGetSubString(md5, 0, 11);
}
The problem with this hash function is that the average avatar name in SL is about 14 characters, so a twelve-character hash code is not a dramatic improvement. And then if we store these hash codes in a list, we eat up additional memory with list overhead. To save memory, we need to think of these hash codes as containing so many bits of information, not as characters. Then we need to think of a way to store those bits of information in a more compact form, i.e., without using lists. The next section describes a way of encoding information into Unicode character strings.
Avoiding list overhead
It's very difficult to store data compactly in Mono scripts — there are no arrays; lists incur 20+ bytes of overhead per list element; and string data uses a minimum of two bytes per character. About the only way we have to encode arbitrary data in memory is by encoding the information as Unicode characters.
Using the functions in this article, you can store 15 bits of arbitrary data in each 16-bit Unicode character. It's not quite as straightforward as interpreting the 15 bits as the Unicode character ID because certain Unicode character codes may not be used. For example, we can't simply use the Unicode character ID 0x000000 to store a numeric value of zero, because 0x000000 is an illegal Unicode character code. So we have to remap any 15-bit number we want to store into some range of legal Unicode characters. The functions encode15BitsToChar() and decodeCharTo15Bits() do this remapping by adding 0x1000 to the 15-bit number to make a Unicode character ID in the range 0x1000 through 0x8fff. The result is that a 15-bit hash code can be stored as one character; a 30-bit hash code in two characters, and a 45-bit hash code in three Unicode characters.
Searching for hash codes when concatenated in a string
If the consequence of a false positive match is an acceptable risk, then searching a long string for a given hash code can be simplified to a call to llSubStringIndex(). In this example, hash codes are 45-bit values encoded as three Unicode characters each. All hash codes are concatenated into one long string named hashCodes:
// Returns logical index of the given hash code
// or -1 if not found
string hashCodes; // all hash codes are concatenated here
integer findHash(string hashCode)
{
integer index = llSubStringIndex(hashCodes, hashCode);
if (index != -1) {
index /= 3; // because each hash code is three Unicode chars
}
return index;
}
The code above will find a false match if the hashCode value accidentally randomly matches characters that span two different hash codes in the hashCodes string. Your script can detect false positives by checking if the match occurred on a boundary of hashSize bytes:
// Returns logical index of the given hash code
// or -1 if not found
string hashCodes; // all hash codes are concatenated here
integer hashSize = 6; // number of chars in each hash code
integer findHash(string hashCode)
{
integer index = llSubStringIndex(hashCodes, hashCode);
if (index != -1 && (index % hashSize)) {
// this is a false match.
// fall back to a loop search, or whatever is appropriate
} else if (index != -1) {
// a valid match
index /= hashSize;
}
return index;
}
Code Example
Here is a complete example of a script that stores avatar names read from a notecard. For each name read, the script searches its memory to see if that avatar name has been encountered before. If not, it adds the name to the list. First let's get a benchmark of the memory consumption using the naive approach of storing the names as strings in a global list. The notecard I used for a test contains 2000 actual SL avatar names, one per line. There are only 1500 unique names in the notecard, so 500 of them are duplicates scattered throughout.
integer lineNumber;
string notecardName = "names";
integer namesSaved;
list hashedNames;
default
{
touch_start(integer num)
{
llGetNotecardLine(notecardName, lineNumber++);
}
dataserver(key id, string avName)
{
if (avName != EOF) {
if (llListFindList(hashedNames, [avName]) == -1) {
hashedNames += [avName];
llOwnerSay((string)(++namesSaved) + ": " + avName + ", " +
(string)llGetFreeMemory() + " free");
} else {
llOwnerSay("duplicate: " + avName);
}
llGetNotecardLine(notecardName, lineNumber++);
}
}
}
The data ate up 56000 bytes of available memory and ran out of memory after saving 1137 unique names. That works out to about 49 bytes consumed per item. That's very close to what we could have predicted in Mono, based on the data in this page — an average of 14 characters per avatar name uses 28 bytes of memory in Mono, plus 22 bytes of overhead per list element for a total of 28+22=50 bytes each.
Here is the same script but with the functions needed to save the avatar identities as 45-bit hash codes encoded as three Unicode characters each. It stores 1500 hashed names in just 9000 bytes. Even adding the 2500 extra bytes of program code to support hashing, that comes out to a total 11500 bytes consumed to store 1500 avatar identities, or 7.7 bytes each, compared to 49 bytes each on the average for the naive list approach above. That's an 84% net reduction in memory consumption.
// Demo - store avatar identities as 45-bit hashes encoded in 3 Unicode
// chars each, concatenated into one string.
// By Becky Pippen, 2010, contributed to the public domain.
// Converts n = [0..0xff] to "%" + two hex characters
//
string escapeHexChar(integer n)
{
string hexChars = "0123456789abcdef";
return "%" +
llGetSubString(hexChars, n >> 4, n >> 4) +
llGetSubString(hexChars, n & 0xf, n & 0xf);
}
// Encode 15 bits of data as a single Unicode character. Use decodeCharTo15Bits()
// to recover the 15-bit data.
//
// The technique used here is very similar to the technique used in the "Base 32768
// Script" in http://wiki.secondlife.com/wiki/Key_Compression .
string encode15BitsToChar(integer num)
{
// Use only the lower 15 bits and bias by 0x1000 to avoid illegal Unicode IDs:
num = 0x1000 + (num & 0x00007fff);
// Construct an escaped hex UTF-8 representation and
// return it as one Unicode character:
return llUnescapeURL(
escapeHexChar(0xe0 + (num >> 12)) +
escapeHexChar(0x80 + ((num >> 6) & 0x3f)) +
escapeHexChar(0x80 + (num & 0x3f)));
}
//****************************************
// demo -- store avatar identities as compressed 45-bit hashes of their name
integer lineNumber; // for the notecard reader
string notecardName = "names"; // notecard of av names, one per line
integer namesSaved; // count how many successfully saved
string hashedNames; // all the hashes get concatenated in this string
integer hashSize = 3; // num of Unicode chars per encoded hash code
default
{
touch_start(integer num)
{
llGetNotecardLine(notecardName, lineNumber++);
}
dataserver(key id, string avName)
{
if (avName != EOF) {
// Pick off 48 bits of information to start with:
string md5 = llMD5String(avName, 0);
integer n1 = (integer)("0x" + llGetSubString(md5, 0, 7));
integer n2 = (integer)("0x" + llGetSubString(md5, 8, 11));
// Map 45 of the bits into three characters. We cannot encode Unicode
// directly from Unicode IDs, but we can do it by constructing an escaped
// hex UTF-8 string representation and converting that to Unicode
//
// |------------------ n1 ------------------| |------- n2 -------|
// x000 0000 0000 0000 x000 0000 0000 0000 x000 0000 0000 0000
// |------char1------| |------char2------| |------char3------|
string encoded3Chars =
encode15BitsToChar(n1 >> 16) +
encode15BitsToChar(n1) +
encode15BitsToChar(n2);
integer idx = llSubStringIndex(hashedNames, encoded3Chars);
if (idx == -1) {
// this is a new one, so save it
hashedNames += encoded3Chars;
llOwnerSay((string)(++namesSaved) + ": " + avName + ", " +
(string)llGetFreeMemory() + " free");
} else if (idx % hashSize) {
llOwnerSay("WARNING!!! False match for " + avName);
} else {
llOwnerSay("Duplicate, already saved: " + avName);
}
llGetNotecardLine(notecardName, lineNumber++);
}
}
}
Also see
For more ideas about compressing bits into Unicode strings, see
- Scripting techniques for storing arbitrary binary data without lists
- Memory limits FAQ and Checklist for memory reduction
If you need to store many keys without losing any bits, see the suggestions at Key_Compression.