Difference between revisions of "User:JTBarnabas Resident"

From Second Life Wiki
Jump to navigation Jump to search
(Unique(ish) Identifiers)
 
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Unique(ish) Indentifiers==
My challenge was to produce a script that allowed each avatar to interact with the scripted object in a particular way only one time and after that to interact with it in a different way.
My challenge was to produce a script that allowed each avatar to interact with the scripted object in a particular way only one time and after that to interact with it in a different way.


Line 9: Line 11:
I eventually rejected names. In order to use them I would either have to accept variable identifier lengths or encode them so all were the same length as the longest thus defeating the purpose of using them, specifically to find a more compact, less memory expensive unique identifier.
I eventually rejected names. In order to use them I would either have to accept variable identifier lengths or encode them so all were the same length as the longest thus defeating the purpose of using them, specifically to find a more compact, less memory expensive unique identifier.


So, that left me with UUIDs. Stored in a list as a key they require 110 bytes per UUID. Storage as string reduces this by 20 bytes per string but that is still a very large record size. Even stripping out the dashes only reduces them by 8 bytes. However, those remaining 32 characters are hexidecimals and by converting them to integers I knew I could find a more compact way to store them.
So, that left me with UUIDs. Stored in a list as a key they require 110 bytes per UUID. Storage as string reduces this by 20 bytes per string but that is still a very large record size. Even stripping out the dashes only reduces them by 8 bytes. However, those remaining 32 characters are hexadecimals and by converting them to integers I knew I could find a more compact way to store them.


My first idea was to convert them to conventional string data using something like this:
My first idea was to convert them to conventional string data using something like this:
<source lang="lsl2">
<source lang="lsl2">
integer count;
    integer count = 5;
integer data = integerValueOfFirst7HexCharacters;
    integer data = integerValueOfFirst7HexCharacters;
while (data)
    while (data)
    {
makeSymbol(data % 1114111);
data = (data / 1114111);
if (data < 1114111 && count)
{
{
makeSymbol(data % 1114111);
    data += integerValueOf5MoreHexCharacters * 1114111;
data = llFloor(data / 1114111);
    --count;
if (data < 1114111 && count <= 5)
data += integerValueOf5MoreHexCharacters * 1114111;
++count;
}
}
    }
</source>
</source>
where makeSymbol was a function to convert the number to a UTF-16 symbol. The problem with this approach was each symbol could be 2 or 4 bytes and again I would have the problem of variable record lengths necessitating the use of a list.
where makeSymbol was a function to convert the number to a UTF-16 symbol. The problem with this approach was each symbol could be 2 or 4 bytes and again I would have the problem of variable record lengths necessitating the use of a list.


I thought of something I later learned was sometimes called base32768 encoding. It is a method of storing 15 bits of data as a UTF-8 character. Each UTF-8 character requires exactly 2 bytes so I could eliminate the variable record size *and* bring the storage efficiency from 83.70% to 93.75%
I thought of something I later learned was sometimes called base32768 encoding. It is a method of storing 15 bits of data as a UTF-8 character. Each UTF-8 character requires exactly 2 bytes so I could eliminate the variable record size *and* bring the storage efficiency from 83.70% to 93.75%


Using base32768 encoding allowed me to reduce a 36 character UUID to just 9 characters *and* allowed me to store them in what I like to call a strided string. Even though the script would require the use of llGetSubString, which I have been warned can crash a script, I deemed the risk to be acceptable, especially considering I would not be using UTF-16 characters, which I believe the only documented case of script crashing involved.
Using base32768 encoding allowed me to reduce a 36 character UUID to just 9 characters *and* allowed me to store them in something like a strided string what I like to call a straddle.


I discussed ways to reduce storage space with the good folks in the Scripting Academy group chat. One of them, The Scripter (Silvermane Mode), who runs one of the largest Name2Key databases in SL, informed me converting the first 5 characters of the UUID produced a hash that has not clashed (produced identical results) in his list of over 400,000 avatars and I had decided to trust his years of experience and use the same while I was still using the base1114111 encoding; however, after deciding to use the base32768 encoding I realized I could use the first 7 characters in the UUID to produce 2 UTF-8 characters decreasing the chance of hash clash by over 94.6%.
I discussed ways to reduce storage space with the good folks in the Scripting Academy group chat. One of them, The Scripter (Silvermane Mode), who runs one of the largest Name2Key databases in SL, informed me converting the first 5 characters of the UUID produced a hash that has not clashed (produced identical results) in his list of over 400,000 avatars and I had decided to trust his years of experience and use the same while I was still using the base1114111 encoding; however, after deciding to use the base32768 encoding I realized I could use the first 7 characters in the UUID to produce 2 UTF-8 characters decreasing the chance of hash clash by over 94.6%.


While considering ways to further compact my storage requirements I realized there were unused bits and I could use those bits to make the first character different from any of the following characters by simply shifting them all up one bit (e.g. multiply by 2) and adding 1 to only the first character. I will call this base16384+1 encoding.
While considering ways to further compact my storage requirements I realized there were unused bits and I could use those bits to make the first character different from any of the following characters by simply shifting them all up one bit (e.g. multiply by 2) and adding 1 to only the first character. I will call this base0x4000+1 encoding.


By doing this I will be able to eliminate the need to use llGetSubString and simplify the search routine using llSubStringIndex. Adding a bitwise not (~) I can pass it as the parameter as a conditional reducing the code space needed to work with the unique identifiers in my script even further. However, to store the full UUID in a form that it is retrievable will require 10 string characters using base16384+1 encoding.
By doing this I will be able to eliminate the need to use llGetSubString and simplify the search routine using llSubStringIndex. Adding a bitwise not (~) I can pass it as the parameter of a conditional reducing the code space needed to work with the unique identifiers in my script even further. However, to store the full UUID in a form that it is retrievable will require 10 string characters using base0x4000+1 encoding.


If all UUIDs were v4 or v5 9 characters would suffice but I understand not all are. I do not need to retrieve the UUIDs but the code here http://wiki.secondlife.com/wiki/User:Becky_Pippen/Text_Storage will provide the details you'll need to make your database so you can.
If all UUIDs were v4 or v5 9 characters would suffice but I understand not all are. I do not need to retrieve the UUIDs but the code here http://wiki.secondlife.com/wiki/User:Becky_Pippen/Text_Storage will provide the details you'll need to make your database so you can.
It might be possible to encode 7 characters using base16384+1 encoding and 2 using base32768 encoding to produce a searchable pattern that would not run the risk, or at least as high a risk, of producing false positives allowing you to use just 9 characters to store the full UUID. I would welcome your thoughts on the matter.


Here's some code to illustrate what I'm talking about:
Here's some code to illustrate what I'm talking about:
Line 45: Line 46:
string HexChars(integer n)
string HexChars(integer n)
{
{
string hexChars = "0123456789abcdef";
    string hexChars = "0123456789abcdef";
return llGetSubString(hexChars, n >> 4, n >> 4) +
    return llGetSubString(hexChars, n >> 4, n >> 4) +  
llGetSubString(hexChars, n & 0xf, n & 0xf);
llGetSubString(hexChars, n & 0xf, n & 0xf);
}
}


string Encode(integer num)
string Encode(integer num)
{
{
num += 0x1000;
    num += 0x1000;
return llUnescapeURL("%" + HexChars(0xe0 + (num >> 12)) +
    return llUnescapeURL("%" + HexChars(0xe0 + (num >> 12)) +
"%" + HexChars(0x80 + ((num >> 6) & 0x3f)) +
"%" + HexChars(0x80 + ((num >> 6) & 0x3f)) +
"%" + HexChars(0x80 + (num & 0x3f)));
"%" + HexChars(0x80 + (num & 0x3f)));
}
}


Line 61: Line 62:
string dbID(string UUID)
string dbID(string UUID)
{
{
integer data1 = (integer)("0x" + llGetSubString(UUID, 0 , 6));
    integer data1 = (integer)("0x" + llGetSubString(UUID, 0 , 6));
integer data2 = (data1 % 16384) * 2;
    integer data2 = (data1 % 0x4000) * 2;
data1 = llFloor(data1 / 16384) * 2 + 1;
    data1 = (data1 / 0x4000) * 2 + 1;
return Encode(data1) + Encode(data2);
    return Encode(data1) + Encode(data2);
}
}


// The need for this function was eliminated by using base16484+1 encoding. I
// The need for this function was eliminated by using base0x4000+1 encoding. I
// left it to show what ~llSubStringIndex(database, check) is replacing.
// left it to show what ~llSubStringIndex(database, check) is replacing.
integer inDB(string check)
integer inDB(string check)
{
{
integer known;
    integer known;
integer i = llStringLength(database) / 2;
    integer i = llStringLength(database) / 2;
while (i-- && !known) // (i--) is equivalant to (--i <= 0)
    while (i-- && !known) // (i--) is equivalant to (--i <= 0)
if (llGetSubString(database, i * 2, i * 2 + 1) == check) ++known;
if (llGetSubString(database, i * 2, i * 2 + 1) == check) ++known;
return known;
    return known;
}
}


string database;
string database;
// database is a strided string, 2 characters per stride, each character
// database is a straddle, 2 characters per stride, each character containing
// containing 14-bits of base16384+1 encoded data.
// 14-bits of base0x4000+1 encoded data.


default
default
{
{
state_entry()
    state_entry()
    {
llSay(0, "There should be enough free memory to hold " +
    (string)(llGetFreeMemory() / 4) + " identifiers.");
    }
 
    touch_start(integer total_number)
    {
key id = llDetectedKey(0);
// If I didn't need to use the UUID later I'd replace id below with
// llDetectKey(0) and eliminate the definition line for id.
string myID = dbID(id);
if(~llSubStringIndex(database, myID))
{
{
llSay(0, "There should be enough free memory to hold " +
    // Do whatever you do with, to, or for first encounter avatars.
(string)llFloor(llGetFreeMemory() / 2) + " identifiers.");
database += myID;
}
}
 
else
touch_start(integer total_number)
{
{
key id = llDetectedKey(0);
    // Do whatever you do with, to, or for "known" avatars.
// If I didn't need to use the UUID later I'd replace id below with
    ;
// llDetectKey(0) and eliminate the definition line for id.
string myID = dbID(id);
if(~llSubStringIndex(database, myID))
{
// Do whatever you do with, to, or for first encounter avatars.
database += myID;
}
else
{
// Do whatever you do with, to, or for "known" avatars.
;
}
}
}
    }
}
}
</source>
</source>
My thanks to Pavl Duke and Omei Qunhua for their kindly remarks that helped make this page better.
If anyone wants to encode the full UUID they could replace the dbID function above with this:
<source lang="lsl2">
string dbID(string UUID)
{
    integer group;
    string work;
    do
    {
integer data1 = (integer)("0x" + llGetSubString(UUID, group * 7 , group * 7 + 6));
integer data2 = (data1 % 0x4000) * 2;
data1 = (data1 / 0x4000) * 2 + !group; // Adds 1 if group = 0 (e.g. the first character only)
work += Encode(data1) + Encode(data2);
    }
    while (++group < 5);
    return work;
}
</source>
and the call with:
<source lang="lsl2">
string myID = dbID((string)llParseString2List(id, ["-"], []) + "000");
</source>
(string)llParseString2List(id, ["-"], []) return the 32 hexadecimal characters without the dashes. Adding "000" defines the 12 unused bits and makes the string 35 characters so the last group can be encoded just like the first 4.

Latest revision as of 14:43, 28 April 2016

Unique(ish) Indentifiers

My challenge was to produce a script that allowed each avatar to interact with the scripted object in a particular way only one time and after that to interact with it in a different way.

In order to do this I needed to come up with a unique identifier for each avatar that used the scripted object. There are already 2 unique identifiers in second life, the UUID and the avatar name and I considered both as sources for my unique identifier.

Avatar names can be up to 31 alphanumeric characters and legacy names can have a last name that is the same possible size. There is a database of all the last names ever used in SL available from SLNameWatch.com. I saved all 109 pages of them to my computer and concatenated them into a list of 10,889 names and did some analysis.

The longest last name in second life is only 19 characters and no name had any significant characters beyond the 13th. I further discovered there were only 3 names which contained numerals and those names only contained one numeral and the location of the numeral was not the same in any of the names. This would reduce the range of characters from 36 to 27 if one allowed for a single value to stand for any non-alphabetic character in a name.

I eventually rejected names. In order to use them I would either have to accept variable identifier lengths or encode them so all were the same length as the longest thus defeating the purpose of using them, specifically to find a more compact, less memory expensive unique identifier.

So, that left me with UUIDs. Stored in a list as a key they require 110 bytes per UUID. Storage as string reduces this by 20 bytes per string but that is still a very large record size. Even stripping out the dashes only reduces them by 8 bytes. However, those remaining 32 characters are hexadecimals and by converting them to integers I knew I could find a more compact way to store them.

My first idea was to convert them to conventional string data using something like this:

    integer count = 5;
    integer data = integerValueOfFirst7HexCharacters;
    while (data)
    {
	makeSymbol(data % 1114111);
	data = (data / 1114111);
	if (data < 1114111 && count)
	{
	    data += integerValueOf5MoreHexCharacters * 1114111;
	    --count;
	}
    }

where makeSymbol was a function to convert the number to a UTF-16 symbol. The problem with this approach was each symbol could be 2 or 4 bytes and again I would have the problem of variable record lengths necessitating the use of a list.

I thought of something I later learned was sometimes called base32768 encoding. It is a method of storing 15 bits of data as a UTF-8 character. Each UTF-8 character requires exactly 2 bytes so I could eliminate the variable record size *and* bring the storage efficiency from 83.70% to 93.75%

Using base32768 encoding allowed me to reduce a 36 character UUID to just 9 characters *and* allowed me to store them in something like a strided string what I like to call a straddle.

I discussed ways to reduce storage space with the good folks in the Scripting Academy group chat. One of them, The Scripter (Silvermane Mode), who runs one of the largest Name2Key databases in SL, informed me converting the first 5 characters of the UUID produced a hash that has not clashed (produced identical results) in his list of over 400,000 avatars and I had decided to trust his years of experience and use the same while I was still using the base1114111 encoding; however, after deciding to use the base32768 encoding I realized I could use the first 7 characters in the UUID to produce 2 UTF-8 characters decreasing the chance of hash clash by over 94.6%.

While considering ways to further compact my storage requirements I realized there were unused bits and I could use those bits to make the first character different from any of the following characters by simply shifting them all up one bit (e.g. multiply by 2) and adding 1 to only the first character. I will call this base0x4000+1 encoding.

By doing this I will be able to eliminate the need to use llGetSubString and simplify the search routine using llSubStringIndex. Adding a bitwise not (~) I can pass it as the parameter of a conditional reducing the code space needed to work with the unique identifiers in my script even further. However, to store the full UUID in a form that it is retrievable will require 10 string characters using base0x4000+1 encoding.

If all UUIDs were v4 or v5 9 characters would suffice but I understand not all are. I do not need to retrieve the UUIDs but the code here http://wiki.secondlife.com/wiki/User:Becky_Pippen/Text_Storage will provide the details you'll need to make your database so you can.

Here's some code to illustrate what I'm talking about:

string HexChars(integer n)
{
    string hexChars = "0123456789abcdef";
    return llGetSubString(hexChars, n >> 4, n >> 4) + 
	llGetSubString(hexChars, n & 0xf, n & 0xf);
}

string Encode(integer num)
{
    num += 0x1000;
    return llUnescapeURL("%" + HexChars(0xe0 + (num >> 12)) +
	"%" + HexChars(0x80 + ((num >> 6) & 0x3f)) +
	"%" + HexChars(0x80 + (num & 0x3f)));
}

// This function creates the (hopefully) unique idenfier for use in my script
string dbID(string UUID)
{
    integer data1 = (integer)("0x" + llGetSubString(UUID, 0 , 6));
    integer data2 = (data1 % 0x4000) * 2;
    data1 = (data1 / 0x4000) * 2 + 1;
    return Encode(data1) + Encode(data2);
}

// The need for this function was eliminated by using base0x4000+1 encoding. I
// left it to show what ~llSubStringIndex(database, check) is replacing.
integer inDB(string check)
{
    integer known;
    integer i = llStringLength(database) / 2;
    while (i-- && !known) // (i--) is equivalant to (--i <= 0)
	if (llGetSubString(database, i * 2, i * 2 + 1) == check) ++known;
    return known;
}

string database;
// database is a straddle, 2 characters per stride, each character containing
// 14-bits of base0x4000+1 encoded data.

default
{
    state_entry()
    {
	llSay(0, "There should be enough free memory to hold " +
	    (string)(llGetFreeMemory() / 4) + " identifiers.");
    }

    touch_start(integer total_number)
    {
	key id = llDetectedKey(0);
	// If I didn't need to use the UUID later I'd replace id below with
	// llDetectKey(0) and eliminate the definition line for id.
	string myID = dbID(id);
	if(~llSubStringIndex(database, myID))
	{
	    // Do whatever you do with, to, or for first encounter avatars.
	database += myID;
	}
	else
	{
	    // Do whatever you do with, to, or for "known" avatars.
	    ;
	}
    }
}


My thanks to Pavl Duke and Omei Qunhua for their kindly remarks that helped make this page better.


If anyone wants to encode the full UUID they could replace the dbID function above with this:

string dbID(string UUID)
{
    integer group;
    string work;
    do
    {
	integer data1 = (integer)("0x" + llGetSubString(UUID, group * 7 , group * 7 + 6));
	integer data2 = (data1 % 0x4000) * 2;
	data1 = (data1 / 0x4000) * 2 + !group; // Adds 1 if group = 0 (e.g. the first character only)
	work += Encode(data1) + Encode(data2);
    }
    while (++group < 5);
    return work;
}

and the call with:

	string myID = dbID((string)llParseString2List(id, ["-"], []) + "000");

(string)llParseString2List(id, ["-"], []) return the 32 hexadecimal characters without the dashes. Adding "000" defines the 12 unused bits and makes the string 35 characters so the last group can be encoded just like the first 4.