Difference between revisions of "LLSD"
Zero Linden (talk | contribs) (→string) |
Zero Linden (talk | contribs) (→string) |
||
Line 208: | Line 208: | ||
** except U+D800 through U+DFFF | ** except U+D800 through U+DFFF | ||
** except U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFE, U+2FFF ... U+10FFFE, U+10FFFF | ** except U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFE, U+2FFF ... U+10FFFE, U+10FFFF | ||
** except U+ | ** except U+FDD0 through U+FDEF | ||
* U+9 (tab, '\t') | * U+9 (tab, '\t') | ||
* U+A (newline or line feed, '\n') | * U+A (newline or line feed, '\n') |
Revision as of 09:37, 27 January 2009
The LLSD flexible data system
The following text is from the comments in the source of the file: linden\indra\common\llsd.cpp
According to Andrew Linden, LLSD stands for Linden Lab Structured Data[1].
Summary
LLSD provides a flexible data system similar to the data facilities of dynamic languages like Perl and Python. It is created to support exchange of structured data between loosly coupled systems (not compiled together into the same module).
Data in such exchanges must be highly tolerant of changes on either, for example:
- Recompilation
- Implementation in a different langauge
- Addition of extra parameters
- Execution of older versions (with fewer parameters)
To this end, the C++ API of LLSD strives to be easy to use, and to default to "the right thing" wherever possible. It is extremely tolerant of errors and unexpected situations.
The fundamental class is LLSD. LLSD is a value holding object. It holds one value that is either undefined, one of the scalar types, or a map or an array. LLSD objects have value semantics (copying them copies the value, though it can be considered efficient, due to sharing), and are mutable.
Undefined is the singular value given to LLSD objects that are not initialized with any data.
The scalar data types are:
- Boolean - true or false
- Integer - a 32 bit signed integer
- Real - a 64 bit IEEE 754 floating point value
- UUID - a 128 bit unique value
- String - a sequence of zero or more Unicode chracters
- Date - an absolute point in time, UTC, with resolution to the second
- URI - a String that is a URI
- Binary - a sequence of zero or more octets (unsigned bytes)
A map is a dictionary mapping String keys to LLSD values. The keys are unique within a map, and have only one value (though that value could be an LLSD array).
An array is a sequence of zero or more LLSD values.
Scalar Accessors
Function: Fetch a scalar value, converting if needed and possible.
Conversion among the basic types, Boolean, Integer, Real and String, is fully defined. Each type can be converted to another with a reasonable interpretation. These conversions can be used as a convenience even when you know the data is in one format, but you want it in another. Of course, many of these conversions lose information.
Note: These conversions are not the same as Perl's. In particular, when converting a String to a Boolean, only the empty string converts to false. Converting the String "0" to Boolean results in true.
Conversion to and from UUID, Date, and URI is only defined to and from String. Conversion is defined to be information preserving for valid values of those types. These conversions can be used when one needs to convert data to or from another system that cannot handle these types natively, but can handle strings.
Conversion to and from Binary isn't defined.
Conversion of the Undefined value to any scalar type results in a reasonable null or zero value for the type.
Automatic Cast Protection
These are not implemented on purpose. Without them, C++ can perform some conversions that are clearly not what the programmer intended.
If you get a linker error about these being missing, you have made mistake in your code. DO NOT IMPLEMENT THESE FUNCTIONS as a fix.
All of thse problems stem from trying to support char* in LLSD or in std::string. There are too many automatic casts that will lead to using an arbitrary pointer or scalar type to std::string.
Attributes and Data
Attributes are only used for encoding parser and formatting instructions. The data in the elements is always data.
Root Element
The root element is llsd. The root must have only one child element which can be any container or atomic type.
Atomic Types
Each atomic type represents one value with type information. An atomic does not have a name, but may have attributes to specify format or processing considerations for the parser. Consumers of atomics are encouraged to massage the data into the preferred native representation, but further serialization should honor the original type information if possible.
undefined
The undefined type is a placeholder to indicate something is there, but it has no value, and cannot be converted to any other atomic type. Though limited in this way, an undefined is still considered a first-class atomic, and is expected to behave like any other atomic structured data type at runtime.
Serialization example
<undef />
boolean
A true or false value.
Conversion
type | rules | |
boolean | unity | |
integer | true => 1, false => 0 | |
real | true => 1.0, false => 0.0 | |
uuid | n/a | |
string | 'true', 'false' | |
binary | one byte us-ascii where true => 1, false => 0 | |
date | n/a | |
uri | n/a |
Serialization examples
<!-- true --> <boolean>1</boolean> <boolean>true</boolean> <!-- false --> <boolean>0</boolean> <boolean>false</boolean> <boolean />
integer
A signed integer value with a representation of 32 bits.
Conversion
type | rules | |
boolean | 0 => false, all other values => true | |
integer | unity | |
real | closest representable number | |
uuid | n/a | |
string | human readable string | |
binary | 8 byte network byte order representation | |
date | seconds since epoch | |
uri | n/a |
Serialization examples
<integer>289343</integer> <integer>-3</integer> <integer /> <!-- zero -->
real
A 64 bit double as defined by IEEE.
Conversion
type | rules | |
boolean | exactly 0 => false, all other values => true | |
integer | rounded to closest representable number | |
real | unity | |
uuid | n/a | |
string | human readable string | |
binary | 8 byte network byte order representation | |
date | seconds since epoch | |
uri | n/a |
Serialization examples
<real>-0.28334</real> <real>2983287453.3848387</real> <real /> <!-- exactly zero -->
uuid
A 128 bit unsigned integer.
Conversion
type | rules | |
boolean | null uuid => false, all other values => true | |
integer | n/a | |
real | n/a | |
uuid | unity | |
string | standard 8-4-4-4-12 serialization format | |
binary | 16 byte raw representation | |
date | n/a | |
uri | n/a |
Serialization examples
<uuid>d7f4aeca-88f1-42a1-b385-b9db18abb255</uuid> <uuid /> <!-- null uuid '00000000-0000-0000-0000-000000000000' -->
string
A simple string of any character data which is intended to be human comprehensible.
Strings in the system that hold text a user might see or enter (chat, IM, notecards, AV names, region names,... basically almost everything!) should move to using a consistent set of acceptable characters. This set is:
- Unicode code points U+20 through U+10FFFD
- except U+D800 through U+DFFF
- except U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFE, U+2FFF ... U+10FFFE, U+10FFFF
- except U+FDD0 through U+FDEF
- U+9 (tab, '\t')
- U+A (newline or line feed, '\n')
- U+D (carriage return, '\r')
Strings may be sequences of zero or more of these characters. Strings *may* be normalized by mapping line ending sequences to U+A. Do not rely on differences in strings that normalize to the same string.
These choices of valid strings are chosen from Unicode 4.0 which defines the following valid code points:
- Unicode code points U+0 through U+10FFFD
- except U+D800 through U+DFFF (the UTF-16 surrogate pair range)
- except U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFE, U+2FFF ... U+10FFFE, U+10FFFF
- except U+FDD0 through U+FDEF (some historical screw up with Arabic)
The choice for special characters < U+20 is because XML defines acceptable text as all valid Unicode code points >= U+20, and U+9, U+A and U+D. The normalization is because XML defines that all line ending sequences are normalized to U+A.
See: Unicode In 5 Minutes for a brief introduction to Unicode.
Conversion
type | rules | |
boolean | empty => false, all other values => true | |
integer | A simple conversion of the initial characters to an integer | |
real | A simple conversion of the initial characters to a real number | |
uuid | A valid 8-4-4-4-12 is converted to a uuid, all other values => null uuid | |
string | unity | |
binary | raw representation of the characters | |
date | An interpretation of the string as a date | |
uri | An interpretation of the string as a link |
Serialization examples
<string>The quick brown fox jumped over the lazy dog.</string> <string>540943c1-7142-4fdd-996f-fc90ed5dd3fa</string> <string /> <!-- empty string -->
binary data
A chunk of binary data. The serialization format is allowed to specify an encoding. Parsers must support base64 encoding. Parsers may support base16 and base85.
Conversion
type | rules | |
boolean | empty => false, all other values => true | |
integer | len < 4 => 0, otherwise first four bytes are interpreted as a network byte order integer | |
real | len < 8 => 0, otherwise first eight bytes are interpreted as a network byte order double | |
uuid | len < 16 => null uuid, otherwise first sixteen bytes are interpreted as the raw binary uuid | |
string | the raw binary data interpreted as utf-8 character data | |
binary | unity | |
date | n/a | |
uri | the raw binary data interpreted as a utf-8 serialized link |
Serialization examples
<binary encoding="base64">cmFuZG9t</binary> <!-- base 64 encoded binary data --> <binary>dGhlIHF1aWNrIGJyb3duIGZveA==</binary> <!-- base 64 encoded binary data is default --> <binary /> <!-- empty binary blob -->
date
A specific point in time. Intervals or relative dates are not supported. The serialization and parser only understand ISO-8601 numeric encoding in UTC. The time may be omitted which will be interpreted as midnight at the start of the day.
Conversion
type | rules | |
boolean | n/a | |
integer | seconds since epoch | |
real | seconds since epoch | |
uuid | n/a | |
string | standard serialization format | |
binary | n/a | |
date | unity | |
uri | n/a |
Serialization examples
<date>2006-02-01T14:29:53.43Z</date> <date /> <!-- epoch -->
uri
A link to an external resource. The data is expected to conform to rfc 2396 for interpretation, meaning, serialization, and deserialization.
Conversion
type | rules | |
boolean | n/a | |
integer | n/a | |
real | n/a | |
uuid | n/a | |
string | standard serialization format | |
binary | n/a | |
date | n/a | |
uri | unity |
Serialization examples
<uri>http://sim956.agni.lindenlab.com:12035/runtime/agents</uri> <uri /> <!-- an empty link -->
Containers
Containers is a special data type which can contain any other data type including other containers.
map
A map of key and value pairs where key ordering is unspecified and keys are unique. The key is always interpreted as a character string and any character string is acceptable. If there are any elements in the map, it is serialized as a key followed by an atomic or container value. For every key, there must be one value. Well formed and valid serialized maps may contain more non-unique keys. When a deserialized, the implementation should choose one of the the value objects, but that choice is not specified.
Serialization example
<map> <key>foo</key> <string>bar</string> <key>agent info</key> <map> <key>agent_id</key> <uuid>93c73b16-cd86-434d-8b4a-76e12eee950a</uuid> <key>name</key> <string>testtest tester</string> </map> </map>
array
An ordered collection of data members. Any member can be any atomic or container type.
Serialization example
<array> <real>7343.0194</real> <array> <map> <key>offset</key> <integer>9847</integer> </map> <string>da boom</string> </array> </array>
XML Serialization
MIME type: application/llsd+xml
When possible, prefer using us-ascii or or UTF-8 xml encoding.
XML is the "standard" serialization format, being future-proof and readable by a wide variety of tools. The XML serialization should be preferred unless profiling reveals that the binary serialization provides an essential performance benefit. All the serialization examples in the above sections are of the XML serialization.
DTD
<!ELEMENT llsd (DATA)> <!ELEMENT DATA (ATOMIC|map|array)> <!ELEMENT ATOMIC (undef|boolean|integer|real|uuid|string|date|uri|binary)> <!ELEMENT KEYDATA (key,DATA)> <!ELEMENT key (#PCDATA)> <!ELEMENT map (KEYDATA*)> <!ELEMENT array (DATA*)> <!ELEMENT undef (EMPTY)> <!ELEMENT boolean (#PCDATA)> <!ELEMENT integer (#PCDATA)> <!ELEMENT real (#PCDATA)> <!ELEMENT uuid (#PCDATA)> <!ELEMENT string (#PCDATA)> <!ELEMENT date (#PCDATA)> <!ELEMENT uri (#PCDATA)> <!ELEMENT binary (#PCDATA)>
<!ATTLIST string xml:space (default|preserve) 'preserve'> <!ATTLIST binary encoding (base64|base16|base85) 'base64'>
]></xml>Example XML Output
This is a sample from a recently running sim (indention for readability):
<?xml version="1.0" encoding="UTF-8"?> <llsd> <map>
<key>region_id</key> <uuid>67153d5b-3659-afb4-8510-adda2c034649</uuid> <key>scale</key> <string>one minute</string> <key>simulator statistics</key> <map> <key>time dilation</key><real>0.9878624</real> <key>sim fps</key><real>44.38898</real> <key>pysics fps</key><real>44.38906</real> <key>agent updates per second</key><real>nan</real> <key>lsl instructions per second</key><real>0</real> <key>total task count</key><real>4</real> <key>active task count</key><real>0</real> <key>active script count</key><real>4</real> <key>main agent count</key><real>0</real> <key>child agent count</key><real>0</real> <key>inbound packets per second</key><real>1.228283</real> <key>outbound packets per second</key><real>1.277508</real> <key>pending downloads</key><real>0</real> <key>pending uploads</key><real>0.0001096525</real> <key>frame ms</key><real>0.7757886</real> <key>net ms</key><real>0.3152919</real> <key>sim other ms</key><real>0.1826937</real> <key>sim physics ms</key><real>0.04323055</real> <key>agent ms</key><real>0.01599029</real> <key>image ms</key><real>0.01865955</real> <key>script ms</key><real>0.1338836</real> </map>
</map>
</llsd></xml>Binary Serialization
MIME type: application/llsd+binary
We also have support for binary serialization and deserialization in c++ and python. The binary format is useful when dealing where optimal parse time is necessary. Binary LLSD is the binary llsd prefix followed by a single LLSD element of any type.
<?llsd/binary?>\n
type | serialization | notes |
---|---|---|
undef | '!' | |
true | '1' | |
false | '0' | |
integer | 'i' + htonl(value) | |
real | 'r' + htond(value) | |
uuid | 'u' + uuid | uuid is 16 bytes |
binary | 'b' + htonl(binary.size()) + binary | |
string | 's' + htonl(string.size()) + string | notation serialization is considered valid |
uri | 'l' + htonl(uri.size()) + uri | |
date | 'd' + htond(seconds_since_epoch) | |
array | '[' + htonl(array.length()) + (child0, child1, ...) + ']' | order is always preserved |
map | '{' + htonl(map.length()) + ((key0,value0), (key1, value1), ...)+ '}' | order is not always preserved. |
size() is a byte count.
length() is a child count.
htonl() is a function to generate a 4 byte network byte order integer.
htond() is a function to generate an 8 byte network byte order double. htond is not a standard system call, but you can find a c implementation in indra/llcommon/llsdserialize.cpp
.
Notation Serialization
We also have support for a serialization format meant for human readability. Parsing and formatting are currently only available in c++. Notation LLSD is the notation llsd prefix followed by a single LLSD element of any type.
<?llsd/notation?>\n
type | serialization | notes |
---|---|---|
undef | '!' | |
true | '1' | 't' | 'T' | 'true' | 'TRUE' | |
false | '0' | 'f' | 'F' | 'false' | 'FALSE' | |
integer | 'i' str(value) | |
real | 'r' str(value) | |
uuid | 'u' str(uuid) | |
binary | 'b(' str(size) ')"' raw_data '"' | 'b' base '"' encoded_data '"' | Base 16 and 64 encodings are supported. |
string | " escaped_string " | ' escaped_string ' | 's(' str(size) ')"' raw_string '"' | When using single quotes, double quotes do not need escaping and vice versa. |
uri | 'l"' escaped_uri '"' | See rfc 1738 for encoding rules. |
date | 'd"' YYYY-MM-DD 'T' HH:MM:SS [.FF] 'Z"' | Fractional seconds are optional |
array | '[' object0 ',' object1 ',' ... ']' | order is always preserved |
map | '{' string0:object0 ',' string1:object1 ',' ... '}' | order is not always preserved. The string is any supported string serialization format |
String Escaping
Strings which contain non-printable characters delimited with quotes or double quotes require escaping. If a single quote delimited string contains single quotes, those must be escaped. If a double quote delimited string contains double quotes, the double quotes must be escaped.
To escape the delimiter character, prefix a backslash. Backslashes must always be escaped with another backslash.
"And then he said, \"I have nothing more to say on the subject.\""
'Look in "C:\\linden\\"'
The most generic escaping is to specify a hex value of the byte after a literal backslash and character 'x'. This can be used for any character and is required for all non-printable characters which do not have an abbreviation. For example:
\x0C
Serialized strings should only contain UTF-8 characters, so non-printable characters other than tab, newline, and carriage return should be avoided. However, common non-printable characters have short-hand abbreviations.
character | value | serialization |
---|---|---|
alert/bell | 0x7 | \a |
backspace | 0x8 | \b |
form feed | 0xc | \f |
newline | 0xa | \n |
carriage return | 0xd | \r |
horizontal tab | 0x9 | \t |
vertical tab | 0xb | \v |
Example Notation Output
This is an excerpt from an agent request to enter a region serialized as notation:
[ {'destination':'http://secondlife.com'}, {'version':i1}, { 'agent_id':u3c115e51-04f4-523c-9fa6-98aff1034730, 'session_id':u2c585cec-038c-40b0-b42e-a25ebab4d132, 'circuit_code':i1075, 'first_name':'Phoenix', 'last_name':'Linden', 'position':[r70.9247,r254.378,r38.7304], 'look_at':[r-0.043753,r-0.999042,r0], 'granters':[ua2e76fcd-9360-4f6d-a924-000000000003], 'attachment_data': [ { 'attachment_point':i2, 'item_id':ud6852c11-a74e-309a-0462-50533f1ef9b3, 'asset_id':uc69b29b1-8944-58ae-a7c5-2ca7b23e22fb }, { 'attachment_point':i10, 'item_id':uff852c22-a74e-309a-0462-50533f1ef900, 'asset_id':u5868dd20-c25a-47bd-8b4c-dedc99ef9479 } ] } ]
[ { 'creation-date':d"2007-03-15T18:30:18Z", 'creator-id':u3c115e51-04f4-523c-9fa6-98aff1034730 }, s(10)"0123456789", "Where's the beef?", 'Over here.', b(160)"default { state_entry() { llSay(0, "Hello, Avatar!"); } touch_start(integer total_number) { llSay(0, "Touched."); } }", b64"AABAAAAAAAAAAAIAAAA//wAAP/8AAADgAAAA5wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AABkAAAAZAAAAAAAAAAAAAAAZAAAAAAAAAABAAAAAAAAAAAAAAAAAAAABQAAAAEAAAAQAAAAAAAA AAUAAAAFAAAAABAAAAAAAAAAPgAAAAQAAAAFAGNbXgAAAABgSGVsbG8sIEF2YXRhciEAZgAAAABc XgAAAAhwEQjRABeVAAAABQBjW14AAAAAYFRvdWNoZWQuAGYAAAAAXF4AAAAIcBEI0QAXAZUAAEAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" ]
Guidelines
Questions & Things To Do
Would Binary be more convenient as usigned char* buffer semantics?
Should Binary be convertable to/from String, and if so how?
- as UTF8 encoded strings (making not like UUID<->String)
- as Base64 or Base96 encoded (making like UUID<->String)
Conversions to std::string and LLUUID do not result in easy assignment to std::string, LLString or LLUUID due to non-unique conversion paths.