User:Kadah Coba/Unicode

From Second Life Wiki
Jump to navigation Jump to search

Unicode, UTF-16, and UTF-8

AKA. when 1 character isn't 2 bytes, characters unsupported in LSL, and some other random notes.

LSL Strings and UTF-16

LSL strings are UTF-16. Characters U+0000 through U+D7FF and U+E000 through U+FFFF use 2 bytes per character while U+010000 through U+10FFFF use 4 bytes, so characters above U+010000 will use twice as much memory as those below. Unless you using a lot of characters from the supplementary planes, or doing custom data encoding, you are unlikely to encounter enough of these to cause a significant memory usage difference.

From my testing, a string variable will work with (not going to use the word "support" out of caution) any character in those ranges except for the following: U+0000 (null, blocked for use in LSL), U+D800 through U+DFFF (surrogate pairs range), U+FFFE and U+FFFF ("noncharacters"[1]). I do not know the reason why U+FFFE and U+FFFF do not work while the other noncharacters U+FDD0 through U+FDEF and U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... U+10FFFE, U+10FFFF defined noncharacters do. Perhaps those two are used internally by the VM? I would avoid using the noncharacters. Similarly, might also want to avoid the "private-use characters"[2] ranges in case these gain some internal use within the VM in the future and stop working. Some clarification from LL on this at some point would be handy.

UTF-8

Even though the strings datatype is UTF-16, most (all?) data fields in SL that are accessible to LSL are UTF-8. With the exception of link_message (which has no limit other than script memory), every function I have tested so far uses UTF-8 and is subject to the difference between UTF-8 and UTF-16 which may cause some headaches if not aware that there is different encodings going on.

Many LSL functions that take strings and have some input limit stated and those limit are probably in bytes and not characters. There may be a few wiki pages that just state a "character" limit, but its possible that limit is actually in bytes. Some pages, like LlSay do a better job of highlighting the implications of this than others.

1 character equals 1 byte is only the case in UTF-8 when dealing with ASCII (U+0000 to U+007F). Characters U+0080 to U+07FF are 2 bytes, U+0800 to U+FFFF are 3 bytes, and U+10000 to U+10FFFF are 4 Bytes. Using non-ASCII charters for any of these will, at best, half the amount of characters. For example LlSetText has a stated limit of 254 bytes, which will be 254 characters with 1-byte chars, but only 127 with all 2-byte chars, 84 with 3-byte, and just 63 with 4-byte.

What will or won't work

I tested many LSL functions that take strings for what characters they will or won't accept. Almost everything will take any printable character that isn't otherwise stated to not be allowed (example, cannot use "|" on prim descriptions). Most non-printable also worked, like control and private-use characters.

Some functions had additional, sometimes undocumented, characters that would not work. For example, PRIM_MEDIA_WHITELIST could not use these in all cases: U+0009 (tab), U+000A (LINE FEED), U+000B (vertical tab), U+000C (form feed), U+000D (carriage return), U+0020 (space), U+002C (comma).

Link_messages will support anything that can be stored in a string variable.

Experience Tools

Key-value store is also UTF-8. Everything that applies to UTF-8 applies here.


Encoding / packing data

If you are going to make a data encoding scheme that will be in UTF-8 and data size is a factor, it is more efficient to stick with ACSII, or just BASE64, than to use the higher code points. 2-byte UTF-8 characters only have 11 potentially usable bits, while 2 characters of BASE64 have 12 usable bits.

It is possible to do a higher base encoding within the usable ASCII character space, but there would be a substantial code overhead verse using the build-in BASE64 functions.

Kadah Coba (talk) 23:51, 27 February 2022 (PST)