User:Zero Linden/Office Hours/2007 Apr 24

From Second Life Wiki
< User:Zero Linden/Office Hours
Revision as of 11:58, 1 May 2007 by Zero Linden (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Transcript of Zero Linden's office hours:

[13:09] Rex Cronon: hello zero
[13:09] Jarod Godel: Yaaaaay.
[13:09] Zero Linden: sorry all - my lunch meeting ran a bit late
[13:09] Zha Ewry: Heh. I am so betting the IRC channel is noisy
[13:09] Kitto Flora has quick Q for Zero as he has to depart pretty soon: Does Zero know anything about email to object UUIDs in mainland sims the last two weekends? Is incoming object email becoming overloaded?
[13:10] Zero Linden: and - ack - I have a 2pm right on the heels of this
[13:10] Zero Linden: I don't know anything about that, alas
[13:10] Kitto Flora: OK, TY
[13:10] Zero Linden: I did see some talk on the #ops channel about e-mail system unhappiness
[13:10] Zero Linden: but no specifics
[13:11] Zero Linden: Yes - so here is an object lesson in conversion from one style of programming: app + db, to another: web services
[13:11] Zha Ewry: TP is flaky. Took 25 minutes to load my inv. (and it's small)
[13:11] Zero Linden: The beauty of HTTP is that it's stateless nature, and total lack of guarentees
[13:11] Tao Takashi: Hi Zero :)
[13:11] Zha Ewry: 7 minutes to pass a notecard to someone
[13:11] Zha Ewry: Ick
[13:11] Zero Linden: means that you are forced to write robust systems
[13:11] Zero Linden: Someone here needed to change a db call, specifically
[13:12] Zero Linden: the query of which database server a given agent's inventory is stored on
[13:12] Zero Linden: into a web service
[13:12] Wyn Galbraith returns with ice coffee and toast.
[13:12] Zero Linden: since that query was heating up the central db (all hail.....)
[13:12] Zero Linden likes ice coffee!
[13:13] Zero Linden: Well, alas, the code that did the DB query was very very tied up in a sequential way of doing it: it blocked on the db access
[13:13] Zha Ewry: At this point. I am ready to sacrifice chickens to the poor central DB if it will help
[13:13] Zero Linden: but db access is sort of designed to work or quickly fail....
[13:13] Rex Cronon: maybe local storage of inventory is better
[13:13] Zero Linden: when it became a web service - the engineer opted to keep the sequential nature of the system
[13:13] Zero Linden: SO it made a blocking HTTP request
[13:13] Rex Cronon: at least for objects created by user
[13:13] Zha Ewry stares and looks appaled
[13:13] Zero Linden: usually this is fine - most of those come back in about .2ms
[13:13] Wyn Galbraith smiles.
[13:13] Khamon Fate: You do that so well Zha.
[13:14] Jarod Godel: Zero, why don't you guys cluster and load balance?
[13:14] Zero Linden: alas, HTTP is design to allow things like. "ooops, dum de dum... I'll just take 20 seconds now...."
[13:14] Zero Linden: Jarod - MySQL doesn't just "cluster and load ballence"
[13:14] Zero Linden: alas
[13:15] Jarod Godel: Zero, I know, but it does cluster.
[13:15] Jarod Godel: Live Journal had this problem a few years back.
[13:15] Zero Linden: especially in an environment with a very high write load - like we have
[13:15] Zha Ewry nods and mutters about vendors which sell industrial DB solutions..
[13:15] Jarod Godel: Millions of hits a second. They wrote their own software.
[13:15] Zero Linden: well - we know of the industrial solutions -
[13:15] Zero Linden: but eventually, at the future scale of SL, even those would fail
[13:16] Jarod Godel: I think this was one: http://www.danga.com/memcached/
[13:16] Zero Linden: so much better to attack the architectural problem head on - and make the central db issue go away for good
[13:16] Zero Linden: hence my role at LL
[13:16] Zha Ewry nods. Yep. But.. even when it goes away.. the load issues won't.
[13:16] Tao Takashi: ah, I think we now use memcached with Plone
[13:16] Zero Linden: Yes - we know of memcached....
[13:17] Zero Linden: and we've written our own solutions for our needs - we have a python based non-blocking, server framework called backbone that works very nicely
[13:17] Kitto Flora has RL catch up with him...:(
[13:17] Zero Linden: Well - if we can distribute the load without incuring central cost, then, yes, the load does go away
[13:17] Khamon Fate: Bye Kitto
[13:17] Zha Ewry: Well.. to an extent. You still have to drop it somwhere
[13:18] Zero Linden: as for why we don't just drop in some big commercial DB for now..... well, alas, despite being all SQL, they really aren't all drop in compatible, now are they?
[13:18] Zha Ewry nods. Hardly
[13:18] Zero Linden: Right - it is easily several months of several engineers to replace with a different DB vendor's product.... FIE!
[13:18] Wyn Galbraith knows they're not.
[13:19] Tao Takashi: yep, unfortunately they are not
[13:19] Wyn Galbraith thinks you just get new issues that way anyway.
[13:19] Tao Takashi: and you don't know how they might behave under load in your system
[13:19] Zero Linden: So - we get it down to the only central question at all being "who do I talk to about entity X"
[13:20] Zero Linden: and if we can make that be partionable (on either a name heirarchy on X, like DNS, or by dividing the even number space of random UUIDs)
[13:20] Zero Linden: then even that query can easily be distributed
[13:20] Zero Linden: THEN - we just have to get stuff to stop being single threaded where it is....
[13:20] Zha Ewry nods "DNs and such or structuring the UUID"
[13:21] Zero Linden: One of the nice things abou writing in our python server framework, backbone, is that you *CAN* write what looks like single-threaded code;
[13:22] Zero Linden: "make HTTP Request A, get result, use it to make request B, get that result, generate reply"
[13:22] Zero Linden: and it all executes async w/o threads.... you just need to code knowing that
[13:22] Zero Linden: there is context switching at the network IO points
[13:23] Zero Linden: In C++ this is generally impossible w/o either writing for threads (hard, and in general more resource intensive)
[13:23] Jarod Godel: You're writing objects.
[13:23] Jarod Godel: Like Smalltalk.
[13:23] Zero Linden: or writing everything in a sort of fake continuation style (what we do now: things like passing compleation objects that get called when the opreation completes)
[13:24] Jarod Godel: Do you make a practice of using exception handling: try...catch?
[13:24] Zha Ewry: Tail of each call just passes a faux stack frame?
[13:24] Tao Takashi: what network library do you use for the python framework? the standard libs?
[13:24] Zero Linden: we don't use try catch in general ....
[13:24] Tao Takashi: and do you come to EuroPython to give a talk about that? :)
[13:25] Zero Linden: personally, I have mixed feelings about try/catch as a construct.... but my personal language theories are a topic for another day.... :-)
[13:25] Jarod Godel: I'm just trying to figure how you handle an asychronus call when it fails.
[13:26] Zero Linden: things like: " LLHTTPClient::get(some_url, new LLMyResponderSubclass(data_it_needs_to_deal_with_response)); "
[13:26] Zero Linden: the instance of the LLHTTPClient::Responder subclass gets called with either result(...) or error(....)
[13:26] Zero Linden: it is garunteed to get called with one or the other, exactly once
[13:26] Zero Linden: *guranteed -- hate that word!
[13:27] Jarod Godel: Does error() recall the function recursively, or does it just fail?
[13:27] Wyn Galbraith always gets caught by that word too, Zero.
[13:27] Zero Linden: That is upto the subclass's error() implementation
[13:28] Jarod Godel: Ok.
[13:28] Zero Linden: the default implementation just logs
[13:28] Jarod Godel: Your Python backbone, does it use BaseHTTPServer for the web service servines?
[13:28] Zero Linden: Usually, we don't retry - there is a fair bit of retrying going on a the lower protocol layers (TCP/IP, the initial DNS, SSL/TLS if used...)
[13:28] Jarod Godel: (Or is "backbone" just for MySQL?)
[13:29] Zero Linden: that we figure that if it didn't go through, it isn't likely to immediatly
[13:29] Zero Linden: The python server, Backbone, uses a web server framework called Ultramini
[13:30] Zero Linden: it in turn is based on eventlet
[13:30] Tao Takashi: never heard of that framework
[13:30] Zero Linden: for making all the network I/O into cooperative co-routines
[13:31] Jarod Godel: Zero, do you use many stored procedures in your sql or do you let Python do the work?
[13:31] Hiro Market: aye, which version of mysql in that case :-)
[13:31] Tao Takashi: and Google also has problems finding that framework ;-)
[13:32] Zero Linden: there are no stored procedures at all
[13:32] Zero Linden: mysql 3 didn't have them!
[13:32] Jarod Godel: I think Ultramini might be an embedable web server.
[13:32] Zero Linden: it is a python web server framework
[13:32] Zero Linden: that takes a tree of objects (either instantiated or dynamic) and turns them into a servable URL tree
[13:32] Zero Linden: it was written by Donovan Preston
[13:32] Tao Takashi: sounds like Zope ;-)
[13:33] Tao Takashi: but more lightweight I guess
[13:33] Zero Linden: very very lightweight
[13:33] Zero Linden: now, at present
[13:34] Zero Linden: most DB access is done in a server called the dataserver
[13:34] Zero Linden: that is written in C++
[13:34] Zero Linden: we use our UDP messaging system to ask the dataserver to do something to the database
[13:34] Zero Linden: and then have handlers sitting on messages that come back with the results from the dataserver
[13:34] Tao Takashi: ok, so come to EuroPython and give a talk about Python at LL ;-)
[13:34] Zero Linden: they did this because the mysql lib is blocking
[13:34] Zero Linden: and so the dataserver blocks, doing one request at a time
[13:35] Zero Linden: but the sim keeps going
[13:35] Zero Linden: so the good news is that from the simulator perspective - we are already doing things in a
[13:35] Zero Linden: nice async, fault tollerant way
[13:35] Zero Linden: the sim expects the dataserver to sometimes never respond
[13:35] Zero Linden: etc....
[13:35] Zero Linden: BUT the datasever is now the weak link
[13:36] Zero Linden: if it falls over - or takes 20 seconds to answer a query - ALL the other queries from the simulator get queued up behind it ---
[13:36] Zero Linden: right now, we have THREE dataserver processes on every simulator host
[13:36] Tao Takashi: Hm, I need to go unfortunately
[13:36] Tao Takashi: but thanks for hosting :)
[13:36] Khamon Fate: See ya Tao
[13:36] Zero Linden: we partition queires among them in hopes of keeping potentially slow queires from blocking the ones that need to be fast
[13:37] Zero Linden: backbone is used right now in two ways
[13:37] Tao Takashi: good infos as always :)
[13:37] Wyn Galbraith: Bye Tao.
[13:37] Rex Cronon: why does the sim need to talk to datasarver, can't get the data from the ciient, the client already has all its data. right?
[13:37] Khamon Fate: You have THREE dataserver processes running on every simulator machine or every CPU on that supports a sim?
[13:37] Zero Linden: ...wait for it....
[13:37] Zero Linden: 1) It takes non-long lived data, such as agent presence, and rather than put that in the central DB - puts it in.....
[13:38] Zero Linden: ...in-memory python structures
[13:38] Zero Linden ewe - notices the bad mis-ordering of his chat packets....
[13:38] Wyn Galbraith whees she lives inside a python!
[13:38] Zero Linden: This is fast: you do a HTTP PUT to set some agents presence ("Zero Linden is in Grasmere now.")
[13:39] Zero Linden: and you HTTP GETs to find out which sim, if any they are one
[13:39] Zero Linden: all that data is in memory - easy - and really, it is just replaying the XML taht the PUT stored
[13:40] Zero Linden: it is all in a few hundred lines of python code
[13:40] Zero Linden: handles about 600 requests/second on our stock server hw
[13:40] Hiro Market: really in core? even with 50 avs in the same sim?
[13:41] Zero Linden: Hiro - we keep agent presence divided onto four backbone servers (UUIDs 0-3, 4-7, etc...)
[13:41] Zero Linden: so, about 10k per server - all in core
[13:41] Zero Linden: right now, for example, one of those has a core size of 337Meg virtual, and 330M resident
[13:41] Zero Linden: that's nothing
[13:42] Zero Linden: the othe use of backbone right now
[13:42] Zero Linden: is caching larger datasets from the db
[13:43] Zero Linden: for example, the "which inventory server is agnet X on"
[13:43] Zero Linden: that data is in the central DB, but doesn't change often
[13:43] Zero Linden: so we can cache it in a backbone server - in core, and serve it very fast
[13:44] Zero Linden: we also use backbone as the host for the whole capability framework - though that runs locally on each simulator host
[13:45] Zero Linden: so when you get the map layer data, you are invoking a URL that is a capability
[13:45] Zero Linden: that is really a URL that points to that simulator's backbone
[13:45] Zero Linden: the backbone in turn, maps the capability to the real, internal URL of the map layer service, and calss it
[13:45] Zero Linden: proxying the result back to the viewer
[13:46] Zero Linden: what I love is that the whole capability framework is just 375 lines of code
[13:46] Zero Linden: so that it is clear, easy, and testable
[13:46] Zero Linden: soon- the user server will go away - one of our two remaining central servers (other than DB)
[13:47] Zero Linden: the functionality - mostly some of the group IM stuff, will now go through backbones
[13:48] Zero Linden: Next time Donovan is here, we should get him to talk about backbone, eventlet and ultramini (which I think he is renaming mulib)
[13:48] Zero Linden: On an admnistrative note - sorry about falling behind on the wiki transcripts
[13:48] Zero Linden: my iMac died last week
[13:48] Zero Linden: and I have the transcripts - but I've been busy installing and configuring SW!
[13:49] Wyn Galbraith: Sorry about your system, Zero. They all tend to do that once in a while.
[13:49] Zero Linden: Alas - I have a 2pm that I must not be late for.... so I'm going end just a few minutes early today....
[13:49] Jarod Godel: so Ultramini is an in-house project?
[13:50] Khamon Fate: Thanks for hosting Zero.
[13:50] Zero Linden: I know - and like all anecdotal computer misery - of course mine died just over a month out of warrantee
[13:50] Zero Linden: waranty
[13:50] Zero Linden: Ultramini is an open source project done by Donovan Preston before he joined Linden Lab
[13:50] Jarod Godel: ok
[13:50] Zero Linden: though we have done some fair bit of imporovement to it
[13:50] Zero Linden: I think the new version will be called mulib and will be opensourced
[13:51] Wyn Galbraith: Thanks Zero, always a learning experience ;)
[13:51] Zero Linden: okay - I'm going to ahve run now
[13:51] Jarod Godel: bye
[13:51] Zero Linden: thanks all - till next time
[13:51] Khamon Fate: Byeo
[13:51] Zha Ewry: Thanks, as always
[13:51] Rex Cronon: bye