User:Zero Linden/Office Hours/2007 Apr 26
Jump to navigation
Jump to search
Transcript of Zero Linden's office hours:
[7:43] | Rex Cronon: | if the text has a color closer to the background is realllllllllyyyyyyy fuzzzzyyyyy |
[7:43] | Khamon Fate: | Welcome back Zero. |
[7:43] | Zha Ewry: | wb Zero |
[7:43] | Zero Linden: | Fie! |
[7:44] | Zero Linden: | Dang Lindens! Can't write anything?!?!?! |
[7:44] | Jarod Godel: | lol@irony |
[7:44] | Wyn Galbraith: | That's why it's nice to have Lindens in-world. |
[7:44] | Wyn Galbraith thinks her boyfriend has finally accepted SL. She woke up with the laptop already on the bed beside her. | |
[7:44] | Zero Linden: | Yeah - Lindens in world - If I can actually manage to stay logged in! |
[7:45] | Zero Linden: | Ah well |
[7:45] | Zero Linden: | Welcome all |
[7:45] | Wyn Galbraith: | SL does't play favorites. |
[7:45] | Zha Ewry would like to take the graphics group and shake them. As far as she can tell, the new drivers hate ATI chips worse then the old ones | |
[7:45] | Zero Linden: | I thought I'd use this morning to give you some behind the seems look at one of our current miseries |
[7:45] | Wyn Galbraith: | Cool. |
[7:45] | Zha Ewry: | Go for it. :-) |
[7:46] | Zero Linden: | So - here's the basics of one of the problems |
[7:46] | Zero Linden: | Turns out that convering agent id's to names is one of the higher queries on the db |
[7:47] | Zero Linden: | which is ironic - as it is one of the least changing parts of the dataset |
[7:47] | Rex Cronon: | lots of scanners/radars do that too |
[7:47] | Wyn Galbraith: | So why bother until the actual name is needed? |
[7:47] | Zero Linden: | we don't Wyn |
[7:47] | Wyn Galbraith: | Oh. |
[7:47] | Zero Linden: | but it turns out that it is needed alot |
[7:48] | Zero Linden: | Digression: Part of the problem of making the simulator responsible for agents is that they have to all the work that is related to communicating with humans |
[7:48] | Zero Linden: | like, for example, formatting messages saying "so and so" is off line, etc.... |
[7:48] | Khamon Fate: | Who you callin human? |
[7:48] | Zero Linden: | which means that they have to know the name mapping for many more agents then they would otherwise care about |
[7:49] | Khamon Fate: | We need names every time we examine something, look at group listing, friends lists, IMs |
[7:49] | Zero Linden: | Well, even you advanced AI systems seem to prefer names to ids.... |
[7:49] | Zero Linden: | but I digress |
[7:49] | Zero Linden: | SO |
[7:49] | Khamon Fate: | Yes scripts demand that resolution very often as well. |
[7:49] | Zero Linden: | Someone had the idea to use a partitioned REST server as a store for that data |
[7:50] | Khamon Fate: | We even log into the Second Life websites with our names rather than our keys. |
[7:50] | Zero Linden: | so they dutifully made the datasevers query via HTTP for this info. |
[7:50] | Zero Linden: | which then should be cacheable etc..... |
[7:50] | Khamon Fate: | I think that was Jarod's idea orginially. |
[7:50] | Zero Linden: | Only at the same time, someone else generalized the web service architecture |
[7:51] | Zero Linden: | and it turns out, that set of code DIDN'T make those HTTP requests via the local squid cache |
[7:51] | Zero Linden: | so --- no caching |
[7:51] | Zero Linden: | which still should have been better than hitting the central db |
[7:51] | Zero Linden: | BUT |
[7:51] | Zero Linden: | the load on the central backbone servers (our partitioned REST system) |
[7:52] | Zero Linden: | went from about 340 requests/second /machine |
[7:52] | Zero Linden: | to pegging at about 760 req./sec./server |
[7:52] | Zero Linden: | And quick log analysis revealed that it was all this name lookup |
[7:52] | Zero Linden: | Alas - that is the same set of servers that has agent presence.... |
[7:53] | Zero Linden: | which is why friends and login are both bork'd now |
[7:53] | Zero Linden: | SO - three fixes in the works last evening: |
[7:53] | Rex Cronon: | that is why it took me yesterday 1 hour to log in? |
[7:53] | Zero Linden: | 1) increase the partitioning from 4 to 8 servers |
[7:54] | Zero Linden: | 2) merge in the code to support query via local squid cache into the web services framework (had been done in another branch ) |
[7:54] | Wyn Galbraith: | That's why groups didn't show members too? |
[7:54] | Zero Linden: | 3) Figure out why the in-process name cache in the sim and the dataserver wasn't working at all |
[7:54] | Zero Linden: | yes and yes |
[7:54] | Wyn Galbraith: | Wow, interesting. |
[7:54] | Zero Linden: | At the moment - none of these has been deployed |
[7:55] | Zero Linden: | so - amazing the system still works even under such load |
[7:55] | Zero Linden: | So - here's the thing |
[7:55] | Khamon Fate: | Do the squids only cache what the sim requests from a dataserver? |
[7:55] | Zero Linden: | why didn't we see this in testing? how come we couldn't anticipate this? Or notice that expected caching was all missing? |
[7:56] | Zero Linden: | Khamon - no - manythings on a simhost go via that squid. In particular assets |
[7:56] | Khamon Fate: | Because you didn't have enough users to register the exceptions. |
[7:56] | Rex Cronon: | u didn't have enough users to stress the system |
[7:56] | Zero Linden: | Rex - no |
[7:56] | Zero Linden: | oddly enough |
[7:56] | Zero Linden: | so - here's are load stats for this name lookup |
[7:57] | Zero Linden: | yesterday aternoon we snagged a million lines from one of the four backbone logs |
[7:57] | Zero Linden: | it covers 4:06pm to 4:26pm |
[7:58] | Rex Cronon: | hope u won't paste it in chat |
[7:58] | Zero Linden: | so in 20min we did 1M requests on one host -- and there are four hosts |
[7:58] | Wyn Galbraith: | Fun stuff reading logs. |
[7:58] | Zero Linden: | of those 578k were name lookups |
[7:58] | Zero Linden: | of thouse - only 50k unique agent ids were looked up |
[7:59] | Zero Linden: | So - on average each agent id is looked up 10x |
[7:59] | Zero Linden: | but with 2100 simualtor hosts hitting it |
[7:59] | Zero Linden: | it is easy to imagine that |
[8:00] | Zero Linden: | If it were evenly distributed - then |
[8:00] | Zha Ewry blinks.Are you simply flushing the cache as fast as you load it? | |
[8:00] | Zero Linden: | we'd have one set of problems |
[8:00] | Zero Linden: | BUT |
[8:00] | Zero Linden: | it isn't |
[8:00] | Rex Cronon: | a solution might be to store obj/av name alongside the key<------- |
[8:01] | Khamon Fate: | Plus, if a squid has agent data cached, it can serve that to scripts and other agents in the sim right? |
[8:01] | Zero Linden: | Well - look, if it is evenly distributed - and those577k requests were randomly from to the 2100 hosts |
[8:01] | Zero Linden: | then there is littly likelyhood, even with only 50k unique agent ids, that any given host would be requesting the same id twice |
[8:01] | Zero Linden: | hence - the cache on that host wouldn't help at all |
[8:01] | Rex Cronon: | if u do that then there should be only 1 request |
[8:02] | Rex Cronon: | per sim for each av/obj |
[8:02] | Zero Linden: | so each host looks up 280 random agent ids from the pool of 50k in 20min --- the likelyhood that it picks the same agent id twice is very low |
[8:03] | Zero Linden: | hence the cache on the local host is useless - and so all the queries go to the central service |
[8:03] | Zero Linden: | so 280 x 2100 = 577k hit the central service |
[8:03] | Zero Linden: | BUT |
[8:03] | Zero Linden: | that isn't what is happening at all.... |
[8:03] | Wyn Galbraith waits for it. | |
[8:04] | Zero Linden: | one sim did 4000k requests, but 3400 of them were for the same agaent id |
[8:04] | Zero Linden: | there are about 20 or so of these rogue sims |
[8:04] | Zero Linden: | well - really only about 5 really bad ones |
[8:05] | Khamon Fate: | Good analysis skills |
[8:05] | Rex Cronon: | look what i said above zero, could that be a sollution |
[8:05] | Zha Ewry: | /mw blins. And the 5 bad ones are still online? |
[8:05] | Wyn Galbraith: | Rogue sims, that's scary. |
[8:05] | Khamon Fate: | Give us names so we can harass their owners unfairly. |
[8:05] | Khamon Fate: | I mean mercilessly. |
[8:06] | Wyn Galbraith: | Do we know it's the sim owners fault? |
[8:06] | Khamon Fate: | Do y'all know what facilitates the rogueness? Is it some scripting command perhaps? |
[8:06] | Zero Linden: | the vast majority of them did 150 |
[8:06] | Rex Cronon: | were those request comming from scripts? |
[8:06] | Zha Ewry frowns."Or just throttle them. Choke down thier requests." | |
[8:06] | Jarod Godel: | /multiply 20 60 |
[8:06] | Bash: Multiplying. | |
[8:06] | Bash: 1200.000000 | |
[8:06] | Jarod Godel: | /divide 1000000 1200 |
[8:06] | Bash: Dividing. | |
[8:06] | Bash: 833.333313 | |
[8:06] | Khamon Fate: | No Wyn, I kidding about the owners. |
[8:06] | Jarod Godel: | That's 833 requests a minute. |
[8:06] | Jarod Godel: | Not too shabby. |
[8:06] | Zero Linden: | At present, we don't konw why those regions went berzerrk |
[8:06] | Wyn Galbraith smiles at Khamon. | |
[8:07] | Zero Linden: | But the key is that on a test grid - if we pulled 50 random region states to be on that grid |
[8:07] | Zero Linden: | likelyhood is we'd never see this |
[8:07] | Khamon Fate: | I do agree that the sims should be taken offline. I'd not complain if Slate or Taber had to be disconnected for such rogueness, assuming the problem was being worked on. |
[8:07] | Rex Cronon: | can't u trace what/who generated that many requests for name? |
[8:08] | Jarod Godel: | Zero, I don't know too much about MySQL indexing to be honest, but would indexing names-to-keys speed things up? |
[8:08] | Zero Linden: | So, if the bad actions of .5% of the grid can cause problems -- then we'd need test grids of well over 400 regions to get statistically good chance of hitting all the cases |
[8:09] | Zero Linden: | Jarod - you can bet your hammer it is indexed |
[8:09] | Khamon Fate: | Question is, can ANY sim end up going rogue or is this a specialized problem caused by some type of script of agent configuration or such? |
[8:09] | Zero Linden: | of course, in this case, it isn't even hitting the db - this is all just requesting data out of an in-memory hash table |
[8:09] | Zero Linden: | Khamon - we don't know |
[8:09] | Khamon Fate claps for in memory hash tables | |
[8:10] | Jarod Godel: | Zero, ok. |
[8:10] | Khamon Fate: | Hi Dnate |
[8:10] | Zero Linden: | My best guess is that the in-simulator name cache is working -- since that has been in for a long time.... |
[8:10] | Khamon Fate: | We're fixing the grid. |
[8:10] | Jarod Godel: | Zero, is that because I use it often or because it's a build? |
[8:10] | Zero Linden: | and that the new in dataserver cache is failing --- which leads to some strange situation |
[8:10] | Zero Linden: | where it isn't the sim itself causing the requests, but something in the dataserver ..... |
[8:10] | Jarod Godel: | (The latter being the same as "eevrything is indesxed.) |
[8:10] | Khamon Fate: | That was my next question. |
[8:11] | Khamon Fate: | I'm still unclear, is there a dataserver process running for each sim/processor? |
[8:11] | Zero Linden: | Jarod? on the index - it is beacuse we index any column that we do regular queries on |
[8:11] | Khamon Fate: | Speaking of which, we'd like to see the asset server schema if that's possible. |
[8:12] | Zero Linden: | Khamon - each host has four simulators running (one per CPU), sharing three dataservers ("fast", "slow" and "inventory"), one backbone web service proxy, and one squid cache |
[8:12] | Zero Linden: | The asset server isn't a databse - |
[8:12] | Khamon Fate: | Thank You |
[8:12] | Jarod Godel: | Zero, ok. Thanks. |
[8:12] | Jarod Godel: | and, I second Khamon's request. |
[8:12] | Zero Linden: | it is a HTTP server cluster backed by a distributed fault-tolerant file system |
[8:12] | Khamon Fate: | I'll rephrase that, we'd like to see the grid's database schema if that's possible. |
[8:13] | Zero Linden: | Ah - the central DB scheme!?!?!? HA - you really don't want to see it! Not unless you're on relaxants and anti-depressants.... |
[8:13] | Khamon Fate: | I have pills and yes I would like to see it. |
[8:14] | Zero Linden: | About a year ago Tess Linden did this giant chart of it -- it fills a wall - a large wall |
[8:14] | Khamon Fate: | Oh well I'll be SF in Sept. Perhaps I can come by and see it there. |
[8:14] | Zero Linden: | Also - there are many parts of it that are being moved into other dbs, moved into web services, or being excised completely |
[8:14] | Khamon Fate: | Was there no recorded diagram before a year ago? |
[8:15] | Zero Linden: | Fundimentally - it isn't very much more complex than you'd expect |
[8:15] | Zero Linden: | So - the lessons from yesterday are that simulators have very very wide behavior |
[8:16] | Zero Linden: | and that unit tests rock! |
[8:16] | Zero Linden: | This morning - we'll be deciding what to deploy of those three -- probably all three.... |
[8:17] | Khamon Fate: | Will they solve the problem or accomodate it? |
[8:17] | Wyn Galbraith: | Any one here read "When HARLIE Was One" |
[8:17] | Zero Linden: | I'm not sure there is a "soclution" - though if you mean route out the cause of the sims that are going hogg wild? |
[8:17] | Zero Linden: | well, no, we probably won't find that |
[8:17] | Zero Linden: | at least not before me make the grid happy |
[8:19] | Jarod Godel: | Zero, how much of your metric-taking is done in-world and out? Are there any tools (script finders, etc.) that work better as LSL scripts than looking at logs? |
[8:19] | Zero Linden: | almost all the metrics are done though log analysis |
[8:19] | Wyn Galbraith: | HARLIE was a computer that wanted to get high like it's creator did and OD'd on input streams. Just a thought ;) |
[8:19] | Jarod Godel: | Could LSL/sl developers have route out this issues by doing anything investigative in-world? |
[8:19] | Zero Linden: | We have "smoke test" objects that we run that measure responsiveness for scripts in world |
[8:20] | Jarod Godel: | Wyn, like a Maker who creates cyber-drugs? |
[8:20] | Zero Linden: | dpn |
[8:20] | Zero Linden: | I'm not thinking of any obvious ways..... |
[8:20] | Wyn Galbraith: | No cyber drugs. |
[8:20] | Jarod Godel: | ok |
[8:20] | Zero Linden: | alas, there isn't much introspection one can do in scripts |
[8:20] | Zero Linden: | What you could / should do is |
[8:20] | Jarod Godel: | That's all abstracted away for security? |
[8:20] | Zero Linden: | have an object that tests all the subsystems you rely on |
[8:21] | Zero Linden: | if your scripts need http or xml-rpc -- or if they expect a certain behavior from chat or link messages |
[8:21] | Zero Linden: | it would be best to have an object you can rez after a new release |
[8:21] | Zero Linden: | and run it's tests |
[8:21] | Zero Linden: | THAT would be great |
[8:21] | Wyn Galbraith nods, "Good idea." | |
[8:21] | Zero Linden: | LSL acceptance tests |
[8:22] | Zero Linden: | It would be great if there were a common framework for them - like jtest or other such test libs |
[8:22] | Zero Linden: | I could imagine one rezzing prims in a test sim and they communicating with a controller |
[8:23] | Zero Linden: | I could imagine the controller gaterhing up the results in a common format.... |
[8:23] | Jarod Godel: | test libs assumes #include ;) |
[8:23] | Zero Linden: | you know - I use #include in my LSL scripts - I just run 'em all through the C++ preprocessor before pasting! |
[8:24] | Zero Linden: | and actually - you can code in a style wher eyou think of scripts as DLLs - not static libs, and use llMessageLinked as the calling convention |
[8:24] | Rex Cronon: | but, there are tests that require human observation, as the lsl is kind of limited in some areas |
[8:24] | Zero Linden: | My goban has several such dynamic libs in it |
[8:24] | Zero Linden: | Rex - true - one way to do that is to automate the human |
[8:24] | Wyn Galbraith is just learning LSL though she is a tester type, "I'd love to do that." | |
[8:25] | Zero Linden: | llSay(0, "you will see a red sphere appear. It should be 5m and above you. It should appear within 4 seconds.") |
[8:25] | Rex Cronon: | or give more abilities/functions to lsl |
[8:25] | Zero Linden: | llSay(0, "did you see it?") -- after sleeping 4 seconds |
[8:25] | Zha Ewry has done the multi-script thing. Its good, up to a point. Gobal state.. is a PITA | |
[8:26] | Khamon Fate: | I tend to just multistate in single scripts as much as possible. |
[8:26] | Zero Linden: | yes - it is - you can use tricks like setting params on child prims .... but - ick! |
[8:26] | Wyn Galbraith: | I did a lot of test scripts like that at one job where I tested the graphic output. You had to watch because there was no way to record graphic displays without a camera. We did do screen shots. |
[8:26] | Zha Ewry would love some better tools for that state | |
[8:26] | Khamon Fate: | My trees have to use child prims; but llSetPrimitiveParams is totally hosed right now. The clients don't render the results properly. |
[8:27] | Zero Linden: | another option I use for state is silo |
[8:27] | Zero Linden: | just store it off world - |
[8:27] | Zero Linden: | it is often faster than using llSay to communicate it around |
[8:27] | Khamon Fate: | child scripts I mean |
[8:28] | Wyn Galbraith: | llSetPrimitivesParams is hosed? Is that why I haven't been able to get it towork? |
[8:28] | Zero Linden tries to think of the company that did automated testing by actually parsing the desktop pixmap image..... | |
[8:28] | Zha Ewry nods.. It is easy to have some of that off world... But.. fragile in annoying ways | |
[8:29] | Zero Linden: | right - it isn't perfect -- but that does bring up an intersting meta thought to me |
[8:29] | Khamon Fate: | SVC-38 and related reports on JIRA describe some of the problems. The command works, the server calcs properly, but the clients render the changes badly. |
[8:29] | Wyn Galbraith thought it was her lack of LSL knowledge. | |
[8:29] | Zero Linden: | we think of programming system of LSL/SL like we think of other programming systems like python or C++ or Java |
[8:29] | Rex Cronon: | zero, is there any way the sims can set all physical objects, to non-physical after a restart? |
[8:30] | Zero Linden: | they are well defined, with consistent semantics, and sort of "these things must work!" |
[8:30] | Jarod Godel: | When we should think in terms of Smalltalk and asynchronus objects. |
[8:30] | Zero Linden: | like - you call a function -- it will execute and then return |
[8:30] | Zero Linden: | On the other hand, there is REST - and whole design stance, and aprogramming model for a large, unreliable distributed system |
[8:30] | Zero Linden: | make this data request... and it might not complete |
[8:31] | Khamon Fate: | Yes those functions are alive and do whatever they please ha ha |
[8:31] | Rex Cronon: | that is anarchy |
[8:31] | Zero Linden: | Even Smalltalk is a very well defined, lock step system - (and smalltalk is no more async. than other common language) |
[8:32] | Khamon Fate: | I suppose that the reason y'all can't track rogue requests but simply have to accomodate them. At the risk of sounding snarky though, I have to wonder how long that can continue. Is such a method scalable? |
[8:32] | Jarod Godel: | It's not. I always thought it kind of had to be asynch. Shows what I know. |
[8:32] | Zha Ewry checks thel clock. Zero.. One quick question. What is being deployed todfay that needs a shutdown of the grid? | |
[8:32] | Rex Cronon: | what grid shutdown today:( |
[8:33] | Zero Linden: | Zha - I'm not sure - all three changes can be done live, with only rolling restarts (though host by host, not region by region) |
[8:33] | Zero Linden: | So I'm not sure why we're kicking all. |
[8:33] | Zha Ewry glances at Rex.. check the blog | |
[8:33] | Zero Linden: | The morning huddle on this is about to happen.... I'll know more then |
[8:33] | Zha Ewry: | Ahh. |
[8:33] | Rex Cronon: | didn't have time to read the blog |
[8:34] | Zha Ewry: | There's a 1 hour window positied |
[8:34] | Zero Linden: | it is likely that when we were all working on those fixes -- they decided to schedule the downtime in case the fixes turned out to require it |
[8:34] | Zha Ewry: | I'm surprised that LL can shutdown, do anything and restart in 1 hour at this opint |
[8:34] | Wyn Galbraith notes these hours go by so quickly, "I think my blue screen display stops might be fixed by the update. But the jury is still out." | |
[8:35] | Zero Linden: | well - I must head off... and see what we can get deployed today! |
[8:35] | Zero Linden: | thanks again |
[8:35] | Khamon Fate: | Sim servers reboot rather quickly. It's only when the other systems have to be ipled that more time is required. |
[8:35] | Wyn Galbraith: | Thanks Zero. Have a nice day ;) |
[8:35] | Rex Cronon: | bye |
[8:35] | Khamon Fate: | Thanks for hosting Zero. You p4wnz the office hours. |
[8:35] | Zha Ewry: | Thanks as always Zero. |
[8:35] | Rex Cronon: | too bad u can't answer all the questions:( |
[8:36] | Zha Ewry: | And good luck.. |
[8:36] | Zha Ewry: | This has been a rough one |