User:Zero Linden/Office Hours/2007 Apr 26

From Second Life Wiki
Jump to navigation Jump to search

Transcript of Zero Linden's office hours:

[7:43] Rex Cronon: if the text has a color closer to the background is realllllllllyyyyyyy fuzzzzyyyyy
[7:43] Khamon Fate: Welcome back Zero.
[7:43] Zha Ewry: wb Zero
[7:43] Zero Linden: Fie!
[7:44] Zero Linden: Dang Lindens! Can't write anything?!?!?!
[7:44] Jarod Godel: lol@irony
[7:44] Wyn Galbraith: That's why it's nice to have Lindens in-world.
[7:44] Wyn Galbraith thinks her boyfriend has finally accepted SL. She woke up with the laptop already on the bed beside her.
[7:44] Zero Linden: Yeah - Lindens in world - If I can actually manage to stay logged in!
[7:45] Zero Linden: Ah well
[7:45] Zero Linden: Welcome all
[7:45] Wyn Galbraith: SL does't play favorites.
[7:45] Zha Ewry would like to take the graphics group and shake them. As far as she can tell, the new drivers hate ATI chips worse then the old ones
[7:45] Zero Linden: I thought I'd use this morning to give you some behind the seems look at one of our current miseries
[7:45] Wyn Galbraith: Cool.
[7:45] Zha Ewry: Go for it. :-)
[7:46] Zero Linden: So - here's the basics of one of the problems
[7:46] Zero Linden: Turns out that convering agent id's to names is one of the higher queries on the db
[7:47] Zero Linden: which is ironic - as it is one of the least changing parts of the dataset
[7:47] Rex Cronon: lots of scanners/radars do that too
[7:47] Wyn Galbraith: So why bother until the actual name is needed?
[7:47] Zero Linden: we don't Wyn
[7:47] Wyn Galbraith: Oh.
[7:47] Zero Linden: but it turns out that it is needed alot
[7:48] Zero Linden: Digression: Part of the problem of making the simulator responsible for agents is that they have to all the work that is related to communicating with humans
[7:48] Zero Linden: like, for example, formatting messages saying "so and so" is off line, etc....
[7:48] Khamon Fate: Who you callin human?
[7:48] Zero Linden: which means that they have to know the name mapping for many more agents then they would otherwise care about
[7:49] Khamon Fate: We need names every time we examine something, look at group listing, friends lists, IMs
[7:49] Zero Linden: Well, even you advanced AI systems seem to prefer names to ids....
[7:49] Zero Linden: but I digress
[7:49] Zero Linden: SO
[7:49] Khamon Fate: Yes scripts demand that resolution very often as well.
[7:49] Zero Linden: Someone had the idea to use a partitioned REST server as a store for that data
[7:50] Khamon Fate: We even log into the Second Life websites with our names rather than our keys.
[7:50] Zero Linden: so they dutifully made the datasevers query via HTTP for this info.
[7:50] Zero Linden: which then should be cacheable etc.....
[7:50] Khamon Fate: I think that was Jarod's idea orginially.
[7:50] Zero Linden: Only at the same time, someone else generalized the web service architecture
[7:51] Zero Linden: and it turns out, that set of code DIDN'T make those HTTP requests via the local squid cache
[7:51] Zero Linden: so --- no caching
[7:51] Zero Linden: which still should have been better than hitting the central db
[7:51] Zero Linden: BUT
[7:51] Zero Linden: the load on the central backbone servers (our partitioned REST system)
[7:52] Zero Linden: went from about 340 requests/second /machine
[7:52] Zero Linden: to pegging at about 760 req./sec./server
[7:52] Zero Linden: And quick log analysis revealed that it was all this name lookup
[7:52] Zero Linden: Alas - that is the same set of servers that has agent presence....
[7:53] Zero Linden: which is why friends and login are both bork'd now
[7:53] Zero Linden: SO - three fixes in the works last evening:
[7:53] Rex Cronon: that is why it took me yesterday 1 hour to log in?
[7:53] Zero Linden: 1) increase the partitioning from 4 to 8 servers
[7:54] Zero Linden: 2) merge in the code to support query via local squid cache into the web services framework (had been done in another branch )
[7:54] Wyn Galbraith: That's why groups didn't show members too?
[7:54] Zero Linden: 3) Figure out why the in-process name cache in the sim and the dataserver wasn't working at all
[7:54] Zero Linden: yes and yes
[7:54] Wyn Galbraith: Wow, interesting.
[7:54] Zero Linden: At the moment - none of these has been deployed
[7:55] Zero Linden: so - amazing the system still works even under such load
[7:55] Zero Linden: So - here's the thing
[7:55] Khamon Fate: Do the squids only cache what the sim requests from a dataserver?
[7:55] Zero Linden: why didn't we see this in testing? how come we couldn't anticipate this? Or notice that expected caching was all missing?
[7:56] Zero Linden: Khamon - no - manythings on a simhost go via that squid. In particular assets
[7:56] Khamon Fate: Because you didn't have enough users to register the exceptions.
[7:56] Rex Cronon: u didn't have enough users to stress the system
[7:56] Zero Linden: Rex - no
[7:56] Zero Linden: oddly enough
[7:56] Zero Linden: so - here's are load stats for this name lookup
[7:57] Zero Linden: yesterday aternoon we snagged a million lines from one of the four backbone logs
[7:57] Zero Linden: it covers 4:06pm to 4:26pm
[7:58] Rex Cronon: hope u won't paste it in chat
[7:58] Zero Linden: so in 20min we did 1M requests on one host -- and there are four hosts
[7:58] Wyn Galbraith: Fun stuff reading logs.
[7:58] Zero Linden: of those 578k were name lookups
[7:58] Zero Linden: of thouse - only 50k unique agent ids were looked up
[7:59] Zero Linden: So - on average each agent id is looked up 10x
[7:59] Zero Linden: but with 2100 simualtor hosts hitting it
[7:59] Zero Linden: it is easy to imagine that
[8:00] Zero Linden: If it were evenly distributed - then
[8:00] Zha Ewry blinks.Are you simply flushing the cache as fast as you load it?
[8:00] Zero Linden: we'd have one set of problems
[8:00] Zero Linden: BUT
[8:00] Zero Linden: it isn't
[8:00] Rex Cronon: a solution might be to store obj/av name alongside the key<-------
[8:01] Khamon Fate: Plus, if a squid has agent data cached, it can serve that to scripts and other agents in the sim right?
[8:01] Zero Linden: Well - look, if it is evenly distributed - and those577k requests were randomly from to the 2100 hosts
[8:01] Zero Linden: then there is littly likelyhood, even with only 50k unique agent ids, that any given host would be requesting the same id twice
[8:01] Zero Linden: hence - the cache on that host wouldn't help at all
[8:01] Rex Cronon: if u do that then there should be only 1 request
[8:02] Rex Cronon: per sim for each av/obj
[8:02] Zero Linden: so each host looks up 280 random agent ids from the pool of 50k in 20min --- the likelyhood that it picks the same agent id twice is very low
[8:03] Zero Linden: hence the cache on the local host is useless - and so all the queries go to the central service
[8:03] Zero Linden: so 280 x 2100 = 577k hit the central service
[8:03] Zero Linden: BUT
[8:03] Zero Linden: that isn't what is happening at all....
[8:03] Wyn Galbraith waits for it.
[8:04] Zero Linden: one sim did 4000k requests, but 3400 of them were for the same agaent id
[8:04] Zero Linden: there are about 20 or so of these rogue sims
[8:04] Zero Linden: well - really only about 5 really bad ones
[8:05] Khamon Fate: Good analysis skills
[8:05] Rex Cronon: look what i said above zero, could that be a sollution
[8:05] Zha Ewry: /mw blins. And the 5 bad ones are still online?
[8:05] Wyn Galbraith: Rogue sims, that's scary.
[8:05] Khamon Fate: Give us names so we can harass their owners unfairly.
[8:05] Khamon Fate: I mean mercilessly.
[8:06] Wyn Galbraith: Do we know it's the sim owners fault?
[8:06] Khamon Fate: Do y'all know what facilitates the rogueness? Is it some scripting command perhaps?
[8:06] Zero Linden: the vast majority of them did 150
[8:06] Rex Cronon: were those request comming from scripts?
[8:06] Zha Ewry frowns."Or just throttle them. Choke down thier requests."
[8:06] Jarod Godel: /multiply 20 60
[8:06] Bash: Multiplying.
[8:06] Bash: 1200.000000
[8:06] Jarod Godel: /divide 1000000 1200
[8:06] Bash: Dividing.
[8:06] Bash: 833.333313
[8:06] Khamon Fate: No Wyn, I kidding about the owners.
[8:06] Jarod Godel: That's 833 requests a minute.
[8:06] Jarod Godel: Not too shabby.
[8:06] Zero Linden: At present, we don't konw why those regions went berzerrk
[8:06] Wyn Galbraith smiles at Khamon.
[8:07] Zero Linden: But the key is that on a test grid - if we pulled 50 random region states to be on that grid
[8:07] Zero Linden: likelyhood is we'd never see this
[8:07] Khamon Fate: I do agree that the sims should be taken offline. I'd not complain if Slate or Taber had to be disconnected for such rogueness, assuming the problem was being worked on.
[8:07] Rex Cronon: can't u trace what/who generated that many requests for name?
[8:08] Jarod Godel: Zero, I don't know too much about MySQL indexing to be honest, but would indexing names-to-keys speed things up?
[8:08] Zero Linden: So, if the bad actions of .5% of the grid can cause problems -- then we'd need test grids of well over 400 regions to get statistically good chance of hitting all the cases
[8:09] Zero Linden: Jarod - you can bet your hammer it is indexed
[8:09] Khamon Fate: Question is, can ANY sim end up going rogue or is this a specialized problem caused by some type of script of agent configuration or such?
[8:09] Zero Linden: of course, in this case, it isn't even hitting the db - this is all just requesting data out of an in-memory hash table
[8:09] Zero Linden: Khamon - we don't know
[8:09] Khamon Fate claps for in memory hash tables
[8:10] Jarod Godel: Zero, ok.
[8:10] Khamon Fate: Hi Dnate
[8:10] Zero Linden: My best guess is that the in-simulator name cache is working -- since that has been in for a long time....
[8:10] Khamon Fate: We're fixing the grid.
[8:10] Jarod Godel: Zero, is that because I use it often or because it's a build?
[8:10] Zero Linden: and that the new in dataserver cache is failing --- which leads to some strange situation
[8:10] Zero Linden: where it isn't the sim itself causing the requests, but something in the dataserver .....
[8:10] Jarod Godel: (The latter being the same as "eevrything is indesxed.)
[8:10] Khamon Fate: That was my next question.
[8:11] Khamon Fate: I'm still unclear, is there a dataserver process running for each sim/processor?
[8:11] Zero Linden: Jarod? on the index - it is beacuse we index any column that we do regular queries on
[8:11] Khamon Fate: Speaking of which, we'd like to see the asset server schema if that's possible.
[8:12] Zero Linden: Khamon - each host has four simulators running (one per CPU), sharing three dataservers ("fast", "slow" and "inventory"), one backbone web service proxy, and one squid cache
[8:12] Zero Linden: The asset server isn't a databse -
[8:12] Khamon Fate: Thank You
[8:12] Jarod Godel: Zero, ok. Thanks.
[8:12] Jarod Godel: and, I second Khamon's request.
[8:12] Zero Linden: it is a HTTP server cluster backed by a distributed fault-tolerant file system
[8:12] Khamon Fate: I'll rephrase that, we'd like to see the grid's database schema if that's possible.
[8:13] Zero Linden: Ah - the central DB scheme!?!?!? HA - you really don't want to see it! Not unless you're on relaxants and anti-depressants....
[8:13] Khamon Fate: I have pills and yes I would like to see it.
[8:14] Zero Linden: About a year ago Tess Linden did this giant chart of it -- it fills a wall - a large wall
[8:14] Khamon Fate: Oh well I'll be SF in Sept. Perhaps I can come by and see it there.
[8:14] Zero Linden: Also - there are many parts of it that are being moved into other dbs, moved into web services, or being excised completely
[8:14] Khamon Fate: Was there no recorded diagram before a year ago?
[8:15] Zero Linden: Fundimentally - it isn't very much more complex than you'd expect
[8:15] Zero Linden: So - the lessons from yesterday are that simulators have very very wide behavior
[8:16] Zero Linden: and that unit tests rock!
[8:16] Zero Linden: This morning - we'll be deciding what to deploy of those three -- probably all three....
[8:17] Khamon Fate: Will they solve the problem or accomodate it?
[8:17] Wyn Galbraith: Any one here read "When HARLIE Was One"
[8:17] Zero Linden: I'm not sure there is a "soclution" - though if you mean route out the cause of the sims that are going hogg wild?
[8:17] Zero Linden: well, no, we probably won't find that
[8:17] Zero Linden: at least not before me make the grid happy
[8:19] Jarod Godel: Zero, how much of your metric-taking is done in-world and out? Are there any tools (script finders, etc.) that work better as LSL scripts than looking at logs?
[8:19] Zero Linden: almost all the metrics are done though log analysis
[8:19] Wyn Galbraith: HARLIE was a computer that wanted to get high like it's creator did and OD'd on input streams. Just a thought ;)
[8:19] Jarod Godel: Could LSL/sl developers have route out this issues by doing anything investigative in-world?
[8:19] Zero Linden: We have "smoke test" objects that we run that measure responsiveness for scripts in world
[8:20] Jarod Godel: Wyn, like a Maker who creates cyber-drugs?
[8:20] Zero Linden: dpn
[8:20] Zero Linden: I'm not thinking of any obvious ways.....
[8:20] Wyn Galbraith: No cyber drugs.
[8:20] Jarod Godel: ok
[8:20] Zero Linden: alas, there isn't much introspection one can do in scripts
[8:20] Zero Linden: What you could / should do is
[8:20] Jarod Godel: That's all abstracted away for security?
[8:20] Zero Linden: have an object that tests all the subsystems you rely on
[8:21] Zero Linden: if your scripts need http or xml-rpc -- or if they expect a certain behavior from chat or link messages
[8:21] Zero Linden: it would be best to have an object you can rez after a new release
[8:21] Zero Linden: and run it's tests
[8:21] Zero Linden: THAT would be great
[8:21] Wyn Galbraith nods, "Good idea."
[8:21] Zero Linden: LSL acceptance tests
[8:22] Zero Linden: It would be great if there were a common framework for them - like jtest or other such test libs
[8:22] Zero Linden: I could imagine one rezzing prims in a test sim and they communicating with a controller
[8:23] Zero Linden: I could imagine the controller gaterhing up the results in a common format....
[8:23] Jarod Godel: test libs assumes #include ;)
[8:23] Zero Linden: you know - I use #include in my LSL scripts - I just run 'em all through the C++ preprocessor before pasting!
[8:24] Zero Linden: and actually - you can code in a style wher eyou think of scripts as DLLs - not static libs, and use llMessageLinked as the calling convention
[8:24] Rex Cronon: but, there are tests that require human observation, as the lsl is kind of limited in some areas
[8:24] Zero Linden: My goban has several such dynamic libs in it
[8:24] Zero Linden: Rex - true - one way to do that is to automate the human
[8:24] Wyn Galbraith is just learning LSL though she is a tester type, "I'd love to do that."
[8:25] Zero Linden: llSay(0, "you will see a red sphere appear. It should be 5m and above you. It should appear within 4 seconds.")
[8:25] Rex Cronon: or give more abilities/functions to lsl
[8:25] Zero Linden: llSay(0, "did you see it?") -- after sleeping 4 seconds
[8:25] Zha Ewry has done the multi-script thing. Its good, up to a point. Gobal state.. is a PITA
[8:26] Khamon Fate: I tend to just multistate in single scripts as much as possible.
[8:26] Zero Linden: yes - it is - you can use tricks like setting params on child prims .... but - ick!
[8:26] Wyn Galbraith: I did a lot of test scripts like that at one job where I tested the graphic output. You had to watch because there was no way to record graphic displays without a camera. We did do screen shots.
[8:26] Zha Ewry would love some better tools for that state
[8:26] Khamon Fate: My trees have to use child prims; but llSetPrimitiveParams is totally hosed right now. The clients don't render the results properly.
[8:27] Zero Linden: another option I use for state is silo
[8:27] Zero Linden: just store it off world -
[8:27] Zero Linden: it is often faster than using llSay to communicate it around
[8:27] Khamon Fate: child scripts I mean
[8:28] Wyn Galbraith: llSetPrimitivesParams is hosed? Is that why I haven't been able to get it towork?
[8:28] Zero Linden tries to think of the company that did automated testing by actually parsing the desktop pixmap image.....
[8:28] Zha Ewry nods.. It is easy to have some of that off world... But.. fragile in annoying ways
[8:29] Zero Linden: right - it isn't perfect -- but that does bring up an intersting meta thought to me
[8:29] Khamon Fate: SVC-38 and related reports on JIRA describe some of the problems. The command works, the server calcs properly, but the clients render the changes badly.
[8:29] Wyn Galbraith thought it was her lack of LSL knowledge.
[8:29] Zero Linden: we think of programming system of LSL/SL like we think of other programming systems like python or C++ or Java
[8:29] Rex Cronon: zero, is there any way the sims can set all physical objects, to non-physical after a restart?
[8:30] Zero Linden: they are well defined, with consistent semantics, and sort of "these things must work!"
[8:30] Jarod Godel: When we should think in terms of Smalltalk and asynchronus objects.
[8:30] Zero Linden: like - you call a function -- it will execute and then return
[8:30] Zero Linden: On the other hand, there is REST - and whole design stance, and aprogramming model for a large, unreliable distributed system
[8:30] Zero Linden: make this data request... and it might not complete
[8:31] Khamon Fate: Yes those functions are alive and do whatever they please ha ha
[8:31] Rex Cronon: that is anarchy
[8:31] Zero Linden: Even Smalltalk is a very well defined, lock step system - (and smalltalk is no more async. than other common language)
[8:32] Khamon Fate: I suppose that the reason y'all can't track rogue requests but simply have to accomodate them. At the risk of sounding snarky though, I have to wonder how long that can continue. Is such a method scalable?
[8:32] Jarod Godel: It's not. I always thought it kind of had to be asynch. Shows what I know.
[8:32] Zha Ewry checks thel clock. Zero.. One quick question. What is being deployed todfay that needs a shutdown of the grid?
[8:32] Rex Cronon: what grid shutdown today:(
[8:33] Zero Linden: Zha - I'm not sure - all three changes can be done live, with only rolling restarts (though host by host, not region by region)
[8:33] Zero Linden: So I'm not sure why we're kicking all.
[8:33] Zha Ewry glances at Rex.. check the blog
[8:33] Zero Linden: The morning huddle on this is about to happen.... I'll know more then
[8:33] Zha Ewry: Ahh.
[8:33] Rex Cronon: didn't have time to read the blog
[8:34] Zha Ewry: There's a 1 hour window positied
[8:34] Zero Linden: it is likely that when we were all working on those fixes -- they decided to schedule the downtime in case the fixes turned out to require it
[8:34] Zha Ewry: I'm surprised that LL can shutdown, do anything and restart in 1 hour at this opint
[8:34] Wyn Galbraith notes these hours go by so quickly, "I think my blue screen display stops might be fixed by the update. But the jury is still out."
[8:35] Zero Linden: well - I must head off... and see what we can get deployed today!
[8:35] Zero Linden: thanks again
[8:35] Khamon Fate: Sim servers reboot rather quickly. It's only when the other systems have to be ipled that more time is required.
[8:35] Wyn Galbraith: Thanks Zero. Have a nice day ;)
[8:35] Rex Cronon: bye
[8:35] Khamon Fate: Thanks for hosting Zero. You p4wnz the office hours.
[8:35] Zha Ewry: Thanks as always Zero.
[8:35] Rex Cronon: too bad u can't answer all the questions:(
[8:36] Zha Ewry: And good luck..
[8:36] Zha Ewry: This has been a rough one