User:Zero Linden/Office Hours/2007 Apr 26

From Second Life Wiki
< User:Zero Linden/Office Hours
Revision as of 12:59, 1 May 2007 by Zero Linden (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Transcript of Zero Linden's office hours:

[7:43] Rex Cronon: if the text has a color closer to the background is realllllllllyyyyyyy fuzzzzyyyyy
[7:43] Khamon Fate: Welcome back Zero.
[7:43] Zha Ewry: wb Zero
[7:43] Zero Linden: Fie!
[7:44] Zero Linden: Dang Lindens! Can't write anything?!?!?!
[7:44] Jarod Godel: lol@irony
[7:44] Wyn Galbraith: That's why it's nice to have Lindens in-world.
[7:44] Wyn Galbraith thinks her boyfriend has finally accepted SL. She woke up with the laptop already on the bed beside her.
[7:44] Zero Linden: Yeah - Lindens in world - If I can actually manage to stay logged in!
[7:45] Zero Linden: Ah well
[7:45] Zero Linden: Welcome all
[7:45] Wyn Galbraith: SL does't play favorites.
[7:45] Zha Ewry would like to take the graphics group and shake them. As far as she can tell, the new drivers hate ATI chips worse then the old ones
[7:45] Zero Linden: I thought I'd use this morning to give you some behind the seems look at one of our current miseries
[7:45] Wyn Galbraith: Cool.
[7:45] Zha Ewry: Go for it. :-)
[7:46] Zero Linden: So - here's the basics of one of the problems
[7:46] Zero Linden: Turns out that convering agent id's to names is one of the higher queries on the db
[7:47] Zero Linden: which is ironic - as it is one of the least changing parts of the dataset
[7:47] Rex Cronon: lots of scanners/radars do that too
[7:47] Wyn Galbraith: So why bother until the actual name is needed?
[7:47] Zero Linden: we don't Wyn
[7:47] Wyn Galbraith: Oh.
[7:47] Zero Linden: but it turns out that it is needed alot
[7:48] Zero Linden: Digression: Part of the problem of making the simulator responsible for agents is that they have to all the work that is related to communicating with humans
[7:48] Zero Linden: like, for example, formatting messages saying "so and so" is off line, etc....
[7:48] Khamon Fate: Who you callin human?
[7:48] Zero Linden: which means that they have to know the name mapping for many more agents then they would otherwise care about
[7:49] Khamon Fate: We need names every time we examine something, look at group listing, friends lists, IMs
[7:49] Zero Linden: Well, even you advanced AI systems seem to prefer names to ids....
[7:49] Zero Linden: but I digress
[7:49] Zero Linden: SO
[7:49] Khamon Fate: Yes scripts demand that resolution very often as well.
[7:49] Zero Linden: Someone had the idea to use a partitioned REST server as a store for that data
[7:50] Khamon Fate: We even log into the Second Life websites with our names rather than our keys.
[7:50] Zero Linden: so they dutifully made the datasevers query via HTTP for this info.
[7:50] Zero Linden: which then should be cacheable etc.....
[7:50] Khamon Fate: I think that was Jarod's idea orginially.
[7:50] Zero Linden: Only at the same time, someone else generalized the web service architecture
[7:51] Zero Linden: and it turns out, that set of code DIDN'T make those HTTP requests via the local squid cache
[7:51] Zero Linden: so --- no caching
[7:51] Zero Linden: which still should have been better than hitting the central db
[7:51] Zero Linden: BUT
[7:51] Zero Linden: the load on the central backbone servers (our partitioned REST system)
[7:52] Zero Linden: went from about 340 requests/second /machine
[7:52] Zero Linden: to pegging at about 760 req./sec./server
[7:52] Zero Linden: And quick log analysis revealed that it was all this name lookup
[7:52] Zero Linden: Alas - that is the same set of servers that has agent presence....
[7:53] Zero Linden: which is why friends and login are both bork'd now
[7:53] Zero Linden: SO - three fixes in the works last evening:
[7:53] Rex Cronon: that is why it took me yesterday 1 hour to log in?
[7:53] Zero Linden: 1) increase the partitioning from 4 to 8 servers
[7:54] Zero Linden: 2) merge in the code to support query via local squid cache into the web services framework (had been done in another branch )
[7:54] Wyn Galbraith: That's why groups didn't show members too?
[7:54] Zero Linden: 3) Figure out why the in-process name cache in the sim and the dataserver wasn't working at all
[7:54] Zero Linden: yes and yes
[7:54] Wyn Galbraith: Wow, interesting.
[7:54] Zero Linden: At the moment - none of these has been deployed
[7:55] Zero Linden: so - amazing the system still works even under such load
[7:55] Zero Linden: So - here's the thing
[7:55] Khamon Fate: Do the squids only cache what the sim requests from a dataserver?
[7:55] Zero Linden: why didn't we see this in testing? how come we couldn't anticipate this? Or notice that expected caching was all missing?
[7:56] Zero Linden: Khamon - no - manythings on a simhost go via that squid. In particular assets
[7:56] Khamon Fate: Because you didn't have enough users to register the exceptions.
[7:56] Rex Cronon: u didn't have enough users to stress the system
[7:56] Zero Linden: Rex - no
[7:56] Zero Linden: oddly enough
[7:56] Zero Linden: so - here's are load stats for this name lookup
[7:57] Zero Linden: yesterday aternoon we snagged a million lines from one of the four backbone logs
[7:57] Zero Linden: it covers 4:06pm to 4:26pm
[7:58] Rex Cronon: hope u won't paste it in chat
[7:58] Zero Linden: so in 20min we did 1M requests on one host -- and there are four hosts
[7:58] Wyn Galbraith: Fun stuff reading logs.
[7:58] Zero Linden: of those 578k were name lookups
[7:58] Zero Linden: of thouse - only 50k unique agent ids were looked up
[7:59] Zero Linden: So - on average each agent id is looked up 10x
[7:59] Zero Linden: but with 2100 simualtor hosts hitting it
[7:59] Zero Linden: it is easy to imagine that
[8:00] Zero Linden: If it were evenly distributed - then
[8:00] Zha Ewry blinks.Are you simply flushing the cache as fast as you load it?
[8:00] Zero Linden: we'd have one set of problems
[8:00] Zero Linden: BUT
[8:00] Zero Linden: it isn't
[8:00] Rex Cronon: a solution might be to store obj/av name alongside the key<-------
[8:01] Khamon Fate: Plus, if a squid has agent data cached, it can serve that to scripts and other agents in the sim right?
[8:01] Zero Linden: Well - look, if it is evenly distributed - and those577k requests were randomly from to the 2100 hosts
[8:01] Zero Linden: then there is littly likelyhood, even with only 50k unique agent ids, that any given host would be requesting the same id twice
[8:01] Zero Linden: hence - the cache on that host wouldn't help at all
[8:01] Rex Cronon: if u do that then there should be only 1 request
[8:02] Rex Cronon: per sim for each av/obj
[8:02] Zero Linden: so each host looks up 280 random agent ids from the pool of 50k in 20min --- the likelyhood that it picks the same agent id twice is very low
[8:03] Zero Linden: hence the cache on the local host is useless - and so all the queries go to the central service
[8:03] Zero Linden: so 280 x 2100 = 577k hit the central service
[8:03] Zero Linden: BUT
[8:03] Zero Linden: that isn't what is happening at all....
[8:03] Wyn Galbraith waits for it.
[8:04] Zero Linden: one sim did 4000k requests, but 3400 of them were for the same agaent id
[8:04] Zero Linden: there are about 20 or so of these rogue sims
[8:04] Zero Linden: well - really only about 5 really bad ones
[8:05] Khamon Fate: Good analysis skills
[8:05] Rex Cronon: look what i said above zero, could that be a sollution
[8:05] Zha Ewry: /mw blins. And the 5 bad ones are still online?
[8:05] Wyn Galbraith: Rogue sims, that's scary.
[8:05] Khamon Fate: Give us names so we can harass their owners unfairly.
[8:05] Khamon Fate: I mean mercilessly.
[8:06] Wyn Galbraith: Do we know it's the sim owners fault?
[8:06] Khamon Fate: Do y'all know what facilitates the rogueness? Is it some scripting command perhaps?
[8:06] Zero Linden: the vast majority of them did 150
[8:06] Rex Cronon: were those request comming from scripts?
[8:06] Zha Ewry frowns."Or just throttle them. Choke down thier requests."
[8:06] Jarod Godel: /multiply 20 60
[8:06] Bash: Multiplying.
[8:06] Bash: 1200.000000
[8:06] Jarod Godel: /divide 1000000 1200
[8:06] Bash: Dividing.
[8:06] Bash: 833.333313
[8:06] Khamon Fate: No Wyn, I kidding about the owners.
[8:06] Jarod Godel: That's 833 requests a minute.
[8:06] Jarod Godel: Not too shabby.
[8:06] Zero Linden: At present, we don't konw why those regions went berzerrk
[8:06] Wyn Galbraith smiles at Khamon.
[8:07] Zero Linden: But the key is that on a test grid - if we pulled 50 random region states to be on that grid
[8:07] Zero Linden: likelyhood is we'd never see this
[8:07] Khamon Fate: I do agree that the sims should be taken offline. I'd not complain if Slate or Taber had to be disconnected for such rogueness, assuming the problem was being worked on.
[8:07] Rex Cronon: can't u trace what/who generated that many requests for name?
[8:08] Jarod Godel: Zero, I don't know too much about MySQL indexing to be honest, but would indexing names-to-keys speed things up?
[8:08] Zero Linden: So, if the bad actions of .5% of the grid can cause problems -- then we'd need test grids of well over 400 regions to get statistically good chance of hitting all the cases
[8:09] Zero Linden: Jarod - you can bet your hammer it is indexed
[8:09] Khamon Fate: Question is, can ANY sim end up going rogue or is this a specialized problem caused by some type of script of agent configuration or such?
[8:09] Zero Linden: of course, in this case, it isn't even hitting the db - this is all just requesting data out of an in-memory hash table
[8:09] Zero Linden: Khamon - we don't know
[8:09] Khamon Fate claps for in memory hash tables
[8:10] Jarod Godel: Zero, ok.
[8:10] Khamon Fate: Hi Dnate
[8:10] Zero Linden: My best guess is that the in-simulator name cache is working -- since that has been in for a long time....
[8:10] Khamon Fate: We're fixing the grid.
[8:10] Jarod Godel: Zero, is that because I use it often or because it's a build?
[8:10] Zero Linden: and that the new in dataserver cache is failing --- which leads to some strange situation
[8:10] Zero Linden: where it isn't the sim itself causing the requests, but something in the dataserver .....
[8:10] Jarod Godel: (The latter being the same as "eevrything is indesxed.)
[8:10] Khamon Fate: That was my next question.
[8:11] Khamon Fate: I'm still unclear, is there a dataserver process running for each sim/processor?
[8:11] Zero Linden: Jarod? on the index - it is beacuse we index any column that we do regular queries on
[8:11] Khamon Fate: Speaking of which, we'd like to see the asset server schema if that's possible.
[8:12] Zero Linden: Khamon - each host has four simulators running (one per CPU), sharing three dataservers ("fast", "slow" and "inventory"), one backbone web service proxy, and one squid cache
[8:12] Zero Linden: The asset server isn't a databse -
[8:12] Khamon Fate: Thank You
[8:12] Jarod Godel: Zero, ok. Thanks.
[8:12] Jarod Godel: and, I second Khamon's request.
[8:12] Zero Linden: it is a HTTP server cluster backed by a distributed fault-tolerant file system
[8:12] Khamon Fate: I'll rephrase that, we'd like to see the grid's database schema if that's possible.
[8:13] Zero Linden: Ah - the central DB scheme!?!?!? HA - you really don't want to see it! Not unless you're on relaxants and anti-depressants....
[8:13] Khamon Fate: I have pills and yes I would like to see it.
[8:14] Zero Linden: About a year ago Tess Linden did this giant chart of it -- it fills a wall - a large wall
[8:14] Khamon Fate: Oh well I'll be SF in Sept. Perhaps I can come by and see it there.
[8:14] Zero Linden: Also - there are many parts of it that are being moved into other dbs, moved into web services, or being excised completely
[8:14] Khamon Fate: Was there no recorded diagram before a year ago?
[8:15] Zero Linden: Fundimentally - it isn't very much more complex than you'd expect
[8:15] Zero Linden: So - the lessons from yesterday are that simulators have very very wide behavior
[8:16] Zero Linden: and that unit tests rock!
[8:16] Zero Linden: This morning - we'll be deciding what to deploy of those three -- probably all three....
[8:17] Khamon Fate: Will they solve the problem or accomodate it?
[8:17] Wyn Galbraith: Any one here read "When HARLIE Was One"
[8:17] Zero Linden: I'm not sure there is a "soclution" - though if you mean route out the cause of the sims that are going hogg wild?
[8:17] Zero Linden: well, no, we probably won't find that
[8:17] Zero Linden: at least not before me make the grid happy
[8:19] Jarod Godel: Zero, how much of your metric-taking is done in-world and out? Are there any tools (script finders, etc.) that work better as LSL scripts than looking at logs?
[8:19] Zero Linden: almost all the metrics are done though log analysis
[8:19] Wyn Galbraith: HARLIE was a computer that wanted to get high like it's creator did and OD'd on input streams. Just a thought ;)
[8:19] Jarod Godel: Could LSL/sl developers have route out this issues by doing anything investigative in-world?
[8:19] Zero Linden: We have "smoke test" objects that we run that measure responsiveness for scripts in world
[8:20] Jarod Godel: Wyn, like a Maker who creates cyber-drugs?
[8:20] Zero Linden: dpn
[8:20] Zero Linden: I'm not thinking of any obvious ways.....
[8:20] Wyn Galbraith: No cyber drugs.
[8:20] Jarod Godel: ok
[8:20] Zero Linden: alas, there isn't much introspection one can do in scripts
[8:20] Zero Linden: What you could / should do is
[8:20] Jarod Godel: That's all abstracted away for security?
[8:20] Zero Linden: have an object that tests all the subsystems you rely on
[8:21] Zero Linden: if your scripts need http or xml-rpc -- or if they expect a certain behavior from chat or link messages
[8:21] Zero Linden: it would be best to have an object you can rez after a new release
[8:21] Zero Linden: and run it's tests
[8:21] Zero Linden: THAT would be great
[8:21] Wyn Galbraith nods, "Good idea."
[8:21] Zero Linden: LSL acceptance tests
[8:22] Zero Linden: It would be great if there were a common framework for them - like jtest or other such test libs
[8:22] Zero Linden: I could imagine one rezzing prims in a test sim and they communicating with a controller
[8:23] Zero Linden: I could imagine the controller gaterhing up the results in a common format....
[8:23] Jarod Godel: test libs assumes #include ;)
[8:23] Zero Linden: you know - I use #include in my LSL scripts - I just run 'em all through the C++ preprocessor before pasting!
[8:24] Zero Linden: and actually - you can code in a style wher eyou think of scripts as DLLs - not static libs, and use llMessageLinked as the calling convention
[8:24] Rex Cronon: but, there are tests that require human observation, as the lsl is kind of limited in some areas
[8:24] Zero Linden: My goban has several such dynamic libs in it
[8:24] Zero Linden: Rex - true - one way to do that is to automate the human
[8:24] Wyn Galbraith is just learning LSL though she is a tester type, "I'd love to do that."
[8:25] Zero Linden: llSay(0, "you will see a red sphere appear. It should be 5m and above you. It should appear within 4 seconds.")
[8:25] Rex Cronon: or give more abilities/functions to lsl
[8:25] Zero Linden: llSay(0, "did you see it?") -- after sleeping 4 seconds
[8:25] Zha Ewry has done the multi-script thing. Its good, up to a point. Gobal state.. is a PITA
[8:26] Khamon Fate: I tend to just multistate in single scripts as much as possible.
[8:26] Zero Linden: yes - it is - you can use tricks like setting params on child prims .... but - ick!
[8:26] Wyn Galbraith: I did a lot of test scripts like that at one job where I tested the graphic output. You had to watch because there was no way to record graphic displays without a camera. We did do screen shots.
[8:26] Zha Ewry would love some better tools for that state
[8:26] Khamon Fate: My trees have to use child prims; but llSetPrimitiveParams is totally hosed right now. The clients don't render the results properly.
[8:27] Zero Linden: another option I use for state is silo
[8:27] Zero Linden: just store it off world -
[8:27] Zero Linden: it is often faster than using llSay to communicate it around
[8:27] Khamon Fate: child scripts I mean
[8:28] Wyn Galbraith: llSetPrimitivesParams is hosed? Is that why I haven't been able to get it towork?
[8:28] Zero Linden tries to think of the company that did automated testing by actually parsing the desktop pixmap image.....
[8:28] Zha Ewry nods.. It is easy to have some of that off world... But.. fragile in annoying ways
[8:29] Zero Linden: right - it isn't perfect -- but that does bring up an intersting meta thought to me
[8:29] Khamon Fate: SVC-38 and related reports on JIRA describe some of the problems. The command works, the server calcs properly, but the clients render the changes badly.
[8:29] Wyn Galbraith thought it was her lack of LSL knowledge.
[8:29] Zero Linden: we think of programming system of LSL/SL like we think of other programming systems like python or C++ or Java
[8:29] Rex Cronon: zero, is there any way the sims can set all physical objects, to non-physical after a restart?
[8:30] Zero Linden: they are well defined, with consistent semantics, and sort of "these things must work!"
[8:30] Jarod Godel: When we should think in terms of Smalltalk and asynchronus objects.
[8:30] Zero Linden: like - you call a function -- it will execute and then return
[8:30] Zero Linden: On the other hand, there is REST - and whole design stance, and aprogramming model for a large, unreliable distributed system
[8:30] Zero Linden: make this data request... and it might not complete
[8:31] Khamon Fate: Yes those functions are alive and do whatever they please ha ha
[8:31] Rex Cronon: that is anarchy
[8:31] Zero Linden: Even Smalltalk is a very well defined, lock step system - (and smalltalk is no more async. than other common language)
[8:32] Khamon Fate: I suppose that the reason y'all can't track rogue requests but simply have to accomodate them. At the risk of sounding snarky though, I have to wonder how long that can continue. Is such a method scalable?
[8:32] Jarod Godel: It's not. I always thought it kind of had to be asynch. Shows what I know.
[8:32] Zha Ewry checks thel clock. Zero.. One quick question. What is being deployed todfay that needs a shutdown of the grid?
[8:32] Rex Cronon: what grid shutdown today:(
[8:33] Zero Linden: Zha - I'm not sure - all three changes can be done live, with only rolling restarts (though host by host, not region by region)
[8:33] Zero Linden: So I'm not sure why we're kicking all.
[8:33] Zha Ewry glances at Rex.. check the blog
[8:33] Zero Linden: The morning huddle on this is about to happen.... I'll know more then
[8:33] Zha Ewry: Ahh.
[8:33] Rex Cronon: didn't have time to read the blog
[8:34] Zha Ewry: There's a 1 hour window positied
[8:34] Zero Linden: it is likely that when we were all working on those fixes -- they decided to schedule the downtime in case the fixes turned out to require it
[8:34] Zha Ewry: I'm surprised that LL can shutdown, do anything and restart in 1 hour at this opint
[8:34] Wyn Galbraith notes these hours go by so quickly, "I think my blue screen display stops might be fixed by the update. But the jury is still out."
[8:35] Zero Linden: well - I must head off... and see what we can get deployed today!
[8:35] Zero Linden: thanks again
[8:35] Khamon Fate: Sim servers reboot rather quickly. It's only when the other systems have to be ipled that more time is required.
[8:35] Wyn Galbraith: Thanks Zero. Have a nice day ;)
[8:35] Rex Cronon: bye
[8:35] Khamon Fate: Thanks for hosting Zero. You p4wnz the office hours.
[8:35] Zha Ewry: Thanks as always Zero.
[8:35] Rex Cronon: too bad u can't answer all the questions:(
[8:36] Zha Ewry: And good luck..
[8:36] Zha Ewry: This has been a rough one