User:Which Linden/Office Hours/2008 Apr 10

From Second Life Wiki
Jump to navigation Jump to search

This conversation is a bit disjointed because I was crashing all the time, and didn't have logging to disk enabled in my viewer.

  • [11:31] Zai Lynch: cu
  • [11:31] Which Linden: Sheesh. Maybe the normal viewer will be more stable for me
  • [11:31] Zai Lynch: well i got nVidia as well morgaine.
  • [11:32] Mastorian Kingsford: i hate to say this makes me look anti wl but i have noticed alota players crash on sunrises and sunsets when server does the trigger i get a 3 sec pause on light changes
  • [11:32] Which Linden: Hm, interesting, that might be useful information to pass on to Pastrami if you can get a decent repro
  • [11:33] Morgaine Dinova: Is that in QA, or different studio for client?
  • [11:33] Mastorian Kingsford: way to test it have a view lindens in bust area and wacth the light event you should see it has to be sim thats got a few on it
  • [11:33] Mastorian Kingsford: busy area
  • [11:34] Trinity Coulter: so, Which, you are saying because MySQL doesn't have a timeout essentially, that the systems that depend on it shouldn't be having timeouts
  • [11:34] Which Linden: Trinity: yes
  • [11:34] Trinity Coulter: is that the essence :)
  • [11:34] Trinity Coulter:  :)
  • [11:34] Which Linden: Cause if it's your own code, you can abort on command, but if it's some other process, you can't
  • [11:35] Trinity Coulter: i would guess that undoing something after a timeout is as troublesome?
  • [11:35] Which Linden: Yes, cause you can't be sure that the undo will work
  • [11:35] Trinity Coulter: since the MySQL might have already done it by that point.....?
  • [11:35] Which Linden: Yeah
  • [11:35] Which Linden: You just don't know the state, and if you gave someone some L$ they may have spent it already
  • [11:36] Which Linden: I would prefer to have some way for the viewer to get a list of ongoing transactions and see their status
  • [11:36] Morgaine Dinova: Doesn't this tackle a symptom though, rather than the problem? Being in the queue for 20s is not the normal operating regime.
  • [11:36] Which Linden: Yes, well, that's true as well
  • [11:36] Which Linden: But you have to design around the edge cases
  • [11:37] Which Linden: And in the edge case where the commit takes too long, currently our timeouts cause the simulator to think the transaction failed, when it actually did go through in the database
  • [11:37] Morgaine Dinova: The problem is, extending or removing the timeout buys you nothing. All that will happen is that your queue ends up hogging all of memory and then everything collapses.
  • [11:38] Morgaine Dinova: Because your queue will just grow longer and longer.
  • [11:38] Which Linden: Yeah, but at least you're more correctly reporting the status of the transaction
  • [11:38] Which Linden: And, actually it's not that different from the current situation
  • [11:38] Which Linden: The dataserver will still wait for the COMMIT to return
  • [11:39] Which Linden: It's just that the simulator gives up before the dataserver gets back and says "I'm done"
  • [11:39] Trinity Coulter: but maybe that is part of why you are thinking about the escrow situation, so that things can be rearranged a bit, to become more efficient where they can be?
  • [11:39] Which Linden: I'm actually not sure that the escrow will be any faster
  • [11:39] Morgaine Dinova: Yeah, misreporting is bad --- same issue in IM, where we get an error returned after timeout, saying that the post failed .... and then the damn thing appears anyway, often twice because you reposted after getting the failure message.
  • [11:40] Which Linden: In fact it will probably be slower, because there will be a lot more messaging involved. But it will be more consistent.
  • [11:40] Trinity Coulter: i would think that money and inventory would simply have to be nearly infallible in SL
  • [11:40] Morgaine Dinova: Agree with the fixing status. But the problem still remains.
  • [11:40] Trinity Coulter: just because its so close to "real" money
  • [11:40] Which Linden: Agreed. This is why this project is taking so long -- we don't want to F it up
  • [11:41] Which Linden: We probably will in some way, let's be realistic, but hopefully not too egregiously
  • [11:41] Trinity Coulter: just hire a person who you can conveniently blame later
  • [11:41] Trinity Coulter:  :)
  • [11:42] Which Linden: Heh, Scapegoat Linden
  • [11:42] Morgaine Dinova: Instead of removing the timeout, can't you add a corresponding one to MySQL? At least your system woudn't collapse then after the queue hits the ceiling.
  • [11:43] Mastorian Kingsford: lol that woulkd be hard a l melt down have avs inline for ll workmans comp and aid haha'
  • [11:43] Which Linden: I don't think you can add a corresponding timeout that covers all cases
  • [11:43] Which Linden: There's a lock wait timeout, but I doubt that L$ transactions take a long time due to locks
  • [11:44] Morgaine Dinova: Can you find out? Does MySQL have reasonable instrumentation?
  • [11:44] Which Linden: It might exceed the timeout just waiting for the disk flush() to complete.
  • [11:44] Which Linden: I don't know, but I wouldn't be comfortable unless I knew that the timeout was covering all cases
  • [11:45] Which Linden: If you can find out how to get MySQL to rguarantee that a COMMIT will either succeed or fail within a certain timeout, then we're golden
  • [11:45] Which Linden: But I don't see any such ability
  • [11:45] Morgaine Dinova: Isn't it all one single case? "Transaction not completed in required time" --- BANG, roll back.
  • [11:46] Which Linden: Yeah, it might be easy for the MySQL guys to implement it
  • [11:46] Which Linden: (with the caveat that nothing is easy)
  • [11:46] Morgaine Dinova: Hehe
  • [11:46] Morgaine Dinova: Handwaving is very easy, which is why we love it ;-)
  • [11:47] Which Linden: I don't think there's any way that we can work around it, short of implementing our own rollbacks, which will have their own horrible reliability problems
  • [11:48] Qie Niangao: seems to me it would be better to throttle incoming L$ transactions earlier, before they even enqueued, if they were likely to timeout at the db.
  • [11:48] Which Linden: Yeah, we might end up doing something like that
  • [11:49] Which Linden: But I think that it would be a band-aid and our efforts would be better spent on distributing the L$ data
  • [11:49] Morgaine Dinova: I really don't see any future in it. You need to increase the transaction bandwith to greater than transaction demand, nothing else will solve it.
  • [11:50] Which Linden: So, did I convince you about the unfeasibility of timeouts?
  • [11:50] Mastorian Kingsford:  :-)
  • [11:50] Trinity Coulter: only good for 2 year olds
  • [11:50] Qie Niangao: oh, i agree with the distributed approach, too... just that it's really hard to tune cascading timeout queues... so better not let them happen
  • [11:50] Morgaine Dinova: Not really Which
  • [11:51] Which Linden: OK, so what should we do to make the timeouts work reliably, Morgaine?
  • [11:52] Which Linden: I guess we could implement our own rollback mechanism, but that seems like you're just extending the two-generals problem one step further
  • [11:52] Morgaine Dinova: I'm not focussing on the timeouts, despite agreeing that it would be far better if the reporting were consistent with what's really happening. I'm more focussed on the underlying problem: your transaction processing is too slow.
  • [11:52] Which Linden: Oh, yeah.
  • [11:52] Which Linden: Well, we'll address that in two stages
  • [11:52] Which Linden: First stage: create a dedicated central host for L$ transactions
  • [11:53] Which Linden: This will improve the performance by a constant factor
  • [11:53] Bjorlyn Loon: Hi and ouch
  • [11:53] Which Linden: And will buy us time to deploy the escrow, which is infinitely scaleable
  • [11:53] Which Linden: Hey Bjorlyn
  • [11:53] Bjorlyn Loon: Trinity!
  • [11:53] Bjorlyn Loon: good to see you!
  • [11:53] Bjorlyn Loon: Hi Which.
  • [11:53] Trinity Coulter: OMG Bjorlyn :))
  • [11:53] Mastorian Kingsford: incressing the cores and hyperthreading wiould improve a load balance on transactions too
  • [11:54] Which Linden: I believe that our database performance is gated on disk I/O speed at the moment
  • [11:54] Mastorian Kingsford: nut on single threads
  • [11:54] Bjorlyn Loon: I know I am terribly late.
  • [11:54] Bjorlyn Loon: may I sit?
  • [11:55] Which Linden: We're buying bigger and better disks, but you can only get a small constant factor from hardware upgrades
  • [11:55] Which Linden: Yeah, have a seat
  • [11:55] Which Linden: Make yourself at home. :-)
  • [11:56] Which Linden: I guess the database downtime earlier this week was to upgrade the hardware
  • [11:56] Bjorlyn Loon: listens.
  • [11:56] Trinity Coulter: omg 72 people in Brampton
  • [11:56] Which Linden: But I haven't heard any word on whether or not that improved performance
  • [11:56] Which Linden: Is there some event in Brampton?
  • [11:56] Trinity Coulter: i guess
  • [11:56] Mastorian Kingsford: its always bad to swamp inventory and and transaction to same dbs since both will always be geting a heavy call to them from population use should be sepert chains
  • [11:57] Morgaine Dinova: I'm not even sure that there's a problem that LL needs to find an original solution for (unless you want to "do a Google"). Zillions of people have created high-bandwidth databases, it's business as usual. Just give the spec to Sun or IBM or someone and it'll be done. It's boring RDBMS stuff --- not original VW trailblazing :-)
  • [11:57] Which Linden: Inventory goes to a separate database now, actually, Mastorian.
  • [11:58] Mastorian Kingsford: but isnt the assets still ported in though one point
  • [11:58] Which Linden: Agreed, Morgaine, but I think that there is no vendor that provides an infinitely-scaleable database
  • [11:58] Which Linden: It would, again, only buy us a constant factor
  • [11:59] Which Linden: Mastorian: no, the asset "server" is actually a cluster.
  • [11:59] Mastorian Kingsford: k
  • [11:59] Which Linden: The asset cluster has interommunication between its nodes so sometimes the whole thing can get sad
  • [12:00] Which Linden: But generally we don't notice the failures of individual machines
  • [12:00] Morgaine Dinova: Fair enough there --- but what you're saying is, VW's need research-grade distributed databases, nothing else will scale. It's true ... but it's hard, and I don't see a project on the table for it. (A real one)
  • [12:01] Which Linden: We don't do that many transactions in our database, though, so it seems that a full-scale RDBMS is overkill for our needs
  • [12:02] Which Linden: AFAIK, it can be completely partitioned on a per-agent basis and we only need escrow-style transactions
  • [12:03] Which Linden: A fully-scaleable RDBMS would be hampered by the fact that it would have to be able to perform arbitrary transactions between all data within itself
  • [12:03] Which Linden: We are, essentially, optimizing for a specific case.
  • [12:03] Morgaine Dinova: Good. Google only gets away with their MapReduce approach because they found a small subset of the storage+processing requirement that it can handle. I guess this applies to LL too: find that small subset.
  • [12:04] Which Linden: Yeah, that's a great parallel to draw
  • [12:05] Which Linden: See this article by the hard-core RDBMS people about how horrible MapReduce is: [1]
  • [12:05] Which Linden:  :-)
  • [12:06] Morgaine Dinova: I guess one could reasonably ask, if you "don't do that many transactions in your database", how come it's jammed solid ;-))))) But that can be rhetorical :P
  • [12:06] Morgaine Dinova: Bad RDBMS coding?
  • [12:06] Which Linden: Heh. Well, it's jammed solid with non-transactional writes
  • [12:07] Which Linden: E.g. every time you log out, it's a write to the disk
  • [12:07] Which Linden: (that's being distributed so it doesn't write to the central db right now, btw)
  • [12:08] Which Linden: Actual transactions don't take up a lot of the resources on the central db, it's just write, write, write, seek, seek, seek
  • [12:08] Morgaine Dinova: Much worse than that, each time a person sends an IM, it's a read from the database for the group exploder, following by a whole pile of reads to get user preferences. A total disaster, no transactions yet jamming your DB solid.
  • [12:09] Which Linden: Group IMs don't hit the database anymore, I believe, or at least they're not a major source of load on it.
  • [12:09] Which Linden: But yeah, we need to distribute user preferences to the inventory databases
  • [12:09] Which Linden: No doubt about it
  • [12:09] Morgaine Dinova: There's been no change in IM behaviour in last 6-9 months.
  • [12:09] Which Linden: There has been
  • [12:10] Morgaine Dinova: OK, I missed out the term "user-visible" there :-)
  • [12:10] Which Linden: Jonathan has been trying to fix it. But since it's not a major source of "OMG the database is about to die", it's not as much of a priority.
  • [12:10] Which Linden: Group chat did improve by a constant factor. But the whole system needs to be redesigned, IMO
  • [12:11] Which Linden: It won't be simple, so that's why we have to focus on other things
  • [12:11] Which Linden: Higher nails
  • [12:11] Which Linden: I am as sad about this as you are, believe me
  • [12:11] Morgaine Dinova: AW Groupies IM behaves exactly the same now as at the time that AWG was formed.
  • [12:12] Which Linden: Hm, it should be somewhat faster.
  • [12:12] Mastorian Kingsford: i ran chat as a sept client using a javascript in a window assigned to space in my clint it floated and since sept from game never laged me also worked for video webcames in my game world ect
  • [12:13] Which Linden: Did you connect to a separate chat server for that, Mastorian?
  • [12:13] Which Linden: Agh, it's lunchtime here in the Lab, I should depart
  • [12:13] Mastorian Kingsford: i ran it as a seprt server was smaller world i didnt have 60,000 omn mying hell bandwidth would have killed me haha
  • [12:14] Which Linden: Ha ha, gotcha
  • [12:14] Morgaine Dinova: The big problem (for everyone) is that despite the opennes with the community, there is no manner in which we can help.
  • [12:14] Which Linden:  :-(
  • [12:14] Which Linden: I agree, that is sad
  • [12:14] Morgaine Dinova: KK Which, have a good lunch :-)
  • [12:14] Which Linden: Thanks for an engaging conversation, I look forward to next week
  • [12:14] Mastorian Kingsford: been running home brew games for a long time always on the edge of whats new lota stuff was in mine before store shelve games