Talk:AWG Scalability through reverse proxies: the paravirtual grid

From Second Life Wiki
Revision as of 03:41, 24 October 2007 by Jesrad Seraph (talk | contribs)
Jump to navigation Jump to search

Analysing scalability of the paravirtual grid

Call me confused, but I don't see much scalability in the paravirtualization of the Grid... The reverse proxies are called quasi-static caches, but the problem is that the simulation data, the way it is now in SL, is not static enough to make it practical or useful. As I understand it, this proposal is a form of sharding of the regions, it does not solve high concurrency number issues because of two reasons:

  • The subdivision method examplified (US, Europe, Asia) is useless in current use cases: different continental populations log in SL and interact at roughly the same hours of the day across themselves already. Thousands of people connecting from the same geographical origin are still facing asset lag and simulator cutoff. This means that the proposal does not really address the concurrency problem at all, or at least it only addresses a marginal aspect of it.
  • Even if we take exemption of the above problem by redirecting same-origin excess load to other reverse proxies, the data streams are still mixed as high in the chain as the origin. i.e. the load that arises from the interaction and intervisibility of the crowd still has to be processed by the source, and we're just moving the bottleneck from the simulator to this interconnection mechanism.

Even if there is a way to seperate the data streams at the source and keep them seperate down the way (which implies seperating people into seperate grids where they cannot see each other) this proposal does not actually offer any mechanism for the accruing of those streams at the "top" of the architecture: there is no method proposed for concurrent modification of the central databases, which therefore still faces the same scalability issues known today. Besides, bottlenecks are still present inside the proposed architecture: they appear within the flow of data from the source, be it a reverse-proxy or the original simulator, and the clients. This is bound to happen with any form of static subdivision (geographic or not), and dynamic subdivisions can only mitigate that effect: the distribution of the load in SL is not uniform in any globally-determinable way, so there is no correct way to distribute that load from the top-down, it must be done from the clients in a bottom-up way.

In other words, I think trying to approach concurrency issues in SL through a macroscopic view, and thinking of the population load in global aggregates, is not the correct method. The bottom line is that the elementary design of Grid processes determines everything else, including the global result. I'm afraid LL might be making the same mistake in their own proposal here: in many common use cases, it does not, in fact, scale the way they hope it will. Simply throwing more machines at the growing Grid load does not appease it, the fundamental origins of the scalability constraints, within the design of the Grid processes, as they had been identified months ago, have to be addressed first.

I hope this can help improve the proposal. Reverse-proxies definitely have their place in a well-designed Grid architecture, finding exactly how is an important task.

This analysis feels about right. I've been chewing on some of this in these two pages

AWG: state melding exploration
AWG: Core simulator exploration

Which try to look at two levels of abstraction, at what the core work of a region simulator is. (Rough draft warning)

I appreciate the analyses and intent to address the concerns.

This is a plan to evolve the grid over time rather than rely on one big switch. I've only documented up to the point where the reverse proxies evolve to a paravirtual grid. Before that, it is all about the reverse-proxies itself and not about any move on simulators or states of simulator frames. The analyses so far have shown a lot of concern about the simulator frame states. Those are concerns I have yet to address at the paravirtual state, however.

The more immediate deployment of the reverse-proxies provide the means to decentralize content that does not need to be constantly accessed at the current central location. That means simulator deal only with simulator state and do not need to deal with request for intellectual property content or other assets than are cache-able in a decentralized manner. With the previous model, the central location deals with all requests to and from the central location. The reverse-proxies take a great portion of content and make it available closer to the viewers.

The pipes to the central location hold only so much bandwidth. There is mainly asset content and sim state in those pipes. The reverse-proxies located outside of the central location allows for those pipes to be filled with sim state and much less asset content.

As both of you seem to agree, this is only one measure. It is not a complete solution or redesign.

Thank you for the concerns, and I'll address them in the next description of the paravirtual grid design, where this evolves after the point "through reverse proxies".

OK, thanks for the clarification. As I see it, the reverse proxies are meant as a way to unicast the simulators, so they're essentially serving few "wide" clients (the proxies) instead of multiple "narrow" clients. I understand this aims at addressing the high local concurrency (event scaling), and only help marginally the global grid concurrency load.

Basically you're distributing the simulation task and the broadcasting task onto different machines. This means the layer of reverse proxies between sim and clients acts as a software router, at the price of additional internal bandwidth (which should be cheap): the info that must reach the clients, though it is not multiplicated across multiple clients at the sim level, will be multiplicated at the reverse-proxy level.

One caveat though: what kind of mechanism is proposed in order to only send to the reverse proxy the needed information, which is required by the clients, and not the whole sim state in bulk, so that one client in one proxied-sim does not force the sim to send the whole region state to the reverse proxy the client is connected through ? Without such a mechanism, and depending on which sort of high concurrency is hitting the Grid (more or less uniformly distributed across the whole Grid in raw number ; or concentrated, event-sort of high concurrency) the gains may turn into losses.