Difference between revisions of "ANALYSIS: Region Subdivision as a scaling method"

Revision as of 06:05, 24 November 2008

Introduction

A recent AWGroupies discussion examined the issue of sim and region scalability in the direction of increasing the processing power available to regions. One mechanism for achieving scalability that was proposed was region subdivision of statically-bounded region spaces. This section first provides a rough picture of conditions under such scaling, and then examines the problems that exist with that approach which impose inherent limits to its scalability.

Crowd conditions under projected scaling -- a few numbers

The scalability for events target for the normal use case has an event scaling factor requirement of 200 (20,000/100) --- this is the factor by which the number of people who wish to attend a maxed-out-sim event today will increase under Zero's total-population projections. (The massively higher viewer numbers possible for the SecondlifeTube use cases are not examined numerically for now, except in discussing event headroom.) Event scalability projections were calculated in the discussion on Project Motivation - Scaling for events.

For this event scaling factor requirement of 200, the region subdivision method subdivides today's 65,536 m^2 of land per region into subregions of (65,536/200) = 327.68 m^2 under a policy of equal-size subdivision, which is a square of 18.1m per side and 25.6m diagonal. Each subdivision then holds 100 people, each with a personal space of 3.28 m^2 (a square of 1.81m per side). Thus, this represents a lighter packing density than normal crowd packing in real life (RL), which approaches 1 person per m^2, and hence represents a less onerous scaling than would a true crowd simulation. Visually then, this packing density would provide a normal and untroubling experience for any person used to popular standup venues in RL.

Assuming simple rectangular NS/EW subdivision, the maximum rate of sim handover would be experienced every 18.1m of linear travel when aligned with a NS/EW grid, and the minimum rate when travelling diagonally every 25.6m. Determining the rate of handovers is hard without performing a statistical study of typical avatar movement patterns, so this is not attempted here. However, under rectangular spacing, a square subregion containing 100 people has 10 of them along each side bordering on the adjacent subregion, which provides a lower bound on handover for near-zero travel and therefore bears examination.

At each such boundary, any outbound movement by anyone from within their subregion causes handover (this is reduced by hysteresis, see below), and what's more it occurs in both directions across a boundary concurrently (by symmetry with the adjacent subregion), within any cross-section of an event crowd.

Therefore the maximum rate of handovers at a single boundary without any contribution from deeper within a subregion is somewhere between zero (nobody moving) and 2*10 = 20 (10 on each side of a boundary moving outbound). Since there are 4 adjacent subregions per subregion, this gives us a maximum of 20*4 = 80 possible handovers at the boundaries of any given subregion, under conditions of zero hysteresis, nobody coming out from deeper within their subregion, nobody getting closer than 1.8m to anyone else, and any movement whatsoever that is greater than zero distance (since hysteresis == 0). The average would be half of this or 40 handovers, since each of those 10 might be moving deeper into their own subregion rather than outbound.

Note that this is not a handover rate as such (since movement rates are unknown), but a bounding coefficient governing maximum handover rates for minimum travel. It doesn't establish an actual target for handover rate handling, but just gives us some idea of the magnitudes involved --- in other words, it shows that discussing thousands of handovers per second per subregion is not relevant, but a few tens of handovers might be. It offers an initial suggestion of possible viability.

Potential problems with Region Subdivision

The above scenario gives some idea of operational conditions under the normal (ie. current) use case, which are complicated by the following observations:

Max-scaling conditions apply under very small scaling (density-based subdivision is required)

First of all, it needs to be borne in mind that the above conditions do not apply only to the end projection of a 2bn user population, because crowds at events almost always congregate on the primary focus of attention, for example at the stage of a live music event. This would apply right now if regions and clients were scalable, because the top musicians already max out their sims for well-advertised concerts, even if they perform more than once a day. (Check out any [Komuso Tokugawa] concert, for example.) Live music is of course not the only type of event that maxes out sims in SL today, even without considering one-off special events like the recent SL4B celebration that attracted many thousands concurrently but was not able to handle them.

Because (i) popular events easily exceed the 100-av subregion regularly today, (ii) will do so even more with each passing day, and (iii) audiences draw together around an attraction, the described subdivision policy of equal-size subdivision is not viable. Instead, we require an equal-density subdivision policy, because without it we will not be able to scale events in the immediate future, let alone far ahead. The equal-size policy was described only to provide a rough idea of the numbers that govern handover densities, but the actual policy to be employed must be based on local participant density or it will not accomplish its goal, because otherwise subregions at the event focus would be massively oversubscribed while outer ones would be empty.

This is clearly more complex from a design and implementation standpoint, but it is inevitable under the region subdivision method otherwise the premise of limiting subregions to handling a maximum of 100 avs is exceeded very early on in the population growth, or even right now.

An alternative to equal-density subdivision that might address the problem is modulus subdivision (the region is sliced into sections each containing rather less than the maximum number a typical CPU can handle, plus one more section to carry the remainder). Geometric land partitioning however is not an option, because agent density is highly non-uniform.

In other words, we could not design for equal-size subdivision initially and then evolve to equal-density subdivision, but instead we would require a density-based subdivision right from the start: equal-size subdivision does not accomplish the goal, which is to handle loading from high avatar densities. And that introduces numerous difficulties, examined later.

Hard ceiling on scalability (non-existence of scalability headroom)

The 18.1m per side square of the simple example subregion (which is not actually viable because equal-density subdivision would be needed as above, but is nevertheless still useful for analysis) seems at first glance to be a viable extent for a subregion. Unfortunately, this would not be the minimum size of a subregion under the figures of our Project_Motivation. The reasons why subregions would have to be far smaller than 18.1m include the following:

Prim count expectations will rise, inevitably. This includes land-placed prims in the event region, but these are not the primary concern, since the biggest prim-related loading is generated by avatar attachments in all event-scaling scenarios. While it is hoped that sculpties will reduce the need for very high-prim attachments, a knowledge of people suggests that the opposite will occur, and that ever-more detailed attachments will increase instead of, in parallel with, or on top of sculptie-based designs. Region loading from prims will therefore be higher than forecasts based only on population growth.
Scripting load is highly likely to increase, for three reasons:

The number of scripts never decreases, and if the ceiling on land prim counts is raised then script numbers will rise too. More importantly however, scripts numbers in attachments have no user-explicit ceiling, and will be expected to grow not only alongside normal population growth but also because people will want facilities for crowd management.
As the population grows, new residents experience the pleasure of LSL programming, so the volume of available scripted objects always heads upwards. One additional factor that adds to the concurrent scripting load is that as the worldwide number of free or cheap scripts increases, the initial expense barrier to event participants running quality scripted products falls away. Since exponential growth of the SL population creates an ever-larger percentage of newcomers compared to well-funded oldies, this effect results in an ever more rapid takeup of scripted objects by residents. The onset of usage of scripted attachments is therefore no longer deferred while they earn some money, but becomes ever more immediate.
Mono is coming. This will speed up scripts markedly, and means that more scripts will be runnable in a region for any fixed amount of CPU resource, and therefore inevitably also means that more scripts will be run. It also means that more efficient scripts will access backend assets more rapidly, which compounds loading still further.

Adding these 3 points together suggests that region loading from scripts will therefore be higher than forecasts based only on population growth.

The SecondlifeTube use cases are highly likely to become massively popular, potentially becoming the "new TV" of the second decade of this millennium. The impact of these uses cases on server-side scalability is nothing short of ultra scary, and since the client-side changes for these use cases are quite trivial, the pressure will be on for massive server-side upscaling for events.

Factoring in the effect of more prims, more scripts, and a future which includes SecondlifeTube-type clients, immediately indicates that the 18x18m minimum-size subregions that we thought might be sufficient for scalability by region subdivision are wholly inappropriate. Indeed, the analysis is out by some orders of magnitude (log10(~millions/20,000)), whereas not even a single order of magnitude improvement is available since 18m/10 is roughly the size of an avatar, and other factors add loading too.

In case the impact of this is not immediately obvious: subregions would have to be smaller than an avatar to scale fully by this method. In other words, the method is completely non-scalable to many use cases of high interest, because region handover makes no sense when regions are smaller than avatars.

Or, as another way of putting it, the region subdivision proposal contains an immoveable ceiling or cap on region scalability. What little headroom exists for possible reduction of subregions below 18m per side can never go below the size of an avatar because of handover, so the ceiling cannot be raised further. Simple use cases that are expected to be popular require scalability beyond that.

Adjacent region view non-scalability

When a region is subdivided into a large number of small subregions, the 3D view from any given position will require object data to be gathered from all the region servers whose land or objects are visible within the observer's field of view, or possibly more for speculative caching. This process entails a lot of distributed activity. While that is not in itself a problem in the first instance, it gets progressively worse as the subregions become smaller and smaller, since the number then involved in almost any view will then grow larger and larger.

What's more, the intended goal of letting other machines assist by working on their own subregion is lost, because the physical proximity of subregions to other subregions forces additional object viewing traffic on them. The separation of domains which gives tiling its good performance is lost when those domains are so close that they become local. This is a fundamental design flaw in Region Subdivision.

The effect is pathological. As subregion size is reduced, the interconnection topology tends towards total connectivity.

In other words, region object loading which used to be primarily local now becomes global as regions are subdivided, because 3D viewing distances are not reduced in proportion. This appears to be a recipe for large additional machine loading which wastes CPU and network resources by coupling region land extent to the (desired) distribution of computing.

Increasing grid workload from subdivision (non-scalable internal workload)

Region subdivision always increases the overall workload. This stems from the simple observation that, for any given area, area subdivision adds new interaction points within that area.

As an example, consider a square of 100 units per side and containing a uniform density of objects (or other reasons for boundary interactions to occur), and assume that each unit of distance also generates a unit of boundary interaction workload:

S/1: The original 100x100 square then features (4 * 100) = 400 units of workload, ie. 100 per side.
S/2: Splitting the square down the middle results in 2*((2 * 100) + (2 * 50)) = 600 units of workload.
S/4: Splitting each half into half again results in 4*((2 * 50) + (2 * 50)) = 800 units of workload.
S/8: Splitting each quarter into halves again results in 8*((2 * 50) + (2 * 25)) = 1200 units of workload.
S/16: More directly, dividing the original square into 16 results in 16*(4 * 100/(16/4)) = 1600 units of workload.
And so on. For every 4-way split, the boundary interaction workload doubles, ie. exponential in 2^(2N).

This then is a design for non-scalability, since in the limit it tends towards infinite internal workload as the subdivision tends towards zero size, and since the trend is exponential, it is the exact opposite of a scalable trend.

Hysteresis zone size shrinks to unusable

Real systems which employ the concept of handover at region boundaries do not usually trigger handover at the actual boundary lines, because this would generate pathological behaviour for agents sitting exactly on the boundary, moving along the boundary line, or making infinitessimal movements while at the boundary. All of these cases can result in 'handover thrashing, ie. continuous switching back and forth between regions, potentially at very high rates.

To avoid this, handover is usually tamed by placing handover bands that add [hysteresis] to handover at boundaries. This can be done in many ways, but the general goal is to introduce delay into the handover process (this ties into normal control theory for damping uncontrolled behaviour through negative feedback with a delay in the feedback loop). The handover band does not necessarily have a physical extent (it's more of a time parameter), but it can be associated with a physical extent if the velocity of an agent perpendicular to the boundary is considered.

As the conceptual handover band shrinks in size, hysteresis is reduced, and handover behaviour draws closer to the pathologcal condition that it was designed to avoid.

When regions are repeatedly subdivided as in this proposal, retaining existing sim behaviour will require reducing the size of handover bands correspondingly, otherwise in due course handover bands will overlap and the current handover design is then no longer viable since the handover process loses its normal 1:1 mapping. If the handover bands are reduced sufficiently, this then leads to handover thrashing.

In the limit of subdivision, handover cannot work at all without continuous thrashing, although in practice the bag of problems is so full by then (eg. regions smaller than avatars) that it hardly matters.

In summary, the concept of handover by region adjacency becomes progressively less manageable and usable as subregions shrink in size, and would need to be replaced by some form of non-adjacent region handover in order to retain the important principle of hysteresis.

Adjacency-based power contribution is flawed

Region Subdivision is an attempt to retain static mapping between land and servers while sharing the workload of highly loaded adjacent regions. In other words, load sharing requires land sharing (through subdivision), in this model. This constraint on load sharing is deeply flawed, as a very simple example illustrates:

Imagine a grid split down the middle, in which all servers in the right half are 100% idle, and all servers in the left half are 100% busy.
Now replace the server right in the middle of the left half by one that is running at 80% CPU capacity, that knows that a major event is scheduled in 5 minutes' time, and that is desperate for help to avoid total collapse.
None of the adjacent servers in the left half can help, as they have no spare capacity. None of the extremely bored servers on the right half can help because their land is not adjacent to that of the ailing server.
In 5 minutes' time, that server collapses.

While this is not a real scenario, it illustrates well that the concept of land adjacency is not helpful for power sharing. In real scenarios, exactly the same considerations apply: the ability of a server to accept a share of a region's workload should have nothing at all to do with land adjacency, extent, or location. Land should be a virtual concept, and not tied to the CPU resourcing model.

Dynamic subregioning vs stateless access to virtual regions

As a little analysis revealed above, some key aspects of Region Subdivision have turned out to require dynamic handling:

The actual subdivision policy cannot be equal-size subdivision, otherwise workload sharing is not effective and some subregions suffer overload beyond viable limits while others in the partitioning are relatively idle. Instead, a policy of equal-density subdivision is required, and this is a dynamic policy.
Hysteresis for handover stability requires handover band reduction proportional to subregion shrinkage to retain 1:1 handovers, or else requires that the handover mechanism be replaced entirely by one that is not based on subregion adjacency. Such handling becomes highly dynamic.
As subregions reduce in size to cope with increased region load, each contributing CPU is required to handle object fanout for more and more viewers because 3D views do not shrink in step with subregion reduction. Consequently subregion shrinkage tends towards serving the needs of everyone in the locality, which is the same goal that Virtualization of Regions achieves without any handover overhead. (Original link in Brainstorming was here.)

When the design of region mechanisms starts to embrace changing sizes and adaptive handover systems, it is no longer a simple statically tiled grid architecture. Instead, it is halfway to a dynamic architecture in which region parameters are entirely virtualized, but instead of benefitting from that new freedom, it retains the disadvantages of the old design and introduces many new problems, as highlighted in the analysis.
This tends to suggest that, if the static grid functionality is being redesigned to have dynamic features, then those dynamic features should operate independently of those static constraints, in order not to introduce new disadvantages by distorting the old model.
The latter is the approach suggested in Virtualization of regions, which works alongside the old static resource mapping, merely offering any unused local resources to all other participants in a parallel dynamic infrastructure.
It is worth noting that virtualizing regions inherently results in stateless communication with the servers that perform operations on virtual regions, and stateless communication is a central tennet of REST. The principle of virtualizing regions is therefore inherently consistent with massively scalable HTTP-based loadbalanced access to virtual resources, of which a virtual region is an archetypal case.

Scalability vs scaling

When the proposal to use Region Subdivision was described and discussed in the AWGroupies discussion, the impact of the scary numbers of Project_Motivation on the proposal was not considered, and was in fact dismissed, on the grounds that scaling would occur in an evolutionary manner and therefore only a smaller number would initially be relevant. This is flawed reasoning, because it confuses scalability with degree of scaling.

If the goal is to design a system that can handle (say) a population of N users, then in general there will be any number of different designs that can fulfil that requirement. These designs will typically differ in their scalability: one may support only N users and no more, one may be able to handle some small multiple of N, and another may be able to support a vastly larger population. However, because they all fulfil the short term goal of supporting N users, these differences are not apparent if the only design case that has been considered is the scaling to N. If a system implements a design that is viable only up to N users, then if that system needs to be scaled up further, the only way forward will entail a redesign, and if that new design is not evolutionary from the earlier one then this will also entail a reimplementation. This is clearly not satisfactory if the envisaged population is hugely greater than N right from the start, even if N is the initial target.

In respect of the Region Subdivision proposal then, it is not sound scalability analysis to ignore the scary numbers of Project_Motivation just because some smaller number is envisaged as a short term scaling target (a specific implementation). All candidate designs need to be assessed against the scary numbers to ensure that they remain viable at the projected limits of scalability, regardless of the magnitude of short term physical scaling targets. Without that, gradual evolution is not assured.

Illusion of scalability (abstracting resources into clouds)

First, a general observation on allegedly designing for scalability without actually doing so (also known as designing with clouds):

You cannot put a scalable, massively parallelized HTTP access mechanism on the front of a non-scalable resource and magically claim that you have a scalable resource. All you really have is a scalable access mechanism. If the resource is non-scalable, then the scalable access mechanism is wasted. This means that if we put the resource in a cloud labelled "Scalability to be left to implementors" then we have not actually achieved resource scalability. We have not even produced a design that makes the resource scalable, let alone actually achieved scaling. We *know* that stateless HTTP front ends and load balancers give us a scalable access mechanism --- it's been normal industry practice for 10-15 years now (the author scaled a national ISP from 64k users to 2+ million over 6 years using that approach, and it was severely limited in scope). That's not the hard task that the AWG is tackling, it's the easy part of the hard task.

So, if your project goal is to design scalable resources, you have to do exactly that. Anything else is self-delusion.

Now to place this in the current context: it has been suggested that we might abstract scalability of regions by placing the internal design of regions inside a design cloud, and to claim scalability because we have placed a scalable HTTP-based access method in front of that cloud. That suggestion is not designing for region scalability, it is brushing the whole issue under the carpet. It does not achieve scalability of regions in the current SL, nor in the projected SL, neither in design nor in implementation. It is simply ignoring the issue of region scalability entirely.

Given that scalability for events is needed extremely badly right now, and that the nil scalability for events of the current grid has been causing intense dissatisfaction among event participants for far longer than the limits along any other dimension of scalability (well over 3 years), ignoring this element of scalability while pretending that it is addressed is not a attitude of responsibility towards SL residents. It should not be even considered. (This paragraph is not technical, but is relevant in the sense that design engineering and the work of the AWG ultimately has a social purpose.)

We are not designing for a future SL in which 99.5% to 99.995% of the residents who wish to attend a given event have to stay at home --- see Talk:Project_Motivation. Region scalability for events is a primary goal, as well as an urgent requirement.

Summary of observations

The Region Subdivision approach suffers from several difficulties and restrictions which will result in high overheads if scaled even to the lower reaches of the numbers offered in Project_Motivation, and high overheads generally reduce actual ability to scale to far less than predicted figures. Even more worrying though is that the proposed approach contains an inherent ceiling on achievable scalability (let alone actual scaling) for predictably popular use cases which are already implemented in other systems.

Given that such difficulties are already identified at this early design stage, Region Subdivision would be a poor design choice for a new architecture.

"This concludes the results of the Norwegion Jury." Not quite nil points, but close.

@@ Line 95: / Line 95: @@
 ::* The actual subdivision policy cannot be '''equal-size subdivision''', otherwise workload sharing is not effective and some subregions suffer overload beyond viable limits while others in the partitioning are relatively idle.  Instead, a policy of '''equal-density subdivision''' is required, and this is a dynamic policy.
 ::* Hysteresis for handover stability requires handover band reduction proportional to subregion shrinkage to retain 1:1 handovers, or else requires that the handover mechanism be replaced entirely by one that is not based on subregion adjacency.  Such handling becomes highly dynamic.
-::* As subregions reduce in size to cope with increased region load, each contributing CPU is required to handle object fanout for more and more viewers because 3D views do not shrink in step with subregion reduction.  Consequently subregion shrinkage tends towards serving the needs of ''everyone'' in the locality, which is the same goal that [[Virtualization of Regions]] achieves without any handover overhead.  (Original link in [[Brainstorming]] was [[Brainstorming#Virtualization of regions|Virtualization of regions]].)
+::* As subregions reduce in size to cope with increased region load, each contributing CPU is required to handle object fanout for more and more viewers because 3D views do not shrink in step with subregion reduction.  Consequently subregion shrinkage tends towards serving the needs of ''everyone'' in the locality, which is the same goal that [[Virtualization of Regions]] achieves without any handover overhead.  (Original link in [[Brainstorming]] was [[Brainstorming#Virtualization of regions|here]].)
 :* When the design of region mechanisms starts to embrace changing sizes and adaptive handover systems, it is no longer a simple statically tiled grid architecture.  Instead, it is halfway to a dynamic architecture in which region parameters are entirely virtualized, but instead of benefitting from that new freedom, it retains the disadvantages of the old design and introduces many new problems, as highlighted in the analysis.
 :* This tends to suggest that, if the static grid functionality is being redesigned to have dynamic features, then those dynamic features should operate independently of those static constraints, in order not to introduce new disadvantages by distorting the old model.