Difference between revisions of "Service Disruptions"
Line 18: | Line 18: | ||
== Asset Storage Cluster == | == Asset Storage Cluster == | ||
'''What is it:''' A cluster of machines that form a whopping huge WebDAV (think "web-based disk drive") storage mechanism with terabytes of space for storing assets, including uploaded textures, snapshots, scripts, objects taken into inventory, script states, saved region states (simstates), etc that make up Second Life. The technology (software and hardware) is licensed from a third party. | '''What is it:''' A cluster of machines that form a whopping huge [http://en.wikipedia.org/wiki/WebDAV WebDAV] (think "web-based disk drive") storage mechanism with terabytes of space for storing assets, including uploaded textures, snapshots, scripts, objects taken into inventory, script states, saved region states (simstates), etc that make up Second Life. The technology (software and hardware) is licensed from a third party. | ||
'''How it can fail:''' The system should be resilient against single node failures. In the case of multiple disk failures, software upgrades, removing problem nodes or adding new nodes, some or all of the cluster can fall offline. If this happens, asset uploads and downloads fail - this causes texture uploads and simstate saves to fail. Since transient data during region crossings (attachment states, etc) are written as assets, region crossings will also often fail. | '''How it can fail:''' The system should be resilient against single node failures. In the case of multiple disk failures, software upgrades, removing problem nodes or adding new nodes, some or all of the cluster can fall offline. If this happens, asset uploads and downloads fail - this causes texture uploads and simstate saves to fail. Since transient data during region crossings (attachment states, etc) are written as assets, region crossings will also often fail. |
Revision as of 15:26, 9 April 2008
Second Life is a complex system with many inter-operating components, from simulators and databases to the viewer you run, and the Internet connections over which data flows. Although it is intended to be as loosely coupled as possible and resilient against problems, events can still occur which lead to service disruptions.
In some cases the failures are beyond the control of Linden Lab. However, in nearly all cases there is active work being done to mitigate the disruptions - either prevent them from happening or significantly reduce the impact.
When a disruption occurs, the following sequence usually occurs:
- a system stops responding
- automated notifications go off, alerting our operations team
- residents often notice immediately, and alert in-world support, who confirm the problem to our operations team
- the operations team identifies the system that is the root cause of the problem
- the communications team is notified, and asked to provide information about the disruption to the blog
- if the disruption lasts more a few minutes, updates are made to the blog regularly
- once the problem is solved, an "all clear" is reported to the blog
Note that in many cases a disruption may be solved before any information can make it to the blog explaining the details of the problem. One purpose of this document is to provide a "clearing house" for types of service disruptions, so that in the event of a system failure the blog post can reference this page.
At the time of this writing, and in no particular order, these are the systems which have been known to cause service disruptions:
Asset Storage Cluster
What is it: A cluster of machines that form a whopping huge WebDAV (think "web-based disk drive") storage mechanism with terabytes of space for storing assets, including uploaded textures, snapshots, scripts, objects taken into inventory, script states, saved region states (simstates), etc that make up Second Life. The technology (software and hardware) is licensed from a third party.
How it can fail: The system should be resilient against single node failures. In the case of multiple disk failures, software upgrades, removing problem nodes or adding new nodes, some or all of the cluster can fall offline. If this happens, asset uploads and downloads fail - this causes texture uploads and simstate saves to fail. Since transient data during region crossings (attachment states, etc) are written as assets, region crossings will also often fail.
How we fix it: When detected, we often disable logins and message in-world (if possible) to help avoid data loss. Failed nodes can be taken out of rotation. A restart of other nodes may be necessary. When upgrading the software on the nodes, the grid is usually closed to prevent data loss during any inadvertent outages. An asset system failure requiring a reboot occurred on March 28th, 2008
Central Database Cluster
What is it: A cluster of databases that store the core persistent information about Second Life - including resident profiles, groups, regions, parcels, L$ transactions and classifieds.
How it can fail: The database can become loaded enough during normal operations that some fraction of transactions fail and either must be manually retried or are automatically retried. Hardware failure and software bugs in the database code can also cause the database to crash or stop responding. Logins will fail, transactions in-world and on the web site will fail, and so forth.
How we fix it: If the primary database fails, we swap to one of the secondaries. If the database load is high but hasn't failed we can turn off services to try and reduce the load.
Eliminating this cluster as a scalability bottleneck and failure point is a very high priority for Linden Lab. While this is in progress, load mitigation is occurring. Watch the Second Life Blog for updates
Agent ("Inventory") Database Cluster
What is it: Storage for most agent-specific data such as the inventory tree is partitioned across a series of databases. Each agent is associated with a particular inventory partition (a primary database and its secondary backups). At the time of this writing, we have approximately 15 agent database partitions.
The initial use of these agent-partitioned databases was for inventory, so they are often referred to as "inventory databases" by Lindens, but this is no longer the extent of what agent-specific data is stored within them.
How it can fail: Hardware or software failure can affect the primary database within the partition, so that it either stops responding to queries or becomes excessively slow.
How we fix it: When an agent database fails we can swap to the backup within that partition, which takes a few minutes. If this will not happen immediately or if problems are encountered, that particular agent partition is "blacklisted" temporarily; this causes logins of agents who are associated with that partitions to be blocked and any agents logged in are "kicked", while the fix is in progress. This will affect some fraction of the grid, but not everybody.
This is an example of a system that in the past was prone to causing global service disruption. It was re-designed to limit the impact to Residents even in the face of hardware failure; only Residents associated with a particular partition are affected during such a failure.
Other Database Clusters
What is it: There are a handful other database clusters in use. One is used for logging data.
How it can fail: Hardware or software failures can take a database cluster offline. There should be no in-world effect from one of these other database clusters failing, but occasionally a software design flaw does introduce a dependency that is not caught. For example, logins used to require a successful connection to the login database to record the login and viewer statistics, but this dependency has been removed.
How we fix it: All databases act in clusters with a primary machine and several secondaries. In case of failure, a secondary can be swapped into place as the new primary.
Our data warehousing team has been doing significant work over the past year to ensure that the ever increasing amount of data being logged about simulator and other system performance can be analyzed, and that the collection of this data is "transparent" to the other systems - logging database failures should no longer cause service disruptions.
Transient Data Services
What is it: A cluster of machines (currently: 16) that store data in memory for "transient" state. This includes things like agent presence ("who is logged on?"), group chat participation, inbound email to script mapping, and so forth. This data is not stored in a database and is either constantly refreshed (e.g. simulators update agent presence every few minutes) or otherwise recoverable (e.g. rejoining a group chat).
How it can fail: Hardware failure can reduce the capacity of these machines, or take them offline entirely. Software bugs can also cause poor performance - for example, a memory leak in a service could cause the services to start responding slowly. While the specific service is disrupted, the overall service remains functional.
How we fix it: Because the state is transient, a replacement host can be brought online quickly and the data "heals" itself over time. If the error is software-side the services can be restarted as soon as a fix for the bug is found, with little impact to residents.
Simulators
What is it: A sim is a machine that runs simulators, which are the computer processes that runs regions. (Think of a region like a document, the simulator like a word processor, and the sim as the computer itself that runs the program.) Since these are closely coupled concepts, jargon/terminology tends to be somewhat loose, e.g. "simstate" should really be "region state". The simulator divides its time between communicating with viewers, communicating with other system components, simulating physics and executing scripts.
How it can fail: A bug in the simulator code can cause a crash. Most crashes cause a region simstate to be saved, and another simulator will load that simstate after a few minutes. Often, the bug is triggered by some of the region content - a script or physical object.
Other problems fall into two categories - problems with a specific sim, or grid-wide. Problems with a specific sim may include overloading (e.g. 4 high-traffic regions on the same sim) or failures (disk full, network interfaces lost, hardware failure). Grid-wide problems are usually caused by the other factors listed here, such as loss of network or database or asset cluster failure (either of which, for example, could prevent simulators from loading simstates. New simulator code releases occasionally introduce bugs with grid-wide consequences (e.g. excessive logging causing network traffic congestion)
How we fix it: Simulator crashes are reported just as viewer crashes are. We can use the data to determine the general subsystem that caused the crash (initialization, physics, scripts, messaging, etc.) If the crash is caused by content (script or physical object) we can use the crash data to determine why this occurred. In the mean time (since the fix may take several days or, in the case of physics, a project like the move to Havok 4) the region is brought up with scripts/physics disabled and the offending objects removed.
Problems with a specific sim can be addressed by restarting the regions, which causes a simulator process on a different sim to run it. Grid-wide problems are fixed either at the source (e.g. repair the network). Bugs from new code releases require either a configuration change (to turn off a new feature) or a rolling restart with updated code.
Reducing simulator crashes was the main motivation behind the movement to Havok4 for physics simulation, and the upcoming move to Mono for script execution.
Dataservers
What is it: Most of the simulator to database communication proxies through a process called "dataserver"; there are a few dataserver processes on each sim host. This eliminates a direct dependency on the database and allows the dataserver to block on a lengthy query while the simulator targets a fixed frame rate.
How it can fail: The dataserver process can crash as a result of bugs related to unforseen circumstances. For example, if the network hiccups, a connection to a database may be lost. Usually the system recovers gracefully and transparently from a dataserver failure, but on a particular simulator some transactions may fail temporarily. The service disruption is localized to the specific simulator. It is also possible that a software update could introduce bugs that cause grid-wide effects (for example, increased load on the central database cluster, or just more frequent crashes.) When a database is not responding to connections, the dataserver process watcher will automatically stop and restart the dataserver so new requests can be services.
How we fix it: When an individual dataserver crashes, it is automatically restored. If a bug is introduced that causes grid-wide effects the dataserver processes can usually be replaced without downtime.
The dataserver component is being phased out and replaced with web dataservices; simulators will use HTTP to talk to a new set of hosts that in turn relay queries to the database. This will allow us to more easily tweak the system to improve performance and eliminate disruptions.
Login Server Cluster
What is it: A cluster of servers which represent the first service that the viewer connects to when attempting to log in. This validates the resident's credentials, checks the viewer version for possible updates, ensures the latest Terms of Service have been updated. Assuming those check out, it sends the viewer an initial overview of the resident's inventory folders and a few other chunks of data. Finally, it negotiates with the simulator for the requested start location and lets the viewer know which simulator to talk to.
How it can fail: If one drops offline, some percentage of logins will fail. Additionally, since the login sequence is database-intensive, if the central database or inventory database cluster are having problems then logins will also fail. Finally, after a major disruption that leads to many Residents being kicked or unable to connect, there may be more Residents trying to connect than our Second Life can handle (roughly 1000 logins/minute); this can appear to Residents trying to log in as though the login service is failing, even though it is fully functional and just at maximum capacity.
How we fix it: If a login server itself fails, we take it out of rotation. If the problem is in another system or service, we fix it there.
Web Site
What is it: A cluster of machines that serve the web pages and web services exposed to the public - including secondlife.com, lindenlab.com, slurl.com, etc.
How it can fail: Hardware failures can slow down or shut down a machine in the web cluster. In that case, a load balancer should automatically redirect web traffic away from machines that are performing poorly, but the load balancer itself may have bugs (e.g. it may not detect such failures properly, or itself become blocked up). Web site bugs can be introduced by code updates to the web site, which are made daily. In addition, the web site relies on the central database cluster for many service actions, so failures there will affect web site actions such as the LindeX and transaction history, land store, friends online, and so forth.
How we fix it: Problematic hardware can be taken out of rotation to restore the responsiveness of the web site. Problems in other systems such as the central database cluster need to be addressed there.
Linden Network
What is it: The tubes through which stuff travels. Most notably, the connections between our co-location facilities ("colos"), e.g. SF and Dallas, but also the plumbing within colos. This includes "VPNs", switches, routers, and other esoteric stuff. Some of this is Linden equipment, some of this is leased equipment (e.g. we pay a third party to have dedicated use of their "tubes" between our colos), and public Internet pipes are also used.
How it can fail: A component can go bad, for example, a router can start dropping packets. This often appears as one of the other problems (asset storage, database, simulators, logins) since the systems can no longer talk to each other. The failure on April 5th, 2008 is an example of this kind of failure.
How we fix it: isolate the affected component and take it out of service or replace it as quickly as possible. If this is a leased component we need to talk to our provider.
Internet
What is it: A series of tubes that bring Second Life to your computer, from the large trans-oceanic and trans-continental pipes that link the world down to high-speed connection to your home from your Internet Service Provider (ISP).
How it can fail: Failures occur on several levels. If this happens at a high level - for example, a major Internet trunk to Europe drops offline - thousands of residents can be disconnected from Second Life.
How we fix it: This is usually beyond our control. If we can isolate the problem we can report it to network contacts, but otherwise we just need to wait for the issue to get fixed, like the residents.