Rolling restart

From Second Life Wiki
Jump to navigation Jump to search

What is a "rolling restart"?

The Second Life world has thousands upon thousands of Regions running on a great many servers. Due to the sheer logistics involved with managing them, not to mention all our Residents who'd prefer Second Life to have as much uptime as possible, it's unfeasible to restart every Region at the same time when an upgrade needs to be deployed.

Region restart.png

Think of a rolling restart like a wave: it doesn't occur everywhere simultaneously, but travels from one place to another. As some servers are restarting, others have already been restarted several minutes ago, and are coming online shortly. Thus, only a portion of Second Life is down at any one time. For example:

As with all of our server deploys, each region will be restarted once during one of the rolling restart periods. Most regions should be down no more than 5-10 minutes, although some fraction of the regions will take 20-30 minutes to upgrade. If your region stays down for more than 30 minutes, please contact support. Each region will receive warnings starting 5 minutes before that region is restarted.

Rolling restarts usually apply to all of Second Life, including Private Regions.

Specific details of how a rolling restart proceeds are sometimes announced on our Grid Status Reports, and more details are on the Server Deploys forum.

Technical Details[1]

First, some definitions

Colo -- a colocation site which holds many racks of sims
Sim (or sim host) -- a computer (or server)
Simulator -- the binary that runs on a sim host
Regions -- run by a simulator


Here's the deploy process in a nutshell
1. New server code is prepared.

2. Deploy day rolls around.

3. The server code is compiled and the binaries are put into a tarball.

4. The tarball is put on the Asset System.

5. For each colo a bittorret tracker is started and the sim nodes start sucking down the tarball. That usually takes less than 1 hour.

6. A command is sent out to have each sim unpack the tarball and get it ready for prime time. That takes another hour or so per colo.

7. When it's time for the rolling restart, the grid is put into what's called "Startup Mode". That means that when a sim goes down, the system won't try to bring the region up on another host. This way regions only get restarted once.

8. While in startup mode, 200 sims are selected at a time and the following steps are taken:
8a. The simulator is shut down.
8b. The new binaries are installed.
8c. The simulator is restarted.

This is repeated until all (approximately) 6000 sims have been processed.

References