Certified HTTP Versioning

From Second Life Wiki
Revision as of 17:04, 4 January 2008 by Which Linden (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Problem

One problem that we have to face is oplog versioning. We have to account for the definition of functions using the oplog changing over time, or else we risk destroying data. So here's a simple-to-the-point-of-contrived example that demonstrates the problem.

<python>agents = {'a':0, 'b':0}

  1. touch_agent version 1

def touch_agent(op, agent_id):

   touch_time = time.time()
   if touch_time != agents[agent_id]:
       op.persist(agents.__setitem__, agent_id, touch_time)</python>

When you run the touch_agent code, you'll end up with an oplog that looks like this:

  1. None ( this is the return value of __setitem__ )

In any running system, you have to assume that the oplogs will stay around for a while and be rerun at some point in the future. So you're bound to have a set of these one-entry oplogs sitting around in your persistent store.

Now let's say you change the touch_agent code:

<python># touch_agent version 2 def touch_agent(op, agent_id):

 touch_time = op.persist(time.time)  # whoops, have to persist the time too
 if touch_time != agents[agent_id]:
   op.persist(agents.__setitem__, agent_id, touch_time)</python>

This version of the code expects to be run over oplogs that look like this:

  1. 1199490918.72645
  2. None

If you try to run version 2 of the code over an oplog generated by the old, this is what will happen:

<python>touch_time = op.persist(time.time) # -> time.time will be ignored, touch_time will be set to None if touch_time != agents[agent_id]: # -> the if statement will return True because None != 0

 op.persist(agents.__setitem__, agent_id, touch_time)  # -> this will set the agent's time to None</python>

This is bad because the incorrect value None makes its way into the "database". The fact that anything at all was stored in the database is very bad, since that completely violates the idempotency guarantees that are the entire reason for creating chttp in the first place. This is but one example -- the potential scope of damage for poorly-versioned programs is unbounded.

Solution

The only solution that has a chance of working for all cases is to always run an oplog with the code that generated it.

In order to achieve this, any time you change the code that uses an oplog, you must give the new code a new name and keep it around alongside the old code, until you can be certain that no oplogs use the old code any more. After you're sure that the old code isn't used, you can delete it, but you still can't rename the new code.

Resumable Oplogs

For resumable oplogs, this is simple to do, since the function name is stored in the oplog itself. E.g.:

<python>op = persistence.oplog(touch_agent, 'a') op.execute() op.reset() op._main[0] # => <function touch_agent at 0xa7d4eed4></python>

The first entry in the oplog is a pickled version of the touch_agent method, so every time this oplog is re-executed it'll call the touch_agent method again. If you want to modify the touch_agent behavior, you have to write a new function with a different name.

<python># touch_agent version 1 def touch_agent_1(op, agent_id):

   touch_time = time.time()
   if touch_time != agents[agent_id]:
       op.persist(agents.__setitem__, agent_id, touch_time)


  1. touch_agent version 2

def touch_agent_2(op, agent_id):

 touch_time = op.persist(time.time)  # whoops, have to persist the time too
 if touch_time != agents[agent_id]:
   op.persist(agents.__setitem__, agent_id, touch_time)</python>

Changing all the call sites from touch_agent to touch_agent_2 owuld be a pain, so the oplog.resumable method exists to serve as sort of a symlink in these cases (as well as handling the creation and lifetime of the oplog).

<python>touch_agent = oplog.resumable(touch_agent_2)</python>

This way, all call sites can just refer to touch_agent, and the oplogs will store the correct underlying method. You'll never call touch_agent to resume a stored oplog, so there's no chance for confusion there.

Server Oplogs

Versioning the server code is somewhat of a trickier problem. An incoming request has two possibilities: it's a brand new request and should use the latest version of the handler code, or it is a replayed request that might have to use an older version of the handler code in order to successfully complete.

Our solution is to dispatch to the proper version of the code inside the handler. This imposes some additional constraints on the Resource object:

  • upon construction, the Resource takes in an argument 'old_versions' that contains a list of Resources implementing all previous versions of the code.
  • as little work as possible is done in the handle() method itself, instead the handle() method mostly does version dispatch and then calls handle_impl() on the correct resource.

Here's an example containing two versioned server resources:

<python>class AgentToucher_1(server.Resource):

 def __init__(self, persister, database):
   self.database = database
   super(AgentToucher_1, self).__init__(persister)
 def handle_post(self, op, request):
   agent_id = request['agent_id']
   touch_time = time.time()
   if touch_time != self.database[agent_id]:
       op.persist(self.database.__setitem__, agent_id, touch_time)


class AgentToucher_2(server.Resource):

 _version = 2
 def __init__(self, persister, database):
   self.database = database
   old_versions = (AgentToucher_1(persister, database), )
   super(AgentToucher_2, self).__init__(persister, old_versions=old_versions)


 def handle_post(self, op, request):
   agent_id = request['agent_id']
   touch_time = op.persist(time.time)  # whoops, have to persist the time too
   if touch_time != self.database[agent_id]:
     op.persist(self.database.__setitem__, agent_id, touch_time)


AgentToucher = AgentToucher_2</python>

Notes:

  • the old code is kept around untouched
  • the version is set to 2 (the default is 1)
  • the new Resource is responsible for instantiating an instance of the older code, and passing that on to the base class constructor in the old_versions keyword argument

The Certified HTTP Resource then handles all the version dispatch for you. One important caveat is that only the latest Resource needs the old_versions argument. If you create a new version you should move the old_versions generation to the latest resource, or simply make the old_versions be a constructor argument and populate it outside the constructors.