Reliable Message Behavior

From Second Life Wiki
Revision as of 12:02, 13 July 2007 by Which Linden (talk | contribs) (→‎Workflow Option 2: 202 is noncommittal, 204 means "no content")
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

The basic goal of reliable messaging is exactly-once guaranteed delivery. That is, the sender can queue up a message that is guaranteed to be delivered at least once, and the receiving side has duplicate suppression logic that ensures that the code that handles the message only gets called once.

Assumptions

  • Reliable hosts will not be down forever. Either a clone will be brought up on different hardware, or the machine itself will reappear within a day or so.
  • Each reliable host can perform "Wikipedia logo"ACID operations on the data it contains.
  • Generating a globally unique message ID is inexpensive
  • We can store an arbitrary amount of 'small' data (such as UUIDs, urls, and date/timestamps) for a "long" time.
  • Any transaction or data handled by this system will become useless well before a "long" time has elapsed. We haven't decided yet whether this means hours, days, or weeks. Any data that needs to be longer-lived will be communicated to another system, so any data in the reliable messaging system that is older than a "long" time can be deleted safely.

Workflows

There are two options, which differ mildly in their implementation and API.

Workflow Option 1

Sending a Message

The sending-side API looks very much like a normal HTTP method call:

 response = reliable_http.put(url, body)

What happens under the covers is:

  1. Generate a globally unique message ID for the message
  2. Store the outgoing request (headers and all), including the message id, in a durable store "outbox", and waits for a response.
  3. A potentially asynchronous process performs the following steps:
    1. Retrieves the request from the outbox.
    2. Performs the HTTP request specified by the outbox request and waits for a response.
    3. If a response is not forthcoming, for whatever reason, the process retries after a certain period.
    4. If the server sends an error code that indicates that the reliable message will never complete (e.g. 501), or a long timeout expires indicating that an absurd amount of time has elapsed, the method throws an exception.
  4. Opens a transaction on the durable store
  5. Stores the incoming response in a durable inbox.
  6. 'Tombstones' the message in the outbox, which essentially marks the message as having been received, so that if the application resumes again, it doesn't resend.
  7. Closes the transaction on the durable store
  8. If the response contains a header indicating a confirmation url on the recipient, performs an HTTP DELETE on the resource to ack the incoming message.

There are no explicit semantics for the response body, like HTTP itself. The content will vary depending on the application.

Receiving a Message

The receiver sets up a node in the url hierarchy, just like a regular http node. When an incoming request comes in, the receiver:

  1. Stores the incoming request in a durable store "inbox", if it doesn't already contain a message with the same ID.
  2. A potentially asynchronous process performs the following steps:
    1. Looks for responses in the outbox matching the incoming message id, and if it finds one, sends it as the response without invoking anything else.
    2. Opens a transaction on the database, locking the inbox request
    3. Calls the handler method on the receiving node:
      1. def handle_put(body, txn):
        return "My Response"
      2. The handler method can use the open transaction to perform actions in the database that are atomic with the receipt of the message. Any non-idempotent operation must be done atomically in this way.
    4. Stores the return value of the handle method as an outgoing response in the outbox, without closing the transaction.
    5. Removes the incoming request from the inbox
    6. Closes the transaction
  3. Discovers a new item in the outbox, responds to the incoming http request with the response from the outbox.

There is no "tombstoning" of the message in the outbox.

Performing a Job

There is no special logic for performing a job. If a long-running job is to be performed, the Receiver simply delays its response until the job is complete. This may result in the Sender timing out and retrying, but that's OK because the Receiver will simply respond to whichever retry happens to occur at or after the time the response is put in the outbox.

Workflow Option 2

Sending a Message

The sending-side API looks like a message send:

 response = reliable_http.send(url, body)

What happens under the covers is:

  1. Generate a globally unique message ID for the message
  2. Store the outgoing request (headers and all), including the message id, in a durable store "outbox", and waits for an ack.
  3. A potentially asynchronous process performs the following steps:
    1. Retrieves the request from the outbox.
    2. Performs the HTTP request specified by the outbox request and waits for an ack.
    3. If an ack is not forthcoming, for whatever reason, the process retries after a certain period.
    4. If the server sends an error code that indicates that the reliable message will never complete (e.g. 501), or a long timeout expires indicating that an absurd amount of time has elapsed, the method throws an exception.
  4. When an ack is received, 'Tombstone' the message in the outbox, which essentially marks the message as having been acked, so that if the application resumes again, it doesn't resend.

The response body is explicitly content-free.

Receiving a Message

The receiver sets up a node in the url hierarchy, just like a regular http node. When an incoming request comes in, the receiver:

  1. Stores the incoming request in a durable store "inbox", if it doesn't already contain a message with the same ID.
  2. Responds with an ack (an empty http response with 204 status) to the requesting server.
  3. A potentially asynchronous process performs the following steps:
    1. Opens a transaction on the database, locking the inbox request
    2. Calls the handler method on the receiving node:
      1. def handle_reliable_message(body, txn):
        txn.perform_operations()
      2. The handler method can use the open transaction to perform actions in the database that are atomic with the receipt of the message. Any non-idempotent operation must be done atomically in this way.
    3. Tombstones the incoming request from the inbox
    4. Closes the transaction

Performing a Job

A job differs from a message in that it is expected to take a "long" time, and return a result at the end, essentially an RPC. In Option 2, the job requires a separate http connection to report on its completion.

To perform a job, the API looks like:

 response = reliable_http.perform_job(url, local_url, body)

Under the covers, the Sender needs to perform these additional actions:

  • Add to the outgoing request a Response URL pointing back to the Sender (influenced by the local_url argument)
  • Before initiating the outgoing request, perform whatever bookkeeping is necessary to activate a reliable node at the Response URL
  • After receiving the ack from the first http request, tombstone the outgoing request in the outbox (the ack means the recipient has persistently stored the request)
  • Wait for a response on the Response URL node
  • When the Response URL is hit by the job Receiver, the Sender receives a reliable message (following steps above), persisting the request body and responding with an ack.
  • Return the body of the Response message sent by the Receiver as the return value of the perform_job function.

The Receiver performs the following additional actions:

  • When sending the ack, use status code 202 instead of 204.
  • Before tombstoning the incoming request, takes the return value from the handler method and stores it as the body of an outgoing message destined for the Response URL provided in the request.
  • In delivering the response message, the Receiver sends a reliable message (following steps above), with retries, and tombstoning the outgoing message after receiving an ack.

Discussion

The major difference between the two options relates to the number of http requests made, and how long each is held open. Option 1 involves fewer http connections for short operations (the best case), but each one is held open for longer (the entire duration of the transaction on the recipient). Option 2 always involves 2 http connections, but each one is only open for as long as it takes to transmit and persist the bytes in the message.

Because you cannot expect a response from an Option 2 message, the full range of http verbs like GET don't make as much sense. It's therefore a little less RESTful than Option 1.

There is no inherent conflict in implementing both Option 1 and Option 2. A sophisticated client could provide a response url if it expects the message handling to take a long time. Upon receipt of the request, the server could decide whether to deliver a return message in the response, or to immediately return a 202.