Difference between revisions of "Reliable Message Behavior"

From Second Life Wiki
Jump to navigation Jump to search
m (fixing double redirect)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The basic goal of reliable messaging is exactly-once guaranteed delivery.  That is, the sender can queue up a message that is guaranteed to be delivered at least once, and the receiving side has duplicate suppression logic that ensures that the code that handles the message only gets called once.
#REDIRECT [[Certified HTTP]]
 
= Assumptions =
 
* Reliable hosts will not be down forever.  Either a clone will be brought up on different hardware, or the machine itself will reappear within a day or so.
* Each reliable host can perform ACID operations on the data it contains.
* Generating a globally unique message ID is inexpensive
* We can store an arbitrary amount of 'small' data (such as UUIDs, urls, and date/timestamps) for a "long" time.
* Any transaction or data handled by this system will become useless well before a "long" time has elapsed.  We haven't decided yet whether this means hours, days, or weeks.  Any data that needs to be longer-lived will be communicated to another system, so any data in the reliable messaging system that is older than a "long" time can be deleted safely.
 
= Workflows =
 
There are two options, which differ mildly in their implementation and API.
 
== Workflow Option 1 ==
 
=== Sending a Message ===
 
The sending-side API looks very much like a normal HTTP method call:
  response = reliable_http.put(url, body)
 
What happens under the covers is:
# Generate a globally unique message ID for the message
# Store the outgoing request (headers and all), including the message id, in a durable store "outbox", and waits for a response.
# A potentially asynchronous process performs the following steps:
## Retrieves the request from the outbox.
## Performs the HTTP request specified by the outbox request and waits for an ack.
## If an ack is not forthcoming, for whatever reason, the process retries after a certain period.
## If the server sends an error code that indicates that the reliable message will never complete (e.g. 501), or a long timeout expires indicating that an absurd amount of time has elapsed, the method throws an exception.
# 'Tombstone' the message in the outbox, which essentially marks the message as having been acked, so that if the application resumes again, it doesn't resend.
 
We don't have any explicit semantics for the response body.
 
=== Receiving a Message ===
 
The receiver sets up a node in the url hierarchy, just like a regular http node.  When an incoming request comes in, the receiver:
 
# Stores the incoming request in a durable store "inbox", if it doesn't already contain a message with the same ID.
# A potentially asynchronous process performs the following steps:
## Looks for responses in the outbox matching the incoming message id, and if it finds one, sends it as the response without invoking anything else.
## Opens a transaction on the database, locking the inbox request
## Calls the handler method on the receiving node:
### <code>def handle_put(body, txn):<br/>return "No response"</code>
### The handler method can use the open transaction to perform actions in the database that are atomic with the receipt of the message.  Any non-idempotent operation must be done atomically in this way.
## Stores the return value of the handle method as an outgoing response in the outbox, without closing the transaction
## Removes the incoming request from the inbox
## Closes the transaction
# Discovers a new item in the outbox, responds to the incoming http request with the response from the outbox.
 
=== Performing a Job ===
 
There is no special logic for performing a job.  If a long-running job is to be performed, the Receiver simply delays its response until the job is complete.  This may result in the Sender timing out and retrying, but that's OK because the Receiver will simply respond to whichever retry happens to occur at or after the time the response is put in the outbox.
 
== Workflow Option 2 ==
 
=== Sending a Message ===
 
The sending-side API looks like a message send:
  response = reliable_http.send(url, body)
 
What happens under the covers is:
# Generate a globally unique message ID for the message
# Store the outgoing request (headers and all), including the message id, in a durable store "outbox", and waits for an ack.
# A potentially asynchronous process performs the following steps:
## Retrieves the request from the outbox.
## Performs the HTTP request specified by the outbox request and waits for an ack.
## If an ack is not forthcoming, for whatever reason, the process retries after a certain period.
## If the server sends an error code that indicates that the reliable message will never complete (e.g. 501), or a long timeout expires indicating that an absurd amount of time has elapsed, the method throws an exception.
# When an ack is received, 'Tombstone' the message in the outbox, which essentially marks the message as having been acked, so that if the application resumes again, it doesn't resend.
 
The response body is explicitly content-free.
 
=== Receiving a Message ===
 
The receiver sets up a node in the url hierarchy, just like a regular http node.  When an incoming request comes in, the receiver:
 
# Stores the incoming request in a durable store "inbox", if it doesn't already contain a message with the same ID.
# Responds with an ack (an empty http response with a 200 code) to the requesting server.
# A potentially asynchronous process performs the following steps:
## Opens a transaction on the database, locking the inbox request
## Calls the handler method on the receiving node:
### <code>def handle_reliable_message(body, txn):<br/>txn.perform_operations()</code>
### The handler method can use the open transaction to perform actions in the database that are atomic with the receipt of the message.  Any non-idempotent operation must be done atomically in this way.
## Tombstones the incoming request from the inbox
## Closes the transaction
 
=== Performing a Job ===
 
A job differs from a message in that it is expected to take a "long" time, and return a result at the end.  In Option 2, the job requires a separate http connection to report on its completion.
 
To perform a job, the API looks like:
  response = reliable_http.perform_job(url, local_url, body)
Under the covers, the Sender needs to perform these additional actions:
* Add to the outgoing request a Response URL pointing back to the Sender (influenced by the local_url argument)
* Before initiating the outgoing request, perform whatever bookkeeping is necessary to activate a reliable node at the Response URL
* After receiving the ack from the first http request, tombstone the outgoing request in the outbox (the ack means the recipient has persistently stored the request)
* Wait for a response on the Response URL node
* When the Response URL is hit by the job Receiver, the Sender receives a reliable message (following steps above), persisting the request body and responding with an ack.
* Return the body of the Response message sent by the Receiver as the return value of the <code>perform_job</code> function.
 
The Receiver performs the following additional actions:
* Before tombstoning the incoming request, takes the return value from the handler method and stores it as the body of an outgoing message destined for the Response URL provided in the request.
* In delivering the response message, the Receiver sends a reliable message (following steps above), with retries, and tombstoning the outgoing message after receiving an ack.

Latest revision as of 21:10, 11 July 2008

Redirect to: