Issue21

Issue Title Node restart handling
Document: GIMPS Protocol Specification v05 Section: n/a
Category: Technical Priority: Must Fix
Status: Closed

Created on 2005-01-17.12:27:49 by reh, last changed 2005-05-13.14:20:53.

Messages
msg129 Author: reh Date: 2005-05-13.14:20:53
OK: there are actually 3 parts to this problem:

(1) detecting the peer has restarted
(2) recovering routing and messaging association state at the GIMPS level
(3) providing the service of informing the NSLPs that a restart has occurred

For (1), it looks like there are three parts to the solution
- for MAs with reliable transport, depend on the transport protocol to detect it
has been reset
- for MAs with unreliable transport, allow the GIMPS-MA-Hello message to be used
as a keepalive probe
- for D-mode messaging, the Query can be used as a keepalive probe (from the
initiating node), or in general there is the "No Routing State" response to
other messages

For (2) it seems that - given the error messages above - once the error has been
detected, a sensible GIMPS implementation can recover promptly using the
standard protocol features. Exactly how to schedule the various stages of the
recovery should probably be left up to the implementation.

For (3), one possibility would be to introduce the concept of an Epoch, probably
associated with the peer-identity (part of the Node-Addressing object in version
-05). Each NSLP-Data message would be associated with an Epoch, and when the
Epoch changed the receiver would know that a peer had restarted. GIMPS could
provide this as a common service to all NSLPs, at the cost of imposing some
marginal extra complexity on the protocol and overhead on the D-mode messages,
and an extra requirement on the implementation. (There are several ways of
generating the Epoch: a boot counter read/updated in non-volatile store, a clock
read at boot, a random number.)

At the moment, my assumption is that the first two parts of the issue are closed
by the current round of -06 updates, and the last will be made a new open issue.
msg112 Author: caoun Date: 2005-04-21.13:03:20
There are two main cases where an NE could be impacted by the restart of a node:
the restarting node could either be upstream or downstream. 
*For a downstream node restart: If the restart happens after a messaging
association was established, the downstream node will drop messaging association
packets (possibly without sending any ICMP error messages); in case the used
messaging association has reliable delivery capabilities it would detect that
the message can't be sent and after max retry take down the messaging
association(s)and trigger Message Routing State entry refresh (maybe with all
the entries having that downstream node, this might have  timing impact.
Decision on refreshing all the entries with dowsntream node vs one, need to be
based on the MRS table size and frequency of messages).
If the restart happens after message routing state was installed and datagram
mode messaging is used, then the downstream node should send an error notifying
(there are some hints for this error in section 5.3 but not more, it will need
to be defined) that it has no MRS for that flow and the node would refresh the
MRS in its downstream MRS table (same as above for single entry refresh vs all
entries refresh with downstream node).

*For an upstream node restart: if the restart happens after messaging
association were established, when the upstream node sends a query message for a
well known flow associated to an entry in the upstream message routing state
there will be issues when responding to the upstream node over the (expected to
be) installed messaging association. In case the messaging association has
reliable messaging capabilities it would detect the problem and request to
remove the MA from the MA table as well as the entry in the upstream Message
routing state table. The response message would then be sent using datagram
mode.If the node needs to send an asynchronous message (error message due to
system issues) it will retransmit until giving up, terminate MA and request to
remove flow's upstream MRS from the MRS table. In the case of datagram mode
there are no issues unless asynchronous messages need to be sent before the
upstream node sends a query. If the upstream node receives the message before
sending a query message for the flow, an error should be returned (same one as
above could be used).After receiving the error, the node should remove the
message routing state (same as above for all of them or only the just discovered
one) with the upstream node.
msg60 Author: admin Date: 2005-03-02.16:10:47
[updated to refer to -05 spec.]
msg22 Author: reh Date: 2005-01-17.12:27:49
The protocol specification currently has no way to identify if a node
(signalling peer) has restarted, or what to do if such a condition is identified.
History
Date User Action Args
2005-05-13 14:20:54rehsetstatus: Pending -> Closed
messages: + msg129
2005-04-21 13:03:20caounsetstatus: No Discussion -> Pending
messages: + msg112
2005-03-02 16:10:47adminsetdocument: GIMPS Protocol Specification v04 -> GIMPS Protocol Specification v05
messages: + msg60
2005-01-17 12:27:49rehcreate