OK: there are actually 3 parts to this problem:
(1) detecting the peer has restarted
(2) recovering routing and messaging association state at the GIMPS level
(3) providing the service of informing the NSLPs that a restart has occurred
For (1), it looks like there are three parts to the solution
- for MAs with reliable transport, depend on the transport protocol to detect it
has been reset
- for MAs with unreliable transport, allow the GIMPS-MA-Hello message to be used
as a keepalive probe
- for D-mode messaging, the Query can be used as a keepalive probe (from the
initiating node), or in general there is the "No Routing State" response to
other messages
For (2) it seems that - given the error messages above - once the error has been
detected, a sensible GIMPS implementation can recover promptly using the
standard protocol features. Exactly how to schedule the various stages of the
recovery should probably be left up to the implementation.
For (3), one possibility would be to introduce the concept of an Epoch, probably
associated with the peer-identity (part of the Node-Addressing object in version
-05). Each NSLP-Data message would be associated with an Epoch, and when the
Epoch changed the receiver would know that a peer had restarted. GIMPS could
provide this as a common service to all NSLPs, at the cost of imposing some
marginal extra complexity on the protocol and overhead on the D-mode messages,
and an extra requirement on the implementation. (There are several ways of
generating the Epoch: a boot counter read/updated in non-volatile store, a clock
read at boot, a random number.)
At the moment, my assumption is that the first two parts of the issue are closed
by the current round of -06 updates, and the last will be made a new open issue.
|