Issue191

Issue Title Routing State Errors
Document: GIST Protocol Specification v11 Section: 6
Category: Technical Priority: Must Fix
Status: Text Proposed

Created on 2007-02-21.18:51:15 by reh, last changed 2007-02-22.11:15:50.

Messages
msg519 Author: reh Date: 2007-02-22.11:15:50
Created a new section 4.4.5 specifically addressing the issue, and trimmed
associated text in 4.3.2/5.3.3/6. The new section is as follows:

4.4.5.  Routing State Failures

   A GIST node can receive a message from a GIST peer, which can only be
   correctly processed in the context of some routing state, but where
   no corresponding routing state exists.  Cases where this can arise
   include:

   o  Where the message is random traffic from an attacker, or
      backscatter (responses to such traffic).

   o  Where routing state has been correctly installed but the peer has
      since lost it, for example because of aggressive timer settings at
      the peer, or because the node has crashed and restarted.

   o  Where the routing state has never been correctly installed in the
      first place, but the sending node does not know this.  This can
      happen if the Confirm message of the handshake is lost.

   It is important for GIST to recover from such situations promptly
   where they represent genuine errors (node restarts, or lost messages
   which would not otherwise be retransmitted).  Note that only
   Response, Confirm, Error and Data messages ever require routing state
   to exist, and these are considered in turn:

   Response:  A Response can be received at a node which never sent (or
      has forgotten) the corresponding Query.  If the node wants routing
      state to exist, it will initiate it itself; a diagnostic error
      would not allow the sender of the Response to take any corrective
      action, and the diagnostic could itself be a form of backscatter.
      Therefore, an error message MUST NOT be generated, but the
      condition MAY be logged locally.

   Confirm:  For a Responding node which implements delayed state
      installation, this is normal behaviour, and routing state will be
      created provided the Confirm is validated.  Otherwise, this is a
      case of a non-existent or forgotten Response, and the node may not
      have sufficient information in the Confirm to create the correct
      state.  The requirement is to notify the Querying node so that it
      can recover the routing state.

   Data:  This arises when a node receives Data where routing state is
      required, but either it does not exist at all, or it has not been
      finalised (no Confirm message).  To avoid Data being black-holed,
      a notification must be sent to the peer.

   Error:  Some error messages can only be interpreted in the context of
      routing state.  However, the only error messages which require a
      response within the protocol are routing state error messages
      themselves.  Therefore, this case should be treated the same as a
      Response: an error message MUST NOT be generated, but the
      condition MAY be logged locally.

   For the case of Confirm or Data messages, if the state is required
   but does not exist, the node MUST reject the incoming message with a
   "No Routing State" error message (Appendix A.4.4.5).  There are then
   three cases at the receiver of the error message:

   No routing state:  The condition MAY be logged but a reply MUST NOT
      be sent (see above).

   Querying node:  The node MUST restart the GIST handshake from the
      beginning.

   Responding node:  The node MUST delete its own routing state and
      SHOULD report an error condition to the local signalling
      application.

   The rules at the Querying or Responding node make GIST open to
   disruption by randomly injected error messages, similar to blind
   reset attacks on TCP (cf. [45]), although because routing state
   matching includes the SID this is mainly limited to on-path
   attackers.  If a GIST node detects a significant rate of such
   attacks, it MAY adopt a policy of using secured messaging
   associations to communicate for the affected MRIs, and only accepting
   "No Routing State" error messages over such associations.
msg518 Author: reh Date: 2007-02-21.18:54:11
And the full follow up email thread (relevant parts):

>> Section 6.1 and Section 8.4
>> A DoS attack can come from any node in the network.
>> Such a node can send a Q-mode Query message as though it
>> was forwarding it at the GIST level. The result, from any
>> GIST capable node that intercepts it, may be a D-mode
>> Response direct to the (spoofed) Query node. The Query
>> node will handle the Response (according to Rule 2) by
>> sending a "No Routing State" error message.
>> Please consider whether a Responder node should be
>> recommended to keep track of such events and rate limit
>> processing of new Query messages under such circumstances.
>
>Interesting question. This falls into the general class
>of attack detection/countermeasures, which I believe should
>be considered by protocol implementors but do not need to
>be specified, at least not normatively. Essentially I don't
>see a clear boundary to the level of complexity that should
>be captured: a simple implementation could do nothing, a
>complex implementation could do what is described above and
>considerably more. So it is not clear what level of detail
>to capture.

Sort of OK.
My concern is that the protocol spec says "MUST respond" etc. when
describing receiving a message. So an implementation has no choice to make
this kind of sensible control action if it wants to be conformant.

Actually, thinking about this a bit further, since GIST is soft state, why
should a node respond to an uncorrelated Response at all? It is either a
stray message (from a race condition), a bug, or an attack. I wonder if it
might be better to ignore such messages.

>> Section 6.2
>> Can I receive er_NoRSM in Awaiting Refresh state?
>> I think so because my sent Query refresh may have been
>> lost, the peer may have timed out the state, and I may
>> have sent Data.
>
>I hope not: since Queries are retransmitted, the node should
>have dropped out of Awaiting Refresh to Death according to
>rule 2 before the peer times out the state.

Oh, but I think this is very unclear from the definition of the timers. What
you are saying is that a node that generates Queries MUST execute the
to_NoResponse[nResp_reached] before its neighbour executes to_Expire_RNode.

Since this would be hard to police, I assumed that it is possible that the
neighbour may time out first. Normally this is no problem, because the next
refresh Query will simply re-initiate state. But if data is sent in the
intervening time, then an error (i.e. lost data) will be reported be the
responder.

I guess you either have to fix the description of the timers or allow the
event in the state.

>Of course, the peer could have crashed and rebooted. In
>which case the answer is yes. But in that case the correct
>transition would be back to Awaiting Response (needs to be
>added, I think).  See further below (after your '100 messages'
>comment).

Well, even in this case, is that the correct transition? It is true that a
new Query will establish the state at the responder and all will be well,
but what of the lost data? Recall, you have promised the NSLP that you will
reliably deliver the data, and you have got it to the next GIST node, but
that node is telling you that it has discarded the data, and you are
proposing to keep this secret from the NSLP application (and not to recover
it within GIST).

Is it the case that a GIST implementation cannot restart under the feet of
existing NSLP sessions so the problem would be resolved at the application
layer/ if so, you might as well do a full reset at the GIST layer anyway.

[SNIP]

>> ---- 
>> Section 6.2 Rule 3
>> This action needs to check to see if Confirm was requested.
>> Should not send a Confirm if one was never requested.
>
>But if a Confirm was not requested, and we have received
>a Response, there should be no complaints that there is no
>routing state. However, again there could be crash/restart
>reasons why the error is being generated so a different
>reaction is probably more robust anyway (see below).
>
>> What if the er_NoRSM is sent in response to the Confirm?
>> This is opening a tight loop! (Such could arise if the
>> peer has 'lost' the Routing State.
>
>In 'normal' operation this should not happen: an er_NoRSM
>is only sent in response to a Confirm by a node which does
>not require a Confirm to create the routing state. In
>other words, this situation arises where there was no
>Query in the first place (or, more likely, a spoofed
>query). So the er_NoRSM will refer to non-existent
>routing state at the Querying node and should only be
>logged locally.
>
>But again, there could be a crash/restart situation,
>where the Query created routing state at the Responding
>node, which then restarted before the next message
>arrived. Sending the Confirm is actually no help here,
>so ... (see long note following the next comment).
>
>> Also, if you have sent several (say 100) Data messages
>> immediately after a lost Confirm, you will receive 100
>> er_NoRSM messages and (according to this state machine) you
>> will send 100 new Confirm messages. >
>>
>> This is not just a mater of implementation optimisation
>> because you are impacting the network and the adjacent
>> nodes. Further, this state machine is presented as
>> normative so variations from it will be tested by
>> certification labs.
>OK. The behaviour you describe is indeed what I would
>expect to happen and looks like a problem in long-fat-network
>scenarios. However, it a question of tradeoffs for likely
>signalling patterns. We have two options:
>
>- use a full 4-way handshake. This would avoid the above
>problem of a flood of Data messages. However, it adds
>1RTT to the delivery time of the initial message after
>routing state is created. We'd like to avoid that.

How long does it take to set up a TCP connection? Is the 4th handshake
significant?

>- use a strategy where an error triggers the recovery.
>This allows wasted messages if the Confirm is lost;
>however, note that the messages affected are Data
>messages for the same routing state (i.e. same
>SID/MRI/NSLPID). It's conceivable that an NSLP
>would send a large number of messages for the same
>flow/session combination unreliably and before receiving
>any feedback from the peer at the NSLP level, but unlikely
>to be typical.

OK. I believe you. I have no real understanding of how GIST will be used.

>Your further points on protecting the network and adjacent
>nodes are well taken, but should already be accounted for,
>in that if this messaging is going on in D-mode it will
>at least be rate limited, and in C-mode will be congestion
>controlled (note that C-mode for post-handshake messaging
>is likely to become a SHOULD). So if this is a new messaging
>association, you won't be able to send large numbers of data
>messages without some feedback from the far end to open the
>congestion window, and that feedback would include an
>early er_NoRSM; even if the MA has been open for a long time,
>the lost Confirm will cause a throttling. At least the network
>is protected.

That's good.

>However, having said all this, I think the earlier points above
>have uncovered a problem with reboot handling affecting the
>Responding node, which show up as er_NoRSM messages in multiple
>states of the Query state machine. I think the cleanest solution
>to all these problems is to change the er_NoRSM handling to
>the following:
>- if in Awaiting Response: ignore
>- otherwise: restart the handshake with a new Query

Yeah, that makes me most comfortable.

>As well as handling the above problems, this probably simplifies
>the specification and implementation compared to the current
>re-send Confirm approach; it also partly handles the message
>flood issue (it doesn't do any more to prevent the 100 Data
>messages scenario, but it does avoid the flood of repeated
>Confirms).
msg517 Author: reh Date: 2007-02-21.18:51:15
From Adrian Farrel:

> Section 6.1 and Section 8.4 
> A DoS attack can come from any node in the network. 
> Such a node can send a Q-mode Query message as though it 
> was forwarding it at the GIST level. The result, from any 
> GIST capable node that intercepts it, may be a D-mode 
> Response direct to the (spoofed) Query node. The Query 
> node will handle the Response (according to Rule 2) by 
> sending a "No Routing State" error message. 
> Please consider whether a Responder node should be 
> recommended to keep track of such events and rate limit 
> processing of new Query messages under such circumstances

> Section 6.2 
> Can I receive er_NoRSM in Awaiting Refresh state? 
> I think so because my sent Query refresh may have been 
> lost, the peer may have timed out the state, and I may 
> have sent Data. 

> Section 6.2 Rule 3 
> This action needs to check to see if Confirm was requested. 
> Should not send a Confirm if one was never requested. 
> 
> What if the er_NoRSM is sent in response to the Confirm? 
> This is opening a tight loop! (Such could arise if the 
> peer has 'lost' the Routing State. 

> Also, if you have sent several (say 100) Data messages 
> immediately after a lost Confirm, you will receive 100 
> er_NoRSM messages and (according to this state machine) you 
> will send 100 new Confirm messages. > 
> 
> This is not just a mater of implementation optimisation 
> because you are impacting the network and the adjacent 
> nodes. Further, this state machine is presented as 
> normative so variations from it will be tested by 
> certification labs.
History
Date User Action Args
2007-02-22 11:15:50rehsetstatus: No Discussion -> Text Proposed
messages: + msg519
2007-02-21 18:54:11rehsetmessages: + msg518
2007-02-21 18:51:15rehcreate