Message518

Author reh
Recipients
Date 2007-02-21.18:54:11
Content
And the full follow up email thread (relevant parts):

>> Section 6.1 and Section 8.4
>> A DoS attack can come from any node in the network.
>> Such a node can send a Q-mode Query message as though it
>> was forwarding it at the GIST level. The result, from any
>> GIST capable node that intercepts it, may be a D-mode
>> Response direct to the (spoofed) Query node. The Query
>> node will handle the Response (according to Rule 2) by
>> sending a "No Routing State" error message.
>> Please consider whether a Responder node should be
>> recommended to keep track of such events and rate limit
>> processing of new Query messages under such circumstances.
>
>Interesting question. This falls into the general class
>of attack detection/countermeasures, which I believe should
>be considered by protocol implementors but do not need to
>be specified, at least not normatively. Essentially I don't
>see a clear boundary to the level of complexity that should
>be captured: a simple implementation could do nothing, a
>complex implementation could do what is described above and
>considerably more. So it is not clear what level of detail
>to capture.

Sort of OK.
My concern is that the protocol spec says "MUST respond" etc. when
describing receiving a message. So an implementation has no choice to make
this kind of sensible control action if it wants to be conformant.

Actually, thinking about this a bit further, since GIST is soft state, why
should a node respond to an uncorrelated Response at all? It is either a
stray message (from a race condition), a bug, or an attack. I wonder if it
might be better to ignore such messages.

>> Section 6.2
>> Can I receive er_NoRSM in Awaiting Refresh state?
>> I think so because my sent Query refresh may have been
>> lost, the peer may have timed out the state, and I may
>> have sent Data.
>
>I hope not: since Queries are retransmitted, the node should
>have dropped out of Awaiting Refresh to Death according to
>rule 2 before the peer times out the state.

Oh, but I think this is very unclear from the definition of the timers. What
you are saying is that a node that generates Queries MUST execute the
to_NoResponse[nResp_reached] before its neighbour executes to_Expire_RNode.

Since this would be hard to police, I assumed that it is possible that the
neighbour may time out first. Normally this is no problem, because the next
refresh Query will simply re-initiate state. But if data is sent in the
intervening time, then an error (i.e. lost data) will be reported be the
responder.

I guess you either have to fix the description of the timers or allow the
event in the state.

>Of course, the peer could have crashed and rebooted. In
>which case the answer is yes. But in that case the correct
>transition would be back to Awaiting Response (needs to be
>added, I think).  See further below (after your '100 messages'
>comment).

Well, even in this case, is that the correct transition? It is true that a
new Query will establish the state at the responder and all will be well,
but what of the lost data? Recall, you have promised the NSLP that you will
reliably deliver the data, and you have got it to the next GIST node, but
that node is telling you that it has discarded the data, and you are
proposing to keep this secret from the NSLP application (and not to recover
it within GIST).

Is it the case that a GIST implementation cannot restart under the feet of
existing NSLP sessions so the problem would be resolved at the application
layer/ if so, you might as well do a full reset at the GIST layer anyway.

[SNIP]

>> ---- 
>> Section 6.2 Rule 3
>> This action needs to check to see if Confirm was requested.
>> Should not send a Confirm if one was never requested.
>
>But if a Confirm was not requested, and we have received
>a Response, there should be no complaints that there is no
>routing state. However, again there could be crash/restart
>reasons why the error is being generated so a different
>reaction is probably more robust anyway (see below).
>
>> What if the er_NoRSM is sent in response to the Confirm?
>> This is opening a tight loop! (Such could arise if the
>> peer has 'lost' the Routing State.
>
>In 'normal' operation this should not happen: an er_NoRSM
>is only sent in response to a Confirm by a node which does
>not require a Confirm to create the routing state. In
>other words, this situation arises where there was no
>Query in the first place (or, more likely, a spoofed
>query). So the er_NoRSM will refer to non-existent
>routing state at the Querying node and should only be
>logged locally.
>
>But again, there could be a crash/restart situation,
>where the Query created routing state at the Responding
>node, which then restarted before the next message
>arrived. Sending the Confirm is actually no help here,
>so ... (see long note following the next comment).
>
>> Also, if you have sent several (say 100) Data messages
>> immediately after a lost Confirm, you will receive 100
>> er_NoRSM messages and (according to this state machine) you
>> will send 100 new Confirm messages. >
>>
>> This is not just a mater of implementation optimisation
>> because you are impacting the network and the adjacent
>> nodes. Further, this state machine is presented as
>> normative so variations from it will be tested by
>> certification labs.
>OK. The behaviour you describe is indeed what I would
>expect to happen and looks like a problem in long-fat-network
>scenarios. However, it a question of tradeoffs for likely
>signalling patterns. We have two options:
>
>- use a full 4-way handshake. This would avoid the above
>problem of a flood of Data messages. However, it adds
>1RTT to the delivery time of the initial message after
>routing state is created. We'd like to avoid that.

How long does it take to set up a TCP connection? Is the 4th handshake
significant?

>- use a strategy where an error triggers the recovery.
>This allows wasted messages if the Confirm is lost;
>however, note that the messages affected are Data
>messages for the same routing state (i.e. same
>SID/MRI/NSLPID). It's conceivable that an NSLP
>would send a large number of messages for the same
>flow/session combination unreliably and before receiving
>any feedback from the peer at the NSLP level, but unlikely
>to be typical.

OK. I believe you. I have no real understanding of how GIST will be used.

>Your further points on protecting the network and adjacent
>nodes are well taken, but should already be accounted for,
>in that if this messaging is going on in D-mode it will
>at least be rate limited, and in C-mode will be congestion
>controlled (note that C-mode for post-handshake messaging
>is likely to become a SHOULD). So if this is a new messaging
>association, you won't be able to send large numbers of data
>messages without some feedback from the far end to open the
>congestion window, and that feedback would include an
>early er_NoRSM; even if the MA has been open for a long time,
>the lost Confirm will cause a throttling. At least the network
>is protected.

That's good.

>However, having said all this, I think the earlier points above
>have uncovered a problem with reboot handling affecting the
>Responding node, which show up as er_NoRSM messages in multiple
>states of the Query state machine. I think the cleanest solution
>to all these problems is to change the er_NoRSM handling to
>the following:
>- if in Awaiting Response: ignore
>- otherwise: restart the handshake with a new Query

Yeah, that makes me most comfortable.

>As well as handling the above problems, this probably simplifies
>the specification and implementation compared to the current
>re-send Confirm approach; it also partly handles the message
>flood issue (it doesn't do any more to prevent the 100 Data
>messages scenario, but it does avoid the flood of repeated
>Confirms).
History
Date User Action Args
2007-02-21 18:54:11rehlinkissue191 messages
2007-02-21 18:54:11rehcreate