0009618: No mechanism to handle case of never receiving first data - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0009618	Part 81: UAFX Connecting Devices and Information Model	Spec	public	2024-06-21 14:09	2024-09-06 13:55

Reporter	Brian Batke	Assigned To	Brian Batke
Priority	normal	Severity	major	Reproducibility	have not tried
Status	assigned	Resolution	open

Summary	0009618: No mechanism to handle case of never receiving first data
Description	Consider a bidirectional connection, where Controller (or Device) A starts publishing, which is received at B. But the publications from B are never received at A. Could be due to a faulty cable or some other network issue. In this case, the subscriber on A will be stuck in preoperational, and the connection will not time out. B would not know that is publications are not being received unless it somehow explicitly checks, which should not be a required behavior. There needs to be a mechanism to handle this situation.
Tags	No tags attached.

Paul Hunkar 2024-07-20 06:12 manager ~0021497	Discussion from working group meeting 1) for remote connection - the status can be returned in the PubSub - a warning that it is still in pre-operation - probably several warning codes - one for SKS, For network setup , for message received,... 2) an timer could be defined for receipt of first message ( so that this will timeout at some point)

Brian Batke 2024-07-20 12:37 developer ~0021499	Sending a status in the PubSub message (indicating preoperational) may be informative, but it doesn't really solve the problem. Consider a case where both Endpoint 1 and Endpoint 2 begin publishing, but neither are receiving any messages because of some kind of network configuration issue. Neither would ever see the status indicating the other side is still preoperational. They both remain in preoperational indefinitely. Or consider the unidirectional case where EP2 starts sending (it is operational) but EP1 never receives any messages because of a network error. EP1 would remain in preoperational indefinitely. There needs to be some kind of timeout -- probably configurable -- to handle these situations.

David Puffer 2024-07-23 12:20 developer ~0021502 Last edited: 2024-07-23 12:23	In the described case: The publications from B are never received on A. Which means that A (not B) is stuck in preoperational, correct? In this case, A would know that something is wrong with B, because it never receives any data from B. In this case, ConnectionEndpoint on A would never be cleaned up, which may only be a problem if B /the network link is never repaired. Even then: what would be the problem? Any diagnosing entity observing A would observe EP1 to be in Preoperational, which according to part 81 is only the case if either one or both of DSR and DSW are in pre-operational. If the Endpoint was cleaned up, an observing entity would not even be able to distinguish between "was never established" vs "was cleaned up".

Brian Batke 2024-07-23 20:47 developer ~0021506	I edited the original description to correct a mixup of endpoint A and B. I think there are two example cases of interest: One endpoint having an issue, and both endpoints having an issue. One side never receives. DSW A (operational) sends; DSR B receives (now operational) DSW B (operational) sends; DSR A does not receive (preoperational). EPB is operational. EPA is still preoperational because its DSR has never received a message. EPA will remain preoperational forever unless the application has some sort of timeout whereby it can force the cleanup. This case has another example, which Jan has pointed out, where EPB loses power before EPA receives any message. EPA remains preoperational and is a potential resource leak unless some action is taken to clean it up. Both sides never receive data from the other side. DSW A (operational) sends; DSR B does not receive (still preoperational) DSW B (operational) sends; DSR A does not receive (still preoperational) Both EPA and B remain preoperational. Sending a status in the DataSet message header would not help at all in this case. In the case of a controller opening connections via its CM, the controller needs (in my experience at least) a time limit on connections being established and providing data in order for the user application to run. To have the possibility of the connection sitting in a "waiting for data" state, indefinitely, is a problem. So having a timeout for going operational would solve this. We could say that this should be the application's responsibility, but it seems to me that this should be part of the connection state machine. You wouldn't expect a TCP connection to complete the handshake and start sending data on one side, and then sit waiting forever for the other side. Different sort of protocol but the principle is similar.

Paul Hunkar 2024-09-06 05:56 manager ~0021667	The additional status in the header would eliminate the case where one side is stuck in preoperational - since this would be reported to the other side. i.e. your case 1. If you have a CM that is monitoring connection it would detect the problem and at least diagnostics would show the issue. A timeout does not really help the issue, since the CM would just see that it went away at the AC level and re-create it - when - since nothing was changed it would go back to being stuck. just leaving it stuck in the monitor environment - I think would result in it appearing as stuck on a diagnostic display. The application would also notice that the values are bad or not being reported. In case 2 diagnostic would still show that the connection is pre-operational (since my side is pre-operational - so again in the monitor world - it would not be a leak, but would be detected. The application would also see the issue. To me the issue with a simple timeout, is that depending on why it is pre-operational the values of the timeout and or what should be done are very different. If waiting on an SKS - then the CM should be checking the SKS and this operation could be slow, thus a longer timeout might be needed), if it is not receiving an initial value it would have a different timeout - probably fairly short or at least different. The CM in this case would want to check if the other node is reachable and configured correctly (if TSN is involved then the CM might be looking at TSN configuration - again if TSN is involved then the timeout might be larger. I think in all case - as long as the CM is monitoring then we would not be having a leak - but just reporting a problem (that the application or engineer can correct - and in case like a bad switch configuration of a stuck SKS would be fixed without any changes to the connection.

Brian Batke 2024-09-06 13:55 developer ~0021673	We should talk about this in a meeting. This was described better with Jan's slides.

Date Modified	Username	Field	Change
2024-06-21 14:09	Brian Batke	New Issue
2024-07-20 06:12	Paul Hunkar	Note Added: 0021497
2024-07-20 12:37	Brian Batke	Note Added: 0021499
2024-07-23 12:20	David Puffer	Note Added: 0021502
2024-07-23 12:23	David Puffer	Note Edited: 0021502
2024-07-23 15:14	Brian Batke	Description Updated
2024-07-23 20:47	Brian Batke	Note Added: 0021506
2024-08-16 12:38	Paul Hunkar	Relationship added	related to 0009619
2024-08-16 12:39	Paul Hunkar	Assigned To	=> Brian Batke
2024-08-16 12:39	Paul Hunkar	Status	new => assigned
2024-09-06 05:56	Paul Hunkar	Note Added: 0021667
2024-09-06 13:55	Brian Batke	Note Added: 0021673