View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0009618 | Part 81: UAFX Connecting Devices and Information Model | Spec | public | 2024-06-21 14:09 | 2025-01-21 15:15 |
Reporter | Brian Batke | Assigned To | Brian Batke | ||
Priority | normal | Severity | major | Reproducibility | have not tried |
Status | assigned | Resolution | open | ||
Summary | 0009618: No mechanism to handle case of never receiving first data | ||||
Description | Consider a bidirectional connection, where Controller (or Device) A starts publishing, which is received at B. But the publications from B are never received at A. Could be due to a faulty cable or some other network issue. In this case, the subscriber on A will be stuck in preoperational, and the connection will not time out. B would not know that is publications are not being received unless it somehow explicitly checks, which should not be a required behavior. There needs to be a mechanism to handle this situation. | ||||
Tags | No tags attached. | ||||
related to | 0009619 | closed | Brian Batke | Missing description of connection behavior when heartbeat or data subscription times out |
|
Discussion from working group meeting |
|
Sending a status in the PubSub message (indicating preoperational) may be informative, but it doesn't really solve the problem. Consider a case where both Endpoint 1 and Endpoint 2 begin publishing, but neither are receiving any messages because of some kind of network configuration issue. Neither would ever see the status indicating the other side is still preoperational. They both remain in preoperational indefinitely. Or consider the unidirectional case where EP2 starts sending (it is operational) but EP1 never receives any messages because of a network error. EP1 would remain in preoperational indefinitely. There needs to be some kind of timeout -- probably configurable -- to handle these situations. |
|
In the described case:
In this case, A would know that something is wrong with B, because it never receives any data from B. Any diagnosing entity observing A would observe EP1 to be in Preoperational, which according to part 81 is only the case if either one or both of DSR and DSW are in pre-operational. |
|
I edited the original description to correct a mixup of endpoint A and B. I think there are two example cases of interest: One endpoint having an issue, and both endpoints having an issue.
In the case of a controller opening connections via its CM, the controller needs (in my experience at least) a time limit on connections being established and providing data in order for the user application to run. To have the possibility of the connection sitting in a "waiting for data" state, indefinitely, is a problem. So having a timeout for going operational would solve this. We could say that this should be the application's responsibility, but it seems to me that this should be part of the connection state machine. You wouldn't expect a TCP connection to complete the handshake and start sending data on one side, and then sit waiting forever for the other side. Different sort of protocol but the principle is similar. |
|
The additional status in the header would eliminate the case where one side is stuck in preoperational - since this would be reported to the other side. i.e. your case 1. To me the issue with a simple timeout, is that depending on why it is pre-operational the values of the timeout and or what should be done are very different. If waiting on an SKS - then the CM should be checking the SKS and this operation could be slow, thus a longer timeout might be needed), if it is not receiving an initial value it would have a different timeout - probably fairly short or at least different. The CM in this case would want to check if the other node is reachable and configured correctly (if TSN is involved then the CM might be looking at TSN configuration - again if TSN is involved then the timeout might be larger. I think in all case - as long as the CM is monitoring then we would not be having a leak - but just reporting a problem (that the application or engineer can correct - and in case like a bad switch configuration of a stuck SKS would be fixed without any changes to the connection. |
|
We should talk about this in a meeting. This was described better with Jan's slides. |
|
@Brian, in response to your comment from 2024-07-23, and in prep for Telco 2025-01-17 (sorry for latency >>): ad 1) If EPA because it never receives a message, then the application on EPA will detect that it does not receive required data and can transition to an error state. ad 2) see 1) What is the expectation regarding a PreOperationalToOperational Timeout? If the timeout expires, should the ConnectionEndpointState transition to Error (which would lead to a cleanup if CleanupTimeout != 0 and subsequently difficult to diagnose what happened)? Note: The non-operational state of the bi-directional connection is already indicated by the ConnectionState exposed in ConnectionDiagnostics array in the CM. |
|
@David, thanks for looking. We can discuss further in the telco, but I'll just add some brief comments. From the PLC perspective, it will want to know when the connection is fully established and data is flowing (for all necessary connections) in order to transition to an "application is running mode". The user application may just be looking at data locations for inputs/outputs that don't carry connection status. So it could be that we would just say that this is up to the PLC implementation (or similarly on the device side) to handle these cases. If they want to have a timeout for the data to start flowing fully, then they can do that. And the UAFX stack would need to expose the connection status to the PLC application. In a situation that is a transitory failure, e.g., the case where EPB has a power failure before it starts publishing, I would expect that situation to correct itself without any operator intervention. So that is a case where I think you would need to have a timeout, somewhere, in order to trigger connection reestablishment. In the case of a persistent failure such as a misconfigured switch or router or the SKS, there will be some action needed. In that case, maybe it is ok to leave the endpoints sitting in "preoperational", because someone will need to troubleshoot that. We should just make sure that there is sufficient status information so that it doesn't just look to the user that they have the equivalent of the "spinning hourglass" |
|
In summary @Brian and David: I agree with your points. The issue can be resolved by implementing the following updates: sending the status back with the header of the publisher (in case 1) and having a configurable timeout (in case 2). |
|
I added a proposal for a preoperational timeout into Part 81. See the ConnectionEndpointType and ConnectionEndpointParameterType. It seems rather simple to me, and if you are in a situation where you don't want to use it, then you set it to infinite. |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-06-21 14:09 | Brian Batke | New Issue | |
2024-07-20 06:12 | Paul Hunkar | Note Added: 0021497 | |
2024-07-20 12:37 | Brian Batke | Note Added: 0021499 | |
2024-07-23 12:20 | David Puffer | Note Added: 0021502 | |
2024-07-23 12:23 | David Puffer | Note Edited: 0021502 | |
2024-07-23 15:14 | Brian Batke | Description Updated | |
2024-07-23 20:47 | Brian Batke | Note Added: 0021506 | |
2024-08-16 12:38 | Paul Hunkar | Relationship added | related to 0009619 |
2024-08-16 12:39 | Paul Hunkar | Assigned To | => Brian Batke |
2024-08-16 12:39 | Paul Hunkar | Status | new => assigned |
2024-09-06 05:56 | Paul Hunkar | Note Added: 0021667 | |
2024-09-06 13:55 | Brian Batke | Note Added: 0021673 | |
2025-01-16 14:22 | David Puffer | Note Added: 0022287 | |
2025-01-16 18:14 | Brian Batke | Note Added: 0022301 | |
2025-01-16 18:15 | Brian Batke | Note Edited: 0022301 | |
2025-01-17 12:02 | Suad Morgan | Note Added: 0022307 | |
2025-01-21 15:15 | Brian Batke | Note Added: 0022319 |