View Issue Details

IDProjectCategoryView StatusLast Update
0009618Part 81: UAFX Connecting Devices and Information ModelSpecpublic2025-01-21 15:15
ReporterBrian Batke Assigned ToBrian Batke  
PrioritynormalSeveritymajorReproducibilityhave not tried
Status assignedResolutionopen 
Summary0009618: No mechanism to handle case of never receiving first data
Description

Consider a bidirectional connection, where Controller (or Device) A starts publishing, which is received at B. But the publications from B are never received at A. Could be due to a faulty cable or some other network issue.

In this case, the subscriber on A will be stuck in preoperational, and the connection will not time out. B would not know that is publications are not being received unless it somehow explicitly checks, which should not be a required behavior.

There needs to be a mechanism to handle this situation.

TagsNo tags attached.

Relationships

related to 0009619 closedBrian Batke Missing description of connection behavior when heartbeat or data subscription times out 

Activities

Paul Hunkar

2024-07-20 06:12

manager   ~0021497

Discussion from working group meeting
1) for remote connection - the status can be returned in the PubSub - a warning that it is still in pre-operation - probably several warning codes - one for SKS, For network setup , for message received,...
2) an timer could be defined for receipt of first message ( so that this will timeout at some point)

Brian Batke

2024-07-20 12:37

developer   ~0021499

Sending a status in the PubSub message (indicating preoperational) may be informative, but it doesn't really solve the problem.

Consider a case where both Endpoint 1 and Endpoint 2 begin publishing, but neither are receiving any messages because of some kind of network configuration issue. Neither would ever see the status indicating the other side is still preoperational. They both remain in preoperational indefinitely. Or consider the unidirectional case where EP2 starts sending (it is operational) but EP1 never receives any messages because of a network error. EP1 would remain in preoperational indefinitely. There needs to be some kind of timeout -- probably configurable -- to handle these situations.

David Puffer

2024-07-23 12:20

developer   ~0021502

Last edited: 2024-07-23 12:23

In the described case:

  • The publications from B are never received on A. Which means that A (not B) is stuck in preoperational, correct?

In this case, A would know that something is wrong with B, because it never receives any data from B.
In this case, ConnectionEndpoint on A would never be cleaned up, which may only be a problem if B /the network link is never repaired.
Even then: what would be the problem?

Any diagnosing entity observing A would observe EP1 to be in Preoperational, which according to part 81 is only the case if either one or both of DSR and DSW are in pre-operational.
If the Endpoint was cleaned up, an observing entity would not even be able to distinguish between "was never established" vs "was cleaned up".

Brian Batke

2024-07-23 20:47

developer   ~0021506

I edited the original description to correct a mixup of endpoint A and B.

I think there are two example cases of interest: One endpoint having an issue, and both endpoints having an issue.

  1. One side never receives.
    DSW A (operational) sends; DSR B receives (now operational)
    DSW B (operational) sends; DSR A does not receive (preoperational).
    EPB is operational. EPA is still preoperational because its DSR has never received a message.
    EPA will remain preoperational forever unless the application has some sort of timeout whereby it can force the cleanup.
    This case has another example, which Jan has pointed out, where EPB loses power before EPA receives any message. EPA remains preoperational and is a potential resource leak unless some action is taken to clean it up.

  2. Both sides never receive data from the other side.
    DSW A (operational) sends; DSR B does not receive (still preoperational)
    DSW B (operational) sends; DSR A does not receive (still preoperational)
    Both EPA and B remain preoperational. Sending a status in the DataSet message header would not help at all in this case.

In the case of a controller opening connections via its CM, the controller needs (in my experience at least) a time limit on connections being established and providing data in order for the user application to run. To have the possibility of the connection sitting in a "waiting for data" state, indefinitely, is a problem. So having a timeout for going operational would solve this. We could say that this should be the application's responsibility, but it seems to me that this should be part of the connection state machine. You wouldn't expect a TCP connection to complete the handshake and start sending data on one side, and then sit waiting forever for the other side. Different sort of protocol but the principle is similar.

Paul Hunkar

2024-09-06 05:56

manager   ~0021667

The additional status in the header would eliminate the case where one side is stuck in preoperational - since this would be reported to the other side. i.e. your case 1.
If you have a CM that is monitoring connection it would detect the problem and at least diagnostics would show the issue. A timeout does not really help the issue, since the CM would just see that it went away at the AC level and re-create it - when - since nothing was changed it would go back to being stuck. just leaving it stuck in the monitor environment - I think would result in it appearing as stuck on a diagnostic display. The application would also notice that the values are bad or not being reported.
In case 2 diagnostic would still show that the connection is pre-operational (since my side is pre-operational - so again in the monitor world - it would not be a leak, but would be detected. The application would also see the issue.

To me the issue with a simple timeout, is that depending on why it is pre-operational the values of the timeout and or what should be done are very different. If waiting on an SKS - then the CM should be checking the SKS and this operation could be slow, thus a longer timeout might be needed), if it is not receiving an initial value it would have a different timeout - probably fairly short or at least different. The CM in this case would want to check if the other node is reachable and configured correctly (if TSN is involved then the CM might be looking at TSN configuration - again if TSN is involved then the timeout might be larger.

I think in all case - as long as the CM is monitoring then we would not be having a leak - but just reporting a problem (that the application or engineer can correct - and in case like a bad switch configuration of a stuck SKS would be fixed without any changes to the connection.

Brian Batke

2024-09-06 13:55

developer   ~0021673

We should talk about this in a meeting. This was described better with Jan's slides.

David Puffer

2025-01-16 14:22

developer   ~0022287

@Brian, in response to your comment from 2024-07-23, and in prep for Telco 2025-01-17 (sorry for latency >>):

ad 1) If EPA because it never receives a message, then the application on EPA will detect that it does not receive required data and can transition to an error state.
The fact that EPA is preoperational does not per se constitute a problem and the diagnostic capabilities of a CM monitoring the state of its connections, should reflect the PreOperational state in the diag array. Table 165 describes this as "Connection is established and enabled, but communication has not started".

ad 2) see 1)

What is the expectation regarding a PreOperationalToOperational Timeout? If the timeout expires, should the ConnectionEndpointState transition to Error (which would lead to a cleanup if CleanupTimeout != 0 and subsequently difficult to diagnose what happened)?
If the actual requirement is, for an application to signal error if it doesn't receive data, would that timeout not be implemented in the application (observing data availability?).
I see that it could be argued that such a timeout should be part of the communication layer, but I'd be interested to know how the information would be used by the application.
And in terms of layer separation: do we want an application to "know" about communication? Or should an application only know about IN/OUT data, rather than communication state?

Note: The non-operational state of the bi-directional connection is already indicated by the ConnectionState exposed in ConnectionDiagnostics array in the CM.

Brian Batke

2025-01-16 18:14

developer   ~0022301

Last edited: 2025-01-16 18:15

@David, thanks for looking. We can discuss further in the telco, but I'll just add some brief comments.

From the PLC perspective, it will want to know when the connection is fully established and data is flowing (for all necessary connections) in order to transition to an "application is running mode". The user application may just be looking at data locations for inputs/outputs that don't carry connection status. So it could be that we would just say that this is up to the PLC implementation (or similarly on the device side) to handle these cases. If they want to have a timeout for the data to start flowing fully, then they can do that. And the UAFX stack would need to expose the connection status to the PLC application.

In a situation that is a transitory failure, e.g., the case where EPB has a power failure before it starts publishing, I would expect that situation to correct itself without any operator intervention. So that is a case where I think you would need to have a timeout, somewhere, in order to trigger connection reestablishment. In the case of a persistent failure such as a misconfigured switch or router or the SKS, there will be some action needed. In that case, maybe it is ok to leave the endpoints sitting in "preoperational", because someone will need to troubleshoot that. We should just make sure that there is sufficient status information so that it doesn't just look to the user that they have the equivalent of the "spinning hourglass"

Suad Morgan

2025-01-17 12:02

reporter   ~0022307

In summary @Brian and David: I agree with your points. The issue can be resolved by implementing the following updates: sending the status back with the header of the publisher (in case 1) and having a configurable timeout (in case 2).

Brian Batke

2025-01-21 15:15

developer   ~0022319

I added a proposal for a preoperational timeout into Part 81. See the ConnectionEndpointType and ConnectionEndpointParameterType. It seems rather simple to me, and if you are in a situation where you don't want to use it, then you set it to infinite.

Issue History

Date Modified Username Field Change
2024-06-21 14:09 Brian Batke New Issue
2024-07-20 06:12 Paul Hunkar Note Added: 0021497
2024-07-20 12:37 Brian Batke Note Added: 0021499
2024-07-23 12:20 David Puffer Note Added: 0021502
2024-07-23 12:23 David Puffer Note Edited: 0021502
2024-07-23 15:14 Brian Batke Description Updated
2024-07-23 20:47 Brian Batke Note Added: 0021506
2024-08-16 12:38 Paul Hunkar Relationship added related to 0009619
2024-08-16 12:39 Paul Hunkar Assigned To => Brian Batke
2024-08-16 12:39 Paul Hunkar Status new => assigned
2024-09-06 05:56 Paul Hunkar Note Added: 0021667
2024-09-06 13:55 Brian Batke Note Added: 0021673
2025-01-16 14:22 David Puffer Note Added: 0022287
2025-01-16 18:14 Brian Batke Note Added: 0022301
2025-01-16 18:15 Brian Batke Note Edited: 0022301
2025-01-17 12:02 Suad Morgan Note Added: 0022307
2025-01-21 15:15 Brian Batke Note Added: 0022319