0009619: Missing description of connection behavior when heartbeat or data subscription times out

ID	Project	Category	View Status	Date Submitted	Last Update

0009619	Part 81: UAFX Connecting Devices and Information Model	Spec	public	2024-06-21 18:42	2024-11-01 14:08

Reporter	Brian Batke	Assigned To	Brian Batke
Priority	normal	Severity	major	Reproducibility	have not tried
Status	closed	Resolution	no change required

Summary	0009619: Missing description of connection behavior when heartbeat or data subscription times out
Description	Consider the case of a fully active unidirectional with heartbeat or bidirectional connection. Data is being sent and received at both endpoints. If the subscriber on either endpoint times out, the publisher should stop. Otherwise the other endpoint may never know that its published data is not being received. e.g., a controller that is sending outputs and receiving inputs from a device. If the device stops receiving the outputs (input to the device), but keeps on sending inputs, the controller will never know that its outputs are not being received. This was the reason for defining a heartbeat connection. If there is no action to be taken, in the event of a timeout, there is no reason to have the heartbeat. It should not be required for an endpoint to "poll" the other to get its status. It could be that such a status could be communicated in the PubSub data, but there is no definition for this, and it would not work in the case of a heartbeat.
Tags	No tags attached.

Matthias Damm 2024-07-11 12:43 developer ~0021446	I think it would be better to change the status in the header instead of STOP. The status is also sent for the heartbeat. If a specific uncertain code would be defined, it is even possible to still send the data plus the information that the corresponding reader does not receive data. If the Writer is stopped, this could cause different timing related issues in situations where communication is not reliable and state changes more frequently. The status may even be used in for the case where the Reader is in preoperational to indicate to the other side that nothing was received so far.

Brian Batke 2024-07-11 15:09 developer ~0021447	Which header? The network message header? I don't see a status in the network msg header. If it is a heartbeat message, then there would be no other header, correct? But let's say that it is possible to add a status to the header. Seems like you would then need a specific status code to say that the corresponding reader is not receiving data. And then we would still need to define behavior when the other side receives the bad status. ie., then it would need to stop publishing. So in the end, the result needs to be the same. We have this pattern in other protocols (EtherNet/IP and PROFI) with stopping production when the receving side is in error or timeout, and this does not cause any timing issues.

Matthias Damm 2024-07-11 17:01 developer ~0021448	The NetworkMessage is the wrong level. The NetworkMessage is just a transport container for the DataSetMessages. And each DataSetMessage has a Status, even if the payload is empty for heartbeat messages. A FX connection works on DataSetWriter/Reader and therefore on DataSetMessage level. In theory, multiple DataSetMessages from multiple FX connections in different states may be contained in a NetworkMessage. For Periodic Fixed the NetworkMessages would be still sent even if one of the Writers in the WriterGroup is disabled. The corresponding DataSetMessages would have the first (valid) bit set to false. I saw the slides from Jan today and do not like the idea that one side behaves different than the other. This requires additional configuration and it is not symmetric. To send a dedicated Status instead of stopping the Writer would have the same effect (indicate the error faster to the other side) but recovery would be faster and more reliable if the problem is temporary and resolved before the clean-up time is over. Since the Status is already in the DataSetMessage, there is no change in PubSub necessary. And the bundling / error handling of Reader/Writer pairs must be defined in FX anyhow. I think a dedicated Status has the same result but is more reliable and gives applications more information and flexibility

Brian Batke 2024-07-12 11:45 developer ~0021452	OK, I didn't realize that the DataSet message header would always be included with a heartbeat. But the definition of the Status in that header says: "The overall status of the DataSetMessage", which would seem to be contrary if the status then said something like "the corresponding reader is in error". And then in that case, we would still need to define what the subscriber is supposed to do when it receives that status. And what it would need to do is to stop publishing for the corresponding writer. So the end effect is the same. If you don't stop publishing or otherwise tear down the connection, then it is forever stuck. But we should probably discuss all this in a meeting.

Matthias Damm 2024-07-12 13:27 developer ~0021454	If there is a related Reader that is relevant for the "overall status", I do not see a problem to indicate a problem in the status that is not directly a problem in the DataSetMessage itsself but from OPC UA FX point of view the "overall status of the DataSetMessage" is not GOOD. I agree that you still need to define the behaviour but this also the case if you simply stop the Writer. The behaviour of clean-up could be the same for "did not receive DataSetMessages" and "received DataSetMessagew with Uncertain_RelatedReaderError"

David Puffer 2024-07-19 10:42 developer ~0021491 Last edited: 2024-07-19 10:56	This issue is about missing description in case of "loss of heartbeat" or "loss of subscribed data". Both of which is actually defined in the specification: "5.5.4.2 Operation The lifetime of a logical connection on an AutomationComponent may be tied to its Status (see 6.6.2), which is, in turn, tied to the reception of data or heartbeat messages. A configurable CleanupTimeout (see 6.6.2) allows the deletion of all resources allocated to a specific ConnectionEndpoint once its status indicates loss of data reception." Edit: If this is the problem that was discussed on Tuesday, I think we need to differentiate: 1) Loss of heartbeat or loss of subscribed data (indicating transmission failure on device or frame propagation on wire) vs 2) Indication that data reception on one Endpoint is compromised. 1) is covered with what is already specified. 2) is covered as well: Reader on Endpoint 2 going into Error will cause Endpoint 2 going into Error. If CleanupTimeout is specified, it will trigger cleanup, and thus loss of subscription/heartbeat for Endpoint 1 and corresponding cleanup on Endpoint 1 if set. Why is it relevant if this takes CleanupTimeout2 + CleanupTimeout1? The Connection is supposed to exist, and its failure is an error condition. Cleaning up the connection will free resources on either end, which are supposed to be used in any case, once the reason for the failure has been resolved. Also, stopping the publication on Endpoint1 if Endpoint2 is not receiving anymore, is only a temporary condition is it not? My expectation would be, that the failure will will be resolved and the connection re-established. Edit2: If Endpoint1 was interested in the status of Endpoint2, regardless of whether a CleanupTimeout was defined, it could always map the Status Variable of the ConnectionEndpoint into a data message, rather than using a heartbeat. Using Status of a publication to refer to an error in a subscription, would require this to be done in Part 14, but Part 14 knows nothing about FX ConnectionEndpoints and the notion of bidirectional connections that consist of a Reader/Writer pair.

Paul Hunkar 2024-09-06 04:59 manager ~0021665	I agree with David, that for the case of a running connection (both sides are in operation) no additional text or work is required, but the solution that Matthias described - with some standard uncertain status code is worth implementing. I do think this is an issue that initially would have to be handled in Part 14 and then we could make use of it in our specs. Matthias is proposed solution could be used to handle the case where one or the two device has never reached operation (and is only in pre-operational - do to missing keys or missing values etc). It could also be useful for an application that would like to provides feedback, that there is a problem at the application level.

Brian Batke 2024-09-06 13:44 developer ~0021670	In the Operational case, I suppose we could say that the CleanupTimeout will handle this. If the reader stops receiving and goes to error, the ConnectionEndpoint goes to error, and eventually will be cleaned up and the writer will stop as well. It may be problematic if that is a long cleanup timeout (if you want a quick detection of a problem), but then the user or engineering tool could just set a short cleanup timeout. The Preoperational case is another problem though

Brian Batke 2024-11-01 14:02 developer ~0021965	See previous note. Determined to be "not a problem". Connection Endpoint will be deleted after the cleanup timeout expires, which will then cause data to stop being produced. If users want quick feedback and endpoint deletion, a short cleanup timeout can be set.

Paul Hunkar 2024-11-01 14:08 manager ~0021966	Agreed in call that this issue is not a problem and no changes are required

Date Modified	Username	Field	Change
2024-06-21 18:42	Brian Batke	New Issue
2024-07-11 12:43	Matthias Damm	Note Added: 0021446
2024-07-11 15:09	Brian Batke	Note Added: 0021447
2024-07-11 17:01	Matthias Damm	Note Added: 0021448
2024-07-12 11:45	Brian Batke	Note Added: 0021452
2024-07-12 13:27	Matthias Damm	Note Added: 0021454
2024-07-19 10:42	David Puffer	Note Added: 0021491
2024-07-19 10:51	David Puffer	Note Edited: 0021491
2024-07-19 10:56	David Puffer	Note Edited: 0021491
2024-08-16 12:38	Paul Hunkar	Relationship added	related to 0009618
2024-08-16 12:39	Paul Hunkar	Assigned To	=> Brian Batke
2024-08-16 12:39	Paul Hunkar	Status	new => assigned
2024-09-06 04:59	Paul Hunkar	Note Added: 0021665
2024-09-06 13:44	Brian Batke	Note Added: 0021670
2024-11-01 14:02	Brian Batke	Status	assigned => resolved
2024-11-01 14:02	Brian Batke	Resolution	open => fixed
2024-11-01 14:02	Brian Batke	Fixed in Version	=> 1.00.03
2024-11-01 14:02	Brian Batke	Note Added: 0021965
2024-11-01 14:07	Paul Hunkar	Resolution	fixed => no change required
2024-11-01 14:08	Paul Hunkar	Status	resolved => closed
2024-11-01 14:08	Paul Hunkar	Fixed in Version	1.00.03 =>
2024-11-01 14:08	Paul Hunkar	Note Added: 0021966

View Issue Details

Relationships

Activities

Issue History