-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] ThingsBoard Edge PE disconnects from cloud #57
Comments
Hello @akseerali, To fully understand the issue you're experiencing, we would need some additional information. Could you please provide the complete log from your ThingsBoard Edge container? Additionally, if you could attach your docker-compose.yml file, it would be very helpful. This additional information is crucial because, without a comprehensive log analysis, determining the root cause of your problem is challenging. Thank you in advance for your cooperation! |
Please find attached the docker-compose configuration and edge log file. |
Hi @volodymyr-babak, I have observed another issue that might be related to this problem. Today the Cloud is unable to send the RPC requests to Devices connected to Edge even though the edge is connected. The rule chain message shows "NO_ACTIVE_CONNECTION". Please see the screenshot below. I have tried to unassign and then assign all the users to Edge, but the issue persists. |
No, it's not showing RPC call in the Downlinks section. |
It seems like you're using the cloud version of ThingsBoard along with a ThingsBoard PE Edge license. As such, you should have access to our ThingsBoard Customer Portal, available at https://thingsboard-portal.atlassian.net/browse/CP. As the troubleshooting of this issue may require additional private information from you, I would suggest continuing our investigation on this closed portal to ensure your data privacy. Please note, if the root of the issue turns out to be a bug within our platform, we will ensure to update this GitHub ticket with that information. This way, our broader user community can also benefit from the findings of our investigation. Looking forward to assisting you further on the ThingsBoard Customer Portal. |
Thanks a lot @volodymyr-babak for the support. Extra configuration |
Thank you for providing the updated docker-compose file and the previous logs. I've reviewed the information, but the root cause of the disconnection issue is not immediately clear to me. However, it's possible that the disconnections may be related to an issue that we've recently addressed and fixed in our latest release: thingsboard/thingsboard#8346 We just updated our cloud to the 3.5 release yesterday, and the 3.5 Edge version will be publicly available today. We'll also update the documentation on our website accordingly. Once these updates are live, I would kindly ask you to upgrade your version to 3.5.0 and monitor the behavior. If my assumption is correct, this upgrade should resolve the disconnection issues and you should no longer see the disconnects in your logs. Please let us know if you continue to experience problems after this update. We are committed to ensuring the smooth operation of our service for your needs. |
Hi @volodymyr-babak, We upgraded the TB Edge to version 3.5; however, this did not resolve the issue of sending the RPC request to Edge from Cloud. After this, we re-assigned the Devices group to edge and it worked. I think the upgrade of Edge instance also played its part because I had tried the same method with Edge version 3.4.3. Regarding the disconnection/synchronization issue of edge with cloud, we'll continue to observe it for more days. Many thanks. |
Hi @volodymyr-babak, The NO_ACTIVE_CONNECTION RPC call to Device error appeared again when we tried to send the server RPC requests to Edge today. The issue is once again cleared after re-assigning the Devices group to Edge. |
Hello @akseerali, I appreciate your patience as we work to resolve your issue. To aid in our troubleshooting, could you please verify whether you can observe the RPC Call event under the Downlinks tab of the Edge entity? I'm currently trying to ascertain whether the issue originates from the Edge or if it lies within the cloud's capability to send the RPC Call event to the Edge. For further investigation, I'll be running my own Edge demo overnight in an attempt to replicate the issue locally. I'm currently hypothesizing that the problem might be associated with the device session timeout. After a certain period, the cloud may begin to send RPC requests under the assumption that the device is directly connected to the cloud and not interfacing via the Edge. I will share my findings and any potential solutions as soon as I have more information. In the meantime, I encourage you to check for the RPC Call event, as mentioned earlier, and report any findings. Thank you for your understanding, and I look forward to resolving this issue promptly. |
Hi @volodymyr-babak, Thanks for the information and efforts. I have double-checked the Downlinks tab under the Edge details option, and no RPC Call Event action was observed due to this error until the Devices group was re-assigned to Edge instance. You may be right, the issue can be related with session. Please let me know in case of any findings. Many thanks |
Hi @akseerali, I have a few clarifying questions that could help us diagnose this issue more effectively. Firstly, do you have a single Edge entity in your system, or are there multiple ones? If there are multiple Edge entities, could you please verify if your device belongs to a group that is assigned exclusively to a single Edge entity? Additionally, it would be beneficial to ensure that this device doesn't belong to any other group that could potentially be assigned to another Edge. These steps will help us isolate the problem more accurately. Looking forward to your response. |
Hi @volodymyr-babak, We have only one edge entity in our system and the device is only assigned to this edge. In our case, one Device is directly connected to Edge. The RPC NO_ACTIVE_CONNECTION error was appearing when we were assigning the Device Profile of type Default to that Device. This is probably due to the session timeout. I have now changed the Device Type to MQTT 2-3 days ago and so far no RPC error is appearing. Please see the attached diagram of system architecture. One more thing, this issue only appeared after the update of Cloud version. I'll continue to observe it after the changings. Many thanks |
Hello @volodymyr-babak, Today the postgres container is showing an error after updating and upgrading some file in the Ubuntu system. `PostgreSQL Database directory appears to contain a database; Skipping initialization 2023-06-21 12:43:05.832 IST [1] LOG: starting PostgreSQL 12.14 (Debian 12.14-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit ` |
Hello @akseerali, Are these the complete logs for the PostgreSQL container? If not, could you please provide the full logs for a more comprehensive overview? Additionally, could you clarify the exact steps you've undertaken when you refer to 'updating and upgrading some file in the Ubuntu system'? Providing these details will allow for a more accurate analysis and assist in identifying the issue at hand. Thank you. |
Hi @volodymyr-babak, Please find attached the postgres container logs. I have noticed that an old postgres container (used for upgrading the PE Edge from 3.4 to 3.5) was somehow started. I have now stopped the container. Below are the commands used in Ubuntu system.
|
Hi @volodymyr-babak, Please let me know if the use of backup database can fix this issue. The backup was saved during the upgrade of Edge instance. |
Hello @akseerali according to Postgres container logs, checkpoint file is corrupted and postgres is not able to start because of this. https://sysopspro.com/fix-postgresql-error-panic-could-not-locate-a-valid-checkpoint-record/ According to this article, you will need to login into postgres container and reset log file by exiting command: /usr/bin/pg_resetxlog -f /path/to/pg/data/directory Please try this and let me know your results. |
Since the postgres container was restarting after every few seconds, login into container was not possible. From this topic, I have found a way to reset the Postgres database log file in a docker container. Please see the steps below. The above method cleared the log error, but now there are some other errors observed in the Postgres container. The Edge is also not working properly. Please see the attached Edge and Postgres container logs. I think there is an issue with database. Please let me know if I can just use previous backup or create a new database to clear the issue. The PE Edge is newly deployed, so the old data is not an issue. Thanks |
Hello @akseerali indeed looks some database issue and some files/permissions are corrupted. |
I have followed these instructions to backup the database and the command used is mentioned below. |
thanks for the provided information. in this case you can try to do the following:
sudo cp -r ~/.mytb-edge-data/db ~/.mytb-edge-db-BACKUP-BROKEN
sudo rm -rf ~/.mytb-edge-data/db
sudo cp -r ~/.mytb-edge-db-BACKUP ~/.mytb-edge-data/db
Once you'll do these steps, please let me know the results. |
Thanks for the detailed steps. The use of backup database solves the error; however, when I upgrade the edge from 3.4.3EDGEPE to 3.5.0EDGEPE or 3.5.1EDGEPE, the upgrade process shows the error. Please find attached the edge container logs when I tried to upgrade from 3.5.3EDGE to 3.5.0EDGEPE. With 3.4.3EDGEPE version, the instance is running like pre-upgrade time. I think the only stable way now is to use a new database and use the latest EDGEPE version. |
Hello @akseerali, Based on the logs, it seems the system is not upgrading from version 3.4.3 to 3.5.0 as expected. Could you please check the contents of the following file in the edge container: If it's not set to 3.4.3, please adjust it to reflect 3.4.3 and initiate the upgrade procedure following the steps provided here: |
Thanks. After changing 3.5.0 to 3.4.3 in the /data/.upgradeversion file inside the Edge container, edge is finally upgraded with new version. The new version also solves the edge connectivity problem, so I am closing this issue. Thanks again |
A TB Edge PE synchronization issue is observed on 01/07/2023.
Question: How to avoid this kind of issue in a production environment in future? Edge version |
Please note, after some time, TB Cloud is again showing that only one Device is active despite of receiving the telemetry data of other Devices from the Edge instance. |
Hey @akseerali , I noticed errors in the logs that could be a major communication bug in the most recent release:
and
I plan to investigate these issues and prepare a hotfix for the 3.5.1 release. I'll update this ticket as soon as the hotfix is ready. My goal is to have the hotfix released by tomorrow. |
The hotfix for the Community Edition, CE 3.5.1.1, has been completed and released. You can find it at this link: The specific commit that addresses the IndexOutOfBoundException issue can be found here: The Professional Edition hotfix, PE 3.5.1.1, is on its way and will be available soon. |
The Professional Edition (PE 3.5.1.1) hotfix has also been released. Please follow the upgrade guide to update your ThingsBoard Edge instances. It's worth noting that this update doesn't require a database update, only a package update, so it should be a quick process. For the Community Edition (CE) upgrade instructions, follow this link: For the Professional Edition (PE) upgrade instructions, refer to this link: Should you encounter any issues after the update, please don't hesitate to inform me. |
We have updated the edge version to 3.5.1.1 on 7th July 2023 and found that the cloud is again having a synchronization issue with edge instance on 8th July 2023. From Cloud, the Edge downlinks section was not sending any updates including the RPC requests (please see the attached figure). To clear this issue, I tried to restart the docker compose; however, it didn't solve the issue and I had to re-assign the device group to the edge instance. It should be noted that the cloud was again able to receive the telemetry data and the Edge status was showing active at both edge and cloud. From the edge container, I have found below error. I have also attached complete logs. Please fix the synchronization issue between edge and cloud.
|
Hi @akseerali, I think it would be beneficial for us to set up a short call to troubleshoot this situation, as I am currently unable to clearly understand the steps needed to reproduce the issue. I've been running a personal PE Edge for a month now, and have successfully been able to send RPC requests to the device every single day. It seems like I might be missing a step to reproduce this correctly. Could you kindly send me an email to the address mentioned in my profile? We can coordinate the details of our call via email. Thank you in advance for your cooperation. |
I have been testing the connectivity of RP requests for the Device connected to TB PE Edge from Cloud and it’s working fine with below setup. • Edge PE version 3.5.1.1 I observed that this problem arose specifically when employing RPC requests with a Device that originated from the Cloud and was allocated to any Device group other than an Edge Device group beginning with "[Edge]". Thank you so much for the support. |
Thank you for the updates. I am reopening the ticket to re-examine this theory, specifically looking at multiple device groups other than those that begin with "[Edge]." |
Thank you for all the input. I believe I've finally identified the root cause of the issue. In cases where a device is not created over the edge but is created on the cloud and then assigned to the edge, a specific "ManagedByEdge" relation from the device to the edge is not created automatically. However, this relation is essential in the DeviceActor to find the related edge and send RPC commands to it. As a temporary fix, please add the following relation from the device to the required edge:
Please let me know if this update resolves your issues. In the meantime, I will consider ways to improve this approach to eliminate the need for manually adding this relation while still achieving the expected functionality. |
@akseerali |
We resolved this issue by following the instructions in the last comment. See the details below.
|
@akseerali Thank you for your reply. I’ll give it a try. |
@akseerali Also, I would like to ask if this issue has been resolved in version 3.5.1 that you are using, as I am using the CE version? My current issue is that, after a while, the edge device continues to send telemetry data normally, but the cloud shows the edge device’s active attribute as false, and I am unable to send commands or synchronize. I am currently using version 3.4.3 CE. |
Component
Description
I am using ThingsBoard PE Perpetual license with ThingsBoard Cloud Maker. The issue is, Edge status is shown offline at the cloud. Below are some of the symptoms I have observed so far:
tb-edge | 2023-05-18 10:16:52,867 [tb-rule-engine-consumer-47-thread-7 | QK(Main,TB_RULE_ENGINE,system)-2] INFO o.t.s.s.q.DefaultTbRuleEngineConsumerService - Failed to process [2] messages
Below are the screenshots of Edge activity status from Cloud and Edge.
Edge activity status from cloud
Edge activity status from Edge
Questions
Environment
The text was updated successfully, but these errors were encountered: