-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[netatmo] Refresh misbehavior #16485
Comments
Situation now seems stabilized https://community.openhab.org/t/netatmo-bridge-offline-99-of-time/149168/65?u=glhopital. @jlaur : can we close the issue ? |
I agree, it's much more stable now. I had a scenario more than a month ago where one of my stations was offline and didn't come online again until I disabled/enabled the Thing. Unfortunately I didn't secure the logs and performed investigations because I was short on time. In any case, it's rare and it's unrelated to the issues that were already fixed by your PR's. I'll create a new issue if/when it happens again, and I have collected the needed evidence. 🙂
The only thing mentioned in the issue remaining is:
You have already created #16489, which went off my radar a few times, sorry. The issue is that it's hard to test in real life, so I have to run some simulations in different scenarios, which is time-consuming and has to be done on my production system because of the way Netatmo handles OAuth2 sessions. Since it was also waiting for some fixes for a long time, I forgot how exactly I performed those tests previously. 😆 I should be able to get back to this during Christmas holiday, so we can finally merge that PR and close this issue. |
@clinique - unfortunately, as mentioned here and repeated here and here, #16489 is still not working as intended (I guess). When hitting 429, all Things will go offline:
Notice "Reconnection scheduled in 3600 seconds" and the last log line: "next refresh in PT15M". And surely enough:
and everything came back online. |
@clinique - I believe all calls are going through the bridge handler. So if we store Lines 353 to 354 in 22c7ca9
we could check here: Line 297 in 22c7ca9
and refrain from making any calls within the next hour. WDYT? EDIT: But probably we also need this to affect the calculations here, so we don't just skip, but actually adjust the rescheduled jobs according to the "grace" period for reconnecting: Lines 85 to 112 in 6af04fc
|
@jlaur : let me know if my proposal suits your thoughts. |
I have had many alerts about old data from my Netatmo weather stations today and am now in the progress of digging into the logs. Luckily I already had trace level logging enabled since I've experienced stability issues since 4.1.
The full logs are pretty extensive, and can be requested by private mail.
Here's a preliminary analysis...
Everything was normal until around 8:46. I have two weather stations, and here are the logs from that point filtered by lines containing "executeUri GET":
at the end of this snippet, the number of calls are doubled, and the frequency is down to two minutes.
Around 9:00 when three calls are made instead of two, what can be noticed:
It seems this can only be caused by:
openhab-addons/bundles/org.openhab.binding.netatmo/src/main/java/org/openhab/binding/netatmo/internal/handler/ApiBridgeHandler.java
Lines 359 to 362 in fb26d0e
and handled here:
openhab-addons/bundles/org.openhab.binding.netatmo/src/main/java/org/openhab/binding/netatmo/internal/handler/capability/WeatherCapability.java
Lines 42 to 46 in fb26d0e
This repeated around 9:15 two times and 9:28 one time.
After that things gradually went worse and worse. more and more additional calls are made:
And as a result:
This error was not taken into account, and the aggressive polling went on until late evening when I started looking into it.
At this point I disabled all Things by removing my .things file:
However, this did not stop the polling:
So it seems there are three distinct issues identified here:
Expected Behavior
Handlers should be disposed when Things are disabled or removed.
Netatmo error code 26 should be handled appropriately, for examle by lowering the polling frequency or waiting for some time before resuming any polling. Perhaps there are some guidelines to be found in the API documentation.
The binding should not increase the number of calls after experiencing issues such as
InterruptedException
.Current Behavior
See description.
Possible Solution
Steps to Reproduce
Since it's a cloud service, it would be hard to reproduce natually, but can perhaps be reproduced artifically by some code modifications mimicking the observed events in my logs.
Your Environment
The text was updated successfully, but these errors were encountered: