Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nut on Raspberry PI keeps giving stale data after USB disconnect (usbhid-ups driver) #2742

Open
slavatarsyurimi opened this issue Dec 30, 2024 · 6 comments
Labels
APC Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time Incorrect or missing readings On some devices driver-reported values are systemically off (e.g. x10, x0.1, const+Value, etc.) macOS raspberry USB

Comments

@slavatarsyurimi
Copy link

Greetings. I've got a single Smart-UPS X 1500 connected via USB to a Raspberry PI with nut installed with fairly basic configuration. Aside from locally getting data via the upsc command I also have nut-http set up to provide the same data via a HTTP API (end usecase is making a bitfocus companion module that is going to query that data). Nut is talking to the UPS via the usbhid-ups driver.

If I disconnect the usb cable - nut keeps reporting last known data, seemly unaware of the disconnect (both when asked via upsc or the http api). Expected behaviour after a usb disconnect would be seeing a "Driver not connected" error. I've tinkered with a bunch of settings, googled issues and chatgpt suggestions for a few hours now, but whatever I do it doesn't change this behavior. Here's some stuff I tried:

  • pollonly = "enabled" in the ups.conf UPS setting
  • playing around with pollinterval and pollfreq in the same config
  • playing around with MAXAGE in upsd.conf
  • manual UPS port selection in ups.conf

The curious thing is that it keeps returning stale data even after restarting the nut server (sudo systemctl resart nut-server). The only 'soft' way I've found for it to figure out it's gone was either making an adjustment in ups.conf or doing a hard usb driver restart for this UPS (/bin/systemctl restart nut-driver@[upsname].service).
To poke it around some more I installed nut on my macbook, and plugged the UPS right in. It looks like it doesn't automatically keep the usb driver running after configuring the ups on mac like it does on linux. When running the usb driver manually and disconnecting the usb cable I see instantly see a nut_libusb_get_report: No such device (it may have been disconnected) error, and the driver crashing shortly after - which is nice.

To explore some more I discovered you can tail the usb driver logs like so sudo journalctl -u nut-driver@[upsname].service -f.
What I observe there is that the driver is infact well aware of the fact that a disconnect took place. Only in this case it keeps repeating the following:

  96.184957     [D1] Can't retrieve Report 0f: No such device
  96.184996     nut_libusb_get_report: No such device (it may have been disconnected)

Meanwhile the driver is still running, and both upsc and nut-http keep dumping old data. I haven't let it take quite a lot of time to see if it realizes eventually, but it definitely lasts for 5+ minutes which is unaccetable for my use case. Also, logically the MAXAGE setting should tell nut to declare data as stale as quickly as after 15 seconds.
Anyone any clue what's going on here and what else I can try? Many thanks!

@jimklimov jimklimov added USB raspberry Incorrect or missing readings On some devices driver-reported values are systemically off (e.g. x10, x0.1, const+Value, etc.) Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time APC macOS labels Dec 30, 2024
@jimklimov
Copy link
Member

Thanks for the report, indeed seems like a serious issue. Can you please clarify, which version of NUT you encounter it with (on Pi and on Mac) - or if you've tried a build of current master to see if it behaves better?

IIRC some time during NUT v2.8.x timeframe (maybe leading up to 2.8.0) there were changes about driver reconnection after communication loss, so maybe this was either fixed by those changes, or a regression brought in by them (e.g. not marking data as stale while we're trying to reconnect - which the earlier crashing approach sort of avoided?)

Also to clarify, your use-case does not insist on specifically the driver exiting due to problems - just on having the NUT stack report that up-to-date information is not available, right?

@slavatarsyurimi
Copy link
Author

@jimklimov Thanks for the prompt response! I'm not around the PI area that the moment, so I'll be able to report the PI version, and attempt a build from master either this Wednesday or Thursday. In any case, it was installed with apt on the PI and with brew on the mac. OSX version: ==> nut: stable 2.8.2 (bottled), HEAD

IIRC some time during NUT v2.8.x timeframe (maybe leading up to 2.8.0) there were changes about driver reconnection after communication loss, so maybe this was either fixed by those changes, or a regression brought in by them (e.g. not marking data as stale while we're trying to reconnect - which the earlier crashing approach sort of avoided?)

Good to know, thanks! I'll try tinkering around with different version when I have the chance.

Also to clarify, your use-case does not insist on specifically the driver exiting due to problems - just on having the NUT stack report that up-to-date information is not available, right?

That is pretty correct, as long as there's a reliable way to know if the reported info is 'fresh' or not via the http API - all good.

@jimklimov
Copy link
Member

Looking at https://github.com/networkupstools/nut/blame/0f543e745e7830dac714e28592972319fae04888/drivers/usbhid-ups.c#L1180 and history before it, the main loop in upsdrv_updateinfo() with detection of (various names of) LIBUSB_ERROR_NO_DEVICE should have been in place for years. Likewise, in
https://github.com/networkupstools/nut/blame/0f543e745e7830dac714e28592972319fae04888/drivers/usbhid-ups.c#L1140-L1144 there is logic, for the past 18 years, to mark it stale: if (!reconnect_ups()) { ... dstate_datastale(); }

Since #1771 (released as part of NUT v2.8.1) code should also report the situation via dstate_setinfo("driver.state", "reconnect.trying"); - do you see that in e.g. upsc query results?

You can also bump debug logging verbosity to check if messages from https://github.com/networkupstools/nut/blame/0f543e745e7830dac714e28592972319fae04888/drivers/usbhid-ups.c#L2095 method are seen?

Also, USB-capable drivers since NUT v2.8.0 should report which libusb version they were built against (there's different code for libusb-0.1 and libusb-1.0 API, and differently capable libraries on the other side). The "crashing" you see on MacOS is probably one of fatalx() exits from reconnect_ups() calling an open method like

static int nut_libusb_open(libusb_device_handle **udevp,
or https://github.com/networkupstools/nut/blob/master/drivers/libusb0.c#L213 - probably higher debug would confirm that, and at which point it decides to exit the driver.

In fact, maybe it makes sense to extend these methods with optional ability to not exit() when reconnecting vs. initial attempt (or even initially), since not all platforms run frameworks like systemd or SMF that would start another copy of the driver and provide data when the broken connection is reinstated.

@jimklimov
Copy link
Member

In short, ideally you should check if the current NUT codebase behaves well on your Raspberry (so there's nothing to fix for your original report, in fact -- the idea in my post above is a separate thing), see e.g. https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests about building your own.

@slavatarsyurimi
Copy link
Author

Hi @jimklimov looks like your hunch is currect. The "problematic" nut version is 2.8.0. I've just spent a while manually building 2.8.2 - after unplugging the usb cable I get the Error: Data stale with the upsc command after 5 seconds.
Haven't tried it with a http API yet, but I'm confident that it would give an error as well.

My understanding of C is pretty non-existent, so I'm going to have to take your word for most of the technical stuff in your last message. Is there anything in particular you'd like me to try out? I've got the PI instance with 2.8.0 on the separate sd card.

Since #1771 (released as part of NUT v2.8.1) code should also report the situation via dstate_setinfo("driver.state", "reconnect.trying"); - do you see that in e.g. upsc query results?

Now I can't get that to happen as of yet. The normale state I get when the ups is plugged in is driver.state: quiet, and it stays that way even during the 5 seconds after the cable is unplugged before the data is declared stale.

For now it seems like I can move forward with my module development. Hand building the current codebase is fine for our use-case.

@jimklimov
Copy link
Member

jimklimov commented Jan 3, 2025

The "during the 5 seconds" part may be due to the frequency of the big loop in drivers/main.c, dictated by pollinterval setting.

So the driver (any one of them) does some work in upsdrv_updateinfo(), effectively sleeps for the remainder of timeout in dstate_poll_fds() (if no extrafd like those for a serial port or a raw network socket gets bytes incoming), and only then gets to run upsdrv_updateinfo() again and perhaps notice the absence of the device as reported by libusb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APC Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time Incorrect or missing readings On some devices driver-reported values are systemically off (e.g. x10, x0.1, const+Value, etc.) macOS raspberry USB
Projects
None yet
Development

No branches or pull requests

2 participants