ilorest simultaneous run contention

Bug #1990702 reported by John Lettman
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
hw-health-charm
Won't Fix
Low
Unassigned

Bug Description

The `ilorest` tool prefers only one application instance at a time:
$ TMP=/var/lib/hw-health sudo /usr/sbin/ilorest list -j --selector=Power --refresh
iLOrest : RESTful Interface Tool version 3.2.2
Copyright (c) 2014-2021 Hewlett Packard Enterprise Development LP
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR : 2

Blob was overwritten by another user. Please ensure only one user is making changes at a time locally.

$ echo $?
76

As a result, any time the previous run of the cronjob to read `ilorest` runs too long, the next run hits this error. Further, multi-threaded attempts to acquire `ilorest` data will hit this same wall. The script will halt on this error and fail to write out data.

We are starting to see cases where this happens, causing the following alarm:
/var/lib/nagios/ilorest.out: was last modified on Fri Sep 23 14:52:47 2022 and is too old (> 1200 seconds).

A probable solution is setting a lock before executing the `ilorest` program that checks for running instances.

Tags: bseng-1094

Related branches

Andrea Ieri (aieri)
Changed in charm-hw-health:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Andreas Hamacher (andreashamacher) wrote (last edit ):

I was hit by the same bug but it spat out a different error message

sudo /usr/sbin/ilorest list -j --selector=Power --refresh
Unable to locate instance for power
root@cz20430cj4:~# echo $?
23

Also Chassis instead of Power but the error code remained the same.

A manual workaround would be :
in a root shell root: ilorest logout; ilorest login
sudo chmod 644 /var/lib/nagios/ilorest.out
sudo chown nagios:nagios /var/lib/nagios/ilorest.out
# regenerate fresh data
sudo /usr/local/lib/nagios/plugins/cron_ilorest.py
# rerun the nagios health check
sudo -u nagios /usr/local/lib/nagios/plugins/check_hw_health_cron_output.py --filename /var/lib/nagios/ilorest.out
OK No errors found
All OK

Revision history for this message
Marcus Boden (marcusboden) wrote :

Small addendum: The `ilorest logout; ilorest login` needs to be run in a root login shell (e.g. via sudo -i), otherwise the cron will fail later.

Andrea Ieri (aieri)
tags: added: bseng-1094
Revision history for this message
JamesLin (jneo8) wrote :

Seems it's not a bug, because the blob is only allow one connection in the same time.

https://github.com/HewlettPackard/python-ilorest-library/blob/f3f5e3c2ade8f316e08fc93f871059f119d32d52/src/redfish/rest/connections.py#L532

We should try to avoid multi-thread or repeated call.

Revision history for this message
JamesLin (jneo8) wrote :

As I understand, the output of ilorest cmd should show the override error.
So one solution is we can simply add a try/except to the part execute the ilorest command and check the output. Raise a warning if the previous ilorest connection is still holding.

Revision history for this message
JamesLin (jneo8) wrote (last edit ):

Also the error message is mention here: https://support.hpe.com/hpesc/public/docDisplay?docId=a00128387en_us

> RESOLUTION
> This is working as designed. This may occur if more than one user is accessing the HPE Integrated Lights-Out 5 (iLO 5) in local mode. Ensure only one user is accessing the iLO 5 data. In addition, check if the size of the data exceeds 15kb.

Eric Chen (eric-chen)
Changed in charm-hw-health:
importance: Medium → Low
Revision history for this message
Eric Chen (eric-chen) wrote :

This charm is no longer being actively maintained. Please consider using the new hardware-observer-operator instead. (https://github.com/canonical/hardware-observer-operator). This issue won't impact a lot to normal usage, therefore I will close it. Furthermore, this is the design limitation in iorest (cannot accept multiple connection).

Changed in charm-hw-health:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.