[focal][ussuri] Cinder volume service fails to start after authentication issue against innodb cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Cinder Charm |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
As seen during this test run: https:/
Crashdump here: https:/
This was a Ussuri Focal run using our current stable skus. Deployment went fairly normal, however at some point cinder/0 got hung up due to a failed cinder-volume service:
cinder/0 blocked idle 1/lxd/2 10.244.40.244 8776/tcp Services not running that should be: cinder-volume
cinder-ceph/2 waiting idle 10.244.40.244 Incomplete relations: ceph
cinder-
hacluster-
logrotated/50 active idle 10.244.40.244 Unit is ready.
public-
cinder/1 active idle 3/lxd/2 10.244.41.4 8776/tcp Unit is ready
cinder-ceph/1 waiting idle 10.244.41.4 Incomplete relations: ceph
cinder-
hacluster-
logrotated/48 active idle 10.244.41.4 Unit is ready.
public-
cinder/2* active idle 5/lxd/2 10.244.40.253 8776/tcp Unit is ready
cinder-ceph/0* waiting idle 10.244.40.253 Incomplete relations: ceph
cinder-
hacluster-
logrotated/39 active idle 10.244.40.253 Unit is ready.
public-
Journalctl on that unit was fairly unhelpful however looking into the cinder-volume log we can see the issue:
2020-12-04 15:17:15.244 64496 ERROR cinder.cmd.volume Traceback (most recent call last):
2020-12-04 15:17:15.244 64496 ERROR cinder.cmd.volume File "/usr/lib/
2020-12-04 15:17:15.244 64496 ERROR cinder.cmd.volume server = service.
...
...
2020-12-04 15:17:15.244 64496 ERROR cinder.cmd.volume sqlalchemy.
2020-12-04 15:17:15.244 64496 ERROR cinder.cmd.volume (Background on this error at: http://
2020-12-04 15:17:15.244 64496 ERROR cinder.cmd.volume
2020-12-04 15:17:15.255 64496 ERROR cinder.cmd.volume [req-e608b7f5-
The other two units have no such connection issues. The innodb cluster is also reporting healthy:
mysql-innodb-
mysql-innodb-
mysql-innodb-
Not sure what to make of this. Either the unit was never granted access to the db or it was never added to the ACL for that cluster properly is my best guess.
I saw this same exact behavior in a focal-ussuri deployment. Two cinder units managed to establish connection with mysql-innodb- cluster but a third one failed, causing the cinder-volume failure.
The workaround found was restarting cinder-volume service inside the failing unit.