RBD driver and partial dead ceph monitors causing db locks and quota inconsistencies

Bug #1856182 reported by Alex Walender
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
New
Undecided
Unassigned

Bug Description

We encountered a problem which is partially related to https://bugs.launchpad.net/cinder/+bug/1685818 and https://bugs.launchpad.net/cinder/+bug/1789828

System:
OpenStack Rocky (prod), Stein (dev)
3x Controller each with cinder-volume in HA
MariaDB in Galeracluster
Ceph Nautilus with 3 monitors, storaging rbd images for cinder and glance.

Observation:
During parallel creations/deletions of multiple volumes, cinder-volume throws multiple exceptions
about dropped mysql connections and db lock problems on the quouta_usage tables.
We created a total of 100 volumes, while 20 of them are created on parallel. Roughly 20% of these volumes are forever stuck on "Creating". Also, creation of these volumes take an extreme long time (~20 Minutes).
We have tuned InnoDB parameters and rpc thread counts in cinder with no success. We could calm down the excess amounts of dropped sql connections by increasing "connect_timeout" but the underlying problem was still there.

During testing, we found out that our ceph.conf was outdated. Two of three monitors were not reachable, causing high timeouts on rbd connections.
I suspect that a cinder rpc worker is making reservations preemptively on the database before connecting to the rbd pool via not-existing monitors, locking the db row for an exhausting time and causing other rpc workers to behave wrongly on their volume creation process.

This is ultimately causing inconsistencies in the quota_usage table, since there is no clean-up on failed rbd volume creation. Probably roll-backs from the database are making things even much more complicated.

Proposal:
Make sure that we have a valid ceph connection to the rbd pool before making any reservations or quouta manipulations. Downed monitors are a valid situation for ceph. Cinder-volume should not expect all described monitors in ceph.conf to be up and running.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.