Duplicate Cinder services DB entries after upgrade

Bug #1891330 reported by Oliver Horecny
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Triaged
Low
Unassigned

Bug Description

Environment
===========
- SETUP: HA setup with 2 controllers.
- OS: Red Hat Enterprise Linux Server release 7.7 (Maipo)
- KERNEL: 3.10.0-957.el7.x86_64
- DOCKER VERSION: 1.13.1
- DOCKER IMAGES: binary
- upgrade from Ocata to Rocky by using Rocky kolla-ansible

Description
===========
This issue was firstly reported in kolla-ansible project ( https://bugs.launchpad.net/kolla-ansible/+bug/1889202 ). It was hit also during upgrade from Stein to Train. In original ticket we came with decision that this is issue of cinder DB, so that is reason why this ticket is created also for cinder project.

It seems that best way how to fix it will be using uniqueness constraints in DB. Which was proposed in this change some time ago: https://review.opendev.org/#/c/389049/
Unfortunately that change was abandoned. Can it be reconsidered?

Original ticket description:
During upgrade of OpenStack there is ansible part which configure cinder services and then it restarts their containers. In case of HA setup services are restarted on both controllers in parallel and it seems that there is some race condition, which lead to state, that sometimes can happen that cinder-backup or cinder-scheduler entry is added to DB two times. It seems like one service for each controller.
Upgrade finished successfully and openstack was fully functional, but doubled entry of cinder backup service was visible in openstack:

[root@osc1 softi-ops(keystone_admin)]# openstack volume service list
+------------------+----------------+------+---------+-------+----------------------------+
| Binary | Host | Zone | Status | State | Updated At |
+------------------+----------------+------+---------+-------+----------------------------+
| cinder-scheduler | 128.0.0.50 | nova | enabled | up | 2020-05-15T14:44:25.000000 |
| cinder-volume | 128.0.0.50@lvm | nova | enabled | up | 2020-05-15T14:44:23.000000 |
| cinder-backup | 128.0.0.50 | nova | enabled | up | 2020-05-15T14:44:25.000000 |
| cinder-backup | 128.0.0.50 | nova | enabled | down | 2020-05-15T14:31:53.000000 |
+------------------+----------------+------+---------+-------+----------------------------+

And also in DB:

MariaDB [(none)]> select created_at,updated_at,deleted_at,deleted,id,host,`binary` from cinder.services;
+---------------------+---------------------+------------+---------+----+--------------------------+------------------+
| created_at | updated_at | deleted_at | deleted | id | host | binary |
+---------------------+---------------------+------------+---------+----+--------------------------+------------------+
| 2020-05-15 13:18:18 | 2020-05-15 14:46:25 | NULL | 0 | 2 | 128.0.0.50 | cinder-scheduler |
| 2020-05-15 13:18:20 | 2020-05-15 14:46:23 | NULL | 0 | 4 | 128.0.0.50@lvm | cinder-volume |
| 2020-05-15 13:18:20 | 2020-05-15 14:46:25 | NULL | 0 | 8 | 128.0.0.50 | cinder-backup |
| 2020-05-15 13:18:20 | 2020-05-15 13:23:51 | NULL | 0 | 10 | 128.0.0.50 | cinder-backup |
+---------------------+---------------------+------------+---------+----+--------------------------+------------------+

From output of DB query above is visible that there are two entries of cinder-backup and both were created exactly in same time.
In kolla-ansible logs was found that few moments before that time was called handler for restart of cinder-backup containers. So it looks like race condition when both services are starting and connecting to DB.

Steps to reproduce
==================
Upgrade from Ocata to Rocky by using Rocky kolla-ansible. This issue is hard to reproduce because it happens only sometimes, but it was hit several times.

Expected result
===============
After upgrade only one entry for cinder-backup has to be present in DB.

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

The patch rejecting using a DB uniqueness constraint was rejected because A/A was not yet supported. It is now, so we should probably reconsider the general approach of https://review.opendev.org/#/c/389049/ (there were some other objections to the patch).

I haven't been able to reproduce this locally, but am going ahead and marking 'triaged' based on yoctozepto's confirmation in https://bugs.launchpad.net/kolla-ansible/+bug/1889202/comments/4

Changed in cinder:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Oliver Horecny (horecoli) wrote :

Good, so what should be done next? Should we contact somehow owner of original change?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.