[19.04][stein] barbican-worker is down: Requested revision 1a0c2cdafb38 overlaps with other requested revisions 39cf2e645cba

Bug #1827690 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Barbican Charm
Fix Released
Critical
David Ames
barbican (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

After deploying barbican (stein, 19.04 charms) I got into a situation where both workers (two-node HA) are down due to a failed alembic migration:

https://private-fileshare.canonical.com/~dima/charm-dumps/04-05-2019-barbican-0-var-log.tar.gz
https://private-fileshare.canonical.com/~dima/charm-dumps/04-05-2019-barbican-1-var-log.tar.gz

https://paste.ubuntu.com/p/J6GvwkXMWc/ - bundle.yaml

May 03 20:51:11 juju-35b20e-3-lxd-0 systemd[1]: Started OpenStack Barbican Key Management Workers.
May 03 20:51:39 juju-35b20e-3-lxd-0 systemd[1]: barbican-worker.service: Main process exited, code=exited, status=1/FAILURE
May 03 20:51:39 juju-35b20e-3-lxd-0 systemd[1]: barbican-worker.service: Failed with result 'exit-code'.

juju status barbican
Model Controller Cloud/Region Version SLA Timestamp
openstack samaas samaas 2.6-rc1 unsupported 23:27:23Z

App Version Status Scale Charm Store Rev OS Notes
barbican 8.0.0 blocked 2 barbican jujucharms 18 ubuntu
barbican-vault 1.2.2 active 2 barbican-vault jujucharms 2 ubuntu
hacluster-barbican active 2 hacluster jujucharms 54 ubuntu

Unit Workload Agent Machine Public address Ports Message
barbican/0 blocked idle 3/lxd/0 10.232.46.135 9311/tcp,9312/tcp Services not running that should be: barbican-worker
  barbican-vault/1 active idle 10.232.46.135 Unit is ready
  hacluster-barbican/1 active idle 10.232.46.135 Unit is ready and clustered
barbican/1* blocked idle 2/lxd/0 10.232.46.130 9311/tcp,9312/tcp Services not running that should be: barbican-worker
  barbican-vault/0* active idle 10.232.46.130 Unit is ready
  hacluster-barbican/0* active idle 10.232.46.130 Unit is ready and clustered

https://paste.ubuntu.com/p/BS3fHw287r/
2019-05-03 21:13:46.582 115638 ERROR barbican alembic.util.exc.CommandError: Requested revision 1a0c2cdafb38 overlaps with other requested revisions 39cf2e645cba

mysql> use barbican;
mysql> select * from alembic_version;
+--------------+
| version_num |
+--------------+
| 1a0c2cdafb38 |
| 39cf2e645cba |
+--------------+
2 rows in set (0.00 sec)

Tags: cpe-onsite
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Added the UCA package as well as I am not exactly sure where the root cause is.

The logs attached above point to a synchronization issue in the charm code because it looks to be that both units have similar errors but "Table already exists" messages appear for different databases in different units. Also barbican-worker logs for both units contain messages about model migrations which is problematic as model migrations are not idempotent.

barbican/0:

2019-05-03 20:51:24.375 78559 INFO barbican.model.repositories [-] Auto-creating barbican registry DB
2019-05-03 20:51:38.930 78559 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, "Table 'kek_data' already exists") [SQL: '\nCREATE TABLE kek_data (\n\tid VARCHAR(36) NOT NULL, \n\tcreated_at DATETIME NOT NULL, \n\tupdated_at DATETIME NOT NULL, \n\tdeleted_at DATETIME, \n\tdeleted BOOL NOT NULL, \n\tstatus VARCHAR(20) NOT NULL, \n\tplugin_name VARCHAR(255) NOT NULL, \n\tkek_label VARCHAR(255), \n\tproject_id VARCHAR(36) NOT NULL, \n\tactive BOOL NOT NULL, \n\tbind_completed BOOL NOT NULL, \n\talgorithm VARCHAR(255), \n\tbit_length INTEGER, \n\tmode VARCHAR(255), \n\tplugin_meta TEXT, \n\tPRIMARY KEY (id), \n\tCHECK (deleted IN (0, 1)), \n\tCONSTRAINT kek_data_project_fk FOREIGN KEY(project_id) REFERENCES projects (id), \n\tCHECK (active IN (0, 1)), \n\tCHECK (bind_completed IN (0, 1))\n)ENGINE=InnoDB\n\n'] (Background on this error at: http://sqlalche.me/e/2j85): pymysql.err.InternalError: (1050, "Table 'kek_data' already exists")

barbican/1:

2019-05-03 20:51:08.942 96228 INFO barbican.model.repositories [-] Auto-creating barbican registry DB
2019-05-03 20:51:41.095 96228 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, "Table 'secret_store_metadata' already exists") [SQL: '\nCREATE TABLE secret_store_metadata (\n\tid VARCHAR(36) NOT NULL, \n\tcreated_at DATETIME NOT NULL, \n\tupdated_at DATETIME NOT NULL, \n\tdeleted_at DATETIME, \n\tdeleted BOOL NOT NULL, \n\tstatus VARCHAR(20) NOT NULL, \n\t`key` VARCHAR(255) NOT NULL, \n\tvalue VARCHAR(255) NOT NULL, \n\tsecret_id VARCHAR(36) NOT NULL, \n\tPRIMARY KEY (id), \n\tCHECK (deleted IN (0, 1)), \n\tFOREIGN KEY(secret_id) REFERENCES secrets (id)\n)ENGINE=InnoDB\n\n'] (Background on this error at: http://sqlalche.me/e/2j85): pymysql.err.InternalError: (1050, "Table 'secret_store_metadata' already exists")

description: updated
Revision history for this message
David Ames (thedac) wrote :

TRIAGE:

I suspect we are not limiting the db migration to only the leader.
Guarantee only the leader runs the migration.

Changed in charm-barbican:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 19.07
James Page (james-page)
Changed in charm-barbican:
assignee: nobody → James Page (james-page)
status: Triaged → In Progress
David Ames (thedac)
Changed in charm-barbican:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-barbican:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-barbican:
milestone: 20.01 → 20.05
David Ames (thedac)
Changed in charm-barbican:
milestone: 20.05 → 20.08
James Page (james-page)
Changed in charm-barbican:
milestone: 20.08 → none
Revision history for this message
David Ames (thedac) wrote :

Doing some research.

1) I don't think this is more than one charm unit trying to do migrations at the same time (comment #2)
2) The problem is barbican itself failing per the Traceback
3) My current theory is a failed migration attempt that leaves behind two alembic versions.

Per [1] we should only ever expect 1 alembic version at a time. As a work around we may need to do some DB surgery per [1]. After a backup of course.

To find what gets barbican into this state will take a bit more digging through logs.

[1] https://stackoverflow.com/questions/42424320/how-do-i-fix-alembics-requested-revision-overlaps-with-other-requested-revisio

Revision history for this message
Márton Kiss (marton-kiss) wrote :

After additional debug and research I can confirm the issue is the result of a race-condition situation when multiple barbican-worker units trying to populate the database with schema data simultaneously and upgrade the alembic version.

As a result the barbican-worker service will stop because it is trying to recreate tables already created by other worker units.

I applied the following manual steps as workaround:

1, stop all barbican-workers, stop jujud agent
2, drop db tables
3, clean up alembic db change states if present: rm -rf /usr/lib/python3/dist-packages/barbican/model/migration/alembic_migrations/versions/*_db_change.py files
4, start barbican-worker on a single unit (populates the db tables to head revisions)
5, restart all barbican-workers, start jujud agent

as a result each units must be on the same db revision:
$ juju run --application barbican 'barbican-db-manage current -V | grep Revision'
- Stdout: |2
        Revision ID: 39cf2e645cba
  UnitId: barbican/0
- Stdout: |2
        Revision ID: 39cf2e645cba
  UnitId: barbican/1
- Stdout: |2
        Revision ID: 39cf2e645cba
  UnitId: barbican/2

The root cause of the problem is the default implementation of barbican-worker services, because they are populating the data after service start, this can lead to the race condition situation.

A proper permanent charm fix should be:
1, As [1] mentions in the Install and configure components / 3. Populate the Key Manager service database section, the charm should set the db_auto_create to false in the /etc/barbican/barbican.conf file.
2, the leader charm populates the database
3, start the barbican-worker services after the leader finished with db schema upgrade.

[1] https://docs.openstack.org/barbican/stein/install/install-ubuntu.html

James Page (james-page)
Changed in charm-barbican:
status: In Progress → New
assignee: James Page (james-page) → nobody
David Ames (thedac)
Changed in charm-barbican:
assignee: nobody → David Ames (thedac)
milestone: none → 20.10
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-barbican (master)

Fix proposed to branch: master
Review: https://review.opendev.org/756931

Changed in charm-barbican:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

The fix [0] will resolve this bug but is currently blocked by LP Bug#1899104 [1].

[0] https://review.opendev.org/756931
[1] https://bugs.launchpad.net/ubuntu/+source/barbican/+bug/1899104

Changed in barbican (Ubuntu):
status: New → Invalid
David Ames (thedac)
Changed in charm-barbican:
milestone: 20.10 → 21.01
David Ames (thedac)
Changed in charm-barbican:
milestone: 21.01 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-barbican (master)

Reviewed: https://review.opendev.org/c/openstack/charm-barbican/+/756931
Committed: https://opendev.org/openstack/charm-barbican/commit/ca80b266ed831813a2a9ecec84d288ec3f8f5104
Submitter: "Zuul (22348)"
Branch: master

commit ca80b266ed831813a2a9ecec84d288ec3f8f5104
Author: David Ames <email address hidden>
Date: Thu Oct 8 22:55:30 2020 +0000

    Run db sync manually by leader only

    Without db_auto_create = False each barbican-worker attempts to run
    alembic upgrades on the database whenever they start. This leads to
    race conditions that can leave the sate of the database broken
    (multiple alembic version that "overlap").

    This change runs the barbican-manage db upgrade by the leader only
    avoiding the race condition. charms.openstack also handles the
    openstack upgrade process with a call to instance.db_sync.

    Change-Id: I6b9498059c7057b73b1c3db0e355456c38b0510e
    Closes-Bug: #1827690

Changed in charm-barbican:
status: In Progress → Fix Committed
Changed in charm-barbican:
milestone: none → 21.04
Changed in charm-barbican:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.