Bug #1827690 “[19.04][stein] barbican-worker is down: Requested ...” : Bugs : OpenStack Barbican Charm

Dmitrii Shcherbakov (dmitriis) on 2019-05-04

description:

updated

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-05-04:

#1

Added the UCA package as well as I am not exactly sure where the root cause is.

The logs attached above point to a synchronization issue in the charm code because it looks to be that both units have similar errors but "Table already exists" messages appear for different databases in different units. Also barbican-worker logs for both units contain messages about model migrations which is problematic as model migrations are not idempotent.

barbican/0:

2019-05-03 20:51:24.375 78559 INFO barbican.model.repositories [-] Auto-creating barbican registry DB
2019-05-03 20:51:38.930 78559 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, "Table 'kek_data' already exists") [SQL: '\nCREATE TABLE kek_data (\n\tid VARCHAR(36) NOT NULL, \n\tcreated_at DATETIME NOT NULL, \n\tupdated_at DATETIME NOT NULL, \n\tdeleted_at DATETIME, \n\tdeleted BOOL NOT NULL, \n\tstatus VARCHAR(20) NOT NULL, \n\tplugin_name VARCHAR(255) NOT NULL, \n\tkek_label VARCHAR(255), \n\tproject_id VARCHAR(36) NOT NULL, \n\tactive BOOL NOT NULL, \n\tbind_completed BOOL NOT NULL, \n\talgorithm VARCHAR(255), \n\tbit_length INTEGER, \n\tmode VARCHAR(255), \n\tplugin_meta TEXT, \n\tPRIMARY KEY (id), \n\tCHECK (deleted IN (0, 1)), \n\tCONSTRAINT kek_data_project_fk FOREIGN KEY(project_id) REFERENCES projects (id), \n\tCHECK (active IN (0, 1)), \n\tCHECK (bind_completed IN (0, 1))\n)ENGINE=InnoDB\n\n'] (Background on this error at: http://sqlalche.me/e/2j85): pymysql.err.InternalError: (1050, "Table 'kek_data' already exists")

barbican/1:

2019-05-03 20:51:08.942 96228 INFO barbican.model.repositories [-] Auto-creating barbican registry DB
2019-05-03 20:51:41.095 96228 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, "Table 'secret_store_metadata' already exists") [SQL: '\nCREATE TABLE secret_store_metadata (\n\tid VARCHAR(36) NOT NULL, \n\tcreated_at DATETIME NOT NULL, \n\tupdated_at DATETIME NOT NULL, \n\tdeleted_at DATETIME, \n\tdeleted BOOL NOT NULL, \n\tstatus VARCHAR(20) NOT NULL, \n\t`key` VARCHAR(255) NOT NULL, \n\tvalue VARCHAR(255) NOT NULL, \n\tsecret_id VARCHAR(36) NOT NULL, \n\tPRIMARY KEY (id), \n\tCHECK (deleted IN (0, 1)), \n\tFOREIGN KEY(secret_id) REFERENCES secrets (id)\n)ENGINE=InnoDB\n\n'] (Background on this error at: http://sqlalche.me/e/2j85): pymysql.err.InternalError: (1050, "Table 'secret_store_metadata' already exists")

Added the UCA package as well as I am not exactly sure where the root cause is.

The logs attached above point to a synchronization issue in the charm code because it looks to be that both units have similar errors but "Table already exists" messages appear for different databases in different units. Also barbican-worker logs for both units contain messages about model migrations which is problematic as model migrations are not idempotent.

barbican/0:

2019-05-03 20:51:24.375 78559 INFO barbican.model.repositories [-] Auto-creating barbican registry DB
2019-05-03 20:51:38.930 78559 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, "Table 'kek_data' already exists") [SQL: '\nCREATE TABLE kek_data (\n\tid VARCHAR(36) NOT NULL, \n\tcreated_at DATETIME NOT NULL, \n\tupdated_at DATETIME NOT NULL, \n\tdeleted_at DATETIME, \n\tdeleted BOOL NOT NULL, \n\tstatus VARCHAR(20) NOT NULL, \n\tplugin_name VARCHAR(255) NOT NULL, \n\tkek_label VARCHAR(255), \n\tproject_id VARCHAR(36) NOT NULL, \n\tactive BOOL NOT NULL, \n\tbind_completed BOOL NOT NULL, \n\talgorithm VARCHAR(255), \n\tbit_length INTEGER, \n\tmode VARCHAR(255), \n\tplugin_meta TEXT, \n\tPRIMARY KEY (id), \n\tCHECK (deleted IN (0, 1)), \n\tCONSTRAINT kek_data_project_fk FOREIGN KEY(project_id) REFERENCES projects (id), \n\tCHECK (active IN (0, 1)), \n\tCHECK (bind_completed IN (0, 1))\n)ENGINE=InnoDB\n\n'] (Background on this error at: http://sqlalche.me/e/2j85): pymysql.err.InternalError: (1050, "Table 'kek_data' already exists")

barbican/1:

2019-05-03 20:51:08.942 96228 INFO barbican.model.repositories [-] Auto-creating barbican registry DB
2019-05-03 20:51:41.095 96228 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, "Table 'secret_store_metadata' already exists") [SQL: '\nCREATE TABLE secret_store_metadata (\n\tid VARCHAR(36) NOT NULL, \n\tcreated_at DATETIME NOT NULL, \n\tupdated_at DATETIME NOT NULL, \n\tdeleted_at DATETIME, \n\tdeleted BOOL NOT NULL, \n\tstatus VARCHAR(20) NOT NULL, \n\t`key` VARCHAR(255) NOT NULL, \n\tvalue VARCHAR(255) NOT NULL, \n\tsecret_id VARCHAR(36) NOT NULL, \n\tPRIMARY KEY (id), \n\tCHECK (deleted IN (0, 1)), \n\tFOREIGN KEY(secret_id) REFERENCES secrets (id)\n)ENGINE=InnoDB\n\n'] (Background on this error at: http://sqlalche.me/e/2j85): pymysql.err.InternalError: (1050, "Table 'secret_store_metadata' already exists")

description:

updated

Revision history for this message

David Ames (thedac) wrote on 2019-05-15:

#2

TRIAGE:

I suspect we are not limiting the db migration to only the leader.
Guarantee only the leader runs the migration.

Changed in charm-barbican:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → 19.07

James Page (james-page) on 2019-07-01

Changed in charm-barbican:
assignee:	nobody → James Page (james-page)
status:	Triaged → In Progress

David Ames (thedac) on 2019-08-12

Changed in charm-barbican:
milestone:	19.07 → 19.10

David Ames (thedac) on 2019-10-24

Changed in charm-barbican:
milestone:	19.10 → 20.01

James Page (james-page) on 2020-03-02

Changed in charm-barbican:
milestone:	20.01 → 20.05

David Ames (thedac) on 2020-05-21

Changed in charm-barbican:
milestone:	20.05 → 20.08

James Page (james-page) on 2020-08-03

Changed in charm-barbican:
milestone:	20.08 → none

Revision history for this message

David Ames (thedac) wrote on 2020-10-07:

#3

Doing some research.

1) I don't think this is more than one charm unit trying to do migrations at the same time (comment #2)
2) The problem is barbican itself failing per the Traceback
3) My current theory is a failed migration attempt that leaves behind two alembic versions.

Per [1] we should only ever expect 1 alembic version at a time. As a work around we may need to do some DB surgery per [1]. After a backup of course.

To find what gets barbican into this state will take a bit more digging through logs.

[1] https://stackoverflow.com/questions/42424320/how-do-i-fix-alembics-requested-revision-overlaps-with-other-requested-revisio

Revision history for this message

Márton Kiss (marton-kiss) wrote on 2020-10-08:

#4

After additional debug and research I can confirm the issue is the result of a race-condition situation when multiple barbican-worker units trying to populate the database with schema data simultaneously and upgrade the alembic version.

As a result the barbican-worker service will stop because it is trying to recreate tables already created by other worker units.

I applied the following manual steps as workaround:

1, stop all barbican-workers, stop jujud agent
2, drop db tables
3, clean up alembic db change states if present: rm -rf /usr/lib/python3/dist-packages/barbican/model/migration/alembic_migrations/versions/*_db_change.py files
4, start barbican-worker on a single unit (populates the db tables to head revisions)
5, restart all barbican-workers, start jujud agent

as a result each units must be on the same db revision:
$ juju run --application barbican 'barbican-db-manage current -V | grep Revision'
- Stdout: |2
        Revision ID: 39cf2e645cba
  UnitId: barbican/0
- Stdout: |2
        Revision ID: 39cf2e645cba
  UnitId: barbican/1
- Stdout: |2
        Revision ID: 39cf2e645cba
  UnitId: barbican/2

The root cause of the problem is the default implementation of barbican-worker services, because they are populating the data after service start, this can lead to the race condition situation.

A proper permanent charm fix should be:
1, As [1] mentions in the Install and configure components / 3. Populate the Key Manager service database section, the charm should set the db_auto_create to false in the /etc/barbican/barbican.conf file.
2, the leader charm populates the database
3, start the barbican-worker services after the leader finished with db schema upgrade.

[1] https://docs.openstack.org/barbican/stein/install/install-ubuntu.html

James Page (james-page) on 2020-10-08

Changed in charm-barbican:
status:	In Progress → New
assignee:	James Page (james-page) → nobody

David Ames (thedac) on 2020-10-08

Changed in charm-barbican:
assignee:	nobody → David Ames (thedac)
milestone:	none → 20.10
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-08: Fix proposed to charm-barbican (master)

#5

Fix proposed to branch: master
Review: https://review.opendev.org/756931

Changed in charm-barbican:
status:	Triaged → In Progress

Revision history for this message

David Ames (thedac) wrote on 2020-10-16:

#6

The fix [0] will resolve this bug but is currently blocked by LP Bug#1899104 [1].

[0] https://review.opendev.org/756931
[1] https://bugs.launchpad.net/ubuntu/+source/barbican/+bug/1899104

Chris MacNaughton (chris.macnaughton) on 2020-10-21

Changed in barbican (Ubuntu):
status:	New → Invalid

David Ames (thedac) on 2020-11-03

Changed in charm-barbican:
milestone:	20.10 → 21.01

David Ames (thedac) on 2021-02-10

Changed in charm-barbican:
milestone:	21.01 → none

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-04-15: Fix merged to charm-barbican (master)

#7

Reviewed: https://review.opendev.org/c/openstack/charm-barbican/+/756931
Committed: https://opendev.org/openstack/charm-barbican/commit/ca80b266ed831813a2a9ecec84d288ec3f8f5104
Submitter: "Zuul (22348)"
Branch: master

commit ca80b266ed831813a2a9ecec84d288ec3f8f5104
Author: David Ames <email address hidden>
Date: Thu Oct 8 22:55:30 2020 +0000

Run db sync manually by leader only

    Without db_auto_create = False each barbican-worker attempts to run
    alembic upgrades on the database whenever they start. This leads to
    race conditions that can leave the sate of the database broken
    (multiple alembic version that "overlap").

    This change runs the barbican-manage db upgrade by the leader only
    avoiding the race condition. charms.openstack also handles the
    openstack upgrade process with a call to instance.db_sync.

Change-Id: I6b9498059c7057b73b1c3db0e355456c38b0510e
Closes-Bug: #1827690

Changed in charm-barbican:
status:	In Progress → Fix Committed

Aurelien Lourot (aurelien-lourot) on 2021-04-16

Changed in charm-barbican:
milestone:	none → 21.04

Alex Kavanagh (ajkavanagh) on 2021-05-03

Changed in charm-barbican:
status:	Fix Committed → Fix Released

OpenStack Barbican Charm

[19.04][stein] barbican-worker is down: Requested revision 1a0c2cdafb38 overlaps with other requested revisions 39cf2e645cba

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Barbican Charm	Fix Released	Critical	David Ames	OpenStack Barbican Charm 21.04
	barbican (Ubuntu)	Invalid	Undecided	Unassigned