nova-manage cellv2 discover_hosts traces when run in parallel

Bug #1824445 reported by melanie witt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
melanie witt

Bug Description

Saw this issue downstream [1] and found a couple of similar issues [2][3] where deployments were running the 'nova-manage cellv2 discover_hosts' command in parallel and experiencing tracebacks like this:

"DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'compute-0.localdomain' for key 'uniq_host_mappings0host'\") [SQL: u'INSERT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'host': u'compute-0.localdomain', 'cell_id': 5, 'created_at': datetime.datetime(2019, 4, 10, 15, 20, 50, 527925), 'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",

After some discussion on IRC today [4], we concluded it would be best to address the situation with improved command help and warnings when duplicate host mappings are encountered.

While we could try-except to ignore DBDuplicateEntry, this is not a situation we want to hide from users as it means they are likely hammering their database with parallel updates that are mostly not going to succeed. So we should stop and warn instead.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1698630
[2] https://bugs.launchpad.net/openstack-ansible/+bug/1752540
[3] https://github.com/bloomberg/chef-bcpc/issues/1378
[4] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-04-11.log.html#t2019-04-11T23:04:47

melanie witt (melwitt)
tags: added: nova-manage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651947

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.0.0

This issue was fixed in the openstack/tripleo-heat-templates 11.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/651947
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5c544c7e2a7e266d69a9d0f0bf3ee8a0c636202b
Submitter: Zuul
Branch: master

commit 5c544c7e2a7e266d69a9d0f0bf3ee8a0c636202b
Author: melanie witt <email address hidden>
Date: Fri Apr 12 00:32:01 2019 +0000

    Warn for duplicate host mappings during discover_hosts

    When the 'nova-manage cellv2 discover_hosts' command is run in parallel
    during a deployment, it results in simultaneous attempts to map the
    same compute or service hosts at the same time, resulting in
    tracebacks:

      "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate
      entry 'compute-0.localdomain' for key 'uniq_host_mappings0host'\")
      [SQL: u'INSERT INTO host_mappings (created_at, updated_at, cell_id,
      host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s,
      %(host)s)'] [parameters: {'host': u'compute-0.localdomain',
      %'cell_id': 5, 'created_at': datetime.datetime(2019, 4, 10, 15, 20,
      %50, 527925), 'updated_at': None}]

    This adds more information to the command help and adds a warning
    message when duplicate host mappings are detected with guidance about
    how to run the command. The command will return 2 if a duplicate host
    mapping is encountered and the documentation is updated to explain
    this.

    This also adds a warning to the scheduler periodic task to recommend
    enabling the periodic on only one scheduler to prevent collisions.

    We choose to warn and stop instead of ignoring DBDuplicateEntry because
    there could potentially be a large number of parallel tasks competing
    to insert duplicate records where only one can succeed. If we ignore
    and continue to the next record, the large number of tasks will
    repeatedly collide in a tight loop until all get through the entire
    list of compute hosts that are being mapped. So we instead stop the
    colliding task and emit a message.

    Closes-Bug: #1824445

    Change-Id: Ia7718ce099294e94309103feb9cc2397ff8f5188

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.4.0

This issue was fixed in the openstack/tripleo-heat-templates 9.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 8.4.0

This issue was fixed in the openstack/tripleo-heat-templates 8.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.6.0

This issue was fixed in the openstack/tripleo-heat-templates 10.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.