nova-manage db sync fails due to nova-api/nova-conductor being run before the migration scripts have been applied

Bug #1335804 reported by Sergey Murashov
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
5.0.x
Fix Committed
High
Bogdan Dobrelya

Bug Description

Note: 77 iso(Fuel 5.0.1), VirtualBox

Steps To Reproduce:
1. Create new environment with the following configuration:
2 nodes (Controller+Mongo, Compute+CEPH-OSD), CentOS with Neutron GRE, install MURANO, SAVANNA and Ceilometer.
2. Start deployment.

Actual result:
Deployment fails with error (please see attached snapshot, localhost/var/log/remote/node-5.domain.tld/puppet-apply.log):
nova-manage db sync returned 1 instead of one of [0]

and

-----------
2014-06-30T09:45:28.358521+00:00 notice: (/Stage[main]/Nova::Api/Exec[nova-db-sync]/returns) IntegrityError: (IntegrityError) (1062, "Duplicate entry 'node-5-conductor-0' for key 'uniq_services0host0topic0deleted'") 'ALTER TABLE services ADD CONSTRAINT uniq_services0host0topic0deleted UNIQUE (host, topic, deleted)' ()
2014-06-30T09:45:28.359332+00:00 notice: (/Stage[main]/Nova::Api/Exec[nova-db-sync]/returns) nova-nova CRITICAL: IntegrityError: (IntegrityError) (1062, "Duplicate entry 'node-5-conductor-0' for key 'uniq_services0host0topic0deleted'") 'ALTER TABLE services ADD CONSTRAINT uniq_services0host0topic0deleted UNIQUE (host, topic, deleted)' ()
-----------

and

___________
2014-06-30T09:47:39.229728+00:00 notice: (/Stage[main]/Nova::Api/Exec[nova-db-sync]/returns) OperationalError: (OperationalError) (1050, "Table 'instances' already exists") "\nCREATE TABLE instances (\n\tcreated_at DATETIME, \n\tupdated_at DATETIME, \n\tdeleted_at DATETIME, \n\tid INTEGER NOT NULL AUTO_INCREMENT, \n\tinternal_id INTEGER, \n\tuser_id VARCHAR(255), \n\tproject_id VARCHAR(255), \n\timage_ref VARCHAR(255), \n\tkernel_id VARCHAR(255), \n\tramdisk_id VARCHAR(255), \n\tlaunch_index INTEGER, \n\tkey_name VARCHAR(255), \n\tkey_data MEDIUMTEXT, \n\tpower_state INTEGER, \n\tvm_state VARCHAR(255), \n\tmemory_mb INTEGER, \n\tvcpus INTEGER, \n\thostname VARCHAR(255), \n\thost VARCHAR(255), \n\tuser_data MEDIUMTEXT, \n\treservation_id VARCHAR(255), \n\tscheduled_at DATETIME, \n\tlaunched_at DATETIME, \n\tterminated_at DATETIME, \n\tdisplay_name VARCHAR(255), \n\tdisplay_description VARCHAR(255), \n\tavailability_zone VARCHAR(255), \n\tlocked BOOL, \n\tos_type VARCHAR(255), \n\tlaunched_on MEDIUMTEXT, \n\tinstance_type_id INTEGER, \n\tvm_mode VARCHAR(255), \n\tuuid VARCHAR(36), \n\tarchitecture VARCHAR(255), \n\troot_device_name VARCHAR(255), \n\taccess_ip_v4 VARCHAR(39), \n\taccess_ip_v6 VARCHAR(39), \n\tconfig_drive VARCHAR(255), \n\ttask_state VARCHAR(255), \n\tdefault_ephemeral_device VARCHAR(255), \n\tdefault_swap_device VARCHAR(255), \n\tprogress INTEGER, \n\tauto_disk_config BOOL, \n\tshutdown_terminate BOOL, \n\tdisable_terminate BOOL, \n\troot_gb INTEGER, \n\tephemeral_gb INTEGER, \n\tcell_name VARCHAR(255), \n\tnode VARCHAR(255), \n\tdeleted INTEGER, \n\tlocked_by ENUM('owner','admin'), \n\tcleaned INTEGER, \n\tPRIMARY KEY (id), \n\tCHECK (locked IN (0, 1)), \n\tCHECK (auto_disk_config IN (0, 1)), \n\tCHECK (shutdown_terminate IN (0, 1)), \n\tCHECK (disable_terminate IN (0, 1))\n)ENGINE=InnoDB CHARSET=utf8\n\n" ()
2014-06-30T09:47:39.248732+00:00 err: /usr/bin/nova-manage db sync returned 1 instead of one of [0]
2014-06-30T09:47:39.248732+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/util/errors.rb:97:in `fail'
2014-06-30T09:47:39.248732+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/type/exec.rb:120:in `sync'

___________

Tags: nova db
Revision history for this message
Sergey Murashov (smurashov) wrote :
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
milestone: none → 5.0.1
description: updated
description: updated
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
description: updated
tags: added: ceph nova
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

seems like a intermittent database consistency problem. may be related to performance of virtualbox environment. need a reproducer on different environment.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Download full text (11.5 KiB)

Happened with me too. In m case, nova-cloud-controller is in an LXC container on a beefy machine.
2014-07-04 19:01:48 INFO juju-log shared-db:45: Wrote template /etc/haproxy/haproxy.cfg.
2014-07-04 19:01:48 INFO juju-log shared-db:45: Migrating the nova database.
2014-07-04 19:01:49 INFO shared-db-relation-changed 2014-07-04 19:01:49.908 11320 INFO migrate.versioning.api [-] 215 -> 216...
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 CRITICAL nova [-] IntegrityError: (IntegrityError) (1062, "Duplicate entry 'juju-machin
e-0-lxc-0-conductor-0' for key 'uniq_services0host0topic0deleted'") 'ALTER TABLE services ADD CONSTRAINT uniq_services0host0topic0deleted UNIQUE (host, topic, deleted)'
()
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova Traceback (most recent call last):
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/bin/nova-manage", line 10, in <module>
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova sys.exit(main())
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/cmd/manage.py", line 1374, in
main
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova ret = fn(*fn_args, **fn_kwargs)
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/cmd/manage.py", line 883, in s
ync
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova return migration.db_sync(version)
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/db/migration.py", line 29, in
db_sync
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova return IMPL.db_sync(version=version)
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/migration.py", l
ine 44, in db_sync
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova return versioning_api.upgrade(get_engine(), repository, version)
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/lib/python2.7/dist-packages/migrate/versioning/api.py", line 18
6, in upgrade
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova return _migrate(url, repository, version, upgrade=True, err=err, **opts)
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "<string>", line 2, in _migrate
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova File "/usr/lib/python2.7/dist-packages/migrate/versioning/util/__init__.py", line 159, in with_engine
2014-07-04 19:02:07 INFO shared-db-relation-changed 2014-07-04 19:02:07.057 11320 TRACE nova return f(*a, **kw)
2014-0...

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'm sorry, disregard my comment since I'm not using Fuel. I was just searching launchpad for a backtrace similar to mine and found this bug.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Not reproduced on my environment.

Changed in fuel:
status: Incomplete → Invalid
milestone: 5.0.1 → none
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

api: '1.0'
astute_sha: 5df009e8eab611750309a4c5b5c9b0f7b9d85806
build_id: 2014-07-14_00-31-14
build_number: '122'
fuellib_sha: 2d1e1369c13bc9771e9473086cb064d257a21fc2
fuelmain_sha: 8a8faa68f7939d095dde60665f16b86d068f696d
mirantis: 'yes'
nailgun_sha: 4637ae61920b92f56154a03f7e717e3bd597525c
ostf_sha: 09b6bccf7d476771ac859bb3c76c9ebec9da9e1f
production: docker
release: 5.0.1

Reproduced on iso # 122 (KVM), my steps:

1. Created new environment: Ubuntu + HA + neutronGRE + Ceph (galnce, cinder, ephemeral, radosGW) + Ceilometer
2. Added 8 nodes: 1 (Controller+Mongo), 2 (Controller+Ceph), 5 (Compute+Ceph)
3. Started deployment. It failed, see puppet logs on node-1 (controller+mongo):

http://paste.openstack.org/show/86370/

That environment still alive and I can provide access to it if necessarily.

Changed in fuel:
status: Invalid → New
milestone: none → 5.0.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This is a database migration issue. Migration should work seamlessly. I guess, this may be related to Nova code. Reassigning to MOS Nova team.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Nova (mos-nova)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

From logs it's clear that nova-api/nova-conductor are up and running before *nova-manage db sync* has completed. It means that nova-api is using partially applied DB schema, which obviously can't work well.

Initial Havana migration script (216_havana.py) is written in a way, so that it creates all the tables first and then adds constraints to those tables. In this particular case *all the tables* have been created, nova-api is running and processing queries, but table 'services' is missing one important unique constraint: without this constraint a race condition is possible, which leads to addition of two equivalent entries. Later in 216_havana we try to create this unique constraint and fail.

While this particular problem *can* be fixed in Nova, that would require patching of upstream db schema migrations, and I'm not sure that's the way to go. In general, you can't expect nova-api to work reliably, if it's run before nova-manage db sync has completed.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Having said that, I'd go for changing the order *nova-manage db sync* is performed and *nova services* are run.

Changed in fuel:
status: New → Invalid
status: Invalid → Triaged
Changed in mos:
status: New → Invalid
assignee: nobody → Roman Podoliaka (rpodolyaka)
importance: Undecided → High
milestone: none → 5.0.1
tags: added: db
removed: ceph
Changed in mos:
assignee: Roman Podoliaka (rpodolyaka) → MOS Nova (mos-nova)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/106785

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/106786

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Revision history for this message
Vladimir Kuklin (vkuklin) wrote : Re: OS deployment fail: nova-manage db sync returned 1 instead of one of [0]

workaround is to run nova-manage db sync on the controller manually and restart all the nova services on all the nodes affected

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

guys, we are not expecting migration command to exit before it finishes migration. this is clearly race condition. we need to implement a fix on migration code - it will be much harder to fix it in puppet manifests

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → MOS Nova (mos-nova)
no longer affects: fuel
no longer affects: fuel/5.0.x
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

@Vladimir, nova-manage db sync *can't* exit before it's applied all the migration scripts or failed. Are you sure you don't do something like nova-manage db sync & in your manifests?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

We just encountered this same bug in a multi-controller deployment (3 controllers) and the fix proposed by Vladimir seems to work. Is there a reason it was abandoned?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I see that the proposed fix should actually not do anything since the ordering is already implied by the notification. However, I don't think this bug should be marked as invalid. nova services shouldn't be started during a migration.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I agree with Kevin that the bug is not invalid. It is questionable how we need to fix it (in puppet modules or in *-db-manage scripts) or when (in 5.1/5.0.2 or later if it is not highly reproducible). But we need to fix it eventually.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Dmitry, there is no way we can fix that in *-db-manage scripts. IMO, this is purely a deployment issue: from the code side we just can't guarantee you migration scripts will be applied correctly if you *are* changing the database during the process.

summary: - OS deployment fail: nova-manage db sync returned 1 instead of one of [0]
+ nova-manage db sync fails due to nova-api/nova-conductor being running
+ before the migration scripts have been applied
Changed in fuel:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 5.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote : Re: nova-manage db sync fails due to nova-api/nova-conductor being running before the migration scripts have been applied

Kevin, Roman

We have not reproduced this bug for a while. Manifests ensure that nova services are started after all the migrations finished.

Kevin, would you please attach a diagnostic snapshot so we can ensure that this is the bug you reproduced.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I think the post install script of the RPM/DEB package starts the service *right after* a package has been installed. While this might be ok for other services CentOS/Ubuntu ship (Apache/MySQL/etc), I'm not sure this works for us as before nova-api is started we need to make sure that:

1. nova-api.conf with correct settings is there

2. nova-manage db sync has been executed

If those requirements aren't satisfied, we may end up with nova-api either simply configured incorrectly (e.g. using a default sqlite database) or working in parallel with db sync, in which case we can't guarantee migration scripts will be applied correctly.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I agree with Roman. The deployment must ensure none of the Openstack service instances in cluster are running then one's db is being synced. I suggest to:
1) For new deployments, make db sync at the very first (primary) controller only, and never do it on other nodes - then OSt services already deployed and running at the previousely deployed node(s).
2) For patching, stop all instances of Openstack service which db is about to be synced and start it once it is done

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I can propose that we disable service start-up on installation.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We did it in related bug https://bugs.launchpad.net/fuel/+bug/1348185
but looks like the fix should be adjusted a bit

summary: - nova-manage db sync fails due to nova-api/nova-conductor being running
+ nova-manage db sync fails due to nova-api/nova-conductor being run
before the migration scripts have been applied
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

according to https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/controller.pp#L535-609

we already have automated service startup disabled so this bug should not be reproduced

commit 099dceefd267ad6f20b67a17ec95df01c223e673 should contain the fix. Kevin, please verify that your code contains this fix. If it is true, then reopen the bug.

no longer affects: mos
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Well actually it doesn't require any additional fixes, and should be completely fixed by https://bugs.launchpad.net/fuel/+bug/1348185

no longer affects: mos/5.0.x
no longer affects: mos/5.1.x
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.0)

Reviewed: https://review.openstack.org/117194
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=4073deae506edb5803a7e35edc51acfcc4acde30
Submitter: Jenkins
Branch: stable/5.0

commit 4073deae506edb5803a7e35edc51acfcc4acde30
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Jul 24 18:24:26 2014 +0300

    Disable Openstack services startup on install

    Upstart makes services went off once installed.
    We're wanting them started only then configured.
    If ceph is used volumes, leave cinder-volume override
    on its own (rbd.pp will configure it).

    Closes-bug: #1348185
    Closes-bug: #1335804

    Change-Id: I50ff5b0f5bc3d80bfef917116a620bd9386e8e56
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/118267

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/118267
Reason: no longer relevant

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/5.0)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/106786
Reason: abandoned at master

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.