Rolling upgrade M to N: DBDeadlock Error when create instance during sync database

Bug #1640164 reported by Anh Tran
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

I have 3 controller nodes running HA active/active, using KVM hypervisor and Maria cluster as shared database. The system was deployed by Devstack Mitaka version on virtual machines which was created by virt-manager.

I have upgraded Keytone to N version, then I tried to Rolling Upgrade Nova from M to N version folowed:
http://docs.openstack.org/developer/nova/upgrade.html#rolling-upgrade-process

The document said that:
Using the newly installed nova code, run the DB sync. (nova-manage db sync; nova-manage api_db sync). These schema change operations should have minimal or no effect on performance, and should not cause any operations to fail.

However, during the sync database, I cannot create the VM. Nova raise that:

ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBDeadlock'> (HTTP 500) (Request-ID: req-b5c82715-6306-4f6d-972b-3387015da12c)

Full log here: http://paste.openstack.org/show/588365/

After finishing the sync process, I can create VM as well.

==Reproduce==

# Controller1:
1. Stop all nova service, except nova api (n-api).

2. Upgrade source code:
$ cd /opt/stack/nova/
$ git checkout remotes/origin/stable/newton
$ git checkout -b stable/newton remotes/origin/stable/newton
$ git pull
$ sudo -E pip install -r requirements.txt --upgrade

3. Downgrade some packages dependency (because I used --upgrade as above)
$ sudo pip uninstall oslo.messaging
$ sudo pip uninstall kombu
$ sudo pip uninstall cffi
$ sudo -E pip install oslo.messaging==5.10.0
$ sudo -E pip install kombu==3.0.35
$ sudo -E pip install cffi==1.5.2

4. Update /etc/nova/nova.conf:
[upgrade_levels]
compute = auto

5. Sync DB
$ nova-manage db sync
$ nova-manage api_db sync

6. During the Sync DB, try to create VM, execute on controller 2 and 3 (not concurrency):
$ nova boot --flavor m1.nano --image 21ffa33b-e9eb-43f4-aa73-ceb8f2cbc6fc --nic net-name=net1 VM_test
ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBDeadlock'> (HTTP 500) (Request-ID: req-b5c82715-6306-4f6d-972b-3387015da12c)

tags: added: upgrades
Sujitha (sujitha-neti)
Changed in nova:
assignee: nobody → Sujitha (sujitha-neti)
Changed in nova:
status: New → In Progress
Revision history for this message
Anh Tran (trananhkma) wrote :

Hi Sujitha, do you still continue fixing this bug? Could you please share with me some information about this? :D I hope this bug can be fix soon :)

Revision history for this message
Sujitha (sujitha-neti) wrote :

Hi Anh Tran, I think this is expected because you are trying to write to db when sync is happening. So we get DB Deadlock error. I'm trying to find if we should change the document in order not to misguide the users. Another way would be to retry on DeadLock. I'm not sure if that would be needed here since the sync operation takes very less time.

Revision history for this message
Anh Tran (trananhkma) wrote :

I agree with you about the reson for why did it happen, but I'm not sure about the sync operation. It may take alot of time if we have a large amount of database, a few of TeraByte for example. How do you think about this?

Revision history for this message
John Garbutt (johngarbutt) wrote :

This certainly shouldn't be expected behaviour.

Anh, what version of MariaDB are you using here please? I believe the difference in what schema migrations are non-impacting is quite dramatic between the different versions (at least that was true with MySQL 5.5->5.6).

Sujitha, are you able to reproduce this?

From the logs, it looks like this line is failing, build_request.create():
https://github.com/openstack/nova/blob/stable/mitaka/nova/compute/api.py#L951

As a nasty workaround, we could add a deadlock retry decorator:
@oslo_db_api.wrap_db_retry(max_retries=5, retry_on_deadlock=True)
On the DB method here:
https://github.com/openstack/nova/blob/master/nova/objects/build_request.py#L162
But of course, we shouldn't be hitting a deadlock in the first place!

There are a few changes to the build_request, like adding an index and changing nullable columns that may be causing issues in this cluster setup if they happen while we are adding a build request into the DB. It would be interesting to know which one was taking a long time:
https://github.com/openstack/nova/blob/stable/newton/nova/db/sqlalchemy/api_migrations/migrate_repo/versions/

* 013_build_request_extended_attrs.py (index added)
* 015_build_request_nullable_columns.py (drops a unique constraint)
* 020_block_device_mappings_mediumtext.py (does alter table to make things medium text)
* 021_build_requests_instance_mediumtext.py (as above)

I would reach out to Andrew Gardener on our team to help with this DB issue.

Revision history for this message
John Garbutt (johngarbutt) wrote :

Hmm, I think there was a foreign key drop in 015, I didn't mention above.

Anh, are you using Rolling schema updates as described here?
http://galeracluster.com/documentation-webpages/schemaupgrades.html

I don't know if that would fix your problem here, but it looks like it could help, if a bit fiddly. We do the odd table create, which may cause problems, by the looks of things.

It seems MariaDB doesn't support as much online DDL as MySQL 5.6 as described here:
https://dev.mysql.com/doc/refman/5.6/en/innodb-create-index-overview.html
Although even in that case we have some type changes that look like they would block a lot of online operations :(

I think we need quite a deep look again at this problem.

Revision history for this message
Anh Tran (trananhkma) wrote :

Hi John, thank you for your help :)

> what version of MariaDB are you using here
stack@mariadb1:~$ mysql --version
mysql Ver 15.1 Distrib 5.5.52-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2

> are you using Rolling schema updates as described here?
No I'm not. At that time, I just want to test rolling upgrade feature for OpenStack. Now my system are setting up with other topology for another task, so I will try it later if I have time :)

Sujitha (sujitha-neti)
Changed in nova:
assignee: Sujitha (sujitha-neti) → nobody
Changed in nova:
status: In Progress → Won't Fix
status: Won't Fix → New
Revision history for this message
Sean Dague (sdague) wrote :

Is this just a docs issue that you need mysql 5.6 engine? And mariadb doesn't seem to provide that?

Changed in nova:
status: New → Incomplete
Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version mitaka in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.mitaka
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.