Retry of neutron-db-sync doesn't work if execution fails during tables creation

Bug #1769860 reported by Alexander Rubtsov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Alexander Rubtsov

Bug Description

Release: MOS 9.2

The corresponding excerpts from Puppet log file:
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]/returns (debug): Exec try 1/10
Exec[neutron-db-sync](provider=posix) (debug): Executing 'neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head'
Puppet (debug): Executing 'neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head'
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]/returns (debug): Sleeping for 5.0 seconds between tries
.....
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]/returns (debug): Exec try 10/10
Exec[neutron-db-sync](provider=posix) (debug): Executing 'neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head'
Puppet (debug): Executing 'neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head'
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]/returns (debug): Sleeping for 5.0 seconds between tries
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]/returns (debug): Sleeping for 5.0 seconds between tries
.....
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]/returns (notice): sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1050, "Table 'agents' already exists") [SQL: u"\nCREATE TABLE agents (\n\tid VARCHAR(36) NOT NULL, \n\tagent_type VARCHAR(255) NOT NULL,.....]
/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync] (err): Failed to call refresh: neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head returned 1 instead of one of [0]
...
http://paste.openstack.org/show/x8rHSl9ErSvD9tivzj4h/

This failure doesn't make Fuel mark the entire deployment as failed, which is wrong because actually Neutron is unable to operate.

This issue is rarely reproducible.
It seems that it occurs only if creation/population of MySQL tables was interrupted in the middle of the process.

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla2 for 9.0-updates

Changed in fuel:
importance: Undecided → Medium
assignee: nobody → MOS Maintenance (mos-maintenance)
milestone: none → 9.x-updates
tags: added: customer-found sla2
Changed in fuel:
status: New → Confirmed
assignee: MOS Maintenance (mos-maintenance) → Oleksiy Molchanov (omolchanov)
assignee: Oleksiy Molchanov (omolchanov) → Alexander Rubtsov (arubtsov)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Alexander,

1) Is this initial fuel deploy?
2) Can we have full diagnostic snapshot?

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

Oleksiy,

1) I will ask the customer about that
2) Unfortunately, the log files from the problematic deployment are not available anymore.

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

The customer was not able to reproduce the issue again to collect the diagnostic snapshot.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

The issue has appeared again and the customer was able to collect the diagnostic snapshot this time. Please contact me so that I will provide you with the log files directly.

Changed in fuel:
status: Incomplete → New
assignee: Alexander Rubtsov (arubtsov) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 9.0-updates

Changed in fuel:
importance: Medium → High
tags: added: sla1
removed: sla2
Changed in fuel:
milestone: 9.x-updates → 9.2-mu-7
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

It seems the issue is wider than just neutron-db-sync. The same customer has hit similar incident with Cinder database.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Looks like this is a rarely occurring issue which is not that easy to fix, see also https://bugs.launchpad.net/fuel/+bug/1641136.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Ok, so it seems that the root cause of the issue is some kind of high CPU load during deployment which caused MySQL to be broken. This leads to failures of the db syncs which are mostly executed as a "refreshonly = true" events and the well known bug in Puppet lets the flow to continue even if refreshonly events are failed. But it shouldn't be a big problem since while you have non-working MySQL the deployment will still be failed. Regarding the possible fixes: db-syncs are known to be non-idempotent in older versions, so in Mitaka it's just dangerous to run *-db-sync over an already prepared database so these events *must be* "refreshonly". And on the other hand we cannot change the puppet code itself or update puppet to 5.x version since it would take enormous amount of testing which is not available for the stable products.

Given all the above and the fact that the issue normally occurs very rarely (no occurrences for the last year, ~100 SWARM runs with ~150 full cluster deployments in each SWARM run) and is mostly connected with overloaded H/W or VM resources I'm marking it as Won't Fix. Please add more resources to your deployments and troubleshoot high load on your environments.

Changed in fuel:
status: New → Won't Fix
assignee: Oleksiy Molchanov (omolchanov) → Alexander Rubtsov (arubtsov)
Changed in fuel:
milestone: 9.2-mu-7 → 9.x-updates
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.