Fuel tries to continue deployment when nova-manage db sync failed

Bug #1641136 reported by Alexei Sheplyakov
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Fuel Sustaining
Nominated for Ocata by Oleksiy Molchanov
Mitaka
Confirmed
Medium
Fuel Sustaining
Newton
Confirmed
Medium
Fuel Sustaining

Bug Description

nova-{api,scheduler,consoleauth,cert} services fail right after the start.
The error message is (from /var/log/nova-api.log):

2016-11-11 14:38:57.074 21442 ERROR oslo_service.service OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column 'services.last_seen_up' in 'field list'") [SQL: u'SELECT services.created_at AS services_created_at, services.updated_at AS services_updated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.host AS services_host, services.`binary` AS services_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabled_reason, services.last_seen_up AS services_last_seen_up, services.forced_down AS services_forced_down, services.version AS services_version \nFROM services \nWHERE services.deleted = %s AND services.host = %s AND services.`binary` = %s \n LIMIT %s'] [parameters: (0, '192.168.0.8', 'nova-metadata', 1)]
2016-11-11 14:38:57.074 21442 ERROR oslo_service.service
2016-11-11 14:38:57.453 4895 INFO oslo_service.service [-] Child 21412 exited with status 1

root@node-1:~# apt-cache policy nova-api
nova-api:
  Installed: 2:13.1.1-7~u14.04+mos20
  Candidate: 2:13.1.1-7~u14.04+mos20
  Version table:
 *** 2:13.1.1-7~u14.04+mos20 0
       1050 http://10.20.0.1/mirror/mirror.fuel-infra.org/mos-repos/ubuntu/9.0/ mos9.0-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2:13.0.0-6~u14.04+mos43 0
       1050 http://10.20.0.2:8080/mitaka-9.0/ubuntu/x86_64/ mos9.0/main amd64 Packages
     1:2014.1.5-0ubuntu1.5 0
        500 http://10.20.0.1/mirror/archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
     1:2014.1.3-0ubuntu1.1 0
        500 http://10.20.0.1/mirror/archive.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
     1:2014.1-0ubuntu1 0
        500 http://10.20.0.1/mirror/archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

Setting importance to Critical since this breaks the deployment (100% reproducible)

Tags: area-library
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

mysql> describe services;
+-----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+----------------+
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| deleted_at | datetime | YES | | NULL | |
| id | int(11) | NO | PRI | NULL | auto_increment |
| host | varchar(255) | YES | MUL | NULL | |
| binary | varchar(255) | YES | | NULL | |
| topic | varchar(255) | YES | | NULL | |
| report_count | int(11) | NO | | NULL | |
| disabled | tinyint(1) | YES | | NULL | |
| deleted | int(11) | YES | | NULL | |
| disabled_reason | varchar(255) | YES | | NULL | |
+-----------------+--------------+------+-----+---------+----------------+
11 rows in set (0.00 sec)

There are no `last_seen_up' field in the `services' database.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Marking as Incomplete. Please attach diagnostic snapshot.

Changed in fuel:
status: New → Incomplete
tags: added: area-library
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Oleksiy,

> Marking as Incomplete. Please attach diagnostic snapshot.

Please stop these bureaucratic games and be constructive.
If you need a specific log, backtrace, etc - feel free to ask for it, and I'll provide it.
If you don't know the cause of the problem, and have no idea how to debug it - please leave
this bug report alone and *don't* mark it in any way, so other people who know the code
better than me and you (or have experienced a similar problem) can have a look at it
Thanks in advance.

Changed in fuel:
status: Incomplete → New
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :
Download full text (9.6 KiB)

Apparently nova-manage db sync fails, however puppet continues deployment instead of failing or retrying:

2016-11-11T13:55:54.881170+00:00 debug: (Exec[nova-db-sync](provider=posix)) Executing '/usr/bin/nova-manage db sync'
2016-11-11T13:55:54.881389+00:00 debug: Executing '/usr/bin/nova-manage db sync'
2016-11-11T13:58:45.465663+00:00 notice: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]/returns) Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
2016-11-11T13:58:45.465663+00:00 notice: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]/returns) Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
2016-11-11T13:58:45.465663+00:00 notice: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]/returns) Option "notification_topics" from group "DEFAULT" is deprecated. Use option "topics" from group "oslo_messaging_notifications".
2016-11-11T13:58:45.465663+00:00 notice: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]/returns) /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:450: Warning: Duplicate index 'block_device_mapping_instance_uuid_virtual_name_device_name_idx' defined on the table 'nova.block_device_mapping'. This is deprecated and will be disallowed in a future release.
2016-11-11T13:58:45.465663+00:00 notice: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]/returns) cursor.execute(statement, parameters)
2016-11-11T13:58:45.465663+00:00 notice: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]/returns) error: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away')
2016-11-11T13:58:45.465663+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) Failed to call refresh: /usr/bin/nova-manage db sync returned 1 instead of one of [0]
2016-11-11T13:58:45.465663+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/util/errors.rb:106:in `fail'
2016-11-11T13:58:45.465663+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/type/exec.rb:160:in `sync'
2016-11-11T13:58:45.465663+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/type/exec.rb:583:in `refresh'
2016-11-11T13:58:45.465788+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/transaction/event_manager.rb:101:in `process_callbac
k'
2016-11-11T13:58:45.465908+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/transaction/event_manager.rb:20:in `block in process
_events'
2016-11-11T13:58:45.465908+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/transaction/event_manager.rb:93:in `block in queued_
events'
2016-11-11T13:58:45.466018+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/transaction/event_manager.rb:92:in `each'
2016-11-11T13:58:45.466125+00:00 err: (/Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]) /usr/lib/ruby/vendor_ruby/puppet/transaction/event_manager.rb:92:in `queued_events'
2016-11-11T13:58:45.466218+00:00...

Read more...

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Apparently nova-manage db sync fails, however puppet continues deployment instead of failing or retrying

nova-manage db sync is triggered via refresh:

https://github.com/openstack/puppet-nova/blob/stable/mitaka/manifests/db/sync.pp#L19-L29

and puppet does NOT consider a failed exec as a failure if refreshonly is true, see

https://tickets.puppetlabs.com/browse/PUP-1223
https://tickets.puppetlabs.com/browse/PUP-2280

Perhaps 'nova-db-sync' should use onlyif (with a command which checks if the actual DB version
matches the expected one) instead of refreshonly

summary: - nova-{api,scheduler,cert,consoleauth} service fail to start due to a
- wrong SQL request (or a wrong DB schema?)
+ Fuel tries to continue deployment when nova-manage db-sync failed
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote : Re: Fuel tries to continue deployment when nova-manage db-sync failed

As if it was not enough db sync is not idempotent with mysql (DDL is *not* transactional in mysql).
Thus a failed db sync leaves the cluster in a badly broken state. Recovering from such a state
requires manual inspection of tables and selectively running DB migration scripts to add missing tables, columns, constraints, etc.
This is not a big deal for a new deployment (one can wipe out all tables and start over),
however this is a real disaster for upgrades.

summary: - Fuel tries to continue deployment when nova-manage db-sync failed
+ Fuel tries to continue deployment when nova-manage db sync failed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

I am marking this as duplicate for 1641609

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Thus a failed db sync leaves the cluster in a badly broken state.

Let's track this one separately: https://bugs.launchpad.net/fuel/+bug/1641609

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> I am marking this as duplicate for 1641609

It's not a duplicate at all. There are two separate problems here:

1) nova-manage db sync failure irreversibly breaking the cluster
2) fuel does NOT detect the failure and tries to continue the deployment

hence two different bug reports (this one and 1641609)

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Issue is not reproducible in default configuration on major of environments. So it is not critical according our criteria.

Changed in fuel:
importance: Critical → High
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Alexei,

What build number of 9.x do you use? Is it 9.0 or 9.1? It seems we have already this patch

https://github.com/openstack/fuel-library/blob/stable/mitaka/deployment/puppet/openstack_tasks/manifests/openstack_controller/openstack_controller.pp#L293-L297

Kindly asking you, next time attach Diagnostic Snapshot (it can be created by clicking 2-3 buttons in UI), this will help us to quickly debug and find the root cause.

Changed in fuel:
status: New → Invalid
Changed in fuel:
status: Invalid → New
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> What build number of 9.x do you use? Is it 9.0 or 9.1?

It's official 9.0 ISO (http://seed.fuel-infra.org/fuelweb-release/MirantisOpenStack-9.0.iso.torrent)

> attach Diagnostic Snapshot (it can be created by clicking 2-3 buttons in UI)

I'm afraid you are wrong, for

a) the documentation [1] on creating diagnostic snapshots is wrong:

> Log in to the Fuel web UI.
> Navigate to Support > Download Diagnostic Snapshot.
> Click Generate Diagnostic Snapshot.

For starters, there's no 'Support' to navigate to, see the attached screenshot (fueluimain-201611161326.png) of fuel web UI.

b) the `shotgun' tool which actually collects the data is way too memory hungry and typically gets OOM killed

[1] http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/maintain-environment/create-snapshot.html

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> It seems we have already this patch

> https://github.com/openstack/fuel-library/blob/stable/mitaka/deployment/puppet/openstack_tasks/manifests/openstack_controller/openstack_controller.pp#L293-L297

This patch does not change puppet behavior regarding exec resources evaluated on notify.

(And unfortunately retrying is not very useful due to nova-manage bug https://bugs.launchpad.net/fuel/+bug/1641609)

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

@Alexei,

a) it seems to be UI bug, please try https://IP:8443/#support
b) it is very rare situation

Anyway, I do not see any solution we can implement.

Changed in fuel:
status: New → Invalid
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Just because you can't fix it doesn't mean the bug report is invalid. It's a major usability issue.

Changed in fuel:
status: Invalid → New
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> I do not see any solution we can implement.

It's two easy steps:

1) https://bugs.launchpad.net/fuel/+bug/1641136/comments/6
2) make nova-manage idempotent

Changed in fuel:
status: New → Confirmed
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I guess the workaround would be to add a check to openstack-controller task at the end that checks if migration is fine.

Changed in fuel:
milestone: 9.2 → 11.0
importance: High → Medium
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

Reproduced this on a 9.2 virtual lab env with a slow backing hard drive. Deployment failed on Task[primary-openstack-controller/3], nova database is incomplete and subsequent deployment runs also fail because nova-api is down due to MySQL nova database missing some columns (services.forced_down in my case).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.