Mirantis OpenStack

nova-* services are marked as down after MySQL / RabbitMQ failures

Bug #1516993 reported by Artem Panchenko on 2015-11-17

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	Critical	Roman Podoliaka	Mirantis OpenStack 8.0
7.0.x	Invalid	Critical	Roman Podoliaka	Mirantis OpenStack 7.0-updates
8.0.x	Fix Released	Critical	Roman Podoliaka	Mirantis OpenStack 8.0

Bug Description

Upstream bugs:
https://bugs.launchpad.net/nova/+bug/1505471 https://bugs.launchpad.net/nova/+bug/1517926

==========================================================

Fuel version info (8.0 build #169): http://paste.openstack.org/show/479097/

Nova compute service (1 of 2 instances in the cloud) was marked as down after MySQL termination on controllers (one by one):

Check that required services are running failure

fuel_health.tests.sanity.test_sanity_infrastructure: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/fuel_health/tests/sanity/test_sanity_infrastructure.py", line 75, in test_001_services_state
    u'XXX' not in output, 'Step 2 failed: Some nova services '
  File "/usr/lib/python2.6/site-packages/fuel_health/common/test_mixins.py", line 164, in verify_response_true
    self.fail(message.format(failed_step_msg, msg))
  File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 415, in fail
    raise self.failureException(msg)
AssertionError: Step 2 failed: Some nova services have not been started.. Please refer to OpenStack logs for more details.

Binary Host Zone Status State Updated_At
nova-cert node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:31
nova-consoleauth node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:31
nova-scheduler node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:32
nova-conductor node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:31
nova-cert node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-consoleauth node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-scheduler node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-conductor node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-cert node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:24
nova-consoleauth node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:24
nova-scheduler node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:25
nova-conductor node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:24
nova-compute node-1.test.domain.local nova enabled :-) 2015-11-17 06:32:08
nova-compute node-5.test.domain.local nova enabled XXX 2015-11-17 06:28:16

Steps to reproduce:

            1. Terminate mysql on controller nodes
            2. Wait while it is being restarted
            3. Verify it is restarted
            4. Go to another controller (kill mysql daemon, wait until it's recovered by pacemaker, go to the next controller, kill mysql daemon ...)
            5. Run OSTF

Expected result: all health checks passed
Actual: test 'Check that required services are running' failed

Here is a part of nova-compute log on node-5:

2015-11-17 06:29:18.560 8431 ERROR oslo.service.loopingcall RemoteError: Remote error: DBConnectionError (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL
unication packet', system error: 0") [SQL: u'SELECT 1']

http://paste.openstack.org/show/479104/

The service was marked as down (up to one hour) until I restarted nova-compute on node-5.

See original description

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-17:

fail_error_ha_neutron_mysql_termination-fuel-snapshot-2015-11-17_06-39-27.tar.xz Edit (77.0 MiB, application/octet-stream)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-11-17:

Artem, have you checked Galera status?

means that oslo.db did pessimistic connection check (SELECT 1 at the beginning of each new transaction) and found out that connection was invalidated. 2013 returned by Galera means that it's currently not available for clients (most likely it's in the middle of a cluster re-assembly).

Changed in fuel:
status:	New → Incomplete
assignee:	MOS Nova (mos-nova) → Artem Panchenko (apanchenko-8)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-17:

@Roman,

yep, MySQL cluster worked fine, HA OSTF check passed btw. Looks like nova-compute got that error right after we killed one MySQL instance (on slave-01):

2015-11-17 06:29:11,734 - INFO test_failover_base.py:348 -- Terminating MySQL on slave-01
2015-11-17 06:29:36,904 - INFO checkers.py:446 -- MySQL daemon is started on slave-01

Most probably HAProxy didn't manage to mark failed MySQL instance as down immediately and some requests where proxied to it. The question here is what do we expect from Nova in such cases? Should it try to re-connect to DB and the service must be recovered or HA for database cluster must be improved instead?

Changed in fuel:
status:	Incomplete → New
assignee:	Artem Panchenko (apanchenko-8) → MOS Nova (mos-nova)

Dmitry Pyzhov (dpyzhov) on 2015-11-17

tags:

added: area-mos

Nastya Urlapova (aurlapova) on 2015-11-17

tags:

added: swarm-blocker

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-11-18:

Artem, so oslo.db makes sure you reconnect to the DB, actually 2013 errors is what you get when you try to connect. I'm wondering what settings we have for HAProxy - looks like it continues to proxy connections to the MySQL node which is not a part of the Galera cluster yet.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-11-18:

We are going to give it a try, but this must not be a blocker, IMO.

Changed in fuel:
status:	New → Incomplete
summary:	- Nova compute is marked as down after MySQL failover + nova-compute is marked as down after MySQL failover

Roman Podoliaka (rpodolyaka) on 2015-11-18

Changed in fuel:
status:	Incomplete → Confirmed

Roman Podoliaka (rpodolyaka) on 2015-11-19

description:	updated
summary:	- nova-compute is marked as down after MySQL failover + nova-* services are marked as down after MySQL failover
Changed in fuel:
assignee:	MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-11-19: Re: nova-* services are marked as down after MySQL failover

Seems to be caused by https://github.com/openstack/nova/commit/3bc171202163a3810fdc9bdb3bad600487625443 , which reduced the scope of exception handled and not included DBError / RemoteError exceptions.

Changed in fuel:
status:	Confirmed → In Progress

Roman Podoliaka (rpodolyaka) on 2015-11-19

description:

updated

Roman Podoliaka (rpodolyaka) on 2015-11-24

summary:

- nova-* services are marked as down after MySQL failover
+ nova-* services are marked as down after MySQL / RabbitMQ failures

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-12-01:

Back ports proposed to stable/liberty:

https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/liberty+topic:fix_status_reporting,n,z

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-03:

BVT blocker, many duplicates. Impact raised to critical

Changed in fuel:
importance:	High → Critical

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-12-04:

Fixed in https://review.fuel-infra.org/14398 . Waiting for CI

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-12-04:

#10

https://review.fuel-infra.org/14398 has been merged, the packages have been uploaded successfully - http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/pool/main/n/nova/

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-12-28:

#11

Reproduced again - http://paste.openstack.org/show/482783/

root@node-4:/var/log# nova --version
2.30.2
root@node-4:/var/log# dpkg -l | grep nova
ii nova-common 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - common files
ii nova-compute 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - compute node
ii nova-compute-qemu 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - compute node (QEmu)
ii python-nova 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - libraries
ii python-novaclient 2:2.30.2-1~u14.04+mos3 all client library for OpenStack Compute API
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"
[root@nailgun ~]#

Steps First case:
Neutron vlan
Create and deploy next cluster - Neutron Vlan, cinder/swift, 3 controller, 2 compute, 1 cinder nodes
Run OSTF
Verify network
Fill cinder storage up to 30%
Shutdown of all nodes as it is described in this manual - https://docs.mirantis.com/openstack/fuel/fuel-6.1/operations.html#howto-shut-down-the-whole-cluster
Wait 5 minutes
Start cluster as it is described in this manual - https://docs.mirantis.com/openstack/fuel/fuel-6.1/operations.html#howto-shut-down-the-whole-cluster
Wait until OSTF 'HA' suite passes
Run OSTF

Scenario 2 the same as first but neutron vlanx is used
Actual result:
Nova-manage service list show nova compute as XXX http://paste.openstack.org/show/482783/. Restart of service do not help

Reproduced again - http://paste.openstack.org/show/482783/

root@node-4:/var/log# nova --version
2.30.2
root@node-4:/var/log# dpkg -l | grep nova
ii  nova-common                         2:12.0.0-1~u14.04+mos19                all          OpenStack Compute - common files
ii  nova-compute                        2:12.0.0-1~u14.04+mos19                all          OpenStack Compute - compute node
ii  nova-compute-qemu                   2:12.0.0-1~u14.04+mos19                all          OpenStack Compute - compute node (QEmu)
ii  python-nova                         2:12.0.0-1~u14.04+mos19                all          OpenStack Compute - libraries
ii  python-novaclient                   2:2.30.2-1~u14.04+mos3                 all          client library for OpenStack Compute API
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"
[root@nailgun ~]#

Scenario 2 the same as first but neutron vlanx is used
Actual result: 
Nova-manage service list show nova compute as XXX http://paste.openstack.org/show/482783/. Restart of service do not help

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-12-28:

#12

fuel-snapshot-2015-12-28_17-48-04.tar.xz Edit (182.4 MiB, application/octet-stream)

Changed in fuel:
status:	Fix Committed → Confirmed

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-12-29:

#13

>>> Nova-manage service list show nova compute as XXX http://paste.openstack.org/show/482783/. Restart of service do not help

Tatyana, I think it's actually a different issue, as RCA for the original one shown that restart *always* helped.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-12-29:

#14

I checked the env / logs and see the following:

http://paste.openstack.org/show/482800/

I.e. nova-compute was stuck on getting information from libvirt until I tried to force generation of guru meditation report by sending a SIGUSR1 signal. After that nova-compute found out that libvirt connection is broken and disabled itself in the service list.

The bottom line is, I believe, this is a separate issue - I filed a bug for it - https://bugs.launchpad.net/mos/+bug/1529810 . Closing this one.

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-01-05:

#15

also reproduced on the following scenario:

Deploy cluster: 3 controller+cinder+ceph + 2 compute node, Neutron VxLAN, cinder for volumes and ceph for images + Rados GW
Verify networks
Add 1 ceph+cinder+controller node
Re-deploy cluster
Verify networks
Add 1 new ceph+cinder+controller node and delete one already deployed ceph+cinder+controller node
Re-deploy cluster
Run OSTF

Alexander Zatserklyany (zatserklyany) on 2016-01-15

tags:

added: on-verification

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-18:

#16

I can see none commits merged in the bug comments, why it has the status Fix committed?

Changed in fuel:
status:	Fix Committed → Incomplete

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-18:

#17

Is it the https://review.fuel-infra.org/#/c/14398/ ? But this targets only MOS project, not Fuel

Changed in mos:
status:	New → Fix Committed
importance:	Undecided → Critical
assignee:	nobody → Roman Podoliaka (rpodolyaka)
milestone:	none → 8.0

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-01-19:

#18

This particular issue was fixed in https://review.fuel-infra.org/#/c/14398/ . Please read the description carefully - I believe you are seeing a different issue. Most likely it's https://bugs.launchpad.net/mos/+bug/1529810

Changed in fuel:
status:	Incomplete → Fix Committed

Revision history for this message

Alexander Zatserklyany (zatserklyany) wrote on 2016-01-19:

#19

Fix released

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "456"

Changed in mos:
status:	Fix Committed → Fix Released
tags:	removed: on-verification

Fuel Devops McRobotson (fuel-devops-robot) on 2016-01-21

no longer affects:	fuel
Changed in mos:
status:	Fix Released → New

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-01-21:

#20

Kilo (7.0) is not affected

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.