nova-* services are marked as down after MySQL / RabbitMQ failures

Bug #1516993 reported by Artem Panchenko
36
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Critical
Roman Podoliaka
7.0.x
Invalid
Critical
Roman Podoliaka
8.0.x
Fix Released
Critical
Roman Podoliaka

Bug Description

Upstream bugs:
https://bugs.launchpad.net/nova/+bug/1505471 https://bugs.launchpad.net/nova/+bug/1517926

==========================================================

Fuel version info (8.0 build #169): http://paste.openstack.org/show/479097/

Nova compute service (1 of 2 instances in the cloud) was marked as down after MySQL termination on controllers (one by one):

Check that required services are running failure

fuel_health.tests.sanity.test_sanity_infrastructure: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/fuel_health/tests/sanity/test_sanity_infrastructure.py", line 75, in test_001_services_state
    u'XXX' not in output, 'Step 2 failed: Some nova services '
  File "/usr/lib/python2.6/site-packages/fuel_health/common/test_mixins.py", line 164, in verify_response_true
    self.fail(message.format(failed_step_msg, msg))
  File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 415, in fail
    raise self.failureException(msg)
AssertionError: Step 2 failed: Some nova services have not been started.. Please refer to OpenStack logs for more details.

Binary Host Zone Status State Updated_At
nova-cert node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:31
nova-consoleauth node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:31
nova-scheduler node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:32
nova-conductor node-2.test.domain.local internal enabled :-) 2015-11-17 06:32:31
nova-cert node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-consoleauth node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-scheduler node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-conductor node-3.test.domain.local internal enabled :-) 2015-11-17 06:31:38
nova-cert node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:24
nova-consoleauth node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:24
nova-scheduler node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:25
nova-conductor node-4.test.domain.local internal enabled :-) 2015-11-17 06:32:24
nova-compute node-1.test.domain.local nova enabled :-) 2015-11-17 06:32:08
nova-compute node-5.test.domain.local nova enabled XXX 2015-11-17 06:28:16

Steps to reproduce:

            1. Terminate mysql on controller nodes
            2. Wait while it is being restarted
            3. Verify it is restarted
            4. Go to another controller (kill mysql daemon, wait until it's recovered by pacemaker, go to the next controller, kill mysql daemon ...)
            5. Run OSTF

Expected result: all health checks passed
Actual: test 'Check that required services are running' failed

Here is a part of nova-compute log on node-5:

2015-11-17 06:29:18.560 8431 ERROR oslo.service.loopingcall RemoteError: Remote error: DBConnectionError (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL
unication packet', system error: 0") [SQL: u'SELECT 1']

http://paste.openstack.org/show/479104/

The service was marked as down (up to one hour) until I restarted nova-compute on node-5.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Artem, have you checked Galera status?

2015-11-17 06:29:18.560 8431 ERROR oslo.service.loopingcall RemoteError: Remote error: DBConnectionError (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL
unication packet', system error: 0") [SQL: u'SELECT 1']

means that oslo.db did pessimistic connection check (SELECT 1 at the beginning of each new transaction) and found out that connection was invalidated. 2013 returned by Galera means that it's currently not available for clients (most likely it's in the middle of a cluster re-assembly).

Changed in fuel:
status: New → Incomplete
assignee: MOS Nova (mos-nova) → Artem Panchenko (apanchenko-8)
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

@Roman,

yep, MySQL cluster worked fine, HA OSTF check passed btw. Looks like nova-compute got that error right after we killed one MySQL instance (on slave-01):

2015-11-17 06:29:11,734 - INFO test_failover_base.py:348 -- Terminating MySQL on slave-01
2015-11-17 06:29:36,904 - INFO checkers.py:446 -- MySQL daemon is started on slave-01

Most probably HAProxy didn't manage to mark failed MySQL instance as down immediately and some requests where proxied to it. The question here is what do we expect from Nova in such cases? Should it try to re-connect to DB and the service must be recovered or HA for database cluster must be improved instead?

Changed in fuel:
status: Incomplete → New
assignee: Artem Panchenko (apanchenko-8) → MOS Nova (mos-nova)
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
tags: added: swarm-blocker
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Artem, so oslo.db makes sure you reconnect to the DB, actually 2013 errors is what you get when you try to connect. I'm wondering what settings we have for HAProxy - looks like it continues to proxy connections to the MySQL node which is not a part of the Galera cluster yet.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We are going to give it a try, but this must not be a blocker, IMO.

Changed in fuel:
status: New → Incomplete
summary: - Nova compute is marked as down after MySQL failover
+ nova-compute is marked as down after MySQL failover
Changed in fuel:
status: Incomplete → Confirmed
description: updated
summary: - nova-compute is marked as down after MySQL failover
+ nova-* services are marked as down after MySQL failover
Changed in fuel:
assignee: MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote : Re: nova-* services are marked as down after MySQL failover

Seems to be caused by https://github.com/openstack/nova/commit/3bc171202163a3810fdc9bdb3bad600487625443 , which reduced the scope of exception handled and not included DBError / RemoteError exceptions.

Changed in fuel:
status: Confirmed → In Progress
description: updated
summary: - nova-* services are marked as down after MySQL failover
+ nova-* services are marked as down after MySQL / RabbitMQ failures
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

BVT blocker, many duplicates. Impact raised to critical

Changed in fuel:
importance: High → Critical
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Fixed in https://review.fuel-infra.org/14398 . Waiting for CI

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Reproduced again - http://paste.openstack.org/show/482783/

root@node-4:/var/log# nova --version
2.30.2
root@node-4:/var/log# dpkg -l | grep nova
ii nova-common 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - common files
ii nova-compute 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - compute node
ii nova-compute-qemu 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - compute node (QEmu)
ii python-nova 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - libraries
ii python-novaclient 2:2.30.2-1~u14.04+mos3 all client library for OpenStack Compute API
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"
[root@nailgun ~]#

Steps First case:
Neutron vlan
Create and deploy next cluster - Neutron Vlan, cinder/swift, 3 controller, 2 compute, 1 cinder nodes
Run OSTF
Verify network
Fill cinder storage up to 30%
Shutdown of all nodes as it is described in this manual - https://docs.mirantis.com/openstack/fuel/fuel-6.1/operations.html#howto-shut-down-the-whole-cluster
Wait 5 minutes
Start cluster as it is described in this manual - https://docs.mirantis.com/openstack/fuel/fuel-6.1/operations.html#howto-shut-down-the-whole-cluster
Wait until OSTF 'HA' suite passes
Run OSTF

Scenario 2 the same as first but neutron vlanx is used
Actual result:
Nova-manage service list show nova compute as XXX http://paste.openstack.org/show/482783/. Restart of service do not help

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

>>> Nova-manage service list show nova compute as XXX http://paste.openstack.org/show/482783/. Restart of service do not help

Tatyana, I think it's actually a different issue, as RCA for the original one shown that restart *always* helped.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I checked the env / logs and see the following:

http://paste.openstack.org/show/482800/

I.e. nova-compute was stuck on getting information from libvirt until I tried to force generation of guru meditation report by sending a SIGUSR1 signal. After that nova-compute found out that libvirt connection is broken and disabled itself in the service list.

The bottom line is, I believe, this is a separate issue - I filed a bug for it - https://bugs.launchpad.net/mos/+bug/1529810 . Closing this one.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

also reproduced on the following scenario:

Deploy cluster: 3 controller+cinder+ceph + 2 compute node, Neutron VxLAN, cinder for volumes and ceph for images + Rados GW
Verify networks
Add 1 ceph+cinder+controller node
Re-deploy cluster
Verify networks
Add 1 new ceph+cinder+controller node and delete one already deployed ceph+cinder+controller node
Re-deploy cluster
Run OSTF

tags: added: on-verification
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can see none commits merged in the bug comments, why it has the status Fix committed?

Changed in fuel:
status: Fix Committed → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Is it the https://review.fuel-infra.org/#/c/14398/ ? But this targets only MOS project, not Fuel

Changed in mos:
status: New → Fix Committed
importance: Undecided → Critical
assignee: nobody → Roman Podoliaka (rpodolyaka)
milestone: none → 8.0
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This particular issue was fixed in https://review.fuel-infra.org/#/c/14398/ . Please read the description carefully - I believe you are seeing a different issue. Most likely it's https://bugs.launchpad.net/mos/+bug/1529810

Changed in fuel:
status: Incomplete → Fix Committed
Revision history for this message
Alexander Zatserklyany (zatserklyany) wrote :

Fix released

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "456"

Changed in mos:
status: Fix Committed → Fix Released
tags: removed: on-verification
no longer affects: fuel
Changed in mos:
status: Fix Released → New
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Kilo (7.0) is not affected

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.