Fuel for OpenStack

after full cluster restart MySQL is not available

Bug #1297355 reported by Andrey Sledzinskiy on 2014-03-25

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Bartosz Kupidura	Fuel for OpenStack 5.1
	4.1.x	Won't Fix	High	Fuel Library (Deprecated)	Fuel for OpenStack 4.1.1-updates

Bug Description

Bug is reproduced on {"build_id": "2014-03-25_11-38-46", "mirantis": "yes", "build_number": "43", "nailgun_sha": "3044c2054904525601c921387322a2978e821677", "ostf_sha": "013c13ab033a6829ca4eeaa2476c30837e814902", "fuelmain_sha": "f7ee8bcaa3d993395669f2bcae893176ff2b3bbe", "astute_sha": "d7c6c4d00ffd6e2fa74da442f573e6f39049961e", "release": "5.0", "fuellib_sha": "3445ab7550486074ec8e47fdaed869c697991364"}

Steps:
1. Create next cluster - Ubuntu, HA, KVM, Neutron Vlan, Ceph for volumes, Ceph for images, Rados
2. Create 3 controllers with ceph, create 2 compute with ceph
3. Deploy cluster - cluster was deployed successfully
4. Shutdown primary controller first, then two other controllers and then start primary controller first and then two other controllers
5. Run OSTF tests and check Neutron server logs

Expected - all tests passed and there are no errors
Actual - tests on creating volumes and instances failed, Lost connection to MySQL server in logs

Neutron server log on primary controller:
neutron [-] (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 11") None None
neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n context, host, router_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-10 could not be found\n']

Diagnostic snapshot is attached

See original description

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-03-25:

fuel-snapshot-2014-03-25_15-04-23.tgz Edit (8.2 MiB, application/x-tar)

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-03-25:

caused by broken galera - see mysql logs for details

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Vladimir Kuklin (vkuklin) on 2014-03-26

Changed in fuel:
milestone:	none → 5.0

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-04-01:

Download full text (3.2 KiB)

Get the same error after reset/run all controller and compute nodes (ISO#59):
- Environment - Ubuntu, HA, KVM, Neutron GRE, Cinder LVM, Sahara, Murano, Ceilometer
- Add 3 controller nodes and 1 compute node
- Deploy cluster
- Stutdown and start controller and compute nodes
- Run ostf tests

Diagnostic snapshot is attached

Error in Neutron server log:
neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n context, host, router_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-10 could not be found\n']

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.common [-] Returning exception Agent with agent_type=L3 agent and host=node-10 could not be found to caller

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.amqp [-] Exception during message handling

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n context, host, router_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-12 could not be found\n']

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.common [-] Returning exception Agent with agent_type=L3 agent and host=node-12 could not be found to caller

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.amqp [-] Exception during message handling

2014-04-01 12:01:02 ERROR

neutron.common.legacy [-] Skipping unknown group key: firewall_driver

2014-04-01 12:01:02 CRITICAL

neutro...

Diagnostic snapshot is attached

Error in Neutron server log:
neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n    **args)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n    neutron_ctxt, version, method, namespace, **kwargs)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n    result = getattr(proxyobj, method)(ctxt, **kwargs)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n    context, host, router_ids)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n    context, constants.AGENT_TYPE_L3, host)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n    host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-10 could not be found\n']

2014-04-01 12:01:12 	ERROR

neutron.openstack.common.rpc.common [-] Returning exception Agent with agent_type=L3 agent and host=node-10 could not be found to caller

2014-04-01 12:01:12 	ERROR

neutron.openstack.common.rpc.amqp [-] Exception during message handling

2014-04-01 12:01:12 	ERROR

neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n    **args)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n    neutron_ctxt, version, method, namespace, **kwargs)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n    result = getattr(proxyobj, method)(ctxt, **kwargs)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n    context, host, router_ids)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n    context, constants.AGENT_TYPE_L3, host)\n', '  File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n    host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-12 could not be found\n']

2014-04-01 12:01:12 	ERROR

neutron.openstack.common.rpc.common [-] Returning exception Agent with agent_type=L3 agent and host=node-12 could not be found to caller

2014-04-01 12:01:12 	ERROR

neutron.openstack.common.rpc.amqp [-] Exception during message handling

2014-04-01 12:01:02 	ERROR

neutron.common.legacy [-] Skipping unknown group key: firewall_driver

2014-04-01 12:01:02 	CRITICAL

neutron [-] (OperationalError) (1047, 'Unknown command') 'SELECT DATABASE()' ()

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-04-01:

fuel-snapshot-2014-04-01_13-12-05.tgz Edit (42.8 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-04-01:

Then you shutdown all controllers, you break galera cluster, thus OS env becomes non operational. That is a known HA architecture limitation.
There is also known bugs 1) with RabbitMQ<v3 glitch on cluster member disconnect, and 2) Fuel related only galera HA bug after VIP migration.
Please elaborate which controller(s) you shutdown to reproduce the subject issue: Primary one? All controllers? The one hosting VIP address (management one)?

For now I set this bug as incomplete.

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-04-02:

To reproduce the subject issue you need to shutdown and start all controllers

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-04-02:

Got it. I put the issue in invalid state then. Please update it, if the given full shutdown test case should be supported.

Changed in fuel:
status:	Incomplete → Invalid

Revision history for this message

Andrew Woodward (xarses) wrote on 2014-04-02:

Cold start of the cluster must be supported.

Andrey, Please update reproduction steps, this thread has become hard to follow how to reproduce the subject case. Please edit Bug description.

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-04-03:

Bug was moved to new state due to Andrew's comment that it should be supported. Bug description was updated.

description:	updated
Changed in fuel:
status:	Invalid → New

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-04-03:

#10

Please elaborate the exact actions for "Shutdown and start all controllers"
Which order to shutdown, which order for start.

Changed in fuel:
status:	New → Confirmed

Andrey Sledzinskiy (asledzinskiy) on 2014-04-03

description:

updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-04-03:

#11

We need to fix galera ocf script to support arbitrary startup vs shutdown orders and inject 3.x rabbitmq along with ocf script to resolve this issue. Hence, confirmed and triaged.

Changed in fuel:
status:	Confirmed → Triaged

Andrew Woodward (xarses) on 2014-04-04

tags:

added: ha library

Andrew Woodward (xarses) on 2014-04-04

tags:	added: backports-4.1.1
summary:	- Lost connection to MySQL server at 'reading initial communication - packet' on primary controller + after full cluster restart MySQL is not available

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-04-23:

#12

this bug is going to be addressed by improved galera ocf script in 5.1 version of FUEL

Changed in fuel:
milestone:	5.0 → 5.1

Mike Scherbakov (mihgen) on 2014-05-08

tags:

added: release-notes

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-05-15:

#13

Vladimir, is there a blueprint for improved galera ocf script? Or is thing going to be worked on right in this bug?

Revision history for this message

Meg McRoberts (dreidellhasa) wrote on 2014-05-17:

#14

Added to Known Issues list in 5.0 Release Notes. Is there a work-around we can document?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-05-19:

#15

AFAIK, the only workaround is following the manual scenario (https://mirantis.jira.com/wiki/display/PRD/Working+with+Galera+cluster). Perhaps, Miroslav Anashkin could clarify it better...

Revision history for this message

Bartosz Kupidura (zynzel) wrote on 2014-05-28:

#16

Guys,
Please read and comment:
https://lists.launchpad.net/fuel-dev/msg01100.html
https://blueprints.launchpad.net/fuel/+spec/reliable-galera-ocf-script
https://review.openstack.org/#/c/95764/

There is ready RA for galera, which support whole cluster reboot.

We need to heavily test this, especially because i removed all cs_commit/cs_shadow from puppet.

Bogdan Dobrelya (bogdando) on 2014-06-17

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

OpenStack Infra (hudson-openstack) on 2014-06-17

Changed in fuel:
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2014-06-17

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Bartosz Kupidura (zynzel)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-24: Fix proposed to fuel-library (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/102140

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-24: Change abandoned on fuel-library (master)

#18

Change abandoned by Bartosz Kupidura (<email address hidden>) on branch: master
Review: https://review.openstack.org/102140

OpenStack Infra (hudson-openstack) on 2014-06-24

Changed in fuel:
assignee:	Bartosz Kupidura (zynzel) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-24:

#19

Returned an assignee back (it was changed because of rebase submitted)

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Bartosz Kupidura (zynzel)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-24:

#20

https://review.openstack.org/#/c/95764/

OpenStack Infra (hudson-openstack) on 2014-06-26

Changed in fuel:
assignee:	Bartosz Kupidura (zynzel) → Bogdan Dobrelya (bogdando)

Bogdan Dobrelya (bogdando) on 2014-06-26

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Bartosz Kupidura (zynzel)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-27:

#21

superceeded by BP https://blueprints.launchpad.net/fuel/+spec/reliable-galera-ocf-script

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-02: Fix merged to fuel-library (master)

#22

Reviewed: https://review.openstack.org/95764
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=46748d05d656d0ec617d0876e2620b5100c224b9
Submitter: Jenkins
Branch: master

commit 46748d05d656d0ec617d0876e2620b5100c224b9
Author: Bartosz Kupidura <email address hidden>
Date: Tue Jun 24 09:40:14 2014 +0200

New RA for galera/pacemaker.

Implements: blueprint galera-improvements
Closes-Bug: 1297355

Change-Id: I593d113b430d7607f92ff68ea269e071898a5068

Changed in fuel:
status:	In Progress → Fix Committed

Bogdan Dobrelya (bogdando) on 2014-07-04

tags:	added: to-be-covered-by-tests
tags:	removed: backports-4.1.1

Revision history for this message

Meg McRoberts (dreidellhasa) wrote on 2014-07-07:

#23

Marked as "Known Issue" in 5.0.1 Release Notes.

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-08-18:

#24

{u'build_id': u'2014-08-18_02-01-17', u'ostf_sha': u'd2a894d228c1f3c22595a77f04b1e00d09d8e463', u'build_number': u'448', u'auth_required': True, u'nailgun_sha': u'bc9e377dbe010732bc2ba47161ed9d433998e07b', u'production': u'docker', u'api': u'1.0', u'fuelmain_sha': u'08f04775dcfadd8f5b438a31c63e81f29276b7d3', u'astute_sha': u'8e1db3926b2320b30b23d7a772122521b0d96166', u'feature_groups': [u'mirantis'], u'release': u'5.1', u'fuellib_sha': u'2c9ad4aec9f3b6fc060cb5a394733607f07063c1'}

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Meg McRoberts (dreidellhasa) wrote on 2014-08-26:

#25

Here is the text for the Resolved Issues section -- please verify that it is accurate and complete:

MySQL is available after full restart of environment
----------------------------------------------------

Older versions of Galera
(which manages MySQL in an OpenStack environment)
sometimes failed if the Controllers in an HA environment
come back online in a different order than Galera expected.
Release 5.1 includes a new RA (resource agent)
for Galera and Pacemaker
that supports a cluster bootstrap
that can reboot the whole cluster or any node in the cluster.
It uses Galera GTID (Global Transaction ID)
to determine which node has the latest database version
and uses this node as the Galera PC (Primary Component).
The administrator can manually choose a different node
to serve as the PC.
This fixes this issue.
See `LP1297355 <https://bugs.launchpad.net/fuel/+bug/1297355>`_.