after full cluster restart MySQL is not available

Bug #1297355 reported by Andrey Sledzinskiy on 2014-03-25
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bartosz Kupidura
4.1.x
High
Fuel Library (Deprecated)

Bug Description

Bug is reproduced on {"build_id": "2014-03-25_11-38-46", "mirantis": "yes", "build_number": "43", "nailgun_sha": "3044c2054904525601c921387322a2978e821677", "ostf_sha": "013c13ab033a6829ca4eeaa2476c30837e814902", "fuelmain_sha": "f7ee8bcaa3d993395669f2bcae893176ff2b3bbe", "astute_sha": "d7c6c4d00ffd6e2fa74da442f573e6f39049961e", "release": "5.0", "fuellib_sha": "3445ab7550486074ec8e47fdaed869c697991364"}

Steps:
1. Create next cluster - Ubuntu, HA, KVM, Neutron Vlan, Ceph for volumes, Ceph for images, Rados
2. Create 3 controllers with ceph, create 2 compute with ceph
3. Deploy cluster - cluster was deployed successfully
4. Shutdown primary controller first, then two other controllers and then start primary controller first and then two other controllers
5. Run OSTF tests and check Neutron server logs

Expected - all tests passed and there are no errors
Actual - tests on creating volumes and instances failed, Lost connection to MySQL server in logs

Neutron server log on primary controller:
neutron [-] (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 11") None None
neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n context, host, router_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-10 could not be found\n']

Diagnostic snapshot is attached

Tatyanka (tatyana-leontovich) wrote :

caused by broken galera - see mysql logs for details

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
milestone: none → 5.0
Download full text (3.2 KiB)

Get the same error after reset/run all controller and compute nodes (ISO#59):
- Environment - Ubuntu, HA, KVM, Neutron GRE, Cinder LVM, Sahara, Murano, Ceilometer
- Add 3 controller nodes and 1 compute node
- Deploy cluster
- Stutdown and start controller and compute nodes
- Run ostf tests

Diagnostic snapshot is attached

Error in Neutron server log:
neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n context, host, router_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-10 could not be found\n']

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.common [-] Returning exception Agent with agent_type=L3 agent and host=node-10 could not be found to caller

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.amqp [-] Exception during message handling

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers\n context, host, router_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/agents_db.py", line 133, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=node-12 could not be found\n']

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.common [-] Returning exception Agent with agent_type=L3 agent and host=node-12 could not be found to caller

2014-04-01 12:01:12 ERROR

neutron.openstack.common.rpc.amqp [-] Exception during message handling

2014-04-01 12:01:02 ERROR

neutron.common.legacy [-] Skipping unknown group key: firewall_driver

2014-04-01 12:01:02 CRITICAL

neutro...

Read more...

Bogdan Dobrelya (bogdando) wrote :

Then you shutdown all controllers, you break galera cluster, thus OS env becomes non operational. That is a known HA architecture limitation.
There is also known bugs 1) with RabbitMQ<v3 glitch on cluster member disconnect, and 2) Fuel related only galera HA bug after VIP migration.
Please elaborate which controller(s) you shutdown to reproduce the subject issue: Primary one? All controllers? The one hosting VIP address (management one)?

For now I set this bug as incomplete.

Changed in fuel:
status: New → Incomplete

To reproduce the subject issue you need to shutdown and start all controllers

Bogdan Dobrelya (bogdando) wrote :

Got it. I put the issue in invalid state then. Please update it, if the given full shutdown test case should be supported.

Changed in fuel:
status: Incomplete → Invalid
Andrew Woodward (xarses) wrote :

Cold start of the cluster must be supported.

Andrey, Please update reproduction steps, this thread has become hard to follow how to reproduce the subject case. Please edit Bug description.

Bug was moved to new state due to Andrew's comment that it should be supported. Bug description was updated.

description: updated
Changed in fuel:
status: Invalid → New
Bogdan Dobrelya (bogdando) wrote :

Please elaborate the exact actions for "Shutdown and start all controllers"
Which order to shutdown, which order for start.

Changed in fuel:
status: New → Confirmed
description: updated
Bogdan Dobrelya (bogdando) wrote :

We need to fix galera ocf script to support arbitrary startup vs shutdown orders and inject 3.x rabbitmq along with ocf script to resolve this issue. Hence, confirmed and triaged.

Changed in fuel:
status: Confirmed → Triaged
Andrew Woodward (xarses) on 2014-04-04
tags: added: ha library
Andrew Woodward (xarses) on 2014-04-04
tags: added: backports-4.1.1
summary: - Lost connection to MySQL server at 'reading initial communication
- packet' on primary controller
+ after full cluster restart MySQL is not available
Vladimir Kuklin (vkuklin) wrote :

this bug is going to be addressed by improved galera ocf script in 5.1 version of FUEL

Changed in fuel:
milestone: 5.0 → 5.1
Mike Scherbakov (mihgen) on 2014-05-08
tags: added: release-notes
Dmitry Borodaenko (angdraug) wrote :

Vladimir, is there a blueprint for improved galera ocf script? Or is thing going to be worked on right in this bug?

Meg McRoberts (dreidellhasa) wrote :

Added to Known Issues list in 5.0 Release Notes. Is there a work-around we can document?

Bogdan Dobrelya (bogdando) wrote :

AFAIK, the only workaround is following the manual scenario (https://mirantis.jira.com/wiki/display/PRD/Working+with+Galera+cluster). Perhaps, Miroslav Anashkin could clarify it better...

Bartosz Kupidura (zynzel) wrote :

Guys,
Please read and comment:
https://lists.launchpad.net/fuel-dev/msg01100.html
https://blueprints.launchpad.net/fuel/+spec/reliable-galera-ocf-script
https://review.openstack.org/#/c/95764/

There is ready RA for galera, which support whole cluster reboot.

We need to heavily test this, especially because i removed all cs_commit/cs_shadow from puppet.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
status: Triaged → In Progress
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Bartosz Kupidura (zynzel)

Change abandoned by Bartosz Kupidura (<email address hidden>) on branch: master
Review: https://review.openstack.org/102140

Changed in fuel:
assignee: Bartosz Kupidura (zynzel) → Bogdan Dobrelya (bogdando)
Bogdan Dobrelya (bogdando) wrote :

Returned an assignee back (it was changed because of rebase submitted)

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Bartosz Kupidura (zynzel)
Changed in fuel:
assignee: Bartosz Kupidura (zynzel) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Bartosz Kupidura (zynzel)

Reviewed: https://review.openstack.org/95764
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=46748d05d656d0ec617d0876e2620b5100c224b9
Submitter: Jenkins
Branch: master

commit 46748d05d656d0ec617d0876e2620b5100c224b9
Author: Bartosz Kupidura <email address hidden>
Date: Tue Jun 24 09:40:14 2014 +0200

    New RA for galera/pacemaker.

    Implements: blueprint galera-improvements
    Closes-Bug: 1297355

    Change-Id: I593d113b430d7607f92ff68ea269e071898a5068

Changed in fuel:
status: In Progress → Fix Committed
tags: added: to-be-covered-by-tests
tags: removed: backports-4.1.1
Meg McRoberts (dreidellhasa) wrote :

Marked as "Known Issue" in 5.0.1 Release Notes.

Egor Kotko (ykotko) wrote :

{u'build_id': u'2014-08-18_02-01-17', u'ostf_sha': u'd2a894d228c1f3c22595a77f04b1e00d09d8e463', u'build_number': u'448', u'auth_required': True, u'nailgun_sha': u'bc9e377dbe010732bc2ba47161ed9d433998e07b', u'production': u'docker', u'api': u'1.0', u'fuelmain_sha': u'08f04775dcfadd8f5b438a31c63e81f29276b7d3', u'astute_sha': u'8e1db3926b2320b30b23d7a772122521b0d96166', u'feature_groups': [u'mirantis'], u'release': u'5.1', u'fuellib_sha': u'2c9ad4aec9f3b6fc060cb5a394733607f07063c1'}

Changed in fuel:
status: Fix Committed → Fix Released
Meg McRoberts (dreidellhasa) wrote :

Here is the text for the Resolved Issues section -- please verify that it is accurate and complete:

MySQL is available after full restart of environment
----------------------------------------------------

Older versions of Galera
(which manages MySQL in an OpenStack environment)
sometimes failed if the Controllers in an HA environment
come back online in a different order than Galera expected.
Release 5.1 includes a new RA (resource agent)
for Galera and Pacemaker
that supports a cluster bootstrap
that can reboot the whole cluster or any node in the cluster.
It uses Galera GTID (Global Transaction ID)
to determine which node has the latest database version
and uses this node as the Galera PC (Primary Component).
The administrator can manually choose a different node
to serve as the PC.
This fixes this issue.
See `LP1297355 <https://bugs.launchpad.net/fuel/+bug/1297355>`_.

Denis Meltsaykin (dmeltsaykin) wrote :

4.x is out of support and there is no fix for the bug. Closing this as Won't fix.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints