Lost connection to MySQL server at 'reading initial communication packet

Bug #1352931 reported by Sergey Murashov
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Sergey Murashov
6.0.x
Invalid
High
MOS Oslo
6.1.x
Invalid
High
Sergey Murashov

Bug Description

ISO: {"build_id": "2014-08-04_22-06-52", "ostf_sha": "a3fa823ea0e4e03beb637ae07a91adea82c33182", "build_number": "564", "auth_required": true, "api": "1.0", "nailgun_sha": "5d1e8e38ea3f2cd88ceca8e0927d53e9770598bc", "production": "docker", "fuelmain_sha": "6b05280621c23ab158e9943f88b6f71a22fd4c69", "astute_sha": "99a790ad1b7526cbbd5bf8add0cb2b4e503fccd4", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "25eb632c70fe0bb3b2c1053314b558a160e4af50"}

Steps To Reproduce:
1. Install OS(Ubuntu, Nova Network, Ceilometer, 3 Controller+Mongo, 2 Compute)
2. Launch "Ceilometer create, update, check, delete alarm" ostf test
3. Launch any ostf test(sanity tests for example)

Actual result:
This test fails on "6. Wait for 'alarm' alarm state.". And all ostf tests after this action will fail with reason "Keystone client is not available. Please, refer to OpenStack logs to fix this problem"
In keystone logs we can see following traceback:
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.223 16283 TRACE keystone.common.wsgi
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi Traceback (most recent call last):
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/wsgi.py", line 212, in __call__
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi result = method(context, **params)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/token/controllers.py", line 98, in authenticate
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi context, auth)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/token/controllers.py", line 267, in _authenticate_local
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi username, CONF.identity.default_domain_id)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/identity/core.py", line 189, in wrapper
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return f(self, *args, **kwargs)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/identity/core.py", line 314, in get_user_by_name
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi driver = self._select_identity_driver(domain_id)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/identity/core.py", line 256, in _select_identity_driver
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi self.assignment_api.get_domain(domain_id)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 936, in decorate
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi should_cache_fn)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 588, in get_or_create
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi async_creator) as value:
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/core/dogpile.py", line 158, in __enter__
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return self._enter()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/core/dogpile.py", line 98, in _enter
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi generated = self._enter_create(createdtime)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/core/dogpile.py", line 149, in _enter_create
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi created = self.creator()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 565, in gen_value
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi created_value = creator()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 932, in creator
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return fn(*arg, **kw)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/assignment/core.py", line 297, in get_domain
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return self.driver.get_domain(domain_id)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/assignment/backends/sql.py", line 464, in get_domain
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return self._get_domain(session, domain_id).to_dict()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/assignment/backends/sql.py", line 457, in _get_domain
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi ref = session.query(Domain).get(domain_id)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 816, in get
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return loading.load_on_ident(self, key)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 226, in load_on_ident
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return q.one()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2306, in one
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi ret = list(self)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2349, in __iter__
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return self._execute_and_instances(context)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2362, in _execute_and_instances
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi close_with_result=True)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2353, in _connection_from_session
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi **kw)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 799, in connection
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi close_with_result=close_with_result)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return self.transaction._connection_for_bind(engine)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi conn = bind.contextual_connect()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi self.pool.connect(),
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 272, in connect
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return _ConnectionFairy(self).checkout()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 425, in __init__
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi rec = self._connection_record = pool._do_get()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 778, in _do_get
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi con = self._create_connection()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 225, in _create_connection
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return _ConnectionRecord(self)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 318, in __init__
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi self.connection = self.__connect()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 368, in __connect
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi connection = self.__pool._creator()
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/strategies.py", line 80, in connect
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return dialect.connect(*cargs, **cparams)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 283, in connect
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return self.dbapi.connect(*cargs, **cparams)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/MySQLdb/__init__.py", line 81, in Connect
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi return Connection(*args, **kwargs)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 193, in __init__
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi super(Connection, self).__init__(*args, **kwargs2)
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None
node-1.test.domain.local/keystone-all.log:2014-08-05 14:48:25.500 16283 TRACE keystone.common.wsgi

Revision history for this message
Sergey Murashov (smurashov) wrote :
Changed in fuel:
milestone: none → 5.1
importance: Undecided → High
status: New → Confirmed
description: updated
Dima Shulyak (dshulyak)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Deployment was finished with healthy galera cluster, according to logs:
2014-08-05T14:05:17.848575 node-5 ./remote/node-5.test.domain.local/puppet-apply.log:2014-08-05T14:05:17.848575+01:00 notice: Finished catalog run i
n 0.15 seconds
2014-08-05T14:27:53.636527 node-1 ./remote/node-1.test.domain.local/nova-scheduler.log:2014-08-05T14:27:53.636527+01:00 err: 2014-08-05 13:27:53.468
 15641 ERROR nova.servicegroup.drivers.db [-] model server went away

Galera cluster issues started 22 min later

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

2013 error should be handled by OSLO.db code. reassigning to mos-oslo

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)
Changed in mos:
status: New → Confirmed
milestone: none → 5.1
no longer affects: fuel
Changed in mos:
importance: Undecided → High
assignee: nobody → MOS Oslo (mos-oslo)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
tags: added: oslo
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This seems to be caused by Galera going away for a (short?) period of time. It's a good question why this happens and how long this period is (it would be nice, if found the root reason of this and made Galera available again as soon as possible).

On the OpenStack services side there are two ways to treat such failures:

1. Let API requests issuing DB queries fail while Galera is unavailable (as those aren't critical, once a Galera cluster has recovered, the OpenStack component will start to work properly).

2. Don't let API requests fail by retrying DB *transactions* until they succeed (it's crucial here, that we are retrying transactions rather than individual queries as it's the only way to ensure the DB data remains in the consistent state).

OpenStack services are currently using the first approach and it's probably *OK* to do that as long as Galera doesn't go away for a long period of time.

If we really want to go with the second approach here and try to make API requests always succeed (though they might end up hanging for a while until a Galera cluster recovers), we'll need to reuse the functionality implemented in oslo.db in OpenStack projects. It should be rather simple to do in Nova and probably a bit more complex in Cinder and Glance (at the same time, we'll need to do that in MOS trees as upstream will probably refuse to merge such patches). The DB API layer of Keystone, Neutron and Ceilometer is quite different from the Nova implementation, so I'm not sure we can reuse the oslo.db approach there.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Overall, I'm not sure this should have High importance as OpenStack services continue to work properly once the Galera has recovered. And we definitely can't target this for 5.1 as we are very close to hard core freeze stage.

It should probably be left up to MOS components teams to fix their projects as everything is ready on the oslo.db side and guys from MOS-oslo team might not be familiar with all of the consuming projects.

Changed in mos:
milestone: 5.1 → 6.0
assignee: MOS Oslo (mos-oslo) → nobody
tags: added: db
Changed in fuel:
status: New → Confirmed
Changed in mos:
importance: High → Medium
Changed in fuel:
importance: Undecided → Medium
milestone: none → 6.0
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Seeing this again on another env.

It seems that galera/mysql worked normally for a while and then mysqld received the SIGTERM signal and shutdown properly. After it was restarted however galera is not working anymore: the process is there, but it listens neither on 3306/3307, nor on mysql.sock unix domain socket.

Adding logs.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

[root@node-6 ~]# mysql -u nova -h 192.168.0.1 -P 3306 -p
Enter password:
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0

Dmitry Pyzhov (dpyzhov)
Changed in mos:
assignee: nobody → MOS Linux (mos-linux)
Changed in fuel:
assignee: nobody → MOS Linux (mos-linux)
Changed in fuel:
assignee: MOS Linux (mos-linux) → MOS All (mos-all)
Changed in mos:
assignee: MOS Linux (mos-linux) → MOS All (mos-all)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raised to high due to more duplicates incoming

Changed in mos:
importance: Medium → High
Changed in fuel:
importance: Medium → High
no longer affects: fuel
Changed in mos:
milestone: 6.0 → 6.0.1
Changed in mos:
assignee: MOS All (mos-all) → MOS Oslo (mos-oslo)
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced in system test 'Update OS on reverted env' deploy_ha_after_upgrade on the CI job: http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.centos.ceph_multinode_compact.os_patching/16/

        Scenario:
            1. Revert snapshot with simple ceph env (Fuel 5.1.1): 1 controller + ceph, 2 compute + ceph
            2. Run upgrade on master
            3. Check that upgrade was successful
            4. Create new cluster #2 (Fuel 6.0): CentOS/HA, neutron/vlan, 3 controllers, 2 compute, 1 cinder
            5. Run OSTF

Deploy failed on step 4.

/var/log/neutron/rescheduling.log from node-4 : http://paste.openstack.org/show/153900/
======================
2014-12-12 14:20:46.172 9791 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None

Revision history for this message
Dmitry Savenkov (dsavenkov) wrote :

It's very likely to be the case that it won't be fixed by any of 6.0.x releases due to a number of systemic issues that we hope to resolve when there are obtained some results of stress-testing.

Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :

Attempted to reproduce. Now bug was not reproduced on similar configuration:
Ubuntu, Nova Network, Ceilometer, 3 Controller+Mongo, 2 Compute

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2-6.1"
  api: "1.0"
  build_number: "317"
  build_id: "2015-04-16_21-30-10"
  nailgun_sha: "52d92c86e68602fb5dd2f3b8870a773d20a387ef"
  python-fuelclient_sha: "279ffc358e40dbdc162cfe76898cbd0874529f1f"
  astute_sha: "055b2d82fe8499b27c7047295e2e36a7a2c5d430"
  fuellib_sha: "db5f39e96e7ab9f79691202755e547bf8242661f"
  ostf_sha: "b0991dbad159f53d335efa5d31cb94016ad5312e"
  fuelmain_sha: "0de2d2039e76839d339f977df45b111bef7200d6"

Roman Rufanov (rrufanov)
tags: added: customer-found support
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.