Some nova services marked as down after destroying vips due to mysql fails

Bug #1570936 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Oleksiy Molchanov
Mitaka
Confirmed
High
Oleksiy Molchanov

Bug Description

Steps to reproduce:

1. Deploy or revert any ha environment with configuration:
3 controllers + cinder + 2 computes, neutron vlan
2. Delete all secondary VIP
3. Wait while it is being restored
4. Verify it is restored
5. Run OSTF -------------------------------> Fail here

Actual result:
for one node all nova services marked as down
fuel_health.tests.sanity.test_sanity_infrastructure: DEBUG: +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 1 | nova-consoleauth | node-1.test.domain.local | internal | enabled | up | 2016-04-15T03:29:52.000000 | - |
| 4 | nova-cert | node-1.test.domain.local | internal | enabled | up | 2016-04-15T03:29:51.000000 | - |
| 7 | nova-scheduler | node-1.test.domain.local | internal | enabled | up | 2016-04-15T03:29:52.000000 | - |
| 10 | nova-conductor | node-1.test.domain.local | internal | enabled | up | 2016-04-15T03:29:51.000000 | - |
| 19 | nova-cert | node-3.test.domain.local | internal | enabled | down | 2016-04-15T05:32:23.000000 | - |
| 22 | nova-consoleauth | node-3.test.domain.local | internal | enabled | down | 2016-04-15T05:32:22.000000 | - |
| 25 | nova-scheduler | node-3.test.domain.local | internal | enabled | down | 2016-04-15T05:32:23.000000 | - |
| 28 | nova-conductor | node-3.test.domain.local | internal | enabled | down | 2016-04-15T05:32:22.000000 | - |
| 40 | nova-consoleauth | node-2.test.domain.local | internal | enabled | up | 2016-04-15T03:29:01.000000 | - |
| 43 | nova-cert | node-2.test.domain.local | internal | enabled | up | 2016-04-15T03:29:00.000000 | - |
| 46 | nova-scheduler | node-2.test.domain.local | internal | enabled | up | 2016-04-15T03:29:01.000000 | - |
| 49 | nova-conductor | node-2.test.domain.local | internal | enabled | up | 2016-04-15T03:29:01.000000 | - |
| 61 | nova-compute | node-5.test.domain.local | nova | enabled | up | 2016-04-15T03:29:05.000000 | - |
| 64 | nova-compute | node-4.test.domain.local | nova | enabled | up | 2016-04-15T03:29:45.000000 | - |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+

At the same time there is related oslo messaging error:

2016-04-15T03:31:37.409637+00:00 err: 2016-04-15 03:31:37.407 32467 ERROR oslo_messaging._drivers.common [req-9e9f6105-6610-4eb2-a259-db7b86cf006e b9f00b65bc1740bea862bac8fd8016da a471cc729b214463bbccbf3317987344 - - -] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply\n incoming.message))\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 150, in inner\n return func(*args, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 104, in select_destinations\n dests = self.driver.select_destinations(ctxt, spec_obj)\n', ' File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 53, in select_destinations\n selected_hosts = self._schedule(context, spec_obj)\n', ' File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 104, in _schedule\n hosts = self._get_all_host_states(elevated)\n', ' File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 145, in _get_all_host_states\n return self.host_manager.get_all_host_states(context)\n', ' File "/usr/lib/python2.7/dist-packages/nova/scheduler/host_manager.py", line 554, in get_all_host_states\n context, \'nova-compute\', include_disabled=True)}\n', ' File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 181, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/nova/objects/service.py", line 399, in get_by_binary\n context, binary, include_disabled=include_disabled)\n', ' File "/usr/lib/python2.7/dist-packages/nova/db/ap

that caused by:
2016-04-15T03:31:37.211234+00:00 err: 2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher [req-9e9f6105-6610-4eb2-a259-db7b86cf006e b9f00b65bc1740bea862bac8fd8016da a471cc729b214463bbccbf3317987344 - - -] Exception during message handling: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away') [SQL: u'SELECT 1']
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher incoming.message))
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 150, in inner
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher return func(*args, **kwargs)
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 104, in select_destinations
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher dests = self.driver.select_destinations(ctxt, spec_obj)
2016-04-15 03:31:37.044 32467 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 53, in select_destinations
2016-04-15 03:31:37.044 32467 ER

016-04-15 00:40:52 22720 [Note] InnoDB: 5.6.23 started; log sequence number 1625977
00:40:52 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=0
max_threads=151
thread_count=1
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 76505 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fb9a5897770
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb9a2dc3e88 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x7fb9a35fdb7c]
/usr/sbin/mysqld(handle_fatal_signal+0x3c2)[0x7fb9a334e5c2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fb9a2043340]
/usr/sbin/mysqld(thd_get_ha_data+0xc)[0x7fb9a339b54c]
/usr/sbin/mysqld(_Z20thd_binlog_trx_resetP3THD+0x2e)[0x7fb9a35a879e]
/usr/sbin/mysqld(_Z17wsrep_post_commitP3THDb+0xcc)[0x7fb9a348832c]
/usr/sbin/mysqld(_Z12trans_commitP3THD+0x6f)[0x7fb9a346edcf]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x38d1)[0x7fb9a33dc851]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x3c8)[0x7fb9a33e09d8]
/usr/sbin/mysqld(+0x508c24)[0x7fb9a33e0c24]
/usr/sbin/mysqld(_Z19do_handle_bootstrapP3THD+0x111)[0x7fb9a33e0ff1]
/usr/sbin/mysqld(+0x509060)[0x7fb9a33e1060]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fb9a203b182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb9a175e47d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fb97c0009f0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

Expected:
All ostf tests are passed

Version:
10.109.10.2] out: | Host | Reporter | Report |
[10.109.10.2] out: +---------------------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+
[10.109.10.2] out: | nailgun.test.domain.local | cat /etc/fuel_build_id | 201 |
[10.109.10.2] out: | | | |
[10.109.10.2] out: | nailgun.test.domain.local | cat /etc/fuel_build_number | 201 |
[10.109.10.2] out: | | | |
[10.109.10.2] out: | nailgun.test.domain.local | cat /etc/fuel_release | 9.0 |
[10.109.10.2] out: | | | |
[10.109.10.2] out: | nailgun.test.domain.local | cat /etc/fuel_openstack_version | mitaka-9.0 |
[10.109.10.2] out: |

Tags: area-library
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :

the signal 11 is not related, please note the time differences.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Also according to haproxy, mysql backends never went down.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

oh, yep I missed time difference, but still according to service exception some services think that mysql was gone

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Yea that could be just be a connectivity error from that one node. Just wanted to make sure no one spends a lot of time looking at mysql

Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Guys it can not be duplicate of bug with re-deployment or plugin because here we have clean install + delete vips. So steps and environments are quite different

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Matthew Mosesohn (raytrac3r)
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Oleksiy Molchanov (omolchanov)
Dmitry Klenov (dklenov)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I checked the Oleksiy's environment and see the following:

- right after reverting a snapshot nova-cert service is marked as down

- nova-cert process is up and running

- there are no open connections to mysqld - http://paste.openstack.org/show/494792/ - NOTE: there is one broken socket (can't identify protocol)

- service logs contain no errors: http://paste.openstack.org/show/494794/

- when I tried to attach to the process with strace, it figured out that connection to mysqld (haproxy in our case) is broken, then closed and re-opened it properly - http://paste.openstack.org/show/494795/

- all services are up now - http://paste.openstack.org/show/494796/

NOTE: no service restarts were performed

My current understanding is that destroying of a VIP produces a dead TCP peer - nova-cert simply does not know it should close the existing connection and open a new one. Maybe we should take a look at TCP keepalives? (http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

^ attaching with gdb shows that the process is stuck in epoll_wait() - http://paste.openstack.org/show/494807/

As with strace, attachment with gdb triggers transition to the normal state - the disconnect is detected properly and the process is fully operational again.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

/proc/$PID/status shows that the kernel won't schedule the process for execution: http://paste.openstack.org/show/494821/

This is weird, as eventlet must call epoll_wait() with 60s timeout, which should wake the process up eventually.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

a core dump of nova-cert

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Seems that the most possible case is 'snapshotting-reverting' stuff, as I tried to reproduce it and didn't managed.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So, I checked the core dump of nova-cert process and it looks interesting:

(gdb) info locals
_save = 0x1ef70a0
dtimeout = 7500.1718587875366
timeout = 7500171
maxevents = 1023
nfds = <optimized out>
i = <optimized out>
elist = 0x0
etuple = 0x0
evs = 0x41ef400
kwlist = {0x5d99e1 "timeout", 0x5d9b76 "maxevents", 0x0}
kwlist = {0x5d99e1 "timeout", 0x5d9b76 "maxevents", 0x0}
(gdb) frame
#1 0x0000000000435ef8 in pyepoll_poll (self=0x7f4167656eb8, args=<optimized out>, kwds=<optimized out>) at ../Modules/selectmodule.c:1034
1034 in ../Modules/selectmodule.c

the timeout value is actually ~7500 seconds and epoll_wait() rightfully blocks as there are no events on the sockets being watched. Attaching with gdb or strace sends a SIGSTOP to the process and causes epoll_wait() to be interrupted - then nova-cert tries to send something via mysql socket and finally figures out it's broken.

AFAIU, the "snapshotting-reverting" done by dos.py somehow causes the issue with timeout - maybe due to the fact we try to sync time on revert?

Looks like epoll_wait() wrapper in Python 2.x does not use monotonic clock - https://github.com/python/cpython/blob/2.7/Modules/selectmodule.c#L986-L1058 , while Python 3.x does - https://github.com/python/cpython/blob/master/Modules/selectmodule.c#L1485-L1592 - this must be the reason the timeout value is unexpectedly big.

The bottom line is, this must only be reproduced on the virtual environments in our tests.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.