Nova-compute service is down on compute nodes after destroying controller with running l3 agent

Bug #1333143 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bogdan Dobrelya

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_5/94/testReport/junit/%28root%29/deploy_ha_neutron/deploy_ha_neutron/

Test steps:
1. Create cluster - HA, Neutron with GRE segmentation, CentOS, KVM
2. Add 3 nodes with controller roles
3. Add 2 nodes with compute roles
4. Add 1 node with cinder role
5. Deploy the cluster
6. Destroy controller with running l3 agent
7. Wait for OFFLINE of the controller at fuel UI
8. Run instance connectivity OSTF tests

Expected - ostf tests passed
Actual - nova-compute is down on all compute nodes (node-4, node-5)

Errors in nova-compute.log for compute nodes:

2014-06-23T07:06:59.556640+01:00 err: 2014-06-23 06:06:59.553 26274 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db % msg_id)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f8c8161b08aa47828af194ad2f9432f1
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproducing, RCA in progress

Changed in fuel:
status: New → In Progress
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproduced on {"build_id": "2014-06-16_00-31-15", "mirantis": "yes", "build_number": "255", "ostf_sha": "67b61ed3788297fa5d985afec32498d8c0f812db", "nailgun_sha": "984aa7a86487f1488c2f83c052904abd9f589b7f", "production": "docker", "api": "1.0", "fuelmain_sha": "6f355160366475d52050d7898a1080a95ecb9cbf", "astute_sha": "17b1afa5f0dc8f4fca5ed4eb03ec566fbfb5ed19", "release": "5.1", "fuellib_sha": "99d74172887ab81d38132655d6e5d180e8726437"}
with additional notice:
All OSTF on HA section have passed, all other - failed.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Download full text (6.1 KiB)

Some detail from the env with issue reproduced:

I have node-5 (192.168.0.3) with l3-agent and it was shut down while reproducing.
After shutdown,
- there are 2 controllers left: node-6 (192.168.0.4) and node-7 (192.168.0.5),
- p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-6.test.domain.local
- vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-6.test.domain.local
- vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-7.test.domain.local
but:

1) OSt services trying to contact mysql via management VIP (192.168.0.2) but it cannot accept connections to 3307, tcp.
[root@node-7 ~]# less /var/log/nova/conductor.log
 ...
 2014-06-23 14:33:53.349 11110 TRACE oslo.messaging.rpc.dispatcher OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on '192.168.0.2' (111)") None None
[root@node-7 ~]# nova-manage service list
 ...
 WARNING nova.openstack.common.db.sqlalchemy.session [req-<id> ] SQL connection failed"
[root@node-7 ~]# telnet 192.168.0.2 3307
 Trying 192.168.0.2...
 telnet: connect to address 192.168.0.2: Connection refused
[root@node-7 ~]# ip netns exec haproxy telnet 192.168.0.2 3307
 Trying 192.168.0.2...
 telnet: connect to address 192.168.0.2: Connection refused
(icmp is OK tho)
[root@node-7 ~]# ping -c 3 192.168.0.2
 PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
 64 bytes from 192.168.0.2: icmp_seq=1 ttl=63 time=0.682 ms
 64 bytes from 192.168.0.2: icmp_seq=2 ttl=63 time=0.123 ms
 64 bytes from 192.168.0.2: icmp_seq=3 ttl=63 time=0.295 ms
[root@node-7 ~]# ip netns exec haproxy ping -c 3 192.168.0.2
 PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
 64 bytes from 192.168.0.2: icmp_seq=1 ttl=62 time=0.564 ms
 64 bytes from 192.168.0.2: icmp_seq=2 ttl=62 time=0.134 ms
 64 bytes from 192.168.0.2: icmp_seq=3 ttl=62 time=0.200 ms

2) there are variable number of half-open TCP connections from OSt services public_vip (172.16.0.2) to node-5 (192.168.0.3) initating from the controller hosting public_vip (inside haproxy netns)
[root@node-7 ~]# ip netns exec haproxy ss -untap | grep 'SYN-SENT'
 tcp SYN-SENT 0 1 172.16.0.2:58439 192.168.0.3:35357 users:(("haproxy",11520,43))
 tcp SYN-SENT 0 1 172.16.0.2:44653 192.168.0.3:9696 users:(("haproxy",11520,34))
 tcp SYN-SENT 0 1 172.16.0.2:42812 192.168.0.3:3307 users:(("haproxy",11520,1))
 tcp SYN-SENT 0 1 172.16.0.2:36520 192.168.0.3:80 users:(("haproxy",11520,39))
 tcp SYN-SENT 0 1 172.16.0.2:53966 192.168.0.3:8004 users:(("haproxy",11520,37))
 tcp SYN-SENT 0 1 172.16.0.2:59298 192.168.0.3:8773 users:(("haproxy",11520,44))
 tcp SYN-SENT 0 1 172.16.0.2:49034 192.168.0.3:9191 users:(("haproxy",11520,35))
 tcp SYN-SENT 0 1 172.16.0.2:35413 192.168.0.3:8774 users:(("haproxy",11520,42))
 tcp SYN-SENT 0 1 172.16.0.2:59207 192.168.0.3:5000 users:(("haproxy",11520,38))
 tcp SYN-SENT 0 1 172.16.0.2:36513 192.168.0.3:8775 users:((...

Read more...

Revision history for this message
Ryan Moe (rmoe) wrote :

I'm able to reproduce this and one thing I notice is that the failure is related to killing the node with the management VIP, not the l3-agent.

Revision history for this message
Ryan Moe (rmoe) wrote :

I tested this same deployment on 5.0 release and I see the same issue. After ~9 minutes everything recovered on its own and OSTF passed.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Initial problem is with rabbitmq:

2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f8c8161b08aa47828af194ad2f9432f1

Seems to be the same as described here:

https://bugs.launchpad.net/fuel/+bug/1323277
( see https://bugs.launchpad.net/fuel/+bug/1323277/comments/15 )

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

my bad with #4
actually I was checking [root@node-7 ~]# telnet 192.168.0.2 3306
cuz we have to verify haproxy frontend at management_vip:3306, 3307 is for backends only

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I reproduced the issue once more but with slightly different results, again...
OSTF tests are still failing, but this time much less of them. Anyway, I cannot spawn an instances or create volumes after killing a node with l3 agent.

Here is a log snapshot, please make sure you analyzing only the time frame 2014-06-24T12:51- 2014-06-24T13:51, all other records are non relevant (the env was deployed some time ago and reverted for this issue reproducing)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
  • logs Edit (48.6 MiB, application/x-tar)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The issue has nothing with RabbitMQ clustering and https://bugs.launchpad.net/fuel/+bug/1323277/comments/15. In all reproduced cases there was no problems neither with Galera nor with RabbitMQ clusters.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can confirm there are messages " nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID" in logs at the envs with this issue reproduced, though

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Will check the solution for https://bugs.launchpad.net/fuel/+bug/1323277/comments/15
"after killing problem rabbitmq on node-2 (when only one working rabbitmq left on node-4), nova-compute services successfully recovered and I was able to create instances and pass OSTF."

If it applies here as well, this bug could be considered as a dup of https://bugs.launchpad.net/fuel/+bug/1323277

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Ok, as far as I can see this issue is a dup indeed and it is not l3agent location related, byt vips location and rmq clustering instead.
Here is full log could be found for steps was done to ensure it is a dup http://pastebin.com/qqCVj7i1
Some notes:
1) After the failover of vips, SYN-SYNT sessions keep going on for ever (see step 3). So, looks like it is not an issue?..
2) Looks like the issue was resolved by rmq service restart *which have caused the compute service to reestablish its amqp sessions* and is a dup of 1323277, indeed
3) OSTF tests for RabbitMQ HA should verify if rmq services are gracefully restartable as well (restarting it one by one will allow to detect some hang glithes)

no longer affects: fuel/4.1.x
no longer affects: fuel/5.0.x
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Also had to add
10) Recheck queues
[root@node-3 ~]# rabbitmqctl list_queues name messages | egrep -v "0|^$"
Listing queues ...
q-agent-notifier-security_group-update_fanout_8d465979f67e4bc785b1c6257ddb99e8 39
...done.
Notifications consumed ok but q-agent-notifier-security_group-update keep growing up

tags: added: to-be-covered-by-tests
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers