neutron response to nova with security-group-rules.json too long

Bug #1498167 reported by Leontii Istomin
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
Leontii Istomin
7.0.x
Invalid
High
MOS Maintenance
8.0.x
Invalid
High
Mikhail Chernik
9.x
Invalid
High
Leontii Istomin

Bug Description

During NovaSecGroup.boot_and_delete_server_with_secgroups rally scenario an iteration has been failed on creating sec group rule:
from rally.log: http://paste.openstack.org/show/473183/
from haproxy: http://paste.openstack.org/show/473156/ Note that all request time should be less then 30 seconds
from nova-all.log on node-170: http://paste.openstack.org/show/473164/
let's look at all similar requests at the time which took more then 30 seconds: http://paste.openstack.org/show/473194/
So, neutron on node-174 respond too late. I have not found any errors exclude reconnections to rabbitmq servers: http://paste.openstack.org/show/473214/

Cluster Configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,Ceph-all,Nova-debug,Nova-quotas,7.0-297
Controllers:3 Computes:178 Copmutes+Ceph:20 LMA:2

api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '298'
build_number: '298'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: 0623b4daad438ceeb5dc41b10cdd3011795fff7e
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85
nailgun_sha: d590b26dbb09785b8a8b3651b0ef69746fcf9991
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-22_06-53-27.tar.xz

Tags: scale
Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

Can it be workaround via haproxy or other service parameters change?

Changed in mos:
importance: Undecided → High
Changed in mos:
milestone: 7.0 → 7.0-updates
Revision history for this message
Leontii Istomin (listomin) wrote :

2015-09-19T10:28:31.956130+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels 2>&1 > /dev/null
2015-09-19T10:28:31.989742+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): 'rabbitmqctl list_channels' timed out 1 of max. 1 time(s) in a row and is not responding. The resource is failed.

description: updated
Revision history for this message
Leontii Istomin (listomin) wrote :

Rabbitmq was stopped at 19 10:28:31and has been recovered after 5 minutes but failure occurred at 19 23:16:49.

Revision history for this message
Leontii Istomin (listomin) wrote :

I mean rabbitmq failure at 19 10:28:31 isn't cause of this bug

Revision history for this message
Leontii Istomin (listomin) wrote :

Ilya Shakhat has found the following in neutron.log:
2015-09-19 23:17:13.980 2733 ERROR oslo_messaging._drivers.impl_rabbit [req-6e630e90-66c3-46a7-9079-b29412d3033c ] AMQP server on 192.168.0.149:5673 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker
forced connection closure with reason 'shutdown'. Trying again in 5 seconds.
192.168.0.149 is node-174
I've checket rabbitmq on node-174 at the time:
=ERROR REPORT==== 19-Sep-2015::23:16:20 ===
Error on AMQP connection <0.8579.1> (192.168.0.149:44638 -> 192.168.0.149:5673, vhost: '/', user: 'nova', state: running), channel 0:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}

=INFO REPORT==== 19-Sep-2015::23:16:21 ===
Stopped RabbitMQ application

=INFO REPORT==== 19-Sep-2015::23:16:21 ===
Halting Erlang VM

=INFO REPORT==== 19-Sep-2015::23:17:33 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@node-174' thinks it's clustered with node 'rabbit@node-170', but 'rabbit@node-170' disagrees"}}

Log files (may contain more information):
   /<email address hidden>
   /<email address hidden>

Stack trace:
   [{rabbit_mnesia,check_cluster_consistency,0,
                   [{file,"src/rabbit_mnesia.erl"},{line,588}]},
    {rabbit,'-start/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,317}]},
    {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,353}]},
    {rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]

=INFO REPORT==== 19-Sep-2015::23:17:35 ===
Resetting Rabbit

Rabbitmq was broken on node-174 at the time. Rabbitmq was alive on two other controller nodes at the time

Revision history for this message
Ilya Shakhat (shakhat) wrote :

from neutron-server @node-100:
2015-09-19 23:16:48.628 2733 INFO neutron.wsgi [-] (2733) accepted ('192.168.0.2', 34259)

2015-09-19 23:16:48.633 2733 ERROR oslo_messaging._drivers.impl_rabbit [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] AMQP server on 192.168.0.149:5673 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 5 seconds.

several messages similar to:
2015-09-19 23:16:53.644 2733 ERROR oslo_messaging._drivers.impl_rabbit [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] AMQP server on 192.168.0.149:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

2015-09-19 23:17:25.717 2733 INFO oslo_messaging._drivers.impl_rabbit [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] Reconnected to AMQP server on 192.168.0.168:5673

2015-09-19 23:17:26.037 2733 INFO neutron.wsgi [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] 192.168.0.2 - - [19/Sep/2015 23:17:26] "POST /v2.0/security-group-rules.json HTTP/1.1" 201 564 37.408184

Changed in mos:
assignee: MOS Neutron (mos-neutron) → MOS Oslo (mos-oslo)
description: updated
Changed in mos:
status: New → Confirmed
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Hi, Since the time this bug was filed, we have added max_rabbitmqctl_timeouts. Please make sure this is set to 10 and recreate the problem. please let us know if you are able to recreate.

Thanks,
DIms

Dmitry Pyzhov (dpyzhov)
tags: added: area-qa
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

It looks like we need to verify that this issue reproduced in MOS 8.0 and MOS 9.0 with new updates.

Revision history for this message
Mikhail Chernik (mchernik) wrote :

During last verification run on MOS 8.0 (ISO 482, 3 controllers, 180 computes, 20 ceph+compute nodes) the mentioned test passed. The configuration had different network segmentation (VLAN instead of VxLAN).

Rally log file can be found here: http://mos-scale-share.mirantis.com/boot_and_delete_server_with_secgroups-rally-stdout.log.gz

Revision history for this message
Mikhail Chernik (mchernik) wrote :

Since the issue is not connected with Neutron and no problems with RabbitMQ connectivity was detected during the test run, I'm changing status for 8.0 to Fix Released.

Revision history for this message
Dina Belova (dbelova) wrote :

For 7.0: mos-maintenance, please use Dims' comment and set the needed options to fix the issue.
For 9.0: Neutron team, please ensure that these opts are in place.

Thanks.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

@mchernik it would be good to check this issue in 9.0 to make sure it's not valid anymore.

Revision history for this message
Mikhail Chernik (mchernik) wrote :

@aignatov: Will check during next run on 200 nodes and then update this issue

Revision history for this message
Dina Belova (dbelova) wrote :

Planned to be verified ~Apr 25th

Revision history for this message
Leontii Istomin (listomin) wrote :

Has been reproduced with 9.0-244 during nova.create_and_delete_secgroups rally scenario:
3 controllers, 20 computes+ceph, 175 computes, vxlan+dvr, Ceph for all, LMA
from rally.log: http://paste.openstack.org/show/495720/
from node-49 nova-api: http://paste.openstack.org/show/495721/
from node-49 nova-api: by rquest_id: http://paste.openstack.org/show/495722/
from neutron-server.log on node-140: http://paste.openstack.org/show/495735/ we can see that request was proceed in 33.661700 when nova timeout is 30 sec

Revision history for this message
Alexander Ignatov (aignatov) wrote :

This issue is caused by slowness of all cloud. Perhaps that was an issue with huge amount of debug logs generated during the test. It would be nice to repeat these tests without debug logs and having live env to debug. In current state it seems nothing to fix in nova-neutron interactions.

Revision history for this message
Leontii Istomin (listomin) wrote :

Neutron works slowly because of syslog. Eugeny Nikanorov straced neutron process and have found that syslog keeps a lot of time.
It seems we need to try to increase MaxMessageSize parameter in /etc/rsyslog.conf config files. But we can set it up to 4k (up to 2 times from current value - 2k) because Fuel has a feature when rsyslog sends logs via UDP

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

This issue isn't related to QA, tag "area-qa" was removed.

tags: removed: area-qa
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Closing as Invalid after a month without feedback. Feel free to re-open if this is still reproduced

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.