Mirantis OpenStack

neutron response to nova with security-group-rules.json too long

Bug #1498167 reported by Leontii Istomin on 2015-09-21

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Invalid	High	Leontii Istomin	Mirantis OpenStack 10.0
7.0.x	Invalid	High	MOS Maintenance	Mirantis OpenStack 7.0-updates
8.0.x	Invalid	High	Mikhail Chernik	Mirantis OpenStack 8.0
9.x	Invalid	High	Leontii Istomin	Mirantis OpenStack 9.0

Bug Description

During NovaSecGroup.boot_and_delete_server_with_secgroups rally scenario an iteration has been failed on creating sec group rule:
from rally.log: http://paste.openstack.org/show/473183/
from haproxy: http://paste.openstack.org/show/473156/ Note that all request time should be less then 30 seconds
from nova-all.log on node-170: http://paste.openstack.org/show/473164/
let's look at all similar requests at the time which took more then 30 seconds: http://paste.openstack.org/show/473194/
So, neutron on node-174 respond too late. I have not found any errors exclude reconnections to rabbitmq servers: http://paste.openstack.org/show/473214/

Cluster Configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,Ceph-all,Nova-debug,Nova-quotas,7.0-297
Controllers:3 Computes:178 Copmutes+Ceph:20 LMA:2

api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '298'
build_number: '298'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: 0623b4daad438ceeb5dc41b10cdd3011795fff7e
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85
nailgun_sha: d590b26dbb09785b8a8b3651b0ef69746fcf9991
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-22_06-53-27.tar.xz

See original description

Tags:

Revision history for this message

Georgy Okrokvertskhov (gokrokvertskhov) wrote on 2015-09-21:

Can it be workaround via haproxy or other service parameters change?

Alexander Ignatov (aignatov) on 2015-09-22

Changed in mos:
importance:	Undecided → High

Vitaly Sedelnik (vsedelnik) on 2015-09-22

Changed in mos:
milestone:	7.0 → 7.0-updates

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-09-22:

2015-09-19T10:28:31.956130+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels 2>&1 > /dev/null
2015-09-19T10:28:31.989742+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): 'rabbitmqctl list_channels' timed out 1 of max. 1 time(s) in a row and is not responding. The resource is failed.

description:

updated

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-09-22:

Rabbitmq was stopped at 19 10:28:31and has been recovered after 5 minutes but failure occurred at 19 23:16:49.

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-09-22:

I mean rabbitmq failure at 19 10:28:31 isn't cause of this bug

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-09-22:

Ilya Shakhat has found the following in neutron.log:
2015-09-19 23:17:13.980 2733 ERROR oslo_messaging._drivers.impl_rabbit [req-6e630e90-66c3-46a7-9079-b29412d3033c ] AMQP server on 192.168.0.149:5673 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker
forced connection closure with reason 'shutdown'. Trying again in 5 seconds.
192.168.0.149 is node-174
I've checket rabbitmq on node-174 at the time:
=ERROR REPORT==== 19-Sep-2015::23:16:20 ===
Error on AMQP connection <0.8579.1> (192.168.0.149:44638 -> 192.168.0.149:5673, vhost: '/', user: 'nova', state: running), channel 0:
{amqp_error,connection_forced,
"broker forced connection closure with reason 'shutdown'",none}

=INFO REPORT==== 19-Sep-2015::23:16:21 ===
Stopped RabbitMQ application

=INFO REPORT==== 19-Sep-2015::23:16:21 ===
Halting Erlang VM

=INFO REPORT==== 19-Sep-2015::23:17:33 ===
Error description:
{error,{inconsistent_cluster,"Node 'rabbit@node-174' thinks it's clustered with node 'rabbit@node-170', but 'rabbit@node-170' disagrees"}}

Log files (may contain more information):
/<email address hidden>
/<email address hidden>

Stack trace:
   [{rabbit_mnesia,check_cluster_consistency,0,
                   [{file,"src/rabbit_mnesia.erl"},{line,588}]},
    {rabbit,'-start/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,317}]},
    {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,353}]},
    {rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]

=INFO REPORT==== 19-Sep-2015::23:17:35 ===
Resetting Rabbit

Rabbitmq was broken on node-174 at the time. Rabbitmq was alive on two other controller nodes at the time

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-09-22:

from neutron-server @node-100:
2015-09-19 23:16:48.628 2733 INFO neutron.wsgi [-] (2733) accepted ('192.168.0.2', 34259)

2015-09-19 23:16:48.633 2733 ERROR oslo_messaging._drivers.impl_rabbit [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] AMQP server on 192.168.0.149:5673 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 5 seconds.

several messages similar to:
2015-09-19 23:16:53.644 2733 ERROR oslo_messaging._drivers.impl_rabbit [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] AMQP server on 192.168.0.149:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

2015-09-19 23:17:25.717 2733 INFO oslo_messaging._drivers.impl_rabbit [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] Reconnected to AMQP server on 192.168.0.168:5673

2015-09-19 23:17:26.037 2733 INFO neutron.wsgi [req-23fdf2a1-d4ac-4b7e-b9ae-08affabd4ab3 ] 192.168.0.2 - - [19/Sep/2015 23:17:26] "POST /v2.0/security-group-rules.json HTTP/1.1" 201 564 37.408184

Changed in mos:
assignee:	MOS Neutron (mos-neutron) → MOS Oslo (mos-oslo)

Leontii Istomin (listomin) on 2015-09-22

description:

updated

Vitaly Sedelnik (vsedelnik) on 2015-09-22

Changed in mos:
status:	New → Confirmed

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-12-22:

Hi, Since the time this bug was filed, we have added max_rabbitmqctl_timeouts. Please make sure this is set to 10 and recreate the problem. please let us know if you are able to recreate.

Thanks,
DIms

Dmitry Pyzhov (dpyzhov) on 2016-02-03

tags:

added: area-qa

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-02-04:

It looks like we need to verify that this issue reproduced in MOS 8.0 and MOS 9.0 with new updates.

Revision history for this message

Mikhail Chernik (mchernik) wrote on 2016-02-04:

During last verification run on MOS 8.0 (ISO 482, 3 controllers, 180 computes, 20 ceph+compute nodes) the mentioned test passed. The configuration had different network segmentation (VLAN instead of VxLAN).

Rally log file can be found here: http://mos-scale-share.mirantis.com/boot_and_delete_server_with_secgroups-rally-stdout.log.gz

Revision history for this message

Mikhail Chernik (mchernik) wrote on 2016-02-05:

#10

Since the issue is not connected with Neutron and no problems with RabbitMQ connectivity was detected during the test run, I'm changing status for 8.0 to Fix Released.

Revision history for this message

Dina Belova (dbelova) wrote on 2016-03-09:

#11

For 7.0: mos-maintenance, please use Dims' comment and set the needed options to fix the issue.
For 9.0: Neutron team, please ensure that these opts are in place.

Thanks.

Revision history for this message

Alexander Ignatov (aignatov) wrote on 2016-03-16:

#12

@mchernik it would be good to check this issue in 9.0 to make sure it's not valid anymore.

Revision history for this message

Mikhail Chernik (mchernik) wrote on 2016-04-19:

#13

@aignatov: Will check during next run on 200 nodes and then update this issue

Revision history for this message

Dina Belova (dbelova) wrote on 2016-04-22:

#14

Planned to be verified ~Apr 25th

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-04-29:

#15

Has been reproduced with 9.0-244 during nova.create_and_delete_secgroups rally scenario:
3 controllers, 20 computes+ceph, 175 computes, vxlan+dvr, Ceph for all, LMA
from rally.log: http://paste.openstack.org/show/495720/
from node-49 nova-api: http://paste.openstack.org/show/495721/
from node-49 nova-api: by rquest_id: http://paste.openstack.org/show/495722/
from neutron-server.log on node-140: http://paste.openstack.org/show/495735/ we can see that request was proceed in 33.661700 when nova timeout is 30 sec

Revision history for this message

Alexander Ignatov (aignatov) wrote on 2016-05-05:

#16

This issue is caused by slowness of all cloud. Perhaps that was an issue with huge amount of debug logs generated during the test. It would be nice to repeat these tests without debug logs and having live env to debug. In current state it seems nothing to fix in nova-neutron interactions.

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-05-10:

#17

Neutron works slowly because of syslog. Eugeny Nikanorov straced neutron process and have found that syslog keeps a lot of time.
It seems we need to try to increase MaxMessageSize parameter in /etc/rsyslog.conf config files. But we can set it up to 4k (up to 2 times from current value - 2k) because Fuel has a feature when rsyslog sends logs via UDP

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-05-12:

#18

This issue isn't related to QA, tag "area-qa" was removed.

tags:

removed: area-qa

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-05-23:

#19

Closing as Invalid after a month without feedback. Feel free to re-open if this is still reproduced

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1429087

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.