Fuel for OpenStack

long kombu_reconnect_delay for Neutron blocks metadata

Series 7.0.x
Bug #1440134

Bug #1440134 reported by Miroslav Anashkin on 2015-04-03

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Ivan Berezovskiy	Fuel for OpenStack 10.0
5.1.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 6.0-updates
6.1.x	Fix Committed	High	Alexey Deryugin	Fuel for OpenStack 6.1
7.0.x	Fix Committed	High	Alexey Deryugin	Fuel for OpenStack 7.0-updates
8.0.x	Fix Released	High	Alexey Deryugin	Fuel for OpenStack 8.0-mu-2
Mitaka	Fix Released	High	Ivan Berezovskiy	Fuel for OpenStack 9.0

Bug Description

We've implemented the following workaround in Fuel 5.1.1:

https://bugs.launchpad.net/fuel/+bug/1322259/
Set delay to 5.0 to recover channel errors on highly loaded environments.

Actually, this fix had been made for slow environments.

It introduced to us a new issue, affecting all Mirantis OpenStack versions between 5.1.1 and 6.1.
Instance launch may fail.
It is related to this HA+RabbitMQ+Oslo Messaging bug:
https://bugs.launchpad.net/mos/+bug/1415932

So far customers reported the following errors:

===================================================================

Error Message 1
Timed out waiting for a reply to message ID 7bc529a2d71141cf8e65cbee6402f817
Code
500
Details
File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 614, in build_instances request_spec, filter_properties)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 49, in select_destinations context, request_spec, filter_properties) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 35, in __run_method return getattr(self.instance, __name)(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 34, in select_destinations context, request_spec, filter_properties)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 108, in select_destinations request_spec=request_spec, filter_properties=filter_properties)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call retry=self.retry)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send timeout=timeout, retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 434, in send retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 423, in _send result = self._waiter.wait(msg_id, timeout)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 289, in wait reply, ending = self._poll_connection(msg_id, timeout)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 239, in _poll_connection % msg_id)

===================================================================

Error message 2
The problem I am seeing is that when cloud-init would access the metadata service it was taking 10seconds to respond to each query which is slowing down the startup times of instances from ~30sec to over 3min.

time curl http://169.254.169.254/2009-04-04/meta-data/instance-id
i-00000046
real 0m10.289s
user 0m0.000s
sys 0m0.003s

After some troubleshooting I noticed that the the neutron-metadata-agent was calling out for locks to the neutron-server, when I looked at the neutron-server logs I started seeing error messages like this:

2015-03-12 21:00:34.247 41004 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'reply_948744ac899648c19260a53a8e5c0f4b': [Errno 32] Broken pipe

===================================================================

Should we set kombu_reconnect_delay back to its default value, 1.0?

See original description

Tags:

Miroslav Anashkin (manashkin) on 2015-04-03

description:

updated

Vladimir Kuklin (vkuklin) on 2015-04-22

tags:

added: low-hanging-fruit

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-04-28:

The root cause of the issue looks to be the same as in https://bugs.launchpad.net/mos/+bug/1410797. This bug is already fixed in 6.1 release

Revision history for this message

Miroslav Anashkin (manashkin) wrote on 2015-05-01:

It is not duplicate.
This bug is about setting kombu_reconnect_delay back to the default value.

Revision history for this message

Giuseppe Cossu (giuseppecossu) wrote on 2015-07-15:

It is not duplicated, It actually refers to metadata slowness.
Setting kombu_reconnect_delay=1.0 solves the issue for metadata. Tested with Fuel 6.0.

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2016-04-29:

Issue was observed in MOS 7.0
kombu_reconnect_delay = 5.0 (default)
Some metadata requests get late replies due to "Broken pipe" errors and 5 second reconnect delay, causing CloudInit inside VM to fail with a timeout.

Not sure if heartbeats should be present to fix this, but if so then looks like heartbeats do not work for neutron-metadata-agent in DVR configuration.

tags:

added: support

Dmitry Pyzhov (dpyzhov) on 2016-05-06

tags:

added: area-library

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-05-07:

FIY we indeed decreased kombu_reconnect_delay to default 1 second in change https://review.openstack.org/#/q/Id7bac2d1988b9f91ea19edc0fe8a4bdc9958d458,n,z

which for now went into Mitaka only. And I do suggest to port that change back to other branches as well.

Dmitry Pyzhov (dpyzhov) on 2016-05-13

no longer affects:	fuel/future
no longer affects:	fuel/newton
Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Networking (l23-network)

Revision history for this message

Ivan Berezovskiy (iberezovskiy) wrote on 2016-05-13:

This is fixed for 9.0 - https://review.openstack.org/#/c/305352/ and 10.0 - https://review.openstack.org/#/c/304517/. Need to backport to all other branches

Changed in fuel:
status:	Triaged → Fix Committed
assignee:	Networking (l23-network) → Ivan Berezovskiy (iberezovskiy)

Kyrylo Romanenko (kromanenko) on 2016-06-15

tags:

added: on-verification

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-06-16:

Fix Committed for 8.0 - the review is https://review.openstack.org/#/c/316853/

Revision history for this message

Kyrylo Romanenko (kromanenko) wrote on 2016-06-17:

Requesting from cirros instance:
time curl http://169.254.169.254/latest/meta-data/instance-id
i-00000016real 0m 2.04s
user 0m 0.11s
sys 0m 0.10s

cat /etc/fuel_build_id:
490
cat /etc/fuel_build_number:
490
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6349.noarch
fuel-misc-9.0.0-1.mos8460.noarch
python-packetary-9.0.0-1.mos140.noarch
fuel-bootstrap-cli-9.0.0-1.mos285.noarch
fuel-migrate-9.0.0-1.mos8460.noarch
rubygem-astute-9.0.0-1.mos750.noarch
fuel-mirror-9.0.0-1.mos140.noarch
shotgun-9.0.0-1.mos90.noarch
fuel-openstack-metadata-9.0.0-1.mos8743.noarch
fuel-notify-9.0.0-1.mos8460.noarch
nailgun-mcagents-9.0.0-1.mos750.noarch
python-fuelclient-9.0.0-1.mos325.noarch
fuel-9.0.0-1.mos6349.noarch
fuel-utils-9.0.0-1.mos8460.noarch
fuel-setup-9.0.0-1.mos6349.noarch
fuel-provisioning-scripts-9.0.0-1.mos8743.noarch
fuel-library9.0-9.0.0-1.mos8460.noarch
network-checker-9.0.0-1.mos74.x86_64
fuel-agent-9.0.0-1.mos285.noarch
fuel-ui-9.0.0-1.mos2717.noarch
fuel-ostf-9.0.0-1.mos936.noarch
fuelmenu-9.0.0-1.mos274.noarch
fuel-nailgun-9.0.0-1.mos8743.noarch

Fix released.

tags:

removed: on-verification

Ekaterina Shutova (eshutova) on 2016-06-27

tags:

added: on-verification

Revision history for this message

Ekaterina Shutova (eshutova) wrote on 2016-06-27:

Checked on MOS 8.0 + mu 2 updates.

Env: 3 controllers, 2computes, 1 cinder
On controller:
- kill rabbit
- check reconnect time to rabbit(for example nova-scheduler)
Before updates ~5sec:
2016-06-27 14:36:36.597 5358 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.7:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:36:41.612 5358 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.8:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:36:46.622 5358 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.6:5673 is unreachable: [Errno 111] ECONN$
After updates ~ 1-2sec:
2016-06-27 14:55:44.805 2117 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.8:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:55:45.827 2117 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.7:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:55:46.850 2117 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.9:5673 is unreachable: [Errno 111] ECONN$

tags:

removed: on-verification

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-29: Fix merged to fuel-library (stable/6.1)

#10

Reviewed: https://review.openstack.org/316871
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=234335f3cbebd62d8ed2fee45a26a684f594a76c
Submitter: Jenkins
Branch: stable/6.1

commit 234335f3cbebd62d8ed2fee45a26a684f594a76c
Author: Alexey Deryugin <email address hidden>
Date: Mon May 16 17:14:12 2016 +0300

Use default value of kombu_reconnect_delay for all services

    Current reconnection time is 5.0 which is pretty long.
    This patch switches kombu_reconnect_delay to its default
    value - 1.0 which is the same for all services.

    Change-Id: I8e0230bf48620a8b5e1d7339e419b9b682e5cbd8
    Closes-bug: #1560097
    Closes-Bug: #1440134

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-29: Fix merged to fuel-library (stable/7.0)

#11

Reviewed: https://review.openstack.org/316866
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=1722961c12eea9031280f9622481d97730769951
Submitter: Jenkins
Branch: stable/7.0

commit 1722961c12eea9031280f9622481d97730769951
Author: Alexey Deryugin <email address hidden>
Date: Mon May 16 17:07:04 2016 +0300

Use default value of kombu_reconnect_delay for all services

    Current reconnection time is 5.0 which is pretty long.
    This patch switches kombu_reconnect_delay to its default
    value - 1.0 which is the same for all services.

    Change-Id: I0d2418f940e7036ad71264da7ad10d7a32bb329b
    Closes-bug: #1560097
    Closes-Bug: #1440134

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-19: Change abandoned on fuel-library (stable/5.1)

#12

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/5.1
Review: https://review.openstack.org/316878
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-19: Change abandoned on fuel-library (stable/6.0)

#13

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/6.0
Review: https://review.openstack.org/316885
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

Ekaterina Shutova (eshutova) wrote on 2017-04-10:

#14

Verified on 10.1 #1578.
After kill rabbit reconnection time is about 1sec.
root@node-3:~# grep -i ERROR /var/log/nova/nova-scheduler.log
2017-04-10 08:56:11.497 23579 ERROR oslo.messaging._drivers.impl_rabbit [-] [e151046d-4688-42a7-880c-01cef9758717] AMQP server on 10.109.11.4:5673 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: 44298
2017-04-10 08:56:12.518 23579 ERROR oslo.messaging._drivers.impl_rabbit [-] [e151046d-4688-42a7-880c-01cef9758717] AMQP server on 10.109.11.4:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 44298
2017-04-10 08:56:13.542 23579 ERROR oslo.messaging._drivers.impl_rabbit [-] [e151046d-4688-42a7-880c-01cef9758717] AMQP server on 10.109.11.5:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 44298

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2017-05-29:

#15

MOS 5.1 and MOS 6.0 are no longer supported. Moving to Won't Fix.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.