long kombu_reconnect_delay for Neutron blocks metadata

Bug #1440134 reported by Miroslav Anashkin on 2015-04-03
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Ivan Berezovskiy
5.1.x
High
MOS Maintenance
6.0.x
High
MOS Maintenance
6.1.x
High
Alexey Deryugin
7.0.x
High
Alexey Deryugin
8.0.x
High
Alexey Deryugin
Mitaka
High
Ivan Berezovskiy

Bug Description

We've implemented the following workaround in Fuel 5.1.1:

https://bugs.launchpad.net/fuel/+bug/1322259/
Set delay to 5.0 to recover channel errors on highly loaded environments.

Actually, this fix had been made for slow environments.

It introduced to us a new issue, affecting all Mirantis OpenStack versions between 5.1.1 and 6.1.
Instance launch may fail.
It is related to this HA+RabbitMQ+Oslo Messaging bug:
https://bugs.launchpad.net/mos/+bug/1415932

So far customers reported the following errors:

===================================================================

Error Message 1
Timed out waiting for a reply to message ID 7bc529a2d71141cf8e65cbee6402f817
Code
500
Details
File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 614, in build_instances request_spec, filter_properties)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 49, in select_destinations context, request_spec, filter_properties) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 35, in __run_method return getattr(self.instance, __name)(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 34, in select_destinations context, request_spec, filter_properties)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 108, in select_destinations request_spec=request_spec, filter_properties=filter_properties)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call retry=self.retry)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send timeout=timeout, retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 434, in send retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 423, in _send result = self._waiter.wait(msg_id, timeout)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 289, in wait reply, ending = self._poll_connection(msg_id, timeout)
File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 239, in _poll_connection % msg_id)

===================================================================

Error message 2
The problem I am seeing is that when cloud-init would access the metadata service it was taking 10seconds to respond to each query which is slowing down the startup times of instances from ~30sec to over 3min.

time curl http://169.254.169.254/2009-04-04/meta-data/instance-id
i-00000046
real 0m10.289s
user 0m0.000s
sys 0m0.003s

After some troubleshooting I noticed that the the neutron-metadata-agent was calling out for locks to the neutron-server, when I looked at the neutron-server logs I started seeing error messages like this:

2015-03-12 21:00:34.247 41004 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'reply_948744ac899648c19260a53a8e5c0f4b': [Errno 32] Broken pipe

===================================================================

Should we set kombu_reconnect_delay back to its default value, 1.0?

description: updated
tags: added: low-hanging-fruit
Ilya Shakhat (shakhat) wrote :

The root cause of the issue looks to be the same as in https://bugs.launchpad.net/mos/+bug/1410797. This bug is already fixed in 6.1 release

Miroslav Anashkin (manashkin) wrote :

It is not duplicate.
This bug is about setting kombu_reconnect_delay back to the default value.

Giuseppe Cossu (giuseppecossu) wrote :

It is not duplicated, It actually refers to metadata slowness.
Setting kombu_reconnect_delay=1.0 solves the issue for metadata. Tested with Fuel 6.0.

Dmitry Sutyagin (dsutyagin) wrote :

Issue was observed in MOS 7.0
kombu_reconnect_delay = 5.0 (default)
Some metadata requests get late replies due to "Broken pipe" errors and 5 second reconnect delay, causing CloudInit inside VM to fail with a timeout.

Not sure if heartbeats should be present to fix this, but if so then looks like heartbeats do not work for neutron-metadata-agent in DVR configuration.

tags: added: support
Dmitry Pyzhov (dpyzhov) on 2016-05-06
tags: added: area-library
Dmitry Mescheryakov (dmitrymex) wrote :

FIY we indeed decreased kombu_reconnect_delay to default 1 second in change https://review.openstack.org/#/q/Id7bac2d1988b9f91ea19edc0fe8a4bdc9958d458,n,z

which for now went into Mitaka only. And I do suggest to port that change back to other branches as well.

Dmitry Pyzhov (dpyzhov) on 2016-05-13
no longer affects: fuel/future
no longer affects: fuel/newton
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Networking (l23-network)
Ivan Berezovskiy (iberezovskiy) wrote :

This is fixed for 9.0 - https://review.openstack.org/#/c/305352/ and 10.0 - https://review.openstack.org/#/c/304517/. Need to backport to all other branches

Changed in fuel:
status: Triaged → Fix Committed
assignee: Networking (l23-network) → Ivan Berezovskiy (iberezovskiy)
tags: added: on-verification
Vitaly Sedelnik (vsedelnik) wrote :

Fix Committed for 8.0 - the review is https://review.openstack.org/#/c/316853/

Kyrylo Romanenko (kromanenko) wrote :

Requesting from cirros instance:
time curl http://169.254.169.254/latest/meta-data/instance-id
i-00000016real 0m 2.04s
user 0m 0.11s
sys 0m 0.10s

cat /etc/fuel_build_id:
 490
cat /etc/fuel_build_number:
 490
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8460.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8460.noarch
 rubygem-astute-9.0.0-1.mos750.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-openstack-metadata-9.0.0-1.mos8743.noarch
 fuel-notify-9.0.0-1.mos8460.noarch
 nailgun-mcagents-9.0.0-1.mos750.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-utils-9.0.0-1.mos8460.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8743.noarch
 fuel-library9.0-9.0.0-1.mos8460.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-ostf-9.0.0-1.mos936.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-nailgun-9.0.0-1.mos8743.noarch

Fix released.

tags: removed: on-verification
tags: added: on-verification
Ekaterina Shutova (eshutova) wrote :

Checked on MOS 8.0 + mu 2 updates.

Env: 3 controllers, 2computes, 1 cinder
On controller:
- kill rabbit
- check reconnect time to rabbit(for example nova-scheduler)
    Before updates ~5sec:
2016-06-27 14:36:36.597 5358 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.7:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:36:41.612 5358 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.8:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:36:46.622 5358 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.6:5673 is unreachable: [Errno 111] ECONN$
    After updates ~ 1-2sec:
2016-06-27 14:55:44.805 2117 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.8:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:55:45.827 2117 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.7:5673 is unreachable: [Errno 111] ECONN$
2016-06-27 14:55:46.850 2117 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.1.9:5673 is unreachable: [Errno 111] ECONN$

tags: removed: on-verification

Reviewed: https://review.openstack.org/316871
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=234335f3cbebd62d8ed2fee45a26a684f594a76c
Submitter: Jenkins
Branch: stable/6.1

commit 234335f3cbebd62d8ed2fee45a26a684f594a76c
Author: Alexey Deryugin <email address hidden>
Date: Mon May 16 17:14:12 2016 +0300

    Use default value of kombu_reconnect_delay for all services

    Current reconnection time is 5.0 which is pretty long.
    This patch switches kombu_reconnect_delay to its default
    value - 1.0 which is the same for all services.

    Change-Id: I8e0230bf48620a8b5e1d7339e419b9b682e5cbd8
    Closes-bug: #1560097
    Closes-Bug: #1440134

Reviewed: https://review.openstack.org/316866
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=1722961c12eea9031280f9622481d97730769951
Submitter: Jenkins
Branch: stable/7.0

commit 1722961c12eea9031280f9622481d97730769951
Author: Alexey Deryugin <email address hidden>
Date: Mon May 16 17:07:04 2016 +0300

    Use default value of kombu_reconnect_delay for all services

    Current reconnection time is 5.0 which is pretty long.
    This patch switches kombu_reconnect_delay to its default
    value - 1.0 which is the same for all services.

    Change-Id: I0d2418f940e7036ad71264da7ad10d7a32bb329b
    Closes-bug: #1560097
    Closes-Bug: #1440134

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/5.1
Review: https://review.openstack.org/316878
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/6.0
Review: https://review.openstack.org/316885
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Ekaterina Shutova (eshutova) wrote :

Verified on 10.1 #1578.
After kill rabbit reconnection time is about 1sec.
root@node-3:~# grep -i ERROR /var/log/nova/nova-scheduler.log
2017-04-10 08:56:11.497 23579 ERROR oslo.messaging._drivers.impl_rabbit [-] [e151046d-4688-42a7-880c-01cef9758717] AMQP server on 10.109.11.4:5673 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: 44298
2017-04-10 08:56:12.518 23579 ERROR oslo.messaging._drivers.impl_rabbit [-] [e151046d-4688-42a7-880c-01cef9758717] AMQP server on 10.109.11.4:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 44298
2017-04-10 08:56:13.542 23579 ERROR oslo.messaging._drivers.impl_rabbit [-] [e151046d-4688-42a7-880c-01cef9758717] AMQP server on 10.109.11.5:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 44298

Changed in fuel:
status: Fix Committed → Fix Released
Alexey Stupnikov (astupnikov) wrote :

MOS 5.1 and MOS 6.0 are no longer supported. Moving to Won't Fix.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers