Connection timed out

Bug #1602274 reported by Pavel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
Medium
Pavel

Bug Description

Detailed bug description:
Problem discovered on MOS 9.0 ISO #495 RC2

Problem was discovered during execution of NovaServers.resize_server rally test: AMQP server is unreachable 6 times from 97.

Nova log file atached to this bug.
Nova api log of this issue:

2016-07-07 13:09:44.688 19298 DEBUG keystoneauth.session [-] REQ: curl -g -i -X GET http://192.168.0.2:35357/v3/auth/tokens -H "X-Subject-Token: {SHA1}b4af682fec7d7d1d9eae361e225206444e9248c2" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}63adf02cc36a824b4c6f0ee6ba6003f06ea88ae9" _http_log_request /usr/lib/python2.7/dist-packages/keystoneauth1/session.py:248
2016-07-07 13:09:48.363 19400 DEBUG nova.metadata.wsgi.server [req-17a7f302-f66c-432d-8ebf-ec180b6da5cc - - - - -] (19400) accepted ('240.0.0.2', 42160) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867
2016-07-07 13:09:48.366 19400 INFO nova.metadata.wsgi.server [-] 240.0.0.2 "OPTIONS / HTTP/1.0" status: 200 len: 234 time: 0.0007420
2016-07-07 13:09:48.811 19311 DEBUG nova.osapi_compute.wsgi.server [req-ea429186-3ca0-4e49-bd6b-fd333b1ba765 - - - - -] (19311) accepted ('240.0.0.2', 45402) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867
2016-07-07 13:09:48.813 19311 DEBUG nova.api.openstack.wsgi [-] Calling method '<bound method Versions.index of <nova.api.openstack.compute.versions.Versions object at 0x7f8517191690>>' _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:699
2016-07-07 13:09:48.814 19311 INFO nova.osapi_compute.wsgi.server [-] 240.0.0.2 "OPTIONS / HTTP/1.0" status: 200 len: 501 time: 0.0018778
2016-07-07 13:09:56.098 19271 DEBUG oslo.messaging._drivers.impl_rabbit [-] Received recoverable error from kombu: on_error /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:683
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 436, in _ensured
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 508, in __call__
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit return fun(*args, channel=channels[0], **kwargs), channels[0]
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 734, in execute_method
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit method()
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1028, in _consume
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit self.connection.drain_events(timeout=poll_timeout)
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 275, in drain_events
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 95, in drain_events
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs)
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 303, in drain_events
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit chanmap, None, timeout=timeout,
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 366, in _wait_multiple
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit channel, method_sig, args, content = read_timeout(timeout)
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 337, in read_timeout
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit return self.method_reader.read_method()
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/method_framing.py", line 189, in read_method
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit raise m
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit error: [Errno 110] Connection timed out
2016-07-07 13:09:56.098 19271 ERROR oslo.messaging._drivers.impl_rabbit
2016-07-07 13:09:56.100 19271 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.0.7:5673 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.

Steps to reproduce:
 Start rally test NovaServers.resize_server during the load.
Expected results:
 Test passed. Server flavor was chenged.
Actual result:
 Test failed.
Reproducibility:
 Iterations: 97, Failures: 6
Workaround:
 n/a
Impact:
 unknown
Description of the environment:
 * 10 baremetal nodes:
    - CPU: 12 x 2.10 GHz
    - Disks: 2 drives (SSD - 80 GB, HDD - 931.5 GB), 1006.0 GB total
    - Memory: 2 x 16.0 GB, 32.0 GB total
    - NUMA topology: 1 NUMA node
 * Node roles:
   - 3 controllers (1 was is offline because of disk problems)
   - 7 computes
 * Details:
   - OS on controllers: Mitaka on Ubuntu 14.04
   - OS on computes: RHEL
   - Compute: KVM
   - Neutron with VLAN segmentation

Additional information:
 Diagnostic snapshot download link: http://mos-scale-share.mirantis.com/fuel-snapshot-2016-07-11_08-57-28.tar.gz

Revision history for this message
Pavel (pshvetsov) wrote :
Pavel (pshvetsov)
affects: nova → mos
Changed in mos:
importance: Undecided → Medium
milestone: none → 9.0
Changed in mos:
assignee: nobody → MOS Oslo (mos-oslo)
milestone: 9.0 → 9.1
status: New → Confirmed
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Pavel, we need an env with repro to investigate the issue.

tl;dr

Logs from var/log/remote/node-17.domain.tld/rabbitmq.log.3.gz (http://paste.openstack.org/show/532892/) show that at 2016-07-07T12:51:41 node-17 lost contact with both its peers (node-11 and node-19) and was shut down. Later, at 2016-07-07T13:09:40 node-19 also lost connection with node-11 and also was shut down (rabbitmq log from node-19 http://paste.openstack.org/show/532893/). Eventually nodes recovered and joined back to node-11.

In both cases it is not clear what was the root cause, but strange entries from var/log/remote/node-19.domain.tld/lrmd.log.4.gz like
2016-07-07T13:11:11.951448+00:00 notice: notice: vip__vrouter_monitor_5000:8313:stderr [ cat: /sys/class/net/br-mgmt/carrier: Invalid argument]
2016-07-07T13:14:18.256347+00:00 warning: warning: Notification of client crmd/d5013cf8-70df-49fb-8dec-0842aa92dbe3 failed

indicate that something was not right with the system and most probably it was either overloaded or network bandwidth was exhausted. Since the environment is small and the test executed seems to be rather light for controllers, the issue looks strange and needs to be investigated on a live environment.

Changed in mos:
status: Confirmed → Incomplete
assignee: MOS Oslo (mos-oslo) → Pavel (pshvetsov)
Revision history for this message
Pavel (pshvetsov) wrote :

Sorry, but env with this case was used by another expert and already redeployed.
If i can reproduce this case in the future - i'll contact with you, ok?

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting as Invalid after a month in Incomplete. Feel free to re-open if the issue occurs again.

Changed in mos:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.