Sleep in RPC reconnect is affected by NTPd on neutron services' startup

Bug #1366004 reported by Miroslav Los
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron (Ubuntu)
Triaged
Low
Unassigned

Bug Description

Under certain situations, the neutron server and all its agents will hang for a while on startup and become unresponsive.

The issue can be demonstrated by doing neutron agent-list shortly after reboot:

$ neutron agent-list
Connection to neutron failed: Maximum attempts reached

or

$ neutron agent-list
+--------------------------------------+--------------------+-----------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+-----------+-------+----------------+
| 4306e16a-9786-4772-acbd-93c9911a6971 | Linux bridge agent | myhost | xxx | True |
| 9635d63a-1606-4baf-9093-e99f3eaf0552 | DHCP agent | myhost | xxx | True |
| 9cfb2966-5cc0-4528-9bf6-388703d0d203 | Metadata agent | myhost | xxx | True |
| e6bbaf50-adfb-47fd-9e22-e27624e16704 | L3 agent | myhost | xxx | True |
+--------------------------------------+--------------------+-----------+-------+----------------+

as opposed to them being reported alive.

Note that is its expected to see output (1) - server isn't started, or output (2) - [some] agents don't report being alive, for a while, but not too long, certainly not minutes.

The place where they hang seems to always be the RPC connections' setup to the AMQP server, in neutron.openstack.common.rpc.impl_kombu.Connection.reconnect, on the time.sleep(sleep_time) - final statement of that method.

I strongly believe it only happens if a NTP time step happens while the reconnection attempts are being made. The sleep_time variable is set to 1 initially and grows by 2 seconds on each retry, for a maximum of 30 retries.

However, it has been observed by stracing the process, that the underlying select(2) call, using which time.sleep is implemented in Python, seems to operate with a timeout parameter of more than 15 minutes on the affected machine, which approximately coincides with the NTP stepping difference for that machine. Associated neutron server/agent logs show a similar step of 15 minutes between messages at some point. If an agent run happens to not be affected, the message with the time skip is from elsewhere in the codebase.

I'd be seriously baffling if GNU C library's select(2) were truly affected by setting the clock, but I have no other explanation at this time.

A way to verify that the bad condition is happening is, for example on the DHCP agent:

# get PID of the agent and verify it's running:
$ ps aux | grep neutron-dhcp-agent
neutron 2740 0.3 0.0 110228 34976 ? Ss 16:38 0:00 /usr/bin/python /usr/bin/neutron-dhcp-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini --log-file=/var/log/neutron/dhcp-agent.log

# check that it's waiting in select with a large seconds parameter.
$ sudo strace -p 2740
Process 2740 attached
select(0, NULL, NULL, NULL, {753, 918063}^CProcess 2740 detached
 <detached ...>

The example above shows remaining timeout of almost 753.9 seconds. The value reported is not the original value, it decreases as you strace the process again.

A workaround to this issue would be forcing the neutron upstart jobs to only run after ntp steps the clock. Since ntp is an sysv job and not an upstart job, the easiest way to achieve this is waiting for all sysv jobs to finish, which coincides with the 'rc' upstart job's stopping.

Hence, the proposed workaround is to edit the neutron upstart jobs:

sudo sed -i -e '/start on/s/$/ and stopped rc/' /etc/init/neutron*conf

Description: Ubuntu 14.04.1 LTS
Release: 14.04

ii neutron-common 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - common
ii neutron-dhcp-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - DHCP agent
ii neutron-l3-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - l3 agent
ii neutron-metadata-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - metadata agent
ii neutron-plugin-linuxbridge 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - linuxbridge plugin
ii neutron-plugin-linuxbridge-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - linuxbridge plugin agent
ii neutron-plugin-ml2 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - ML2 plugin
ii neutron-server 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - server

James Page (james-page)
Changed in neutron (Ubuntu):
importance: Undecided → Low
James Page (james-page)
Changed in neutron (Ubuntu):
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.