Comment 3 for bug 1436414

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Reconstruction of workflow sequence:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

03:47:47 - 03:47:50
--------------------------
Works OCF script /usr/lib/ocf/resource.d/fuel/ocf-neutron-dhcp-agent, function "neutron_dhcp_agent_stop" (this seen by message "OpenStack DHCP Agent (neutron-dhcp-agent) stopped", https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L633)

During this time the function kills all processes (https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L599) by sending SIGTERM. It's observed as SIGTERM message in dhcp-agent log. At least 2 processes are not killed in time and script tries to kill them 3 seconds later. We see message from kill that it has not found 2 pids

03:47:50
------------
OCF script writes that it the agent is stopped (https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L633)

03:48:47
-----------
Pacemaker tells that there's timeout in 60 seconds during stop and q-agent-cleanup.py is still running:
<27>Mar 25 03:48:47 node-14 crmd[6977]: error: process_lrm_event: LRM operation p_neutron-dhcp-agent_stop_0 (407) Timed Out (timeout=60000ms)
<29>Mar 25 03:48:47 node-14 crmd[6977]: notice: process_lrm_event: node-14.domain.tld-p_neutron-dhcp-agent_stop_0:407 [ 2015-03-25 03:47:50,827 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=dhcp
--cleanup-ports\n ]
It means that OCF script is in https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L639 and function is not finished.
As result pacemaker thinks that stop hanged and marks resource as unmanaged. The same happens on all 3 controllers leaving OpenStack without any DHCP agents.

q-agent-cleanup is slow when number of namespaces is high. E.g. __collect_ports_for_namespace is called for every namespace. Even when the load is low call to "ip netns exec <namespace> ip l show" takes 0.1 seconds, resulting in 200 seconds on 2k namespaces which is 3 times more than timeout in Pacemaker.