Metadata proxy startup can fail when Daemon class doesn't properly match running processes

Bug #1177416 reported by Brian Haley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley
Grizzly
Fix Released
High
Gary Kotton

Bug Description

I came across a case where two metadata namespace proxy pid files had the same id in them, possibly due to a reboot of the network controller:

root@qu-network-controller:# for f in `ls -1 /var/lib/quantum/external/pids/*.pid`; do cat $f | grep 31857 && echo $f; done
31857
/var/lib/quantum/external/pids/91e99f72-6fb0-49e5-9fbc-0c11d013d66e.pid
31857
/var/lib/quantum/external/pids/dc8af719-e6a0-4cc7-92d0-b2bf309e4245.pid

The pid in question was for a proxy for the namespace ending in d66e:

root@qu-network-controller:# cat /proc/31857/cmdline
python/usr/bin/quantum-ns-metadata-proxy--pid_file=/var/lib/quantum/external/pids/91e99f72-6fb0-49e5-9fbc-0c11d013d66e.pid--network_id=91e99f72-6fb0-49e5-9fbc-0c11d013d66e--state_path=/var/lib/quantum--metadata_port=80--debug--verbose--log-file=quantum-ns-metadata-proxy91e99f72-6fb0-49e5-9fbc-0c11d013d66e.log--log-dir=

Unfortunately, when quantum went to spawn the dhcp agent for the 4245 namespace, which will also spawn a namespace proxy, the code in Daemon class incorrectly matched this existing proxy and threw an exception.

From dhcp-agent.log:

2013-04-25 14:00:04 ERROR [quantum.agent.dhcp_agent] Unable to sync network state.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/quantum/agent/dhcp_agent.py", line 155, in sync_state
self.refresh_dhcp_helper(network_id)
File "/usr/lib/python2.7/dist-packages/quantum/agent/dhcp_agent.py", line 209, in refresh_dhcp_helper
return self.enable_dhcp_helper(network_id)
File "/usr/lib/python2.7/dist-packages/quantum/agent/dhcp_agent.py", line 188, in enable_dhcp_helper
self.enable_isolated_metadata_proxy(network)
File "/usr/lib/python2.7/dist-packages/quantum/agent/dhcp_agent.py", line 329, in enable_isolated_metadata_proxy
pm.enable(callback)
File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/external_process.py", line 55, in enable
ip_wrapper.netns.execute(cmd)
File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 407, in execute
check_exit_code=check_exit_code)
File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
raise RuntimeError(m)
RuntimeError:
Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-dc8af719-e6a0-4cc7-92d0-b2bf309e4245', 'quantum-ns-metadata-proxy', '--pid_file=/var/lib/quantum/external/pids/dc8af719-e6a0-4cc7-92d0-b2bf309e4245.pid', '--network_id=dc8af719-e6a0-4cc7-92d0-b2bf309e4245', '--state_path=/var/lib/quantum', '--metadata_port=80', '--debug', '--verbose', '--log-file=quantum-ns-metadata-proxydc8af719-e6a0-4cc7-92d0-b2bf309e4245.log', '--log-dir=/var/log/quantum']

Looking further in dhcp-agent.log found the actual problem:

2013-04-25 14:00:04 DEBUG [quantum.agent.linux.utils] Running command: ['sudo', 'cat', '/proc/31857/cmdline']\n2013-04-25 14:00:04 DEBUG [quantum.agent.linux.utils] \nCommand: ['sudo', 'cat', '/proc/31857/cmdline']\nExit code: 0\nStdout: 'python\\x00/usr/bin/quantum-ns-metadata-proxy\\x00-pid_file=/var/lib/quantum/external/pids/91e99f72-6fb0-49e5-9fbc-0c11d013d66e.pid\\x00network_id=91e99f72-6fb0-49e5-9fbc-0c11d013d66e\\x00state_path=/var/lib/quantum\\x00metadata_port=80\\x00debug\\x00verbose\\x00log-file=quantum-ns-metadata-proxy91e99f72-6fb0-49e5-9fbc-0c11d013d66e.log\\x00-log-dir=/var/log/quantum\\x00'\nStderr: ''\n2013-04-25 14:00:04 ERROR [quantum.agent.linux.daemon] Pidfile /var/lib/quantum/external/pids/dc8af719-e6a0-4cc7-92d0-b2bf309e4245.pid already exist. Daemon already running?\n"

That's the right pid file, but cmdline for the other proxy process.

The Daemon class needs to use the uuid when looking at cmdline so that it doesn't match the wrong process. I have a patch that passes an additional argument at init time to do this and fixes the problem, I'll assign this to myself and send the change out.

Tags: l3-ipam-dhcp
Changed in quantum:
assignee: nobody → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/28419

Changed in quantum:
status: New → In Progress
Changed in quantum:
importance: Undecided → High
tags: added: grizzly-backport-potential l3-ipam-dhcp
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/28419
Committed: http://github.com/openstack/quantum/commit/6db5b0b77d0dd8b0f5010a71c769a821881797c5
Submitter: Jenkins
Branch: master

commit 6db5b0b77d0dd8b0f5010a71c769a821881797c5
Author: Brian Haley <email address hidden>
Date: Tue May 7 11:06:29 2013 -0400

    Change Daemon class to better match process command lines.

    Add additional uuid argument Daemon class to help it better
    match output from /proc/$id/cmdline to the correct daemon.
    If there is a stale pid in the pidfile, and that process has
    the same name, then it could match accidentally and not
    start the daemon up properly.

    Fixes bug 1177416

    Change-Id: I1109ca73c539c5e96cbe3dbb55ce68c92013ee10

Changed in quantum:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/28934

Changed in quantum:
milestone: none → havana-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (stable/grizzly)

Reviewed: https://review.openstack.org/28934
Committed: http://github.com/openstack/quantum/commit/83eae03e9dcb3e8637dfe9980772a524fd2c5501
Submitter: Jenkins
Branch: stable/grizzly

commit 83eae03e9dcb3e8637dfe9980772a524fd2c5501
Author: Brian Haley <email address hidden>
Date: Tue May 7 11:06:29 2013 -0400

    Change Daemon class to better match process command lines.

    Add additional uuid argument Daemon class to help it better
    match output from /proc/$id/cmdline to the correct daemon.
    If there is a stale pid in the pidfile, and that process has
    the same name, then it could match accidentally and not
    start the daemon up properly.

    Fixes bug 1177416

    Change-Id: I1109ca73c539c5e96cbe3dbb55ce68c92013ee10
    (cherry picked from commit 6db5b0b77d0dd8b0f5010a71c769a821881797c5)

tags: added: in-stable-grizzly
Gary Kotton (garyk)
tags: removed: grizzly-backport-potential
Thierry Carrez (ttx)
Changed in quantum:
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: removed: in-stable-grizzly
Thierry Carrez (ttx)
Changed in neutron:
milestone: havana-1 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.