This failure happened just once. Levels are Ubuntu Raring 13.04, Grizzly Quantum packages at 1:2013.1.2-0ubuntu1.
I noticed the metadata namespace proxy hadn't started after the network node was booted. The l3-agent.log (was only at INFO) has:
2013-09-04 15:53:16 INFO [quantum.openstack.common.rpc.common] Connected to AMQP server on 10.0.10.10:5672
2013-09-04 15:53:16 INFO [quantum.agent.l3_agent] L3 agent started
2013-09-04 15:53:28 ERROR [quantum.agent.l3_agent] Failed synchronizing routers
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 638, in _sync_routers_task
self._process_routers(routers, all_routers=True)
File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 618, in _process_routers
self._router_added(r['id'], r)
File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 236, in _router_added
self._spawn_metadata_proxy(ri)
File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 270, in _spawn_metadata_proxy
pm.enable(callback)
File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/external_process.py", line 55, in enable
ip_wrapper.netns.execute(cmd)
File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 414, in execute
check_exit_code=check_exit_code)
File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
raise RuntimeError(m)
RuntimeError:
Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-fa2ec96d-d1f9-4af2-a022-cac171646aa7', 'quantum-ns-metadata-proxy', '--pid_file=/var/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid', '--router_id=fa2ec96d-d1f9-4af2-a022-cac171646aa7', '--state_path=/var/lib/quantum', '--metadata_port=9697', '--verbose', '--log-file=quantum-ns-metadata-proxyfa2ec96d-d1f9-4af2-a022-cac171646aa7.log', '--log-dir=/var/log/quantum']
Exit code: 1
Stdout: ''
Stderr: '2013-09-04 15:53:28 INFO [quantum.common.config] Logging enabled!\n2013-09-04 15:53:28 ERROR [quantum.agent.linux.daemon] Pidfile /var
/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid already exist. Daemon already running?\n'
And quantum-ns-metadata-proxyfa2ec96d-d1f9-4af2-a022-cac171646aa7.log has:
2013-08-29 19:04:04 INFO [quantum.common.config] Logging enabled!
2013-09-04 15:53:28 INFO [quantum.common.config] Logging enabled!
2013-09-04 15:53:28 ERROR [quantum.agent.linux.daemon] Pidfile /var/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid already exist. Daemon already running?
It is the same error message as https://bugs.launchpad.net/neutron/+bug/1177416 - but the patch from that bug was applied.
The file /lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid had 2045 in it - but no process with pid 2045 was running when I checked - /proc/2045/ did not exist. The pid file was stale as its date was that of the previous launch.
The process call chain in short-hand is like this:
l3-agent --> sudo rootwrap... --> python rootwrap ip netns exec qrouter-uuid quantum-ns-metadata-proxy router_id=uuid... --> python quantum-ns-metadata-proxy router_id=uuid...
Now the code in external_process.py either didn't find a /proc/2045/cmdline, or if it did then that file did not have the strings 'python' and 'fa2ec96d-d1f9-4af2-a022-cac171646aa7'. But the code in daemon.py must have found a /proc/2045/cmdline and it must have had those strings. The only explaination I can give for this is that the python rootwrap process started by sudo just happened to get pid 2045 that time, and this is what daemon.py is_running() found. Its full command line would have looked like:
/usr/bin/python /usr/bin/quantum-rootwrap /etc/quantum/rootwrap.conf ip netns exec qrouter-fa2ec96d-d1f9-4af2-a022-cac171646aa7 quantum-ns-metadata-proxy --pid_file=/var/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid --router_id=fa2ec96d-d1f9-4af2-a022-cac171646aa7 --state_path=/var/lib/quantum --metadata_port=9697 --verbose --log-file=quantum-ns-metadata-proxyfa2ec96d-d1f9-4af2-a022-cac171646aa7.log --log-dir=/var/log/quantum
It has the strings 'python' and the router's uuid, so it would have matched. If my theory is right, then a possible fix would be to change the checks to not report cmdlines with 'ip\x00netns\x00exec' as a running daemon.
Attached a script spoof.py to help reproduce.
TEST 1
======
- stop l3-agent
- kill metadata ns proxy
$ ./spoof.py rootwrap ip netns exec qrouter- e907733f- 3292-4b0b- 80f7-b093285c5b d6 arping 1.2.3.4 data/neutron/ external/ pids/e907733f- 3292-4b0b- 80f7-b093285c5b d6.pid with data: 27907 python\ x00./spoof. py\x00rootwrap\ x00ip\x00netns\ x00exec\ x00qrouter- e907733f- 3292-4b0b- 80f7-b093285c5b d6\x00arping\ x001.2. 3.4\x00'
my pid is: 27907
created stale pid file /opt/stack/
/proc/27907/cmdline has: '/usr/bin/
ps -ef | grep metadata-proxy process. active thinks it's already running
result - metadata ns proxy did not start because external_
TEST 2
======
- stop l3-agent
- kill metadata ns proxy
$ ./spoof.py python rootwrap ip netns exec qrouter- e907733f- 3292-4b0b- 80f7-b093285c5b d6 metadata_ns_proxy blab data/neutron/ external/ pids/[' e907733f' , '3292', '4b0b', '80f7', 'b093285c5bd6'].pid with data: 15442 python\ x00./spoof. py\x00python\ x00rootwrap\ x00ip\x00netns\ x00exec\ x00qrouter- e907733f- 3292-4b0b- 80f7-b093285c5b d6\x00metadata_ ns_proxy\ x00blab\ x00'
my pid is: 15442
created stale pid file /opt/stack/
/proc/15442/cmdline has: '/usr/bin/
- try to start metadata ns proxy on console
$ /usr/bin/python /usr/local/ bin/neutron- ns-metadata- proxy --pid_file= /opt/stack/ data/neutron/ external/ pids/e907733f- 3292-4b0b- 80f7-b093285c5b d6.pid --metadata_ proxy_socket= /opt/stack/ data/neutron/ metadata_ proxy --router_ id=e907733f- 3292-4b0b- 80f7-b093285c5b d6 --state_ path=/opt/ stack/data/ neutron --metadata_ port=9697 --debug --verbose
result: agent.linux. daemon [-] Pidfile /opt/stack/ data/neutron/ external/ pids/e907733f- 3292-4b0b- 80f7-b093285c5b d6.pid already exist. Daemon already running?
neutron.