Metadata ns proxy didn't start - pid already exist. Daemon already running?

Bug #1223369 reported by Darragh O'Reilly
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

This failure happened just once. Levels are Ubuntu Raring 13.04, Grizzly Quantum packages at 1:2013.1.2-0ubuntu1.
I noticed the metadata namespace proxy hadn't started after the network node was booted. The l3-agent.log (was only at INFO) has:

2013-09-04 15:53:16 INFO [quantum.openstack.common.rpc.common] Connected to AMQP server on 10.0.10.10:5672
2013-09-04 15:53:16 INFO [quantum.agent.l3_agent] L3 agent started
2013-09-04 15:53:28 ERROR [quantum.agent.l3_agent] Failed synchronizing routers
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 638, in _sync_routers_task
    self._process_routers(routers, all_routers=True)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 618, in _process_routers
    self._router_added(r['id'], r)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 236, in _router_added
    self._spawn_metadata_proxy(ri)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 270, in _spawn_metadata_proxy
    pm.enable(callback)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/external_process.py", line 55, in enable
    ip_wrapper.netns.execute(cmd)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 414, in execute
    check_exit_code=check_exit_code)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
    raise RuntimeError(m)
RuntimeError:
Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-fa2ec96d-d1f9-4af2-a022-cac171646aa7', 'quantum-ns-metadata-proxy', '--pid_file=/var/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid', '--router_id=fa2ec96d-d1f9-4af2-a022-cac171646aa7', '--state_path=/var/lib/quantum', '--metadata_port=9697', '--verbose', '--log-file=quantum-ns-metadata-proxyfa2ec96d-d1f9-4af2-a022-cac171646aa7.log', '--log-dir=/var/log/quantum']
Exit code: 1
Stdout: ''
Stderr: '2013-09-04 15:53:28 INFO [quantum.common.config] Logging enabled!\n2013-09-04 15:53:28 ERROR [quantum.agent.linux.daemon] Pidfile /var
/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid already exist. Daemon already running?\n'

And quantum-ns-metadata-proxyfa2ec96d-d1f9-4af2-a022-cac171646aa7.log has:

2013-08-29 19:04:04 INFO [quantum.common.config] Logging enabled!
2013-09-04 15:53:28 INFO [quantum.common.config] Logging enabled!
2013-09-04 15:53:28 ERROR [quantum.agent.linux.daemon] Pidfile /var/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid already exist. Daemon already running?

It is the same error message as https://bugs.launchpad.net/neutron/+bug/1177416 - but the patch from that bug was applied.

The file /lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid had 2045 in it - but no process with pid 2045 was running when I checked - /proc/2045/ did not exist. The pid file was stale as its date was that of the previous launch.

The process call chain in short-hand is like this:
l3-agent --> sudo rootwrap... --> python rootwrap ip netns exec qrouter-uuid quantum-ns-metadata-proxy router_id=uuid... --> python quantum-ns-metadata-proxy router_id=uuid...

Now the code in external_process.py either didn't find a /proc/2045/cmdline, or if it did then that file did not have the strings 'python' and 'fa2ec96d-d1f9-4af2-a022-cac171646aa7'. But the code in daemon.py must have found a /proc/2045/cmdline and it must have had those strings. The only explaination I can give for this is that the python rootwrap process started by sudo just happened to get pid 2045 that time, and this is what daemon.py is_running() found. Its full command line would have looked like:

/usr/bin/python /usr/bin/quantum-rootwrap /etc/quantum/rootwrap.conf ip netns exec qrouter-fa2ec96d-d1f9-4af2-a022-cac171646aa7 quantum-ns-metadata-proxy --pid_file=/var/lib/quantum/external/pids/fa2ec96d-d1f9-4af2-a022-cac171646aa7.pid --router_id=fa2ec96d-d1f9-4af2-a022-cac171646aa7 --state_path=/var/lib/quantum --metadata_port=9697 --verbose --log-file=quantum-ns-metadata-proxyfa2ec96d-d1f9-4af2-a022-cac171646aa7.log --log-dir=/var/log/quantum

It has the strings 'python' and the router's uuid, so it would have matched. If my theory is right, then a possible fix would be to change the checks to not report cmdlines with 'ip\x00netns\x00exec' as a running daemon.

Tags: l3-ipam-dhcp
Changed in neutron:
assignee: nobody → Darragh O'Reilly (darragh-oreilly)
Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

Attached a script spoof.py to help reproduce.

TEST 1
======
- stop l3-agent
- kill metadata ns proxy

$ ./spoof.py rootwrap ip netns exec qrouter-e907733f-3292-4b0b-80f7-b093285c5bd6 arping 1.2.3.4
my pid is: 27907
created stale pid file /opt/stack/data/neutron/external/pids/e907733f-3292-4b0b-80f7-b093285c5bd6.pid with data: 27907
/proc/27907/cmdline has: '/usr/bin/python\x00./spoof.py\x00rootwrap\x00ip\x00netns\x00exec\x00qrouter-e907733f-3292-4b0b-80f7-b093285c5bd6\x00arping\x001.2.3.4\x00'

ps -ef | grep metadata-proxy
result - metadata ns proxy did not start because external_process.active thinks it's already running

TEST 2
======
- stop l3-agent
- kill metadata ns proxy

$ ./spoof.py python rootwrap ip netns exec qrouter-e907733f-3292-4b0b-80f7-b093285c5bd6 metadata_ns_proxy blab
my pid is: 15442
created stale pid file /opt/stack/data/neutron/external/pids/['e907733f', '3292', '4b0b', '80f7', 'b093285c5bd6'].pid with data: 15442
/proc/15442/cmdline has: '/usr/bin/python\x00./spoof.py\x00python\x00rootwrap\x00ip\x00netns\x00exec\x00qrouter-e907733f-3292-4b0b-80f7-b093285c5bd6\x00metadata_ns_proxy\x00blab\x00'

- try to start metadata ns proxy on console

$ /usr/bin/python /usr/local/bin/neutron-ns-metadata-proxy --pid_file=/opt/stack/data/neutron/external/pids/e907733f-3292-4b0b-80f7-b093285c5bd6.pid --metadata_proxy_socket=/opt/stack/data/neutron/metadata_proxy --router_id=e907733f-3292-4b0b-80f7-b093285c5bd6 --state_path=/opt/stack/data/neutron --metadata_port=9697 --debug --verbose

result:
neutron.agent.linux.daemon [-] Pidfile /opt/stack/data/neutron/external/pids/e907733f-3292-4b0b-80f7-b093285c5bd6.pid already exist. Daemon already running?

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

the last comment is missing the step to start the l3-agent in test1 - should be:

TEST 1
======
- stop l3-agent
- kill metadata ns proxy

$ ./spoof.py rootwrap ip netns exec qrouter-e907733f-3292-4b0b-80f7-b093285c5bd6 arping 1.2.3.4

- start the l3-agent

ps -ef | grep metadata-proxy
result - metadata ns proxy did not start because external_process.active thinks it's already running

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/47806

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Undecided → Medium
tags: added: havana-rc-potential l3-ipam-dhcp
Thierry Carrez (ttx)
tags: added: havana-backport-potential
removed: havana-rc-potential
Changed in neutron:
status: In Progress → Confirmed
Alan Pevec (apevec)
tags: removed: havana-backport-potential
Revision history for this message
Miguel Lavalle (minsel) wrote :

Marked this bug as incomplete since it hasn't had activity in more that a year. Prposed fix has a -1 since more than a year ago

Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee: Darragh O'Reilly (darragh-oreilly) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.