neutron-ovn-metadata-agent dies on broken namespace

Bug #2037102 reported by Felix Huettner
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Felix Huettner

Bug Description

neutron-ovn-metadata-agent uses network namespaces to separate the metadata services for individual networks. For each network it automatically creates or destroys an appropriate namespace.

If the metadata agent dies for reasons outside of its control (e.g. a SIGKILL) during the process of namespace destruction a broken namespace can be left over.

---
Background on pyroute2 namespace management:

Creating a network namespace works by:
1. Forking the process and doing everything in the new child
2. Ensuring /var/run/netns exists
3. Ensuring the file for the network namespace under /var/run/netns exists by creating a new empty file
4. calling `unshare` with `CLONE_NEWNET` to move the process to a new network namespace
5. Creating a bind mount from `/proc/self/ns/net` to the file under /var/run/netns

Deleting a network namespace works the other way around (but shorter):
1. Unmounting the previously created bind mount
2. Deleting the file for the network namespace

---

If the neutron-ovn-metadata-agent is killed between step 1 and 2 of deleting the network namespace then the namespace file will still be around, but not point to any namespace.

When `garbage_collect_namespace` tries to check if the namespace is empty it tries to enter the network namespace to dump all devices in there. This raises an exception as the namespace can no longer be entered.
neutron-ovn-metadata-agent then crashes and tries again next time, crashing again.

```
Traceback (most recent call last):,
   File "/usr/local/bin/neutron-ovn-metadata-agent", line 8, in <module>,
     sys.exit(main()),
   File "/usr/local/lib/python3.9/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py", line 24, in main,
     metadata_agent.main(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata_agent.py", line 41, in main,
     agt.start(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 277, in start,
     self.sync(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 61, in wrapped,
     return f(*args, **kwargs),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 349, in sync,
     self.teardown_datapath(self._get_datapath_name(ns)),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 400, in teardown_datapath,
     ip.garbage_collect_namespace(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 268, in garbage_collect_namespace,
     if self.namespace_is_empty():,
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 263, in namespace_is_empty,
     return not self.get_devices(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 180, in get_devices,
     devices = privileged.get_device_names(self.namespace),
   File "/usr/local/lib/python3.9/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 609, in get_device_names,
     in get_link_devices(namespace, **kwargs)],
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f,
     return self(f, *args, **kw),
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__,
     do = self.iter(retry_state=retry_state),
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter,
     return fut.result(),
   File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result,
     return self.__get_result(),
   File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result,
     raise self._exception,
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__,
     result = fn(*args, **kwargs),
   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap,
     return self.channel.remote_call(name, args, kwargs,,
   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call,
     raise exc_type(*result[2]),
OSError: [Errno 22] failed to open netns
```

Versions: afaik affects all versions

Reproduction: best by creating a empty file with the name `/var/run/netns/ovnmeta-<some-uuid>` and restarting the neutron-ovn-metadata-agent. Otherwise a breakpoint or a good timed kill command

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/896251

Changed in neutron:
status: New → In Progress
Changed in neutron:
assignee: nobody → Felix Huettner (felix.huettner)
tags: added: ovn
Changed in neutron:
importance: Undecided → High
summary: - neutron-ovn-metadata-agent dies on broken namspace
+ neutron-ovn-metadata-agent dies on broken namespace
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/896251
Committed: https://opendev.org/openstack/neutron/commit/566fea3fed837b0130023303c770aade391d3d61
Submitter: "Zuul (22348)"
Branch: master

commit 566fea3fed837b0130023303c770aade391d3d61
Author: Felix Huettner <email address hidden>
Date: Fri Sep 22 16:25:10 2023 +0200

    fix netns deletion of broken namespaces

    normal network namespaces are bind-mounted to files under
    /var/run/netns. If a process deleting a network namespace gets killed
    during that operation there is the chance that the bind mount to the
    netns has been removed, but the file under /var/run/netns still exists.

    When the neutron-ovn-metadata-agent tries to clean up such network
    namespaces it first tires to validate that the network namespace is
    empty. For the cases described above this fails, as this network
    namespace no longer really exists, but is just a stray file laying
    around.

    To fix this we treat network namespaces where we get an `OSError` with
    errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete
    the namespace will then clean up the file.

    Additionally we add a guard to teardown_datapath to continue even if
    this fails. failing to remove a datapath is not critical and leaves in
    the worst case a process and a network namespace running, however
    previously it would have also prevented the creation of new datapaths
    which is critical for VM startup.

    Closes-Bug: #2037102
    Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
alisafari (alisafar1212) wrote :

Is it possible to have this patch on Antelope?

Revision history for this message
Felix Huettner (felix.huettner) wrote :

i'll trigger backports, but not sure if they will apply

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/905528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/905529

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 24.0.0.0b1

This issue was fixed in the openstack/neutron 24.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/905528
Committed: https://opendev.org/openstack/neutron/commit/f07cc43964e9dea7a48bf8564944d1e05f4e22a8
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit f07cc43964e9dea7a48bf8564944d1e05f4e22a8
Author: Felix Huettner <email address hidden>
Date: Fri Sep 22 16:25:10 2023 +0200

    fix netns deletion of broken namespaces

    normal network namespaces are bind-mounted to files under
    /var/run/netns. If a process deleting a network namespace gets killed
    during that operation there is the chance that the bind mount to the
    netns has been removed, but the file under /var/run/netns still exists.

    When the neutron-ovn-metadata-agent tries to clean up such network
    namespaces it first tires to validate that the network namespace is
    empty. For the cases described above this fails, as this network
    namespace no longer really exists, but is just a stray file laying
    around.

    To fix this we treat network namespaces where we get an `OSError` with
    errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete
    the namespace will then clean up the file.

    Additionally we add a guard to teardown_datapath to continue even if
    this fails. failing to remove a datapath is not critical and leaves in
    the worst case a process and a network namespace running, however
    previously it would have also prevented the creation of new datapaths
    which is critical for VM startup.

    Closes-Bug: #2037102
    Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4
    (cherry picked from commit 566fea3fed837b0130023303c770aade391d3d61)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/905529
Committed: https://opendev.org/openstack/neutron/commit/69c49c4ef24648f97d895bfaacd7336917634565
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 69c49c4ef24648f97d895bfaacd7336917634565
Author: Felix Huettner <email address hidden>
Date: Fri Sep 22 16:25:10 2023 +0200

    fix netns deletion of broken namespaces

    normal network namespaces are bind-mounted to files under
    /var/run/netns. If a process deleting a network namespace gets killed
    during that operation there is the chance that the bind mount to the
    netns has been removed, but the file under /var/run/netns still exists.

    When the neutron-ovn-metadata-agent tries to clean up such network
    namespaces it first tires to validate that the network namespace is
    empty. For the cases described above this fails, as this network
    namespace no longer really exists, but is just a stray file laying
    around.

    To fix this we treat network namespaces where we get an `OSError` with
    errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete
    the namespace will then clean up the file.

    Additionally we add a guard to teardown_datapath to continue even if
    this fails. failing to remove a datapath is not critical and leaves in
    the worst case a process and a network namespace running, however
    previously it would have also prevented the creation of new datapaths
    which is critical for VM startup.

    Closes-Bug: #2037102
    Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4
    (cherry picked from commit 566fea3fed837b0130023303c770aade391d3d61)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/908695

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (unmaintained/wallaby)

Fix proposed to branch: unmaintained/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/913804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/908695
Committed: https://opendev.org/openstack/neutron/commit/38ac22354d62820d5156113c907a91b55b2ab2c3
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 38ac22354d62820d5156113c907a91b55b2ab2c3
Author: Felix Huettner <email address hidden>
Date: Fri Sep 22 16:25:10 2023 +0200

    fix netns deletion of broken namespaces

    normal network namespaces are bind-mounted to files under
    /var/run/netns. If a process deleting a network namespace gets killed
    during that operation there is the chance that the bind mount to the
    netns has been removed, but the file under /var/run/netns still exists.

    When the neutron-ovn-metadata-agent tries to clean up such network
    namespaces it first tires to validate that the network namespace is
    empty. For the cases described above this fails, as this network
    namespace no longer really exists, but is just a stray file laying
    around.

    To fix this we treat network namespaces where we get an `OSError` with
    errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete
    the namespace will then clean up the file.

    Additionally we add a guard to teardown_datapath to continue even if
    this fails. failing to remove a datapath is not critical and leaves in
    the worst case a process and a network namespace running, however
    previously it would have also prevented the creation of new datapaths
    which is critical for VM startup.

    Conflicts:
            neutron/tests/unit/agent/ovn/metadata/test_agent.py

    Closes-Bug: #2037102
    Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4
    (cherry picked from commit 566fea3fed837b0130023303c770aade391d3d61)
    (cherry picked from commit 69c49c4ef24648f97d895bfaacd7336917634565)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (unmaintained/yoga)

Fix proposed to branch: unmaintained/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/914002

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (unmaintained/xena)

Fix proposed to branch: unmaintained/xena
Review: https://review.opendev.org/c/openstack/neutron/+/914003

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.2.1

This issue was fixed in the openstack/neutron 21.2.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.