Juno to Kilo upgrade: lxc-container can not mount /var/run/netns and losing all namespaces

Bug #1487130 reported by Bjoern Teipel on 2015-08-20
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
openstack-ansible
High
Kevin Carter
Kilo
High
Jesse Pretorius
Trunk
High
Kevin Carter

Bug Description

During the container upgrades, the neutron containers in particular, are loosing all namespaces.
This causes the DHCP and floating IP connectivity issues, ultimately DHCP leases will be not be served during this time, so that instances will loose network connectivity.
This issue is happening very early inside the run_upgrade script roughly at lxc-hosts-setup.yml and I expect this issue present until the setup-openstack.yml (neutron portion) has been finished.

2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent Traceback (most recent call last):
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/common/utils.py", line 341, in call
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent return func(*args, **kwargs)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/l3_agent.py", line 892, in process_router
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent self.internal_network_added(ri, p)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/l3_agent.py", line 1537, in internal_network_added
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent ri.is_ha)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/l3_agent.py", line 1517, in _internal_network_added
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent prefix=prefix)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/interface.py", line 425, in plug
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent namespace2=namespace)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 134, in add_veth
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent self.ensure_namespace(namespace2)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 148, in ensure_namespace
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent ip = self.netns.add(name)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 526, in add
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent self._as_root('add', name, use_root_namespace=True)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 242, in _as_root
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent kwargs.get('use_root_namespace', False))
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 74, in _as_root
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent log_fail_as_error=self.log_fail_as_error)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 86, in _execute
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent log_fail_as_error=log_fail_as_error)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 84, in execute
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent raise RuntimeError(m)
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent RuntimeError:
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'add', 'qrouter-51c88
e3e-5097-421f-99bd-ae52522ba2b7']
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent Exit code: 1
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent Stdout: ''
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent Stderr: 'mount --make-shared /var/run/netns failed: Permission denied\n'
2015-08-20 16:14:19.416 687 TRACE neutron.agent.l3_agent

Bjoern Teipel (bjoern-teipel) wrote :

It appears that we seem to have an issue with multiple aa_profiles inside the lxc config :

root@xxxx-infra01:/opt/rpc-openstack/os-ansible-deployment# grep profile /var/lib/lxc/*/config
/var/lib/lxc/xxxx-infra01_cinder_api_container-75e87d70/config:lxc.aa_profile = lxc-openstack
/var/lib/lxc/xxxx-infra01_cinder_api_container-75e87d70/config:lxc.aa_profile = unconfined
/var/lib/lxc/xxxx-infra01_cinder_scheduler_container-bbeb71cc/config:lxc.aa_profile = lxc-openstack
/var/lib/lxc/xxxx-infra01_cinder_scheduler_container-bbeb71cc/config:lxc.aa_profile = unconfined

Reviewed: https://review.openstack.org/216301
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=ffb701f8a3a325e0c321fb2d3e37eea25e66a8af
Submitter: Jenkins
Branch: master

commit ffb701f8a3a325e0c321fb2d3e37eea25e66a8af
Author: kevin <email address hidden>
Date: Mon Aug 24 16:24:02 2015 +0100

    Removed default lxc profile on container create

    Having the lxc container create role drop the lxc-openstack apparmor
    profile on all containers anytime its executed leads to the possibility
    of the lxc container create task overwriting the running profile on a given
    container. If this happens its likley to cause service interruption until the
    correct profile is loaded for all containers its effected by the action.

    To fix this issue the default "lxc-openstack" profile has been removed from the
    lxc contianer create task and added to all plays that are known to be executed
    within an lxc container. This will ensure that the profile is untouched on
    subsequent runs of the lxc-container-create.yml play.

    Change-Id: Ifa4640be60c18f1232cc7c8b281fb1dfc0119e56
    Closes-Bug: 1487130

Changed in openstack-ansible:
status: Invalid → Fix Committed
Dolph Mathews (dolph) on 2015-08-26
summary: Juno to Kilo upgrade: lxc-container can not mount /var/run/netns and
- loosing all namespaces
+ losing all namespaces

Change abandoned by Ian Cordasco (<email address hidden>) on branch: master
Review: https://review.openstack.org/217367
Reason: The actual problem is with scripts/run-upgrade.sh

Reviewed: https://review.openstack.org/217367
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=999cdf52d793e171d0b26c6da1bf786972a6f137
Submitter: Jenkins
Branch: master

commit 999cdf52d793e171d0b26c6da1bf786972a6f137
Author: Ian Cordasco <email address hidden>
Date: Wed Aug 26 14:42:36 2015 -0500

    Remove temporary upgrade task that removes profile

    When performing an upgrade, this project strives to have minimal
    downtime for VMs that are running. By removing the apparmor profile as a
    precondition for upgrades, when the container create role runs, the
    profile will default to contained (the most restrictive profile). This
    causes instance downtime since neutron can not create network
    namespaces.

    Related-bug: 1487130
    Closes-bug: 1489144
    Change-Id: Ife7aab044c7cb882a89c6b108b2d66f5e39aa10c

Bjoern Teipel (bjoern-teipel) wrote :

The lxc-containers-create.yml seem to set lxc-openstack as aa_profile, not differentiating between the containers who need unconfined and which one who doesn't. This also adds the additional lxc.aa_profile to the containers using unconfined and we have the parameter in it twice.

Reviewed: https://review.openstack.org/217640
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=11ae61e96758a50acee9d1dbd32e349d73de89b0
Submitter: Jenkins
Branch: kilo

commit 11ae61e96758a50acee9d1dbd32e349d73de89b0
Author: Ian Cordasco <email address hidden>
Date: Wed Aug 26 14:42:36 2015 -0500

    Remove temporary upgrade task that removes profile

    When performing an upgrade, this project strives to have minimal
    downtime for VMs that are running. By removing the apparmor profile as a
    precondition for upgrades, when the container create role runs, the
    profile will default to contained (the most restrictive profile). This
    causes instance downtime since neutron can not create network
    namespaces.

    Related-bug: 1487130
    Closes-bug: 1489144
    Change-Id: Ife7aab044c7cb882a89c6b108b2d66f5e39aa10c
    (cherry picked from commit 999cdf52d793e171d0b26c6da1bf786972a6f137)

tags: added: in-kilo

Reviewed: https://review.openstack.org/217014
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=5b9c77a55709925fea349ab128288218a50ef5a8
Submitter: Jenkins
Branch: kilo

commit 5b9c77a55709925fea349ab128288218a50ef5a8
Author: kevin <email address hidden>
Date: Mon Aug 24 16:24:02 2015 +0100

    Change AppArmor profile application order

    This patch is a combination of two patches committed to master as the
    first patch on its own results in continual gate check fails:

    Patch 1:

    Removed default lxc profile on container create

    Having the lxc container create role drop the lxc-openstack apparmor
    profile on all containers anytime its executed leads to the possibility
    of the lxc container create task overwriting the running profile on a given
    container. If this happens its likley to cause service interruption until the
    correct profile is loaded for all containers its effected by the action.

    To fix this issue the default "lxc-openstack" profile has been removed from the
    lxc contianer create task and added to all plays that are known to be executed
    within an lxc container. This will ensure that the profile is untouched on
    subsequent runs of the lxc-container-create.yml play.

    Closes-Bug: 1487130
    (cherry picked from commit ffb701f8a3a325e0c321fb2d3e37eea25e66a8af)

    Patch 2:

    Wait for container ssh after apparmor profile update

    This patch adds a wait for the container's sshd to be available
    after the container's apparmor profile is updated. When the
    profile is updated the container is restarted, so this wait is
    essential to the success of the playbook's completion.

    It also includes 3 retries which has been found to improve the
    rate of success.

    Due to an upstream change in behaviour with netaddr 0.7.16 we
    need to pin the package to a lower version until Neutron is
    adjusted and we bump the Neutron SHA.

    Closes-Bug: #1490142
    (cherry picked from commit a40cb5811822181369ee3269bc57d8bd19f05913)

    Change-Id: Ifa4640be60c18f1232cc7c8b281fb1dfc0119e56

This issue was fixed in the openstack/openstack-ansible 11.2.11 release.

This issue was fixed in the openstack/openstack-ansible 11.2.12 release.

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers