Dataplane downtime when containers are stopped/restarted

Bug #1738768 reported by Daniel Alvarez
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Critical
Unassigned
tripleo
Fix Released
High
Jiří Stránský

Bug Description

I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS and observed dataplane downtime when restarting/stopping neutron-l3 container on controllers. This is what I did:

1. Created a network, subnet, router, a VM and attached a FIP to the VM
2. Left a ping running on the undercloud to the FIP
3. Stopped l3 container in controller-0.
   Result: Observed some packet loss while the router was failed over to controller-1
4. Stopped l3 container in controller-1
   Result: Observed some packet loss while the router was failed over to controller-2
5. Stopped l3 container in controller-2
   Result: No traffic to/from the FIP at all.

(overcloud) [stack@undercloud ~]$ ping 10.0.0.131
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=1.83 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=1.56 ms

<---- Last l3 container was stopped here (step 5 above)---->

From 10.0.0.1 icmp_seq=10 Destination Host Unreachable
From 10.0.0.1 icmp_seq=11 Destination Host Unreachable

When containers are stopped, I guess that the qrouter namespace is not accessible by the kernel:

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns e qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument

This means that not only we're getting controlplane downtime but also dataplane which could be seen as a regression when compared to non-containerized environments.
The same would happen with DHCP and I expect instances not being able to fetch IP addresses from dnsmasq when dhcp containers are stopped.

description: updated
Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Further details:

This happens because the containers are mounting host /run in their own /run and namespaces are left behind after stopping/restarting the namespaces as these bug show [0][1]. I applied [2] and now stopping the container will still cause dataplane downtime but also restarting containers simply won't work (we may need additional bug for this).

Namespaces can't be now seen from outside the containers:

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns | grep qrouter
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
[heat-admin@overcloud-controller-2 ~]$

But from inside the container, they can:

[heat-admin@overcloud-controller-2 ~]$ sudo docker exec --user root -it 9f8a322c4a3c bash
()[root@overcloud-controller-2 /]# ip netns | grep qrouter
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
qrouter-5244e91c-f533-4128-9289-f37c9656792c

However, l3 agent fails to initialize because it can't access to them after restart:

()[root@overcloud-controller-2 /]# ip netns exec qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a
RTNETLINK answers: Invalid argument
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument

If I manually delete the namespace from inside the container and restart it, it'll work again:

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c
RTNETLINK answers: Invalid argument

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c
Cannot remove namespace file "/var/run/netns/qrouter-5244e91c-f533-4128-9289-f37c9656792c": No such file or directory

[heat-admin@overcloud-controller-2 ~]$ sudo docker restart 9f8a322c4a3c

And now ping to the FIP works back again:

(overcloud) [stack@undercloud ~]$ sudo ping 10.0.0.131 -i 0.2
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=38.5 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=6.58 ms
64 bytes from 10.0.0.131: icmp_seq=3 ttl=63 time=5.28 ms
64 bytes from 10.0.0.131: icmp_seq=4 ttl=63 time=2.71 ms
64 bytes from 10.0.0.131: icmp_seq=5 ttl=63 time=0.980 ms

Revision history for this message
Lujin Luo (luo-lujin) wrote :

(1) could you please provide which version of Neutron you are using? master branch I guess?
(2) i think you forgot to paste the references you mentioned in #1
(3) from what you described, you stopped all 3 containers running l3 agent, which means you do not have any running l3 agents now, shouldn't this lead to dataplane downtime for sure?

Lujin Luo (luo-lujin)
Changed in neutron:
status: New → Incomplete
Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

@Lujin:

>> (1) could you please provide which version of Neutron you are using? master branch I guess?
It's not latest master branch but latest promoted packages in RDO:

openstack-tripleo-common-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20171127180031.cc2c715.el7.centos.noarch
openstack-tripleo-ui-8.0.1-0.20171129193834.1e42711.el7.centos.noarch
openstack-tripleo-common-containers-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-validations-8.0.1-0.20171129140336.c1f2069.el7.centos.noarch
openstack-tripleo-heat-templates-8.0.0-0.20171130031741.4df242c.el7.centos.noarch
openstack-tripleo-image-elements-8.0.0-0.20171118092222.90b9a25.el7.centos.noarch
openstack-kolla-5.0.0-0.20171107075441.61495b1.el7.centos.noarch

()[root@overcloud-controller-2 /]# rpm -qa | grep neutron
python-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
openstack-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
python2-neutronclient-6.5.0-0.20171023215239.355983d.el7.centos.noarch
openstack-neutron-common-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-ml2-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python2-neutron-lib-1.11.0-0.20171129185804.ff5ee17.el7.centos.noarch
openstack-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch

>> (2) i think you forgot to paste the references you mentioned in #1
Right:

[0] https://bugs.launchpad.net/kolla/+bug/1616268
[1] https://bugs.launchpad.net/tripleo/+bug/1734333
[2] https://github.com/openstack/tripleo-heat-templates/commit/2e3a91f58bb48d4e7ab88258fbd704975cf1c79c

>> (3) from what you described, you stopped all 3 containers running l3 agent, which means you do not have any running l3 agents now, shouldn't this lead to dataplane downtime for sure?

In non-containerized environments, if everything is up and running and you stop l3 agents, dataplane remains working (namespaces are still there, ports are connected, flows installed, etc.). Obviously you'll lose control plane for L3 but that's expected. The scenario I'm describing is different since dataplane is lost as well which IMO it's a regression.

Thanks,
Daniel

Lujin Luo (luo-lujin)
tags: added: l3
Assaf Muller (amuller)
Changed in neutron:
status: Incomplete → Confirmed
importance: Undecided → Critical
Changed in tripleo:
status: New → Confirmed
Changed in tripleo:
status: Confirmed → Triaged
importance: Undecided → High
milestone: none → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Brent Eagles (beagles)
Changed in tripleo:
assignee: nobody → Brent Eagles (beagles)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Brent Eagles (beagles) wrote :

Proposed TripleO patch is here:

https://review.openstack.org/#/c/542858/

Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
assignee: Brent Eagles (beagles) → Jiří Stránský (jistr)
Changed in tripleo:
assignee: Jiří Stránský (jistr) → Brent Eagles (beagles)
Changed in tripleo:
assignee: Brent Eagles (beagles) → Jiří Stránský (jistr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/552073
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ae085825e22cb4ce7bf877087c2e324b8bec1f03
Submitter: Zuul
Branch: master

commit ae085825e22cb4ce7bf877087c2e324b8bec1f03
Author: Jiri Stransky <email address hidden>
Date: Mon Mar 12 17:02:36 2018 +0100

    Add pre_upgrade_rolling_tasks

    The resultin pre_upgrade_rolling_steps_playbook will be executed in a
    node-by-node rolling fashion at the beginning of major upgrade
    workflow (before upgrade_steps_playbook).

    The current intended use case is special handling of L3 agent upgrade
    when moving Neutron services into containers. Special care needs to be
    taken in this case to preserve L3 connectivity of instances (with
    regard to dnsmasq and keepalived sub-processes of L3 agent).

    The playbook can be run before the main upgrade like this:

    openstack overcloud upgrade run --roles overcloud --playbook pre_upgrade_rolling_steps_playbook.yaml

    Partial-Bug: #1738768
    Change-Id: Icb830f8500bb80fd15036e88fcd314bf2c54445d
    Implements: blueprint major-upgrade-workflow

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/556454

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.openstack.org/556454
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=402a0483eaa149951c493ec8a194e665bdbc09c1
Submitter: Zuul
Branch: stable/queens

commit 402a0483eaa149951c493ec8a194e665bdbc09c1
Author: Jiri Stransky <email address hidden>
Date: Mon Mar 12 17:02:36 2018 +0100

    Add pre_upgrade_rolling_tasks

    The resultin pre_upgrade_rolling_steps_playbook will be executed in a
    node-by-node rolling fashion at the beginning of major upgrade
    workflow (before upgrade_steps_playbook).

    The current intended use case is special handling of L3 agent upgrade
    when moving Neutron services into containers. Special care needs to be
    taken in this case to preserve L3 connectivity of instances (with
    regard to dnsmasq and keepalived sub-processes of L3 agent).

    The playbook can be run before the main upgrade like this:

    openstack overcloud upgrade run --roles overcloud --playbook pre_upgrade_rolling_steps_playbook.yaml

    Partial-Bug: #1738768
    Change-Id: Icb830f8500bb80fd15036e88fcd314bf2c54445d
    Implements: blueprint major-upgrade-workflow
    (cherry picked from commit ae085825e22cb4ce7bf877087c2e324b8bec1f03)

tags: added: in-stable-queens
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Revision history for this message
Jiří Stránský (jistr) wrote :

AFAIK this should be fixed, looking at https://review.openstack.org/#/c/542858/. Closing on TripleO side, please shout if i'm wrong :).

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

And this was not affecting non-containerized neutron, so nothing more to do on the neutron side, closing too (same, please shout if it's incorrect)

Changed in neutron:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.