Bug #1738768 “Dataplane downtime when containers are stopped/res...” : Bugs : neutron

Daniel Alvarez (dalvarezs) on 2017-12-18

description:

updated

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2017-12-18:

#1

Further details:

This happens because the containers are mounting host /run in their own /run and namespaces are left behind after stopping/restarting the namespaces as these bug show [0][1]. I applied [2] and now stopping the container will still cause dataplane downtime but also restarting containers simply won't work (we may need additional bug for this).

Namespaces can't be now seen from outside the containers:

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns | grep qrouter
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
[heat-admin@overcloud-controller-2 ~]$

But from inside the container, they can:

[heat-admin@overcloud-controller-2 ~]$ sudo docker exec --user root -it 9f8a322c4a3c bash
()[root@overcloud-controller-2 /]# ip netns | grep qrouter
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
qrouter-5244e91c-f533-4128-9289-f37c9656792c

However, l3 agent fails to initialize because it can't access to them after restart:

()[root@overcloud-controller-2 /]# ip netns exec qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a
RTNETLINK answers: Invalid argument
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument

If I manually delete the namespace from inside the container and restart it, it'll work again:

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c
RTNETLINK answers: Invalid argument

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c
Cannot remove namespace file "/var/run/netns/qrouter-5244e91c-f533-4128-9289-f37c9656792c": No such file or directory

[heat-admin@overcloud-controller-2 ~]$ sudo docker restart 9f8a322c4a3c

And now ping to the FIP works back again:

(overcloud) [stack@undercloud ~]$ sudo ping 10.0.0.131 -i 0.2
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=38.5 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=6.58 ms
64 bytes from 10.0.0.131: icmp_seq=3 ttl=63 time=5.28 ms
64 bytes from 10.0.0.131: icmp_seq=4 ttl=63 time=2.71 ms
64 bytes from 10.0.0.131: icmp_seq=5 ttl=63 time=0.980 ms

Further details:

This happens because the containers are mounting host /run in their own /run and namespaces are left behind after stopping/restarting the namespaces as these bug show [0][1]. I applied [2] and now stopping the container will still cause dataplane downtime but also restarting containers simply won't work (we may need additional bug for this).

Namespaces can't be now seen from outside the containers:

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns | grep qrouter
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
[heat-admin@overcloud-controller-2 ~]$

But from inside the container, they can:

[heat-admin@overcloud-controller-2 ~]$ sudo docker exec --user root -it 9f8a322c4a3c bash
()[root@overcloud-controller-2 /]# ip netns | grep qrouter 
RTNETLINK answers: Invalid argument 
RTNETLINK answers: Invalid argument 
qrouter-5244e91c-f533-4128-9289-f37c9656792c

However, l3 agent fails to initialize because it can't access to them after restart:

()[root@overcloud-controller-2 /]# ip netns exec qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a 
RTNETLINK answers: Invalid argument 
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument

If I manually delete the namespace from inside the container and restart it, it'll work again:

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c
RTNETLINK answers: Invalid argument

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c 
Cannot remove namespace file "/var/run/netns/qrouter-5244e91c-f533-4128-9289-f37c9656792c": No such file or directory

[heat-admin@overcloud-controller-2 ~]$ sudo docker restart 9f8a322c4a3c

And now ping to the FIP works back again:

(overcloud) [stack@undercloud ~]$ sudo ping 10.0.0.131 -i 0.2
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=38.5 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=6.58 ms
64 bytes from 10.0.0.131: icmp_seq=3 ttl=63 time=5.28 ms
64 bytes from 10.0.0.131: icmp_seq=4 ttl=63 time=2.71 ms
64 bytes from 10.0.0.131: icmp_seq=5 ttl=63 time=0.980 ms

Revision history for this message

Lujin Luo (luo-lujin) wrote on 2017-12-19:

#2

(1) could you please provide which version of Neutron you are using? master branch I guess?
(2) i think you forgot to paste the references you mentioned in #1
(3) from what you described, you stopped all 3 containers running l3 agent, which means you do not have any running l3 agents now, shouldn't this lead to dataplane downtime for sure?

Lujin Luo (luo-lujin) on 2017-12-19

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2017-12-19:

#4

@Lujin:

>> (1) could you please provide which version of Neutron you are using? master branch I guess?
It's not latest master branch but latest promoted packages in RDO:

openstack-tripleo-common-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20171127180031.cc2c715.el7.centos.noarch
openstack-tripleo-ui-8.0.1-0.20171129193834.1e42711.el7.centos.noarch
openstack-tripleo-common-containers-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-validations-8.0.1-0.20171129140336.c1f2069.el7.centos.noarch
openstack-tripleo-heat-templates-8.0.0-0.20171130031741.4df242c.el7.centos.noarch
openstack-tripleo-image-elements-8.0.0-0.20171118092222.90b9a25.el7.centos.noarch
openstack-kolla-5.0.0-0.20171107075441.61495b1.el7.centos.noarch

()[root@overcloud-controller-2 /]# rpm -qa | grep neutron
python-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
openstack-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
python2-neutronclient-6.5.0-0.20171023215239.355983d.el7.centos.noarch
openstack-neutron-common-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-ml2-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python2-neutron-lib-1.11.0-0.20171129185804.ff5ee17.el7.centos.noarch
openstack-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch

>> (2) i think you forgot to paste the references you mentioned in #1
Right:

[0] https://bugs.launchpad.net/kolla/+bug/1616268
[1] https://bugs.launchpad.net/tripleo/+bug/1734333
[2] https://github.com/openstack/tripleo-heat-templates/commit/2e3a91f58bb48d4e7ab88258fbd704975cf1c79c

>> (3) from what you described, you stopped all 3 containers running l3 agent, which means you do not have any running l3 agents now, shouldn't this lead to dataplane downtime for sure?

In non-containerized environments, if everything is up and running and you stop l3 agents, dataplane remains working (namespaces are still there, ports are connected, flows installed, etc.). Obviously you'll lose control plane for L3 but that's expected. The scenario I'm describing is different since dataplane is lost as well which IMO it's a regression.

Thanks,
Daniel

@Lujin:

>> (1) could you please provide which version of Neutron you are using? master branch I guess?
It's not latest master branch but latest promoted packages in RDO:

openstack-tripleo-common-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20171127180031.cc2c715.el7.centos.noarch
openstack-tripleo-ui-8.0.1-0.20171129193834.1e42711.el7.centos.noarch
openstack-tripleo-common-containers-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-validations-8.0.1-0.20171129140336.c1f2069.el7.centos.noarch
openstack-tripleo-heat-templates-8.0.0-0.20171130031741.4df242c.el7.centos.noarch
openstack-tripleo-image-elements-8.0.0-0.20171118092222.90b9a25.el7.centos.noarch 
openstack-kolla-5.0.0-0.20171107075441.61495b1.el7.centos.noarch

()[root@overcloud-controller-2 /]# rpm -qa | grep neutron
python-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
openstack-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
python2-neutronclient-6.5.0-0.20171023215239.355983d.el7.centos.noarch
openstack-neutron-common-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-ml2-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python2-neutron-lib-1.11.0-0.20171129185804.ff5ee17.el7.centos.noarch
openstack-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch

>> (2) i think you forgot to paste the references you mentioned in #1
Right:

[0] https://bugs.launchpad.net/kolla/+bug/1616268
[1] https://bugs.launchpad.net/tripleo/+bug/1734333
[2] https://github.com/openstack/tripleo-heat-templates/commit/2e3a91f58bb48d4e7ab88258fbd704975cf1c79c

>> (3) from what you described, you stopped all 3 containers running l3 agent, which means you do not have any running l3 agents now, shouldn't this lead to dataplane downtime for sure?

In non-containerized environments, if everything is up and running and you stop l3 agents, dataplane remains working (namespaces are still there, ports are connected, flows installed, etc.). Obviously you'll lose control plane for L3 but that's expected. The scenario I'm describing is different since dataplane is lost as well which IMO it's a regression.

Thanks,
Daniel

Lujin Luo (luo-lujin) on 2017-12-20

tags:

added: l3

Assaf Muller (amuller) on 2017-12-21

Changed in neutron:
status:	Incomplete → Confirmed
importance:	Undecided → Critical
Changed in tripleo:
status:	New → Confirmed

Emilien Macchi (emilienm) on 2017-12-22

Changed in tripleo:
status:	Confirmed → Triaged
importance:	Undecided → High
milestone:	none → queens-3

Emilien Macchi (emilienm) on 2018-01-26

Changed in tripleo:
milestone:	queens-3 → queens-rc1

Brent Eagles (beagles) on 2018-02-09

Changed in tripleo:
assignee:	nobody → Brent Eagles (beagles)

OpenStack Infra (hudson-openstack) on 2018-02-09

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Brent Eagles (beagles) wrote on 2018-02-09:

#6

Proposed TripleO patch is here:

https://review.openstack.org/#/c/542858/

Alex Schultz (alex-schultz) on 2018-03-02

Changed in tripleo:
milestone:	queens-rc1 → rocky-1

OpenStack Infra (hudson-openstack) on 2018-03-14

Changed in tripleo:
assignee:	Brent Eagles (beagles) → Jiří Stránský (jistr)

Jiří Stránský (jistr) on 2018-03-14

Changed in tripleo:
assignee:	Jiří Stránský (jistr) → Brent Eagles (beagles)

OpenStack Infra (hudson-openstack) on 2018-03-16

Changed in tripleo:
assignee:	Brent Eagles (beagles) → Jiří Stránský (jistr)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-21: Fix merged to tripleo-heat-templates (master)

#7

Reviewed: https://review.openstack.org/552073
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ae085825e22cb4ce7bf877087c2e324b8bec1f03
Submitter: Zuul
Branch: master

commit ae085825e22cb4ce7bf877087c2e324b8bec1f03
Author: Jiri Stransky <email address hidden>
Date: Mon Mar 12 17:02:36 2018 +0100

Add pre_upgrade_rolling_tasks

    The resultin pre_upgrade_rolling_steps_playbook will be executed in a
    node-by-node rolling fashion at the beginning of major upgrade
    workflow (before upgrade_steps_playbook).

    The current intended use case is special handling of L3 agent upgrade
    when moving Neutron services into containers. Special care needs to be
    taken in this case to preserve L3 connectivity of instances (with
    regard to dnsmasq and keepalived sub-processes of L3 agent).

The playbook can be run before the main upgrade like this:

openstack overcloud upgrade run --roles overcloud --playbook pre_upgrade_rolling_steps_playbook.yaml

    Partial-Bug: #1738768
    Change-Id: Icb830f8500bb80fd15036e88fcd314bf2c54445d
    Implements: blueprint major-upgrade-workflow

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-26: Fix proposed to tripleo-heat-templates (stable/queens)

#8

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/556454

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-26: Fix merged to tripleo-heat-templates (stable/queens)

#9

Reviewed: https://review.openstack.org/556454
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=402a0483eaa149951c493ec8a194e665bdbc09c1
Submitter: Zuul
Branch: stable/queens

commit 402a0483eaa149951c493ec8a194e665bdbc09c1
Author: Jiri Stransky <email address hidden>
Date: Mon Mar 12 17:02:36 2018 +0100

Add pre_upgrade_rolling_tasks

    The resultin pre_upgrade_rolling_steps_playbook will be executed in a
    node-by-node rolling fashion at the beginning of major upgrade
    workflow (before upgrade_steps_playbook).

    The current intended use case is special handling of L3 agent upgrade
    when moving Neutron services into containers. Special care needs to be
    taken in this case to preserve L3 connectivity of instances (with
    regard to dnsmasq and keepalived sub-processes of L3 agent).

The playbook can be run before the main upgrade like this:

openstack overcloud upgrade run --roles overcloud --playbook pre_upgrade_rolling_steps_playbook.yaml

    Partial-Bug: #1738768
    Change-Id: Icb830f8500bb80fd15036e88fcd314bf2c54445d
    Implements: blueprint major-upgrade-workflow
    (cherry picked from commit ae085825e22cb4ce7bf877087c2e324b8bec1f03)

tags:

added: in-stable-queens

Alex Schultz (alex-schultz) on 2018-04-20

Changed in tripleo:
milestone:	rocky-1 → rocky-2

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Alex Schultz (alex-schultz) on 2018-08-14

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Juan Antonio Osorio Robles (juan-osorio-robles) on 2018-10-30

Changed in tripleo:
milestone:	stein-1 → stein-2

Emilien Macchi (emilienm) on 2019-01-13

Changed in tripleo:
milestone:	stein-2 → stein-3

Alex Schultz (alex-schultz) on 2019-03-14

Changed in tripleo:
milestone:	stein-3 → stein-rc1

Revision history for this message

Jiří Stránský (jistr) wrote on 2019-04-12:

#10

AFAIK this should be fixed, looking at https://review.openstack.org/#/c/542858/. Closing on TripleO side, please shout if i'm wrong :).

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2019-04-12:

#11

And this was not affecting non-containerized neutron, so nothing more to do on the neutron side, closing too (same, please shout if it's incorrect)

Changed in neutron:
status:	Confirmed → Invalid

neutron

Dataplane downtime when containers are stopped/restarted

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	neutron	Invalid	Critical	Unassigned
	tripleo	Fix Released	High	Jiří Stránský	tripleo stein-rc1