l3 agent downtime can cause tenant VM outages during upgrade

Bug #1671504 reported by Steven Hardy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
High
Marios Andreou

Bug Description

We currently upgrade the l3 agent on the controller during the upgrade_tasks steps, which isn't batched (we take down the services on all nodes at the same time).

It would be better to instead use upgrade_batch_tasks to ensure minimal downtime (which will stop, upgrade, then start the service on each node one by one.

There is a question over package dependencies if we do this, but provided there aren't too many it may be possible to simply move the tasks to upgrade_batch_tasks

Steven Hardy (shardy)
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → pike-1
Changed in tripleo:
assignee: nobody → Marios Andreou (marios-b)
status: Triaged → In Progress
Revision history for this message
Marios Andreou (marios-b) wrote :

WIP - https://review.openstack.org/445494 for possibly getting this into newton to ocata workflow... still being tested.

Revision history for this message
Marios Andreou (marios-b) wrote :

FYI this is also discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1419751 - copy/pasting from a comment I just left there describing latest status:

Hi, update on progress (tl;dr we know what breaks the pingtest, but still blocked on ovs/related issue ) I reached out to jlibosva from the network team for help and he immediately responded (copy/paste my email at [0] for context).

So Jakub quickly confirmed it was openvswitch which is causing the neutron-openvswitch agent to be started (even though it is in a stopped, by us, state). He found an issue in the neutron-openvswitch-agent service file and posted https://review.rdoproject.org/r/#/c/5951/ to fix it. The idea is if we upgrade to this version of openstack-neutron packages (with Jakub fix) then the subsequent openvswitch upgrade should no longer cause the neutron-openvswitch-agent to try and start (prematurely, see [0] for more info on why this is a problem).

Unfortunately in my testing upgrading this way, that is, first upgrade openstack-neutron packages to the ones with jakub fix (he made a repo which has builds with the fix, which i enabled as part of my upgrade-init.yaml environment file) then upgrading openvswitch/all the things. As soon as openvswitch is upgraded to ovs 2.6 i lose all node connectivity/all 3 controllers. I tried doing this both via 'yum update' (for openvswitch i mean) and also including https://review.openstack.org/#/c/434346/ (i.e. the 'special case upgrade with flags' discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1424945#c16 ) and had same result both times.

I think we still need the fix at https://review.rdoproject.org/r/#/c/5951/ (adding to the trackers) but not sure what is going on with openvswitch ... we can pick this up next week and/or someone has more ideas.

thanks, marios

[0] (via email marios->jlibosva):
"
https://review.openstack.org/#/c/445494/ is the review (the 'rolling one node at a time' mechanism is already in place, we are just using it and adding to the l3 agent service here). It does what it's meant to - that code is executed on one node at a time so only one l3 agent is down. Lose like 1/2 ping. Great.

However, upgrade continues and at this point all services are down ( cluster, neutron-* except l3, and all the things). Then _something_ starts the neutron-openvswitch-agent - I am fairly confident it is openvswitch itself (am going from openvswitch-2.5.0-14 to 2.6 so there is an openvswitch restart?). Someone suggested it may even be python-openvswitch but not sure at this point. In other words as these packages are updated as part of the workflow, the neutron-openvswitch-agent is started

Problem is neutron-openvswitch-agent cannot start at this point because rabbit is still down. And the fact that n-ovs-a starts/tries to start kills the ping and it stays down (even though l3 agents are running) until puppet reconfigures and starts all the things again.
"

Revision history for this message
Marios Andreou (marios-b) wrote :

After a call with ajo today I think the premise behind this bug is wrong. We apparently don't even need the l3 agents for the tenant vm IPs. Sure, if all l3 agents are down you won't be able to _create_ new IPs for example but the existing ones should be reachable OK.

If they aren't then there is a bug - like the one discovered testing https://review.openstack.org/#/c/445494/9 that neutron-openvswitch-agent is being started when openvswitch is being updated during the upgrade package update. Another is an issue with 'ryu' and is likely the one I hit on Friday see review comments@ /#/c/445494/). Apparently there are newer package builds from Friday afternoon for neutron-* that might solve some of these.

So, plan today is to test without this change and see where the ping fails using those latest packages, so we can be clearer about any outstanding bugs.

Changed in tripleo:
milestone: pike-1 → pike-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Marios Andreou (<email address hidden>) on branch: master
Review: https://review.openstack.org/445494

Revision history for this message
Marios Andreou (marios-b) wrote :

see https://bugzilla.redhat.com/show_bug.cgi?id=1419751#c11 for testing info and more context but essentially we no longer need the agents to access the floating IPs (though we won't be able to manage them during the upgrade). Marking the bug as invalid.

Changed in tripleo:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.