networking disruption on upgrade from 14.0.0 to 14.0.3

Bug #1859649 reported by Junien Fridrick on 2020-01-14
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
neutron
Undecided
Unassigned
neutron (Ubuntu)
Undecided
Unassigned

Bug Description

Hi,

Not entirely sure why, but if a cloud has services running these 2 versions of neutron (2:14.0.3-0ubuntu1~cloud0 and 2:14.0.0-0ubuntu1.1~cloud0), networking is basically broken until everything is running 2:14.0.3-0ubuntu1~cloud0.

Which causes networking disruption when not all nodes are upgraded at the same time.

Thanks

Junien Fridrick (axino) wrote :
Download full text (4.0 KiB)

All errors appear to be on the neutron gateway.

One kind in neutron-ovs-agent :
neutron-openvswitch-agent.log:2020-01-14 16:21:31.563 456292 ERROR neutron.agent.rpc
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc [req-c5918d0d-f11b-4de7-b77c-f4628e33774d - - - - -] Failed to get details for device ab683ddb-a1f3-4404-b2b4-f912efe3530a: oslo_messaging.rpc.client.RemoteError: Remote error: InvalidTargetVersion Invalid target version 1.5
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc Traceback (most recent call last):
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/neutron/agent/rpc.py", line 303, in get_devices_details_list_and_failed_devices
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc agent_restarted))
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/neutron/agent/rpc.py", line 312, in get_device_details
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc resources.PORT, device, agent_restarted)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/neutron/agent/resource_cache.py", line 61, in get_resource_by_id
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc agent_restarted=agent_restarted)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/neutron/agent/resource_cache.py", line 79, in _flood_cache_for_query
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc filter_kwargs=filter_kwargs)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/oslo_log/helpers.py", line 67, in wrapper
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc return method(*args, **kwargs)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/neutron/api/rpc/handlers/resources_rpc.py", line 114, in bulk_pull
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc version=resource_type_cls.VERSION, filter_kwargs=filter_kwargs)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/neutron_lib/rpc.py", line 157, in call
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc return self._original_context.call(ctxt, method, **kwargs)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 178, in call
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc retry=self.retry)
neutron-openvswitch-agent.log:2020-01-14 16:21:31.674 456292 ERROR neutron.agent.rpc File "/usr/lib/python3/dist-packages/oslo_messaging/tr...

Read more...

Junien Fridrick (axino) wrote :
Download full text (9.7 KiB)

The other in neutron-l3-agent (still on neutron-gateway), but this is probably due to the upgrade restarting the daemons :

neutron-l3-agent.log:2020-01-14 16:22:11.298 456968 ERROR pyroute2.netns.nslink [-] forced shutdown procedure, clean up netns manually
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info [-] [Errno 9] Bad file descriptor: OSError: [Errno 9] Bad file descriptor
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/common/utils.py", line 158, in call
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 1186, in process
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info self._process_internal_ports()
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 584, in _process_internal_ports
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info self.internal_network_added(p)
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 486, in internal_network_added
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info mtu=port.get('mtu'))
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 461, in _internal_network_added
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info prefix=prefix, mtu=mtu)
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 265, in plug
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info namespace=namespace):
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 818, in device_exists
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info return IPDevice(device_name, namespace=namespace).exists()
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 318, in exists
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info return privileged.interface_exists(self.name, self.namespace)
neutron-l3-agent.log:2020-01-14 16:22:11.322 456743 ERROR neutron.agent.l3.router_info File "/usr/lib/python3/dist-packages/neutron/privileged/agent/linux/ip_lib.py"...

Read more...

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
Felipe Reyes (freyes) wrote :

About "neutron.agent.rpc oslo_messaging.rpc.client.RemoteError: Remote error: InvalidTargetVersion Invalid target version 1.5"

commit b452c508b62 landed in 14.0.3 which bumped up the Port object's version to 1.5 making it incompatible with older versions -> https://github.com/openstack/neutron/commit/b452c508b62

This commit was backported to fix bug 1834484 ([QoS] qos_plugin._extend_port_resource_request is killing port retrieval performance)

Felipe Reyes (freyes) wrote :

According to the comments in https://review.opendev.org/#/c/669360/ the backport mentioned earlier shouldn't have broken things.

Felipe Reyes (freyes) wrote :

taking a look closer, this could have happened since the neutron-gateway (server side) had an older version that is not aware of 1.5 objects, while neutron-ovs-agent (client side) requested a 1.5 versioned object, the backwards compatibility layer is meant to be used the other way around where the server is aware of newer versions while the client is not, so the server can remove fields from the response to downgrade the object and hand back a compatible one.

Corey Bryant (corey.bryant) wrote :

Added upstream to this bug and would like to get their opinion. Minimally this should have been release-noted I'd think.

summary: - neutron 2:14.0.3-0ubuntu1~cloud0 and 2:14.0.0-0ubuntu1.1~cloud0 not
- compatible
+ networking disruption on upgrade from 14.0.0 to 14.0.3
Corey Bryant (corey.bryant) wrote :

If we're sure the patch in comment #5 is causing the issue (which seems likely based on the log) we could revert it in distro and fast-path that SRU. This likely also affects stable/train as well.

Changed in neutron (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Critical
Changed in neutron (Ubuntu Focal):
status: Triaged → Confirmed
importance: Critical → Undecided

Hello:

As Felipe pointed out, the compatibility layer relies on the server side. But making a stable upgrade should not trigger this problem of course.

The mentioned patch [1] complies with the OVO versioning rules, allowing version downgrade (in the server side).

Is there an order in the upgrade procedure you used? According to "Rolling upgrade" section in [2]:
"""
To simplify the matter, it's always assumed that the order of service upgrades is as following:

first, all neutron-servers are upgraded.
then, if applicable, neutron agents are upgraded.
"""

Regards.

[1] https://github.com/openstack/neutron/commit/b452c508b62
[2] https://github.com/openstack/neutron/blob/master/doc/source/contributor/internals/upgrade.rst

Changed in neutron:
status: New → Incomplete
Haw Loeung (hloeung) on 2020-01-15
Changed in cloud-archive:
status: New → Confirmed
Dr. Jens Harbott (j-harbott) wrote :

I agree that this is kind of expected behaviour, but I also think that it would be good if Neutron could adopt the policy to at least bump the minor revision number on version number updates, that would make it more obvious to deployers that special care is needed. It should also be accompanied by a release note.

Slawek Kaplonski (slaweq) wrote :

@Jens: You are right with version numbers. I just added short note about it to release-checklist doc in Neutron: https://review.opendev.org/702822
I hope it will help and we will remember about that in the future.

no longer affects: neutron (Ubuntu Disco)
no longer affects: neutron (Ubuntu Eoan)
no longer affects: neutron (Ubuntu Focal)
Changed in cloud-archive:
status: Confirmed → Incomplete
Changed in neutron (Ubuntu):
status: Confirmed → Incomplete
Corey Bryant (corey.bryant) wrote :

Agree that a minor semver bump would be good in this scenario. Thanks for updating the checklist @Slawek.

This page in the charm deployment guide needs better guidance on upgrade order:

https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-upgrade-openstack.html#known-openstack-upgrade-issues

There is service-specific guidance at the following that should be accounted for: https://docs.openstack.org/operations-guide/ops-upgrades.html#service-specific-upgrade-instructions.

It says:
"In terms of the upgrade order, begin with ‘keystone’. After that, the rest of the charms can be upgraded in any order."

It also has nova-compute and neutron-gateway listed as a 3 in the "Upgrade order" table.

Corey Bryant (corey.bryant) wrote :

Comment #12, while I still think needs action, actually doesn't apply to this bug since the package upgrades weren't done via charms.

Corey Bryant (corey.bryant) wrote :

I opened a new bug for the charm guide at LP: #1859990

no longer affects: charm-deployment-guide
Junien Fridrick (axino) wrote :

So how does that work when we use Landscape to auto-upgrades the packages ?

Changed in cloud-archive:
status: Incomplete → Confirmed
Andrea Ieri (aieri) wrote :

For completeness, this affects even just 2:14.0.2-0ubuntu1.1~cloud0 to 2:14.0.3-0ubuntu1.1~cloud0.

Although we try as hard as we can to ensure that unattended upgrades are disabled for the clouds we manage, upgrades between minor versions (not just patch versions) should always be safe. This bug makes it harder for us to deliver rolling package upgrades to our customers with no downtime.

Is it confirmed that this is triggered only if computes are upgraded before the gateways?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers