Liberty server and Kilo security group aware agent fail to refresh firewall for DHCP and router IPv6 ports

Bug #1531772 reported by Ihar Hrachyshka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Ihar Hrachyshka

Bug Description

When we try to mix Liberty server with Kilo L2 agent, we get the following traceback in the agent log:

ERROR oslo_messaging.rpc.dispatcher [-] Exception during message handling: Endpoint does not support RPC version 1.3. Attempted method: security_groups_provider_updated
TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
TRACE oslo_messaging.rpc.dispatcher executor_callback))
TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 195, in _dispatch
TRACE oslo_messaging.rpc.dispatcher raise UnsupportedVersion(version, method=method)
TRACE oslo_messaging.rpc.dispatcher UnsupportedVersion: Endpoint does not support RPC version 1.3. Attempted method: security_groups_provider_updated

In Kilo, server just dropped a bare notification about some change, and the firewall was reset for all devices; in Liberty, it now passes the list of devices to refresh, so that firewall setup on security group change is more optimized.

Missing the notification could mean any kind of issues that will all go back to ‘my firewall is not updated after security group change’. For what I see in the code, it would affect DHCP and router IPv6 ports only.

Now, since the signature of the RPC call was changed (adding the list of devices), the server requires version = 1.3 for the agent endpoint that would know about the new argument. If that would be a usual notification directed specifically to the agent, we would just use call() instead of cast() and handle UnsupportedVersion exception by calling remotely without the device list. But since it’s fanout, we can’t do it.

The solution for the upgrade issue would probably be reverting the optimization in Liberty. Since we don’t support spanning upgrades through multiple cycles just yet, it should be enough.

Other alternatives do not seem to work here:
- cast()ing for both new and old signatures would effectively disable the optimization, because the same agent would receive both versions of the method, and the old one will trigger full firewall reset anyway;
- calling cast() with the new signature but without the version specified would probably make the older Kilo agent to crash in a more horrible way; (note: I need to check that locally).

Side note: it’s interesting that we have a backwards compatible code on agent side to accommodate to older servers. I will probably kill it since it’s not in line with usual rolling upgrade scenarios that we support where you never run a server older than an agent in the cluster.

Changed in neutron:
importance: Undecided → High
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
tags: added: upgrade
tags: added: liberty-backport-potential
Revision history for this message
Rossella Sblendido (rossella-o) wrote :

Hi Ihar,

thanks for looking into this.
I think we need to find a good and long term strategy for this kind of problem. We might want to increase the RPC version of a cast whose server side is on the agent again in future.

Here is a possible solution: in Liberty we can introduce the new code in the agent, increase the version there but in the neutron server that is using the client side of the RPC won't require the newer version but still use the old one. So the Liberty server will be able to work with the Kilo agents. In Mitaka we can require version 1.3, since Liberty agents will be able to handle it.

If we go this way we should keep the backward compatible code on the agent side to accommodate older servers because it will be needed if both agentsand server are using Liberty. What do you think?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Rossella, that approach makes time to deliver a change span two cycles: it's both slow and hard not to forget to follow up. :)

The proper response in the future should be *not modifying* 'server to agent' APIs. Instead, introduce a new method, call both old and new one. In next cycle switch to using the new one only.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Actually, it's a lot easier: we indeed already capture unknown arguments on agent notification side, so we should just avoid enforcing the agent version.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/266886

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/266886
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f8f366024052a191eb0fc74af1643be15c541aef
Submitter: Jenkins
Branch: master

commit f8f366024052a191eb0fc74af1643be15c541aef
Author: Ihar Hrachyshka <email address hidden>
Date: Wed Jan 13 12:37:21 2016 +0100

    Make security_groups_provider_updated work with Kilo agents

    Initially, we bumped the required version for the agent endpoint from
    1.1 (the initial version that implemented security groups) to 1.3
    without considering that the code should work with old agents that do
    not yet know about the new devices_to_update argument.

    Actually, there was no need to bump the version: old agent side code
    already captures all unknown arguments that could be passed from the
    server, ignoring them:

    https://github.com/openstack/neutron/blob/608b54137fb67512c07099089ea7e074176e12df/neutron/agent/securitygroups_rpc.py#L155

    (^ the link shows the latest Kilo code as of writing)

    Note: some people may argue that the approach that is taken in Neutron
    to support backwards compatibility for server notifications is wrong,
    and we instead should adopt some stricter mechanism like nova version
    pinning. While that is a noble thing to do, it's out of scope for the
    patch that is designed to be easily backportable to stable/liberty.

    Note: some people may also argue that the patch should go straight into
    stable/liberty because we don't claim support for rolling upgrade
    scenarios that span multiple releases. That's indeed true, though my
    take on it is that if we have a way to handle more unofficial scenarios
    without more coding effort, it's worth doing it.

    Change-Id: I741e6e5c460658ac17095551040e67e8d1990812
    Closes-Bug: #1531772

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/268697

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b2

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/268697
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=725a93bb7663c47ecde324b011d5c582d4e94c3f
Submitter: Jenkins
Branch: stable/liberty

commit 725a93bb7663c47ecde324b011d5c582d4e94c3f
Author: Ihar Hrachyshka <email address hidden>
Date: Wed Jan 13 12:37:21 2016 +0100

    Make security_groups_provider_updated work with Kilo agents

    Initially, we bumped the required version for the agent endpoint from
    1.1 (the initial version that implemented security groups) to 1.3
    without considering that the code should work with old agents that do
    not yet know about the new devices_to_update argument.

    Actually, there was no need to bump the version: old agent side code
    already captures all unknown arguments that could be passed from the
    server, ignoring them:

    https://github.com/openstack/neutron/blob/608b54137fb67512c07099089ea7e074176e12df/neutron/agent/securitygroups_rpc.py#L155

    (^ the link shows the latest Kilo code as of writing)

    Note: some people may argue that the approach that is taken in Neutron
    to support backwards compatibility for server notifications is wrong,
    and we instead should adopt some stricter mechanism like nova version
    pinning. While that is a noble thing to do, it's out of scope for the
    patch that is designed to be easily backportable to stable/liberty.

    Note: some people may also argue that the patch should go straight into
    stable/liberty because we don't claim support for rolling upgrade
    scenarios that span multiple releases. That's indeed true, though my
    take on it is that if we have a way to handle more unofficial scenarios
    without more coding effort, it's worth doing it.

    Change-Id: I741e6e5c460658ac17095551040e67e8d1990812
    Closes-Bug: #1531772
    (cherry picked from commit f8f366024052a191eb0fc74af1643be15c541aef)

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.3

This issue was fixed in the openstack/neutron 7.0.3 release.

tags: removed: liberty-backport-potential upgrade
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.