l3-agent stops processing router updates

Bug #1847203 reported by Lars Erik Pedersen
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

In our work on upgrading from Queens to Rocky, we have stumbled upon some weird behaviour in neutron-l3-agent. After "a while" (usually ~days), the l3-agent will simply stop processing router updates. In the debug log, we see:

Got routers updated notification :[u'1dea9d84-e5ec-44be-b37f-7f9070dd159e'] routers_updated /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:446

But then nothing happens after that. We're testing this with adding and removing a floating IP.

The problem is that we really have noe clue what happens, other than the observed symptoms, so we can't really provide a way to reproduce this..

neutron-l3-agent 2:13.0.4-0ubuntu1~cloud0
openvswitch-switch 2.10.0-0ubuntu2~cloud0

Ubuntu 18.04 LTS, running the 4.15.0-51-generic kernel

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Do You have debug enabled in L3 agent? Are there any interesting information there maybe?
I'm marking it as incomplete for now as we can't do too much with only such small piece of information about the issue.

tags: added: l3-dvr-backlog
Changed in neutron:
status: New → Incomplete
Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

I've attached some config files. These should be all config that neutron-l3-agent cares about (I've removed all comments and secrets)

neutron.conf: http://paste.openstack.org/show/781853/
l3_agent.ini: http://paste.openstack.org/show/781854/
fwaas_driver.ini: http://paste.openstack.org/show/781855/

Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

@slaweq yes, we have debug enabled. The thing is. It does not really tell us much.

This is what (not) happens when a router update arrvived (added or removed a floating IP), and the l3-agent doesn't care: http://paste.openstack.org/show/781857/

When it works, it's outputing a whole lot: http://paste.openstack.org/show/781858/

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Thx Lars for details.
It seems for me that those routers are either not added to queue here:
https://github.com/openstack/neutron/blob/1c2e10f8595d2286bd9bec513bc5a346a84a6f7c/neutron/agent/l3/agent.py#L574
either not returned from each_update_to_next_resource() here:
https://github.com/openstack/neutron/blob/ad028b55cafd74c39b1eb0708dba45a6fcfff059/neutron/agent/common/resource_processing_queue.py#L171 - as routers returned from this method are processed in https://github.com/openstack/neutron/blob/1c2e10f8595d2286bd9bec513bc5a346a84a6f7c/neutron/agent/l3/agent.py#L671

Can You maybe add some additional debug messages on Your env to check exactly where it is "stopped"?

Changed in neutron:
status: Incomplete → Opinion
status: Opinion → New
Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

Hi, I'm not sure how I can get the l3-agent to log more debug messages than that? Is there any lower level than "debug = true" in neutron.conf ?

Or were you thinking about "hot-adding" more output directly to the code?

Revision history for this message
Brian Haley (brian-haley) wrote :

If you could add something to periodic_sync_routers_task() just to see if it returned, I'm assuming it just had nothing to to. Something like the LOG.debug before the return.

neutron/agent/l3/agent.py:

    def periodic_sync_routers_task(self, context):
        if not self.fullsync:
            LOG.debug("Not in fullsync, periodic_sync_routers_task returning early")
            return
        LOG.debug("Starting fullsync periodic_sync_routers_task")
    [...]

Also, what version of oslo-service do you have installed? That is the python library dealing with periodically running this function.

Revision history for this message
Brian Haley (brian-haley) wrote :

We have since updated the l3-agent to better log when it starts/stops processing messages, so there is at least a way to look at the logs and help determine what happened. That said, as only this one user has seen this issue and it was on an older release, I'll close this as there have been no other reports (and might have been fixed in any number of changes). If you still see if on a newer release please re-open this bug with more information.

Changed in neutron:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.