Trace in fetch_and_sync_all_routers

Bug #1726370 reported by venkata anil
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley

Bug Description

I am seeing below trace fetch_and_sync_all_routers for HA router

2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task [req-19638c71-4ad9-412f-b5d7-dc9cb84eca4f - - - - -] Error during L3NATAgentWithStateReport.periodic_sync_routers_task
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task Traceback (most recent call last):
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task task(self, context)
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 568, in periodic_sync_routers_task
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task self.fetch_and_sync_all_routers(context, ns_manager)
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 603, in fetch_and_sync_all_routers
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task r['id'], r.get(l3_constants.HA_ROUTER_STATE_KEY))
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 120, in check_ha_state_for_router
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task if ri and current_state != TRANSLATION_MAP[ri.ha_state]:
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 81, in ha_state
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task ha_state_path = self.keepalived_manager.get_full_config_file_path(
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task AttributeError: 'NoneType' object has no attribute 'get_full_config_file_path'
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
tags: added: l3-ha
Revision history for this message
Boden R (boden) wrote :

Is this reproducible?
Based on [1] seems like it was an issue on 10/17/2017, but hasn't occurred since.

[1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Error%20during%20L3NATAgentWithStateReport.periodic_sync_routers_task%5C%22

Revision history for this message
Brian Haley (brian-haley) wrote :

I have seen this downstream recently as well, but only randomly, so would say it is still an issue.

James Anziano (janzian)
Changed in neutron:
status: New → Confirmed
Revision history for this message
Brian Haley (brian-haley) wrote :

Anil,

Can this call in ha_router.py:initialize() move to the end?

  super(HaRouter, self).initialize(process_monitor)

I believe the problem is that we initialize the router_info before we complete with HA initialization. You recently moved this from the first line further down to fix bug 1662804, maybe we can move it to the end of the method?

Either that or the ha_state getter has to check self.keepalived_manager ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/516322

Changed in neutron:
assignee: venkata anil (anil-venkata) → Brian Haley (brian-haley)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/517097

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/517639

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Brian Haley (<email address hidden>) on branch: master
Review: https://review.openstack.org/517097

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/516322
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d2b909f5339e72f84de797977384e4164d72a154
Submitter: Zuul
Branch: master

commit d2b909f5339e72f84de797977384e4164d72a154
Author: Brian Haley <email address hidden>
Date: Mon Oct 30 09:41:46 2017 -0400

    Move check_ha_state_for_router() into notification code

    As soon as we call router_info.initialize(), we could
    possibly try and process a router. If it is HA, and
    we have not fully initialized the HA port or keepalived
    manager, we could trigger an exception.

    Move the call to check_ha_state_for_router() into the
    update notification code so it's done after the router
    has been created. Updated the functional tests for this
    since the unit tests are now invalid.

    Also added a retry counter to the RouterUpdate object so
    the l3-agent code will stop re-enqueuing the same update
    in an infinite loop. We will delete the router if the
    limit is reached.

    Finally, have the L3 HA code verify that ha_port and
    keepalived_manager objects are valid during deletion since
    there is no need to do additional work if they are not.

    Change-Id: Iae65305cbc04b7af482032ddf06b6f2162a9c862
    Closes-bug: #1726370

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/518635

Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/518636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/518636
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=56a9522bde6f74eeaec8fd37809ee6acc8fdd3ba
Submitter: Zuul
Branch: stable/ocata

commit 56a9522bde6f74eeaec8fd37809ee6acc8fdd3ba
Author: Brian Haley <email address hidden>
Date: Mon Oct 30 09:41:46 2017 -0400

    Move check_ha_state_for_router() into notification code

    As soon as we call router_info.initialize(), we could
    possibly try and process a router. If it is HA, and
    we have not fully initialized the HA port or keepalived
    manager, we could trigger an exception.

    Move the call to check_ha_state_for_router() into the
    update notification code so it's done after the router
    has been created. Updated the functional tests for this
    since the unit tests are now invalid.

    Also added a retry counter to the RouterUpdate object so
    the l3-agent code will stop re-enqueuing the same update
    in an infinite loop. We will delete the router if the
    limit is reached.

    Finally, have the L3 HA code verify that ha_port and
    keepalived_manager objects are valid during deletion since
    there is no need to do additional work if they are not.

    Conflicts:
          neutron/agent/l3/agent.py

    Change-Id: Iae65305cbc04b7af482032ddf06b6f2162a9c862
    Closes-bug: #1726370
    (cherry picked from commit d2b909f5339e72f84de797977384e4164d72a154)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/518635
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6809a6dd775f553d58be0f3fcb27d43f575e2881
Submitter: Zuul
Branch: stable/pike

commit 6809a6dd775f553d58be0f3fcb27d43f575e2881
Author: Brian Haley <email address hidden>
Date: Mon Oct 30 09:41:46 2017 -0400

    Move check_ha_state_for_router() into notification code

    As soon as we call router_info.initialize(), we could
    possibly try and process a router. If it is HA, and
    we have not fully initialized the HA port or keepalived
    manager, we could trigger an exception.

    Move the call to check_ha_state_for_router() into the
    update notification code so it's done after the router
    has been created. Updated the functional tests for this
    since the unit tests are now invalid.

    Also added a retry counter to the RouterUpdate object so
    the l3-agent code will stop re-enqueuing the same update
    in an infinite loop. We will delete the router if the
    limit is reached.

    Finally, have the L3 HA code verify that ha_port and
    keepalived_manager objects are valid during deletion since
    there is no need to do additional work if they are not.

    Change-Id: Iae65305cbc04b7af482032ddf06b6f2162a9c862
    Closes-bug: #1726370
    (cherry picked from commit d2b909f5339e72f84de797977384e4164d72a154)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b2

This issue was fixed in the openstack/neutron 12.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.5

This issue was fixed in the openstack/neutron 10.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.3

This issue was fixed in the openstack/neutron 11.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Brian Haley <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/517639

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.