neutron-openvswitch-agent crashes due to TypeError exception in ovs_ryuapp

Bug #1731494 reported by Daniel Alvarez
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Ihar Hrachyshka

Bug Description

At some point during some rally test, we saw this exception in ovs agent logs:

2017-11-07 13:35:51.428 597682 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-62f85bb3-db4c-4485-b35c-b7c1cafb3970 3d527bdd3ede4c6a97f91b701393b8e3 5f753e92a5d740fc97252bd39f868561 - - -] port_delete message processed for port 3e8348d0-40e1-4146-b803-1e6c6eddba53 port_delete /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:430
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp [req-141ecd16-22d7-4b1c-aa91-25d5077414f5 - - - - -] Agent main thread died of an exception: TypeError: int() can't convert non-string with explicit base
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp Traceback (most recent call last):
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py", line 40, in agent_main_wrapper
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp ovs_agent.main(bridge_classes)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2205, in main
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp agent.daemon_loop()
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*args, **kwargs)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2120, in daemon_loop
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp self.rpc_loop(polling_manager=pm)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*args, **kwargs)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1985, in rpc_loop
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp ovs_status = self.check_ovs_status()
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*args, **kwargs)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1787, in check_ovs_status
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp status = self.int_br.check_canary_table()
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp flows = self.dump_flows(constants.CANARY_TABLE)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 141, in dump_flows
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp (dp, ofp, ofpp) = self._get_dp()
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 68, in _get_dp
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp self._cached_dpid = int(new_dpid_str, 16)
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp TypeError: int() can't convert non-string with explicit base
2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp
2017-11-07 13:35:54.861 597682 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: receive error: Connection reset by peer: RuntimeError: OVS transaction timed out

This makes the agent crash and when restarted, perform a full sync which slows things down a lot.

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Looks like the error above has to do with a previous timeout with OVS:

2017-11-07 13:35:51.377 597682 ERROR ovsdbapp.backend.ovs_idl.command TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.DbGetCommand object at 0x11fc6890>] exceeded timeout 10 seconds
2017-11-07 13:35:51.377 597682 ERROR ovsdbapp.backend.ovs_idl.command

2017-11-07 13:35:51.378 597682 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-141ecd16-22d7-4b1c-aa91-25d5077414f5 - - - - -] Bridge br-int changed its datapath-ID from dae36ebcec4d to None

2017-11-07 13:35:38.520 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-141ecd16-22d7-4b1c-aa91-25d5077414f5 - - - - -] Switch connection timeout: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.ListPortsCommand object at 0xa935750>] exceeded timeout 10 seconds

2017-11-07 13:35:51.330 597682 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2017-11-07 13:35:51.331 597682 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:110

2017-11-07 13:35:51.419 597682 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None
2017-11-07 13:35:51.419 597682 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json] dies due to the error: None

Changed in neutron:
assignee: nobody → venkatamahesh (venkatamaheshkotha)
Changed in neutron:
status: New → Confirmed
tags: added: ovs
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/519306

Changed in neutron:
status: Confirmed → In Progress
Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/561047

Changed in neutron:
assignee: venkatamahesh (venkatamaheshkotha) → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/561053

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/561054

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/561060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/519306
Reason: Replaced by https://review.openstack.org/561060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/561047
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=544597c6ef9fb297693dbeb0f2d7dc22f3a1b25d
Submitter: Zuul
Branch: master

commit 544597c6ef9fb297693dbeb0f2d7dc22f3a1b25d
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 20:30:20 2018 +0000

    ovs: survive errors from check_ovs_status

    Instead of allowing an error to bubble up and exit from rpc_loop, catch
    it and assume the switch is dead which will make the agent to wait until
    the switch is back without failing the service.

    Change-Id: Ic3095dd42b386f56b1f75ebb6a125606f295551b
    Closes-Bug: #1731494

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/561053
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=164e4563945a3efd31899c682d2948c5ab6964d0
Submitter: Zuul
Branch: master

commit 164e4563945a3efd31899c682d2948c5ab6964d0
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 21:08:29 2018 +0000

    ovs: split OVS_RESTARTED handler into a separate method

    Change-Id: If535cad87369980010ef0111d5416d22db707cfe
    Related-Bug: #1731494

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/561060
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=38d0b2b52d7daf77cd3d5123bd2d9853fea7448f
Submitter: Zuul
Branch: master

commit 38d0b2b52d7daf77cd3d5123bd2d9853fea7448f
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 21:51:30 2018 +0000

    ovs: raise RuntimeError in _get_dp if id is None

    If the switch misbehaves, we may receive None from db_get_val. In this
    case, int() on the return value will raise TypeError which is not
    expected by callers and may result in ovs agent crash.

    Instead of bubbling up the TypeError exception, we raise RuntimeError if
    datapath id is None.

    Change-Id: I53bea00b9a7302d694b8066e969c894bf64cb2d4
    Closes-Bug: #1731494

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b1

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

tags: added: ocata-backport-potential
tags: added: pike-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/645393

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/645394

tags: added: queens-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/645395

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/645396

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/645397

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/645398

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/645395
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cff6a2db8820e2d1d4ba461025de6bd7b882c663
Submitter: Zuul
Branch: stable/queens

commit cff6a2db8820e2d1d4ba461025de6bd7b882c663
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 21:51:30 2018 +0000

    ovs: raise RuntimeError in _get_dp if id is None

    If the switch misbehaves, we may receive None from db_get_val. In this
    case, int() on the return value will raise TypeError which is not
    expected by callers and may result in ovs agent crash.

    Instead of bubbling up the TypeError exception, we raise RuntimeError if
    datapath id is None.

    Change-Id: I53bea00b9a7302d694b8066e969c894bf64cb2d4
    Closes-Bug: #1731494
    (cherry picked from commit 38d0b2b52d7daf77cd3d5123bd2d9853fea7448f)

tags: added: in-stable-queens
tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/645394
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=876e1d7969b3ff44f6b48645964cb24a18558a8f
Submitter: Zuul
Branch: stable/pike

commit 876e1d7969b3ff44f6b48645964cb24a18558a8f
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 21:51:30 2018 +0000

    ovs: raise RuntimeError in _get_dp if id is None

    If the switch misbehaves, we may receive None from db_get_val. In this
    case, int() on the return value will raise TypeError which is not
    expected by callers and may result in ovs agent crash.

    Instead of bubbling up the TypeError exception, we raise RuntimeError if
    datapath id is None.

    Change-Id: I53bea00b9a7302d694b8066e969c894bf64cb2d4
    Closes-Bug: #1731494
    (cherry picked from commit 38d0b2b52d7daf77cd3d5123bd2d9853fea7448f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/645396
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=12c928b77cd8dbd308279b452ef4031244adb7e5
Submitter: Zuul
Branch: stable/queens

commit 12c928b77cd8dbd308279b452ef4031244adb7e5
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 20:30:20 2018 +0000

    ovs: survive errors from check_ovs_status

    Instead of allowing an error to bubble up and exit from rpc_loop, catch
    it and assume the switch is dead which will make the agent to wait until
    the switch is back without failing the service.

    Change-Id: Ic3095dd42b386f56b1f75ebb6a125606f295551b
    Closes-Bug: #1731494
    (cherry picked from commit 544597c6ef9fb297693dbeb0f2d7dc22f3a1b25d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/645397
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=93cd1921f1a8b5cd58c04d84834b9cd144c61d4d
Submitter: Zuul
Branch: stable/pike

commit 93cd1921f1a8b5cd58c04d84834b9cd144c61d4d
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 20:30:20 2018 +0000

    ovs: survive errors from check_ovs_status

    Instead of allowing an error to bubble up and exit from rpc_loop, catch
    it and assume the switch is dead which will make the agent to wait until
    the switch is back without failing the service.

    Change-Id: Ic3095dd42b386f56b1f75ebb6a125606f295551b
    Closes-Bug: #1731494
    (cherry picked from commit 544597c6ef9fb297693dbeb0f2d7dc22f3a1b25d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/645398
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6a2753a7c16c77b5a7c853b5e56526c41ce09d2e
Submitter: Zuul
Branch: stable/ocata

commit 6a2753a7c16c77b5a7c853b5e56526c41ce09d2e
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 20:30:20 2018 +0000

    ovs: survive errors from check_ovs_status

    Instead of allowing an error to bubble up and exit from rpc_loop, catch
    it and assume the switch is dead which will make the agent to wait until
    the switch is back without failing the service.

    Change-Id: Ic3095dd42b386f56b1f75ebb6a125606f295551b
    Closes-Bug: #1731494
    (cherry picked from commit 544597c6ef9fb297693dbeb0f2d7dc22f3a1b25d)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.7

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.6

This issue was fixed in the openstack/neutron 12.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/561054
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/645393
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9fe09cb2c29c5b87564eec71fb1f1c4878014091
Submitter: Zuul
Branch: stable/ocata

commit 9fe09cb2c29c5b87564eec71fb1f1c4878014091
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Apr 12 21:51:30 2018 +0000

    ovs: raise RuntimeError in _get_dp if id is None

    If the switch misbehaves, we may receive None from db_get_val. In this
    case, int() on the return value will raise TypeError which is not
    expected by callers and may result in ovs agent crash.

    Instead of bubbling up the TypeError exception, we raise RuntimeError if
    datapath id is None.

    Change-Id: I53bea00b9a7302d694b8066e969c894bf64cb2d4
    Closes-Bug: #1731494
    (cherry picked from commit 38d0b2b52d7daf77cd3d5123bd2d9853fea7448f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.