TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand in q-agt gate failure

Bug #1802640 reported by Matt Riedemann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/40/617040/1/check/nova-next/0a82b26/logs/screen-q-agt.txt.gz?level=TRACE

Nov 10 03:51:03.120446 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand object at 0x7fb4e1a94190>] exceeded timeout 10 seconds
Nov 10 03:51:03.120918 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):
Nov 10 03:51:03.121258 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command File "/usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/command.py", line 35, in execute
Nov 10 03:51:03.121528 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command txn.add(self)
Nov 10 03:51:03.121801 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
Nov 10 03:51:03.122032 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command self.gen.next()
Nov 10 03:51:03.122299 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command File "/usr/local/lib/python2.7/dist-packages/ovsdbapp/api.py", line 112, in transaction
Nov 10 03:51:03.122563 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command del self._nested_txns_map[cur_thread_id]
Nov 10 03:51:03.122747 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command File "/usr/local/lib/python2.7/dist-packages/ovsdbapp/api.py", line 69, in __exit__
Nov 10 03:51:03.123003 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command self.result = self.commit()
Nov 10 03:51:03.123198 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command File "/usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit
Nov 10 03:51:03.123418 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command timeout=self.timeout)
Nov 10 03:51:03.123647 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand object at 0x7fb4e1a94190>] exceeded timeout 10 seconds
Nov 10 03:51:03.123865 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.backend.ovs_idl.command
Nov 10 03:51:03.131466 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand object at 0x7fb4e1a94190>] exceeded timeout 10 seconds Agent terminated!: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand object at 0x7fb4e1a94190>] exceeded timeout 10 seconds
Nov 10 03:51:03.134192 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl [-] Post-commit checks failed: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand object at 0x7fb4e1a94190>] exceeded timeout 10 seconds
Nov 10 03:51:03.134358 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl Traceback (most recent call last):
Nov 10 03:51:03.134471 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl File "/usr/local/lib/python2.7/dist-packages/ovsdbapp/schema/open_vswitch/impl_idl.py", line 40, in post_commit
Nov 10 03:51:03.134590 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl self.do_post_commit(txn)
Nov 10 03:51:03.134706 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl File "/usr/local/lib/python2.7/dist-packages/ovsdbapp/schema/open_vswitch/impl_idl.py", line 60, in do_post_commit
Nov 10 03:51:03.134821 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl timeout=self.timeout)
Nov 10 03:51:03.134989 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.SetFailModeCommand object at 0x7fb4e1a94190>] exceeded timeout 10 seconds
Nov 10 03:51:03.135117 ubuntu-xenial-ovh-bhs1-0000461070 neutron-openvswitch-agent[26802]: ERROR ovsdbapp.schema.open_vswitch.impl_idl
Nov 10 03:51:03.518750 ubuntu-xenial-ovh-bhs1-0000461070 systemd[1]: <email address hidden>: Main process exited, code=exited, status=1/FAILURE
Nov 10 03:51:03.519100 ubuntu-xenial-ovh-bhs1-0000461070 systemd[1]: <email address hidden>: Unit entered failed state.
Nov 10 03:51:03.519135 ubuntu-xenial-ovh-bhs1-0000461070 systemd[1]: <email address hidden>: Failed with result 'exit-code'.

This could just be an intermittent failure on slow CI nodes.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22TimeoutException%3A%20Commands%20%5B%3Covsdbapp.schema.open_vswitch.commands.SetFailModeCommand%20object%20at%5C%22%20AND%20message%3A%5C%22exceeded%20timeout%2010%20seconds%5C%22%20AND%20tags%3A%5C%22screen-q-agt.txt%5C%22&from=10d

Reporting this for tracking in elastic-recheck.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Over the past few days, according to elasticsearch, this has occurred 15-25 times per every 3 hour window. Most recently observed it causing a grenade job to fail in a patch on ironic's gate. :(

Matt Riedemann (mriedem)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/621318

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

The timeout always happens as we are waiting for the response from ovsdb-server incrementing next_cfg. The time between successfully calling OVSBridge.create() and set_secure_mode() (which fails) in one of the logs I looked at is 0.3 seconds and there is nothing weird in ovsdb-server.log (we just see the disconnect from the agent after the TimeoutException), so it seems unlikely that things slowed down by over 30x in that short time to cause an ovsdb timeout.

We're not adding any interfaces in the set_secure_mode call and the exception is a TimeoutException so it's definitely hitting here: https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/schema/open_vswitch/impl_idl.py#L59. So it looks like vswitchd_has_completed(next_cfg) never becomes True. I don't see anything in the ovs-vswitchd logs that would indicate it is having an issue. pre_commit() is definitely being called to increment next_cfg(), so to me it looks like maybe python-ovs is hitting an exception and eating it? It would help to turn on the python-ovs vlog debug logging to see what is going on under the hood. (we normally have it turned off because it is *extremely* chatty.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/621572

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

According to logstash, it hasn't hit since Nov 30 for some reason.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron-tempest-plugin (master)

Change abandoned by Hongbin Lu (<email address hidden>) on branch: master
Review: https://review.openstack.org/621318

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Terry Wilson (<email address hidden>) on branch: master
Review: https://review.openstack.org/621572
Reason: was just for testing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/703791

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/703791
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cf36bd4ca52bec4a707c35df261b8a2630850e68
Submitter: Zuul
Branch: master

commit cf36bd4ca52bec4a707c35df261b8a2630850e68
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Jan 22 13:34:29 2020 +0000

    Enable ovsdb debug messages in functional and fullstack

    Enable the configuration flag "ovsdb_debug" for functional and
    fullstack tests. If enable, NeutronOvsdbIdl won't define any log
    level when setting up the logger; if not "max_level" is defined,
    all messages are logged.

    Change-Id: I1185c1ca2556e2aed27051efd98a1d141f5298c9
    Related-Bug: #1802640

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/794625

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/794625
Committed: https://opendev.org/openstack/neutron/commit/9cfa441335c484b499d741723d77121960ce6cf2
Submitter: "Zuul (22348)"
Branch: stable/train

commit 9cfa441335c484b499d741723d77121960ce6cf2
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Jan 22 13:34:29 2020 +0000

    Enable ovsdb debug messages in functional and fullstack

    Enable the configuration flag "ovsdb_debug" for functional and
    fullstack tests. If enable, NeutronOvsdbIdl won't define any log
    level when setting up the logger; if not "max_level" is defined,
    all messages are logged.

    Change-Id: I1185c1ca2556e2aed27051efd98a1d141f5298c9
    Related-Bug: #1802640
    (cherry picked from commit cf36bd4ca52bec4a707c35df261b8a2630850e68)

tags: added: in-stable-train
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I don't think we still have such issues in the tempest jobs. It can happen from time to time in the functional job maybe, but it's very rare recently and IMHO this may be related to overloaded nodes simply. So I'm closing that bug for now.

Changed in neutron:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.