Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"

Bug #1687074 reported by Ihar Hrachyshka
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Unassigned
ovsdbapp
Invalid
Undecided
Unassigned

Bug Description

Example (Ocata): http://logs.openstack.org/67/460867/1/check/gate-neutron-dsvm-functional-ubuntu-xenial/382d800/logs/dsvm-functional-logs/neutron.tests.functional.agent.ovsdb.test_impl_idl.ImplIdlTestCase.test_post_commit_vswitchd_completed_no_failures.txt.gz

2017-04-28 07:59:01.430 11929 WARNING neutron.agent.ovsdb.native.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 1, byte 1: syntax error at beginning of input
2017-04-28 07:59:01.431 11929 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddBridgeCommand(name=test-brc6de03bf, may_exist=True, datapath_type=None) do_commit neutron/agent/ovsdb/impl_idl.py:100
2017-04-28 07:59:01.433 11929 DEBUG neutron.agent.ovsdb.impl_idl [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit neutron/agent/ovsdb/impl_idl.py:111
2017-04-28 07:59:01.433 11929 WARNING neutron.agent.ovsdb.native.vlog [-] tcp:127.0.0.1:6640: connection dropped (Protocol error)

If we look at logstash here:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22tcp%3A127.0.0.1%3A6640%3A%20error%20parsing%20stream%5C%22

We see some interesting data points, sometimes it actually logs what's in the buffer, and I see instances of:

2017-04-27 19:02:51.755 [neutron.tests.functional.tests.common.exclusive_resources.test_port.TestExclusivePort.test_port] 3300 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 1355, byte 1355: invalid keyword 'id'

2017-04-27 14:22:02.294 [neutron.tests.functional.agent.linux.test_ovsdb_monitor.TestSimpleInterfaceMonitor.test_get_events_native_] 3433 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 3, byte 3: invalid keyword 'new'

2017-04-27 04:46:17.667 [neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_bad_address_allocation] 4136 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 3, byte 3: invalid keyword 'ace'

2017-04-26 18:04:59.110 [neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase.test_arp_correct_protection_allowed_address_pairs] 3477 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 3, byte 3: invalid keyword 'err'

2017-04-25 19:00:01.452 [neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_agent_mtu_set_on_interface_driver] 4251 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 3, byte 3: invalid keyword 'set'

2017-04-25 16:34:11.355 [neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase.test_arp_fails_incorrect_mac_protection] 3332 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 5, byte 5: invalid keyword 'tatus'

2017-04-25 03:28:25.858 [neutron.tests.functional.agent.ovsdb.test_impl_idl.ImplIdlTestCase.test_post_commit_vswitchd_completed_no_failures] 4112 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 3, byte 3: invalid keyword 'set'

2017-04-24 21:59:39.094 [neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase.test_arp_protection_port_security_disabled] 3682 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 5, byte 5: invalid keyword 'rsion'

Terry says it doesn't resemble the protocol, but some random crap, potentially from some random place in memory (SCARY!)

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The least we could do is dumping some more details to understand where that bogus data came from. Maybe once we see more context, we will realize the source of the buffer.

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Changed in ovsdbapp:
status: New → Confirmed
tags: added: fullstack functional-tests gate-failure ovs
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Still happening. Not every hit results in job failure.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I wonder whether https://review.openstack.org/#/c/520176/ also fixes this issue.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Adding debugging info, e.g. dumping the entire buffer when there is a parsing issue as opposed to just the part of the message that caused the error, is made a little difficult since the code that would need to be added is in the OVS python lib (in the mangled ovs.jsonrpc.Connection.__process_msg).

__process_msg calls self.error() which closes the stream and resets the buffer to "". We could probably monkey patch in a subclassed jsonrpc.Connection with an error() that dumps the self.output, then calls super().error() to get a better look at the messages that are causing the issue.

Some of the "error parsing stream" messages seem to occur at the very beginning of a message, as if the buffer is starting in the middle of an existing stream. Others seem to happen many bytes into the message. It could be that functional tests unceremoniously dropping the connection in the middle of a read causes some of the messages? ovsdbapp added the ability to "stop()" a Connection, though looking at the code it doesn't actually stop the ovs.db.Idl connection itself. Maybe adding a self.idl.close() to the end of Connection.run() and then adding an explicit ovsdbapp.backend.ovs_idl.connection.Connection.stop() for the test teardown would help--though neutron shouldn't really care about the backend in use and the vsctl implementation obviously has no connection...

Anyway, I'll look into it. :p

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It blew up again here: http://logs.openstack.org/99/526199/1/gate/legacy-neutron-dsvm-functional/ee6740e/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_agent_mtu_set_on_interface_driver.txt.gz

Interestingly, this time it just churns around for 3 minutes retrying, then bailing out with failure. Is it something that latest retry changes in ovsdbapp could affect?

Anyway, seems that the root cause is still the same: bad data inside the buffer.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

The retrying is just because the fixture create_resource() tries forever whenever there is a RuntimeError, which is what we get when a timeout happens. It is interesting that it doesn't seem to recover when it runs the transaction again. I'm guessing that Idl.run() is breaking out of the loop before actually doing anything. I'll investigate.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Still looking. Interesting here: http://logs.openstack.org/82/423382/16/check/legacy-neutron-dsvm-functional/d9fe859/logs/dsvm-functional-index.txt#_2017-12-11_23_25_49_606 :

[neutron.tests.functional.services.trunk.drivers.openvswitch.agent.test_trunk_manager.TrunkManagerTestCase.test_connectivity] 9965 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 0, byte 0: syntax error at beginning of input: RuntimeError: Port nonexistantport does not exist

That text is clearly output from test_ovs_lib's test_delete_port_if_exists_false, not the test from log message. Also, the error should stop at "syntax error at beginning of input" since that is what is set by __error() in ovs.json.Parser and what should be returned by parser.finish() make ovs.jsonrpc.Connection.__process_msg log the "error parsing stream" message. The ": RuntimeError: Port nonexistantport does not exist" in the log must be from somewhere else--and I don't see any place where it could just be appended. Is it something weird with the vlog logging? Each of the next vlog messages from the same test also have that garbage appended to the end. Other 'error parsing stream' messages from other tests end properly.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Marked neutron part fixed because we don't see gate failure because of the error message lately. It could even be that failures we saw in the past were not related to the error message, and the message was just a red herring. Slawek and others made huge progress lately on stabilizing fullstack job and merged a lot of patches unrelated to the message, and the job is not quite stable.

We are closing it for neutron. ovsdbapp folks should decide on their own if they want to pursue it.

Changed in neutron:
status: Confirmed → Fix Released
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Going ahead and closing since I haven't seen the error pop up in a while. We didn't do anything in ovsdbapp to specifically target the error message, so I suspect it was something weird going on with the neutron tests. But can't be sure, so just going to mark it incomplete for ovsdbapp.

Changed in ovsdbapp:
status: Confirmed → Incomplete
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

I guess incomplete is still "open", so invalid it is.

Changed in ovsdbapp:
status: Incomplete → Invalid
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :
Download full text (9.7 KiB)

We still having the issue with Pike.

2018-05-29 02:18:09.056 7 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=1): DbAddCommand(column=protocols, record=br-int, values=('OpenFlow10',), table=Bridge) do_commit /var/lib/kolla/venv/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-05-29 02:18:09.057 7 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit /var/lib/kolla/venv/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:93
2018-05-29 02:18:09.060 7 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): AddBridgeCommand(datapath_type=system, may_exist=True, name=br-int) do_commit /var/lib/kolla/venv/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-05-29 02:18:09.060 7 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=1): DbAddCommand(column=protocols, record=br-int, values=('OpenFlow10',), table=Bridge) do_commit /var/lib/kolla/venv/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp [-] Agent main thread died of an exception: TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.AddBridgeCommand object at 0x7fc2c4545f90>, <ovsdbapp.schema.open_vswitch.commands.DbAddCommand object at 0x7fc2c9c5d0d0>] exceeded timeout 10 seconds
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp Traceback (most recent call last):
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py", line 40, in agent_main_wrapper
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp ovs_agent.main(bridge_classes)
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2219, in main
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp agent = OVSNeutronAgent(bridge_classes, ext_mgr, cfg.CONF)
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 172, in __init__
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp self.setup_integration_br()
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2018-05-29 02:18:37.223 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*ar...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.