Open vSwitch (Version 2.9.2) goes into deadlocked state

Bug #1839592 reported by Camilo
144
This bug affects 30 people
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
High
Unassigned
Bionic
Undecided
Unassigned
Focal
Undecided
Unassigned
openvswitch (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned
Focal
Undecided
Unassigned

Bug Description

Description: Ubuntu 18.04.2 LTS
Release: 18.04

root@kv02:~# apt-cache policy openvswitch-common
openvswitch-common:
  Installed: 2.9.2-0ubuntu0.18.04.3
  Candidate: 2.9.2-0ubuntu0.18.04.3
  Version table:
 *** 2.9.2-0ubuntu0.18.04.3 500

Randomly ovs-vswitchd service is locked waiting for handler thread to quiesce. Here is the log tail:

ovs-vswitchd.log:
2019-08-07T04:25:01.687Z|00101|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-08-07T08:48:23.885Z|00012|ovs_rcu(urcu5)|WARN|blocked 1000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:24.884Z|00102|ovs_rcu|WARN|blocked 1000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:24.885Z|00013|ovs_rcu(urcu5)|WARN|blocked 2000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:25.883Z|00103|ovs_rcu|WARN|blocked 2000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:26.886Z|00014|ovs_rcu(urcu5)|WARN|blocked 4001 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:27.884Z|00104|ovs_rcu|WARN|blocked 4000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:30.886Z|00015|ovs_rcu(urcu5)|WARN|blocked 8001 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:31.883Z|00105|ovs_rcu|WARN|blocked 8000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:38.886Z|00016|ovs_rcu(urcu5)|WARN|blocked 16001 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:39.883Z|00106|ovs_rcu|WARN|blocked 16000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:54.885Z|00017|ovs_rcu(urcu5)|WARN|blocked 32000 ms waiting for revalidator127 to quiesce
2019-08-07T08:48:55.883Z|00107|ovs_rcu|WARN|blocked 32000 ms waiting for revalidator127 to quiesce
2019-08-07T08:49:26.885Z|00018|ovs_rcu(urcu5)|WARN|blocked 64000 ms waiting for revalidator127 to quiesce
2019-08-07T08:49:27.883Z|00108|ovs_rcu|WARN|blocked 64000 ms waiting for revalidator127 to quiesce
2019-08-07T08:50:30.885Z|00019|ovs_rcu(urcu5)|WARN|blocked 128000 ms waiting for revalidator127 to quiesce
2019-08-07T08:50:31.883Z|00109|ovs_rcu|WARN|blocked 128000 ms waiting for revalidator127 to quiesce
2019-08-07T08:52:38.885Z|00020|ovs_rcu(urcu5)|WARN|blocked 256000 ms waiting for revalidator127 to quiesce

This cause the commands blocks and does not generate any output.

ovs-ofctl show sw0

ovs-ofctl dump-flows sw0

Is necesary restart ovs-vswitchd service, to recover it.

Related https://github.com/openvswitch/ovs-issues/issues/153

¿You are planning updating openvswitch service to 2.9.5 version?

https://www.openvswitch.org/releases/NEWS-2.9.5.txt

Ubuntu 18.04 is stuck in 2.9.2.

Tags: sts Edit Tag help
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu):
status: New → Confirmed
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

https://github.com/openvswitch/ovs/commit/2ed1c95e9c80a21a5d81cb760c872b62b32b0733

This seems to be commit for this.
and 2.9.3 has it.

@Camilo, Do you have reproducer for this issue?

tags: added: sts
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

From the discussion in https://github.com/openvswitch/ovs-issues/issues/175 it seems like this may actually be a glibc bug https://sourceware.org/bugzilla/show_bug.cgi?id=23844 , I'll try to get some tests done in order to confirm this.

Revision history for this message
James Page (james-page) wrote :

bug ref for 2.9.5 SRU - bug 1854360

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glibc (Ubuntu):
status: New → Confirmed
Revision history for this message
Juul Spies (juul) wrote :

I just came across this bug report and would like to share my expierence.

I've been having similar issues on 6 servers since we upgraded from 16.04 to 18.04 about 2 years ago with openvswitch.
Our biggest problem is our inabilty to reproduce it. We just see Openvswitch hanging from time to time. Sometimes it takes a day to get stuck, sometimes it takes months.
The only way to recover from it is to restart openvswitch.

Right now we are running with a backport of openvswitch from Disco (2.11.0-0ubuntu2) in Bionic. With that version backported we are having the same issues as with the previously installed 2.9.2-0ubuntu0.18.04.3 version that Bionic has.

I have gbd traces from both versions which I will attach.

Here a small portion from the ovs log and gdb trace of openvswitch 2.9.2-0ubuntu0.18.04.3:
Sun Aug 25 06:16:14 2019-2019-08-25T04:16:14.943Z|00001|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for revalidator127 to quiesce
Sun Aug 25 06:16:15 2019-2019-08-25T04:16:15.943Z|00002|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for revalidator127 to quiesce
Sun Aug 25 06:16:50 2019-2019-08-25T04:16:17.943Z|00003|ovs_rcu(urcu4)|WARN|blocked 4001 ms waiting for revalidator127 to quiesce

Small portion of the trace:
32 Thread 0x7f1bfa7fc700 (LWP 1461) "revalidator127" 0x00007f1c61aeb37b in futex_abstimed_wait (private=<optimized out>, abstime=0x0, expected=10, futex_word=0x55e4ed0aa800 <rwlock>) at ../sysdeps/unix/sysv/linux/futex-internal.h:172

The full trace is attached in gdbwrap.1566706577.log.gz (Openvswitch 2.9.2)

Revision history for this message
Juul Spies (juul) wrote :

Here a similar portion from the ovs log and gdb trace running openvswitch 2.11.0-0ubuntu2:
Thu Oct 31 18:20:23 2019-2019-10-31T17:20:23.521Z|00001|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for revalidator124 to quiesce
Thu Oct 31 18:20:24 2019-2019-10-31T17:20:24.521Z|00002|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for revalidator124 to quiesce
Thu Oct 31 18:20:59 2019-2019-10-31T17:20:26.520Z|00003|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for revalidator124 to quiesce

In the trace:
29 Thread 0x7f72f97fa700 (LWP 26608) "revalidator124" 0x00007f734aee237b in futex_abstimed_wait (private=<optimized out>, abstime=0x0, expected=10, futex_word=0x5577bab397c0 <rwlock>) at ../sysdeps/unix/sysv/linux/futex-internal.h:172

Full trace attached in gdbwrap.1572542426.log.gz

The traces are a bit hocus to me, I really don't have a clue whats going on there but I guess it might help you make sense of whats going on here.

Revision history for this message
Juul Spies (juul) wrote :
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

So as documented in https://github.com/openvswitch/ovs-issues/issues/175 , this really seems to be a bug in glibc and not in openswitch. After installing libc6 and related pkgs at version 2.29-0ubuntu2 (from disco), the issue is resolved for us. The single patch from https://sourceware.org/bugzilla/show_bug.cgi?id=23844 did not fix the issue. So we would either need to make an effort to get the version from disco into bionic-updates or someone would need to locate a set of possible patches and backport them.

Changed in openvswitch (Ubuntu):
status: Confirmed → Invalid
James Page (james-page)
Changed in glibc (Ubuntu):
importance: Undecided → High
Revision history for this message
TWENTY |20 (tw20) wrote :

I would also like to share our experiences:
We have had problems with this bug for a hole year. "ovs is dead..."
First we suspected a bug in openvswitch. But, this issue made us aware of glibc: https://github.com/openvswitch/ovs-issues/issues/175
We updated glibc to version 2.29-0ubuntu2 from the disco repository.
Now, for more than a month we don't have any problems on 6 compute nodes.
It would be nice if the glibc package updates can also be make available in the bionic repositories.

Revision history for this message
do3meli (d-info-e) wrote :

we are also affected by this issue and would appreciate if glibc would be updated to 2.29-0ubuntu2 in bionic repositories.

Revision history for this message
Salman (salmankh) wrote :

Is there glic upgrade coming on 18.04? We are hit by this problem very frequently and the only way out is to restart the ovs and neutron services.

Revision history for this message
Salman (salmankh) wrote :

And obviously that restarting of ovs and neutron services disrupt the network traffic massively.

Revision history for this message
trya uuum (tryauuum) wrote :

@salmankh
you can solve this by installing this packages from eoan
```
libc6
libc6-dbg
libcap-ng0
libc-bin
libidn2-0
locales
openvswitch-common
openvswitch-switch
```
and don't forget to run `systemctl daemon-reexec` afterwards!

Revision history for this message
Dimitri John Ledkov (xnox) wrote :
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Likely it is not a duplicate, see my comment in #10 and in the bugzilla. So if branch-pthread_rwlock_trywrlock-hang-23844.patch is the patch from the bugzilla, I tested that earlier and it did not solve the issue. The issue here seems to be with pthread_rwlock_rdlock hanging, see the tracebacks in the ovs github issue.

Revision history for this message
Mohammed Naser (mnaser) wrote :

I have done some extensive research on this and I am able to confirm that while Bug #1864864 does solve a class of issues, it does not solve the one affecting Open vSwitch, as it does rely on PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP:

https://github.com/openvswitch/ovs/blob/0de1b425962db073ebbaa3ddbde445580afda840/lib/ovs-thread.c#L247-L248

And when running the reproducer here:

https://sourceware.org/bugzilla/show_bug.cgi?id=23861

By doing the following:

```
curl https://sourceware.org/bugzilla/attachment.cgi?id=11382 -o bug23861.c
sed -i 's/do_exit = 0/do_exit(0)/' bug23861.c
g++ bug23861.c -lpthread -o bug23861
for ((x=1;x<100;x++)) ; do echo $x;date;./bug23861 --prefer-writer-nonrecursive;done
```

I will add that in my initial tests of doing this inside a small VM with a Docker container, it did not hang after running for 10-15 minutes, but it immediately hung on a much more powerful machine. I will file a bug to try and get the patch above SRU'd into glibc as well and hopefully that will be the end of it.

Revision history for this message
Mohammed Naser (mnaser) wrote :

So, it turns out that 2.27-3ubuntu1.3 could fix this because of the fact that not only does it include that patch, but a full rebase to 2.27 branch of upstream glibc, which does include the other fix mentioned above (I found this out when I tried to rebuild locally and found out that quilt was complaining of an already built patch).

Upon running the code above on the 2.27-3ubuntu1.2 system, it freezes but works fine under 2.27-3ubuntu1.3.

Revision history for this message
James Page (james-page) wrote :

For context - 2.27-3ubuntu1.3 is in bionic-proposed

Revision history for this message
James Page (james-page) wrote :

Thanks for the research @mnaser - interesting that this is only seen on much more powerful machines which I think has been the key blocker to reproduction of this issue.

Revision history for this message
Balint Reczey (rbalint) wrote :

2.27-3ubuntu1.3 has been released to bionic-updates. @mnaser do I understand correctly that this bug is fully fixed by it?

Revision history for this message
Balint Reczey (rbalint) wrote :

Also I assume the bug is fixed in Ubuntu 20.04 and later as well. Could you please confirm that?

Changed in glibc (Ubuntu):
status: Confirmed → Incomplete
Changed in glibc (Ubuntu Bionic):
status: New → Incomplete
Changed in glibc (Ubuntu Focal):
status: New → Incomplete
Changed in openvswitch (Ubuntu Bionic):
status: New → Incomplete
status: Incomplete → Invalid
Changed in openvswitch (Ubuntu Focal):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.