o2cb configuration options ignored in 16.04

Bug #1614038 reported by Sven
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

We've been trying to add a 16.04 node (ocfs2-tools 1.6.4-3.1) to our existing OCFS2 filesystem based on Ubuntu 13.04 (ocfs2-tools 1.6.4-2ubuntu1) and , Ubuntu 14.04 (ocfs2-tool 1.6.4-3ubuntu1)

 * Node1: Ubuntu 16.04, Slot 1, 10.22.44.21
 * Node2: Ubuntu 13.04, Slot 2, 10.22.44.22
 * Node3: Ubuntu 14.04, Slot 6, 10.22.44.23
 * Node4: Ubuntu 14.04, Slot 7, 10.22.44.24

The exiting system has a O2CB_HEARTBEAT_THRESHOLD=61 setting, but when adding the new system these tweaks seem to be ignored. Here's the syslog section:

Aug 16 15:58:02 node1 kernel: [ 936.294820] (o2hb-37AAEB0304,5741,7):o2hb_check_slot:895 ERROR: Node 2 on device sdc has a dead count of 122000 ms, but our count is 62000 ms.
Aug 16 15:58:02 node1 kernel: [ 936.294820] Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
Aug 16 15:58:02 node1 kernel: [ 936.294949] (o2hb-37AAEB0304,5741,7):o2hb_check_slot:895 ERROR: Node 6 on device sdc has a dead count of 122000 ms, but our count is 62000 ms.
Aug 16 15:58:02 node1 kernel: [ 936.294949] Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
Aug 16 15:58:02 node1 kernel: [ 936.295071] (o2hb-37AAEB0304,5741,7):o2hb_check_slot:895 ERROR: Node 7 on device sdc has a dead count of 122000 ms, but our count is 62000 ms.
Aug 16 15:58:02 node1 kernel: [ 936.295071] Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
Aug 16 15:58:03 node1 kernel: [ 937.123350] o2net: node node3 (num 6) at 10.22.44.23:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:58:03 node1 kernel: [ 937.393608] o2net: node node2 (num 2) at 10.22.44.22:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:58:04 node1 kernel: [ 938.055983] o2net: node node4 (num 7) at 10.22.44.24:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:58:29 node1 kernel: [ 963.213554] o2net: node node3 (num 6) at 10.22.44.23:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:58:30 node1 kernel: [ 964.057995] o2net: node node4 (num 7) at 10.22.44.24:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:58:32 node1 kernel: [ 966.404380] o2net: No connection established with node 2 after 30.0 seconds, check network and cluster configuration.
Aug 16 15:58:32 node1 kernel: [ 966.404390] o2net: No connection established with node 6 after 30.0 seconds, check network and cluster configuration.
Aug 16 15:58:32 node1 kernel: [ 966.404393] o2net: No connection established with node 7 after 30.0 seconds, check network and cluster configuration.
Aug 16 15:58:59 node1 kernel: [ 993.296012] o2net: node node3 (num 6) at 10.22.44.23:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:59:00 node1 kernel: [ 994.060435] o2net: node node4 (num 7) at 10.22.44.24:7777 uses a heartbeat timeout of 120000 ms, but we use 60000 ms locally. Disconnecting.
Aug 16 15:59:02 node1 kernel: [ 996.486396] o2net: No connection established with node 2 after 30.0 seconds, check network and cluster configuration.
Aug 16 15:59:02 node1 kernel: [ 996.486405] o2net: No connection established with node 6 after 30.0 seconds, check network and cluster configuration.
Aug 16 15:59:02 node1 kernel: [ 996.486409] o2net: No connection established with node 7 after 30.0 seconds, check network and cluster configuration.
Aug 16 15:59:05 node1 kernel: [ 999.582560] o2cb: This node could not connect to nodes: 2 6 7.
Aug 16 15:59:05 node1 kernel: [ 999.582607] o2cb: Cluster check failed. Fix errors before retrying.
Aug 16 15:59:05 node1 kernel: [ 999.582647] (mount.ocfs2,5740,1):ocfs2_dlm_init:3025 ERROR: status = -107
Aug 16 15:59:05 node1 kernel: [ 999.582814] (mount.ocfs2,5740,1):ocfs2_mount_volume:1863 ERROR: status = -107
Aug 16 15:59:05 node1 kernel: [ 999.582895] ocfs2: Unmounting device (8,32) on (node 0)
Aug 16 15:59:05 node1 kernel: [ 999.582905] (mount.ocfs2,5740,1):ocfs2_fill_super:1219 ERROR: status = -107

Revision history for this message
Joshua Powers (powersj) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

Since there isn't enough information in your report to differentiate between a local configuration problem and a bug in Ubuntu, I'm marking this bug as Incomplete.

If indeed this is a local configuration problem, you can find pointers to get help for this sort of problem here: http://www.ubuntu.com/support/community

Or if you believe that this is really a bug, then you may find it helpful to read "How to report bugs effectively" http://www.chiark.greenend.org.uk/~sgtatham/bugs.html. We'd be grateful if you would then provide a more complete description of the problem, explain why you believe this is a bug in Ubuntu rather than a problem specific to your system, and then change the bug status back to New.

Changed in ocfs2-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Sven (sven-solberg) wrote :

Yes, sorry ... I should have added the config files... I'll do that now.

These files are identical across all 4 nodes.

There are additional references in the file for "sentinel", "belt" and "moli7". These are not part of the main cluster file system, but are used for other ocfs2 volumes.

Revision history for this message
Sven (sven-solberg) wrote :
Joshua Powers (powersj)
Changed in ocfs2-tools (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Alexandre Derumier (aderumier-odiso) wrote :

Hi,
I have same problem than you with debian jessie and kernel 4.7.
works fine with kernel 3.16

I seem that on kernel 3.16,
the sysfs threshold was

/sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold

and now on 4.7
/sys/kernel/config/cluster/ocfs2/heartbeat/threshold

the o2cb init script is setting value in
/sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold

(I have checked in ocfs2-tool git, this is the same old sysfs key)

Changing the script help, but sometimes it don't work. Maybe other keys have change.

Revision history for this message
Alexandre Derumier (aderumier-odiso) wrote :

set_timeouts()
{
    O2CB_HEARTBEAT_THRESHOLD_FILE_OLD=/proc/fs/ocfs2_nodemanager/hb_dead_threshold
- O2CB_HEARTBEAT_THRESHOLD_FILE=$(configfs_path)/cluster/${CLUSTER}/heartbeat/dead_threshold
+ O2CB_HEARTBEAT_THRESHOLD_FILE=$(configfs_path)/cluster/${CLUSTER}/heartbeat/threshold

Revision history for this message
Joshua Powers (powersj) wrote :

Thanks Alexandre for your additional logs and potential fix. This will get looked at for the next release to understand if that is the right fix going forward.

tags: added: server-next
Revision history for this message
Alexandre Derumier (aderumier-odiso) wrote :

Note that looking at kernel git, dead_threshold still seem to be the right value.

I don't known why on last ubuntu/debian kernel it's now "threshold".

I'll try to build a stock kernel to compare.

Revision history for this message
ian (ircwaves) wrote :

Any updates on this issue? It is blocking our move to 16.04. Hunting for work arounds now, other than just accepting the default.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
I have to beg all your pardon - nothing has happened since the last updates.
I'm trying to clear a few bugs that seem to expire by being dormant.

On this one the first that I can provide is a trivial reproduction which often
helps to get more general developers to focus on an issue and not feel blocked
on a particular configuration.

# in VM needed for extra modules
$ sudo apt install linux-image-extra-virtual
$ sudo apt install ocfs2-tools
# very very basic config
$ sudo sed -i 's/O2CB_ENABLED=false/O2CB_ENABLED=true/' /etc/default/o2cb
$ sudo sed -i '/localhost/a192.168.122.237 ocfs2node1' /etc/hosts
$ sudo o2cb add-cluster ocfs2
$ sudo o2cb add-node --ip 192.168.122.237 ocfs2 $(hostname)
# should be running on restart now
$ sudo systemctl restart o2cb
$ sudo systemctl status o2cb
$ sudo o2cb cluster-status

That is enough to get all initialized and see that you have
$ cat /sys/kernel/config/cluster/ocfs2/heartbeat/threshold
31

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I also can confirm that this is:
3.13.0-123-generic: /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold
4.4.0-81-generic: /sys/kernel/config/cluster/ocfs2/heartbeat/threshold

So where does the change come from and what needs to adapt (kernel or ocfs tools).
It was reported that upstream in the kernel this still would be dead_threshold.

I compared:
Trusty: git://kernel.ubuntu.com/ubuntu/ubuntu-xenial.git
Xenial: git://kernel.ubuntu.com/ubuntu/ubuntu-xenial.git
Upstream: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

I found that on that level Upstream agrees to have dead_threshold just as trusty does.
So why is Xenial off of that - I found an upstream commit [1] that broke it upstream.
That broken state is in since 4.4 and fixed about now in 4.12 with commit [2].

There also is stable kernel activity on this at [3] for 4.11, [4] for 4.9 and [5] for 4.4

Given that pre-analysis I think it is the kernel team that has/want to look to include updates.
I hope my analysis helps to do so, and I have reassigned by adapting the bug tasks accordingly.

[1]: https://github.com/torvalds/linux/commit/45b997737a80
[2]: https://github.com/torvalds/linux/commit/33496c3c3d7b
[3]: https://www.spinics.net/lists/stable/msg179361.html
[4]: https://www.spinics.net/lists/stable/msg179433.html
[5]: https://www.spinics.net/lists/stable/msg179582.html

affects: ocfs2-tools (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Triaged → New
tags: removed: server-next
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1614038

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.