StarlingX

drbd split-brain?

Bug #1850899 reported by yanwei on 2019-11-01

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Confirmed	Low	Austin Sun

Bug Description

My starlingx cluster is a controllerstorge with for nodes in virtual machine.
A drbd problem brought the cluster into an unused state
Some info like flows:
1.drbd state
controller-1:~$ drbd-overview
  0:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  1:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  2:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  3:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  5:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  7:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  8:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
controller-0:~$ drbd-overview
  0:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  1:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  2:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  3:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  5:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  7:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  8:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
2.The pmod could not be started and the other processes were down too，such as sm、etcd and so on

After about 20 hours,the cluster was self recovered

The tar file of all logs is too big to upload,sorry...

If need some more info ,I will copy to here..

Thanks

See original description

Tags:

yanwei (yanwei) on 2019-11-01

description:	updated
description:	updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-01:

There isn't enough information here. Please use the template so that we have the load info. Without logs, nobody can investigate this further.

tags:	added: stx.storage
Changed in starlingx:
status:	New → Incomplete

Revision history for this message

yanwei (yanwei) wrote on 2019-11-04:

ALL_NODES_20191101.012950.tar Edit (162.0 MiB, application/x-tar)

all logs in attachment

Revision history for this message

yanwei (yanwei) wrote on 2019-11-04:

after recoverd,drbd looks like:

[sysadmin@controller-0 ~(keystone_admin)]$ drbd-overview
  0:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/postgresql ext4 40G 100M 38G 1%
  1:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/rabbitmq ext4 2.0G 6.2M 1.9G 1%
  2:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/platform ext4 2.0G 7.9M 1.9G 1%
  3:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/cgcs ext4 20G 13M 19G 1%
  5:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/extension ext4 992M 2.6M 923M 1%
  7:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/etcd ext4 4.8G 420M 4.2G 10%
  8:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/docker-distribution ext4 16G 4.6G 11G 31%
[sysadmin@controller-0 ~(keystone_admin)]$

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-04:

Assigning to the storage PL for review/next steps

Changed in starlingx:
assignee:	nobody → Austin Sun (sunausti)

Austin Sun (sunausti) on 2019-11-05

Changed in starlingx:
status:	Incomplete → Confirmed

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-05:

from the log , the lvm of controller-1 meet some issues, some lvm are not mounted successfully
from var\extra\blockdev.info of controller-1
  ├─cgts--vg-scratch--lv 253:0 0 8G 0 lvm /scratch
  ├─cgts--vg-log--lv 253:1 0 7.8G 0 lvm /var/log
  ├─cgts--vg-docker--lv 253:2 0 30G 0 lvm /var/lib/docker
  ├─cgts--vg-etcd--lv 253:3 0 5G 0 lvm
  │ └─drbd7 147:7 0 5G 1 disk
  ├─cgts--vg-kubelet--lv 253:4 0 10G 0 lvm /var/lib/kubelet
  ├─cgts--vg-extension--lv 253:5 0 1G 0 lvm
  │ └─drbd5 147:5 0 1024M 1 disk
  ├─cgts--vg-pgsql--lv 253:6 0 40G 0 lvm
  │ └─drbd0 147:0 0 40G 1 disk
  ├─cgts--vg-ceph--mon--lv 253:7 0 20G 0 lvm /var/lib/ceph/mon
  ├─cgts--vg-backup--lv 253:8 0 60G 0 lvm /opt/backups
  ├─cgts--vg-cgcs--lv 253:9 0 20G 0 lvm
  │ └─drbd3 147:3 0 20G 1 disk
  ├─cgts--vg-dockerdistribution--lv 253:10 0 16G 0 lvm
  │ └─drbd8 147:8 0 16G 1 disk
  ├─cgts--vg-rabbit--lv 253:11 0 2G 0 lvm
  │ └─drbd1 147:1 0 2G 1 disk
  └─cgts--vg-platform--lv 253:12 0 2G 0 lvm
    └─drbd2 147:2 0 2G 1 disk

The /opt/cgcs, /opt/platform, /opt/etcd and /opt/extension are not successfully mounted , those are used for drbd, so the drbd did not start sync.

since this setup is doing on virtual environment , is it possible host disk full or host disk issue ?

Revision history for this message

yanwei (yanwei) wrote on 2019-11-05:

host disk info
starlingx@starlingx-EC600G4:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 126G 0 126G 0% /dev
tmpfs 26G 219M 25G 1% /run
/dev/sda2 1.1T 353G 691G 34% /
tmpfs 126G 12M 126G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sda1 511M 3.7M 508M 1% /boot/efi
cgmfs 100K 0 100K 0% /run/cgmanager/fs
tmpfs 26G 28K 26G 1% /run/user/1000
starlingx@starlingx-EC600G4:~$

starlingx@starlingx-EC600G4:~$ sudo hdparm -tT /dev/sda2
[sudo] password for starlingx:

/dev/sda2:
Timing cached reads: 13450 MB in 1.99 seconds = 6748.01 MB/sec
Timing buffered disk reads: 648 MB in 3.00 seconds = 215.73 MB/sec
starlingx@starlingx-EC600G4:~$

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-07:

it seems all drbd needed /opt are not existing. is your env still existing ? could you check if /opt folder on controller-1 is existing ?

Revision history for this message

yanwei (yanwei) wrote on 2019-11-07:

controller-1:~$ drbd-overview
  0:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  1:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  2:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  3:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  5:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  7:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  8:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
controller-1:~$ ls /opt/
backups branding cgcs cni collectd containerd dc deploy etcd extension extracharts patching platform
controller-1:~$ df -h |grep opt
/dev/mapper/cgts--vg-backup--lv 59G 53M 56G 1% /opt/backups
controller-1:~$

Revision history for this message

yanwei (yanwei) wrote on 2019-11-07:

controller-1:~$ mount /dev/drbd2 /opt/platform/
mount: only root can do that
controller-1:~$ sudo mount /dev/drbd2 /opt/platform/
Password:
mount: /dev/drbd2 is write-protected, mounting read-only
mount: mount /dev/drbd2 on /opt/platform failed: Wrong medium type

Revision history for this message

yanwei (yanwei) wrote on 2019-11-07:

#10

controller-0:~$ drbd-overview
  0:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/postgresql ext4 40G 100M 38G 1%
  1:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/rabbitmq ext4 2.0G 6.2M 1.9G 1%
  2:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/platform ext4 2.0G 7.9M 1.9G 1%
  3:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/cgcs ext4 20G 13M 19G 1%
  5:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/extension ext4 992M 2.6M 923M 1%
  7:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/etcd ext4 4.8G 420M 4.2G 10%
  8:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/docker-distribution ext4 16G 4.6G 11G 31%
controller-0:~$ df -h |grep drbd
/dev/drbd8 16G 4.6G 11G 31% /var/lib/docker-distribution
/dev/drbd7 4.8G 420M 4.2G 10% /opt/etcd
/dev/drbd5 992M 2.6M 923M 1% /opt/extension
/dev/drbd3 20G 13M 19G 1% /opt/cgcs
/dev/drbd1 2.0G 6.2M 1.9G 1% /var/lib/rabbitmq
/dev/drbd0 40G 100M 38G 1% /var/lib/postgresql
/dev/drbd2 2.0G 7.9M 1.9G 1% /opt/platform

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-07:

#11

ok. I think this is result that everything is recovered (drbd is working fine now) , not when the issue was found. right ?

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-12:

#12

from mtcClient , there is abnormal behavior. the host name jump from controller-0 to local , and timestamp is hop.

2019-10-31T08:27:48.221 [86957.00069] controller-0 mtcClient --- daemon_config.cpp ( 405) daemon_dump_cfg : Info : debug_event = none
2019-10-31T16:27:25.947 [6342.00000] localhost mtcClient --- daemon_files.cpp ( 443) daemon_system_type : Info : System Type : Standard System
2019-10-31T16:27:25.947 [6342.00001] localhost mtcClient --- mtcNodeComp.cpp ( 969) daemon_init : Info : Node Type : controller (1:1)
2019-10-31T16:27:25.947 [6342.00002] localhost mtcClient --- daemon_files.cpp (1051) daemon_files_init : Info : --- Daemon Start-Up --- pid:6342
2019-10-31T16:27:25.947 [6342.00003] localhost mtcClient sig daemon_signal.cpp ( 255) daemon_signal_init : Info : Signal Hdlr : Installed (sigaction)
2019-10-31T16:27:25.949 [6342.00004] localhost mtcClient --- mtcNodeComp.cpp ( 237) mtc_config_handler : Info : Shutdown TO : 120 secs
2019-10-31T16:27:25.949 [6342.00005] localhost mtcClient --- daemon_debug.cpp ( 364) get_debug_options : Info : Config File : /etc/mtc.conf
2019-10-31T16:27:25.949 [6342.00006] localhost mtcClient --- daemon_debug.cpp ( 307) debug_config_handler : Info : FIT host : none
2019-10-31T16:27:25.949 [6342.00007] localhost mtcClient --- mtcNodeComp.cpp ( 273) daemon_configure : Info : Agent Mgmnt : 2101 (tx)
2019-10-31T16:27:25.949 [6342.00008] localhost mtcClient --- mtcNodeComp.cpp ( 274) daemon_configure : Info : Client Mgmnt: 2118 (rx)
2019-10-31T16:27:25.949 [6342.00009] localhost mtcClient --- daemon_config.cpp ( 76) client_timeout_handler : Info : goEnabled TO: 600 secs
2019-10-31T16:27:25.949 [6342.00010] localhost mtcClient --- daemon_config.cpp ( 81) client_timeout_handler : Info : Host Svcs TO: 300 secs
2019-10-31T16:27:25.949 [6342.00011] localhost mtcClient com nodeUtil.cpp ( 702) get_hostname : Info : controller-0

from mtcClient , there is abnormal behavior. the host name jump from controller-0 to local , and timestamp is hop.

2019-10-31T08:27:48.221 [86957.00069] controller-0 mtcClient --- daemon_config.cpp ( 405) daemon_dump_cfg         : Info : debug_event           = none
2019-10-31T16:27:25.947 [6342.00000] localhost mtcClient --- daemon_files.cpp  ( 443) daemon_system_type      : Info : System Type : Standard System
2019-10-31T16:27:25.947 [6342.00001] localhost mtcClient --- mtcNodeComp.cpp   ( 969) daemon_init             : Info : Node Type   : controller (1:1)
2019-10-31T16:27:25.947 [6342.00002] localhost mtcClient --- daemon_files.cpp  (1051) daemon_files_init       : Info : --- Daemon Start-Up --- pid:6342
2019-10-31T16:27:25.947 [6342.00003] localhost mtcClient sig daemon_signal.cpp ( 255) daemon_signal_init      : Info : Signal Hdlr : Installed (sigaction)
2019-10-31T16:27:25.949 [6342.00004] localhost mtcClient --- mtcNodeComp.cpp   ( 237) mtc_config_handler      : Info : Shutdown TO : 120 secs
2019-10-31T16:27:25.949 [6342.00005] localhost mtcClient --- daemon_debug.cpp  ( 364) get_debug_options       : Info : Config File : /etc/mtc.conf
2019-10-31T16:27:25.949 [6342.00006] localhost mtcClient --- daemon_debug.cpp  ( 307) debug_config_handler    : Info : FIT host    : none
2019-10-31T16:27:25.949 [6342.00007] localhost mtcClient --- mtcNodeComp.cpp   ( 273) daemon_configure        : Info : Agent Mgmnt : 2101 (tx)
2019-10-31T16:27:25.949 [6342.00008] localhost mtcClient --- mtcNodeComp.cpp   ( 274) daemon_configure        : Info : Client Mgmnt: 2118 (rx)
2019-10-31T16:27:25.949 [6342.00009] localhost mtcClient --- daemon_config.cpp (  76) client_timeout_handler  : Info : goEnabled TO: 600 secs
2019-10-31T16:27:25.949 [6342.00010] localhost mtcClient --- daemon_config.cpp (  81) client_timeout_handler  : Info : Host Svcs TO: 300 secs
2019-10-31T16:27:25.949 [6342.00011] localhost mtcClient com nodeUtil.cpp      ( 702) get_hostname            : Info : controller-0

Revision history for this message

hutianhao27 (hutianhao) wrote on 2020-01-14:

#13

@Austin Sun
@yanwei
I have two systems and one is AIO Simplex and another is AIO Duplex. Both of them have this problem. After config system and unlock the controller-0, it shows controller-0 is Secondary node and /opt/extension, /var/lib/postgresql, /opt/etcd, /var/lib/docker-distribution, /var/lib/rabbitmq and /opt/platform are not mounted successfully. I try to change controller-0 to Primary but it changes back to Secondary quickly and lvm are still not mounted. I wonder if this problem have any solution or do I have any operation to fix this?

Ghada Khalil (gkhalil) on 2020-06-01

Changed in starlingx:
importance:	Undecided → Low

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

ALL_NODES_20191101.012950.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.