drbd split-brain?

Bug #1850899 reported by yanwei on 2019-11-01
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Undecided
Austin Sun

Bug Description

My starlingx cluster is a controllerstorge with for nodes in virtual machine.
A drbd problem brought the cluster into an unused state
Some info like flows:
1.drbd state
controller-1:~$ drbd-overview
  0:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  1:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  2:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  3:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  5:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  7:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  8:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
controller-0:~$ drbd-overview
  0:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  1:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  2:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  3:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  5:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  7:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
  8:??not-found?? Connected Secondary/Secondary UpToDate/UpToDate C r-----
2.The pmod could not be started and the other processes were down too,such as sm、etcd and so on

After about 20 hours,the cluster was self recovered

The tar file of all logs is too big to upload,sorry...

If need some more info ,I will copy to here..

Thanks

yanwei (yanwei) on 2019-11-01
description: updated
description: updated
Ghada Khalil (gkhalil) wrote :

There isn't enough information here. Please use the template so that we have the load info. Without logs, nobody can investigate this further.

tags: added: stx.storage
Changed in starlingx:
status: New → Incomplete
yanwei (yanwei) wrote :

all logs in attachment

yanwei (yanwei) wrote :

after recoverd,drbd looks like:

[sysadmin@controller-0 ~(keystone_admin)]$ drbd-overview
  0:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/postgresql ext4 40G 100M 38G 1%
  1:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/rabbitmq ext4 2.0G 6.2M 1.9G 1%
  2:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/platform ext4 2.0G 7.9M 1.9G 1%
  3:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/cgcs ext4 20G 13M 19G 1%
  5:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/extension ext4 992M 2.6M 923M 1%
  7:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/etcd ext4 4.8G 420M 4.2G 10%
  8:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/docker-distribution ext4 16G 4.6G 11G 31%
[sysadmin@controller-0 ~(keystone_admin)]$

Ghada Khalil (gkhalil) wrote :

Assigning to the storage PL for review/next steps

Changed in starlingx:
assignee: nobody → Austin Sun (sunausti)
Austin Sun (sunausti) on 2019-11-05
Changed in starlingx:
status: Incomplete → Confirmed
Austin Sun (sunausti) wrote :

from the log , the lvm of controller-1 meet some issues, some lvm are not mounted successfully
from var\extra\blockdev.info of controller-1
  ├─cgts--vg-scratch--lv 253:0 0 8G 0 lvm /scratch
  ├─cgts--vg-log--lv 253:1 0 7.8G 0 lvm /var/log
  ├─cgts--vg-docker--lv 253:2 0 30G 0 lvm /var/lib/docker
  ├─cgts--vg-etcd--lv 253:3 0 5G 0 lvm
  │ └─drbd7 147:7 0 5G 1 disk
  ├─cgts--vg-kubelet--lv 253:4 0 10G 0 lvm /var/lib/kubelet
  ├─cgts--vg-extension--lv 253:5 0 1G 0 lvm
  │ └─drbd5 147:5 0 1024M 1 disk
  ├─cgts--vg-pgsql--lv 253:6 0 40G 0 lvm
  │ └─drbd0 147:0 0 40G 1 disk
  ├─cgts--vg-ceph--mon--lv 253:7 0 20G 0 lvm /var/lib/ceph/mon
  ├─cgts--vg-backup--lv 253:8 0 60G 0 lvm /opt/backups
  ├─cgts--vg-cgcs--lv 253:9 0 20G 0 lvm
  │ └─drbd3 147:3 0 20G 1 disk
  ├─cgts--vg-dockerdistribution--lv 253:10 0 16G 0 lvm
  │ └─drbd8 147:8 0 16G 1 disk
  ├─cgts--vg-rabbit--lv 253:11 0 2G 0 lvm
  │ └─drbd1 147:1 0 2G 1 disk
  └─cgts--vg-platform--lv 253:12 0 2G 0 lvm
    └─drbd2 147:2 0 2G 1 disk

The /opt/cgcs, /opt/platform, /opt/etcd and /opt/extension are not successfully mounted , those are used for drbd, so the drbd did not start sync.

since this setup is doing on virtual environment , is it possible host disk full or host disk issue ?

yanwei (yanwei) wrote :

host disk info
starlingx@starlingx-EC600G4:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 126G 0 126G 0% /dev
tmpfs 26G 219M 25G 1% /run
/dev/sda2 1.1T 353G 691G 34% /
tmpfs 126G 12M 126G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sda1 511M 3.7M 508M 1% /boot/efi
cgmfs 100K 0 100K 0% /run/cgmanager/fs
tmpfs 26G 28K 26G 1% /run/user/1000
starlingx@starlingx-EC600G4:~$

starlingx@starlingx-EC600G4:~$ sudo hdparm -tT /dev/sda2
[sudo] password for starlingx:

/dev/sda2:
 Timing cached reads: 13450 MB in 1.99 seconds = 6748.01 MB/sec
 Timing buffered disk reads: 648 MB in 3.00 seconds = 215.73 MB/sec
starlingx@starlingx-EC600G4:~$

Austin Sun (sunausti) wrote :

it seems all drbd needed /opt are not existing. is your env still existing ? could you check if /opt folder on controller-1 is existing ?

yanwei (yanwei) wrote :

controller-1:~$ drbd-overview
  0:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  1:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  2:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  3:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  5:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  7:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
  8:??not-found?? Connected Secondary/Primary UpToDate/UpToDate C r-----
controller-1:~$ ls /opt/
backups branding cgcs cni collectd containerd dc deploy etcd extension extracharts patching platform
controller-1:~$ df -h |grep opt
/dev/mapper/cgts--vg-backup--lv 59G 53M 56G 1% /opt/backups
controller-1:~$

yanwei (yanwei) wrote :

controller-1:~$ mount /dev/drbd2 /opt/platform/
mount: only root can do that
controller-1:~$ sudo mount /dev/drbd2 /opt/platform/
Password:
mount: /dev/drbd2 is write-protected, mounting read-only
mount: mount /dev/drbd2 on /opt/platform failed: Wrong medium type

yanwei (yanwei) wrote :

controller-0:~$ drbd-overview
  0:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/postgresql ext4 40G 100M 38G 1%
  1:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/rabbitmq ext4 2.0G 6.2M 1.9G 1%
  2:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/platform ext4 2.0G 7.9M 1.9G 1%
  3:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/cgcs ext4 20G 13M 19G 1%
  5:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/extension ext4 992M 2.6M 923M 1%
  7:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /opt/etcd ext4 4.8G 420M 4.2G 10%
  8:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r----- /var/lib/docker-distribution ext4 16G 4.6G 11G 31%
controller-0:~$ df -h |grep drbd
/dev/drbd8 16G 4.6G 11G 31% /var/lib/docker-distribution
/dev/drbd7 4.8G 420M 4.2G 10% /opt/etcd
/dev/drbd5 992M 2.6M 923M 1% /opt/extension
/dev/drbd3 20G 13M 19G 1% /opt/cgcs
/dev/drbd1 2.0G 6.2M 1.9G 1% /var/lib/rabbitmq
/dev/drbd0 40G 100M 38G 1% /var/lib/postgresql
/dev/drbd2 2.0G 7.9M 1.9G 1% /opt/platform

Austin Sun (sunausti) wrote :

ok. I think this is result that everything is recovered (drbd is working fine now) , not when the issue was found. right ?

Austin Sun (sunausti) wrote :

from mtcClient , there is abnormal behavior. the host name jump from controller-0 to local , and timestamp is hop.

2019-10-31T08:27:48.221 [86957.00069] controller-0 mtcClient --- daemon_config.cpp ( 405) daemon_dump_cfg : Info : debug_event = none
2019-10-31T16:27:25.947 [6342.00000] localhost mtcClient --- daemon_files.cpp ( 443) daemon_system_type : Info : System Type : Standard System
2019-10-31T16:27:25.947 [6342.00001] localhost mtcClient --- mtcNodeComp.cpp ( 969) daemon_init : Info : Node Type : controller (1:1)
2019-10-31T16:27:25.947 [6342.00002] localhost mtcClient --- daemon_files.cpp (1051) daemon_files_init : Info : --- Daemon Start-Up --- pid:6342
2019-10-31T16:27:25.947 [6342.00003] localhost mtcClient sig daemon_signal.cpp ( 255) daemon_signal_init : Info : Signal Hdlr : Installed (sigaction)
2019-10-31T16:27:25.949 [6342.00004] localhost mtcClient --- mtcNodeComp.cpp ( 237) mtc_config_handler : Info : Shutdown TO : 120 secs
2019-10-31T16:27:25.949 [6342.00005] localhost mtcClient --- daemon_debug.cpp ( 364) get_debug_options : Info : Config File : /etc/mtc.conf
2019-10-31T16:27:25.949 [6342.00006] localhost mtcClient --- daemon_debug.cpp ( 307) debug_config_handler : Info : FIT host : none
2019-10-31T16:27:25.949 [6342.00007] localhost mtcClient --- mtcNodeComp.cpp ( 273) daemon_configure : Info : Agent Mgmnt : 2101 (tx)
2019-10-31T16:27:25.949 [6342.00008] localhost mtcClient --- mtcNodeComp.cpp ( 274) daemon_configure : Info : Client Mgmnt: 2118 (rx)
2019-10-31T16:27:25.949 [6342.00009] localhost mtcClient --- daemon_config.cpp ( 76) client_timeout_handler : Info : goEnabled TO: 600 secs
2019-10-31T16:27:25.949 [6342.00010] localhost mtcClient --- daemon_config.cpp ( 81) client_timeout_handler : Info : Host Svcs TO: 300 secs
2019-10-31T16:27:25.949 [6342.00011] localhost mtcClient com nodeUtil.cpp ( 702) get_hostname : Info : controller-0

hutianhao27 (hutianhao) wrote :

@Austin Sun
@yanwei
I have two systems and one is AIO Simplex and another is AIO Duplex. Both of them have this problem. After config system and unlock the controller-0, it shows controller-0 is Secondary node and /opt/extension, /var/lib/postgresql, /opt/etcd, /var/lib/docker-distribution, /var/lib/rabbitmq and /opt/platform are not mounted successfully. I try to change controller-0 to Primary but it changes back to Secondary quickly and lvm are still not mounted. I wonder if this problem have any solution or do I have any operation to fix this?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers