So we absolutely need to add /var/log/cluster to what we collect from the rdo jobs (we used to collect it in tripleo CI), otherwise things get particularly hard.
In any case this is a networking issue of some sort. From: https://logs.rdoproject.org/52/526152/2/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zc82f7d56a5fd452192f6d6426c191787/overcloud-controller-1/var/log/journal.txt.gz we can see that corosync is losing the totem token that is being passed around:
Dec 07 22:55:30 overcloud-controller-1 cib[18522]: notice: Node overcloud-controller-0 state is now member Dec 07 22:55:30 overcloud-controller-1 attrd[18525]: notice: Node overcloud-controller-0 state is now member Dec 07 22:55:30 overcloud-controller-1 stonith-ng[18523]: notice: Node overcloud-controller-0 state is now member Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
I'll try to look more on Monday. So far all the evidence points to some networking issue between controllers.
So we absolutely need to add /var/log/cluster to what we collect from the rdo jobs (we used to collect
it in tripleo CI), otherwise things get particularly hard.
In any case this is a networking issue of some sort. From: /logs.rdoprojec t.org/52/ 526152/ 2/openstack- check/gate- tripleo- ci-centos- 7-ovb-3ctlr_ 1comp-featurese t035-master/ Zc82f7d56a5fd45 2192f6d6426c191 787/overcloud- controller- 1/var/log/ journal. txt.gz we can see that corosync is losing the totem token that is being passed
https:/
around:
Dec 07 22:55:30 overcloud- controller- 1 cib[18522]: notice: Node overcloud- controller- 0 state is now member controller- 1 attrd[18525]: notice: Node overcloud- controller- 0 state is now member controller- 1 stonith-ng[18523]: notice: Node overcloud- controller- 0 state is now member controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 controller- 1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
Dec 07 22:55:30 overcloud-
I'll try to look more on Monday. So far all the evidence points to some networking issue between controllers.