So we absolutely need to add /var/log/cluster to what we collect from the rdo jobs (we used to collect it in tripleo CI), otherwise things get particularly hard. In any case this is a networking issue of some sort. From: https://logs.rdoproject.org/52/526152/2/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zc82f7d56a5fd452192f6d6426c191787/overcloud-controller-1/var/log/journal.txt.gz we can see that corosync is losing the totem token that is being passed around: Dec 07 22:55:30 overcloud-controller-1 cib[18522]: notice: Node overcloud-controller-0 state is now member Dec 07 22:55:30 overcloud-controller-1 attrd[18525]: notice: Node overcloud-controller-0 state is now member Dec 07 22:55:30 overcloud-controller-1 stonith-ng[18523]: notice: Node overcloud-controller-0 state is now member Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37 I'll try to look more on Monday. So far all the evidence points to some networking issue between controllers.