Comment 1 for bug 1737066

Revision history for this message
Michele Baldessari (michele) wrote :

So we absolutely need to add /var/log/cluster to what we collect from the rdo jobs (we used to collect
it in tripleo CI), otherwise things get particularly hard.

In any case this is a networking issue of some sort. From:
https://logs.rdoproject.org/52/526152/2/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zc82f7d56a5fd452192f6d6426c191787/overcloud-controller-1/var/log/journal.txt.gz we can see that corosync is losing the totem token that is being passed
around:

Dec 07 22:55:30 overcloud-controller-1 cib[18522]: notice: Node overcloud-controller-0 state is now member
Dec 07 22:55:30 overcloud-controller-1 attrd[18525]: notice: Node overcloud-controller-0 state is now member
Dec 07 22:55:30 overcloud-controller-1 stonith-ng[18523]: notice: Node overcloud-controller-0 state is now member
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37
Dec 07 22:55:30 overcloud-controller-1 corosync[18514]: [TOTEM ] Retransmit List: 1a 1b 2c 2d 2e 2f 30 34 35 36 37

I'll try to look more on Monday. So far all the evidence points to some networking issue between controllers.